Sage Journals: Discover world-class research

Abstract

The emergence of text-to-image generative models (e.g., Midjourney, DALL-E 2, Stable Diffusion) in the summer of 2022 impacted architectural visual culture suddenly, severely, and seemingly out of nowhere. To contextualize this phenomenon, this text offers a socio-technical history of text-to-image generative systems. Three moments in time, or “scenes,” are presented here: the first at the advent of AI in the middle of the last century; the second at the “reawakening” of a specific approach to machine learning at the turn of this century; the third that documents a rapid sequence of innovations, dubbed “clever little tricks,” that occurred across just 18 months. This final scene is the crux, and represents the first formal documentation of the recent history of a specific set of informal innovations. These innovations were produced by non-affiliated researchers and communities of creative contributors, and directly led to the technologies that so compellingly captured the architectural imagination in the summer of 2022. Across these scenes, we examine the technologies, application domains, infrastructures, social contexts, and practices that drive technical research and shape creative practice in this space.

Keywords

Machine learning text-to-image socio-technical study generative AI

Introduction

Despite appearances, contemporary imaging technologies based on neural networks—such as Midjourney, DALL-E, and Stable Diffusion—are neither sorcery nor sentient. Rather, they are tools produced in a specific socio-technical context. While the impact of the broad adoption of these tools may feel like “a rock from the sky,” the events of the summer of 2022 were rooted in a particular social context and arose out of a specific technical history. This paper seeks to illuminate exactly this. The history presented here finds relevance in architectural visual culture not only as a way of ensuring accurate attribution for our tools, but also in that it reveals the providence of the specific biases that have been “baked” into an emerging architectural visualization technique. Further, the broader history surveyed here demonstrates that themes of accessibility and access have permeated research and creative practice in AI since its inception.

Three scenes, four lenses

Our story unfolds in three scenes, with the first two setting the stage for a culmination in the third—a carefully documented history of informal innovations in text-to-image systems that were developed by non-affiliated researchers and creative communities. Although text-to-image systems were built using foundational elements developed in large institutional settings, here on the steep incline of yet another hype cycle, an important aspect of their origin might otherwise be overlooked. This paper aims to ensure the recognition of the contributions of independent and amateur developers that produced innovations core to the development of the better-known systems that captured the architectural imagination in the summer of 2022. To do so, this history by necessity draws from a dynamic set of primary sources—blog posts, Discord servers, Colab notebooks, GitHub repos, and social media posts—that are not formally structured and sometimes unattributed, undated, or incomplete.

To help us understand what led to the summer of 2022, this paper offers a look at the socio-technical forces across the following three moments in time:

• Scene 1—the middle of the last century—1950’s and 1960’s

• Scene 2—the turn of this century—the 2000’s and 2010’s

• Scene 3—the months leading up to the summer of 2022.

Across these scenes, we offer the following four lenses through which to view the context for creative technical work, and that shape the perception of what creative ML is “for”:

• Technology—This includes both the “hard” technology (e.g., algorithms, architectures, models, data, and data formats), as well as domains of application (i.e., those real-world problems that AI researchers seek to address).

• Social Contexts—This includes those organizational structures—public and private research institutions, conferences, informal online communities, social networks, and creative collectives—that support ML research and creative practice.

• Infrastructures—This includes those tools, resources, software, platforms, programming languages, and hardware accessibility that make ML research and creative practice possible.

• Practices—This includes those influential creative works, aesthetic theories, and impactful cultural events that expand the space of possibilities. This will be looked at through the lens of architectural design in particular.

These four lenses must be considered together because technologies, infrastructures, social structures, and creative practices are intimately interlinked. Creative practices, for example, are often organized around and catalyzed by specific technologies. Similarly, technologies are rarely developed outside of a conception of their application in a specific social context. There are layers of dependencies at work here that require more than a single point of view to effectively to pull back. Such an holistic account is a prerequisite for the critical examination of text-to-image systems in particular, of generative AI in general, and of the impact that these technologies may hold on architectural culture. Critical contemporary issues of technological accessibility and data bias, for example, are more clearly illuminated by an account of the cultures in which these technologies were developed and the audiences they were intended to serve.

The first section of this text concerns the scene at the middle of the last century—the 1950s and 1960s—and describes this context through the lenses described above as a baseline for subsequent periods of time. The second section of this text concerns the scene at the turn of this century—the 2000’s and 2010’s. Here, we go into more detail on the technologies, application domains, social contexts, practices, and (in particular) the infrastructures established in this period that hold particular sway on the current state of things. The final section concerns the present, and departs from this pattern somewhat. We begin with an account of the social groundwork that underpins the recent rise of text-to-image generative models, including a detailed unpacking of the social prerequisites to this phenomenon. Next, we offer a section that discusses the technical groundwork for this period, and that “takes apart” the current state-of-the art approach to text-to-image generation—CLIP-Guided Diffusion—and conducts a biography of the constituent elements. Finally, we offer a history of the rapid-fire series of innovations—what has been termed a series of “clever little tricks” that directly preceded contemporary text-to-image systems.

Scene 1—the 1950’s and 1960’s

In this section, we describe the socio-technical context of artificial intelligence research and applications as it stood in the middle of the last century—that is, from the late 1940s to the early 1970s. While the history of AI in this period of time is well-accounted for in scholarship, we offer a brief summary here through the socio-technical lens described above, in order to provide a comparison to later periods.

The field of AI developed in the 1950s, proceeding through the application of a “systems theory” model of human cognition as the basis for machine logic.^1,2 In what became known as the “connectionist” approach to AI, some of the earliest work in the late 1940s and 1950s used networks of connected circuits to simulate intelligent behavior. Notable among these early models was the concept of the “Perceptron” a classification algorithm³ first invented in 1943 by Warren Mcculloch and Walter Pitts,⁴ and first implemented as a physical machine in 1958 by Frank Rosenblatt.⁵ The Perceptron was a binary classifier, that is, a function that decides if a given input belongs to a specific class. Similar to important advances in machine vision in the early 2000s that set the stage for contemporary image generation techniques, the Perceptron was a machine designed for image recognition.

In stark contrast with our current context that finds AI in everything from web applications to household appliances, early AI research was far removed from practical application. While advocates extolled potential applicability to a broad set of general problems,⁶ the work wasn’t propelled by academic interest alone, but rather was supported for its promising applications in national defense in particular. The Perceptron, for example, was developed at the Cornell Aeronautical Laboratory in research funded by the US Navy.⁷ Nearly all early AI work was conducted at academic research institutions, primarily in the UK and the US, and was propelled by generous national defense funding. For instance, at MIT alone, ARPA (later DARPA) contributed 2–3 million dollars per year to AI research from 1963 through the mid-1970s.⁸ This money was provided with few strings attached, and contributed to what has been termed a “technological libertine” atmosphere at MIT an atmosphere strongly related to what we would now call “hacker culture.”¹

We highlight the Perceptron here not only because it is the great-grandparent of modern machine learning systems, but also because it figures large at both the start and end of this early period as connectionism was abandoned in favor of symbolic reasoning approaches. It turns out that “connectionism” is but one school of thought in general artificial intelligence research, one that fit the preoccupation for “finding effective techniques for ‘learning’”¹⁰ that characterized this particular period of time. A turning point of this early period was Marvin Minsky and Seymour Papert’s 1969 book on Perceptrons¹¹ which described their view of the limits of neural networks, and offered a scathing criticism that re-directed the attention of research to another school of thought dubbed “knowledge-based systems.”¹² This contributed to the first of two “AI Winters,” a colloquial term for periods of diminished progress in AI. These “quiet periods,” named for the concept of nuclear winter in which a re-emergence is only possible once the “environment had cleansed itself,”¹³ had less to do with a natural progression of technological advances, and more to do with AI’s “visibility or invisibility”¹⁴—that is, cycles of public and private investment, respectively, driven by popular interest and the perception of marketable outcomes. Table 1 shows a summary of the socio-technical context of AI in the mid-twentieth century.

Table 1.

A summary of the socio-technical context of AI in the mid-twentieth century.

Technologies	A set of technologies were developed across this period that formed the foundations of the connectionist approach (neural networks) as well as symbolic approaches
Application domains	Intended applications for the technology developed in this period included machine translation,¹⁵ natural language processing,¹⁶ reasoning as search,¹⁷ automata, and micro-worlds
Social contexts	In this period, research was primarily conducted in academic labs at R1 research universities. Funding in this period of time was dominated by government grants, often military related, with only limited private money at play.²
Infrastructures	Contributions to and participation in this domain required specialized knowledge and access to expensive hardware (e.g., large mainframe computers) that demanded significant expertise to operate

Intersections with architectural discourse and culture

The technological milieu in this period held sway in the creative and architectural world, as many of the early founders of AI directly intersected with architects and architectural thinking. Luminaries such as Allen Newell and Herbet Simon influenced, and were influenced by, figures in the early history of design and computation. These include Christopher Alexander, Richard Saul Wurman, Cedric Price, Nicholas Negroponte, and others.³ The concurrence of early AI and the early history of computer-aided-design resulted in these two fields shaping one another.

As an illustration, consider the position of Stephen Coons—a progenitor of computer aided design and the inventor of the mathematical foundation of surface descriptions still used by contemporary geometric modeling software. For Coons, computation in design held a liberatory promise similar to claims often made about AI. Writing in 1996 on the potential of CAD as transformative force in design, Coons referred to the computer as a “compliant partner,” an “appropriate kind of slave,” and a “magic instrument of creative liberation.”¹⁹ Such characterizations not only find echoes in the current climate of over-hyped AI-assisted, augmented, and automated design, they also mark the crystallization of a particular position on the role of technology in the design process that persists to this day. Coons understood design as a “conversation between humans and machines” with distinct creative and mechanical stages. In what Daniel Cardoso Llach observed as “Albertian” in the implicit privileging of mind over matter, Coons celebrated the computer as “an obedient agent dealing with the toils of construction work” while designers were “liberated from the drudgery of materials and physical work.”²⁰ This theme of the “liberation” of design and designers enabled by artificial intelligence recurs across the subsequent decades, and may be found in the scenes documented in the sections that follow.

Scene 2—the early 2000s

To jump forward 50 years from the early history of AI to the 2000s and early 2010s leaves aside a body of history that is inexcusably large, but that is also well-represented in existing scholarship. For our purposes, a coarse overview will suffice: the AI Winter of the late 1970s led to an AI summer in the mid-1980s,⁴ which led to a second winter in the early 1990s. The most recent thaw occurred in the period we discuss here, the early 2010s, which directly proceeded the springtime we currently enjoy, and re-established connectionism as the dominant “tribe” in AI research. As always, none of the technological development of the intervening period occurred in a vacuum. Much changed across this 50-year span, but two major developments are particularly relevant to this discussion:

• Computers got faster—both in terms of general processing and dedicated graphic processing (GPUs).

• The world began producing and storing vastly more data.

While the first point above—the expansion of computing power across this half-century span—is plain to see, the second point is worth emphasizing. Looking at just the first decades of the current century, access to large amounts of data, and cheaper computers to store and processes this data, radically transformed the global economy. It is difficult to overstate the scale of this shift. To illustrate, consider Figure 1, which shows:

• In 1997, Michael Lesk, the head of the Division of Information and Intelligent Systems at the National Science Foundation, estimated that there was a “few thousand petabytes” of information in the world, and that data storage would match this capacity by the year 2000.²² This is roughly equivalent to 3 billion gigabytes of information in the world, in total (not data produced per-year, as expressed below).

• In 2011, McKinsey Global Institute estimated that 13 exabytes of new data was being created and stored by companies and consumers per year.²³ This is roughly equivalent to 13 billion gigabytes per-year.

• The volume of data produced today (2021) is estimated to be around 94 zettabytes.²⁴ This is roughly equivalent to 94,000 billion gigabytes per-year.

Figure 1.

A visualization of estimates of the relative amount of data produced per year in 1997, 2011, and 2021.

The sheer volume of data that the world began to produce across this period represented an opportunity, but also a problem. This phenomenon was known in the early 2010s as “big data.” Machine learning (ML), or more specifically deep learning, was a technology developed to cope with just this. ML represented a revival of connectionism, and also a paradigm shift from the expert system approach that had previously dominated. Here, rather than focusing on coding the rules of a system, we let computers discover such rules on their own by statistical correlation. Such an approach requires a massive quantity of data. We find here a transition from technologies propelled by the need to address wartime logistical problems, transitioning to technologies propelled by the needs of governments and private interests to more effectively sift through a steadily increasing volume of data. That is to say: a transition from technologies of war to technologies of mass surveillance. The context of “big data,” both in terms of the challenges it wrought in grappling with volumes of data orders of magnitude greater than had been in the past, and in terms of the perceived opportunities it offered in terms of surveillance-based commerce and control, provided the motivation and justification for the technical innovations and new funding streams that characterize this period. In turn, the innovations established here—in the technologies, application domains, social contexts, and (in particular) infrastructures—set the stage for the period we are now in Table 2 shows a summary of the socio-technical context of machine learning in the early 21st century.

Table 2.

A summary of the socio-technical context of machine learning in the early 21st century.

Technologies	The technical foundations of the applications discussed in this section were established in the decade that proceeded it—that is, the 1980s and 1990s. Importantly, these technologies included the defining of modern approaches to supervised/unsupervised/reinforcement learning,⁵ much of which was first developed just before the turn of the 20th century
Application domains	The most prominent practical applications for the technology developed in this period addressed problems related to “big data” and computer vision. These problems notably included: Language translation;³¹ image recognition (spurred by the establishing of the ImageNet³² training database and commercialized by Google Image Search); handwriting detection; and self-driving cars (DARPA Grand Challenge).³³ At the same time, a number of highly visible applications of ML created impactful cultural events,⁶ and served as effective means of public outreach for academic work, and as potent marketing events for technology companies
Social contexts	This period saw significant shifts in ways in which AI research proceeded and how it was financially supported. As in the previous period, research was centered in academic research labs, but these were no longer concentrated in elite institutions, and instead proliferated widely. This expansion was supported by a revival of government funding in this period,⁷ which still provided much of the backing for basic research in academic settings. Simultaneously, a host of private research labs were established in this period,⁸ and begun to make contributions of significance. These labs often represented a “big tech” backing of ambitious “moonshot” AI research
Infrastructures	Perhaps the most impactful shift that occurred in this period concerned infrastructure—that is, an expansion of accessibility in the tools, software, platforms, programming languages, and hardware necessary to make contributions in this area. Central to this shift was the establishment of cloud infrastructure,⁹ which made accessible the storage and GPU compute required to train and run ML models. Many of these specific infrastructural pieces are still used in contemporary practice, and directly contributed to recent developments in creative ML. Also in this period, a number of foundational data sets were established¹⁰ that remain important today for training, benchmarking, and as models for the structure of contemporary data sets

Intersections with architectural discourse and culture

In this period of time, the problem of “big data” was clearly recognized in architectural discourse and practice. Further, the specific technologies and methods devised to deal with this problem in the context of AI were often adopted by architects both rhetorically and instrumentally. In the early 2000s, we find appeals to large accumulations of data deployed rhetorically in architectural circles, often to lend design proposals “the authority of retroactive inevitability.”⁴⁵ This tactic—an appeal to the ethos of data as an unimpeachable authority—may be found across the period, from Winy Maas and MVRDV’s “Metacity Datatown”⁴⁶ to Bjarke Ingels’ “data-based approach that drives design.”⁴⁷ While the 2009 financial crisis shifted the terms of the argument “from gee-whiz form-making” to “social relevance and environmental stewardship,”⁴⁵ appeals to the authority of incomprehensible volumes of data remained a prominent rhetorical device.

Also notable in this period is the shift in the instrumentality of computational design as a response to the technologies and methods related to “big data.” Paralleling the shift in AI from the encoding of rules (i.e., expert systems) to the discovery of correlation (i.e., computational statistics), this period anticipates a shift in computer-aided design from systems that facilitate the composition of rules and relationships (e.g., parametric modeling) and systems that manage and coordinate information (e.g., BIM) to new forms of computer augmentation that operate on data at a higher level of abstraction. As an example, consider Mario Carpo’s 2015 speculation on the potential of “searching without sorting (also known as the art of finding stuff without knowing where it is)” as a generative instrument in design.⁴⁸ Years before animated “latent space walks” entered into the architectural imagination, Carpo offered the prescient observation that unstructured search may provide an antidote to both the explicit rule-based approaches of parametric design and the highly structured semantic approach found in BIM systems. Carpo observed that this “new big-data science of searching” appears to be alien to existing methods of design, and possibly also deeper aspects of the Western philosophical tradition. Carpo envisioned “a new digital style in architecture, one that is based on the integration of real-time data streams and dynamic modeling,” and speculated that this new style would require architects to think differently, with more focus on data analysis and algorithmic design. Carpo’s predictions take on new relevance in light of the nascent cultures of architectural visualization that are taking root around text-to-image generative models.

Scene 3—January 2021 to June 2022

Our current “springtime”—including the text-to-image generative models that seemingly suddenly appeared in the summer of 2022, and that disrupted certain segments of architectural visual culture—came about in the specific social and technological context discussed above. In many ways, there is little new about the period of time we discuss here. The technology that so compellingly produces synthetic images of imaginary architecture, for example, is the direct decedent of Rosenblatt’s Perceptron, which we may recall was itself a machine designed for image recognition. A key factor in the success of today’s systems is access to data. We can only speculate what Rosenblatt or Minsky, LeCun or Hopfield, might have accomplished in their time given the quantity of data produced by our world today. In other ways, much has changed to bring us to this moment. Infrastructural shifts established in the early 21st century have since flourished, and have brought about an unprecedented level of accessibility—to computing resources, to data, to communities of collaborators, and to knowledge that was not possible in the past. This accessibility has allowed a new generation of unaffiliated and uncredentialed creative technologists¹¹ to thrive. As we demonstrate below, it has also enabled this community (in parallel with large-scale institutions) to produce innovations that at first glance may appear to be mere “clever little tricks,”¹² but that turn out to be serious contributions and that hold deep cultural impact.

It is better at this point to depart slightly from our chronological account, and to instead tell an instrumental story. To understand the emergence of text-to-image generative models in the summer of 2022, we shift the emphases of both our technical and social account of this history. On the technical side, we move from a discussion of changes in the general theory and foundational technologies of AI, to one that focuses on smaller-scale innovations and specific applications. In this way, the second section below combines a discussion of technologies and application domains across this period of time. On the social side, more will be illuminated through a detailed account of the specific mechanisms by which individuals, communities, and corporations are enabled by certain technological infrastructures. For this reason, the first section that follows combines a discussion of the social context of text-to-image models with an account of the technical infrastructure that supports it.

Social Groundwork—Infrastructures and Social Contexts

In this section, we account for the social context that gave rise to the text-to-image models under consideration, as well as the technical infrastructure that supports it. Here we ask: What were the prerequisites for this rise, and which conditions enabled it? What empowered “unaffiliated” research groups and casual technologists to make contributions that hold impact on the social production of creative work? As we describe below, the establishment of a specific set of social and technical infrastructures in the time immediately proceeding this period directly gave rise to these changes. This set includes both infrastructural technologies—including the availability of published model architectures, the open-sourcing of foundational code, the availability of data for training—as well as social technologies—such as cheap access to compute; collaboration technologies (e.g., GitHub, Google Golab); and communication and coordination technologies (e.g., Discord, Reddit, and Twitter), all of which fueled novel structures of communal labor.

Published Models and Open-Source Code

In ML research, it has long been a standard practice to include the details of how novel neural networks are structured as a part of publication, which is often expressed as a diagrammatic description¹³ of the architecture of a model, such as those shown in Figure 2 and Figure 3. Alongside technical information described in a paper, these diagrams allow other researchers to reproduce the published work.

Figure 2 (left).

The first neural network architectural diagram—Warren Mcculloch and Walter Pitts, 1943.⁴

Figure 3 (right).

A selection of neural network architectural diagrams from the “Neural Network Zoo”—Fjodor VanVeen and Stefan Leijnen, 2016.⁴⁹

In practice, this sort of information alone is not sufficient to duplicate published research. Among other capacities,¹⁴ such an endeavor would also require: the translation of architectural diagrams into functional code, the possession of appropriate data for training the model, and access to the computational power required for performing the training. While practices around publishing model architectures had long been established, a shift in other publishing norms played an important role in enabling innovations in this period.

First, we find in recent years more willingness to publish not only model architectures (such as the diagrams shown above), but also implemented code. For example, when researchers at NVIDIA published the paper that introduced StyleGAN in 2019⁵⁰ it was closely followed by a repository of code⁵¹ that implemented the proposed architecture. Repositories of functional code such as this not only allowed researchers to verify claims made in the paper, but also enabled creative practitioners to directly apply leading edge technologies in their work. As a direct result, we find in the mid-2010s and early 2020s the emergence of ML-driven creative work in a diverse set of fields from fine art,¹⁵ to music,¹⁶ to design.¹⁷ We see this trend reflected in the establishing of “creative ML” events, workshops, and exhibitions.¹⁸

Another shift in the culture of ML research has amplified this first factor: it has become commonplace for research to be posted to open-access preprint repositories¹⁹ before being published via the typical academic channels. In the context of the period described here, we can see that preprint publication acts as an accelerant, since papers are approved for posting in the repository after a quick (and mostly automated) moderation process and do not need to wait for the slow mechanism of peer review to find a public audience. As such, just as the pace of publishing in ML in general has increased in recent years, it has also become more immediate, with research moving from the lab out into the public sphere more quickly.

Finally, while it is more common to release only the model architecture and source code, we do find instances of researchers including the “weights” of a model publicly.²⁰ This information, in conjunction with the source code, effectively provides a functional model for those with access to the expertise and compute required to run it. While not eliminated, the barriers are significantly lowered here, since any technologist that follows up on this work is free to bypass the training step altogether, thereby side-stepping the need for access to data and the high-powered computers required by training.

As we show in the sections that follow, the pivot toward open model architectures, open source code, and open model weights enabled the contributions made by “casual technologists” across this period of time. A model released in such a way is distinct from consumer-facing software, and still requires significant expertise to run, modify, and extend, but nonetheless represents an important prerequisite of the small-scale innovation discussed below.

Cloud Compute

As described above, one of the most impactful shifts in the previous period of time was the establishment of cloud infrastructure, which made accessible the storage and GPU compute required to train and run ML models. While certainly technical in nature, we see the widespread adoption of cloud computing services (e.g., AWS and Google App Engine) as holding important social ramifications as well. This is because these services provide cheap access to powerful hardware thereby allowing relatively under-resourced groups to perform compute-demanding tasks, such as training, running, and serving large models. In recent years, something like a secondary market has developed²¹ around providing “wrappers” around core cloud computing services that ease access even further.

Datasets

A clear prerequisite for working in machine learning is data. The centrality of data in this work is evidenced by the breathless comparisons to essential natural resources²² that became prevalent at the time. As is the case with natural resources, access to data is not evenly distributed. A wide variety of datasets were required to train the assorted text-to-image generative systems mentioned in this text. Some of these datasets are known, while others were developed without revealing the provenience of this data. Where the data sources are known, some are openly available and some are proprietary. These distinctions can make all the difference to a creative technologist seeking to make contributions in this area. In this context, a number of private and public initiatives have been established in recent years²³ to connect ML practitioners with data for training and testing. Access to these resources remain a critical piece of infrastructure, and can enable relatively small-scale contributors reproduce, verify, amplify, and extend the work larger and better-funded labs. For example, OpenCLIP⁶⁶ (an implementation of OpenAI’s CLIP) was trained and released open-source⁶⁷ by LAION, a German non-profit organization (as detailed in a section below) that has collected, maintained, and publicly released a number of large datasets widely used by AI researchers.⁶⁸ Another organization, EleutherAI (detailed below), created and maintains an open source language modeling dataset called the “Pile”⁶⁹ to facilitate the training of open-source large-language models to compete with closed systems (such as GPT-3).⁷⁰ The construction and maintenance of these datasets is crucial to the development of ML research, and the public and private research communities rely equally on such organizations.

Collaboration, communication, and coordination technologies

Colab and GitHub are simultaneously social and technical infrastructure, as they function as platforms for both collaboration and dissemination of technical work. Further, these platforms connect to and intersect with social platforms, such as Reddit, Twitter, and Discord. As demonstrated in the sections that follow, this infrastructural network has played a crucial role in the development of the technologies considered here.

• GitHub (est 2008)²⁴—a collaborative version control service often used to support the development of open source software projects. GitHub repositories are open by default, and an essential feature is the ability to “fork” and “merge”—that is, for anyone in a community to work on their own private copy of some code, and then later integrate any changes that work back into the communal source. The Stable Diffusion GitHub repository, for example, shows 5600 “forks” at the time of writing—this represents five thousand potential community contributors to the project. There are currently more than 350 million repositories of code are reported to be hosted on the site.⁷¹

• Google Colaboratory (“Colab,” released 2017)⁷²—a “freemium” web-based environment that allows users to write and execute Python code in the cloud. This Google initiative builds on earlier nonprofit initiatives—specifically, Project Jupyter, an open-source project intended “to support interactive data science and scientific computing”⁷³—and connects these existing systems to internal Google services that provide storage and compute. Colab was an essential piece of infrastructure to the development of text-to-image generative models. Nearly all the innovations documented in the section below were implemented in and disseminated via Colab notebooks.

• Discord (released 2015)—an instant messaging social platform established with an online gaming audience in mind,²⁵ and as a way for the founders “to talk to their gamer friends.”⁷⁴ Discord holds a unique place in relation to text-to-image generative models. Importantly, it is the platform that supported certain tools (e.g., Midjourney and Stable Diffusion) in their “alpha test” phase before moving to stand-alone web apps. Further, Discord is the social platform of choice for the creative technologists that laid the technical groundwork for contemporary text-to-image systems; important organizations—such as Disco Diffusion, Eleuther and Stability—organize community efforts using this app.

The social context in 2021

The technologies described above—including published models, open-source code, cloud compute, public datasets, and collaboration technologies—supported the emergence of novel social organizations that directly gave rise to the text-to-image generative models under consideration. While the academic institutions and private research labs that dominated our earlier scenes still loom large in this period, we see here a surge in the importance of small-scale and informal social organizations, as well as the emergence of novel forms of collective work. Specifically, the technological changes described above empowered “unaffiliated” and “uncredentialed” casual technologists to make contributions of significance. As demonstrated in a section below, we use the latter term, “uncredentialed,” in that many of the central contributors involved are largely self-taught, without formal education in machine learning, and were not working as ML engineers at the time of their most important contributions. Our former term, “unaffiliated,” is meant less to describe the status of individuals, and more to refer to the loose social associations that these individuals employ to organize their labor. To illustrate this point, we offer here three examples that are emblematic of the time:

• LAION, mentioned above, is a German non-profit that “aims to make large-scale machine learning models, datasets and related code available to the general public.”⁶⁸ The work of organizations like LAION are critical to providing access to data, and the specific datasets assembled by LAION have served to train many of the text-to-image generative systems that we consider here. Notable among these are LAION-400M (August 2021), a collection of 400 million image-caption pairs that was scraped from the Internet and that has been used to train text-to-image generative models.²⁶ Perhaps the most traditional of the organizations discussed here, there is nothing particularly novel about a non-profit organization that is dedicated to the maintenance of “common good” resources like datasets, and that operates in the space between academic institutions and private interests. What is notable, however, is the way in which organizations like LAION intertwine academic, governmental, and private interests. As an analogous example, consider ImageNet—an organization established in 2009 in order to manage a foundational dataset of images, that is led by researchers at Stanford and Princeton, and is financially supported by Google, NVIDIA, and the National Science Foundation.⁷⁸

• EleutherAI is a deeplearning research lab founded in July of 2020. In contrast with academic labs, with privately funded labs such as OpenAI, and with labs that are a part of established tech companies, EleutherAI is largely “unaffiliated.” Self-described as “a decentralized collective of volunteer researchers, engineers, and developers focused on … open source AI research,” Eleuther was started as a counter-point to larger initiatives: to “give OpenAI a run for their money like the good ol’ days.”⁷⁹ This grassroots collective has indeed found remarkable success, and has started a number of initiatives that have held wide impact. As mentioned above, Eleuther’s “Pile” dataset has been used to train open-source large-language models that approach the performance of state-of-the art models such as GPT-3. More recently—in partnership with MilaQuebec, LAION, IBM Research and Stability AI—Eleuther was awarded computation time on a supercomputer by the U.S. Department of Energy’s Office of Science,⁸⁰ a sign of their increasing prominence. Eleuther organizes its efforts on an open Discord server, and a number of the individuals discussed in the sections that follow continue to be active in this community.

• Disco Diffusion, is both a generative text-to-image model (as detailed in a section below) and a community of creative technologists that created and maintain this tool, which is expressed through a collection of shared and commonly forked Colab notebooks. Although the first Disco Diffusion notebooks date back to October 2021,⁸¹ the Discord server that the group uses to organize their efforts was established in February of 2022.²⁷

The social organizations described above demonstrate that, in contrast with earlier periods, and due in no small part to the infrastructural and social technologies describe above, innovation in this period of time is no longer the providence of large academic institutions and closed corporate labs. Rather, non-affiliated labs and independent contributors can make breakthroughs of significance. The text-to-image technologies we consider here are the product of just this, and, as we demonstrate in a section below, a number of “clever little tricks” developed by relative amateurs led to innovations that are directly responsible for the events of summer 2022.

Technical groundwork—technologies and application domains

The text-to-image generative models that are our subject are an amalgam of technologies, each holding its own provenance. In this section, we “take apart” one such model that has become the dominant approach—CLIP-Guided Diffusion—and conduct a biography of the constituent elements. Through this, we can see how large-scale organizations, such as academic institutions and large corporate labs, laid the necessary groundwork for later innovations by smaller-scale actors. As mentioned in the section above, although the fruit of some of these efforts remains proprietary, without the openness of publicly published research that include detailed descriptions of model architecture (at times accompanied by open source code) much of the novel structures of communal labor discussed above would not be possible. The interdependency of large-scale private interests with small-scale and public ones is also visible in the reliance that some of these efforts hold on public datasets maintained by volunteers.

Before detailing the specific technologies that can be seen as the “ancestors” of CLIP-Guided Diffusion, we detail below parallel and proceeding approaches to generating images from text.

Proceeding and parallel approaches to text-to-image synthesis

Described in this section are approaches to text-to-image synthesis that are unrelated to the CLIP-Guided Diffusion approach that is currently the state-of-the-art. These approaches find relevance here not only in that they represent alternative means to similar ends, but also in that they have directly inspired some of the “clever little tricks” described in the section that follows.

Tracing the origins of text-to-image synthesis brings us relatively quickly to a different problem. Machine learning researchers did not start out intending to generate images from text. To the contrary, modern approaches are derived from automatic image captioning systems. It turns out that text-to-image systems grew out of image-to-text systems. These include systems such as “BabyTalk,”⁸³ developed in 2013 as the first system to automatically generate natural language descriptions from images. The first system capable of generating images from text was created just a year later, in 2014. In “Generating Images from Captions with Attention,”⁸⁴ a team of researchers from the University of Toronto demonstrated a system capable of synthesizing 32 × 32 pixel images from natural language descriptions, samples of which are shown. Technically, this work bears little relation with contemporary approaches Figure 4.²⁸

Figure 4.

Synthetic images generated in 2014 with the following captions (from left to right): “A toilet seat sits open in the grass field”, “A herd of elephants flying in the blue skies”, “A stop sign is flying in blue skies”, “A person skiing on sand clad vast desert”. From “Generating Images from Captions with Attention” [84].

Three years later, a separate effort that brought together academic researchers into a partnership with Microsoft was the first to recognize “art generation and computer-aided design”⁸⁷ as a domain of application for this technology, the results of which may be seen in Figure 5. AttnGAN⁸⁸ proposed the use of a modified GAN—an “attentional generative network”—combined with a system capable of determining the similarity between a synthetic image and a caption.

Figure 5 (left).

An image generated from the caption “A man and women rest as their horse stands on the trail.” using AttnGAN—Connie Fan, 2018.⁸⁹

Two years after AttnGAN, we are approaching text-to-image generative models that appear quite similar to CLIP-Guided Diffusion systems, and that were developed in parallel with such systems. Despite these similarities, here we find a number of notable examples that use entirely different technological approaches. Most prominent among these is the original DALL-E system,⁹⁰ developed by OpenAI and first reported publicly in 2021. The components of the first DALL-E system were completely different than the second release: it did not use CLIP for understanding text (instead relying on something closer to GPT-3), and did not use diffusion as the generative engine (instead employing a variational autoencoder, or VAE). While the technological underpinnings were different, the cultural effect of the public release of this system held direct impact on subsequent events, with the image of the now-iconic “armchair in the shape of an avocado” (shown in Figure 6) entering into the consciousness of technical popular media at the time.⁹¹

Figure 6 (middle).

An image generated from the caption “an armchair in the shape of an avocado” using DALL-E 1, 2021.⁹²

While the first DALL-E was widely publicized by OpenAI in 2021, and was promoted with an on-line demo, the technical components of this system was only selectively released. Despite this limited information, the architectural details discussed in the paper were sufficient for others to reproduce this work in an open-source way. The ruDALL-E (Nov 2021) model is “comparable to the English DALL-E from OpenAI”⁹³ in terms of architecture, and offers only somewhat diminished performance. Here, the research arm of the Russian financial services company Sber AI re-created OpenAI’s DALL-E by combining the components that OpenAI released with custom-trained elements. Resulting synthetic images may be see in Figure 7. Sber AI was much more open with their work, and publicly released the code⁹⁴ for their “Russian clone” along with model weights, and published a Colab notebook⁹⁵ that has since been copied and modified widely.

Figure 7 (right).

An image generated with the prompt “Sneakers, Nike, color: black” generated with ruDALL-E—Alex Nikolich, 2022.⁹⁶

A biography of CLIP-guided diffusion

Having detailed the parallel history of technical work that has held similar functional aims to the current state-of-the-art text-to-image generative system, we now turn our attention to CLIP-Guided Diffusion itself. As seen in the section that follows, the specific architecture of CLIP-Guided Diffusion was invented by Katherine Crowson.⁹⁷ However, this was not an isolated breakthrough, as Crowson’s was just one in a rapid series of innovations made by communities of creative contributors from 2021–22. This architecture set the basic approach that has since been adopted by all the public-facing text-to-image generative systems, including DALLE-2, Midjourney, Stable Diffusion, and others. As the name implies, there are two basic components at play in CLIP-Guided Diffusion: Diffusion and CLIP. The former is a system that generates images through a “de-noising” process, while the latter is a system that guides the generation of images in order to better match a given caption. In this section, we trace the technical lineage of each of these components.

We begin with Diffusion.

As is often the case in ML research, diffusion-driven image generation did not find its origins in images at all. Here, inspiration was drawn from non-equilibrium statistical physics. In 2015, a team of researchers at Stanford and UC Berkeley developed a system that teaches a neural network to restore structure in noisy data by “systematically and slowly” destroying structure in a given sample, and then learns to restore the sample to something that meaningfully resembles its previous state.⁹⁸ Building on this work, in 2020, a separate team from UC Berkeley applied a similar “de-noising” approach to image synthesis problems, specifically as an alternative to a GAN-based approach to generating synthetic images that match a given class of images.⁹⁹ Results from this work may be seen in Figure 8.

Figure 8.

Images generated by a team at UC Berkeley to match the LSUN Church class, 2020.⁹⁹

In 2021, following closely after this work completed in academic labs, OpenAI published an approach to image synthesis that, by trading “diversity for fidelity,” successfully brought some of the advantages previously held by GANs to diffusion systems.¹⁰⁰ The open-source release of this work¹⁰¹ marks the first time that a diffusion-based approach outperformed a GAN, thereby establishing diffusion as the new state-of-the-art in image synthesis.

We turn now from the image synthesis component to the system that accounts for text-to-image relationships—CLIP.

CLIP (Contrastive Language–Image Pre-training) was developed at OpenAI in 2020. The CLIP approach¹⁰² was designed not as a component in synthetic image generation, but rather was intended to improve image classification tasks. Trained²⁹ on “a wide variety of images … abundantly available on the internet,” the model was touted by OpenAI for its ability to “be instructed in natural language to perform a great variety of classification benchmarks.”¹⁰³ Since its initial development, a number of versions of CLIP have been released by OpenAI, and several open-source re-implementations of CLIP have been trained on open-source datasets of comparable or larger sizes, such as LAION-5B—the larger descendant of LAION-400M. In the context of text-to-image systems, the public release of CLIP in January of 2021 changed everything. As we document in the section that follows, soon after the release of CLIP we find a rapid sequence of Google Colab notebooks that use CLIP in creative ways, and for ends that OpenAI never intended. This series of “clever little tricks” integrate CLIP into larger architectures for text-guided image generation, and are the direct predecessors of the text-to-image generative models we know today. Table 3 shows a summary of the socio-technical context of creative machine learning from January 2021 to June 2022.

Table 3.

A summary of the socio-technical context of creative machine learning from January 2021 to June 2022.

Technical innovations and applications

In addition to the developments of the CLIP-guided diffusion architecture that were developed across this time, we also find a variety of approaches to text-to-image generation developed in parallel but with very different architectures. A series of innovations produced by amateurs and expressed through informal pieces of software directly led to contemporary text-to-image systems

Social infrastructures

This period saw the rise of novel structures of communal labor, such as “non-affiliated” research labs and “casual technologists” that have made significant contributions. These new structures were enabled by certain infrastructural technologies—including: Published model architectures, open-source code, and public training data—as well as social technologies—including accessible cloud compute, collaboration technologies, such as GitHub and Google Golab, and communication technologies such as Discord and Reddit

Clever little tricks

Documented here is the recent history of a series of critical innovations expressed through informal pieces of software (typically Colab notebooks) that directly precede contemporary text-to-image systems. While these systems rely on architectural designs and technical components developed in larger institutional settings (most notably OpenAI’s CLIP), their assembly was completed by non-affiliated researchers and communities of creative contributors. Such “assemblages” represent significant innovations that directly led to the technologies that drew the attention of architectural visual culture in the summer of 2022. Although these technologies were essentially developed by amateurs, without this work there would be no DALLE-2, Imagen, Midjourney, or Stable Diffusion.

Before proceeding, a note on the way in which this history was constructed. The nature of the innovations that we account for here resist the formal documentation mechanisms we would typically expect when writing in a technical context. When seeking to account for the contributions of a dispersed community and non-affiliated (and often non-credentialed) technologists, we cannot count on the same reliable sources found in the context of an academic lab or a research group in a large tech firm. By necessity, the following account was pieced together from a number of informal sources³⁰ that include blog posts, Discord servers, log files found in GitHub repos, attributions included in code comments, and posts found on a variety of social media such as Reddit and Twitter.

The specific software discussed are listed in overview below. In the section that follows, we describe the unique architecture of each, who contributed to its development, and present images generated by these systems at the time of their creation. Wherever possible, we have also retroactively re-created images using each system, employing a standard set of text prompts to facilitate comparison in an architectural visualization context.

• SIREN and CLIP (Jan 2021)

• The Big Sleep/BigGANxCLIP (∼Feb 2021)

• Aleph2Image (∼Feb 2021)

• VQGAN + CLIP (∼April 2021)

• CLIP-Guided Diffusion (∼July 2021)

• Disco Diffusion (Oct 2021)

• Aphantasia/Illustra/FFT + CLIP (Nov 2021)

• JAX CLIP-Guided Diffusion (Nov 2021)

• Looking Glass (Dec 2021)

• CLIP Guided Deep Image Prior (Jan 2022)

Within this landscape, a number of individuals—largely self-taught and not affiliated with any formal lab at the time of their most important contributions—stand out as particularly instrumental. We list a selection of such individual contributors here, alongside a brief biography to offer context:

• Ryan Murdock (@advadnoun) contributed in significant ways to a number of the innovations described below, and was the first person to connect CLIP to an image generation function (outside of OpenAI). A “self taught”¹⁰⁶ machine learning engineer, Murdock holds a Bachelor’s in Psychology from the University of Utah. Murdock joined Adobe in 2021 to work on “neural filters”¹⁰⁷ for Photoshop.

• Katherine Crowson (@RiversHaveWings) is a central figure in the AI art community, and initiated a number of foundational approaches—such as VQGAN + CLIP and the first CLIP-Guided Diffusion notebook—that have since been widely forked and modified by others. Crowson is a self-described “AI/generative artist” that “writes her own code.” She has been a member of the Eleuther AI Discord since its establishment in August of 2020, and currently works as a Principal Researcher at Stability AI.

• Maxwell (Max) Ingham (@Somnai_dreams) created and initially maintained Disco Diffusion, an influential image generation model and creative community that directly proceeded DALL-E 2 and Midjourney. Ingham holds a diploma in graphic design from the Canberra Institute of Technology (2014), and has worked as a web designer and front-end developer since 2012. Ingham now works as a developer at Midjourney.

SIREN and CLIP (Jan 2021)

The SIREN + CLIP notebook,¹⁰⁸ by Ryan Murdock, represents the first attempt to connect CLIP to an image generation function; in this case, a sinusoidal representation network (SIREN)¹⁰⁹ serves as the image generation model. This work produced the first published CLIP-guided synthetic image, shown in Figure 9.

Figure 9 (top left).

The first published CLIP-guided image—Ryan Murdock, 2021.¹¹⁰

In comparison to state-of-the-art synthetic image generation techniques (e.g., StyleGAN)⁵¹ and text-to-image generation techniques (DALL-E 1)⁹⁰ of the time, the quality of the synthetic images produced by SIREN + CLIP, such as those shown in Figure 10 and Figure 11, is not particularly noteworthy. The given prompt appears to affect the image nearly exclusively in terms of texture, tone, and color. Distinct figures are impossible to discern, and there is little observable compositional coherence. At times, the image will contain certain sections of the prompt itself, rendered as barely decipherable text.

Figure 10 (top middle).

An image generated by SIREN + CLIP using the prompt “A professional photograph of a Brutalist museum in Berkeley, California.” Image courtesy of the author, 2023.

Figure 11 (top right).

An image generated by SIREN + CLIP using the prompt “Interior photo of a house. Thin wooden construction scaffolding. Architectural HD, HQ, 4k, 8k.” Image courtesy of the author, 2023.

The Big Sleep/BigGANxCLIP (∼Feb 2021)

The Big Sleep notebook,¹¹¹ also by Murdock, was the first to connect CLIP to a generative adversarial network (GAN), and uses BigGAN¹¹² as the generative model.

Because this technique uses BigGAN as the generator, the aesthetic of the resulting images, such as those seen in Figures 12–14, are easily recognizable as such. In contrast with earlier incarnations of GANs, BigGAN was capable of representing a wide range of subjects, and could produce higher-resolution work, but was also far too expensive to custom train. Because of this, images produced by BigGAN tend to be more visually homogeneous then other GAN models, a characteristic that is evident in the results of the Big Sleep notebook, seen nearby. Here, while figures are more distinct, textures remain dominant, and small-scale details are not discernible.

Figure 12 (bottom left).

An image generated with the prompt “a cityscape in the style of Van Gogh”—Ryan Murdock, 2021.¹¹³

Figure 13 (bottom middle).

An image generated by BigGANxCLIP using the prompt “A professional photograph of a Brutalist museum in Berkeley, California.” Image courtesy of the author, 2023.

Figure 14 (bottom right).

An image generated by BigGANxCLIP using the prompt “Interior photo of a house. Thin wooden construction scaffolding. Architectural HD, HQ, 4k, 8k.” Image courtesy of the author, 2023.

Aleph2Image (∼Feb 2021)

While the Aleph2Image notebook¹¹⁴ doesn’t employ CLIP, it is worth including in our timeline in that it demonstrates the diversity of architectures that may be applied to text-to-image problems. Here, Murdock takes apart DALL-E, using only the image generation part (the variational autoencoder, or VAE) of Open AI’s process in order to generate images.

Aesthetically, Aleph2Image may represent a step backward from Murdock’s earlier experiments. Without the reliable image synthesis capacities of BigGAN, figures are once again lost in these images, such as those seen in Figures 15–17, and we again begin to see text from the prompt present in the results.

Figure 15 (left).

An image generated with the prompt “Godzilla attacks the University of Chicago”—Ryan Murdock, 2021.¹¹⁵

Figure 16 (middle).

An image generated by Aleph2Image using the prompt “A professional photograph of a Brutalist museum in Berkeley, California.” Image courtesy of the author, 2023.

Figure 17 (right).

An image generated by Aleph2Image using the prompt “Interior photo of a house. Thin wooden construction scaffolding. Architectural HD, HQ, 4k, 8k.” Image courtesy of the author, 2023.

VQGAN + CLIP (∼April 2021)

The VQGAN + CLIP notebook,¹¹⁶ authored by Katherine Crowson, builds on Murdock’s success in employing GANs for image synthesis, and represents a significant leap forward in image coherence and figuration. Here, BigGAN is swapped out in favor of VQGAN,¹¹⁷ which combines a variational autoencoder (VAE) as the generator with a standard GAN discriminator to produce higher-quality results. It was at this point that the generative art scene took notice of the potential of CLIP-guided systems, and marked the VQGAN + CLIP notebook as one of the most widely forked X + CLIP image generation systems at the time. It is also notable that at this moment we begin to see prompts that address not only subject matter, but also style through the invocation of individual artists and movements. Consider, for example, Crowson’s image below of a “pencil sketch in the style of beksinski (sic).”

In large part due to the advances in image quality provided by VQGAN—seen in Figures 18–20—Crowson’s work shows a remarkable advance in the aesthetic range and coherence in synthetic image-making. Here, figures are as clearly visible as textures, and spatial arrangements and compositions begin to appear in meaningful ways.

Figure 18 (left).

An image generated with the prompt “a black and white pencil sketch in the style of beksinski”—Katherine Crowson, 2021.¹¹⁸

Figure 19 (middle).

An image generated by VQGAN + CLIP using the prompt “A professional photograph of a Brutalist museum in Berkeley, California.” Image courtesy of the author, 2023.

Figure 20 (right).

An image generated by VQGAN + CLIP using the prompt “Interior photo of a house. Thin wooden construction scaffolding. Architectural HD, HQ, 4k, 8k.” Image courtesy of the author, 2023.

CLIP-guided diffusion (∼July 2021)

In July of 2021, a Colab notebook authored by Katherine Crowson⁹⁷ was publicly posted. This release marks a seminal milestone in the history of machine-augmented visualization, as it was the first time that CLIP was paired with a diffusion model, and established the basic architecture that persists in contemporary text-to-image systems. The technical components of this first CLIP-Guided Diffusion model are described in detail above. The notebook produced “absolutely unbelievable results,” including the image seen in Figure 21, but because it was resource-heavy and very slow by today’s standards, it failed to gain the traction of Crowson’s VQGAN + CLIP notebook.

Figure 21 (left).

An image generated with the prompt “Psychic Sermon in the Temple of the Mind”—KaliYuga, 2021.¹⁰⁵

While in terms of texture, figure, and spatial composition, the early incarnations of CLIP-Guided Diffusion are not so aesthetically distinct from the best GAN-based models that proceeded, we can see from our current vantage point, and also in the resulting images shown in Figure 22 and Figure 23, that this diffusion-based approach held more promise.

Figure 22 (middle).

An image generated by CLIP-Guided Diffusion using the prompt “A professional photograph of a Brutalist museum in Berkeley, California.” Image courtesy of the author, 2023.

Figure 23 (right).

An image generated by CLIP-Guided Diffusion using the prompt “Interior photo of a house. Thin wooden construction scaffolding. Architectural HD, HQ, 4k, 8k.” Image courtesy of the author, 2023.

Disco diffusion (Oct 2021)

As discussed above, Disco Diffusion is both a generative text-to-image model and a community that created and maintain this tool. The project is expressed through a loose collection of shared and commonly forked Colab notebooks. The original model was authored by Katherine Crowson, while the original notebook was written by Maxwell Ingham. Thereafter came a steady stream of monthly revisions starting in October of 2021, and extending through the first major release of the current version in Feburary of 2022.^81,119–122 Across this time, contributions were made by a considerable number of programmers,³¹ many of whom now work for organizations—such as Midjourney, Stable Diffusion, and Palmweaver, that have brought text-to-image technology into the public consciousness.

Aesthetic progress in Disco Diffusion—demonstrated in Figures 24–26 may be seen less in terms of image coherence (the texture, figure, and spatial composition attributes discussed above) and more in terms of the capacity of this community for innovative variation. The widespread popularity of Disco Diffusion inspired a number of forks and adaptations,³² each with its own aesthetic quality (e.g., “pixel art diffusion,” “pulp science fiction diffusion,” “watercolor diffusion”). The proliferation of “disco variants,” each authored by a subset of enthusiasts and each fine-tuned on its own narrowly focused datasets, upends the problem of data bias: while harmful and discriminatory content remains a critical issue, the question of bias among small subcultures and focused models is more a function of the intentional decisions within these communities than the providence of large tech firms managing general-purpose models.

Figure 24 (left).

An image generated by Disco Diffusion—Adam Letts, 2022.¹²³

Figure 25 (middle).

An image generated by Disco Diffusion using the prompt “A professional photograph of a Brutalist museum in Berkeley, California.” Image courtesy of the author, 2023.

Figure 26 (right).

An image generated by Disco Diffusion using the prompt “Interior photo of a house. Thin wooden construction scaffolding. Architectural HD, HQ, 4k, 8k.” Image courtesy of the author, 2023.

Disco Diffusion marks an inflection point in the history described here. While small-scale experiments in this space continued well after the introduction of Disco, the creative community catalyzed by this technology (as well as the specific architecture of CLIP-Guided Diffusion) was central to most of the well-known text-to-image software that are the subject of this paper. These larger efforts directly benefited from the technical innovations developed in this small community. Perhaps more importantly, the large organizations that developed systems such as Midjourney and Stable Diffusion benefited from the human capital of the Disco Diffusion community, and directly hired many of the central contributors in early 2022. Although the parallel and subsequent experiments listed below have not yet made as significant an impact, we list them here both to complete the historical record and in anticipation of the possibility of future developments.

Aphantasia/Illustra/FFT + CLIP (Nov 2021)

Aphantasia¹²⁶ is the first of a number of experiments by Vadim Epstein that attempt to improve upon the established CLIP + Diffusion architecture by again swapping out one of the basic components. Here, CLIP is combined with FFT (Fast Fourier Transform) as an alternative generator to diffusion.

Aesthetically, as shown in Figures 27–29, Epstein’s work recalls Murdock’s early pre-GAN assemblages. Figures that operate at the scale of textures dominate these images, which suggest a collage of merged spaces and vanishing points.

Figure 27 (left).

An image generated by Aphantasia—Vadim Epstein, 2022.¹²⁶

Figure 28 (middle).

An image generated by Aphantasia using the prompt “A professional photograph of a Brutalist museum in Berkeley, California.” Image courtesy of the author, 2023.

Figure 29 (right).

An image generated by Aphantasia using the prompt “Interior photo of a house. Thin wooden construction scaffolding. Architectural HD, HQ, 4k, 8k.” Image courtesy of the author, 2023.

Figure 30.

A still of an animation generated with JAX Clip-Guided Diffusion—@huemin, 2022.¹²⁹

JAX CLIP-guided diffusion (Nov 2021)

This variant of CLIP-Guided Diffusion by Katherine Crowson and John David Pressman (@jd_pressman)¹²⁷ is a ground-up re-implementation of the original CLIP-Guided Diffusion model in the JAX framework. The notebook¹²⁸ was authored by @huemin and @nshepperd. Since aesthetically, there is little distinction from the July 2021 CLIP-Guided Diffusion, as shown in Figure 30, we do not present comparison images here.

Figure 31 (left).

An image generated with the prompt “a moody painting of a lonely duckling”—Daniel Russell, 2022.¹³²

Looking glass (Dec 2021)

This experiment¹³⁰ authored by @Mx. AI Curio, @Bearsharktopus, and others (exact attribution is unclear) is perhaps the most radical experiment presented here, as it abandons both CLIP and diffusion. Instead RuDALL-E is adpated as a “One-Shot Finetuner” that necessarily starts with a initialization image. This departs from our text-to-image history in that textual prompts are not required. As such, we do not present comparison images here.

CLIP guided deep image prior (Jan 2022)

This experiment by Daniel Russell (@russelldc, now at Midjourney) combines CLIP with Deep Image Prior¹³¹ as the generator in place of diffusion.

While aesthetically less developed than contemporaneous efforts, CLIP Guided Deep Image Prior offers the most successful non-diffusion text-to-image synthesis system at the time of writing. As shown in Figures 31–33, in contrast with early diffusion-based systems, these images do not suffer from the dominance of textural fields, and instead begin from a place of strongly coherent image composition and arrangement. While much work remains at the level of fine detail and figural coherence, the approach appears to hold promise.

Figure 32 (middle).

An image generated by CLIP Guided Deep Image Prior using the prompt “A professional photograph of a Brutalist museum in Berkeley, California.” Image courtesy of the author, 2023.

Figure 33 (right).

An image generated by CLIP Guided Deep Image Prior using the prompt “Interior photo of a house. Thin wooden construction scaffolding. Architectural HD, HQ, 4k, 8k”. Image courtesy of the author, 2023

Subsequent developments in larger institutional settings

While not rightfully a part of the timeline of informal innovations that are the subject of this section, it is worth noting that the technologies established by the non-affiliated technologists detailed above continued to develop with the support of larger institutional players. With the benefit of significant capital investment, and the attendant professional labor force that comes with it, these technologies were considerably refined and adapted for a commercial audience since being released widely in the summer of 2022. To facilitate a comparison with the images shown above we extend our sequence of synthetic images using our standard set of text prompts to two of the more popular text-to-image systems—Midjourney and Stable Diffusion—and chart their progression through the latest versions.

Reflection

This text documents the socio-techincal context that gave rise to the text-to-image generation tools that found prominence in architectural visual culture in the summer of 2022. To facilitate a critical examination of these tools, we present three scenes in time, and four interlinked lenses through which to view this creative and technical labor: technology, social contexts, infrastructures, and practices. In placing particular emphasis on the recent history of informal innovations developed by non-affiliated researchers and communities, we aim to ensure recognition of the contributions of independent and amateur creative technologists that produced innovations core to the development of better-known systems. The specific story of the development of these tools parallels larger shifts in the culture across these periods, and holds lessons for how we understand the broader landscape of forces that shape creative work.

The history presented here reveals that issues of accessibility and access have been present since the inception of AI, and also illuminates the origin of certain biases implicit in an emerging architectural visualization technique. Computation in design has historically been seen as a liberating force, and themes of emancipation and labor replacement remain prevalent in contemporary dialogs surrounding AI-augmented design. Just as an “appeal to data” was used rhetorically across the early 2000s to lend authority to design proposals, we may expect to find similar claims will be made regarding the “new digital styles” that are sure to emerge in relation to text-to-image generation systems.

From the perspective of creative work, the alternative models of labor found across our three scenes echo trends and suggest alternatives for the AEC industry. Specifically, the communal models of labor described here (e.g., those at work in the Disco Diffusion community) may hold relevance for technical architectural labor. Technical innovation in our final scene is marked by a fluid and diffuse sense of authorship. Work is completed in collectives that may only know one another by their Discord handles, and attribution occurs in “blobs” rather than in clear hierarchical lines of contribution. To illustrate, consider the characterization of attribution made by one core contributor to Disco—“I just spam a long list of what people did.”³³ Such a posture cuts both ways. On the one hand, it offers an appealing alternative model for creative technical practice for an industry dominated by the near-monopoly of a small number of software companies that serve the AEC industry. Deeply impactful innovation, it would seem, need not cross the threshold of Autodesk’s San Fransisco headquarters. On the other, the loss of clear lines of attribution risks exploitation—it is no wonder that tools that emerge from such a context are beset by criticisms of visual plagiarism and intellectual theft.

Limits and Future Work

This study suggests multiple layers of dependencies—technological, infrastructural, financial, social, and cultural—that cannot be effectively examined from a single perspective, nor within the confines of this format. Even a cursory review of the details of any one of our lenses reveals that we are not conducting an apples-to-apples comparison. A look at the application domains across these three periods, for example, shows the limits of comparing the significance of, say, research that aimed to develop a “general problem solver” in 1969 to that of a system that generates pixel art animations from text prompts in 2022. Since the ambitions, audiences, and proportions of the creative and intellectual workforce brought to bear on these applications are markedly different across these contexts, this text is suggestive of more comprehensive future work that accounts for these distinctions. Further, our account of the recent history of informal innovations would benefit from an oral history approach, and would be enriched by a series of interviews with the key figures in the field.

Footnotes

Acknowledgements

The research that led to this publication was supported by a Fellowship at Stochastic Labs in Berkeley in the summer of 2022, and could not have proceeded without the insights and inspiration found in this generous community. The authors would like to thank Vero Bollow, Alexander Reben, Joel Simon, and Virj Kan of Stochastic for their kindness and hospitality in sharing their magnificent Victorian home. We would also like to thank the summer 2022 residents – Lisha Li, Chigozie Nri, Adam Letts, Max Kremenski, Evan Casey, Joel Lehman, Nick Shaheed, and Cory Li – for their guidance into the world of creative machine learning.

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) received no financial support for the research, authorship, and/or publication of this article.

ORCID iD

Kyle Steinfeld

Notes

References

Newell

Shaw

Simon

. Report on a general problem-solving program. IFIP congress, 1959, https://www.semanticscholar.org/paper/Report-on-a-general-problem-solving-program-Newell-Shaw/97876c2195ad9c7a4be010d5cb4ba6af3547421c

Newell

Simon

. The logic theory machine–A complex information processing system. IEEE Trans Inf Theory 1956; 2: 61–79.

Rosenblatt

. The perceptron - a perceiving and recognizing automaton. Ithaca, NY, USA: Cornell Aeronautical Laboratory 1957.

McCulloch

Pitts

. A logical calculus of the ideas immanent in nervous activity. Bull Math Biophys 1943; 5: 115–133.

Rosenblatt

. Principles of neurodynamics: Perceptrons and the theory of brain mechanisms. Ithaca, NY, USA: Cornell Aeronautical Laboratory, 1962.

Ernst

Newell

. GPS: A case study in generality and problem solving. New York, NY, USA: Academic Press, 1969.

Lefkowitz

. Professor’s perceptron paved the way for AI – 60 years too soon. Ithaca, NY, USA: Cornell Chronicle.

Crevier

DAI

. The tumultuous history of the search for artificial intelligence. USA: Basic Books, Inc. 1993.

Turkle

. The second self: Computers and the human spirit. New York City, NY, USA: Simon & Schuster 1985.

10.

Minsky

. Semantic information processing. Cambridge, MA, USA: MIT Press 1968.

11.

Minsky

Papert

. Perceptrons: An introduction to computational geometry. Cambridge, MA, USA: MIT Press 1969.

12.

Buchanan

. A (Very) Brief History of Artificial Intelligence. AI Magazine 2005; 26: 53–53.

13.

Newquist

. The brain makers. Indianapolis, IN: Sams Pub. 1994.

14.

Leach

. Architecture in the age of artificial intelligence: an introduction to AI for architects. London, New York: Bloomsbury Visual Arts 2022.

15.

Machine for selecting and typing words when translating from one language to another or across multiple languages simultaneously, Patent Number 40995, 1935.

16.

Bobrow

. Natural language input for a computer problem solving system. Thesis, Cambridge, MA, USA: Massachusetts Institute of Technology 1964.

17.

Newell

Shaw

Simon

. Elements of a theory of human problem solving. Psychological Review 1958; 65: 151–166.

18.

Steenson

. Architectural intelligence: How designers and architects created the digital landscape. Cambridge, MA: The MIT Press 2017.

19.

Coons

. Computer, art and architecture. Art Education 1966; 19: 9–11.

20.

Llach

. Builders of the vision: software and the imagination of design. New York, NY: Routledge 2015.

21.

Japan gain reported in comptuers. New York, NY: The New York Times.

22.

Lesk

. How much information is there in the world. Los Angeles, CA, USA: University of Southern California 2014.

23.

James

Chui

Brown

, et al. Big data: The next frontier for innovation, competition, and productivity. London, UK: McKinsey 2011.

24.

International Data Corporation (IDC) . Worldwide IDC global DataSphere forecast, 2022–2026. IDC, 2022, https://www.idc.com/getdoc.jsp?containerId=US49018922

25.

LeCun

Boser

Denker

, et al. Backpropagation applied to handwritten zip code recognition. Neural Computation 1989; 1: 541–551.

26.

Rumelhart

Hinton

Williams

. Learning representations by back-propagating errors. Nature 1986; 323: 533–536.

27.

Zhang

Tanida

Itoh

, et al. Shift-invariant pattern recognition neural network and its optical architecture. In: Proceedings of annual conference of the Japan Society of Applied Physics 1988: 2147–2151.

28.

Hopfield

. Neural networks and physical systems with emergent collective computational abilities. Proc Natl Acad Sci U S A 1982; 79: 2554–2558.

29.

Hochreiter

Schmidhuber

. Long short-term memory. Neural Comput 1997; 9: 1735–1780.

30.

Gallinari

Lecun

Thiria

, et al. Memoires associatives distribuees: Une comparaison (Distributed associative memories: A comparison). In: Proceedings of COGNITIVA 87, Paris, La Villette, May 1987 (Cesta-Afcet) 1987.

31.

Brants

Popat

, et al. Large language models in machine translation. In: Proceedings of the 2007 joint conference on empirical methods in natural language processing and computational natural language learning (EMNLP-CoNLL). (Prague, Czech Republic. Association for Computational Linguistics) 2007: 858–867.

32.

Deng

Dong

Socher

, et al. Imagenet: A large-scale hierarchical image database. In: 2009 IEEE conference on computer vision and pattern recognition, Miami, FL, USA, June 20–25, 2009 (IEEE) 2009: 248–255.

33.

The DARPA Grand Challenge . Defense advanced research projects agency. Encyclopedia Britannica 2023, Available at: https://www.britannica.com/topic/Defense-Advanced-Research-Projects-Agency. Accessed 30 March, 2023.

34.

Pandolfini

. Kasparov and deep blue: The historic chess match between man and machine. New York: Touchstone 1997.

35.

Markoff

. Computer wins on “Jeopardy!”: Trivial, it’s not. New York: The New York Times.

36.

Strogatz

. One giant step for a chess-playing machine. New York: The New York Times.

37.

Mozur

. Google’s AlphaGo defeats Chinese go master in win for A.I. New York: The New York Times.

38.

Miller

. How AWS came to be. Boston, MA: TechCrunch.

39.

Paul

. The history of google cloud platform. A Cloud Guru, https://acloudguru.com/blog/engineering/history-google-cloud-platform.

40.

Abandy

Roosevelt

. The history of Microsoft azure. Microsoft Educator Developer Blog.

41.

Metz

. Google open sources its secret weapon in cloud computing. Wired.

42.

Lecun

Bottou

Bengio

, et al. Gradient-based learning applied to document recognition. Proc IEEE 1998; 86: 2278–2324.

43.

Krizhevsky

Hinton

. Learning multiple layers of features from tiny images. 0, Toronto, ON: Technical report, University of Toronto/University of Toronto 2009.

44.

Stuck_In_the_Matrix . I Have every publicly available Reddit comment for research. ∼ 1.7 billion comments @ 250 GB compressed. Any interest in this? Reddit.

45.

Love

Between mission statement and parametric model. Places 2009, Epub ahead of print September 2009. DOI: 10.22269/090910

46.

Maas

Firm)

. Metacity Datatown. Rotterdam, Netherlands: 010 Publishers 1999.

47.

Bjarke Ingels Group . Big - Bjarke Ingels group, https://big.dk/

48.

Carpo

. The new science of form-searching. Archit Design 2015; 85: 22–27.

49.

Veen

Leijnen

. The neural network Zoo. Utrecht, Netherlands: The Asimov Institute.

50.

Karras

Laine

Aila

. A style-based generator architecture for generative adversarial networks. Epub ahead of print March 2019. DOI: 10.48550/arXiv.1812.04948

51.

Karras T. Style GAN.

https://github.com/NVlabs/stylegan

52.

White Perception T. Engines,

2018. [Image Classification]. Available: https://drib.net/perception-engines (accessed 07 March 2023).

53.

Klingemann Neural Glitch Mistaken . [Generative Adversarial Network]. Available: https://underdestruction.com/2018/10/28/neural-glitch/ (accessed 07 March 2023).

54.

Crespo

Neural Zoo . 2018. [Convolutional Neural Networks]. Available: https://neuralzoo.com/ (accessed 07 March 2023).

55.

Scott E Humanity (Fall of the Damned) . 2019. [Pix2pix]. Available: https://www.scott-eaton.com/2019/humanity-fall-of-the-damned (accessed 07 March 2023).

56.

Alexander R . AI Am I? [GPT-3]. Available: https://areben.com/ai-am-i/ (accessed 07 March 2023).

57.

YACHT, Chain Tripping . 2019. [Online]. Available: https://yacht.bandcamp.com/album/chain-tripping (accessed 07 March 2023).

58.

Tokui

N. AI DJ project.

2016. [Online]. Available: https://naotokui.net/works/ai-dj-project-2016/ (accessed 07 March 2023).

59.

Coelho

G. Jargon.

2019. [Online]. Available: https://www.aiartonline.com/music-2019/guilherme-coelho/ (accessed 07 March 2023).

60.

Schmitt

Weiß

. The chair project (four classics). 2019. [Wood, Steel, Thread, Data]. Available: https://philippschmitt.com/work/chair (accessed 02 July 2020).

61.

Tyler

. Generating the generator.

62.

Steinfeld

K. Almost Home.

2018. [Generative Adversarial Network]. Available: https://www.aiartonline.com/design/kyle-steinfeld/ (accessed 07 March 2023).

63.

Ginsparg

. Lessons from arXiv’s 30 years of information sharing. Nat Rev Phys 2021; 3: 602–603.

64.

Solaiman

Brundage

Clark

, et al. Release strategies and the social impacts of language models. Epub ahead of print November 2019. DOI: 10.48550/arXiv.1908.09203

65.

Clive Humby . Data is the new oil, presented at the Association of National Advertisers (ANA) Senior marketer’s summit, Evanston, Illinois: Kellogg School of Management, Nov. 2006.

66.

Wortsman

Ilharco

Kim

, et al. Robust fine-tuning of zero-shot models. Epub ahead of print June 2022. DOI: 10.48550/arXiv.2109.01903

67.

Ilharco

Wortsman

Wightman

, et al. OpenCLIP. Epub ahead of print July 2021. DOI: 10.5281/zenodo.5143773

68.

Large-scale artificial intelligence open network (LAION). Aug. 2021. https://laion.ai/about (accessed 03 December 2022).

69.

Gao

Biderman

Black

, et al. The Pile: An 800GB dataset of diverse text for language modeling. arXiv preprint arXiv:210100027, https://arxiv.org/abs/2101.00027 2020.

70.

Brown

Mann

Ryder

, et al. Language models are few-shot learners. Epub ahead of print July 2020. DOI: 10.48550/arXiv.2005.14165

71.

GitHub - search more than 361M repositories. GitHub. https://github.com (accessed 03 December 2022).

72.

Gershgorn

Nerds rejoice: Google just released its internal tool to collaborate on AI. October 27, 2017. https://qz.com/1113999/nerds-rejoice-google-just-released-its-internal-tool-to-collaborate-on-ai/amp (accessed 03 December 2022).

73.

Pérez

Granger

B. “Project Jupyter”.

2014. https://jupyter.org/about (accessed

03 December

03 2022).

74.

Pierce

How Discord (somewhat accidently) invented the future of the internet. Protocol. October 29, 2020. https://www.protocol.com/discord (accessed 03 December 2022).

75.

Gil

About common crawl. 2011. https://commoncrawl.org/about/ (accessed 04 December 2022).

76.

Schuhmann

LAION-400-Million open dataset. December 12, 2022. https://laion.ai/blog/laion-400-open-dataset (accessed 04 December 2022).

77.

Saharia

Chan

Saxena

, et al. Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding. Epub ahead of print May 2022. DOI: 10.48550/arXiv.2205.11487

78.

About

ImageNet. ImageNet.

2020. https://www.image-net.org/about.php (accessed 04 December 2022).

79.

Eleuther AI Frequently Asked Questions. Eleuther AI. https://www.eleuther.ai/faq/ (accessed 29 November 2022).

80.

INCITE program awards supercomputing time to 56 projects. HPCwire. https://www.hpcwire.com/off-the-wire/incite-program-awards-supercomputing-time-to-56-projects/ (accessed 04 December 2022).

81.

Ingham

M. “Disco diffusion V1”.

October 29, 2021. [Online]. Available: https://github.com/alembics/disco-diffusion (accessed 28 November 2022).

82.

Nun

Discord creation date check - hugo.moe. July 19, 2019. https://hugo.moe/discord/discord-id-creation-date.html (accessed 04 December 2022).

83.

Kulkarni

Premraj

Ordonez

, et al. BabyTalk: understanding and generating simple image descriptions. IEEE Trans Pattern Anal Mach Intell 2013; 35: 2891–2903.

84.

Mansimov

Parisotto

, et al. Generating images from captions with attention. In: International Conference on Learning Representations. San Juan, Puerto Rico, 2015. Epub ahead of print November 2015. DOI: 10.48550/arXiv.1511.02793

85.

Gregor

Danihelka

Graves

, et al. DRAW: A recurrent neural network for image generation. Epub ahead of print May 2015. DOI: 10.48550/arXiv.1502.04623

86.

Lin

T-Y

Maire

Belongie

, et al. Microsoft COCO: Common objects in context. Epub ahead of print February 2015. DOI: 10.48550/arXiv.1405.0312

87.

Zhang

Huang

, et al. AttnGAN: Fine-grained text to image generation with attentional generative adversarial networks. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2017. Epub ahead of print November 2017. DOI: 10.48550/arXiv.1711.10485

88.

Xu T. AttnGAN, 2022.

89.

Fan

Summary of AttnGAN: fine-grained text to image generation with attentional generative adversarial networks. November 09, 2018. https://www.conniefan.com/2018/11/summary-of-attngan-fine-grained-text-to-image-generation-with-attentional-generative-adversarial-networks/ (accessed 04 December 2022).

90.

Ramesh

Pavlov

Goh

, et al. Zero-Shot Text-to-Image Generation. Epub ahead of print February 2021. DOI: 10.48550/arXiv.2102.12092

91.

Will Douglas Heaven.

This avocado armchair could be the future of AI. MIT Technology Review. January 05, 2021. https://www.technologyreview.com/2021/01/05/1015754/avocado-armchair-future-ai-openai-deep-learning-nlp-gpt3-computer-vision-common-sense/ (accessed 04 December 2022).

92.

Ramesh

Pavlov

Goh

, et al. DALL-E: Creating images from text. OpenAI. January 05, 2021. https://openai.com/blog/dall-e/ (accessed 04 December 2022).

93.

Sber

AI.

ruDALL-E. http://rudalle.ru/en/ (accessed 28 November 2022).

94.

Sber AI. ruDALL-E Malevich. AI Forever , November 01, 2021. [Online]. Available: https://github.com/ai-forever/ru-dalle (accessed 28 November 2022).

95.

Sber AI. ruDALL-E Malevich . November 05, 2021. https://colab.research.google.com/drive/1Tb7J4PvvegWOybPfUubl5O7m5I24CBg5?usp=sharing#scrollTo=glZexagJEnhe (accessed 28 November 2022).

96.

Nikolich

, Sneakers, Nike, color: black. 2021. Accessed: Nov. 27, 2022. [Online]. Available: https://colab.research.google.com/drive/1Tb7J4PvvegWOybPfUubl5O7m5I24CBg5?usp=sharing#scrollTo=glZexagJEnhe

97.

Crowson

CLIP guided diffusion HQ 256x256. July 2021. https://colab.research.google.com/drive/12a_Wrfi2_gwwAuN3VvMTwVMz9TfqctNj#scrollTo=XIqUfrmvLIhg (accessed 28 November 2022).

98.

Sohl-Dickstein

Weiss

Maheswaranathan

, et al. Deep unsupervised learning using nonequilibrium thermodynamics. Epub ahead of print November 2015. DOI: 10.48550/arXiv.1503.03585

99.

Jain

Abbeel

. Denoising diffusion probabilistic models. Epub ahead of print December 2020. DOI: 10.48550/arXiv.2006.11239

100.

Dhariwal

Nichol

. Diffusion models beat GANs on image synthesis. Epub ahead of print June 2021. DOI: 10.48550/arXiv.2105.05233

101.

Dhariwal

Nichol

A. Guided-diffusion.

December 04, 2022. [Online]. Available: https://github.com/openai/guided-diffusion (accessed 04 December 2022).

102.

Radford

Kim

Hallacy

, et al. Learning transferable visual models from natural language supervision. Epub ahead of print February 2021. DOI: 10.48550/arXiv.2103.00020

103.

CLIP.

Connecting text and images. OpenAI. January 05, 2021. https://openai.com/blog/clip/ (accessed 27 November 2022).

104.

Johannezz. Making Images with CLIP. deeplearn.art. March 08, 2021. https://deeplearn.art/making-images-with-clip/ (accessed 27 November 2022).

105.

KaliYuga. KaliYuga’s Library of AI Google Colab Notebooks. PeakD . October 06, 2021. https://peakd.com/hive-177949/@kaliyuga/kaliyugas-library-of-ai-google-colab-notebooks (accessed 27 November 2022).

106.

Bohannon J. ML Show & Tell: The incredible, bizarre world of AI-generated art : March 30, 2022. [Online Video]. Available: https://www.youtube.com/watch?v=GRECnsjOtIUA (accessed 27 November 2022).

107.

Costin

Super happy to welcome @advadnoun to Adobe joining the @NeuralFilters team. Twitter. July 13, 2021. https://twitter.com/acostin/status/1414807047468568580 (accessed 27 November 2022).

108.

Ryan

. CLIP and gradient ascent for text-to-image (Deep Daze?).

109.

Sitzmann

Martel

Bergman

, et al. Implicit neural representations with periodic activation functions. In: Larochelle

Ranzato

Hadsell

, et al. (eds) Advances in neural information processing systems. Curran Associates, Inc., 2020: 7462–7473.

110.

Ryan

Not a lot of time. 2021. [Online]. Available: https://rynmurdock.github.io/portfolio/ (accessed 27 November 2022).

111.

Ryan M. The Big Sleep: BigGANxCLIP.

January 18, 2021. https://colab.research.google.com/drive/1NCceX2mbiKOSlAd_o7IU7nA9UskKN5WR (accessed 27 November 2022).

112.

Brock

Donahue

Simonyan

. Large scale GAN training for high fidelity natural image synthesis. In: Seventh International Conference on Learning Representations (ILCR). New Orleans, LA. Epub ahead of print 2019. DOI: 10.48550/arXiv.1809.11096

113.

Ryan

A cityscape in the style of van Gogh. 2021. [Online]. Available: https://colab.research.google.com/drive/1NCceX2mbiKOSlAd_o7IU7nA9UskKN5WR#scrollTo=7EuUz-ICNKUr (accessed 27 November 2022).

114.

Ryan M. Aleph2Image (Gamma) . February 28, 2021. https://colab.research.google.com/drive/1VAO22MNQekkrVq8ey2pCRznz4A0_jY29 (accessed 27 November 2022).

115.

Ryan M. Godzilla attacks the University of Chicago.

2021. [Online]. Available: https://colab.research.google.com/drive/1NCceX2mbiKOSlAd_o7IU7nA9UskKN5WR#scrollTo=7EuUz-ICNKUr (accessed 27 November 2022).

116.

Crowson

VQGAN y CLIP - método z+quantize con augmentations. June 2021. https://colab.research.google.com/drive/1go6YwMFe5MX6XM9tv-cnQiSTU50N9EeT#scrollTo=g7EDme5RYCrt (accessed 27 November 2022).

117.

Esser

Rombach

Ommer

. Taming transformers for high-resolution image synthesis, https://arxiv.org/abs/2012.09841 (2020).

118.

Crowson

. A black and white pencil sketch in the style of beksinski (VQGAN+CLIP multimodal analogies). Twitter.

119.

Ingham

Disco diffusion V2.22 November, 2021. [Online]. Available: https://github.com/alembics/disco-diffusion (accessed 28 November 2022).

120.

Ingham

Disco Diffusion

dango233max.V3.

December 24, 2021. [Online]. Available: https://github.com/alembics/disco-diffusion (accessed 28 November 2022).

121.

Ingham

Nri

. Disco diffusion V4. January 2022. [Online]. Available: https://github.com/alembics/disco-diffusion (accessed 28 November 2022).

122.

Adam

Allen

, MSFT server. Disco diffusion V5. February 20, 2022. [Online]. Available: https://github.com/alembics/disco-diffusion (accessed 28 November 2022).

123.

Adam

Disco Diffusion with ViTL14@336px + RN50x16. Twitter, June 13, 2022. https://twitter.com/gandamu_ml/status/1536244893604978689 (accessed 04 December 2022).

124.

dango233max, João Paulo

Apolinário Passos.

Majesty diffusion. May 19, 2022. [Online]. Available: https://github.com/multimodalart/majesty-diffusion (accessed 28 November 2022).

125.

Chris A. Disco Diffusion Turbo.

May 27, 2022. https://colab.research.google.com/github/alembics/disco-diffusion/blob/main/Disco_Diffusion.ipynb (accessed 28 November 2022).

126.

Vadim

Epstein. Aphantasia.

November 30, 2022. [Online]. Available: https://github.com/eps696/aphantasia (accessed 04 December 2022).

127.

Crowson

David Pressman

. JAX clip-guided diffusion. November 07, 2021. [Online]. Available: https://github.com/crowsonkb/v-diffusion-jax (accessed 28 November 2022).

128.

huemin and nshepperd , “nshepperd’s JAX CLIP Guided Diffusion 512x512,” April 2022. https://colab.research.google.com/drive/1ZZi1djM8lU4sorkve3bD6EBHiHs6uNAi (accessed 28 November 2022).

129.

huemin. JAX . CLIP guided diffusion animation.

130.

Mx AI Curio and Bearsharktopus. Looking Glass v1.3

. December 18, 2021. https://colab.research.google.com/drive/15vFLeepkSTr1qd4xs31g9kMEiwkWP0sh#scrollTo=y-s9LfFBSyty (accessed 28 November 2022).

131.

Ulyanov

Vedaldi

Lempitsky

. Deep Image Prior. Int J Comput Vis 2020; 128: 1867–1888.

132.

Daniel

Russell.

A moody painting of a lonely duckling. 2022. [Online]. Available: https://colab.research.google.com/drive/1_oqIK8A67EgtJDdfsuJojc5ukNzirdle#scrollTo=32ksbekHxZl_ (accessed 28 November 2022).

Clever little tricks: A socio-technical history of text-to-image generative models

Abstract

Keywords

Introduction

Three scenes, four lenses

Scene 1—the 1950’s and 1960’s

Intersections with architectural discourse and culture

Scene 2—the early 2000s

Intersections with architectural discourse and culture

Scene 3—January 2021 to June 2022

Social Groundwork—Infrastructures and Social Contexts

Published Models and Open-Source Code

Cloud Compute

Datasets

Collaboration, communication, and coordination technologies

The social context in 2021

Technical groundwork—technologies and application domains

Proceeding and parallel approaches to text-to-image synthesis

A biography of CLIP-guided diffusion

Clever little tricks

SIREN and CLIP (Jan 2021)

The Big Sleep/BigGANxCLIP (∼Feb 2021)

Aleph2Image (∼Feb 2021)

VQGAN + CLIP (∼April 2021)

CLIP-guided diffusion (∼July 2021)

Disco diffusion (Oct 2021)

Aphantasia/Illustra/FFT + CLIP (Nov 2021)

JAX CLIP-guided diffusion (Nov 2021)

Looking glass (Dec 2021)

CLIP guided deep image prior (Jan 2022)

Subsequent developments in larger institutional settings

Reflection

Limits and Future Work

Footnotes

Acknowledgements

Declaration of conflicting interests

Funding

ORCID iD

Notes

References