Sage Journals: Discover world-class research

Abstract

As a sociotechnical practice at the nexus of humans, machines, and visual culture, text-to-image generation relies on verbal prompts as the primary technique to guide generative models. To align desired aesthetic outcomes with computer vision, human prompters engage in extensive experimentation, leveraging the model’s affordances through prompting for style. Focusing on the interplay between machine originality and repetition, this study addresses the dynamics of human-model interaction on Midjourney, a popular generative model (version 6) hosted on Discord. It examines style modifiers that users of visual generative media add to their prompts and addresses the aesthetic quality of AI images as a multilayered construct resulting from affordance actualization. I argue that while visual generative media holds promise for expanding the boundaries of creative expression, prompting for style is implicated in the practice of generating a visual aesthetic that mimics paradigms of existing cultural phenomena, which are never fully reduced to the optimized target output.

Keywords

visual generative media prompting for style affordance actualization text-to-image generation prompt modifiers Midjourney

Silicon-based intelligence of AI-enabled image generators is often described as magic. A discourse of enchanted determinism (Campolo and Crawford, 2020) serves AI impresarios to capture public trust, framing visual AI as a new medium of thought to “expand the imaginative power of the human species” (Midjourney), “translate ideas into exceptionally accurate images” (DALL·E 3), and “provide users with a variety of options for scalability and quality to best meet their creative needs” (Stable Diffusion 3). In Heidegger’s sense of Being and Time, machines can indeed be conceptualized as tools since they, like humans, operate on the basis of language. However, in a world in which machines are becoming more computationally powerful, claims about accuracy and superhuman creativity are increasingly paired with the inability to understand how deep neural networks lead to such extraordinary visual outputs. The magic of visual generative media lies in their enigmatic and unpredictable processes (Chesher and Albarrán-Torres, 2023) and manifests in a visual aesthetic of AI images invoked through prompting.

As the main algorithmic technique to direct foundation models of text-to-image generators, prompting entails crafting natural language descriptions of the desired visual outputs by adding prompt “modifiers,” or short key phrases (see Oppenlaender, 2023), which specify representational content (e.g., subject matter, setting), technical aspects (e.g., aspect ratio, region variation), and visual aesthetic (e.g., style, art movement) (Feng et al., 2023; Liu and Chilton, 2022). Among these, style modifiers raise the most questions because, in addition to specifying technical aspects of visual composition, they allow users to mimic the style of human artists, serving as a shortcut to achieving desired visual outputs. This shortcut is either enabled by the system through associations made from training data containing copyrighted artworks scraped from the web, or by the human prompter, through style modifiers introduced to a prompt, such as –stylize on Midjourney or in the style of on DALL·E 2. Both practices are far from unproblematic and require critical appraisal.

Drawing on the earlier work on text-to-image generation as a sociotechnical practice at the nexus of humans, machines, and visual culture (Laba, 2024), I address style modifiers as an entry point into a critical study of human-model interaction through prompting. This investigation primarily focuses on the nuances of such interaction in relation to what this paper terms prompting for style, which involves adding style modifiers to a text prompt to achieve a particular visual aesthetic in AI-generated images. More specifically, I ask:

RQ1: Which style modifiers do users of visual generative media add to their prompts to produce desired visual outputs?

RQ2: How might prompting for style be more adequately thought through in both practice and theory?

The primary objective of this study is to enhance the current understanding of human-model interaction¹ dynamics through prompting and to explore the extent to which generative models can be seen as facilitators or arbiters of originality. Focusing on user discussions about the platform affordances of Midjourney (version 6), a popular generative AI model hosted on Discord, I examine the practice of prompting for style on the #prompt-chat channel of Midjourney’s Discord server, a “dedicated discussion room for talking about how to craft prompts.” This space offers an insight into a user perspective on AI image generation, pointing to the value of community learning in the context of unpredictable machine behavior while also revealing affordances that enable questionable production practices around style recontextualization without proper attribution. I argue that while visual generative media holds promise for expanding the boundaries of creative expression, prompting for style is implicated in the practice of generating a visual aesthetic that mimics paradigms of existing cultural phenomena, which are never fully reduced to the optimized target output. Discussed next, prompting for style as a technique to produce desired visual outputs positions this work within the existing literature.

Prompting for style as a pixel predictor

As direct-to-consumer interfaces, visual generative media enable creation of novel images, providing opportunities for both novices and experienced users to supplant previous visual production methods with generation (Burkhardt and Rieder, 2024) in response to verbal prompts initiated by human actors. The growing interest in semantics of natural language as a form of human-model interaction has seen emergent research efforts to tackle the open problem of prompting in large foundation models, mostly from computer science and human-computer interaction (HCI) perspectives. This body of research advances the current understanding of how people interact with generative models, contributing to practitioner knowledge of text-to-image generation. In other words, its focus is on how to maximize the system’s capacity to render accurate representations, enhance aesthetic quality, and achieve cohesive visual outputs.

For example, Pavlichenko and Ustalov (2023) experimentally demonstrate how a set of keywords for Stable Diffusion (version 1.4) improves the aesthetics of generated images, most of which relate to visual effects (e.g., cinematic, colorful background, dramatic lighting, high detail) and art references. Feng et al. (2023) offer computational and practice-based solutions to interactive prompt engineering, such as PromptMagician, a visual analysis system that provides a multi-tiered visualization of retrieved images with suggested prompt words, supporting users in setting various criteria for personalized prompt refinement. Liu and Chilton’s (2022) design guidelines on prompt engineering identify successful themes across styles, such as salient color palettes, relevant textures, appropriate lighting and perspective, and motifs. Their extensive experimentation involves prompt modifiers and model hyperparameters across three partitions of styles, understood by the authors as “keywords” to suggest an aesthetic within a generation, such as abstract versus figurative, Western versus non-Western, and styles partitioned by period (premodern, modern, and digital). Similarly, Oppenlaender (2023) proposes a taxonomy of prompt modifiers (or “keywords” in Liu and Chilton’s sense). This (auto)ethnographic study makes references to a collection of style modifiers or details about “art periods, schools, and styles, but also art materials and media, techniques, and artists” (p. 7), demonstrating how these can help to reproduce a characteristic style (e.g., #pixelart) and artistic medium (e.g., oil on canvas).

Perceptual attributions of creativity to the machines—what Natale and Henrickson (2024) term the Lovelace Effect—have been studied through techniques like image quality assessment (IQA) (Comb et al., 2024), demonstrating that people assign AI images higher ratings on certain metrics compared to human-made art (Elgammal et al., 2017) but negatively assess “artwork” produced by generative models due to the lack of ascribed mind in terms of machine agency and experience (Messingschlager and Appel, 2023), which are traditionally seen as human qualities. From a user-model interaction perspective, Oppenlaender (2022) argues for machine creativity arising from “the text-based interaction of human users with text-to-image generation systems” (p. 196). Hence, determining whether the value of a specific AI image framed as “AI art” should be based on the technological intricacy and innovation employed in its creation, or solely on its perceptions by humans, proves challenging (Cetinic and She, 2022).

The ongoing debate about the interplay of machine originality and repetition corresponds with contemporary discourses on posthumanism that contrasts with anthropocentric AI (see, e.g., Mellamphy, 2021), the proponents of which aim to safeguard communities against the overshadowing of human values and potentially distorted perception of the world by AI (Boddington, 2021). Viewed this way, creations of generative models are “forms of aesthetic mimicry” (O’Meara and Murphy, 2023: 1070), where celebrated aesthetic brilliance of machine originality emerges from repetition, and “one repeats because one doesn’t know or cannot do and becomes equal or similar through becoming identical, acquiring an identity as fixity” (Goriunova, 2012: 49).

When particular style modifiers are introduced, a generative model produces visual outputs that emulate aesthetic qualities of existing works on which the model has been trained. As such, the system operates on the basis of procedural imitation of what it has “seen” at training through a statistical pattern distribution (Pasquinelli, 2019). In other words, the model does not copy the training data directly but looks for patterns between pixels and verbal captions compared against prompt words during image generation. According to Tilford (2024), this results in “a synthetic imagination” that mimics existing cultural products, thereby granting users “a form of artistic subjectivity previously unavailable to them because of its practical unattainability” (p. 130). Similarly, for Zeilinger (2021), text-to-image generators are “generative adversarial copy machines” capable of both conformity and subversion to established norms of creativity, as “aligned with a progressive (posthumanist) notion of expressive agency that contradicts romantic ideals of creativity and originality, and which, in doing so, also challenges the cultural logic of intellectual property” (p. 12).

In sum, the practice of human-model interaction raises complex questions regarding the distribution of agency and the extent to which generative models serve as facilitators or arbiters of originality, highlighting the unresolved tension between innovation and imitation inherent in visual generative media. In the context of the machine originality/repetition debate,² Denson (2023) argues that a turn to art and creativity might be misguided altogether because the notion of machine intelligence obfuscates the agency of people—both those holding authority and those being exploited to sustain that authority. Any examination of AI’s role in shaping the future of creative practice, as suggested by Atkinson and Barker (2023), should consider the contexts of AI’s use and application. Entanglement with practice—that is, how users reshape their practices to manipulate the algorithms they depend on (Gillespie, 2014: 168)—can elucidate which interactions with a generative model are possible through the system’s interface. Seen from a media practice perspective, the power of visual generative media does not reside solely within the algorithm itself but in the “ever-changing outcome of its enactment” (Passoth et al., 2012: 4) through platform affordances (i.e., perceived action potentials) of different prompt modifiers. Affordance actualization theory, introduced next, helps to identify key constructs involved in AI image generation and how the relationships between these constructs can be addressed to explain the dynamics of human-model interaction.

Theoretical framework: Affordance actualization

An overarching theoretical framework for this study is affordance actualization theory (AAT), useful for addressing the relationship between an actor and a technological system (Bao et al., 2023; Strong et al., 2014). AAT is concerned with how users of technological systems perceive and utilize affordances (i.e., possibilities for action signaled by a technological environment), and how, in turn, affordances relate to action possibilities for goal-directed actors rather than actual actions, objects, or states (Volkoff and Strong, 2017). Affordance actualization involves an interplay between affordance perception, which is shaped by users’ goals and capabilities, and affordance enactment, where these perceived opportunities are acted upon to produce desired outcomes (Bernhard et al., 2013).

While AAT has been instrumental in fields such as HCI and organization studies,³ it has the potential to complement media studies approaches to critical issues around liminal technologies such as generative AI. The relational actor-affordance property is essential in technological system implementation and use (Bao et al., 2023), and, in the context of AI image generation, prompt modifiers offer action possibilities to direct the opaque foundation models toward desired visual outputs. Seen from an AAT perspective, a human prompter is a goal-oriented actor who perceives a style modifier as an affordance to be acted upon to achieve certain stylistic outcomes. Such affordance is also “imagined” because it evokes “expectations for technology that are not fully realized in conscious, rational knowledge but are nonetheless concretized or materialized in socio-technical systems” (Nagy and Neff, 2015: 1). Importantly, affordance actualization in visual generative media environments does not necessarily lead to desired visual outcomes due to the technological complexity of the system (Ananny and Crawford, 2018) and model uncertainty (Combs et al., 2024).

An examination of prompt affordances identified from user discussions rather than user manuals and guidelines will provide an insight into which style modifiers human prompters experiment with to enact envisaged stylistic outcomes. The study proceeds with the analysis of the most widely used affordances of Midjourney, one of the most popular image generators to date. More specifically, the focus is on how human prompters identify the system’s capacity to manipulate the style of AI images through Midjourney’s affordance actualization.

Midjourney uses: Parameters of style and their effects

Midjourney (MJ hereafter), created by the independent San Francisco–based research lab Midjourney Inc., differentiates itself by hosting its image generator on a Discord chat server (Salkowitz, 2022). As a communication platform designed for community building, Discord has been valuable in enabling and facilitating chat-based groups centered around particular topics (Oppenlaender, 2023), with features like servers (communities centered around particular topics), channels (discussion spaces within a particular server), and customizable roles (permissions and labels assigned to users that define what users can see and do on a server). While other prominent image generators such as Adobe Firefly and Leonardo AI also have dedicated Discord servers, real-time image generation is only enabled on MJ’s Discord. In this way, MJ provides a unique platform where users both generate images and learn from each other about how to design effective prompts, with the ambient co-presence of the founder David Holz (@DavidH) across its seventy-nine channels.

On MJ’s Discord, image generation unfolds across channels labeled #general and #newbies. In April 2024, twenty channels under #general (#general 1–20) and four newcomer rooms (labeled as #newbies-8, #newbies-38, #newbies-68, and #newbies-98) were available. Image generation is only possible on these twenty-four channels, with other channels serving other purposes, which vary in terms of participatory power distribution, from top-bottom (e.g., announcements, rules, support, community guidelines, prompt-faqs) to community discussions (e.g., #prompt-chat, #discussion). User guidelines (https://docs.midjourney.com) specify that the MJ bot generates images by breaking down “the words and phrases in a prompt into smaller pieces, called tokens, that are compared to its training data and then used to generate an image.” After users input verbal prompts starting with a command /imagine, they are presented with a set of options, as shown in Figure 1.

Figure 1.

Model-to-human feedback options on Midjourney. Prompt: A high-contrast surrealist collage –niji 6 (Author, 22 April 2024). Left: first-order options. Right: second-order options.

First-order options (Figure 1, left) include: (i) select any of the four images generated to the initial prompt with the U+image number command, (ii) generate different versions of any of the four images with the V+image number command, and (iii) repeat the generation process for the same prompt with the ♻ command. Unlike traditional visual production tools and photo editors that aim at a refined version of the original image, the edited AI-generated outputs result in entirely new images that meet prompt criteria with varying degrees of accuracy.

Second-order options (Figure 1, right) include upscalers, variations, custom zoom, and panning (i.e., expanding) any of the four zones of visual composition (indicated by arrows). Among these, subtle and creative upscales enhance the quality of the original image and double the size of the image to 2048 x 2048 pixels; variations (subtle, strong, and region) are guided by the content in the original image and the area selected for variation; zoom-out options change the positioning of the image subject in relation to the user. On the whole, possibilities for human-centered editing of the original designs of AI-generated images in MJ are limited to several options across two orders (Table 1).

Table 1.

A summary of the model-to-human feedback options on Midjourney.

First-order options		Second-order options
U1−U4	select	Upscale	subtle / creative
V1−V4	vary	Vary	subtle / strong / region
♻	regenerate	Zoom out	2x / 1.5x / custom
		Pan	left arrow / right arrow / upward arrow / downward arrow

Second-order options resemble a simplified version of traditional Adobe Photoshop, the use of which requires some level of expertise in graphic design, with human input guiding every aspect of the editing process. In contrast to Photoshop, the MJ algorithm has disproportional control over the final output, with a limited set of affordances for modifying generated images. Similar to the overarching logic of human-computer interaction, the feedback loops from the model to the user rather than the other way around, which often distorts human insight and intention (Kittler, 2006). Unlike traditional error messages in the computational code, the feedback from image generators materializes as a visual output in the form of four images generated in response to a prompt, along with a limited set of options to edit the selected images.

Given the importance of prompting for the actualization of MJ’s affordances, the #prompt-chat is where discussions around prompt experimentation are most pronounced. To identify the most widely discussed prompt modifiers, I analyzed over 13,000 chat messages exchanged between users on the #prompt-chat channel over two weeks from April 1 to April 14, 2024. The period and volume were selected due to two reasons.

Firstly, I aimed to access the most recent discussions of MJ’s affordances at the time of writing. The then-default model of MJ, version 6 (v 6), released in December 2023, introduced a new feature called the style raw parameter (discussed in the next section), which is relevant to the study of prompting for style. Collecting the data about four months after the release of this new feature ensured that regular users had time to experiment with different style parameters and form opinions about their effects on the visual aesthetics of AI images. Secondly, to collect the data, I used the Google Chrome extension Discordmate, a Discord chat exporter that provides access to the chat history, allowing the capture of data organized in HTML and CSV files. Because Discordmate supports exporting only 1,000 chat logs between certain dates, only 13,695 chat messages could be captured out of 15,730 messages shared in two weeks (see Appendix I). During this period, an average of 900 messages were exchanged daily, with the number ranging from 874 to 1,484 messages per day.

The analysis was conducted with Voyant, an open-source software managed by Sinclair and Rockwell (2016). Identifying handles were removed during the data cleaning process to preserve user anonymity. Six users were found to be the most active on the #prompt-chat channel, accounting for 40.4% of mentions (5,545 total; min = 400 mentions per user). From their Discord profiles, these users appeared to be human prompt engineering enthusiasts, with five of them holding the roles of a pro-member and a guide. Presumably, these roles were assigned by the server moderators for active participation and meaningful contribution, although this kind of information is not available to laypersons. The six most active commenters on the channel were found to regularly participate in promptcraft volunteer rosters and engage in playtesting new channel activities, with no indication that they could possibly be MJ’s employees or bots like Charon the FAQ Bot.

Out of 400,941 total words and 14,919 unique word forms, the top five most frequent words in the corpus were prompt (2,095), image (1,666), like (1,454), style (1,251), and get (1,067). Figure 2 shows a word cloud of the fifty most frequent words in the corpus (see also Appendix II). The terms that appeared most frequently are positioned centrally and are sized the largest.

Figure 2.

Fifty most frequently used words on the #prompt-chat channel 1–14 April 2024. Produced with Voyant (http://www.voyant-tools.org/?view=Cirrus), 20 April 2024.

To unpack user discussions around affordances of prompting for style, I examined the top five collocates co-occurring with the word “style” over a hundred times. In Table 2, “count” shows the raw frequencies of words that appear near “style” in the proximity of five words (“~5”), and “relative” stands for the relative frequency of this term per million.

Table 2.

Top five collocates with the term style on the #prompt-chat channel of Midjourney’s Discord server between 1–14 April 2024.

#	Term	Count	Relative
1	“style raw”~5	488	1,217
2	“style s” ~5	245	611
3	“style ar”~5	225	561
4	“style sref”~5	122	304
5	“style v”~5	107	267

The frequencies of five collocates with the term style identified in Table 2 across seven segments of equal size are shown in Figure 3. Each series in the graph is colored according to the collocate it represents, and at the top of the graph, a legend displays which words are associated with which colors. The X-axis shows the distribution of discussion across fourteen days into equal segments. The Y-axis shows relative frequencies (per million) of the two co-occurring terms (e.g., “style” and “v”). The trends observed in the two-week discussion on the #prompt-chat channel point to the highest frequencies across all collocates in segments 1, 5, 8, and 11, which suggests that the discussions are most active on Mondays, Thursdays, and Fridays, subject to some variation depending on the user time zone. The graph shows that the following style parameters are featured in user discussion most frequently: –style raw (note double hyphen), –s <value>, –v <value>, –ar <value>:<value>, and –sref <URL>. These parameters are explored in detail next through reported user observations, MJ’s documentation, and self-experimentation.

On MJ, style refers to a specific parameter, or an option that can be added to the prompt to impact the visual output. By default, when the style is not specified, MJ’s “house style” is applied. In MJ model version 6, the new –style raw parameter configures the model to produce images closer to their raw output state, which is photographic, to achieve more literal results (https://docs.midjourney.com/docs/model-versions). Because this parameter has been introduced only in version 6, it features most prominently in the user discussion of style (see Figure 3). Without –style raw, the model applies more unconventional aesthetics, and when –style raw is added, the volume on such aesthetic treatments is turned down (Figure 4). As one of the users clarifies, since –style raw does less work than the house style, it may be less coherent, but it may also “hear” verbal prompt better. Discussions on how to apply this parameter features prominently on the channel, which suggests that users attempt to master different image modalities, from abstract to naturalistic.

Figure 3.

Five most frequently discussed style-related parameters on the #prompt-chat channel. Produced with Voyant (http://www.voyant-tools.org/?view=Trends).

Figure 4.

Effects of the –style raw parameter on the visual aesthetics of AI images generated with Midjourney. Prompt: The flower bomb, graffiti-style painting, in the style of Tom Swain / Dript (Author, 3 May 2024). Artist’s work: https://driptart.wixsite.com/dript-art.

The –s impacts how strongly MJ’s custom style is applied. Lower stylization values (–s 0–50) are used to match user prompts more closely, without applying model’s artistic aesthetic. Higher stylization values result in higher artistic values, but similarly to –style raw, users reported that image content is often less connected to the prompt. The default stylize parameter is set to 100 (medium), and users can further select between high (–s 250–500) and very high (–s 750) values. Since the model has been trained to “produce images that favor artistic color, composition, and forms” (https://docs.midjourney.com/docs/stylize-1), lower stylize values are used when the content of the image needs to match the prompt more closely. The effects of this parameter are less pronounced when the prompt is open-ended—that is, it does not specify the subject matter—the only demonstrable difference is the color differentiation and the level of detail (Figure 5). When applying –s, users must make choices between the accuracy of visual outputs and artistic values produced by the system.

Figure 5.

Effects of the stylize (–s) parameter on the visual aesthetics of AI images generated with Midjourney. Prompt: A finite slice of infinite space, in the style of Bram Braam, –s <value> (Author, 3 May 2024). Original artwork: https://www.brambraam.com/artwork/a-finite-slice-of-infinite-space.

Another technical parameter that refers to the aspect ratio or the canvas size is –ar. It usually appears at the end of the prompt through –ar <value>:<value> or, less frequently, –aspect <value>:<value>. Common choices across model-supported ratios include: –ar 1:1 (default square aspect ratio; e.g., images in Figure 5), –ar 5:4 (common frame and print ratio), –ar 3:2 (common in print photography), and –ar 7:4 (close to HDTV screens and smartphone screens). In Figure 6, the traditional principle of the grid is applied to demonstrate the differences between these choices. The –ar parameter was often discussed and used in conjunction with –style raw.

Figure 6.

Aspect ratio chart of the most typical values in Midjourney. Prompt: Girl with the balloon, Banksy –ar <value>:<value> (Author, 3 May 2024).

Discussions of style often revolved around troubleshooting, which gives an additional insight into the inconsistent workings of MJ despite user specification of prompt modifiers. For example, in a thread below, @user1 (the most engaged discussant of the #prompt-chat channel, with 3,302 mentions across two weeks) shares advanced solutions around style manipulation:

@user1: @user2 Put her [image subject] in a narrower taller frame, maybe –ar 3:4 and then pan to the side to add the couch. You should be using “photograph of” at the beginning and –style raw as a parameter after the prompt. Otherwise, it might stray into other stylistic territory (like illustration styles). Also, if you increase –stylize it will improve coherence (once you get a prompt that is working). I often do a test pic with lower stylize, then if it’s a robust prompt (3 or 4 frames the way I like them), I’ll roll another one with higher stylize (for the quality). At low levels it listens to the prompt. At higher levels it puts more effort into the stylization (sometimes details, sometimes aesthetics).

This discussion indicates that the model does not always actualize specified style affordances, and users devise their own techniques to produce desired outputs while also developing literacies around the quality of prompts and their robustness.

The –sref parameter enables using external images as style references to influence the style of AI-generated images. To add a style reference to a prompt, the –sref parameter must be used with the web address where the reference image is stored online. This functionality allows users to replicate the style of any artist, even if their work was not included in the model’s training data. There are some limitations of this parameter:

@user3: MJ won’t copy an image exactly, but it can use them for inspiration . . . . “industrial truck with a large panda head on the front driving through an Asian city –sref <URL> –cref <URL> –s 50 –v 6.0 –style raw”

To test the impact of the –sref parameter on the aesthetics of the visual output, I used Martine Mooijenkind’s collage as a style reference. The resulting images closely resembled original artwork’s style, content, and composition (Figure 7).

Figure 7.

The style reference parameter –sref < URL > in Midjourney. Prompt: A surrealist collage –sref <URL> (Author, 23 April 2024). Original artwork She was the World by Martine Mooijenkind: https://www.saatchiart.com/art/Collage-She-was-the-world/1033395/8344210/view.

From Figure 7 the –sref parameter mimics original artworks without proper attribution, which raises questions about intellectual property and artist rights. The character references –cref, a parameter that can be used with MJ and Niji version 6⁴, also features prominently in the #prompt-chat discussion (569 mentions), which suggests that this parameter is widely applied to copy image subjects (i.e., characters).

The last most frequently discussed and perhaps least problematic parameter is –v. In April 2024, this parameter accepted values 1, 2, 3, 4, 5, 5.0, 5.1, 5.2, and 6, reflective of nine models of MJ since its initial release on 12 July 2022 (open beta). Version 6 has “enhanced prompt accuracy for longer inputs, improved coherence and knowledge, and advanced image prompting and remixing capabilities” (https://docs.midjourney.com/docs/models). Although the most distinctive feature of v 6 is the –style raw parameter, discussed earlier, this model also appears to synchronize better with the semantics of the prompts generated with large language models (LLMs), as highlighted by @user4 on 22 December 2023:

@user4: You can talk to Midjourney a little bit like it’s ChatGPT. You can’t tell Midjourney “<URL> but make it an oil painting” or use remix [clarify] and tell Midjourney “but put more flowers in her hair.” But a single direct /imagine can be written like an instruction and it will probably play just fine.

From earlier observations of MJ’s prompting style, the ChatGPT style was often used by newcomers who appeared to have little experience with prompting. The upgraded instruction-like prompt style in –v 6 contrasts with the prompt tips for earlier versions, which preferred clipped prompt semantics (see, e.g., Parsons, 2022). As generative models evolve, users adapt their prompting techniques to mitigate model unpredictability and maximize its efficiency.

Discussion: Directing the visual aesthetic and composition of AI images

The analysis reveals that users of visual generative media have two avenues for directing the visual aesthetic and composition of AI-generated images—(i) integrating several technical parameters (i.e., style modifiers) into prompts and (ii) selecting from predetermined options outlined in Table 1. Human control over the output is proportional to the level of abstraction—while straightforward options to select, regenerate, zoom out, pan, and specify the aspect ratio are generally executed with a high degree of precision, the options to vary, upscale, and stylize are far more unpredictable. Because the system is set to optimize for opaque associations between its parameters produced during deep learning (Jacobsen, 2023), machine visions and visions of the human prompters often diverge.

Due to nuances of deep learning and the increasing complexity of proprietary models, prompt actualization involves considerable experimentation as each model behaves differently after it learns new patterns from the new data, “both to discover and optimize capabilities and to understand what renders a particular outcome satisfying or “authentic” (Burkhardt and Rieder, 2024: 8). The value of community learning is arguably heightened precisely because unpredictable behavior of generative models poses practical challenges in aligning human vision with built-in functionality. As such, MJ’s Discord server points to a dynamic community that emphasizes peer learning, with some users taking on mentor roles, fostering discussions around the aesthetic outcomes and styles of AI images.

On the #prompt-chat channel, style is conceived of as a problem that can be solved by manipulating image composition and visual aesthetic through a set of perceived affordances. These affordances are actualized through technical parameters such as –style raw, –s, –v, –ar and –sref, which vary in their purposes and outcomes. The most popular affordances range from image modality (–style raw and –s) to size of the frame (–ar) to “calling” a model version (–v) known for its aesthetic particularities and recontextualizing styles without proper attribution (–sref). The parameters –ar and –v can be seen as purely technical, subject to most human control, evoking system functionality, while the –style raw and –s parameters are far more ambiguous. Alongside the opportunity for a more realistic or less detailed aesthetic, –s influences the extent to which artistic composition, forms, and color are applied, while –style raw has the potential to contribute to visual deception, allowing to generate less polished images that approximate photos taken by everyday users.⁵

The –sref parameter raises the most questions, particularly around intellectual property and copyright, as it is a direct way to mimic any image available on the web, art or otherwise. The prompt-faqs community forum states that style references only “roughly transfer aesthetic style to your canvas,” but in reality, –sref also transfers composition, subject, and style, bypassing the limitations of its training data (see Figure 7). This means that even if the model has not been trained on a particular artist’s piece, MJ provides its users with a quick way to access, scan, and copy the style of any artist who publicly shares their work online, as long as it is available in a .jpg format. Equally alarming are additional options to bypass even this limitation, shared on the same channel. Users are provided with an option to upload any image and send it to the MJ bot in a private message, akin to a private interaction with other Discord users. They can then generate a link to the uploaded image in one click and use the –sref parameter to copy the artist’s style. In sum, MJ has the computational power to both generate images and analyze external data, copying not only the salient features of referenced images but also their characters (i.e., subjects), backgrounds, visual compositions, and styles.

Overall, discussion on the#prompt-chat channel revolves around maximizing the system’s capacity to render visual outputs more closely aligned with the vision of human prompters. In this sense, MJ serves as a medium that precludes the human vision or, as Paul Klee might say, makes the invisible visible, opening onto a future that should not be constrained to a solitary, perfected endpoint (Amoore, 2019). As a dynamic but ambiguous entity responsive to changes in context (Kim et al., 2023), generative models also enact time in that they produce visual outputs in a matter of seconds, challenging temporalities of traditional visual production, and in doing so, provide several shortcuts to producing a desired visual aesthetic. This is particularly evident in user discussions around style, featuring prominently on the #prompt-chat channel. For MJ users, style is thought of not as an artistic skill driven by “the motivation of a human artist to make human connections” (Volz, 2018) but as a series of predictive values associated with verbal descriptors to generate desirable outputs.

Affordance actualization theory, which emphasizes the interplay between the social and material aspects of technology use, offers a valuable lens for understanding the dynamics of human-model interaction, contributing to a more nuanced picture of how generative models afford new possibilities for action. As the first (to the best of my knowledge) study to theorize prompting for style as affordance actualization, this work establishes a foundation for future scholarly inquiries and encourages further research on how evolving technical features of generative media impact the dynamics of agency in the context of human-model interaction.

Conclusions, limitations, and future research

Taken on the whole, the results of this study suggest that visual generative media hold potential for expanding the boundaries of creative expression through both design features of the system and prompt parameters. However, what should not go unnoticed are the hidden processes of technical layering involved in the generation and questionable practices of style mimicking through affordances framed as innovations that benefit the users. It is imperative to address the questions of originality and creativity in the context of visual generative media together with how computation takes place (Kittler, 2006), beyond a glossy veneer that presents deep learning techniques as magical, seemingly beyond the scope of present scientific knowledge (Campolo and Crawford, 2020: 3). In an environment where data takes precedence and human artistic skill is relegated to the background, creativity should not be conflated with the mastery of technical parameters—some of which encourage questionable visual production practices.

Shifting the focus from the ontological creativity of generative models to human-model interaction allows us to address the originality of AI images as a multi-layered construct resulting from affordance actualization. On a deeper level, the originality of AI images can be understood as a cline that instantiates prompt modifiers with varying degrees of human control over visual output. Among the five main technical parameters, –ar executes human vision with a high degree of precision; –s, –v, and –style raw guide the model toward certain values with a lesser degree of model certainty; and –sref mimics existing cultural phenomena without proper attribution. Ultimately, the originality of visual generative media lies in its ability to detect statistical patterns in the extensive datasets on which it has been trained. The disproportional power rests with the platform and its creators, who determine which artistic works can be mimicked and which prompts are deemed appropriate (see note 5).

The current study engaged in a critical examination of the capabilities of Midjourney, addressing how its users leverage platform affordances to enact a visual aesthetic through sociotechnical assemblages comprising different components, including data, algorithms, fellow users, platform creators, and cultural phenomena entangled with data objects. Focusing on people’s practices around prompting for style, I demonstrated how style is conceptualized as an end goal that can be reached through prompt parameters, some of which (–style raw, –s, –ar, and –v) orient toward maximization of the system’s capacity to produce desired outputs while others (–sref) directly abstract people away. Future research could further examine affordance actualization practices among a more diverse range of lay users, as this study found that a select group of prompt experts seemed to dominate the conversation. It is possible that the most active and experienced users prefer certain solutions. An in-depth qualitative study and thematic analysis of user discussions about platform affordances would complement the current findings.

While this study investigated affordance actualization practices of MJ’s users, it is essential to acknowledge the limitations in generalizing findings to other visual generative media. Due to nuances of deep learning, each model behaves differently after it learns new patterns from the new data. Consequently, it is important to recognize and address these dynamics as well as new features that will become available with the release of new models.

Future research could expand the analysis to other prominent image generators such as DALL·E 3, Adobe Firefly, or DreamStudio by Stability AI, as well as external prompt discussion channels such as Reddit or X, to investigate whether these platforms provide options for style recontextualization without proper attribution equivalent to MJ’s –sref. Such insights could inform regulatory acts and policies aimed at “provably beneficial AI” (Russell, 2022) to ensure that its “habitual use by individuals and society does not warp or eclipse our values and our goals and does not distort or obscure our view of the world” (Boddington, 2021: 109). As AI capabilities become increasingly integrated into existing products like Adobe Photoshop and Pixlr, how users respond to these innovations and the implications that follow also remain open for critical investigation.

Footnotes

Appendix

Appendix II.

Top fifty most frequent words on the #prompt-chat channel, Midjourney’s Discord server, between 1–14 April 2024 (raw count).

#	Term	Frequency	#	Term	Frequency
1	prompt	2095	26	sref	442
2	image	1666	27	mj	429
3	like	1454	28	reference	422
4	style	1251	29	vary	420
5	get	1067	30	trying	418
6	use	1061	31	thanks	416
7	how	956	32	work	415
8	want	888	33	background	415
9	more	882	34	different	414
10	midjourney	847	35	white	402
11	images	817	36	add	391
12	make	683	37	create	387
13	try	615	38	see	371
14	ar	591	39	way	367
15	help	583	40	channels	367
16	cref	569	41	good	362
17	using	558	42	region	354
18	bot	541	43	look	351
19	right	529	44	text	350
20	know	518	45	photo	350
21	same	511	46	tried	343
22	raw	495	47	black	338
23	s	493	48	prompts	329
24	character	483	49	v	323
25	need	457	50	thank	315

Acknowledgements

I sincerely thank two anonymous reviewers for their constructive and thoughtful feedback.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: The author received no financial support for the research, authorship, or publication of this article.

Data availability statement:

Please contact the author if you wish to access the corpus used for this study.

ORCID iD

Nataliia Laba

Notes

Author biography

Nataliia Laba, PhD, is an Assistant Professor in Digital and Multimodal Communication / Humane AI at the Department of Communication and Information Studies, University of Groningen. She is interested in the instrumental and political powers of digital technologies, focusing on the relationship between the production and circulation of visual images and multimodal texts across novel sociotechnical systems such as generative AI. Her recent publication includes an original research article addressing the issue of representation in visual generative media (Media, Culture & Society, 2024).

References

Amoore

(2019) Doubt and the algorithm: On the partial accounts of machine learning. Theory, Culture & Society 36(6): 147–169.

Ananny

Crawford

(2018) Seeing without knowing: Limitations of the transparency ideal and its application to algorithmic accountability. New Media & Society 20(3): 973–989.

Atkinson

Barker

(2023) AI and the social construction of creativity. Convergence 29(4): 1054–1069.

Bao

Gong

Yang

(2023) A literature review of human–AI synergy in decision making: From the perspective of affordance actualization theory. Systems 11(9): 442. Available at: https://www.mdpi.com/2079-8954/11/9/442 (accessed 18 July 2024).

Bernhard

Recker

Burton-Jones

(2013) Understanding the actualization of affordances: A study in the process modeling context. In: Chau

Baskerville

(eds.) Proceedings of the 34th international conference on information systems (ICIS 2013), Milan, Italy, 15–18 December 2013, pp.1–11. Atlanta, Georgia: Association for Information Systems.

Boddington

(2021) AI and moral thinking: How can we live well with machines to enhance our moral agency? AI and Ethics 1(2): 109–111.

Bryan-Kinns

Ford

Chamberlain

, et al (2023) Explainable AI for the arts: XAIxArts. In: Proceedings of the 15th conference on creativity and cognition, virtual event, USA, 19–21 June 2023, pp.1–17. New York: Association for Computing Machinery.

Burkhardt

Rieder

(2024) Foundation models are platform models: Prompting and the political economy of AI. Big Data & Society 11(2). Epub ahead of print 22 April 2024. DOI: 10.1177/20539517241247839.

Campolo

Crawford

(2020) Enchanted determinism: Power without responsibility in artificial intelligence. Engaging Science, Technology, and Society 6. Available at: https://knowledge.uchicago.edu/record/6022 (accessed 22 April 2024).

10.

Cetinic

She

(2022) Understanding and creating art with AI: Review and outlook. ACM Transactions on Multimedia Computing, Communications, and Applications 18(2): 1–22.

11.

Chesher

Albarrán-Torres

(2023) The emergence of autolography: The “magical” invocation of images from text through AI. Media International Australia 189(1): 57–73. DOI: 10.1177/1329878X231193252.

12.

Combs

Moyer

Bihl

(2024) Uncertainty in visual generative AI. Algorithms 17(4): 136. Available at: https://www.mdpi.com/1999-4893/17/4/136 (accessed 20 April 2024).

13.

Denson

(2023) From sublime awe to abject cringe: On the embodied processing of AI art. Journal of Visual Culture 22(2): 146–175.

14.

Elgammal

Liu

Elhoseiny

Mazzone

(2017) CAN: Creative adversarial networks, generating “art” by learning about styles and deviating from style norms. arXiv preprint, arXiv:1706.07068.

15.

Feng

Wang

Wong

, et al (2023) PromptMagician: Interactive prompt engineering for text-to-image creation. IEEE Transactions on Visualization and Computer Graphics 30(1): 295–305.

16.

Gillespie

(2014) The relevance of algorithms. In: Gillespie

Foot

Boczkowski

(eds.) Media Technologies: Essays on Communication, Materiality, and Society. Cambridge: MIT Press, pp.167–193.

17.

Goriunova

(2012) Art Platforms and Cultural Production on the Internet. New York: Routledge.

18.

Heidegger

(2002) On Time and Being. Chicago, Illinois: University of Chicago Press.

19.

Jacobsen

(2023) Machine learning and the politics of synthetic data. Big Data & Society 10(1). Epub ahead of print 17 January 2023. DOI: 10.1177/20539517221145372.

20.

Kim

Yun

(2023) Sociotechnical challenges to the technological accuracy of computer vision: The new materialism perspective. Technology in Society 75. Epub ahead of print November 2023. DOI: 10.1016/j.techsoc.2023.102388.

21.

Kittler

(2006) Thinking colours and/or machines. Theory, Culture & Society 23(7–8): 39–50.

22.

Laba

(2024) Engine for the imagination? Visual generative media and the issue of representation. Media, Culture & Society. Epub ahead of print 12 June 2024. DOI: 10.1177/01634437241259950.

23.

Liu

Chilton

(2022) Design guidelines for prompt engineering text-to-image generative models. In: Proceedings of the 2022 CHI conference on human factors in computing systems, New Orleans, USA, 29 April–5 May 2022, pp.1–23. New York: The Association for Computing Machinery.

24.

Llano

D’Inverno

Yee-King

, et al (2022) Explainable computational creativity. arXiv preprint, arXiv:2205.05682.

25.

Mellamphy

(2021) Humans “in the loop”? Human-centrism, posthumanism, and AI. Nature and Culture 16(1): 11–27.

26.

Messingschlager

Appel

(2023) Mind ascribed to AI and the appreciation of AI-generated art. New Media & Society. Epub ahead of print 26 September 2023. DOI: 10.1177/14614448231200248.

27.

Nagy

Neff

(2015) Imagined affordance: Reconstructing a keyword for communication theory. Social Media + Society 1(2). Epub ahead of print 30 September 2015. DOI: 10.1177/2056305115603385.

28.

Natale

Henrickson

(2024) The Lovelace Effect: Perceptions of creativity in machines. New Media & Society 26(4): 1909–1926.

29.

Midjourney (2024) V6 of Niji-Journey is now live! [X]. 30 January 2024. Available at: https://x.com/midjourney/status/1752115495065755798 (accessed 1 May 2024).

30.

O’Meara

Murphy

(2023) Aberrant AI creations: Co-creating surrealist body horror using the DALL-E Mini text-to-image generator. Convergence 29(4): 1070–1096.

31.

Oppenlaender

(2022) The creativity of text-to-image generation. In: Proceedings of the 25th international academic Mindtrek conference, Tampere, Finland, 16–18 November 2022, pp.192–202. New York: Association for Computing Machinery.

32.

Oppenlaender

(2023) A taxonomy of prompt modifiers for text-to-image generation. Behaviour & Information Technology. Epub ahead of print 28 November 2023. DOI: 10.1080/0144929X.2023.2286532.

33.

Parsons

(2022) DALL·E 2 prompt book. dall·ery gall·ery, 13 July. Available at: https://dallery.gallery/the-dalle-2-prompt-book/ (accessed 24 April 2024).

34.

Passoth

J-H

Peuker

Schillmeier

(2012) Introduction. In: Passoth

J-H

Peuker

Schillmeier

(eds.) Agency without Actors?: New Approaches to Collective Action. Abingdon, Oxon: Routledge, pp.1–11.

35.

Pasquinelli

(2019) How a machine learns and fails – A grammar of error for artificial intelligence. Spheres: Journal for Digital Cultures 5: 1–17.

36.

Pavlichenko

Ustalov

(2023) Best prompts for text-to-image models and how to find them. In: Proceedings of the 46th international ACM SIGIR conference on research and development in information retrieval, Taipei, Taiwan, 23–27 July 2023, pp. 2067–2071. New York: The Association for Computing Machinery.

37.

Russell

(2022) Provably beneficial artificial intelligence. In: IUI ‘22: Proceedings of the 27th international conference on intelligent user interfaces, Helsinki, Finland, 22–25 March 2022, p. 3. New York: The Association for Computing Machinery.

38.

Salkowitz

(2022) Midjourney founder David Holz on the impact of AI on art, imagination and the creative economy. Forbes, 16 September. Available at: https://shorturl.at/D8Pcp (accessed 19 July 2024).

39.

Sinclair

Rockwell

(2024) Voyant tools. Available at: https://voyant-tools.org/ (accessed 20 April 2024).

40.

Strong

Volkoff

Johnson

, et al (2014) A theory of organization-EHR affordance actualization. Journal of the Association for Information Systems 15(2): 2. Available at: https://aisel.aisnet.org/jais/vol15/iss2/2 (accessed 17 July 2024).

41.

Tilford

(2024) Performing creativity: Text-to-image synthesis and the mimicry of artistic subjectivity. In: Trillo

Poliks

(eds.) Choreomata: Performance and Performativity after AI. Abingdon, Oxon: CRC Press, pp.125–187.

42.

Volkoff

Strong

(2017) Affordance theory and how to use it in IS research. In: Galliers

Stein

M-K

(eds.) The Routledge Companion to Management Information Systems. London: Routledge, pp. 232–245.

43.

Volz

(2018) When computers make art. Pursuit, 22 February. Available at: https://pursuit.unimelb.edu.au/articles/when-computers-make-art (accessed 24 April 2024).

44.

Zeilinger

(2021) Generative adversarial copy machines. Culture Machine 20. Available at: https://culturemachine.net/vol-20-machine-intelligences (accessed 20 April 2024).

Beyond magic: Prompting for style as affordance actualization in visual generative media