Abstract
Over recent years, the stakes and complexity of online content moderation have been steadily raised, swelling from concerns about personal conflict in smaller communities to worries about effects on public life and democracy. Because of the massive growth in online expressions, automated tools based on machine learning are increasingly used to moderate speech. While ‘design-based governance’ through complex algorithmic techniques has come under intense scrutiny, critical research covering algorithmic content moderation is still rare. To add to our understanding of concrete instances of machine moderation, this article examines Perspective API, a system for the automated detection of ‘toxicity’ developed and run by the Google unit Jigsaw that can be used by websites to help moderate their forums and comment sections. The article proceeds in four steps. First, we present our methodological strategy and the empirical materials we were able to draw on, including interviews, documentation, and GitHub repositories. We then summarize our findings along five axes to identify the various threads Perspective API brings together to deliver a working product. The third section discusses two conflicting organizational logics within the project, paying attention to both critique and what can be learned from the specific case at hand. We conclude by arguing that the opposition between ‘human’ and ‘machine’ in speech moderation obscures the many ways these two come together in concrete systems, and suggest that the way forward requires proactive engagement with the design of technologies as well as the institutions they are embedded in.
Keywords
Introduction
Whether enacted by dedicated staff, distributed over a site's user base, or automatically performed by algorithmic systems, content moderation takes part in shaping increasingly pivotal social infrastructures and circumscribes the communicative situations that make up everyday life and democratic debate. But content moderation has long been a complex and contested issue. While aggressive measures against blatantly illegal material such as child pornography or terrorist propaganda have had broad support, there is much disagreement over the boundaries of “acceptable” discourse and, consequently, different online venues operate according to different standards when it comes to filtering user contributions. The current situation was in large part made possible by legal landmarks such as section 230 of the 1996 US Communications Decency Act and similar legislation in other countries, for example, the European Union's e-Commerce Directive. These “safe harbor provisions” state that “interactive computer services” cannot be held liable for the speech of their users. At the same time, section 230, in particular, determines that if an intermediary does decide to police its users, it will not lose these protections, leading Grimmelmann to argue that the “underlying policy is to encourage moderation by taking away the threat of liability for mismoderation” (2017: 103). Most websites that allow users to contribute, and especially social media platforms, have thus invested in some form of content moderation. Research on how this plays out in practice has often focused on specific sites such as Reddit (Massanari, 2017) and Facebook (Myers West, 2018), or on specific mechanisms like flagging (Crawford and Gillespie, 2016). These efforts point to the different instruments platforms use to manage their user base, and “moderation” can indeed be understood as “governance mechanisms that structure participation in a community to facilitate cooperation and prevent abuse” (Grimmelmann, 2017: 47), which implies a procedural interpretation of appropriate behavior and discourse.
Actual moderation practices and the debates surrounding them have been in flux. Events such as the Christchurch shootings and the Rohingya genocide in Myanmar, which implicated Facebook in particular, the increasing virulence of online harassment that became visible around controversies such as Gamergate, the rise of “alt-right” extremism, the spread of misinformation, and even Donald Trump's manic use of social media have led to demands for more stringent moderation. The scale and character of these controversies show how the stakes of online discourse—and therefore of content moderation—have been steadily raised, swelling from concerns about personal conflict in smaller communities to worries about effects on public life and the future of democracy. A shift can also be traced in public opinion, particularly in the US, where a recent survey indicated that a large majority of citizens wanted the government to strengthen laws against online hate and platform companies to provide more options to filter hateful or harassing content (Anti-Defamation League, 2019).
Enforcing stricter guidelines on expression means that content moderation increasingly concerns the boundaries and governance of speech in a more broadly moral sense, taking into account questions of participation and “quality” of exchange. The forms of speech coming into focus here are often those that have a “harassing” tendency that is meant to distress, insult, or annoy participants in a discussion, often but not necessarily directed at minority groups. Leiter defines these expressions as “tortious harms […] such as defamation and infliction of emotional distress and dignitary harms […] that are real enough to those affected and recognized by ordinary standards of decency” (2010: 155). While moderation can be seen as exclusionary if judged as censorship of free speech, it can also be understood as inclusionary when considering the various ways voices are silenced through tortious and dignitary harms in the absence of intervention. Every moderation system thus implies a particular operational conception of how to balance competing values, affecting how “political participation is enacted through the medium of talk” (Fraser, 1992).
The growing demands for more and subtler forms of moderation, combined with the sheer speed and volume of online expression, have rendered the task more difficult in logistical terms as well. Websites have long relied on hired moderators and user reporting to flag and filter posts and comments. But the mass of online expressions can overwhelm manual review and human moderators often risk serious psychological harms (Roberts, 2019). Machine learning (ML) techniques are therefore increasingly deployed as supposedly cheap and effective solutions. While other forms of “design-based governance” (Gritsenko and Wood, 2020) face heavy scrutiny, critical research on algorithmic content moderation is still rare, in particular when compared to the quickly expanding literature on practical approaches in computer science and adjacent disciplines (Schmidt and Wiegand, 2017; Waseem et al., 2017). Gorwa et al. (2020) provide an excellent overview of the technologies involved and the political and ethical problems that ensue, but if we take “moderation as an expansive socio-technical phenomenon, one that functions in many contexts and takes many forms” (Gillespie et al., 2020: 3), there is a need to investigate concrete instances of machine moderation in greater detail. This is precisely what this paper is setting out to do.
Like other things large internet companies do, their moderation tools are rarely open to outside scrutiny. Perspective API, a text classification system developed by Google unit Jigsaw, is in many ways an exception. Developed around open source practices and in collaboration with large knowledge organizations such as Wikipedia and The New York Times (NYT), Perspective is an attempt to package the identification of “toxic” speech into a service that can be used by websites to help moderate their forums and comment sections. In this paper, we analyze the project with regard to two main aspects or sets of questions. First, we are interested in what we call the fabrics of machine moderation, that is, in the specific ways Perspective weaves different technologies, data sources, forms of labor, and normative commitments into a working system that makes decisions with real-world consequences. Following Ananny and Crawford's argument that holding “an assemblage accountable requires not just seeing inside any one component […] but understanding how it works as a system” (2018: 983), our empirical work focuses less on whether or how Perspective's models are biased and more on the processes and relationships that inform the product and its integration into concrete contexts. To achieve this, we propose a layered analysis that moves from Perspective's basic mandate to the ways the service is integrated into different interfaces and software artifacts. Second, we inquire into the organizational logics that characterize Perspective as a project and part of Google. Here, a cooperative, multi-polar model that involves a degree of openness, transparency, and adaptability contrasts with platformization dynamics (Helmond, 2015) that raise questions about centralization and cultural normalization. The already fraught relationship between platform companies and news providers, a key target audience for Perspective, is further complicated by services that propose to alleviate the burden of moderation at the cost of further logistical dependence and a potential loss of editorial agency. We shed light on these power relations and their implications, but also show how Perspective's basic setup as a web service allows for considerably more leeway than the monolithic and hidden systems at work in social media platforms. The overarching goal is to reflect on machine moderation as a form of techno-institutional design that defines and imposes key conditions on public discourse (Calhoun, 1992).
The paper proceeds in four steps. The following section presents our methodological strategy and empirical sources, including interviews, documentation, and GitHub repositories. We then summarize our findings in mostly descriptive terms along five axes or “steps” to identify the various threads Perspective API brings together to deliver a working product. The third section discusses the two organizational logics we identified, paying attention to both critique and what can be learned from the specific case at hand. We conclude by arguing that the opposition between “human” and “machine” in speech moderation obscures the many ways the two come together in concrete systems and suggest that the way forward requires proactive engagement with the design of technologies as well as the institutions they are embedded in.
Case and analytical strategy
While any moderation system involves software, algorithmic content classification is increasingly common. When dealing with textual material, the general idea is to submit individual messages—conversation-based systems are still rare—to one or several classification modules that produce scores for notions such as “toxicity” or “hatefulness”. Today, this task is generally handled by ML systems, “since basic word filters do not provide a sufficient remedy” (Schmidt and Wiegand, 2017: 1). The work of creating the classifiers at the center of such systems consists mainly in assembling adequate training data, applying a series of algorithmic techniques for model generation, and tuning both to the specificities of the task at hand.
Although our investigation of Perspective API can be seen as a contribution to the quickly growing literature on ML accountability and transparency, our approach departs in important ways from the very model-focused work on bias (Mehrabi et al., 2019) or explainability (Belle and Papantonis, 2020). In fact, the outputs Perspective produces have already been scrutinized by researchers, journalists, and activists (e.g. Blue, 2017; Gröndahl et al., 2018) trying to “trip up” the system through adversarial attacks designed to “reveal brittleness in toxic speech detectors” (Noever, 2018: 1), for example racial or gender biases. This paper, however, takes the speech classifier not as a singular object but as an entry point into a complex arrangement of work processes, technologies, partnerships, and normative choices. Whether these combinations of human and machine agency are conceptualized as “assemblages” or “socio-technical networks” is less important than the recognition that contributions to the character and performativity of speech classification are spread out over technical components, actor relations, and various temporalities. As Ananny and Crawford argue, “[t]here is no ‘single’ system to see inside when the system itself is distributed among and embedded within environments that define its operation” (2018: 982). We use the metaphor of “fabrics” to emphasize that the different components involved are not a coincidental collection but woven together with purpose into complex yet ordered arrangements that are not easily taken apart, potentially binding actors into relationships of dependency. Approaching empirical work from this vantage point is daunting and even more so when moderation systems operate within secretive online platforms. In the case of Perspective API, however, we could access substantial amounts of information and in the following sections, we introduce our case, sources, and analytical strategy.
Perspective API
Jigsaw was initially founded as “Google Ideas” in 2010, with Jared Cohen, a former staffer in the US State Department, as the “think/do” tank's first and so far only director. 1 In 2016, Google CEO Eric Schmidt announced a renaming and expanded mandate, presenting Jigsaw as a “technology incubator” that would “use technology to tackle the toughest geopolitical challenges, from countering violent extremism to thwarting online censorship to mitigating the threats associated with digital attacks”. 2 While many of the unit's projects concern security in a more restricted sense, its Conversation AI group works on what Wired Magazine called an “AI-Powered War on Trolls” (Greenberg, 2016) and Perspective API is their main product.
In practical terms, Perspective is a simple web service that takes a string of text as input, processes it, and sends back a probability score along one or several “attributes.” The main attribute is TOXICITY, 3 but at the time of writing, SEVERE_TOXICITY, IDENTITY_ATTACK, INSULT, PROFANITY, and THREAT have also reached production status. A second set of attributes is derived from a collaboration with the NYT, which shared 16 million moderated comments going back to 2007 (Etim, 2017). While news websites play a prominent role in Conversation AI's plans, the service targets anyone who wants to “host better conversations online” 4 and is currently free for accepted clients. Since Perspective is a Web-API that other programs interface with, a variety of tools make use of it, and these tools determine what data is sent and how the returned scores are used. They may give feedback to commenters before sending potentially offending messages, help human moderators by organizing comments by toxicity score, or allow messages below a certain threshold to be published without human review.
Approach
While we were able to interview several people 5 involved in the project, including Conversation AI's founder and Jigsaw's former chief scientist Lucas Dixon, our research is not an ethnography of an algorithmic system (Seaver, 2017). We rely in large part on various kinds of published material to iteratively compose a picture of a case where key decision-making processes—Jigsaw is managed by Google after all—are off-limits. We thus stray closer to Bucher's notion of technography as a method for dealing with algorithmic systems. Technography, she writes, “is a way of describing and observing the workings of technology in order to examine the interplay between a diverse set of actors (both human and nonhuman)” (2018: 60). Rather than “reveal some hidden truth about the exact workings of software or to unveil the precise formula of an algorithm” (Bucher, 2018: 61), the approach seeks to “develop a critical understanding of the mechanisms and operational logic of software” (Bucher, 2018: 61), including the way it connects to and drives social purposes. To do this, we draw on journalistic work, such as company profiles and interviews, but more importantly, on a wide array of material published by Jigsaw itself. Other than corporate websites, 6 a developer blog called The False Positive, 7 a not very active Google Group, 8 and several academic publications, this includes robust developer documentation, 9 an extensive network of GitHub repositories, a ML practicum that takes Perspective as example, 10 and three Kaggle challenges hosted by the Conversation AI group. 11
While critical approaches to studying AI systems have been proliferating, there is no broadly shared consensus on what to highlight and what to leave out in assessments of concrete examples. In the following section, we discuss Perspective API through the lens of five central aspects that structure the project. These aspects were inductively developed from our interaction with the material at hand, but they also constitute a rudimentary method for examining similar systems. While their frontiers are porous, they distinguish a series of design “steps” leading from a general idea toward a working system that continues to evolve after its initial design.
From problem(atization) to working system
Problem definition, mandate, values
The first point to address is the overall orientation and self-definition of Perspective. Technical projects generally set a specific goal or mandate, they single out a particular “problem” they seek to “solve.” Critical scholars have argued that technical fields have their own worldview, “linguistic forms, habits of thought, established techniques, ritualized work practices, ways of framing questions and answers, genre conventions, and so forth” (Agre, 1997: 150) that recast complex social phenomena in technical language as problems amenable to technical solutions. These problem definitions or “problematizations” are both descriptive and prescriptive in nature and the initial diagnosis justifies and orients the technical propositions that follow. In the case of Perspective API, the project developed from a meeting between the Conversation AI group and women targeted by the notorious Gamergate campaign (Greenberg, 2016). Hearing about the relentless attempts to silence these women through targeted harassment inspired one of the core diagnoses behind the project, namely that “[a]buse and harassment stop people from expressing themselves or makes them give up on conversations entirely.”
12
This initial realization is developed into a substantive assessment: The problem of toxic language online is broad and global. According to Pew Research Center, 27% of American internet users chose not to post something online after seeing someone being harassed. Toxic language makes it hard to discuss important issues. The best approaches we’ve seen involve people facilitating discussions, but these are labor-intensive, often requiring community managers to manually review every comment. That kind of system is obviously hard to scale.
13
The reference to Pew Research, here, sets up “toxic language” as a pervasive phenomenon rather than a series of anecdotes, and anchors it as a threat to expression and participation. Moderation is thus not stifling free speech but enabling “better conversations.” 14 The preferred approach—manual reviewing of comments—cannot keep up with the sheer volume of messages, however. Using ML to emulate that process thus comes into view as the “obvious” solution. According to Gillespie (2020), this is the dominant “discursive justification” for machine moderation, which resonates with the emerging consensus in the Natural Language Processing (NLP) field that “hateful” or “abusive” language can indeed be detected (and thus countered) by automated means (Schmidt and Wiegand, 2017; Waseem et al., 2017). Jigsaw was moreover encouraged by practical experiences with the computer game League of Legends, where 92% of players changed their behavior after receiving automated warnings when making sexist or abusive remarks (Greenberg, 2016). Pragmatic attention to communicative practice is indeed visible throughout Perspective, for example when we read that its classifiers “score the perceived impact a comment might have on a conversation.” 15 These normative and intellectual “anchor points” are crucial for the overall orientation of the project.
Conversation AI sets for itself five broad values. 16 Community, transparency, inclusivity, and privacy are standard fare, but “topic neutrality” highlights a normative concern that is specific to the task at hand: certain issues lead to more heated debates and certain minorities suffer from higher rates of targeting, with the effect that certain words—most importantly “identity terms” like “gay” or “jew”—become associated with toxicity and are scored accordingly. Conversation AI, however, argues that “[g]ood faith discussion can happen on controversial topics” 17 and we will see how this is tackled further down.
It bears mentioning that Conversation AI's goals are in alignment with its mother company's interests. According to Dixon, “a healthy internet is a good internet for Google,” because the alternative could be that “everyone moved to closed walled gardens.” This rhymes with the company's many efforts, for example around technical standards, to shape the web in ways that end up protecting its web-focused business model. Initiatives like Perspective allow Google to position itself as a guardian or “custodian” (Gillespie, 2018) of the open internet, a benevolent provider of technical services that smaller actors may be unable to create and run on their own.
Models and ground truth
While the basic rationale behind Perspective provides a general idea for how to tackle the problem, creating a working system requires further specification. For supervised ML, this involves compiling training data and selecting “target variables,” for example labels such as “toxic” and “not toxic.” This “ground truth” data encapsulates how the ideal outcome should look like. Most contemporary ML techniques require a very large number of examples to capture the nuances of complex classification tasks and, as is increasingly well understood, any biases and inaccuracies in the training data will skew the model if left unaccounted for. But if terms like “bias” still suggest that there may be a neutral way of assessing contested notions like toxicity, a closer look at the construction of ground truth data reveals a process of “semantic design” that is necessarily oriented and idiosyncratic.
To assemble training data, Perspective relied on two of the most common strategies: crowd work and expert input. Interestingly, these efforts were not mixed together but produced two different sets of models. The first includes TOXICITY, which remains Perspective's “main” model. 18 Training data was created by asking Crowdflower workers to rate a “proprietary” 19 set of text messages, including comments from Wikipedia talk page discussions, on a four-point scale ranging from “very toxic” to “not toxic.” 20 Comments were annotated without conversational context by at least ten workers that were tested on several “correct” examples beforehand. All assessments were aggregated into a single label (Wulczyn et al., 2017: 3). Instructions contained eight examples for orientation and defined toxicity as “a rude, disrespectful, unreasonable comment or otherwise somewhat likely to make a user leave a discussion or give up on sharing their perspective.” 21 A second model, SEVERE_TOXICITY, is derived from the same classification data but meant to detect “very toxic” language specifically. Crowd workers were also asked to label each comment with five additional binary attributes: profanity/obscenity, sexually explicit, identity-based attack, insulting, and threatening. 22 These attributes have been funneled into separate models 23 that, with the exception of SEXUALLY_EXPLICIT, had all reached production status at the time of writing (Figure 1).

The current list of models based on crowd work. https://developers.perspectiveapi.com/s/about-the-api-attributes-and-languages
Experimental attributes “have not been tested as thoroughly as production attributes” 24 and their use in real-world environments is discouraged. Since 2017, Spanish, French, German, Portuguese, Italian, Russian, and Arabic have been added. At least part of the necessary training data comes from “close partners” such as Le Monde 25 and El Pais 26 but, as so often, there are fewer first-hand accounts for non-English cases. They may, however, have been trained on smaller amounts of data and could therefore be less accurate. Another noteworthy model is TOXICITY_FAST, which trades accuracy for low latency, facilitating moderation in near real-time situations. All attributes are probabilistic scores between zero and one that indicate “how likely it is that a reader would perceive the comment […] as containing the given attribute.” 27 Scores, therefore, do not reflect “severity” (Noever, 2018: 2).
A second set of models was developed from a collaboration with the community management team at the NYT. This partnership involves the continuous sharing of moderation data, including the full archive, and the creation of a moderation interface, which later became the open source toolkit Moderator. Although these datasets remain proprietary, Perspective API makes eight models publicly available (Figure 2).

The eight attributes based on data from the NYT. https://developers.perspectiveapi.com/s/about-the-api-attributes-and-languages
These models are derived from a different vocabulary of labels, and we can see the Times’ understanding of what constitutes proper public discussion shine through. Attributes are not only meant to detect forms of aggression (ATTACK_ON_AUTHOR, ATTACK_ON_COMMENTER, and INFLAMMATORY) but also low-quality contributions (INCOHERENT, SPAM, and UNSUBSTANTIAL). LIKELY_TO_REJECT is an umbrella category that involves a range of pragmatic commitments tied to a conversational context that includes specific topics, participants, conversational norms, and economic interests. Here, ML is used to emulate an idiosyncratic practice without engineering an ontology of torts around it. Labels were indeed developed by and for human moderators, long before anyone planned for them to become training data for an ML classifier. As Bassey Etim, a former community editor at the NYT recounts: [We gave Jigsaw] the anonymized comments data and then the tags from the professional moderators, why they decided to reject them… that is not perfect, some of the tags are… if we had known when we started doing that, tagging comments a decade ago, probably we would have had a little more fine grained tags. There is one tag that is just inflammatory, and most of the comments were rejected because of that… […]But it worked well enough, and we came back with a model that we were really happy with.
According to our interviews, the catch-all LIKELY_TO_REJECT is used by other news outlets to promote “really high-quality discussion,” showing how a specific understanding of quality can ripple through the internet once reified in a classification model. Paying attention to training data and labels is crucial for understanding how cultural logics flow into a system, even before any kind of “algorithmic” work has been done. In Perspective's case, this involves the particular semantic design used in a crowd work setting as well as the cultural and commercial specificities of the NYT comment section, including the vagaries induced by the daily grind of moderating thousands of comments manually.
Algorithmic techniques
Once training data is assembled, the actual “learning” can happen. While many ML tasks require complicated parsing of camera or sensor data, the automated assessment of written information relies on the input fields already present all over the web. Certainly, a system could consider things like the semantic environment a conversation is embedded in, its conversational structure and dynamics, voting or flagging signals, and even the full comment history of a user, but Perspective API classifies single units of text individually and without context. 28 While we cannot fully reconstruct the text parsing pipeline and how it has evolved in detail, initial work relied on a simple “bag of words” approach, without syntactic or lexicon-based enhancements (Wulczyn et al., 2017). The production version encodes words as word embeddings, using the open source Global Vectors for Word Representation (GloVe) algorithm. 29 Word embeddings basically substitute each word with a vector of hundreds or thousands of semantic dimensions that are themselves learned from parsing very large corpora of text, in an attempt to generalize from words to “meaning” and solve problems like synonymy and homonymy. GloVe is one of many available techniques, but it claims that it “outperforms other models on word analogy, word similarity, and named entity recognition tasks” (Pennington et al., 2014: 11). In the end, comments are transformed into sequences of semantic coordinates that are fed into Google's open source TensorFlow framework to train neural networks that associate input components with output labels. The result is a set of models that apply a form of statistical “if-then” to incoming text, in the sense that the presence of certain words or word combinations adds to the probability score for a given attribute.
While this kind of technical work has become “standard fare” in ML circles, there are many decisions, possibilities for optimization, and potential pitfalls when building text processing pipelines. Training, experience, and the capacity to try out many different ideas are key differentiators, and large internet companies have been frantically hiring researchers from computing disciplines, leading to increased dominance in academic publications and to a “de-democratization of AI” (Ahmed and Wahed, 2020). Conversation AI is indeed a good example of Google assembling teams of scientists that produce working systems but stay close to academic practices and associated knowledge reservoirs.
Jigsaw also drew on the wider ML community through three “challenges,” a common free-labor format in tech circles, with thousands of teams participating through Google's Kaggle website. While all contributions were made open source, the considerable technical creativity at work represents an important—and cheap—source of input for Perspective. For example, the “secret” of the winning submission to the first “toxic comment classification challenge,” where 4500 teams participated, seems to have mainly been the mixing of word embeddings, combining variants of GloVe and FastText, and the translation of comments to other languages and back to “augment” the training data. 30 The capacity of skilled Kaggle users to explore and test original ideas and combinations should not be underestimated, in particular when considering that Jigsaw on the whole has only around 50 employees.
Implementation and evolution
While the steps discussed so far yield a working model, there are important differences between one-off experiments and systems used in production settings. To reach this next stage, performance considerations such as response time, computational cost, reliability, and scalability must be taken into account. While we can only speculate how Jigsaw handles these things internally, we understand that software offered as a web service is often unstable, in part due to forms of automated optimization, but also because manual modifications can be made transparently without disruption. As Jigsaw revealed, “Perspective models are not automatically learning all the time, but we update our models regularly,” 31 resulting in versioned releases. The latest iteration, TOXICITY@6, was launched in August 2018. 32 We know little about how updated versions are compiled, although two pathways for data acquisition stand out. First, partner organizations continue to contribute (labeled) datasets and an online form allows anyone to share datasets with Jigsaw. Shared collections should be “useful,” which means “more than 100k comments w/labels or more than 1 M w/o labels.” 33 Second, users can contribute directly through the API. Comments submitted for assessment are stored by default, although a “doNotStore” flag allows for opting out. And the “SuggestCommentScore” endpoint makes it possible for users to send “corrections” that flow back into the system. 34 Both pathways establish infrastructural relations with outside actors that can transform the optimization and adaptation of models into a byproduct of normal use.
A central vector of Perspective's evolution has been the question of bias. After the API launched in 2017, researchers and activists quickly identified many problems, in particular the tendency to score comments containing terms like “black” or “gay” with high probabilities for toxicity (Blue, 2017). These initial revelations prompted attempts to detect and mitigate “unintended bias” in Perspective and to achieve the above-mentioned “topic neutrality,” starting from the idea that “a model contains unintended bias if it performs better for comments containing some particular identity terms than for comments containing others” (Dixon et al., 2018: 2). Jigsaw “manually created a set of 51 common identity terms” (Dixon et al., 2018: 2) to measure and mitigate the issue as well as “nuanced metrics” (Borkan et al., 2019) to evaluate bias quantitatively, show progress between model versions, and rank contributions to a specifically designed Kaggle challenge. 35 Detection of bias within this framework is relatively straightforward since results can be “calculated using a synthetically generated test set where a range of identity terms are swapped into template sentences, both toxic and non-toxic.” 36 This “operationalized” bias is then mitigated using one of several techniques circulating through the FAT (Fairness, Accountability, and Transparency) ML community (Friedler et al., 2019). The basic problem is that an identity term may appear more often in comments classified as toxic in the training data. The mitigation strategy is thus to add published Wikipedia articles containing the term—which are considered non-toxic since verified by the Wikipedia community—to the training data “to bring the toxic/non-toxic balance in line with the prior distribution for the overall dataset” (Dixon et al., 2018: 3). This strategy works well for the selected identity terms within the measurement framework but also raises the complicated question of how such terms are selected. We will come back to this issue further down.
Orchestration
As already mentioned, Perspective's services are delivered through a Web-API that accepts a text snippet as input and serves a range of scores as output. This means that the designers and community managers of end-user applications decide for themselves how to “orchestrate” activity flows, that is, how to integrate toxicity assessments into functionalities and interfaces. How are algorithmic classifications made effective? Do they appear as mere tags on a human moderator's screen, like in Disqus’ implementation of Perspective API? 37 As recommendations? Are they used to render a comment automatically invisible or less visible? Are humans in the loop, on the loop, or out of the loop? (Citron and Pasquale, 2014: 6–7) Which humans? Are end-users alerted about “speech problems” during writing or before sending a comment, as is the case for El Pais' Perspective-based moderation system? 38 Is there a notification after the fact? Are there facilities for contestation or redress? Or are end-users the ones “calling upon” machine moderation in the first place? Within this space of design choices, Perspective explicitly identifies two “uses to avoid” 39 : fully automated moderation is discouraged since “[m]achine learning models will always make some mistakes”; and “character judgment,” in the sense that scores become part of users’ profiles, is considered problematic from a privacy standpoint.
Jigsaw itself develops two tools that demonstrate technical possibilities and potential use cases. Tune is a browser plug-in that gives users the means to “adjust the level of toxicity they see in comments across the internet,” 40 including on YouTube, Twitter, Facebook, Reddit, and Disqus. Reversing the standard logic of moderation, where content is taken offline, Tune users filter what they see by selecting from the above-mentioned models and tuning a dial (Figure 3).

Tune’s interface.
Tune shows how machine moderation can be “laid over” the internet without any technical adaptations needed on the covered websites. This approach fundamentally reconfigures the governance constellation: since content is filtered by the user and not being removed by websites themselves, they do not have to deal with the ramifications of content removal.
Moderator, on the other hand, is an open source “machine-assisted human-moderation toolkit,” 41 developed in collaboration with the NYT, that can be integrated into a website's editorial systems. It serves as a technical demonstrator but also introduces specific interaction paradigms such as bulk moderating (Figure 4), where operators can use sliders to act on sets of comments, for example, grouping anything below a certain threshold, which “really lightens up the mental load of having to know that any given second something awful might be there,” according to one of our interviews. Moderator thus highlights how moderation is not merely a question of “human vs. machine,” but rather a question of how different forms of cognition and agency are arranged into hybrid systems.

Moderator’s bulk moderation interface.
There are several interesting implementations beyond Google's own tools. One example is Coral Talk, an open source commenting platform designed by The Coral Project that was bought in early 2019
42
by Vox Media, a central technology provider in online news through its Chorus content management system. Coral Talk is provided as software as a service (SaaS) and, according to the company, “powers the communities of nearly 50 publishers in 12 countries, including The Wall Street Journal, the Washington Post, The Intercept, and New York Magazine.”
43
Perspective API is a part of that system, both for real-time user warnings and backend moderation.
44
The former is an attempt at modifying user behavior preemptively by showing a notification if a comment might violate community guidelines. Andrew Losowsky, head of Coral at Vox Media, argues that this both improves commenting practices and makes the use of ML more transparent: From the beginning we decided we will use this as AI assisted human moderation rather than AI Moderation. And at the same time we wanted to understand the feedback using the AI, first of all to be transparent about the AI we use and also there was a way to use the AI feedback as a way to encourage commenters to improve their own behaviour first.
Losowsky also emphasized that the benefits of using ML come from embedding it in human moderation processes, rather than replacing them. This demonstrates how partners can calibrate the use of Perspective to suit their own ideas.
The last example is Twitter's low-level implementation in the “hide replies” API-endpoint, which gives developers of third-party software, for example, tools for brand monitoring, the possibility to integrate Perspective directly. 45 Here, we see how one API can become part of another API's inner workings, affording the creation of systems that combine layers of functionality and reach deeply into the bowels of several companies at once. The different forms of orchestration show how the generation of outcomes is distributed over increasingly complex chains of human and machine work, forming fabrics that combine technical infrastructures with different partnerships and institutional relations.
The fabrics of power
What emerges from the previous section is not a snapshot of a singular model or algorithm, but a fragmented description of a complex fabric of technologies, actors, and normative commitments, drawn together into working artifacts that take part in millions of speech decisions every day. Instead of a singular logic of “automation,” where manual labor is replaced by machine operation, we find many kinds of involvements and contributions, many instances of decision-making, and thus many moments where explicit and implicit forms of human judgment come together with technical methods and artifacts. The case of Perspective invites reflection on how these different strands are woven together into more stable structures, in particular when considering how “organization is an achievement, a process, a consequence, a set of resistances overcome, a precarious effect” (Law, 1992: 390). What comes into view when taking a step back from the minutiae of implementing and running a ML system are two conflicting organizational logics that reveal fundamental tensions at the heart of the project. One is a multi-polar model inspired by academic norms and open source practices, the other evokes familiar processes of platformization and cultural normalization. While machine moderation inevitably takes part in shaping public discourse, “a possible mode of coordination of human life” (Calhoun, 1992: 6), it is monolithic in neither technical nor organizational terms. The question of whether it becomes an instrument for communication rather than domination (Calhoun, 1992: 29) largely depends on which organizational logic will eventually prevail.
A cooperative, multi-polar model
On a first level, Jigsaw implements what could be called an “academic startup model” that involves different forms of collaboration and communication. Perspective keeps close ties with the NLP community, drawing on scientists as key personnel, engaging in publications and co-authorships with academic researchers, participating in conferences, and so forth. Open source practices play a central role in research collaborations, but also for organizing highly valuable Kaggle competitions. These connections allow the company to draw on the state of the art in ML and to anchor their efforts within academic discourse. In line with AI's broader “social sciences deficit” (Sloane and Moss, 2019), the range of disciplines taken into account is restricted, however. Since topics like bias detection and mitigation require technical knowledge as well as certain cultural sensitivities, the substantial empirical critiques Perspective API has received can be seen as “free input” in this regard. Similarly, the longer-term collaboration with the NYT contributed not only high-quality data already shaped by a well-honed moderation apparatus, but also decades of tacit experience.
These efforts toward making “an ecosystem that works towards healthy conversations,” as Dixon puts it, may facilitate the task at hand while saving costs, but Perspective's relative openness also has the effect of countering one of the most common critiques of ML, the lack of transparency (Burrell, 2016). Gorwa et al. (2020) indeed argue that algorithmic content moderation runs the risk of making already opaque moderation practices even less transparent. However, while some of the fundamental problems with techniques that take millions of signals into account remain, the way Perspective presents itself does enhance our capacity to understand and critique the project. The availability of academic publications, blog posts, open source code, public datasets, tutorials, model evaluations, and so forth may not provide deep insights into the inner workings of each classification model, but they go much further than what we are used to from large platforms. And access to the behavior of the different models via an API constitutes a level of “observability” (Rieder and Hofmann, 2020) that allows for investigating concrete biases or blind spots.
There are clear limitations, however. The API does not include an explanation for its decisions, for example, a score for individual words or word combinations—although longer text passages receive scores for different sections or “spans.” More importantly, the composition of the “proprietary” dataset used to train the main toxicity models was never fully clear and recent updates are no longer public. We were not able to find any information on data or partners for languages other than English, French, and Spanish. The use of Google's own “model card” framework for “transparent model reporting” (Mitchell et al., 2019: 220) is certainly good practice, but the only card made publicly available is for the English version of TOXICITY. Maybe these pieces of information exist somewhere, but they are certainly not put to the front. The same goes for partnerships and actual uses of the API: it is possible to piece together a narrative from interviews and published materials, but many details remain in the dark. The lack of information in this regard may paradoxically be in part an effect of the multi-polar, partnership-based way Perspective functions, which allows collaborators to pursue their own communication strategies.
Multi-polarity is in part an effect of Perspective's setup as a Web-API, which implies a high degree of adaptability, for example when developers select from the list of available models or integrate results in different ways into activity flows. This shows very concretely how accountability concerns a whole “assemblage” or system rather than a single component (Ananny and Crawford, 2018). Actual applications of Perspective include back-office tools that can render moderation work for human moderators more practical or humane (Gillespie, 2020), front-end widgets that warn users preemptively instead of deleting after the fact, and filter plug-ins that allow individuals to calibrate the level of toxicity they want to encounter. Each of these orchestrations inserts AI components differently into lived realities and implies quite different cultural logics. The plug-in approach, for example, applies the filter at the very end of the chain, thus framing toxicity as a nuisance or danger that individuals can protect against. While this may “work” in certain circumstances, it does not solve the matter of harmful content being out there, easily available for people willing or wanting to see it, for example in situations of social unrest or radicalization. Thinking about these possibilities can, however, lead to fine-grained and balanced strategies that may have mitigating effects without excessive blocking. Concepts like “design-based governance” (Gritsenko and Wood, 2020) can serve as both markers of critique and invitations for creative intervention.
Finally, Perspective's partners are also involved in developing classifiers further. While the current 46 free-to-use model means that users “pay” with their data and datafied use patterns, the API's “SuggestCommentScore” endpoint and data sharing facilities open pathways toward deliberate forms of participation that are not reduced to some automated optimization loop. If we consider that moderation models should evolve over time as cultures change (Gillespie, 2020: 3), the question how different actors can effect change is crucial and Perspective again provides a starting point for how to organize such a process in practical terms.
Platformization and moral engineering
Even if Perspective succeeds, to a certain degree, in realizing a multi-polar, cooperative model that provides useful ways for thinking about machine moderation going forward, its embedding within Google cannot be ignored. On a practical level, the project depends on substantial material and financial inputs, for example, datacenter capacities for training and inference. And Google's considerable clout is certainly a reason why many of the mentioned partnerships were possible in the first place. Within this actor constellation, multi-polarity does not mean that there are no power asymmetries.
Regarding news organizations, which are particularly important for Perspective as both core partners in model creation and target market, there are clear differences between leading institutions such as the NYT, who have huge archives of moderation decisions and are therefore sufficiently “valuable” to establish a somewhat symmetric relationship with a company like Google, and smaller actors that are dependent on third parties and ready-made models and tools. Like social media platforms, news publishers have been witnessing the growth of problematic expression, ranging from spam to blatant hate speech and personal attacks on writers or other commenters. Allowing reader comments on their websites adds valuable traffic but comes with the expectation that comments will be treated with consideration (Ksiazek, 2018). If a comment section is to add value to journalistic work, it needs to adhere to certain quality standards, and this raises the stakes for content moderation. Smaller websites and platforms have already been outsourcing moderation work to actors like Disqus (Gillespie et al., 2020: 5), and machine moderation potentially introduces another layer of centralization into the equation. According to Nechushtai (2018), Google already plays a pivotal role in this increasingly entangled configuration: products like Search and News are crucial audience gateways; the company's ad networks are central for income generation; Analytics and other products are the industry standard for audience analysis and web design optimization; services like Trends have become reporting tools; programs like Google News Lab and Google Digital News Initiative wield considerable resources. Within this constellation, Perspective API appears as another piece in a puzzle that Nechushtai terms “infrastructural capture,” that is, “circumstances in which a scrutinizing body is incapable of operating sustainably without the physical or digital resources and services provided by the businesses it oversees” (2018: 1). Considering that the media indeed play a central role when it comes to keeping checks on platform companies, machine moderation can become yet another vector of dependence, given the amount of data, expertise, and work required.
Machine moderation as a service also raises concerns about normative plurality and cultural streamlining if the same moderation tools are used by different publishers all over the internet. The preparation of training data and bias mitigation show how machine moderation is a much more complex design process than the simplistic opposition between “human” and “machine” admits. Ultimately, classification models seek to understand human language and intent, funneling that understanding into some decision or outcome. This implies subtle forms of semantic and “moral” engineering. Each model implemented in Perspective projects a particular “view” onto the submitted comments. Together, they form a “vocabulary of concern,” akin to the different possibilities for flagging that Crawford and Gillespie (2016) refer to as “vocabulary of complaint.” But unlike these buttons and menus, machine moderation organizes a tight amalgam of human judgement and computational prowess into an automated mechanism that interprets every comment along a series of defined cultural axes. These modes of appreciation do not necessarily form a singular grid: there are clear semantic and pragmatic differences between TOXICITY and LIKELY_TO_REJECT, the former seeking to attribute a particular “state” to a comment, the latter aggregating a potentially wide range of objections into a single model. But all models encode normative choices, that is, specific ideas of what constitutes “low-value speech” (Leiter, 2010) and, consequently, of what a “healthy” conversation should look like.
In recent work coming out of a collaboration between Google and researchers from the universities of Oxford and South Carolina, these operative distinctions are becoming more fine-grained, seeking to tackle “subtle forms of toxicity” (Price et al., 2020), which means identifying comments that are “(1) hostile; (2) antagonistic, insulting, provocative or trolling; (3) dismissive; (4) condescending or patronizing; (5) sarcastic; and/or (6) an unfair generalisation” (Price et al., 2020: 1). Whether this work ends up in a production system or not, it signals the deep entanglement between technical design and potentially far-reaching moments of normative prescription in machine moderation. And this is not merely an incursion of unruly cultural matters into the pristine spaces of technology. As Callon argues, “engineers transform themselves into sociologists, moralists or political scientists at precisely those moments when they are most caught up in technical questions” (1990: 136). Paying attention to the details of design processes, as we have tried to do for Perspective, can safeguard against the danger of obscuring “the fundamentally political nature of speech decisions being executed at scale” (Gorwa et al., 2020: 1).
Any medium shapes the interactions it affords, but the integration of machine moderation into communication infrastructures adds an explicitly moral mechanism, removing any lingering doubt that they are not neutral message conduits. In the case of Perspective, normative entanglements are particularly visible around the selection of data labels and identity terms for bias mitigation. The latter covers areas like ethnicity, sexual orientation, age, and disability, but leaves out markers for class, for example. 47 While lists of terms and training data can and should be improved, Gorwa et al., rightfully argue that even “a perfectly ‘accurate’ toxic speech classifier will have unequal impacts on different populations because it will inevitably have to privilege certain formalizations of offence above others” (2020: 11). This is one of the reasons why terms like accuracy and bias are ultimately too limiting when it comes to understanding the fundamental way machine moderation is adopting and performing specific political and cultural viewpoints. For Perspective, this may well come down to what Sarah Roberts contends for content moderation on social media platforms, namely that “American mores, jurisprudence and norms, conflated with what is best for American firms, have been seamlessly packaged into technological and policy affordances” (Gillespie et al., 2020: 20f).
And there are more subtle concerns and effects coming into play, for example when considering that the pathologizing of speech at work in terms like “toxicity” makes no difference between trolling, intent to harm, and justified outrage. A key definition that mirrors toxicity is “civility,” a term highlighted by two initial partners in the project, the NYT (Long, 2017) and Wikipedia 48 to describe ideal online conversations. Similar to toxicity, civility does not seem to have deep roots in psychological or social research. It is an interesting choice from an anthropological perspective, however, since it reveals the deep ambiguities involved in normative discussions of communicative practice. On the one hand, civility has the potential to protect the marginalized if it offers everyone in society a fair share of respect and dignity. Balibar indeed points out that the politics of civility can draw limits to violence and humiliation (2002: 30). But civility can also be seen as an elite project that coerces the marginalized factions of society—the “rightfully angry”—into accepting hegemonic violence without dissent or rebellion (Thiranagama et al., 2018). Perspective's ideas about “healthy” conversations thus run into some of the same problems Fraser (1992) and others have pointed out with regards to utopian ideals of a public sphere that brackets and neutralizes status differences and private interests, namely that capacities for participation in such “rationalized” spaces continue to be structured by social and cultural inequalities in myriad ways, conferring power to those capable of emulating the required codes and conversational styles. Weighing in on these debates is beyond our scope, but machine moderation models necessarily make commitments within these moral spaces and if systems like Perspective gain pervasive uptake, normative homogenization is a potential outcome.
Conclusion
This paper set out to investigate machine moderation through the lens of a concrete system and its different components and connections. We found a complex constellation involving a variety of actors, sites, processes, technologies, and points of decision-making, not a singular, coherent thing that one could call “the algorithm” (Seaver, 2017). Indeed, what we have thought of as the “fabrics” of machine moderation—different heterogeneous layers or systems that stabilize into functional ensembles—involve feats of technological prowess, but also many unwieldy “chores” that draw on human labor, ranging from tagging comments to “repairing” bias. It also involves substantial definitional work around “toxicity” and other forms of semantic engineering and rule-setting, much of it done in impromptu fashion. We should expect social media platforms to orchestrate similar techno-bureaucratic processes around machine moderation that, despite much more limited access to empirical material, could be studied following some of the methodological directions taken in this paper.
The focus on a concrete empirical example invariably led us away from the human/machine dichotomy to the question of how different technical and organizational practices are woven together into working systems. The two competing logics we identified with regard to Perspective API are relevant to the case at hand but also to broader appreciations of machine moderation and its cultural and political significance. What we have called a cooperative, multi-polar model concerns practices like open communication, code sharing, academic involvement, and the technical setup as a Web-API that allows for various orchestrations and incorporates elements of two-way interaction. Responsibility and accountability are shared, at least to a degree, and actors can position themselves in different ways within the larger assemblage—as users, contributors, and critics. On the level of technical design, a multi-polar model allows for nuanced responses to specific concerns, for example when Coral Talk or El Pais use Perspective to issue warnings in real time, potentially reducing the need for deletion after the fact and avoiding the reactions that may provoke. The API's multi-headed classification models also allow designers and community managers to think in differential terms about the impact on both their users and human moderation staff, for example by selecting with purpose from the range of available models or by implementing different interface orchestrations for models that flag different speech problems, such as PROFANITY, THREAT, or ATTACK_ON_AUTHOR. On the level of actor involvement, one can think not only in terms of commercial partnerships but also about a more substantial inclusion of civil society, for example, affected marginalized groups themselves. Feedback on classification scores, critiques of problematic outcomes, sharing of training data, and involvement in the compilation of identity terms are all forms of participation that are already possible in one way or another, even if more robust collaboration would certainly require greater efforts from within Jigsaw. These organizational arrangements do not solve all problems at hand, but they show that machine moderation is not necessarily incompatible with a conception of public discourse that involves multiple spheres operating according to different rules rather than a single, hegemonic cultural space (Fraser, 1992).
The embedding of Perspective in Google, however, evokes familiar tropes about platformization, centralization, and infrastructural capture. Is machine moderation yet another task where news organizations are becoming dependent on platform operators? Will it be a vector of political and cultural normalization as smaller organizations are forced to rely on third parties like Coral or Disqus? What if Google decides to stop the project altogether? Since the production of workable machine moderation services is by no means trivial and requires heavy investments in labor and partnerships, powerful actors and the viewpoints they subscribe to are bound to dominate, which includes increasingly globalized news providers such as the NYT. And the divide between the “data rich” and “data poor” separates not only companies and institutions, but also languages and, in extension, the cultures they represent. The decision of what should count as “toxic” or “civil” has clear political implications and Google's enormous reach may give it outsize influence over definitions and ideological diversity. And is Google even now committed to keeping a multi-polar setup alive? We noticed a decline in public communication around Perspective after a promising start and one may indeed wonder whether this is already a return to “normal” corporate practice after the initial openness strategy served its purpose in getting the project running. Other signs, such as the firing of Timnit Gebru in late 2020, are reasons to be pessimistic about Google's willingness to allow anything that goes beyond cosmetic repair to the most egregious problems with its products.
And yet, despite the shortcomings within the multi-polar model and the looming threat of corporate takeover, Perspective contains granules for what Helberger et al., call “cooperative responsibility,” that is, the “dynamic interaction between platforms, users, and public institutions” (2018: 10). Instead of opposing human and machine moderation, we suggest that the more important question is how to foster organizational arrangements that promote openness, participation, and multi-polarity—not only with regards to Perspective, but for the internet on the whole. Looking at the fabrics of machine moderation and the different components that come into play indeed begs the question how fabrics could be woven differently, including ideas for smaller, more fine-grained interventions as well as larger organizational innovations. Given the immense importance of public debate for democratic life, alternative institutional arrangements should be considered: Perspective shows that given the right partnerships, even a small team can produce a machine moderation system in ways that afford a level of transparency, reactivity, and plurality. Could we imagine industry alliances, publicly funded institutions, or civil society organizations taking on the role of providing moderation services, allowing us to choose from a variety of models and model providers? If machine moderation is here to stay, and we believe it is, thinking in terms of fabrics is not only necessary for analysis and critique, but also for the increasingly entangled practice of thinking technologies through the lens of institutions and vice versa.
Footnotes
Acknowledgements
The authors would like to thank Wiebke Denkena, Dieuwertje Luitse, Ariadna Matamoros-Fernández, and three anonymous reviewers for their very helpful comments and suggestions. We are also grateful to our interviewees for their time and contribution.
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the Deutsche Forschungsgemeinschaft (Project 262513311 – SFB 1187 Media of Cooperation).
