Abstract
Content moderation algorithms influence how users understand and engage with social media platforms. However, when identifying hate speech, these automated systems often contain biases that can silence or further harm marginalized users. Recently, scholars have offered both restorative and transformative justice frameworks as alternative approaches to platform governance to mitigate harms caused to marginalized users. As a complement to these recent calls, in this essay, I take up the concept of reparation as one substantive approach social media platforms can use alongside and within these justice frameworks to take actionable steps toward addressing, undoing and proactively preventing the harm caused by algorithmic content moderation. Specifically, I draw on established legal and legislative reparations frameworks to suggest how social media platforms can reconceptualize algorithmic content moderation in ways that decrease harm to marginalized users when identifying hate speech. I argue that the concept of reparations can reorient how researchers and corporate social media platforms approach content moderation, away from capitalist impulses and efficiency and toward a framework that prioritizes creating an environment where individuals from marginalized communities feel safe, protected and empowered.
This article is a part of special theme on Algorithmic Reparation. To see a full list of all articles in this special theme, please click here: https://journals.sagepub.com/page/bds/collections/Algorithmic%20Reparation?pbEditor=true
Introduction
Content moderation algorithms govern what content gets seen on social media platforms and influence how users understand and engage within these digital spaces (Gillespie, 2018a). The corporations behind these platforms often argue that these moderation systems help better protect and keep users safe, especially marginalized individuals (Díaz and Hecht-Felella, 2021). However, these systems result in frequent instances in which content about racism, sexism, homophobia or ableism is “misattributed” as inappropriate content or actual racist, sexist, homophobic or ableist content is not removed, leading to the further marginalization of minoritized users (Guynn, 2019; Marshall, 2021; Murphy and Murgia, 2019; Siapera, 2022). These occurrences are due, in large part, to the fact that automated content moderation systems prioritize efficient and punitive approaches over the concerns and safety of marginalized users (Gillespie, 2020; Schoenebeck and Blackwell, 2021).
Scholars have suggested broad theoretical approaches to addressing the harm automated content moderation causes to marginalized users (Sander, 2019; Siapera, 2022). Recently, a handful of scholars have drawn on restorative and transformative justice frameworks to put forth value-driven suggestions that can help platforms address marginalized users’ fraught experiences with content moderation (Hasinoff and Schneider, 2022; Schoenebeck et al., 2023; Schoenebeck and Blackwell, 2021; Xiao et al., 2023). As a complement to this work, I draw on legal theories of reparations and build on the concept of algorithmic reparations put forth by Davis et al. (2021) to propose design suggestions for how social media platforms can implement algorithmic content moderation to protect the safety of marginalized users, without causing them additional harm and empower marginalized users to engage freely on these platforms.
This essay begins with an overview of the existing literature on platform content moderation, particularly algorithmic content moderation, and the systemic biases these systems can introduce in platform governance of hate speech. I then review theoretical frameworks scholars have offered to address platform governance, particularly the alternative justice frameworks of restorative and transformative justice. Drawing on these previous suggestions, I argue reparations can help support these governance strategies. Next, I draw on existing legal frameworks of reparations, particularly a social justice through the healing framework of reparations (Yamamoto et al., 2007), to suggest how platforms can reorient how they design content moderation systems while also acknowledging the challenges and limitations posed by this orientation.
Platform governance and modes of content moderation
Contrary to popular rhetoric, for-profit social media platforms are not open forums for free speech (Gillespie, 2018a; Meta, 2019). These platforms are privately owned commercial entities that profit from monetizing user activity through advertising (Zuboff, 2019). Thus, the owners of these platforms have legal, economic and customer service imperatives to govern what is on these platforms through content moderation, broadly defined as a form of platform governance meant to prevent harm and abuse by identifying and removing undesirable content (Gillespie, 2018a; Grimmelmann, 2015).
While many large commercial social media platforms, such as Facebook and X, started with little to no content moderation, as these platforms have grown, content moderation has become imperative to their business models. Content moderation allows these platforms to retain advertisers who may not feel comfortable advertising on platforms that enable certain types of user-generated content, such as hate speech, retain users who may be scared off by “toxic” or “harmful” content, and follow increased global platform regulations (Castets-Renard, 2020; Gillespie, 2018a; Viejo Otero, 2022). Thus, as Gillespie (2018b) argues, content moderation has become the core service of social media platforms—allowing them to serve business, user and regulatory needs.
There are three main modes of content moderation: editorial review, which uses human labor to review whether content should be removed from the platform; “flagging,” which allows other users to mark content as inappropriate; and algorithmic, which is “systems that classify user-generated content based on either matching or prediction” (Gillespie, 2018a; Gorwa et al., 2020: 5). While most social media platforms use a combination of all three types of content moderation, platforms have increasingly turned to algorithmic content moderation to keep up with the increasing scale at which users and advertisers demand platforms enforce governance policies (Gillespie, 2020; Gorwa et al., 2020).
Platforms use two main design techniques to implement algorithmic content moderation: classification and matching. Classification algorithms remove undesirable content by using a binary classifier trained to identify or predict if original content on the platform is or is not offensive based on the presence of certain words or hashtags (Gorwa et al., 2020; Mullick et al., 2023). While initially, classification algorithms were constructed based on a manually curated set of “blacklisted” words, today, most platforms use natural language processing (NLP) and other machine learning (ML) techniques, in which algorithms are trained on previous data to identify or predict the presence of these violatory words and phrases through character-level analysis (Gorwa et al., 2020; Schmidt and Wiegand, 2017). Alternatively, matching algorithms identify undesirable content by using a process called “hashing” in which a piece of known content is turned into a “hash,” or a string of data meant to uniquely identify the content, to match known undesirable content to all other content on the platform, such as videos of terror attacks and unlicensed copyrighted materials. In this way, while different, classification and matching algorithms both aim to use known harmful words or content to identify or predict and subsequently remove posts. Thus, while all types of content moderation arguably impose ideological decisions on the kind of community social media platforms want to create, algorithmic content moderation does this at a much larger—and arguably less nuanced—scale (Gillespie, 2018a).
Bias in hate speech algorithmic content moderation
The corporate entities behind social media platforms often use the rhetoric of algorithmic neutrality and scale to depoliticize algorithmic content moderation (Gillespie, 2018a; Gorwa et al., 2020; Roberts, 2018). However, as Roberts (2018) argues, obfuscation and secrecy create a “logic of opacity” around moderation algorithms that deters “large-scale questioning of the policies and values governing the decision-making.” Further, Gillespie (2020) contends that while platform representatives use “scale” to describe the number of users, scale refers to how the small can effectively 1 be made large. Thus, “scale” becomes a justification for putting specific articulations into place without any deeper interrogation of the capitalist impulses of these practices (Gillespie, 2020). In this way, the rhetoric of scale—aside from being a misnomer—turns public attention away from potential biases embedded in algorithmic content moderation (Upda et al., 2022).
Algorithms impose large-scale ideological beliefs on platforms based on how they are designed to match, include, sort, prioritize and classify data. Bias is usually introduced into algorithmic systems not because the technology itself is biased but because of the biases and blind spots of the engineers who design these systems and in the training data used to build these systems (Binns et al., 2017; Gillespie, 2014; O’Neil, 2016). Further, ML algorithms, which use large-scale data sets to change and modify their output over time for optimum results, increase the opacity through which these biases develop and get replicated over time (O’Neil, 2016). As Gorwa et al. (2020) suggest, “Even a perfectly ‘accurate’ toxic speech classifier will have unequal impacts on different populations because it will inevitably have to privilege certain formalisations of offense above others, disproportionately blocking (or allowing) content produced by (or targeted at) certain groups” (11) resulting in what Safiya Noble (2018) terms algorithmic oppression, algorithms that reinforce oppressive social structures. In the case of content moderation algorithms, these biases often manifest as instances of the over-or-underblocking of hate speech.
Overblocking: silencing the experiences of marginalized users
Classification algorithms used to identify hate speech often use binary classifiers to identify and remove a post containing or predicted to contain hate speech in the form of violatory words or phrases (Gorwa et al., 2020). However, while these classification algorithms identify harmful posts based on string matches to previously identified harmful words and phrases, they are not trained to ascertain the context in which these words are being used (Caplan, 2018). Thus, algorithms meant to identify hate speech cannot identify the difference between when a poster is using hateful words or phrases as hate speech and when a poster is using hateful words or phrases because they are recounting an experience of hate speech. Removing the latter is termed overblocking (Gorwa et al., 2020; Marshall, 2021; Siapera, 2022).
As a result, users with marginalized identities experience overblocking or having their content removed when recounting experiences of or defending themselves against hate speech and harassment. For instance, Black women on social media have had their content removed and been banned for responding to and defending themselves against racist and sexist posts and incidences, highlighting how content moderation algorithms reinforce misogynoir or anti-Black sexist logic (Bailey, 2021; Marshall, 2021). On Facebook, the phrase getting “Zucked” developed amongst Black users to describe the experience of having posts and chats about racism they experienced being censored as hate speech (Gray and Stein, 2021; Guynn, 2019). Gray and Stein (2021) argue that these systematic policies are rooted in “carceral logics of controlling the bodies, minds, and actions of minoritized folks,” replicating offline systemic oppression on digital platforms. Additionally, this can lead to what Karizat et al. (2021) terms algorithmic representational harm, or the emotional and material harm marginalized users experience when they feel their experiences are not privileged or recognized by platform algorithms (Andalibi and Garcia, 2021).
Likewise, social media platforms have consistently blocked political posts in Arabic, due to an overenforcement of Arabic words as hate speech (Alimardani and Elswah, 2021; Fatat, 2021). These instances of overblocking result from the fact that social media companies predominately use English-based large language models (LLMs) in training their content moderation algorithms, limiting these algorithms’ ability to accurately classify non-English words (Nicholas, 2023). For instance, recently, The Oversight Board, an independent panel of individuals established in 2020 to review Facebook's moderation policies, began examining the company's automated blocking of the word Arabic word “shaheed,” roughly translated in English to “a Muslim,” under the assumption that this word indicated terrorist inclinations (Benesch, 2023; Wisniak et al., 2023). These instances of what Alimardani and Elswah (2021) term “digital orientalism” further disenfranchise and suppress the voices of Arab activists, particularly in countries where their views and opinions are already censored.
Content moderation algorithms also have cultural biases in speech patterns that can lead to overblocking. Empirical studies have found hate speech detection systems are more likely to label posts containing African American Vernacular English (AAE/AAVE), a dialect of American English often spoken in Black communities as hateful, putting those Black users who engage in this cultural dialect at higher risk of having their content removed (Davidson et al., 2019; Sap et al., 2019). Thus, content moderation algorithms’ inability to correctly classify AAVE linguistic practices, as a non-white hegemonic standard of American English, leads to misattributions that punish Black users. These misattributions, as well as the misattribution of non-English words as hate speech, are also exacerbated by the fact that human moderators who review content flagged by algorithmic detection are usually contracted workers located in a different country than the original poster, leading to cultural misunderstandings of content as toxic or harmful (Caplan, 2018; Díaz and Hecht-Felella, 2021; Roberts, 2019).
Underblocking: algorithmic loopholes and inequitable governance policies
These same automated moderation systems can also protect users with more privilege and power when engaging in hate speech (Gerrard, 2018; Gorwa et al., 2020). For instance, platforms’ over-reliance on English-trained LLMs often means harmful, non-English posts containing misinformation often go undetected (AVAAZ, 2020). Additionally, a recent report by Amnesty International (2022) claims that in 2017, Facebook's algorithm contributed to material violence against the Rohingya people in Myanmar by failing to remove and inadvertently promoting anti-Rohingya hate speech on the platform, which arguably stemmed from Facebook employees’ lack of contextual knowledge about the systemic oppression of the Rohingya people.
Further, since for ML algorithms to accurately predict if content is hateful it must be present in training or test data, bad actors on social media take advantage of moderation algorithms focused on blocked words and images by using tactics such as linguistic changes or image blurring to evade detection, a practice commonly termed content moderation circumvention, when engaging in hate speech and harassment online (Gillett et al., 2023; Schmidt and Wiegand, 2017; Wang et al., 2023). For instance, Hemphill (2022) and her team found white supremacists often use the pluralized version of racial groups, such as “Jews,” to evade moderation when engaging in hate speech. Thus, techniques of classification based on commonly understood ideas of “offensive” terms leave open the ability for individuals wishing to spread harmful rhetoric to evade detection by rearticulating their language in algorithmically acceptable terms (Gerrard, 2018; Gorwa et al., 2020; Hemphill, 2022). While marginalized users also use specific language and techniques to evade algorithmic moderation, this is often done to engage in anti-racist and social justice discourse when they believe platform algorithms unfairly target and suppress content by marginalized people (Peterson-Salahuddin, 2022). In other words, while on the surface, these practices of content moderation circumvention may appear the same, it would be a false equivalency to equate using these practices to engage in anti-racist activism and using these practices to engage in white supremacy.
Further, platforms often do not consider the role of power in governance policies, resulting in equal but inequitable content moderation systems. Policies based on equality treat everyone the same as another in status, however, policies based on equity or the quality of being fair by actively accounting for systemic power differentials. In their analysis of company documents outlining Facebook's approach to racist content on the platform, or what Matamoros-Fernández (2017) terms “platformed racism,” Siapera and Viejo-Otero (2021) argue the platform's moderation policies fail to address racism directly and treat all identity groups, such as Black/white or male/female, as fundamentally equal, without accounting for how some identity groups have historically been systemically marginalized and thus are not treated as equal more broadly within society. It is due to policies under this ideological framework that Facebook documents from 2017 revealed that between “female drivers,” “white men,” and “Black children,” white men would be the only group protected under the company's hate speech guidelines, despite holding more systemic power than the other two groups, because both race (white) and gender (male) were protected categories (Díaz and Hecht-Felella, 2021; Angwin and Grassegger, 2017). Siapera and Viejo-Otero (2021) argue these liberal, color-blind and individualist approaches to addressing racism on the platform fail to equitably protect marginalized users from hate speech.
Through over-and-underblocking, content moderation algorithms further marginalize minoritized communities by silencing them when expressing experiences with hate speech while simultaneously allowing for the proliferation of hate speech against them. Many of these algorithmic biases are rooted in the fact that most social media platforms are developed and exist in formerly colonized Western countries; these algorithmic systems reflect these countries’ existing racist, sexist, homophobic and ableist power structures (Buolamwini and Gebru, 2018; Eubanks, 2018; Noble, 2018; Siapera, 2022). Additionally, since these algorithms are trained on datasets from human content moderators, over time, the biases of these moderators train ML algorithms not only to replicate but also to amplify outcomes (Binns et al., 2017; Crosset and Dupont, 2022; Hayles, 2017).
Alternative approaches to platform governance
Schoenebeck and Blackwell (2021) contend systems of inequitable platform governance are also rooted in the fact that social media governance policies are based on Western models of criminal justice that encourage compliance by relying on punishments and, in turn, “overlook the needs and interests of targets of harassment and remove offenses and offenders from the community without any attempt at rehabilitation” (15). In other words, these governance systems focus on identifying and punishing users for posting harmful content, without a greater consideration for users’ needs and experiences on (and off) the platform.
In an attempt to make these systems more equitable, scholars have offered a variety of governance frameworks to help better address the harms experienced by marginalized users. For instance, Sander (2019) argues that a human rights approach to content moderation can more effectively center users’ best interests. Siapera (2022) argues for a decolonial approach to algorithmic content moderation that would consider the broader historical and social context in which (racist) hate speech and algorithmic systems are embedded—partly by giving marginalized and oppressed peoples a voice in the moderation process. Further, several scholars have recently proposed applying restorative and transformative justice frameworks to platform governance (Hasinoff and Schneider, 2022; Schoenebeck et al., 2023; Schoenebeck and Blackwell, 2021; Xiao et al., 2022).
Restorative and transformative justice, respectively, are alternative justice frameworks focused on harm reduction, meeting the distinct needs of harmed individuals, helping the offenders take accountability and creating pathways to prevent future harm (Hasinoff and Schneider, 2022). However, while restorative justice practices focus on creating a dialogue between the victim and offender, transformative justice moves beyond individual reconciliation to focus on community-based approaches to combating the underlying, systemic conditions that led to the initial harm (Hasinoff and Schneider, 2022; Mingus, 2019; Zehr, 2015). Thus, while distinct, restorative and transformative justice frameworks both encourage non-punitive, community-based and victim-centered approaches to harm reduction.
Scholars have argued employing alternative justice frameworks can better support the moderation needs of marginalized communities on social media platforms. For instance, Schoenebeck and Blackwell (2021) draw on restorative and transformative justice frameworks to suggest social media governance policies focused on repairing harm. Hasinoff and Schneider (2022) further suggest one way to support restorative and transformative approaches to governance platforms is to employ the principle of subsidiarity, the idea that local units and communities should play a meaningful role in shaping larger systems, instead of scalability, which relies on large-scale systems based on blanket policies. Empirical studies examining the utility of restorative and transformative justice governance for users have highlighted that while some users express a desire for restorative justice practices, others still desire more punitive approaches and that different marginalized communities have different needs and preferences when addressing the harms they experience on social media (Schoenebeck et al., 2023; Xiao et al., 2022, 2023). These studies have also noted the technical and resource-based limitations to implementing restorative and transformative justice frameworks, such as the potential increased labor required, a lack of cohesive community and user resistance (Xiao et al., 2023). Building on this work, I argue that reparations can be a helpful tool in supporting alternative justice approaches to platform governance.
Outlining a reparative approach to algorithmic content moderation
Reparations, broadly defined, are programs designed to repair past injustices, to improve conditions for the harmed group and build a more just world in the future 2 (Brophy, 2006; Táíwò, 2022; Wenar, 2006). Reparations programs have existed in various global contexts since the early 1900s. For instance, in 1956, the West German government began issuing payments to Holocaust survivors following World War II, in 1992 the Chilean government offered reparations to victims of human rights violations during Pinochet's dictatorship in the form of a 140,000 peso monthly stipend, educational benefits for children of the disappeared, priority access to state health care services and exemption from military service, and in 1988 the U.S. Government paid approx $1.65 billion to the Japanese people The U.S. interned during World War II (Posner and Vermeule, 2003; The JUST Act Report: Germany, 2019; The Series of Reparations Programs in Chile, n.d.). Corporations have also administered reparations for their role in historical injustices. For instance, in 2005, J.P. Morgan Chase engaged in reparations by apologizing for the role of two of the company's predecessors in supporting the transatlantic slave trade and creating a $5 million scholarship for Black students (Brophy, 2006; Magill, 2005).
Reparations schemes can take several forms. Many reparations programs include monetary payments by those who committed an injustice, their descendants and/or taxpayers to those individuals harmed or their descendants. However, reparations can also take the form of symbolic gestures that work toward repair, such as apologies, truth commissions—official bodies meant to investigate past injustices, affirmative action—policies that transfer preference from a group that benefited due to an injustice to the group who was harmed, the returning of seized and stolen land and community development programs aimed at bettering conditions for harmed groups (Brophy, 2006; Posner and Vermeule, 2003). Thus, reparations draw on material and non-material processes to articulate responsibility and “balance the scales” for past harms (Ifill, 2007: 24).
Reparations can be a useful tool to support restorative and transformative justice approaches to platform governance. Like restorative and transformative justice, reparations are mechanisms to address injustice and better future conditions for the harmed individual(s) on a systemic level. Further, Yamamoto et al. (2007) suggest reparations can elevate the role of “social healing” that links healing to justice work. Drawing on critical race theories of reparations as primarily a form of repair and reconciliation, Yamamoto and co-authors suggest a social justice through healing framework of reparations: (a) focuses on reconciliation over compensation as the foundation of reparations programs, (b) positions reparations not as the end-all, but as one part of a larger process of social healing and (c) draws on grass-roots insights in establishing reparations practices (Yamamoto et al., 2007). In this way, by using a social justice through healing approach to reparations, platforms can call on reparations frameworks to address, reconcile and repair harms that occur on the platform that enable systemic injustice.
However, precursors necessary for implementing restorative and transformative justice frameworks have not historically been central to reparations frameworks. For instance, restorative and transformative justice present community-based, alternative approaches to justice that often exist outside of or in opposition to legal and state governance; and focus on “micro communities” of care based on geographical and social relationships (Zehr, 2015). Conversely, legal scholars have mainly theorized reparations within existing state litigation models, which necessitate the clear identification of an offender/victim relationship and legislative frameworks (Posner and Vermeule, 2003), or legislative models, in which governmental bodies issue reparations where the state and/or the taxpayers take on the monetary and symbolic cost of reparations (Brophy, 2005, 2006). Thus, the concept of a well-defined community that scholars argue is central to enacting transformative and restorative justice has traditionally not been a precursor for administering reparations (Xiao et al., 2023).
Restorative and transformative justice also have a different orientation toward the central importance of individual accountability than reparations. Central to restorative and transformative justice projects is the voluntary participation of the victim and the offender to ensure the offender does not offer an insincere apology and those harmed do not feel coerced into accepting reparative measures (Kaba and Rice, 2020; Menkel-Meadow, 2007; Xiao et al., 2023; Zehr, 2015). However, since most historical reparations schemes have emanated from a legislative framework, individual accountability and the voluntary participation of offenders are often not key considerations; culpability attaches to the governing body, not the individual (Brophy, 2005). Thus, while reparations can occur on both the individual level between a victim and offender (i.e., litigation model) and, more broadly, between a governing body and a larger community that has been harmed (i.e., legislative model), it is the responsibility of the governing body to ensure reparations take place, whether or not all individuals tasked with taking part in this reparative scheme agree. In these ways, reparations as a governing tool could arguably circumvent limitations identified in implementing restorative justice approaches to platform governance.
A reparative approach to algorithmic content moderation builds on calls by Davis et al. (2021) for a reparative approach to ML. Instead of addressing algorithmic bias by focusing on fairness or equality in ML, Davis and co-authors suggest a reparative approach to ML would account for inequity in algorithmic bias by programming for the differential power dynamics that exist between users along multiple axes of oppression, such as race, gender and sexuality, to create more equity in algorithmic outcomes. For example, the authors suggest that while a fairness approach to ML would aim to achieve classification parity or calibration, which aim to limit or make equal the extent to which identity is taken into account in algorithmic systems, a reparative approach would aim to achieve systemic redress by deploying algorithms that account and correct for the disproportionate risk factors that inform training data (Davis et al., 2021). Further, drawing on theories of reparations that argue reparations are not only about repairing and healing past harms, but also making conditions better for marginalized communities in the future (Táíwò, 2022; Wenar, 2006), I argue a reparative approach to algorithmic content moderation would also implement policies that make conditions for marginalized users on the platform better going forward. In the next section, I draw on the reparation frameworks outlined above to propose how social media platforms can reorient their approach to designing algorithmic content moderation to better address harm on social media platforms.
Designing a reparative approach to algorithmic content moderation
Designing for redress, not removal
In line with Schoenebeck and Blackwell's (2021) assertion that social media platforms focus on punitive governance can foreclose more restorative approaches, I suggest that a reparative approach to algorithmic content moderation would first identify content for redress, not removal. As mentioned, content moderation algorithms focus on identifying and subsequently removing harmful content or, in some cases, the poster of said harmful content. While this approach creates efficiency by summarily removing content and individuals that violates community norms, this punitive focus on removal does little to repair the harm(s) or lasting damage that the offending post may have caused marginalized users, such as isolation or emotional trauma (Schoenebeck et al., 2023; Xiao et al., 2023). Alternatively, a reparative approach to content moderation would design algorithmic systems to facilitate reparations between the offending poster and the target(s) of the harm in question. If an algorithm identifies a post as containing hate speech, instead of simply removing or hiding the offending content, the system would divert the offending poster into a reparative process with those user(s) harmed.
In line with a social justice through the healing framework of reparations, I suggest these systems should develop reparations schemes that would engage those harmed to establish grass-roots insights into these reparations practices (Yamamoto et al., 2007). If the post in question directs hate speech at a specific user on the platform, an automated system could ask the targeted user how they would like this harm addressed. Drawing from the established reparations schemes previously outlined, these options could include apology in the form of an individual and/or public platform apology from the offending poster or affirmative action in the form of algorithmically boosting the content of the harmed user and/or hiding the content of the offender for some period of time. If the post promotes hate speech toward a socially marginalized community broadly, the platform could first use the algorithmically identified hash or string to determine what group(s) were targeted by the offending content and subsequently survey a sample of users who self-identify on the platform as being a part of the harmed group(s) regarding what reparations framework they believe would best repair the harm caused by the offensive post.
In addition to reparations such as apologies and affirmative action policies, on a broader level, reparations schemes could also include truth commissions or community development programs in the form of inviting a subset of users from the marginalized community in question to investigate the incident or come up with educational initiatives and additional governance policies to mitigate harms against their community, respectively, who would be compensated for this work. Over time, these broad reparative schemes developed from these surveys can become systematized, replicated and iterated upon when similarly problematic content is identified targeting this specific community(ies), reinforcing the idea that these reparative processes are just one part of a larger social healing process on the platform (Yamamoto et al., 2007).
In this way, instead of being punitive, a reparative approach to algorithmic content moderation would use algorithms to establish reparations to address the deeper harm(s) the offending post perpetuated. In line with alternative justice frameworks, this reparative approach to designing for algorithmic content moderation would aim to promote social healing, focus on posts that propagate systemic injustice, symbolic over monetary reparations and allow for multiple grassroots-based reparative schemes based on the specific preferences and needs of those harmed (Hasinoff and Schneider, 2022; Schoenebeck et al., 2023; Yamamoto et al., 2007). To clarify, I do not mean to suggest that designing for algorithmic reparations in content moderation means eliminating more punitive forms of government. If a user continuously engages in the same harmful and violating behavior, even after partaking in a reparative process, more punitive forms of algorithmic governance, such as temporary or permanent suspension from the platform, may become necessary (Schoenebeck and Blackwell, 2021). Rather, my argument here is that these more punitive approaches should not be the first and only goal when designing content moderation algorithms.
Addressing over-and-underblocking: Designing for context, not words
As I previously stated, a reparative approach to algorithmic content moderation would also be forward-facing, aiming to improve conditions on the platform for harmed groups to prevent these harms from reoccurring in the future, as part of a larger reparative project (Táíwò, 2022; Wenar, 2006). Thus, approaching algorithmic content moderation through a reparative lens may also entail platforms designing algorithmic systems to identify harm based on context as opposed to individual words or phrases to avoid future instances of overblocking of marginalized users or the underbloking of hate speech (Caplan, 2018; Gorwa et al., 2020).
There are several complementary mechanisms through which this contextual approach to algorithmic content moderation could be implemented. Mullick et al. (2023) propose content moderation designed as a cascade of binary questions about themes within a policy. Thus, instead of a single binary classifier that labels content as offensive or not offensive, this model asks multi-layer questions about the content to determine whether it violates platform policies. While Mullick and co-authors propose this design with the idea that it could be more easily adaptable to changes in platform policies by adding and removing themes, this same approach could be used to implement added layers of scrutiny in an algorithmic content moderation decision. For instance, within the model, an algorithm designed to identify hate speech could first answer binary questions to identify the presence of discrete slurs or words related to identity and then subsequently answer binary questions regarding other contextual elements, such as the sentiment of the post, which examines for the predominance of negative, positive or neutral words (see Gitari et al., 2015; Van Hee et al., 2015), and meta-information about the user, to better infer whether or not a post is an instance of hate speech or recalling an instance of hate speech. This design of content moderation algorithms would be more adept at identifying contextual uses of hate speech in posts, and thus avoid bringing additional harm to users utilizing the platform to voice their experiences with hate speech as an act of counter-hegemonic resistance (hooks, 1989).
Researchers have also proposed content moderation models that go beyond the surface-level classification of the words to more accurately classify the context in which hate speech is being used in ways that may help mitigate content moderation circumvention (Schmidt and Wiegand, 2017). Studies have used linguistic information, specifically the relationship between words, to infer hate speech even when any individual word may not be identified as such, potentially preventing underblocking (Chen et al., 2012). Similarly, to help algorithmic systems better understand if hate speech takes the form of inference, as opposed to being an identifiable slur, Dinakar et al. (2012) propose creating a “common sense knowledge base” that would encode knowledge about certain types of bullying to help systems better identify it. While this approach is arguably labor intensive since this database of knowledge would need to be replicated for each form of hate speech (i.e., sexist hate speech, antisemitic hate speech, racist hate speech, etc.), it can help prevent instances of underblocking by picking up on inferred and indirect instances of hate speech. Researchers have also shown that using image-to-text encoders and incorporating examples of content moderation circumvention in toxic and hateful images can also help platforms better identify image-based instances of hate speech that occur alongside captions and hashtags (Sabat et al., 2019; Wang et al., 2023).
Designing for context would also require social media platforms to build content moderation algorithms built on non-English-based datasets and LLM, or, in regions that use niche dialects, forgo the use of LLMs altogether, to properly understand and contextualize posts in non-majority English-speaking countries and regions. In line with a social justice through healing approach to reparations and a decolonial approach to content moderation (Siapera, 2022; Yamamoto et al., 2007), this would require social media platforms to invest more in local editorial reviewers and coders within these regions to build these models with local knowledge, cultural context and needs in mind (Yamamoto et al., 2007). Building content moderation algorithms this way would allow social media platforms to better recognize linguistically and contextually specific instances of hate speech while avoiding overblocking that arises from cultural discrepancies between content moderation algorithms and local linguistic and cultural practices.
Designing content moderation algorithms to identify hate speech based on context instead of individual words and phrases could more proactively prevent content moderation algorithms from causing future harm and furthering injustices against marginalized users on their platform (Davis et al., 2021). Further, this approach can function as a step toward the platform itself, as the corporate and governing entity responsible for past injustices, engaging in reparations by making amends for past injustices caused by the design of their content moderation algorithms.
Designing for equity, not equality
In line with a social justice through healing approach to reparations that aims to build a more just society (Táíwò, 2022; Yamamoto et al., 2007), a reparative approach to designing content moderation algorithms would focus on designing for equity over equality. Schoenebeck and Blackwell (2021) contend that one of the shifts platforms must make to establish governance models grounded in repair is to develop policies focused on equity instead of equality. Echoing Siapera and Viejo-Otero (2021), the authors maintain that while equal treatment may seem fitting on an individual level, it also assumes all users start with the same resources, needs and levels of privilege, and thus obscures or can even perpetuate more socially embedded structural inequalities (Schoenebeck and Blackwell, 2021). For instance, if a content moderation algorithm sanctions a user belonging to a historically marginalized group for using the term “white men,” it not only masks and replicates the offline social power of white men and further harms the marginalized user by silencing them.
Extending this assertion, a reparative approach to algorithmic content moderation would classify mentions of identity as violatory only if the identity being attacked faces systemic marginalization more broadly within the specific societal and cultural context where the user is located. Within this framework, a content moderation algorithm would not divert a user into a reparative process if the identity group(s) or individual mentioned in the post is one of relative social power to the poster, based on socially relevant intersectional classifiers such as race, gender, sexuality and socioeconomic class. This would thus require social media platforms to re-train content moderation algorithms on new data sets that exclude mentions or uses of non-marginalized identities. As with designing for context, this also means that how these content moderation algorithms are trained and deployed should differ between different cultural contexts depending on the prevailing social and racialization schemes within the specified areas and regions. Thus, in line with the larger orientation of algorithmic reparations, content moderation algorithms would not erase user differences but take user differences into account in the proper social and historical context to implement systemic redress (Davis et al., 2021).
Challenges and limitations to reparative algorithmic content moderation
While reparations are a productive tool to help social media platforms better address harms on the platform, implementing this approach has challenges and limitations. One challenge in designing content moderation algorithms for redress is it requires platforms to identify which user(s) on the platform are harmed in any instance. If a post directly attacks one specific user, for example, directly calls another user a racial slur or sexist epithet using the “@” feature or their name, these victims may be easier to identify, like in legal reparative frameworks. However, determining who can claim membership into a harmed group, more broadly, could prove more challenging for several reasons. First, many social media platforms do not directly collect demographic information about their users. 3 Second, studies have recounted instances where users falsely claim marginalized identities online, often for malicious purposes (Freelon and Lokot, 2020). Third, since experiences with identity are not monolithic, within and across demographic groups, individuals may experience the impact of hate speech differently based on their specific social, political and economic positionality.
Thus, when designing content moderation algorithms for reparations, it may become necessary for the corporations behind social media platforms to allow users to report additional demographic information about themselves, such as race, ethnicity, sexuality and socioeconomic class. However, given the higher risks marginalized users may feel in disclosing this information to a large technology corporation, other mechanisms for identifying potentially harmed users may become necessary, such as inferring identity based on a triangulation of meta-information such as profile picture, post content, geographic location and social network. In these instances, platforms should still verify with users if they see themselves as part of the harmed group in question and if they would like to engage in a reparative process, before proceeding with any action.
Another potential challenge to engaging a reparative approach to algorithmic content moderation is mitigating the unintended consequences of operationalizing reparations schemes. While reparations in the form of affirmative action could make marginalized users feel more seen and validated on social media platforms (Andalibi and Garcia, 2021), the increased exposure of their content to a wider audience could also lead to increased levels of harassment directed toward these marginalized creators (DeVito, 2022; Duffy and Meisner, 2023). For each potential reparations scheme, users should be made aware of any potential risks or harms, and these future harms should similarly be mitigated through the lens of reparations. This is why it must be the harmed user(s) in question who choose what they want reparations to look like for them, informed by their lived experiences and not thrust upon them by platform designers and executives outside of their communities.
Additionally, a limitation to implementing reparations in designing content moderation algorithms is that platforms may be met with resistance by individual offenders who do not want to partake in a reparative process, or individuals or communities of victims who would prefer a punitive approach to content moderation over a reparative approach. As noted, while accountability and voluntary engagement on the part of the harmed and offending party have not historically been central to reparations schemes, as restorative justice practices argue resistance to these processes on the part of either party could lead to a continued repetition of the offending behavior, which will limit social media platforms’ ability to repair harm and injustices in ways that better conditions for marginalized user on the platform going forward. For instance, if, within this framework a harmed user requests an apology, it would not truly be within the bounds of a social justice through healing framework of reparations if the offender refuses to earnestly engage in the process because it would not lead to lasting, meaningful change; and, thus, alternative measures may need to be taken to address this action. Thus, in designing content moderation systems for reparations, it is imperative that platforms not force users into these processes but, rather in acknowledging the limitations of this framework and restorative justice approach to harm, offer these reparative options alongside existing punitive measures to expand and deepen the way platforms address harms.
Finally, a challenge imposed by a reparative approach to algorithmic content moderation is that platforms will inevitably have to disentangle complex arguments around where to draw the line between free speech and hate speech. In designing content moderation algorithms for context and equity, platforms may be met with resistance by users who feel these policies unfairly target their ability to engage in free speech over others. Arguably, it is due to this challenge that platforms craft their content moderation policies to focus on equality over equity, which has the advantage of buttressing any claims of prejudice, discrimination, or “reverse racism.” 4
As corporate businesses that profit from keeping users on the platform, navigating these free speech concerns in policy and design can be difficult. In the U.S., conversations around free speech online are bound up in Section 230 of the 1996 Communications Decency Act (CDA), which states that “interactive computer service,” 5 which social media platforms claim to be, cannot be treated as publishers or conveyors of content in the same way as broadcasting and publishing organizations and therefore are not legally liable for the content they distribute (47 U.S. Code § 230, 1996). Thus, in the U.S., where there are no clear financial or regulatory incentives to implement these changes, treating all users equally is more strategic and less risky, as social media companies are legally protected from being liable for hate speech. Alternatively, in other geographic contexts, such as countries belonging to the European Union, legislation and codes of conduct require service providers to be responsible for, monitor and remove the hate speech distributed on their platform (European Commission, 2016). However, as studies have shown, the contextual nature of these policies often makes the line between hate speech and free speech hard to distinguish (Enarsson and Lindgren, 2019).
Nevertheless, a key part of reparations is making a principled claim about who has suffered an injustice, not on an individual level, but more broadly through the propagation of a societal injustice for which social healing and reparations are necessary (Brophy, 2005; Yamamoto et al., 2007). To this end, as a part of a reparative approach to algorithmic content moderation, platforms must be willing to encode, in policy and design, a difference between free expression and hate speech, depending on the sociocultural context in which harm occurs. These removal policies should be determined by the legal standards outlined in each national and geographic context where the offender(s) or victim(s) are and more standards of global human rights more broadly, particularly if the offender(s) and victim(s) exist in different cultural contexts.
Conclusion
Approximately 4.76 billion individuals worldwide, or a little over half of the world's population, are on social media, with many using these platforms to advocate for their social, political and human rights in sociopolitical contexts that seek to marginalize them (Kepios, 2023). Thus, as global instances of algorithmic content moderation that harm and silence marginalized communities around the world arise, highlighting new pathways to mitigate these issues is paramount. Adding to calls from other scholars to propose alternative governance frameworks to help better address the harms experienced by marginalized users (Sander, 2019; Siapera, 2022), particularly practices drawn from restorative and transformative justice (Hasinoff and Schneider, 2022; Schoenebeck and Blackwell, 2021; Xiao et al., 2023) this paper argues that the concept of reparations could be a useful tool to help social media platforms better address harms and injustices against marginalized users that take place on their platform. I draw on historical reparation schemes outlined by legal scholars, particularly a social justice through healing orientation to reparations as outlined by Yamamoto et al. (2007) and build on Davis et al. (2021) concept of algorithmic reparations to suggest ways social media platforms can design content moderation algorithms to enable reparative processes and frameworks.
In line with the two dominant perspectives on reparations legal scholars often consider, litigation, or direct claims for reparations where culpability lies with one individual or group, and legislative (Posner and Vermeule, 2003), reparations instituted on behalf of a group by a larger legislative body (Brophy, 2006)—I suggest social media platform design content moderation algorithms to address algorithmic bias that harm marginalized users both an individual and platform-systemic level. Specifically, on the individual level, I echo previous research on the use of restorative and transformative justice on online platforms to argue that the reparative processes implemented should meet the wants and needs of the marginalized community(ies) to which these reparations are being made (Hasinoff and Schneider, 2022; Schoenebeck et al., 2023; Xiao et al., 2022). On a platform level, following legal arguments that reparations are not only backward-looking but also about designing better conditions for the future (Wenar, 2006; Yamamoto et al., 2007), I suggest that to address instances of over and underblocking from content moderation algorithms, platforms should design for context over words and equity over equality. In this way, a reparative approach to algorithmic content moderation echoes both Siapera's (2022) call to socially and historically contextualize approaches to content moderation and Schoenebeck and Blackwell's (2021) contention that to govern for repair, platforms should focus on governance policies and design equity over equality. However, I also highlight several challenges and limitations to designing for algorithmic reparations in content moderation, including identifying what group or groups of users are being targeted, user resistance and counterarguments that believe a more principled approach to algorithmic content moderation will inhibit free speech.
While algorithmic reparations present one potential solution to bias in algorithmic content moderation around hate speech, it would be unwise to assume this solution alone will upend the harms of algorithmic bias on systematically marginalized communities. As Cunningham et al. (2022) argue, a solutions-oriented approach to technology—especially in response to the harms already enacted on communities by these technologies—is often encased in the ontological limitations of larger historical and contemporary structural inequities. Even while algorithmic reparations as a praxis and approach explicitly aim to acknowledge and upend these inequities in the digital space, it is hard to foresee what the outcomes of implementing such an approach will have on marginalized communities over time. Thus, rather than considering this approach as a definitive solution, I position it as one tool in a larger toolbox of alternative governance frameworks aimed at promoting equitable policies across these platforms.
Footnotes
Acknowledgments
I would like to thank the reviewers, whose thoughtful and insightful comments helped me to clarify and refine the ideas and arguments presented in this manuscript.
Declaration of conflicting interests
The author declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author received no financial support for the research, authorship, and/or publication of this article.
