Abstract
As natural language processing tools powered by big data become increasingly ubiquitous, questions of how to design, develop, and manage these tools and their impacts on diverse populations are pressing. We propose utilizing the concept of linguistic justice—the realization of equitable access to social and political life regardless of language—to provide a framework for examining natural language processing tools that learn from and use human language data. To support linguistic justice, we argue that natural language processing tools (along with the datasets that are used to train and evaluate them) must be examined not only from the perspective of a privileged, majority language user, but also from the perspectives of minoritized language users. Considering such perspectives can help to surface areas in which the data used within natural language processing tools may be (often inadvertently) working against linguistic justice by failing to provide access to information, services, or opportunities in users’ language of choice, underperforming for certain linguistic groups, or advancing harmful stereotypes that can lead to negative life outcomes for members of marginalized groups. At the same time, this framework can help to illuminate ways that these shortcomings can be addressed and allow us to use inclusive language data and approaches to leverage natural language processing technologies that advance linguistic justice.
Introduction
As natural language processing (NLP) tools powered by big data become increasingly ubiquitous, questions of how to design and evaluate these tools and their impacts on social justice are pressing. This paper asks: how might we leverage NLP tools to advance social justice through linguistic justice? We define linguistic justice as the realization of equitable access to social, economic, and political life regardless of linguistic repertoire (Gazzola et al., 2021). Language underlies much of this access, and an increasing number of critical services are being rendered with NLP technologies, from automated courtroom transcription to video interviews in organizational hiring. While these technologies have the potential to benefit users, they will only contribute to linguistic justice and social equity if they are designed to avoid linguistic prejudice and to serve all speakers, not just speakers of “privileged” language varieties 1 (e.g. “standard” American English 2 ).
A framework of linguistic justice illustrates that linguistically just NLP tools must: (1) work well for users regardless of language variety they use and (2) work to counteract inequities based on language use in decision-making and resource allocation. To build linguistically just NLP tools, we must recognize and address power inequities such as over/underrepresentation of linguistic patterns and discourses within datasets. This paper begins with key concepts about language, power, and social identity. We apply these understandings to explore concerns in NLP systems related to differential performance and opportunity allocation and linguistic profiling. We present a path forward through nine concrete actions across the design, development, and management of NLP tools before concluding.
Language, power, and social identity
To understand linguistic justice, we must first understand standard language ideology. This common belief holds that some language varieties are “better” than others. However, it has no basis in fact; all language varieties are equally capable of expression (Hill, 2008). Nevertheless, some language varieties have been privileged as “standard” or viewed as more “appropriate” because of their association with people in power. This is why, for example, “standard” American English (“S”AE) reflects the linguistic norms of middle-class, White men who have overwhelmingly held power within the United States (Baker-Bell, 2020; Hill, 2008). Meanwhile, other equally valid ways of speaking, such as African-American English (AAE), 3 have been devalued (Baker-Bell, 2020; Hill, 2008; King, 2020). Because “standardized” languages are not linguistically better than any others, our definition of linguistic justice requires that users of any language variety be equally able to access services. One should not be denied housing, for example, based on “sounding White” or “sounding Black” (Baugh, 2003). Even when Black people use “S”AE, listeners may show linguistic bias against them (Alim, 2007). For example, when Black students use “S”AE, they can be perceived as “underperforming” (Alim, 2007). This reality reflects how race—along with other aspects of identity including gender, sexual orientation, nationality, etc.—are intertwined with power and continue to affect people’s life outcomes (Crenshaw et al., 1995). In the next section, we further delve into the links between language, power, and identity and how they manifest in NLP systems by exploring two unjust linguistic outcomes of NLP tools: differential performance and opportunity allocation and linguistic profiling.
Concerns for NLP tools trained on human language data
NLP tools perform better and provide more opportunities for speakers of privileged language varieties
To achieve linguistic justice, we must examine whether NLP tools are providing equitable access to information, services, and opportunities in the language variety of all potential users. Because NLP tools recognize and replicate language patterns based on their training data, we should examine the language contained in NLP datasets to see whose language and viewpoints are (not) represented, as data collection and data itself are not neutral (D’Ignazio and Klein, 2020). Large language models are increasingly powering NLP tools. These language models rely on data from the internet, but internet use varies by social factors, resulting in skews in representation. Some estimate that over 60% of all language content on the internet is English (W3Techs, n.d.), despite only ∼17% of people speaking English globally. 4 Although ∼7000 languages are in use worldwide, only 7 have the large digital data that are typically called for in machine learning (Joshi et al., 2021). Meanwhile, over 88% of languages have “exceptionally limited resources” in the digital space (Joshi et al., 2021).
Even among well-represented languages, some perspectives are overrepresented. Reddit users, for example, are 67% male and 70% White; using language from Reddit, then, results in the reification of White, male perspectives (Bender et al., 2021, citing Barthel et al., 2016). Moreover, some perspectives are actively marginalized online. On Twitter, pervasive harassment of women, especially Black women, may lead to self-censorship ( Toxic, n.d. ), reducing the representation of (Black) women’s viewpoints in NLP datasets (Bender et al., 2021). Moreover, while one in 10 Black people in 2010 accessed Twitter daily (a rate over four times higher than White people) (Brock, 2012: 535), Black Tweets are still often considered “inappropriate” uses of Twitter (542) and are more often inaccurately flagged as hateful by automatic hate speech detection tools (Davidson et al., 2019). This deficit model of Black internet use is both inaccurate and harmful as it minimizes the importance of Black internet content, while disproportionately censoring Black speakers.
Underrepresentation of diverse language data contributes to disparities in NLP tool performance, resulting in inequitable access to NLP tools for speakers for which there do not yet exist NLP tools that work (at all, or as well) for their language varieties. This could result in linguistic injustice through differential access to goods, services, and opportunities by language variety. This has been shown, for example, with automatic speech recognition (ASR) tools from Apple, IBM, Google, Amazon, and Microsoft, which show higher error rates for Black speakers than White speakers (Koenecke et al., 2020). Higher error rates in automated hate speech detection tools, as discussed above in the Twitter example, are partially linked to under- and overrepresentation of particular languages in training data (Davidson et al., 2019; Tatman, 2017), impacting use of and access to social media platforms.
Further, data labelers may incorporate their own biases (Davidson et al., 2019)—such as labeling AAE as more negative than “S”AE when labeling training data for hate speech detection, or more often inaccurately labeling AAE as “unintelligible”. In the context of decision-making and resource allocation, NLP errors can have significant negative impacts. For example, certified court reporters in an experimental setting mistranscribed key facts from AAE speech, such as transcribing the utterance, “why your door always locked?” as “why you always lie?” (Jones et al., 2019: e236).
While many speakers of marginalized languages are also speakers of majority languages, asking minoritized language users to modify their language use by adopting dominant language practices results in an inequitable burden, as minoritized individuals may spend significant time, money, and psychological energy to modify their speech (Hughes and Mamiseishvili, 2013). Moreover, addressing linguistic inequities earlier on will ameliorate larger effects that may accrue over time. For example, if an algorithm ranks video search results based on NLP-generated transcripts, but the tool can only transcribe content from some language varieties, others will not surface in search results as easily. This inequity can compound over time if search results are also ranked by popularity.
To work toward linguistic justice in NLP, developers must think carefully about datasets they use and create datasets that are more balanced for language variety (Bender et al., 2021). They must also involve speakers of diverse language varieties so they can accurately analyze and label language data. If only majority language speakers are involved, they may improperly label or even discard language data from other varieties. It is also important to conduct audits and user testing with a diverse set of users to spot and address potential disparities in performance by language variety. Accurately including language from a greater diversity of varieties is an important step toward achieving equitable access, and with it, linguistic justice.
Not all communities, however, wish to contribute data or share the same definition of intellectual property (Tatsch, 2004). Some language data, for example, may contain sensitive, culturally specific knowledge, and use of language data by individuals outside the community may require specific practices (Tatsch, 2004). Although some language communities may be excluded at present, others may choose not to participate. While some may view representation as key to linguistic justice, others may see it as contributing to unjust surveillance and control. Instead, the right to privacy or opacity may be considered more crucial (Blas, 2016; Glissant, 1997). Ultimately, while we recognize it may not be possible to curate datasets representative of all language varieties, we can continue to collaborate with diverse speaker communities and work toward equity, while being transparent in decisions and limitations. This includes giving communities choices in whether or not to share their linguistic data and honoring their decisions.
NLP tools can advance linguistic injustice through linguistic profiling
NLP tools can exacerbate social disparities through their advancement of language-based stereotypes. When we communicate, we not only communicate the literal content, but we simultaneously convey massive amounts of associated information about our social identities—such as race, gender, and nationality (Baugh, 2003; Hughes and Mamiseishvili, 2013). These associations are inherent to human communication, but can result in harm through linguistic profiling—making assumptions about an individual’s identity based on their language use (Baugh, 2003; Hughes and Mamiseishvili, 2013). These associations can be activated with extremely small amounts of linguistic data, making it difficult or impossible to curate datasets without identity markers. Purnell, Idsardi, and Baugh (1999), for example, showed that participants were able to correctly identify the race of 70% of speakers (who were not visible) after hearing them say only the word “hello.” Participants were also largely able to identify speakers’ gender by voice. Thus, natural language data that serve as inputs to NLP systems may lead the systems to learn and adopt identity-based stereotypes. 5
Research shows that AI systems do form these connections, and they can use that information to discriminate. For example, an AI-powered resume scanning tool from Amazon was shown to discriminate by penalizing resumes containing words like “women” (Dastin, 2018). Even when direct references to gender were removed, discrimination continued. The tool picked up on linguistic patterns, such as men’s higher use of terms like “executed,” to infer the applicant’s gender (Dastin, 2018). While the link between terms like “executed” and masculinity may not be obvious, the repeated use of this term largely by men results in a connection that led to linguistic injustice: women were denied employment because of their language use.
As tools like NLP-powered virtual assistants or assessment tools become mediators for people seeking access to resources and opportunities, we must address the potential for NLP systems to use language to discriminate, and thus perpetuate linguistic injustice. Although Amazon no longer uses its tool, similar NLP-powered services remain (Raghavan et al., 2020). Tools like HireVue, which uses NLP to evaluate video job interviews, make judgments about applicants’ potential based largely on “their word choices and the language of their responses” (Zielinski, 2020).
Language has been shown to be a factor in hireability outside of NLP (Hosoda and Stone-Romero, 2010; Hughes and Mamiseishvili, 2013) despite the fact that human judgments of speakers’ language are notoriously unobjective (Hill, 2008). Hosoda & Stone-Romero (2010), for example, showed that job applicants with French-accented English were preferred over those with Japanese-accented English—even when applicants with Japanese accents were more understandable. If NLP tools are trained on data from human decisions, they will replicate those biased outcomes. To build NLP systems that treat speakers equitably, we must identify how past human decisions contained in the data reflect human biases, work to correct biases while being transparent about limitations, and audit NLP tools. By considering linguistic justice, we can see that it is unjust to build NLP tools that prioritize certain ways of speaking—and with them, certain social identities—over others.
A path forward
In considering linguistic justice, we identified two main areas where injustice can occur in NLP: (1) NLP tools may perform worse for users of minoritized language varieties resulting in inequitable access to information and opportunities and (2) NLP may reproduce injustice through linguistic profiling. To move toward linguistic justice—and thereby, social justice—we provide nine actions to prioritize in NLP tool development and management. These nine actions speak to data and NLP systems themselves, as well as broader power dynamics, values, and priorities in the design, development, and management of NLP tools.
Conclusion
This paper presents linguistic justice as a framework for NLP design, development, and management. By effectively centering linguistic justice in NLP, we can advance social justice. Doing so requires that NLP tools equitably serve diverse language users while acknowledging and responding to the harms of linguistic profiling within the current status quo. As Baker-Bell (2020) notes, “Within a Linguistic Justice framework, excuses such as “that’s just the way it is” cannot be used as justification for Anti-Black Linguistic Racism, white linguistic supremacy, and linguistic injustice” (p. 7). Instead, we must imagine and create a world where users of all language varieties are able to equitably access social, economic, and political life. Our nine actions provide a path toward that world whereby we design, develop, and manage NLP systems that advance linguistic justice, and as a result, social justice. This critical work will require practitioners to rethink how we collect data and what data we prioritize and value in NLP development. These shifts will take time and concerted effort. Finally, we must remember that technology alone cannot solve complex societal problems, but should be part of broader efforts toward linguistic and social justice.
Supplemental Material
sj-pdf-1-bds-10.1177_20539517221090930 - Supplemental material for Linguistic justice as a framework for designing, developing, and managing natural language processing tools
Supplemental material, sj-pdf-1-bds-10.1177_20539517221090930 for Linguistic justice as a framework for designing, developing, and managing natural language processing tools by Julia Nee, Genevieve Macfarlane Smith, Alicia Sheares and Ishita Rustagi in Big Data & Society
Footnotes
Declaration of conflicting interests
The author(s) declared the following potential conflicts of interest with respect to the research, authorship, and/or publication of this article: The Center for Equity, Gender, and Leadership has received funding from a large Silicon Valley Tech Firm, whose work includes natural language processing and other use of big data. While this work was carried out by independent researchers at the Center, conversations between the researchers and individuals at the tech firm occurred and the research builds on a previous research collaboration between the tech company and the Center.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by The Center for Equity, Gender, and Leadership at Berkeley Haas School of Business.
Supplemental material
Supplemental material for this article is available online.
Notes
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
