Abstract
As natural language processing tools powered by big data become increasingly ubiquitous, questions of how to design, develop, and manage these tools and their impacts on diverse populations are pressing. We propose utilizing the concept of linguistic justice—the realization of equitable access to social and political life regardless of language—to provide a framework for examining natural language processing tools that learn from and use human language data. To support linguistic justice, we argue that natural language processing tools (along with the datasets that are used to train and evaluate them) must be examined not only from the perspective of a privileged, majority language user, but also from the perspectives of minoritized language users. Considering such perspectives can help to surface areas in which the data used within natural language processing tools may be (often inadvertently) working against linguistic justice by failing to provide access to information, services, or opportunities in users’ language of choice, underperforming for certain linguistic groups, or advancing harmful stereotypes that can lead to negative life outcomes for members of marginalized groups. At the same time, this framework can help to illuminate ways that these shortcomings can be addressed and allow us to use inclusive language data and approaches to leverage natural language processing technologies that advance linguistic justice.
Introduction
As natural language processing (NLP) tools powered by big data become increasingly ubiquitous, questions of how to design and evaluate these tools and their impacts on social justice are pressing. This paper asks: how might we leverage NLP tools to advance social justice through
A framework of linguistic justice illustrates that linguistically just NLP tools must: (1) work well for users regardless of language variety they use and (2) work to counteract inequities based on language use in decision-making and resource allocation. To build linguistically just NLP tools, we must recognize and address power inequities such as over/underrepresentation of linguistic patterns and discourses within datasets. This paper begins with key concepts about language, power, and social identity. We apply these understandings to explore concerns in NLP systems related to differential performance and opportunity allocation and linguistic profiling. We present a path forward through nine concrete actions across the design, development, and management of NLP tools before concluding.
Language, power, and social identity
To understand linguistic justice, we must first understand
Concerns for NLP tools trained on human language data
NLP tools perform better and provide more opportunities for speakers of privileged language varieties
To achieve linguistic justice, we must examine whether NLP tools are providing equitable access to information, services, and opportunities in the language variety of all potential users. Because NLP tools recognize and replicate language patterns based on their training data, we should examine the language contained in NLP datasets to see whose language and viewpoints are (not) represented, as data collection and data itself are not neutral (D’Ignazio and Klein, 2020). Large language models are increasingly powering NLP tools. These language models rely on data from the internet, but internet use varies by social factors, resulting in skews in representation. Some estimate that over 60% of all language content on the internet is English (W3Techs, n.d.), despite only ∼17% of people speaking English globally. 4 Although ∼7000 languages are in use worldwide, only 7 have the large digital data that are typically called for in machine learning (Joshi et al., 2021). Meanwhile, over 88% of languages have “exceptionally limited resources” in the digital space (Joshi et al., 2021).
Even among well-represented languages, some perspectives are overrepresented. Reddit users, for example, are 67% male and 70% White; using language from Reddit, then, results in the reification of White, male perspectives (Bender et al., 2021, citing Barthel et al., 2016). Moreover, some perspectives are actively marginalized online. On Twitter, pervasive harassment of women, especially Black women, may lead to self-censorship (
Underrepresentation of diverse language data contributes to disparities in NLP tool performance, resulting in inequitable access to NLP tools for speakers for which there do not yet exist NLP tools that work (at all, or as well) for their language varieties. This could result in linguistic injustice through differential access to goods, services, and opportunities by language variety. This has been shown, for example, with automatic speech recognition (ASR) tools from Apple, IBM, Google, Amazon, and Microsoft, which show higher error rates for Black speakers than White speakers (Koenecke et al., 2020). Higher error rates in automated hate speech detection tools, as discussed above in the Twitter example, are partially linked to under- and overrepresentation of particular languages in training data (Davidson et al., 2019; Tatman, 2017), impacting use of and access to social media platforms.
Further, data labelers may incorporate their own biases (Davidson et al., 2019)—such as labeling AAE as more negative than “S”AE when labeling training data for hate speech detection, or more often inaccurately labeling AAE as “unintelligible”. In the context of decision-making and resource allocation, NLP errors can have significant negative impacts. For example, certified court reporters in an experimental setting mistranscribed key facts from AAE speech, such as transcribing the utterance, “why your door always locked?” as “why you always lie?” (Jones et al., 2019: e236).
While many speakers of marginalized languages are also speakers of majority languages, asking minoritized language users to modify their language use by adopting dominant language practices results in an inequitable burden, as minoritized individuals may spend significant time, money, and psychological energy to modify their speech (Hughes and Mamiseishvili, 2013). Moreover, addressing linguistic inequities earlier on will ameliorate larger effects that may accrue over time. For example, if an algorithm ranks video search results based on NLP-generated transcripts, but the tool can only transcribe content from some language varieties, others will not surface in search results as easily. This inequity can compound over time if search results are also ranked by popularity.
To work toward linguistic justice in NLP, developers must think carefully about datasets they use and create datasets that are more balanced for language variety (Bender et al., 2021). They must also involve speakers of diverse language varieties so they can accurately analyze and label language data. If only majority language speakers are involved, they may improperly label or even discard language data from other varieties. It is also important to conduct audits and user testing with a diverse set of users to spot and address potential disparities in performance by language variety. Accurately including language from a greater diversity of varieties is an important step toward achieving equitable access, and with it, linguistic justice.
Not all communities, however, wish to contribute data or share the same definition of intellectual property (Tatsch, 2004). Some language data, for example, may contain sensitive, culturally specific knowledge, and use of language data by individuals outside the community may require specific practices (Tatsch, 2004). Although some language communities may be excluded at present, others may choose not to participate. While some may view representation as key to linguistic justice, others may see it as contributing to unjust surveillance and control. Instead, the right to privacy or opacity may be considered more crucial (Blas, 2016; Glissant, 1997). Ultimately, while we recognize it may not be possible to curate datasets representative of all language varieties, we can continue to collaborate with diverse speaker communities and work toward equity, while being transparent in decisions and limitations. This includes giving communities choices in whether or not to share their linguistic data and honoring their decisions.
NLP tools can advance linguistic injustice through linguistic profiling
NLP tools can exacerbate social disparities through their advancement of language-based stereotypes. When we communicate, we not only communicate the literal content, but we simultaneously convey massive amounts of associated information about our social identities—such as race, gender, and nationality (Baugh, 2003; Hughes and Mamiseishvili, 2013). These associations are inherent to human communication, but can result in harm through
Research shows that AI systems
As tools like NLP-powered virtual assistants or assessment tools become mediators for people seeking access to resources and opportunities, we must address the potential for NLP systems to use language to discriminate, and thus perpetuate linguistic injustice. Although Amazon no longer uses its tool, similar NLP-powered services remain (Raghavan et al., 2020). Tools like HireVue, which uses NLP to evaluate video job interviews, make judgments about applicants’ potential based largely on “their word choices and the language of their responses” (Zielinski, 2020).
Language has been shown to be a factor in hireability outside of NLP (Hosoda and Stone-Romero, 2010; Hughes and Mamiseishvili, 2013) despite the fact that human judgments of speakers’ language are notoriously unobjective (Hill, 2008). Hosoda & Stone-Romero (2010), for example, showed that job applicants with French-accented English were preferred over those with Japanese-accented English—even when applicants with Japanese accents were more understandable. If NLP tools are trained on data from human decisions, they will replicate those biased outcomes. To build NLP systems that treat speakers equitably, we must identify how past human decisions contained in the data reflect human biases, work to correct biases while being transparent about limitations, and audit NLP tools. By considering
A path forward
In considering
Conclusion
This paper presents linguistic justice as a framework for NLP design, development, and management. By effectively centering linguistic justice in NLP, we can advance social justice. Doing so requires that NLP tools equitably serve diverse language users while acknowledging and responding to the harms of linguistic profiling within the current status quo. As Baker-Bell (2020) notes, “Within a Linguistic Justice framework, excuses such as “that’s just the way it is” cannot be used as justification for Anti-Black Linguistic Racism, white linguistic supremacy, and linguistic injustice” (p. 7). Instead, we must imagine and create a world where users of all language varieties are able to equitably access social, economic, and political life. Our nine actions provide a path toward that world whereby we design, develop, and manage NLP systems that advance linguistic justice, and as a result, social justice. This critical work will require practitioners to rethink how we collect data and what data we prioritize and value in NLP development. These shifts will take time and concerted effort. Finally, we must remember that technology alone cannot solve complex societal problems, but should be part of broader efforts toward linguistic and social justice.
Supplemental Material
sj-pdf-1-bds-10.1177_20539517221090930 - Supplemental material for Linguistic justice as a framework for designing, developing, and managing natural language processing tools
Supplemental material, sj-pdf-1-bds-10.1177_20539517221090930 for Linguistic justice as a framework for designing, developing, and managing natural language processing tools by Julia Nee, Genevieve Macfarlane Smith, Alicia Sheares and Ishita Rustagi in Big Data & Society
Footnotes
Declaration of conflicting interests
The author(s) declared the following potential conflicts of interest with respect to the research, authorship, and/or publication of this article: The Center for Equity, Gender, and Leadership has received funding from a large Silicon Valley Tech Firm, whose work includes natural language processing and other use of big data. While this work was carried out by independent researchers at the Center, conversations between the researchers and individuals at the tech firm occurred and the research builds on a previous research collaboration between the tech company and the Center.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by The Center for Equity, Gender, and Leadership at Berkeley Haas School of Business.
Supplemental material
Supplemental material for this article is available online.
Notes
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
