Abstract
Current practices in the field of natural language processing may reinforce stereotypes, stigmatize non-normative speech, and prevent access to public discourse online. Across Latin America, a rapid increase in digital material available in minoritized (mostly Indigenous) languages spoken in the region has been observed, and a number of Indigenous language corpora and language models are currently under development. Given that “poor data quality in critical areas can disproportionately impact vulnerable communities and situations it is important to examine the norms and assumptions embedded within the process for building linguistic datasets. Operationalizing and measuring harms have been the primary focus of work investigating bias in natural language processing, and linguistic justice has recently been proposed as a framework for identifying harmful language ideologies in natural language processing systems. This article explores whether and how harmful ideologies of language may be informing the work of natural language processing researchers working on minoritized Mexican languages, through a systematic search and content analysis of published scholarship on natural language processing covering a 20-year period. The findings show that the field is changing rapidly, with far greater awareness of potentially harmful language ideologies in recent years, and attempts to mitigate associated bias. This work also shows that the concepts of linguistic justice and language ideology provide a fruitful framework for understanding, and potentially guiding, the further integration of ethical protocols into the construction of language technologies.
Keywords
Introduction
In recent years, a critical mass of researchers in computational linguistics (CL) and allied fields have begun speaking out on the problem of bias introduced during the language dataset development and modeling process (Basta et al., 2019; Bender et al., 2021; Bender and Friedman, 2018). These researchers have drawn particular attention to the fact that large language models (LMs) (datasets containing trillions of words and word parts and their relationships to one another) are cumulative in nature and inherit any biases existing in previous versions of the language corpora that have been included. Some of the major English language corpora in use within artificial intelligence (AI) development have been curated in such a way as to deliberately exclude content that does not fit into existing normative patterns of language use (Dodge et al., 2021), which has resulted in documented bias along the lines of race, gender, and sexual orientation. For example, tools used to rate toxicity levels have assigned higher toxicity to African American English tweets than to those written in Standard American English (Blodgett et al., 2020).
At the same time, linguists such as Steven Bird (2020) have noted the existence of “colonizing discourses” that reproduce colonial power relationships around the development and adoption of language-related technologies for minoritized languages (Bird, 2020: 3504). He draws attention to the commonly held notion of field linguists “discovering” and “saving” a previously undocumented language and further points to the ways in which the assumptions of Western scientific practice may lead to the erasure of culturally significant contexts from linguistic data collected by non-Indigenous researchers. Extending this concept, another example of a colonizing discourse is the convention of classifying languages into “low resource” versus “high resource” depending generally upon the amount of digital linguistic data available for use in technical applications. 1 This framing positions the overwhelming majority of the world's languages as deficient when compared to dominant (mostly colonial) world languages. 2
Others have made the case that the resulting bias in AI applications results in “epistemic injustice” (Fricker, 2007), which is a type of harm caused to someone in their capacity as a knower, as well as a structural form of prejudice against differing ways that individuals and communities make sense of their lives (Fricker, 2007; Migoli and Rasu, 2021). Such findings have real-world consequences for information seeking and use, individual self-expression, political participation, and community self-determination.
A number of recent studies in social computing and adjacent fields have pointed to the lack of both standard documentation and discipline-based codes of ethics to explain the ways in which the practices of natural language processing (NLP) have resulted in (often unintentional) bias and harms to user communities of language technologies, ranging from search and retrieval to machine translation (Blodgett, 2021; Gebru et al., 2018; Sambasivan et al., 2021). It is understood in this literature that the situated practices of language technology data scientists themselves are a relevant factor in these outcomes. As Bechmann and Bowker (2019) note, “we need to focus less solely on access to the models and algorithms as technical constructs by themselves and more on documenting the human choices made in the work process surrounding AI in order to act sustainable in relation to shared … values” (p. 7). Yet the in-depth study of these practices in context is challenging, given the difficulty of securing meaningful research access to the corporate environments in which many of these technologies are developed (Passi and Sengers, 2020).
Given the well-documented problems that arise once a problematic large LM is in wide use (Bender et al., 2021), it is urgently important to consider the question of bias, harms, and justice in the context of minoritized (primarily Indigenous) languages for which corpora and LMs are currently in development, in the hopes of avoiding discriminatory outcomes to begin with. Across Latin America, a rapid increase in the amount of digital material available in minoritized languages spoken in the region has been observed (Cassels, 2019; UNESCO, 2015), and a number of corpora and models are currently under development for Indigenous languages throughout the region (Mager et al., 2018). This activity raises questions relating to how such corpora are being developed and annotated today, and whether this work is guided by protocols around the normative decisions made therein.
A research agenda around minoritized Indigenous languages in the digital space has arisen at a moment of ethical reckoning for NLP norms in the profession as well as the popular press. 2021 marked the first year that a conference of computational linguists specifically addressed Indigenous language technologies in the Americas (AmericasNLP, 2021). It was also the year that the Ford Foundation, the MacArthur Foundation, the Kapor Center, and the Open Society Foundation jointly supported the creation of the Distributed Artificial Intelligence Research Institute (DAIR), founded by former head of Google's Ethical AI group Timnit Gebru, to address bias in computational practices employed in AI (Coldewey, 2021). Concurrently, UNESCO has launched the Decade on Indigenous Languages in 2022, a global awareness campaign focused on the human rights of Indigenous language speakers across the globe (United Nations General Assembly, 2020).
Given the evidence that some practices in NLP may reinforce stereotypes, stigmatize non-normative speech, and prevent the full enjoyment of access to public discourse online (Blodgett, 2021) and that “poor data quality in high-stakes domains can have outsized effects on vulnerable communities and contexts” (Sambasivan et al., 2021: 1), respecting the communication rights of Indigenous language speakers globally requires much more than simply facilitating access to ICT (information and communications technology) and fostering additional localized content. It is clear that the continued push to bring Indigenous languages into the digital environment could benefit from a holistic consideration of linguistic inclusivity that includes an examination of harms in the underlying communications technologies.
Digital language technologies will only continue to grow in importance for the expression of rights and identity, and, therefore, the need to empirically document the practices of practitioners in this area is a pressing concern. No work has yet explored these practices in depth in the context of the over 500 languages of Latin America, and the case of minoritized languages of Mexico, the country with some of the highest linguistic diversity in the Americas, is proposed as an area for immediate investigation. Recent efforts toward language revitalization have resulted in the growth in the number of speakers of several of Mexico's 68 government-recognized Indigenous languages over the past decade (INEGI, 2010, 2020), and several of these have been or are currently the subject of early-stage NLP work (Mager et al., 2018).
This article explores whether and how specific social values and norms around language and technology, in particular harmful ideologies of language, are informing the work of NLP researchers working on minoritized Mexican languages, through a systematic search and content analysis of published scholarship on NLP covering a 20-year period. These ideologies—the normative political, economic, cultural, and moral beliefs about languages and their value—are identified through the use of linguistic justice as a theoretical framework. This review and the future work it supports will provide the opportunity to observe and document normative decisions in NLP before they become obscured through the passage of time and future alterations to the datasets and systems.
Theoretical framework
Linguistic justice
Linguistic justice is a concept referring to the ability of every person to participate in political, economic, and social life using their preferred language, and to be free from discrimination for using their language of choice (Gazzola, et al., 2023). It recognizes the importance of language to identity and self-expression, which is considered a human right by the United Nations (United Nations, 1949, 2007). A longstanding area of research and practice, the concept has easily made the jump into the digital context, given the importance of equitable access to online information in an individual's native language (Spence, 2021). This is particularly relevant in the context of minoritized, primarily Indigenous, languages, which face unique challenges to digital inclusion, namely longstanding systemic discrimination against these languages and their speakers, limited telecommunications infrastructure in rural areas, nonstandard orthographies, and the lack of a significant digitized corpus of linguistic data for use in the development of digital speech and language tools (Bird, 2020; Mager et al., 2018; Young, 2019). In a powerful statement about the historical power relations implicated in these and other factors, Anasuya Sengupta asks, “Can you only be on the internet in your nearest colonial language?” (Sengupta, quoted by Spence, 2021: 9).
Much of the literature on language justice in the context of the multilingual internet is related to access: information and communications technology (ICT) access, multilingual interface accessibility, and localized content availability (UNESCO, 2015). 87% of the content on the internet is in only 10 of the world's many thousands of languages, even though these dominant languages make up < 1% of all languages on the globe and ∼50% of speakers (Markl, 2022). Such overwhelming differentials in content availability and access contribute to the mistaken perception that “monolingualism is the global norm” (Spence, 2021: 2). Further, the digital treatment of textual language data poses additional challenges for languages with less of a written tradition than dominant global languages. Practices in NLP may inadvertently reinforce all of these factors, with a near-exclusive focus on work on a small number of dominant languages and little attention paid to ∼7000 languages currently spoken by a majority of the global population (Benjamin, 2018 ; Joshi et al., 2021; Markl, 2022; Schwartz, 2022). These phenomena point to the larger context that “language is not just social practice … [but] is also and always infused with and caught up in the political economic, national, (post)colonial, and political circumstances that shape its use and its role as an object of study, political manipulation, and cultural value” (Cavanaugh, 2019: n.p.).
However, the concept of linguistic justice is complicated by differing views over what language actually is and how it functions in different contexts. Culturally specific ontologies of language among Indigenous communities may challenge the Western assumption that language itself is an “autonomous medium of denotational code” (Hauck, 2022: 4) that can be separated from embodied contexts, suggesting that unfettered access to the tools of communication in the digital environment alone is insufficient for securing language justice. These conflicts remain particularly salient in the current environment, where nascent AI technologies have forced a conversation about the onto-epistemological status of texts generated by LMs (Bird, 2024; Zhang, et al., 2020). Positions on this debate range from Claude Shannon's—that language is inherently non-random and speakers have a tacit understanding of the statistical patterns found in their language, which they use to form sentences and communicate (i.e. humans themselves can be thought of as writing machines) (Gibson, 2023; Shannon, 1948), to the assertion that “text generated by an LM is not grounded in communicative intent, any model of the world, or any model of the reader's state of mind” and should, therefore, not be considered language at all (Bender et al., 2021: 616). These debates around language as a concept and a phenomenon are missing crucial perspectives from speakers of Indigenous languages with distinct positions on its onto-epistemological status (Bird, 2024). For example, traditional views among speakers of White Mountain Apache that the Apache language can only be learned through the practice of family relationships and activities (baking bread, chopping wood, etc.) has led to community conflict around the implementation of digital language revitalization efforts (Nevins, 2004). Yet recent research also “complicate[s] the bifurcation of digital and analogue materiality,” suggesting that these language ontologies are currently undergoing shifts as some communities develop “land-based cyber-pedagogies” around language learning (Caranto Morford and Ansloos, 2021: 302). There is no uniform onto-epistemology of language across distinct Indigenous communities around the globe, suggesting the need for further theory-building conducted in partnership.
Additionally, language justice as a concept is further complicated by cultural considerations around linguistic data, which may exist in tension with current trends toward open data and practices. Indigenous Data Sovereignty, the right of Indigenous communities and Nations to “govern data within Indigenous rights and human rights frameworks” (US Indigenous Data Sovereignty Network, n.d.), is an important related concept operationalized through the CARE Principles for Indigenous Data Governance. These principles draw attention to the inherent tension between the open data movement, with a “primarily focus on characteristics of data that will facilitate increased data sharing among entities while ignoring power differentials and historical contexts,” and Indigenous communities’ need to steward data for their own collective benefit (Research Data Alliance International Indigenous Data Sovereignty Interest Group, 2019). Viewed in this way, the harvesting, reusing, and/or distributing cultural heritage data, especially when removed from its original context, can be problematic (Eisenlohr, 2004). As a result, typical NLP practices such as web scraping may be perceived as exploitative and lacking in mutual benefit.
Many such factors contributing to harms and injustices in language technology development and NLP have been singled out in recent years, thus partially suggesting what justice might look like in their absence. Additional factors leading to unjust outcomes include: decisions made during language model development, training dataset curation and annotation practices (including the labor conditions of annotators), a lack of engagement with a diversity of language speakers during the language technology development process, poor data documentation practices, and a lack of interdisciplinarity in NLP (Bender et al., 2021; Blodgett et al., 2020; Denton et al., 2021; Dodge et al., 2021; Hooker, 2021; Hovy and Prabhumoye, 2021). The genesis of these factors can be traced to harmful language ideologies present in the field of NLP and data science more generally (Blodgett et al., 2020; Castelle, 2018; Gururangan et al., 2022; Nee et al., 2022). As Bechmann and Bowker (2019) note, “documenting the human choices made in the work process surrounding AI” (Bechmann and Bowker, 2019: 7) is one mechanism for avoiding unjust outcomes from these technologies. This includes going back to the original data sources and identifying their conditions of production to understand how historical practices have resulted in “clean” Indigenous language datasets.
Language ideologies
One way in which the framework of linguistic justice can be helpful is through the identification of nonharmful versus harmful language ideologies informing the work of NLP for minoritized languages. Language ideologies are the normative political, economic, cultural, and moral beliefs about languages and their value. These beliefs may be “embedded in pre-theoretical common sense about the linguistic aspects of political processes” (Gal, 1989: 377). Ideologies of language come to be dominant because they are reproduced every day by “institutional and semi-institutional practices” (Bloemmart, 1999: 10) such as standards and social policies, which results in their normalization. Although not all language ideologies are harmful, some result in meaningful social and material consequences for speakers of non-dominant language varieties the world over. For example, ideologies privileging dominant colonial languages in Latin America (primarily Spanish and Portuguese) have resulted in the systematic devaluing of Indigenous culture and education, which has significant material impact on Indigenous speaking peoples’ livelihoods, and is a major contributing factor to the continued decline in the number of speakers of several hundred Indigenous languages in the region (Cisternas Irarrázabal and Olate Vinet, 2020).
One area where the lens of language ideology has previously been used to critique NLP practices is in the area of hate speech detection. Castelle (2018) shows the complex nature of language ideologies of hate speech, and how this relates specifically to efforts toward harm reduction in NLP. Surveying ideologies related to abusive speech in NLP, he notes that “We argue that it will be essential for NLP researchers to recognize how our tools and techniques may, in part, be material embodiments of these ideologies, but also how one might partially escape those ideologies without abandoning the use of tools and techniques entirely” (Castelle, 2018: 4). Nee et al. (2022) additionally show that looking at language ideologies through a framework of linguistic justice helps to pinpoint the ways that “NLP tools can exacerbate social disparities through their advancement of language-based stereotypes” and linguistic profiling (Nee et al., 2022: 3), and suggests multiple actions for addressing these often inadvertent outcomes.
It has been proposed that dominant ontologies of language traced to European modernity contribute to the normative conceptualization and valuation of languages in the digital environment, resulting in the development of technologies that properly support only a small number of hegemonic global languages (Schneider, 2022). Technologies constructed with European language structures in mind encode culturally specific worldviews of language that foreclose the possibility of an accurate digital representation of or access to languages falling outside this paradigm. For example, the conception of languages as “rule-based, orderly, and clearly delimitable cognitive systems that have primarily referential functions” (Schneider, 2022: 365) does not uniformly reflect the broad cultural variety around linguistic practice, resulting in the exclusion of individuals and communities from full participation in a civic life increasingly mediated by digital technologies.
Ultimately, all technological development assumes some ideological disposition, and these beliefs aren’t necessarily pejorative or harmful to speech communities. As Gururangan et al. (2022) note, “We argue that … one cannot avoid adopting some language ideology; the language ideology which is appropriate depends on the goals of the work, and one language ideology may conflict with another” (Gururajan et al., 2022: 2). It is therefore important to identify and understand the consequences of specific common ideologies in the field in order to specify conditions for a more just technology development efforts. The theoretical framework of linguistic justice is well-suited to shedding light on which of these language ideologies have the potential to negatively impact information access and technology use by speakers of minoritized languages in the digital environment.
Related research
A number of researchers have taken a critical look at the practices around language technology development, from the perspective of the linguist as well as the NLP practitioner and data scientist. Steven Bird, a computational linguist with a focus on minoritized and Indigenous languages, addresses a tendency among linguists working on language revitalization, including himself, to “prioriti[ze] data capture over local self-determination” (Bird, 2020: 3505), and critiques the idea that documentation by non-experts using new technological advances results in the intended outcome, given the way in which these systems extract and strip data of their local context. This work follows Dobrin et al. (2009), who argue that the “discourse of language endangerment” operates in tension with the moral rationale for language revitalization through a tendency to reduce and commodify linguistic data. Both Bird (2020) and Dobrin et al. (2009) suggest that linguists’ search for pure forms of Indigenous languages is itself a colonial attitude, positioning languages as unchanging objects to be acquired by external observers. Echoing Srinivasan's (2017) call for context-sensitive and partnership-based technology development, Bird proposes that linguists “enter a research process that supports recovery from the colonial legacy” (Bird, 2020: 3509) by identifying community goals, privileging local agency, finding the human in the loop, and prioritizing the connecting function of language over language proficiency. Nee et al. (2022) further suggest nine actions to incorporate reflexivity around “power dynamics, values and priorities” in language technology development.
Decades of scholarship have worked to establish ethical norms around community-based research in linguistics and NLP, with a growing body of Indigenous methods texts addressing ethical practices when working with Indigenous individuals and communities, cultural heritage, and land (Carroll et al., 2021; Kovach, 2021; Smith, 2012). However, the complexities of actually implementing equitable community-based research practices around language technologies can be seen in the types of languages that are researched to begin with. For example, there are hundreds of distinct Indigenous languages spoken across the Latin American region exhibiting a huge diversity of linguistic phenomena, and most are at risk of extinction. However, there are only limited efforts to employ technology to aid in language conservation and revitalization. Mager et al. (2018) catalog all of the known projects addressing computational treatment of Indigenous language data in Latin America, finding language technology projects for only 35 of ∼500 languages. They note that current NLP technologies seek to be language-neutral; however, a lack of knowledge about local languages limits their utility, as the sheer diversity of language typologies and practices in the region resist standardized treatment with the methods popular among computational linguists today. Like Bird (2020), they note that Indigenous language speakers may often be bilingual, and that code switching presents unique challenges for NLP applications. This work is valuable for its thorough survey of the extent of research on Indigenous language technologies and the challenges that remain, but also for its implicit call to recognize the value of situated knowledges in technology development.
Joshi et al. (2021) note that a preference for the study of dominant global languages in NLP contributes to a “typological echo-chamber” (Joshi et al., 2021: 1) wherein the majority of language typologies are unrepresented in language technologies and receive low attention from researchers. The authors ask questions designed to understand the degree to which existing work in NLP has been inclusive of global languages, noting an abundance of typological features found in many languages that are being ignored in current research. The authors conclude with suggestions for increasing research in diverse languages, including special tracks in conferences and asking if researchers’ work applies agnostically across multiple languages. Importantly, they raise the question of researcher positionality, asking: “What role does an individual researcher, or a research community have to play in bridging the linguistic-resource divide?” (Joshi et al., 2021: 2). Since the publication of those recommendations there are further reasons to be encouraged about recent changes in practice, including the establishment of the AmericasNLP conference in 2021 and a 2022 Association of Computational Linguistics annual conference special keynote panel “Supporting Linguistic Diversity.” However, empirical research is needed to establish whether this is part of a larger observed trajectory and what work might remain.
Intervention: Systematic literature search and content analysis
This investigation is a systematic literature search and content analysis exploring whether and how potentially harmful ideologies are expressed in the work of NLP research for Mexican Indigenous languages. The analysis extends the conversation about bias and fairness in NLP to more non-English language contexts and contributes to new understandings of the situated practices and social assumptions involved in the construction of language technologies.
Research questions
RQ1: How are material and symbolic social values and norms around language represented in the scholarly literature on NLP for the minoritized languages of Mexico?
RQ2: How are these values and norms manifested in language technology development?
Methods
A content analysis of Mexican language NLP scholarship was identified as a method to understand the extent and quality of data work related to minoritized languages of Mexico, and the conceptual frameworks guiding such efforts. A systematic search of published literature on NLP for these languages was conducted to provide the data source for analysis. Following a screening process of 511 publications, 46 studies met the inclusion criteria for analysis, as detailed below.
Data sources
The primary publications for computational linguists are published by the Association for Computational Linguistics:
The metadata for each of the articles returned in these searches was evaluated for relevance to the research questions. To provide for the largest possible dataset for subsequent analysis, a thematic focus on NLP work for a Mexican Indigenous language was the sole criterion for article selection. In many cases, article relevance could be determined by reading the abstract, but there were cases when it was necessary to read through an article to ensure its relevance. The vast majority of articles identified in the initial search results included the language keywords in bibliography citations or as a brief mention in the article text, but did not address NLP for these languages in the articles themselves. These were discarded. For example, an article including both of the keywords “machine learning” and “Tzeltal” may have appeared in the initial search results, but unless the article directly addressed the development or use of NLP techniques on Tzeltal language content, it was discarded from the dataset. Citation chaining was further employed on a subset of the selected literature as a means to ensure that the corpus was comprehensive. A total of 46 articles were selected for the final dataset, covering a 20-year period of 2003–2022.
Content analysis
Research Questions 1 and 2 guided an examination of this remaining literature. Braun and Clarke's (2006) Reflexive Thematic Analysis (RTA) process was employed to conduct organic thematic coding of the corpus, acknowledging that such an approach involves significant interpretive work on the part of the researcher. In this process, codes and themes are not viewed as “emerging” from the data; rather, they are recognized as the result of active selection by a researcher with a distinct positionality. All 46 articles were read closely and coded for potentially harmful language ideologies or other harmful practices drawn from the literature on bias in NLP. These include ideologies such as language standardization ideology and practices such as under-describing datasets. In the RTA process, the themes, which may be identified at the semantic or interpretive levels, are not determined based upon a rigid notion of frequency within the corpus, but allow for the use of researcher judgment. Because of the necessarily interpretive nature of this analysis, direct quotations from the corpus are used extensively throughout to demonstrate empirical evidence.
The author acknowledges their own positionality as a non-Indigenous, White, socioeconomically advantaged researcher based at a large university in the Global North, with access to resources unavailable to scholars in the region where the languages in question are spoken most widely. While it is not appropriate for the author to suggest a research agenda that should be driven by Indigenous communities themselves, this work seeks to utilize the protocols and frameworks developed by and in consultation with these communities as a lens through which to suggest fruitful areas for investigation for non-Indigenous researchers (Gasparotto, 2025).
Discussion
Descriptive characteristics
The identified body of literature consists primarily of experiments and case studies around the development of tools and techniques for tasks such as corpus building (lexicon extraction, morphological segmentation, automatic glossing annotation, etc.), machine translation, and automatic speech recognition, among others, published between 2003 and 2022. The research is team-based and multiple authors, sometimes as many as eight, are typical; only one single-author paper was discovered. Most of the papers are published in conference proceedings; only 14% (
Themes
Language ideologies
A variety of potentially harmful language ideologies were observed throughout the identified literature. These ideologies consist of sociocultural norms and values that have been applied across cultural contexts where they may or may not be applicable. These include the assumption of language standardization, unrepresentative or biased datasets, and an open data ideology. However, these themes are more commonly observed in the earlier literature on Mexican language NLP work, and a small but increasing number of more recent papers explicitly call attention to potentially harmful language ideologies present in the field of NLP in order to mitigate their presence in the research.
Language standardization
Normative assumptions about the existence and value of a standard variety of a given language constitute a language standardization ideology. Persistent beliefs placing language practices within a hierarchy can lead to linguistic discrimination when the variety of language selected as the standard is privileged and valued above others. Frequently, such language standardization ideology may be subtle or go unrecognized because individuals are unaware of their strong ideological position, believing their views to be commonsense and held by a majority of people (Milroy, 2001).
The most prevalent potentially harmful language ideology present in the NLP literature on Mexican Indigenous languages refers to orthographic standardization across multiple languages. Researchers note the variety of written traditions present in their data sources (both analog and digital) and describe techniques for addressing the challenges that this lack of standardization poses to their research. Orthographic variation is often presented as a deficit rather than a natural characteristic of a living language with many varieties documented over several hundred years. Researchers frequently take standardization for granted as an apolitical activity, describing orthographic normalization steps without commenting upon any of the sociopolitical implications of such a decision.
3
Two representative extracts describe such efforts: In fact, this is the case for many languages spoken the Americas: large dialectal variation, and To treat orthographic variation, [the authors] standardize text with a rule-based system which converts diacritics and letters to contemporary orthographic convention. (Article 36)
Here, language variation is viewed as a negative feature that impedes research and therefore needs to be overcome by the researchers through standardization efforts. In the second excerpt, “contemporary orthographic convention” is left undefined, and we are left to guess which variety of the language in question was selected. In the case of Náhuatl, multiple varieties are spoken and written across Mexico today, and it is unclear which of these was chosen as the unmarked standard.
The standardization of contemporary Náhuatl is actively contested today, by communities refuting the notion that one form of the language should be dominant, relegating all other variations to a lesser status of a dialect (Petrović, 2017). Only one research team provided a more nuanced understanding of this social context, in reference to Western Tlacolula Valley Zapotec: Throughout the paper we use the term “language variety” in place of “dialect” because of the pejorative force of the word dialecto in Spanish. (Article 39)
Notably, this research team includes a member of the speech community of the language being studied and involves sustained involvement with the larger community of linguistic stakeholders.
Language standardization as an ideology can also be seen in the choice to normalize contemporary orthographies of a given language to an earlier variety of the language for which a greater volume of digital data exists, as in the case of Náhuatl. Náhuatl, a language with nearly 3 million speakers today, was spoken across much of what is now Mexico at the time of the arrival of the Spaniards, and is one of the first languages of the Americas to be documented by Spanish colonizers (McDonough, 2024). Because these early works in Classical Náhuatl have been transcribed and translated in published scholarship for several hundred years, they provide a more readily accessible source of data for NLP efforts than contemporary varieties of Náhuatl for which orthographic practice is less standardized. Such a corpus is a tempting data source in the absence of large scale digital content reflecting contemporary forms of the language. We elected to work with this language because, although it is low resource, access to textual documents is relatively easy compared to others. Náhuatl has a greater written tradition than other Mexican Indigenous languages because it was the most widely used language upon the arrival of the Spanish, and as a result they had to adopt a written practice to create religious, legal and other types of texts beginning in the 16th century. (Article 21) [Author's translation] As there is a lack of consensus regarding the orthographic standard, for AmericasNLI the orthography has been normalized to a version similar to Classical Náhuatl. (Article 45)
Standardization in this way might be compared to normalizing contemporary British English to Shakespearean English, with clear implications for how well technologies developed using such a corpus might relate to the needs of today's speakers. This is not to say that researchers involved in normalization efforts are explicitly advocating for the unimportance of cultural considerations in NLP. It is also not to say that it is unhelpful for Náhuatl speech communities to have NLP-based tools that function for Classical Náhuatl, as historical heritage materials are valued for both cultural and scholarly reasons. However, the opportunistic use of readily available datasets and standards reflects the realities of benchmark-based research whose practices cannot immediately accommodate labor-intensive work with “messy” contemporary data that more accurately reflects the cultural realities of today's active speech communities.
These examples refer to written language data; however, language standardization ideology was also observed in reference to spoken language practices. In one 2009 case, authors describe a research method that requires the collection of recorded linguistic data lacking in tonal or pitch variation, thus stripping a crucial layer of meaning from the speech: The original speech recording needs to be as monotonous as possible to reduce discontinuities between different segments and to reduce as much as possible any need for signal processing. (Article 6)
In this case, the quality of the tools employed for text-to-speech technology development at the time explicitly required the decontextualization of linguistic data at the point of collection, pointing to ethnocentric practices that fail to appropriately prioritize the preservation of linguistic diversity. More recent articles directly address the need to change the technology to suit linguistic diversity rather than the other way around: Due to the characters [ These failed results were mainly associated with the fact that the OCR could not properly identify the special written characters of Indigenous languages, since the majority of them uses [
Although overall the need for standardized datasets in the field of NLP leads to the dominance of language ideologies that fail to account for critical sociocultural differences between language variations and result in technology development efforts that may not be aligned with the needs of contemporary speakers, norms in the field appear to be shifting. The recognition that technological failures continue to present barriers to the development of equitable language technologies for minoritized languages in Mexico is an important shift in the published literature on NLP. Future research can show whether this is a continuing trend.
Biased and unrepresentative datasets
Another area where the prevalence of potentially harmful language ideologies in the field of NLP may result in a disconnect between today's speech communities and technology development is in the area of corpus development. Incomplete or mismatched data is a persistent theme in the literature. For example, researchers have noted that machine translation efforts fail to properly produce sentences that would be spoken in the target language, due to a disconnect between the types of data available and the NLP tasks to which the dataset is put to use: Further, while MT makes translation quick and easy, translating also means that sentences in target languages are likely not to be representative of natural utterances spoken in those languages. As the original sources of the dataset were often conversational in nature, they may be fragments, may not always be grammatical, or may cover topics which are not commonly spoken about in the target language. (Article 45)
In this case, a lack of data suited to the task renders machine translations less useful to contemporary speech communities.
Another area of concern around datasets is the ideological nature of some of the largest existing parallel corpora. For example, Christian religious texts have frequently been used as a convenient body of Spanish-to-Indigenous language parallel corpus data, including translations of the Bible or religious vocabularies. Some research teams reported on the use of such data uncritically; however, one team acknowledged the potential for bias in the resulting output of machine translation applications, and noted subsequent attempts to mitigate this through data cleanup: The high religious content can cause a bias in the reconstruction of the translations meaning for the neural network that when fed with non religious content, it could leak a religious word in most of the translations. (Article 25) The text suffered an extra step to strip the religious bias replacing expressions and words to a more generalized vocabulary. (Article 30)
Although the reader is left to wonder what types of substitutions were made and how thorough the word replacement process was, this is a clear indication of NLP researchers actively seeking to identify and address bias in their practices. However, given the evidence that computational approaches to language dataset cleanup may lead to further dataset bias (Dodge et al., 2021), it is unclear how effective these efforts have been, and further research is needed in this area.
Many efforts in the larger field of NLP have recently turned to the development of language-neutral translation; however, researchers in Mexican NLP have noted that they simply do not function as well for languages with such small datasets: the common assumption that machine learning approaches for MT were language independent routed the efforts into the direction of general model improvements. But this assumption does not hold completely true. (Article 20)
Due to the poor performance of language-neutral approaches to Mexican language machine translation, researchers must explore alternate approaches that do not scale as efficiently, require deeper collaboration with field linguists and communities of native speakers, and may not be viewed as state-of-the-art.
Overall, researchers were relatively open about the limitations in both their existing datasets as well as the specific techniques utilized to augment or create new linguistic corpora. Because the literature consists primarily of case studies looking to outperform existing benchmarks in NLP, it is to be expected that these limitations would be noted; however, a clearer trend around this was observed in the more recently published works. This may be the result of a greater emphasis around transparency within the larger field of NLP, including more specific ethical protocols promoted by professional organizations such as the Association of Computational Linguistics.
Centering researchers
Despite these areas of improvement, the NLP literature on Mexican Indigenous languages by and large continues to center the research interests of academics over the interests of speakers of the languages being studied. Early articles in this field most explicitly privilege research and researchers in this way: Documenting endangered languages offers great potential to contribute exceptional primary data for linguistic research. (Article 13) The initial rationale to experiment with the issues related to corpus creation for endangered and low-resourced languages was also that these speech and language resources are not accessible for speech and language technology related research. (Article 10)
The documentation of linguistic data is framed as important more for its usefulness for research purposes than for members of the relevant speech communities, and data sovereignty is never directly mentioned in any of the publications surveyed. Bird (2020) has remarked upon the uncritical use of revitalization as a blanket justification for the use of linguistic data by linguists outside of the communities of speakers, and Bird and Yibarbuk (2024) have explicitly called for the centering of speech communities as NLP projects are developed. Several papers include further justifications for this work based on perceived benefit to user communities, although this typically consists of generic statements about the importance of NLP for language revitalization efforts: Efforts made for language revitalization can benefit from advances in NLP. (Article 26)
Comments such as those above are also tied to an open data ideology that has been contested by Indigenous data sovereignty frameworks (Research Data Alliance International Indigenous Data Sovereignty Interest Group, 2019). In the following example, researchers argue that language data should always be shareable with and/or accessible to non-native speakers: If they are not transcribed, annotated and translated the collected language resources are only accessible to the native speakers of the language or experts. If the recordings can be deciphered only by native speakers or experts, this in itself presents a problem for the low-resourced or endangered languages we work with as the numbers of the potential users of the resources steadily diminish. (Article 10)
In this case, researchers have determined that their own interpretation of need is more significant than that determined by speech communities, who may have cultural reasons for preferring to maintain control over the distribution of their cultural heritage.
More recent papers, however, offer a more nuanced and thoughtful justification for this work, couching NLP research needs in terms of a given language's speakers. For example, this research team notes the importance of their proposed method for giving agency to native speakers around their own language documentation: Náhuatl is spoken in regions where Spanish is the dominant language. This leads native people to (in some way) forget their mother language in favor of Spanish. In this environment, the language slowly disappears or, even worse, the situation leaves the people of these remote communities excluded of the technological advancements and vulnerable to laws or services that are not written in Náhuatl. Because of this, giving a proper translation from Spanish to Náhuatl and vice versa is crucial for communities that speak this language. It is vital for various scenarios, for example: read legal documentation, acquire medicines, have a more active participation inside politics and even to help spread the language. (Article 30)
Approximately 50% (
However, the lack of direct reference to the Principles does not mean that the themes raised by them have gone unaddressed in the NLP literature. For example, the following ethics statement addressing themes raised by the concept of Indigenous data sovereignty appears in a 2022 article: Furthermore, research involving languages spoken by Indigenous communities raises ethical concerns regarding the exploitation of these languages and communities: It is crucial that the process does not exploit any member of the community or commodify the language … In addition, members of the community should directly benefit from the research. Translation for AmericasNLI was done by either paper authors or translators who were compensated at a rate based on the average rate for translation and the minimum wage in their country of residence. Additionally, many authors are members of and/or have a record of close work with communities who speak a language contained in AmericasNLI. (Article 45)
Unlike other articles with vague statements on the supposed research benefit to the community, these authors emphasize that community benefit must be determined in partnership with community members. They declare their own positionality as members of or close collaborators with speech communities of the languages they are researching, and document their consideration of ethical labor practices. The recency of this article may indicate a shift in practice in the field toward transparency in positionality, collaboration, and labor; however, future research will help to clarify whether this is an outlier or the beginning of a trend.
It is important to note that researchers may hold intersecting positionalities, and this can inform their approach to research design. Although difficult to confirm because few authors declare their positionality as members of a related speech community, an initial analysis of the selected articles indicates that only 33% (
Ultimately, the recent trend in explicitly connecting research efforts to tangible benefits for user communities may reflect larger trends in the field around equity and justice in NLP, which has received increasing attention and critique from outside the field. It is also increasingly reflected in ethics statements of international professional associations sponsoring conferences and journals, such as the Association of Computational Linguistics ethics statement.
Conclusion
This article explored whether and how specific social values and norms around language and technology, in particular harmful ideologies of language, have informed the work of NLP researchers working on Mexican Indigenous languages. These efforts build upon the literature addressing bias and fairness in NLP by examining minoritized languages in Latin America, and expand upon the contexts in which language ideologies are explored. Findings indicate that language standardization, unrepresentative datasets, an open data ideology, and a general centering of researchers over native speakers are common in this body of literature; however, trends indicate a clear move toward a more thoughtful engagement with ethical protocols addressing these themes, as well as a growing interest in bias mitigation.
Ultimately, the evidence of harmful language ideologies and other biases and assumptions in this literature point to the larger political economic forces at play, including the relationship between academia and industry, systemic inequities in global scholarly publishing (Inefuku, 2021), changing incentive structures in Latin American higher education (Deere, 2018), and the commodification of language in the digital environment (Dobrin et al., 2009). These factors have collectively created the conditions within which researchers are working, collaborating, and presenting their findings. The push and pull experienced by researchers working in this and other disciplines means that change is slow, particularly the deeply embedded assumptions that data science is a neutral field and that any discussion of sociocultural or political concerns is politicizing an apolitical space (Green, 2021). The field of NLP for Mexican languages is changing, but it remains to be seen how far norms shift and whether that will have a meaningful impact on technology development in the service of speakers of minoritized languages in the country and region.
However, professional organizations such as the Association of Computational Linguistics are beginning to establish tracks and preconferences for cultural considerations in NLP. Although certainly not a panacea, the shifting of norms in the field towards an acknowledgement of technology as a shaper and amplifier of the political-economic conditions of language in the digital environment may further bear fruit by setting expectations for those graduate students who go on to move the field forward. In addition to special tracks, other organizations, such as EACL, have established reviewing committees to evaluate submissions that have been “flagged as problematic by the reviewers or the ACs during the review process” (EACL, 2023). Future work examining these conferences, particularly surveys and interviews of peer reviewers to understand how they navigate and apply ethical guidelines to their reviews, would contribute empirical evidence about their efficacy. Additional work might include reviews of conference proceedings pre- and post-implementation of conference ethical codes, to understand if the guidelines have resulted in meaningful shifts in practice. For example, was there a meaningful increase in the percentage of papers noting whether data annotators were compensated, or where the authors discuss the ethical implications of the use of identity-related human characteristics in their datasets and/or resulting applications? Once clarified, such evidence may provide important insights into the degree to which organizational shifts are able to push the larger fields of NLP toward a more sociocultural and politically aware practice.
Limitations
The number of distinct but related fields that contribute to overall efforts to advance natural language processing technologies complicates any literature search. Although the search was scoped primarily to journals in CL, NLP practices are relevant to numerous other scientific fields. A more significant search spanning broader disciplines may have retrieved additional relevant articles with a more interdisciplinary perspective.
Because of the nature of language ideologies as normative, commonsensical concepts, coding for harmful language ideologies is a subjective task, making it difficult to employ standard coding practices seeking to establish reliability through the use of multiple coders. While coding was conducted by a single researcher, this limitation is addressed by providing ample direct quotations from the corpus, allowing readers to view and evaluate the data directly. The empirical study also follows from an ample literature review demonstrating the presence of harmful language ideologies in NLP work; this study simply draws some of them out. It is not intended to be exhaustive.
An additional limitation involves the customary format of case study and experiment-driven linguistics papers: there is not always an obvious place for authors to include normative information about the data and processes employed. It was therefore unclear whether researchers who indicated no consideration about bias and its harms failed to consider these issues entirely, or simply omitted them from the papers due to standard practices in their discipline. This is a question for which follow-up fieldwork may provide clarity.
Supplemental Material
sj-docx-1-bds-10.1177_20539517251406184 - Supplemental material for “Missing standardization”: Identifying harmful language ideologies in natural language processing work
Supplemental material, sj-docx-1-bds-10.1177_20539517251406184 for “Missing standardization”: Identifying harmful language ideologies in natural language processing work by Melissa Gasparotto in Big Data & Society
Footnotes
Ethical approval and informed consent statements
Not applicable.
Funding
The author received no financial support for the research, authorship, and/or publication of this article.
Declaration of conflicting interests
The author declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Data availability statement
Not applicable.
Supplemental material
Supplemental material for this article is available online.
Notes
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
