Abstract
Data and data modeling practices in natural language processing (NLP) privilege languages with the greatest levels of representation online, leaving behind speakers of the overwhelming majority of the world's languages from enjoying the full benefits of digital inclusion. These dynamics have been explored in the context of historic colonial relationships, with linguists and language activists noting the colonizing discourses around the adoption of language-related technologies, and highlighting problematic research practices among linguists and computational linguists working in the area of digital corpus development for minoritized and Indigenous languages. Such power differentials raise questions about the source(s) and nature of language technology development bias in Global South contexts. To that end, this paper is divided into three parts. The first addresses the identification and roots of bias in NLP as it is currently understood, emphasizing the particular ways it is expressed in Global South contexts, and Latin America in particular. The second briefly discusses the theoretical frameworks of data colonialism and Indigenous data sovereignty as mechanisms to clarify how bias in NLP operates in these contexts. In the third section, Indigenous methods are proposed to investigate how these frameworks can help illuminate a research ethics around NLP for minoritized languages. Although there is an abundance of literature addressing ethics in NLP, no work to date has utilized an Indigenous methods framework to examine bias in the field, even as it relates specifically to Indigenous language data and technologies. This work argues that Indigenous methods can ultimately yield deeper insights into the concepts of bias and fairness in NLP, as well as the motivations and decisions employed in the development of language technologies, thereby serving as a more effective generator of equitable solutions while advancing theory development.
Introduction
Defining natural language processing and related terms
It is helpful at the outset to provide a couple definitions. Machine learning (ML) is a statistical process by which patterns in an existing machine-readable dataset are used to infer new data (Suresh & Guttag, 2021). Natural language processing (NLP) uses ML on linguistic datasets to infer new linguistic data approximating or mimicking human-produced sentence constructions/utterances, or to generate computer code (Weidinger et al., 2021). Much of NLP work focuses on the development of language models (LMs), which are systems capable of generating sophisticated sentence constructions based upon the statistical likelihood of text combinations in their training data (Rosenfeld, 2000). LMs can be optimized for various types of output, including dialogue, and applications include language agents such as chatbots, autocomplete and machine translation. Inasmuch as these applications are being adopted in high stakes arenas they also raise a number of questions about public policy. In particular, concerns around the skewed nature of LM training datasets and resulting biases in text generation have been raised, especially in the context of minoritized individuals and communities around the globe (Bender et al., 2021; Dodge et al., 2021; Gururangan et al., 2022).
At this point it is helpful to describe how the practices and infrastructure of language technology development play out in Global South, and specifically Latin American, contexts. The following section details the normative basis of linguistic corpus development and NLP for Indigenous languages in Latin America and around the globe.
NLP in the Global South
Across Latin America, a rapid increase in the amount of digital material available in minoritized Indigenous languages spoken in the region has been observed (Cassels, 2019; UNESCO, 2015), and a number of linguistic corpora and LMs are currently under development for these languages throughout the region (Mager et al., 2018). This activity raises questions relating to whether and how this work is guided by protocols around the normative decisions made therein, which may be “shaped by historical legacies of colonial ways of knowing, and representing, the world” (Philip, 2021, p. 91). Although exceptional strides have been made in fostering the digital production and distribution of cultural heritage materials by and for Indigenous language communities, they sidestep a broader problem, which is that the very idea of Indigenous production as a site for knowledge extraction in bulk by tools and research practices in the Global North remains problematic.
Like elsewhere, Latin America is the site of complex negotiations around Indigenous identities in the globalized environment of the web. These identities are dynamically negotiated within power structures imposed by and inherited from colonial systems, which cast all native peoples (both as individuals and communities) as existing outside of modernity and technological advancement (Gomez Menjívar & Chacón, 2019). Colonial legacies continue to obscure the historically contingent nature of the concept of indigeneity itself, frequently resulting in essentialist and overly generalized statements about the cultural contexts of differing communities.
A number of researchers have taken a critical look at the situated practices around Indigenous language technology development from the perspective of field linguistics. This includes addressing “colonizing discourses” (Bird, 2020, p. 3504) around the adoption of technology related to language, critiquing rhetoric around “discovering” and “saving” a previously undocumented language, and noting a tendency among linguists working on language revitalization to “prioriti[ze] data capture over local self-determination” (Bird, 2020, p. 3505) with data collection techniques that extract and strip data of their local context. Others note that the “discourse of language endangerment” (Dobrin et al., 2009, p. 38) operates in tension with the moral rationale for language revitalization through a tendency to reduce and commodify linguistic data. Linguists’ search for “pure” forms of Indigenous languages to record for study is itself a colonial attitude, positioning languages as unchanging objects to be acquired by external observers. Linguists may carry with them the unstated ideology that digitization of a language grants a certain relevance in a hierarchy, indicating a disrespect for oral and analog communication traditions that communities may value (Eisenlohr, 2004). Further, language speakers and linguists may also have differing understandings of what constitutes a language in the first place, who should have access to it, and what should be preserved and/or revitalized. All of these ideologies remain salient in the context of NLP work.
This can be seen, for example, in the types of languages that are researched in NLP to begin with. There are hundreds of distinct Indigenous languages spoken across the Latin American region exhibiting a huge diversity of linguistic phenomena, and most are experiencing patterns of language shift that will likely result in extinction (Mager et al., 2018). However, there are only limited efforts to employL NLP technology to aid in language conservation and revitalization (Joshi et al., 2021). NLP work on specific minoritized languages is hindered by current trends in the profession to develop language-neutral models that do not require substantive context-sensitive data work. However, Mager et al. (2018) note that a lack of knowledge about local languages limits the utility of such models, as the sheer diversity of language typologies and practices in the region resist standardized treatment with the methods popular among computational linguistics today. In particular, Indigenous language speakers may often be bilingual, and code switching, common among speakers of Indigenous languages, presents unique challenges for NLP applications (Bird, 2020; Mager et al., 2018). Further, as Schwartz (2022) notes, “nearly every successful NLP technique in widespread current use was designed around the linguistic characteristics of English” (Schwartz, 2022, p. 3), a factor which highlights the culturally-inattentive language-as-data approach to NLP that continues to be pervasive in the field. The resulting preference for the study of dominant global languages in NLP contributes to a “typological echo-chamber” (Joshi et al., 2021, p. 1) wherein the majority of language typologies are unrepresented in language technologies and receive low attention from researchers.
In summary, the Latin American and broader Global South context shows that NLP efforts are minimal for most languages, but even when projects are undertaken they may be divorced from culturally specific approaches to data that would result in greater levels of access to digital language technologies for communities that are currently unsupported. Although linguists and NLP practitioners are beginning to critique culturally-insensitive practices, they remain the norm.
Frameworks for understanding NLP in the Global South
Data Colonialism
Data Colonialism (Couldry & Mejias, 2019a, 2019b) is a recent analytic that has been used to understand the ways in which capitalist and colonial relationships and logics are replicated and augmented in the context of digital data, through the unique exploitation of cognitive resources. Like colonialism, data colonialism is not just economic and territorial, it is also a knowledge enterprise, establishing legitimacy over how we come to know about ourselves and our environments. Because the concept of data colonialism (and postcolonialism more generally) is a dominant mode of theorizing sociotechnical systems by scholars from or based in Latin America (see for example, Lehuedé, 2021; Milan & Treré, 2019; Morales & Reilly, 2023; Segura & Waisbord, 2019; Tait et al., 2022), where technological systems are entangled with and may compound existing inequalities resulting from colonial histories, it is essential to engage with the concept in the context of NLP for minoritized languages in the region.
Despite critique for ascribing a universality to colonialism that erases particular historical contexts (Calzati, 2021; Gray, 2023), the framework of data colonialism is nonetheless valuable for clarifying the high stakes of predatory inclusion in the digital economy (McMillan Cottom, 2020) : the goal of equitable treatment of Indigenous language content and data brings with it material and epistemic dangers in the form of extraction and appropriation of cultural heritage with clear parallels to harmful colonial practices.
Indigenous data sovereignty
Emerging originally in the Australian context, Indigenous data sovereignty (IDS) is a concept establishing the role that Indigenous communities should play in the creation, stewardship and use of data about themselves and their histories (Maiam nayri Wingara Indigenous Data Sovereignty Collective, n.d.), putting forward the position that context-sensitive ontological and epistemological frameworks must be accounted for in the treatment of Indigenous data, which includes linguistic data. The concept has been operationalized through the CARE Principles for Indigenous Data Governance (Research Data Alliance International Indigenous Data Sovereignty Interest Group, 2019), which establish guidelines for enacting IDS in practice.
The Principles’ authors note that the move toward open science and data sharing has not sufficiently contended with the history of extraction of Indigenous resources, or with Indigenous perspectives privileging collective over individual benefit, which they seek to mitigate with meaningful and measurable benchmarks for a thoughtful engagement with culturally sensitive data. These include empowering communities to determine how such data shall be collected and used, a mandate to use Indigenous data in support of collective benefit and self-determination, and a requirement to evaluate any potential harms stemming from the collection and use of community data (Research Data Alliance International Indigenous Data Sovereignty Interest Group, 2019).
Although only recently formalized as a concept, the general idea of data sovereignty for language data has been in circulation for some time and is at least partially captured in professional guidelines such as the Linguistic Society of America's 2009 Ethics Statement, which states: Some communities regard language, oral literature, and other forms of cultural knowledge as valuable intellectual property whose ownership should be respected by outsiders; in such cases linguists should comply with community wishes regarding access, archiving, and distribution of results. Other communities are eager to share their knowledge in the context of a long-term relationship of reciprocity and exchange. (Linguistic Society of America, 2009, p. 3)
Although the Association of Computational Linguistics has no similarly targeted guidance for researchers and the concept of IDS is not well known to computational linguists and data scientists in the NLP field (Flavelle & Lachler, 2023), some interdisciplinary teams working on Latin American languages have referenced these and related principles.
In the broader field of linguistics, the recently published Open Handbook of Linguistic Data Management (Berez-Kroker et al., 2022) includes a chapter on working with Indigenous languages that addresses the CARE Principles, as well as the challenges posed to data sovereignty by the open data movement.
Using indigenous methods for the study of NLP
Bhattacharya (2009) argues that “there is no purist decolonizing space devoid of imperialism but spaces where multiple colonizing and resisting discourses exist and interact simultaneously” (Bhattacharya, 2009, p. 105). Indigenous-informed research methodologies (Carroll et al., 2021; Kovach, 2021; Smith, 2012) speak to this complex relationship between colonizer-colonized and researcher-researched, and provide a particularly appropriate and culturally sensitive approach when researching the treatment of Indigenous language data within a professional field known to focus almost exclusively on hegemonic languages. Like other critical research methods, Indigenous methods call for the clear acknowledgement of the resulting power dynamics with research communities. They require the joint construction of research design through processes of respect and reciprocity, and call researchers to account for the equitable distribution of their findings, given the systemic barriers to publishing faced by BIPOC individuals in academia.
I’d like to close with a reflection on how Indigenous-informed methods can expand the theoretical toolkit we’re working with to more deeply engage with linguistic equity in the digital sphere. As a socioeconomically advantaged and White researcher operating in the Global North, I acknowledge the role that my own positionality, and power as an investigator, plays in the construction of research design, interactions with research participants, and the selection and interpretation of data. Indigenous-informed methodologies are a way to foreground consciousness around these power dynamics that have so often resulted in inequitable research design, execution and results that marginalize the communities about whom the work is created. Conceptualizing research under this framework is one method for addressing such disparities and inviting a wider conversation about what research is and how it can be made meaningful in broader ways, and settler colonialist researchers can and should contribute to ethical research practices by learning from and utilizing the principles of Indigenous methods in their work to advance these efforts. Rather than list those principles here, which can be found in a number of resources (Carroll et al., 2021; Kovach, 2021; Smith, 2012), I provide some reflections on how their operationalization results in new ways of approaching the research of NLP for Indigenous and minoritized languages.
First, Indigenous methods call us to resist essentializing Indigenous and indigeneity as concepts, by working closely with individuals and communities to understand the historically situated nature of specific contexts. Research should be grounded in the histories and actual experiences of Indigenous communities, which will help prevent the inadvertent formulation of research in ways that reproduce coloniality. For example, the assumption that digitality has no corresponding native epistemology in Latin America is challenged by work showing that “Western technologies like the Internet and the global interventions it offers fit comfortably within Maya philosophies” (Gómez Menjívar and Chacón, 2019, p. 34).
Indigenous methods also call us to “foreground relationships with land, water, and the nonhuman world” (Hearne, 2017, p. 11). They call us to recognize the way that land and the physical environment are used as metaphorical denominators of digital activity, distracting from the very real material consequences for humankind. Terms like “the cloud” serve to distance the digital from its intensely physical footprint of data centers, for example, which require an enormous amount of water to cool servers running 24 hours a day, often in areas with weak watersheds. In reference to digital language technologies, researchers acting in solidarity with Indigenous communities are called to resist that abstraction and emphasize the material conditions of their production and use. We are invited, for example, to consider the strides Google has made with autotranslation services for Andean Indigenous languages Quechua and Aymara alongside their placement of data centers in water-stressed regions such as Chile, over the protests of environmental activists. Likewise, the expansion of digital connectivity to rural Indigenous areas must also be considered in light of the Indigenous labor enabling the tech industry to grow, for example, Diné women's labor in the Fairchild Semiconductor factory.
In relation specifically to language technologies, we are invited to consider how they may establish particular ways of knowing that may not be consistent with how linguistic communities conceptualize their languages. Culturally specific ontologies of language among Indigenous communities challenge the Western assumption that language itself is an “autonomous medium of denotational code” (Hauck, 2022, p. 4) that can be separated from embodied contexts. Yet recent research also “complicate[s] the bifurcation of digital and analogue materiality,” suggesting that these language ontologies are currently undergoing shifts as some communities develop “land-based cyber-pedagogies” around language learning (Caranto Morford and Ansloos, 2021, p. 302). These distinctions and contradictions remain particularly salient in the current environment, where nascent AI technologies have forced a conversation about the onto-epistemological status of texts generated by LMs. This debate is not yet fully inclusive of Indigenous perspectives, and Indigenous methods would call for a fuller accounting to problematize and deepen the theoretical landscape.
Indigenous methods also ask us to call into question the ideologies present in the interfaces through which we access linguistic content. Are language learning apps utilizing ML and LMs designed as communal tools, or do they tacitly view language as something that can be acquired without community context, like DuoLingo? Is it appropriate to assume that all languages can and should be taught through tools within such an ideological framework?
And finally, we are also called to operationalize IDS into our research work. Culturally specific ways of relating to linguistic data may run counter to current trends toward open data in the sciences and social sciences, as well as practices in NLP. Speech communities may object to the broad distribution of their cultural heritage out of context (Eisenlohr, 2004), and therefore consider standard linguistic corpus development practices in NLP such as web scraping to be non-reciprocal and extractive in nature. Data sovereignty also asks us to critically analyze how Indigenous languages became digital in the first place. What is the history of the dataset? Who collected the data and under what paradigms? Was the documentation work done within the community or by external researchers, and were those relationships reciprocal? How do the language communities feel about the process that resulted in the datasets and do they feel that stakeholders have been appropriately consulted? In short, it becomes very clear through a consideration of these methods that identifying and proposing solutions to bias in NLP goes well beyond the suggestion of workflow changes in the tech industry, and ultimately requires researchers developing ethical protocols to broaden their methodological toolkit from the perspective of the communities impacted.
Footnotes
Declaration of conflicting interests
The author declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author received no financial support for the research, authorship, and/or publication of this article.
