Abstract
The problem of data extraction to feed LLMs impacts all digital archives and open-sourced initiatives. Though the practice of data-scraping bots used to create the LLMs that feed algorithms is recent, extractivist models of language documentation are nothing new. To redress the challenges presented by extractivist methods, Indigenous language scholars and linguists as well as digital archivists and librarians have innovated methodologies to provide for a mutually sustaining relationship between language documentation efforts and indigenous language practices with the goal of enhancing language persistence practices and recuperating dormant practices. The Digital Archive of Indigenous Language Persistence (DAILP) at Northeastern University provides one effort to redress extractivist methods using Indigenous Data Sovereignty methodological frameworks. In this essay, I describe the CARE heuristic in relation to DAILP’s goals and methods to support the decision making required to meet the challenges of language extractivism. Though data scraping for LLMs presents ongoing challenges, Indigenous data-sovereignty methodologies allow language persistence with Indigenous communities and digital archives.
Keywords
The Problem Generally
Indigenous language archives result from relationships with tribal community members that can take decades to cultivate and maintain. Indigenous language speakers and storytellers in communities share their linguistic, historical, and cultural information to help future and present indigenous peoples continue with language practices. This painstaking work is built on relationships to land, language, history, and people. Translation teams in tribal communities, language teachers, and education departments of tribes have relationships to the archival materials and lexical data that they share. These relationships to language, history, place, and practice coalesce over time, building tribal sovereignty through the sharing of story, knowledge, and language practice around digital archival materials. One harm of artificial intelligence (AI) scraping bots rests in their reckless disregard and erasure of these relationships. The people whose language data are extracted are typically unacknowledged, excluded from the practices and site materials they curated, and largely uncompensated by the extraction bots or the third parties who purchase the extracted data (Cushman et al., forthcoming).
Early digital archives of documents and audio in an archive’s material holdings went largely unnoticed by generative AI language-data scraping bots. The bots are not able to decode the indigenous language used in digitized documents or audio files. Handwritten indigenous language documents and voice recordings are particularly difficult to decode without humans typing transcriptions of each word and phrase. Indigenous language specialists must isolate the words in the digital media using transcription, translation, and discussion. Teams of indigenous language speakers, linguistic analysts, software engineers, and community members interact together to create relationships with each other in and through the language found in the digital images of handwritten indigenous language documents and/or audio files. AI bots and large language models (LLMs) cannot build relationships through the practice of language that post-custodial digital archives can, but AI bots love to extract the information produced from these relations.
Post-custodial digital archives of indigenous languages draw on digital images of indigenous language documents curated in museum and library digital archive collections to create secondary resources, including dictionaries, language learning and practice materials, and translation sites (Cushman & Trevino, 2021; Frey, 2020; Baldwin et al. 2016; Carpenter et al. 2021; Holton et al., 2022; Link et al. 2021; Lukaniec, 2022; Snead & Cushman, 2023). Digital data from museums and libraries gathered into post-custodial archives is initially unstructured. The data become well-structured when indigenous people collaborate to identify words and provide transcriptions, pronunciations, and general translations of each word. Language specialists and students then create spreadsheets or code that ingests lexical data sets and annotations into the computational back end. Front-end users interact with a post-curatorial dictionary or digital archive to select information to display. Users can view dictionary entries, community-developed transcriptions and translations, metadata, commentary, audio files, and language parsing information on a word-by-word basis or for entire documents at a time.
The finely tuned data produced in post-custodial digital archives can potentially help LLMs train, and because of this, post-custodial sites attract scraping bots. Post-custodial digital archives advance the LLMs’ growth by facilitating the LLMs’ identification of the relationship between a digitized archival image or audio file and its transcribed form. Post-custodial archival websites take decades to develop from teams of experts working together to analyze archival data. For post-custodial indigenous language archives, the stakes of losing control of this post-custodial data to AI bot scraping are high.
Ruinous Irony of Extractivism
To illustrate the scope of harm done by AI scraping bots, let me show the language practice behind just one word from one story appearing in Cherokees Writing the Keetoowah Way, a digital edited collection found in the Digital Archive for Indigenous Language Persistence (DAILP) (https://dailp.northeastern.edu). DAILP fosters knowledge-keeping through language practices with indigenous community collaborators. DAILP provides a reading and writing interface for indigenous peoples to practice their languages using language documents and audio files found in libraries, museums, and archives. Users can select a story in Cherokees Writing the Keetoowah Way, such as the Story of Welling, Oklahoma, that draws on digitized images of a ledger housed in the Yale Beinecke Library. Figure 1 shows the first line of the Story of Welling, Oklahoma.

The first three words of the Story of Welling, Oklahoma, in Cherokees Writing the Keetoowah Way written in the Cherokee syllabary.
After choosing the Story of Welling, Oklahoma, a user can then select more information about any word in that story. Figure 2 shows the word-pane display of the second word in the story of Welling, Oklahoma. The word’s boundary was identified by a human familiar with the syllabary. With the word boundary isolated, a transliteration is entered using the Roman alphabet. Guided by the document’s free translation into English provided by Cherokee language speakers, Cherokee language students, language specialists, and community members provided detailed annotations about each word in this story, including its colloquial translation into English, its phonemics, morphology, and etymology. All this highly curated data, for one word in one story, becomes a single entry into the lexical data set collected in the back end to then be displayed in a word pane (Figure 2) in the DAILP reading interface.

Word pane information displayed in the DAILP reading interface for the Cherokee syllabary word for farmers ᏗᏃᎶᎩᏍᎩ.
The complexity of the indigenous language data parsed to help Cherokee speakers and language learners understand how morphemes work in indigenous languages like Cherokee draws the unwanted attention of AI data-scraping bots. Word and morphemic boundaries identified by users provide collocation information, grammatical information, and morphemic analysis of words gathered into finely tuned lexical data sets that scraping bots seek to extract. LLMs are hungry for language data that are already finely tuned, like the language data one might find in post-custodial digital archives. AI data-scraping bots swarm post-custodial indigenous language data for content. These swarms take down digital archives by hitting them with millions of automated data requests each second. Between March and November 2025, the DAILP site was routinely swarmed by data scraping bots. Users accessing the DAILP site during these attacks received denials-of-service.
The ruinous irony: The more indigenous people learn and practice their languages in sites like DAILP, the more they draw the attention of AI scraping bots. The more that post-custodial indigenous language practices create community-based translations of archival materials to persist with indigenous languages, the more vulnerable these sites become to data scraping. The reverse is also a ruinous irony for AI data-scraping bots—the more AI bots swarm post-custodial archives, the less well-curated data these post-custodial sites can produce because the sites that sustain language practice are no longer accessible to those who want to practice the language.
The ruinous irony of extractivism emerges at the intersection of three vital areas of written communication research: digital archiving of post-custodial collections, indigenous language recuperation and persistence, and language studies (including writing research, literacy research, sociolinguistics, linguistics). The complex problem space for indigenous data sovereignty in post-custodial indigenous language archives stems from:
Extractivist histories of knowledge production and
Extractivist practices of LLM data scraping bots.
The problems posed by language data-extraction bots, structured as they are by imperial legacies of thought, can be remedied with methodologies attendant to indigenous language data sovereignty, especially in post-custodial digital archives. In this article, I describe the CARE principles offered by scholars working within Indigenous Data Sovereignty and suggest how the DAILP team is deepening our adoption of these principles. Along the way, I distinguish between data sovereignty and indigenous data sovereignty, and I identify the harms posed by AI language scraping bots. I end the article with a heuristic to support the decision making that other scholars might find useful. 1 Let me begin with the extractivist premise and practice of generative AI to scope its harm for indigenous peoples’ language data.
Extractivist Legacies of Thought
The computational problems of indigenous language-data scraping are not new. Extractivist thinking about data and documenting indigenous languages has been present for centuries. The sciences have a longstanding practice of taking indigenous knowledge of plants, lands, and the natural environment to create pharmaceuticals, to extract precious minerals and “natural resources,” and to document languages to assert governmental and epistemological control over indigenous people’s lifeways (Smith, 2022). For example, Wesley Leonard (Leonard, 2017, 2019, 2021, 2023) demonstrates the imperialist notions of indigenous languages that underpin linguistic analysis, language documentation, and extractivist practices of linguists. Extractivist thinking and methods have been at the heart of scientific methodologies and are being addressed (Held, 2023; Wråkberg & Granqvist, 2014).
Meanwhile, Indigenous scholars and community members have been practicing and documenting the need for principles and practices of Indigenous Data Sovereignty to address extractivist data mining (Kukutai & Taylor, 2016; Rainie et al., 2019; Ruckstuhl, 2021; Walter et al., 2020, 2021). “The articulation of Indigenous Peoples’ rights and interests in data about their peoples, communities, cultures, and territories is part of reclaiming control of data, data ecosystems, data science, and data narratives in the context of open data and open science” (Carroll et al., 2020). The indigenous data spoken of here absolutely includes language data and lexical data sets. Indeed, Indigenous scholars have long demonstrated the relationship of storytelling and language as central to the sovereign identities of Indigenous peoples (Chew et al., 2019; Coronel-Molina & McCarty, 2019; Jones Brayboy & Maughan, 2009; McCarty et al., 2018; McKinley Jones Brayboy et al., 2015; Teuton, 2007; Teuton et al., 2023). Increasingly visible across the humanities and social sciences, community-based digital archiving presents important means for recuperating suppressed language practices and epistemologies. As such, Indigenous data sovereignty in community-based digital archives demands a grounding in the CARE Principles for Indigenous Data Governance as these principles apply to the quotidian reading and writing practices of language persistence. In the next section, I’ll overview the methodological principles DAILP uses to ground our work in the purposes, audiences, and reasons indigenous communities have for their writing in their own languages and on their own terms. My goal is to suggest how the DAILP team CAREs for indigenous language data sovereignty in this community-based archive in the context of LLMs data scraping.
Indigenous Data Sovereignty Methodology
Indigenous language-data sovereignty provides a methodological blueprint for language persistence work undertaken by community-based indigenous language post-custodial digital archives. These methodological principles for working with indigenous community data (including language data in any modality) have been rendered into CARE principles by the Global Indigenous Data Alliance (CARE Principles, 2023). The CARE principles acknowledge the rights of Indigenous Peoples and nations to govern the collection, ownership, and application of their own data. Doing so, community-based archives operate under four CARE principles about curating data and knowledge-making with indigenous communities.
Indigenous peoples have inherent rights to govern their peoples, land, languages, knowledge, and resources.
These rights are positioned within human rights, court cases, treaties, and recognitions.
Practices of knowledge making have their genesis in the roles, responsibilities, and longstanding practices for the use of community-held information.
The knowledge generated from these practices belongs to the collective and is fundamental to who we are as peoples.
The CARE Principles are operationalized locally and regionally by tribal groups who determine for themselves practices for data generation, collection, and use. The CARE principles ensure that indigenous data sovereignty is enacted in methods to ensure the tribal community’s:
The CARE Principles are distinct from data sovereignty principles. Indigenous data sovereignty emphasizes indigenous peoples’ collective rights, expertise, and relationships to the data. Data sovereignty tends to privilege an individual’s rights over their private data, ownership of data, and fair use of data for the public at large (Cushman, Tarpalechee, Rivard). The DAILP team has been following protocols set forth in community-based design and indigenous data sovereignty scholarship following the CARE Principles. These principles have become all the more urgent in the context of AI scraping bots.
CARE Principles in the DAILP Reading Interface and Translation Interface
As a post-custodial digital archive, DAILP’s twofold goal is to sustain indigenous language practices (e.g., reading, writing, speaking, and listening) and to enhance documentation of indigenous languages. The initial translation work of community-based teams provides paths and guardrails needed for linguists, students, and community members to parse the words and annotate them. Language specialists and learners provide morphemic and grammatical analysis with reference to secondary sources, thereby deepening the lexical data set displayed. Translation and audio files created with indigenous language speakers are analyzed with community-based teams of teachers, scholars, and learners to ensure that tone and accent are fully represented, proving especially important for breathing life into and documenting words, many of which are rarely spoken. These languaging practices lead to digital edited collections displayed on the DAILP public-facing website (https://DAILP.northeastern.edu). The digital edited collections produced by DAILP teams gather together indigenous language documents, translations, and audio files in a reading interface for the collective benefit of everyone trying to learn and persist with indigenous languages.
Supported with grants from the National Archives and from the Henry K. Luce Foundation, the first digital edited collection DAILP produced is publicly accessible at Cherokees Writing the Keetoowah Way (https://dailp.northeastern.edu/collections/cwkw). This digital edited collection presents a reading interface for 87 fully translated and deeply annotated Cherokee language documents (Cushman et al., 2023). Readers can select from public notices, letters, stories, and governance documents chosen by the United Keetoowah Band of Cherokee Indians for translation and display. Curricular materials are offered at the end of the collection. Front matter and chapter introductions written by Ellen Cushman (Cherokee Nation), Rachel Jackson (Cherokee Nation), and Benjamin Frey (Eastern Band of Cherokee Indians) and resource pages help readers to understand the context for documents and the collection and to search the database. A team of Cherokee speakers and translators from the United Keetoowah Band of Cherokee Indians provided the audio and first translations of documents. A larger team of Cherokee community members from the Cherokee Nation and Eastern Band of Cherokee Indians, as well as community members, learners, teachers, and scholars, provided the secondary set of sources.
Let me briefly overview how DAILP’s reading and translation interfaces operationalize the CARE principles and then offer implications.
The CARE Principles applied to the Digital Archive of Indigenous Language Persistence.
Collective benefits of DAILP
DAILP supports indigenous language communities by creating mutually sustaining relationships between language persistence and documentation meaningful for all who contribute to the site. The collective benefit provides even more resources for community members, teachers, scholars, speakers, and learners to use in deeply annotated post-custodial documents with audio recordings at the overall document level. The glossary provides a search tool and index for each word that appears across the translations and dictionaries ingested into the DAILP lexical data set.
Authority over the rights and interests
Authority over the rights and interests of the indigenous language materials (e.g., audio files and language documentation) produced in the initial stages of DAILP’s development rests with the United Keetoowah Band of Cherokee Indians tribal speakers and translation team members who determine how data displays should be designed and accessed. In other words, the reading interface for the digital edited collections was created with the rights and interests of the translators first and foremost. The reading interface of DAILP and all collections it contains are deeded by an attribution, non-commercial Creative Commons 4.0 license.
Responsibility, reciprocity, and relationship to the data
All data developed within or contributed to DAILP is stored either on Northeastern’s AWS servers or in repositories maintained by institutional partners. As part of its initial data-management plan, all DAILP data and code is made available to support indigenous language learning, translation, and apprenticeship programs. DAILP code and workflows are provided on Github (https://github.com/NEU-DSG/dailp-encoding). DAILP translations are printable, audio data are downloadable, and collections are searchable through the glossary that recognizes orthographies, per our contributors’ requests.
Ethics to maximize tribal community benefits and ensure future use
The reading interface was designed to provide language resources for upstream uses to advance indigenous language learning, teaching, and documentation practices. Scholars can draw upon the translations and richly detailed annotations to develop scholarly articles and creative works. As DAILP has begun to be shared among community members in and around Tahlequah, the Qualla reserve, and Cherokee book clubs, individuals have shared how they’re using the collection on DAILP to enhance their pronunciation, reading, and writing abilities. Audio files are played back to help learners understand the syllabary and pronounce words. Word use and derivatives are searched and then found in situ in an authentic context of use within documents. The curricular units support the use of the collection’s translations in classes taught by Cherokee community teachers. Pages are bookmarked in cell phones, then audio files are accessed on long drives. Crucially, archives in Tahlequah have reported being gifted Cherokee syllabary documents from a family’s collections, some hundreds of years old. Though largely anecdotal, we’re encouraged by these firsthand accounts of language practices unfolding in relation to DAILP’s first collection.
The DAILP Translation Interface
With CARE principles guiding our data resilience strategies, we developed a writing and translation interface under a 2022 NEH Level II Digital Humanities Advancement Grant to facilitate the collective translation practices of Cherokee speakers, teachers, learners, and community members. Through an intuitive interface designed with community members, DAILP Translation Interface (TI) gathers community members’ audio, commentary, and language analysis into a computational backend that, in turn, displays this as new translations on the front end. Indigenous language experts, teachers, and scholars can use DAILP TI to translate archival documents or audio files collectively with Cherokee language learners in online classes, immersion schools, university classrooms, and communities. To protect users’ identities, their translation processes, and the initial products of their work from AI scraping bots and ill-intentioned users, DAILP TI must necessarily change its work protocols to ensure data sovereignty, security, and resilience as described below.
Collective benefits for tribes: the DAILP Translation Interface
Indigenous language persistence is at the heart of the DAILP TI. Cherokee was always understood to be a test case to prove the concept that digital edited collections could be created with collective translation efforts. The process and workflows involved painstaking development of spreadsheets for each document, Zoom audio recordings of the manuscripts being recorded aloud, and the trial of several innovations that failed to scale when the number of documents grew. The collective benefit of a translation environment has always kept in mind that the Cherokee language was a stepping-stone to prove the concept and instantiate workflows for the benefit of other indigenous peoples.
In Phase 1, DAILP (2015-2019) contributors numbered 12. This agile team of community translators, teachers, and students proved the concept of DAILP’s reading interface to create a small corpus of 25 deeply annotated documents. Phase 1 proved that a translation team working mostly with spreadsheets could establish a reliable computational back end to display the work of our teams of translators, speakers, teachers, and learners. The concept proven in Phase 2 was the DAILP team more than doubled the core team of contributors, speakers, teachers, learners, and community members. The number of texts translated and read aloud more than tripled to 87, with 3 times as many instructional modules offered as well. By the end of Phase 2, the DAILP team had also completed the full-stack development of the DAILP TI with the support of an NEH DHAG Level II award. 2 We applied for and received a DHAG Level III grant to refine and scale this translation interface to other tribal communities for their future use. 3
Though not yet made public, the DAILP TI hosts a staging environment and has been reviewed by 32 Cherokee and Indigenous community members. Initial responses confirmed the successful functioning of features of DAILP TI, including creating user accounts and signing on/off the site; viewing the user’s dashboard and bookmarking documents under the dashboard; entering into a word pane to be able to edit/add associated translation content; uploading/adding/and selecting best audio for each word from among submissions; and adding/saving comments on each word—all behind a secured server and under specific user account settings. In addition, the writing environment functionality has been stress tested with initial translations of a never-before-translated document written by Willie Jumper, titled “The Bible.”
After this initial translation stress-test and the takedown of the DAILP site by scraping bots, limitations to the database back end became apparent. The team has identified enhancements to the user roles, activity display, and design of features of the software remaining to be addressed under a Mellon Foundation grant. This grant will support three primary activities over 3 years. The first involves refining the current DAILP TI activity features and user profiles as well as enabling administrative roles and expanded editor roles. The second activity involves continuing and deepening our engagement with members of the Anishinaabe and Cherokee communities concerning the design and collaborative ergonomics of these enhanced features as well as identifying and enabling other features desired by Anishinaabe community members; and the third, testing the supporting documentation and workflows to ensure smooth handoff of the DAILP TI and all of its code to allow other tribal communities to stand up their own DAILP sites.
Authority over the rights and interests of tribes: the DAILP Translation Interface
Partnerships with tribes and community organizations have involved scaling and expanding our data management systems to support thousands more documents from new sources, thousands of audio files, and to support commentary and contributions from community members behind secure, sign-in, translation sites. These partnerships enable us to develop our data model to treat community commentary on texts with the same care as the source material. As we hand off DAILP’s software to other tribes, we will provide technical support to tribal communities as they set up their own sites to create language persistence materials using the DAILP TI. We will also actively add new code and features to the DAILP TI and offer these on the GitHub Repo for all community members. Our aim is for tribal community members who set up their own DAILP sites to retain authority and rights over any and all translations, workflows, collections, code, and curriculum they create using the DAILP TI.
Responsibility, reciprocity, and relationship to the data: the DAILP Translation Interface
The final report for the NEH DHAG Level II award provides screenshots of two features of the Translation Interface as these most directly ensure the three Rs of the CARE principles: responsibility, reciprocity, and relationship. 4 The user account creation and sign-in pages and the detailed annotation processes made possible by the word-pane feature of the translation interface ensure that all contributors to an edited collection can maintain a relationship to the data they contribute. User accounts were necessary to ensure that tribal community members could control who has access to the language data and workflows for the creation of digital edited collections. User accounts also help to protect sites from AI scraping bots as these will become one entry point for human verification. The word-pane feature ensures that community members can support each other’s language use by leaving comments, loading audio recordings for a community-selected editor to review, and by offering learners and speakers many ways to contribute language data for each word. In other words, all relationships to the data are designated by the tribal community administrators of the DAILP site, and by those designated by the community to be editors, contributors, and readers.
Ethics to maximize tribal community benefits and ensure future use: the DAILP Translation Interface
The ethics of our tribal community efforts emerge in day-to-day methods to ensure we’re designing and securing the data with community benefits and future use in mind. We hold regular meetings with project advisors and collaborative groups (community translators and language teachers) to share progress and discuss new features. The discussion and feedback from these meetings prove essential to our workflows and ensure that we are responding to desired features (which often take different forms once a concrete implementation is available to experiment with). We continue with extensive user testing following the refinement stage of development to assess ergonomics and how well the existing and new features perform across different platforms, languages, expertise levels, and audiences. We routinely evaluate the success of our documentation and workflows to identify and address challenges with the Anishinaabe and Western Carolina University DAILP TI team leaders as they set up their sites. Based on this testing, some changes are made immediately, while more ambitious requests become part of a roadmap for future work. The goal is to ensure that DAILP TI features and workflows fully maximize tribal benefits and support future language practice. A few specific metrics of progress that we track include the number of translations, commentaries, and other contributions made (an indicator of how effectively the DAILP TI supports those activities), the size and growth of the language data sets and curatorial content, during the grant period (reflecting the addition of new words and language media through ingestion of Eastern Cherokee dialect and Anishinaabe data sets, translation and commentary), and indications of in situ usage in teaching, translating, curatorial, and apprenticeship programs by the Cherokee and Anishinaabe teaching and translation communities.
Indigenous Data Sovereignty and DAILP: Implications
Technological Implications
The DAILP TI web application will provide Cherokees in the East and Anishinaabe community members with the ability to add or create digital collections of culturally relevant documents and audio files of their choosing. The DAILP TI features remaining to be refined were either suggested by Cherokee community members in previous community reviews of DAILP’s web interface or were identified by community design key collaborators and team leaders in planning for their upcoming collections. Building new technology on existing DAILP infrastructure provides access to DAILP’s existing technological ecosystem, including page templates for edited collections; GraphQL queries and mutations for updating words, documents, and user contributions; AWS Cognito User Pools, Identity Pools, and User Roles supporting administrator, editor and contributor roles; and existing security and privacy protocols across DAILP’s stack.
All data developed within or contributed to DAILP is stored either on Northeastern’s AWS servers or in repositories maintained by institutional partners. The AWS servers DAILP uses are maintained by Northeastern University’s Library Technology Services staff and Information Technology Services staff with rigorous attention to backups, data integrity checks, and other storage and preservation practices. Institutional partners adopting the DAILP TI to set up their own collections will draw upon data management standards of preservation in use at their institutions and in their communities. We aim to place all control, authority, rights and management for each site in the hands of those who are creating the site and as may be desired.
As we deepen our engagement with members of the Anishinaabe and Cherokee communities concerning the design and collaborative ergonomics of enhanced features, we will identify and enable other features as desired by indigenous community members. And we will expand testing the supporting documentation and workflows to ensure smooth handoff of the DAILP TI to more indigenous communities. Finally, we will continue to support the United Keetoowah Band of Cherokee Indians (UKBCI) translation team as they create the second digital edited collection of manuscripts, titled The Willie Jumper Stories. This second Cherokee language project is funded by a National Historical Publications and Records Commission grant from the National Archives and the Mellon Foundation. The Willie Jumper Stories will translate Jumper’s entire ledger of 140 manuscript pages found in the Yale Beinecke Library. At the time of this writing, the UKBCI team of translators includes Willie Jumper’s daughter, Alice Jumper, and Kyndal Aimerson, an undergraduate student at Northeastern State University in Tahlequah, OK. The Willie Jumper Stories will deepen the Cherokee linguistic data set and display of translations on the back end, allowing other Cherokee language projects to have a deep and ready language resource for their use.
As of November 2025, DAILP has been coding and testing a proof-of-humanity security measure for all users of the reading and translation interfaces in order to build additional layers of data security and to protect the site’s language data from LLM scraping bots. With the Northeastern Library’s Digital Scholarship group and NULab for Digital Humanities and Computational Social Sciences working groups, researchers and librarians are in the process of creating university-level data resilience plans to help ensure data security, indigenous data sovereignty, and community-based CARE principles for DAILP and post-custodial sites and social science research centers collaborating with indigenous communities. DAILP must also create professional service agreements for all participants to ensure the intellectual property produced during and as a result of their work with DAILP remains theirs.
Beyond DAILP, the perniciousness of AI scraping bots that feed the LLMs’ generalizing algorithms face legal and computational challenges. If the scraped data is so large, or is highly circumscribed by its context, case, or rhetorical circumstance, the data’s predictability cannot be well-weighted unless it too can account for variables particular to the rhetorical situations of the original data. Parameter settings might seek to weight the language data and its analysis to the rhetorical purpose, audience, or exigence of the original scraped data, but data-scraping bots disregard these rhetorical ecologies of thought and meaning. Thus, the decoding one receives from an LLM’s algorithm is unweighted and probabilistically sampled from across incommensurate rhetorical circumstances. In other words, the text generated by LLMs is generalized in ways that impact the creativity, reliability, validity, and predictability of the generated text. Scraping data unethically impinges on copyright and data sovereignty for everyone and faces legal challenges over copyright infringement. In summer 2025, Anthropic, the company that created Claude AI, agreed to settle a lawsuit brought by authors whose books were scraped in violation of copyright laws. Anthropic agreed to pay $1.5 billion to authors whose books had been scraped by their AI bots (Veltman, 2025). As extractivist tools, data scraping bots destroy the ecologies of meaning necessary to produce meaning in the first place.
Relational Implications
Returning to the CARE principles for operationalizing Indigenous Data Sovereignty, scholars and researchers of written communication should exercise restraint when operationalizing collective principles across open-source data sets. As Carroll et al. (2021) offer in “Operationalizing the CARE and FAIR Principles for Indigenous Data Futures,” researchers can “utilize Indigenous design to benefit mainstream data communities.” Carroll et al., however, warn against “the co-optation of the CARE Principles into other spaces just yet. As their full criteria for implementation have not yet been determined and used, we must leave space for the design and maturity of the CARE Principles to occur within the Indigenous environments from which they originate” (2021, p. 5). CARE Principles can be operationalized in locally grounded practices in communities that place importance on the collective rights of communities. However, wholesale co-option of these indigenous methodological principles for all research environments risks being just as extractivist as data scraping bots—destroying meaning, ecologies of thought, and rhetorical sovereignty relationships to language, writing, and data (Lyons, 2000). The extractivist agendas of LLMs’ scraping bots are nothing new for indigenous communities and nations vulnerable to incursion by global AI powers.
Below, with gratitude to an anonymous reviewer, I offer a heuristic for writing researchers working to operationalize the CARE principles in their own research that may place importance on the collective rights of community members.
Heuristic for Applying the CARE principles to a research project.
Footnotes
Acknowledgements
I wish to thank the translation team of the United Keetoowah Band of Cherokee Indians, members of the DAILP team and advisory board, and Assistant Project Manager Naomi Trevino for their continued partnership and encouragement.
Ethical approval
The DAILP project is approved by the IRB at Northeastern University: #
Funding
The author disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: Northeastern University and the Institute of Museum and Library Services provided proof concept funding for DAILP. DAILP received interface development support from the Henry K. Luce Foundation, the National Archives, and the National Endowment for the Humanities Digital Humanities Advancement Grant (DHAG), Level II grant which enabled payment of community language and linguistic specialists, and library experts. An NEH DHAG Level III grant was awarded in December 2024 but was terminated in March 2025. In June 2025, DAILP received generous support from the Mellon Foundation for 3 years to refine and scale the DAILP Translation Interface.
Declaration of Conflicting Interests
The author declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Data availability statement
N/A
