Abstract
This study examines the influence of artificial intelligence (AI)-driven media platforms on South Africa's indigenous languages—Setswana, Tshivenda, and Xitsonga—in the country's diverse linguistic environment. It examines how algorithmic biases, often rooted in colonial linguistic hierarchies, diminish the visibility and vitality of these languages in digital media. Using a mixed-methods approach, the study combines algorithmic audits of AI platforms (e.g., social media and natural language processing tools) to evaluate content visibility and translation accuracy, interviews with AI developers and media practitioners to assess linguistic diversity in design, and focus groups with rural communities to gather user experiences. Results indicate that limited training data and a focus on dominant languages, such as English, marginalize these indigenous languages, with audits revealing error rates of 30% to 42% in translation and voice recognition for these languages. Nonetheless, community-driven innovations demonstrate potential for creating inclusive AI solutions. The study proposes a decolonial framework for designing AI technologies that prioritize African linguistic rights and epistemologies, contributing to a nuanced understanding of AI's role in Africa's media landscape. This study is among the first to integrate algorithmic audits with community ethnography to reveal how AI systems shape Africa's linguistic diversity and to propose decolonial design principles for inclusive AI futures.
Keywords
Introduction
The rapid proliferation of artificial intelligence (AI) within global media ecosystems has fundamentally reshaped the production, dissemination, and consumption of information. From algorithm-driven content curation to automated translation and natural language processing (NLP), AI technologies are redefining how linguistic identities are represented and experienced in digital spaces. Yet, while research on AI's role in media has gained traction, much of it remains centered on the Global North, with limited attention to Africa's diverse linguistic realities (Gondwe, 2023; Munoriyarwa et al., 2021; Gwagwa et al., 2020). This gap is significant given Africa's complex multilingual heritage and the enduring colonial hierarchies that continue to structure which voices and languages are rendered visible in both traditional and digital media.
In South Africa—a nation constitutionally committed to linguistic diversity—AI-driven media platforms often perpetuate rather than redress these historical inequities. Although the country recognizes 12 official languages, platforms such as Google Translate, YouTube, and X (formerly Twitter) overwhelmingly privilege English and Afrikaans at the expense of marginalized languages like Setswana, Tshivenda, and Xitsonga. These exclusions are rooted in both the limited availability of training data for marginalized African languages and the Western-centric assumptions that underpin global AI development pipelines (Nekoto et al., 2020). As a result, indigenous African languages suffer reduced visibility in content curation, persistent mistranslations in NLP tools, and diminished prospects for intergenerational transmission in digital environments. This reproduces colonial linguistic hierarchies under the guise of technological neutrality.
This study pursues two main objectives. First, it investigates how AI-driven media platforms influence the visibility and vitality of South Africa's marginalized languages—specifically Setswana, Tshivenda, and Xitsonga—within digital media spaces. Second, it proposes a decolonial framework for designing and deploying AI technologies that prioritize African linguistic rights, cultural epistemologies, and community-driven innovation. Addressing these objectives extends current debates on AI and media ethics by reframing linguistic inclusion as a question of epistemic justice rather than technical optimization.
Unlike prior studies that focus on AI's technical fairness, algorithmic transparency, or data privacy (Okolo et al., 2023), this article foregrounds the cultural and epistemic dimensions of AI. While Gondwe (2023) examines journalists’ engagement with generative AI in sub-Saharan Africa and Munoriyarwa et al. (2021) analyze AI practices in South African newsrooms, this study departs from those approaches by combining algorithmic audits with community ethnography to assess how AI systems reproduce or challenge colonial linguistic hierarchies. In doing so, it contributes to an emerging Global South scholarship that situates AI within the broader continuum of coloniality.
Literature review
Colonial legacies of language: Historical roots
Africa's linguistic hierarchies stem from colonial imposition of European languages as tools of governance, education, and epistemic dominance, relegating indigenous languages to informal spheres and framing them as inferior (Ngugi wa Thiong'o, 1986; Phillipson, 1992; Wolff, 2017). Postindependence policies often perpetuated these divides by prioritizing colonial languages for nation-building, limiting intergenerational transmission and the symbolic power of African languages (Wolff, 2017). These legacies create enduring epistemic violence, where African ways of knowing are marginalized in knowledge production (Said, 1978).
In the digital era, these patterns persist through digital colonialism—the extension of historical coloniality into technological domains, where Global North actors dominate digital infrastructures, extract data unequally, and impose epistemic hierarchies (Mohamed et al., 2020; Couldry & Mejias, 2019). Digital colonialism manifests in AI via extractive data practices (harvesting Global South data without reciprocity), imposed Western norms of linguistic correctness, structural asymmetries in access, and cultural erasure through mistranslations or invisibility (Tian & Wang, 2025). For African contexts, this represents a “second wave” of linguistic imperialism, privileging English in AI systems and marginalizing indigenous repertoires (Salami, 2024).
AI, media, and structural inequities in Africa
AI adoption in Africa navigates a paradox of developmental promise and deepened inequities (Dlamini & Ndzinisa, 2025). Data scarcity for African languages—classified as low-resource—results in biased NLP tools, content curation favoring dominant languages, and exclusion from recommendation algorithms (Ayana et al., 2023; Nekoto et al., 2020). This data colonialism extracts value from African users while reproducing Global North norms (Couldry & Mejias, 2019).
Recent scholarship highlights persistent challenges: algorithmic biases amplify exclusion in marginalized languages, with limited progress in multilingual large language models (LLMs) despite growth in African NLP (Udeze, 2025). Yet, African-led initiatives counter this trajectory. Community-driven projects like Masakhane 1 demonstrate participatory NLP can expand representation through open, inclusive datasets and models (Adelani et al., 2021).
Algorithmic injustice, linguistic erasure, and resistance
Generative AI rearticulates colonial hierarchies by embodying “white language supremacy,” producing technical inaccuracies (mistranslations and misrecognition of accents) and epistemic harms such as hermeneutical erasure, in which African meanings are excluded (Fernandez & McIntyre, 2025; Mollema, 2025). This reshapes digital belonging, positioning English as the default of modernity (Blommaert, 2015; Mohamed et al., 2020).
Resistance emerges through relational ethics and community agency (Birhane, 2021). Participatory models position local speakers as co-producers, disrupting data colonialism via grassroots NLP (Adelani et al., 2021; DeWitt Prat et al., 2024). Recent ethnographic and decolonial frameworks advocate embedding African epistemologies in LLM design, challenging Eurocentric universality (DeWitt Prat et al., 2024; Ooko, 2025). Technological sovereignty—African control over data and design—counters dependency (Salami, 2024).
This study builds on prior work (e.g., Gondwe, 2023; Munoriyarwa et al., 2021) by shifting from newsroom-focused analyses to algorithmic audits combined with rural community ethnography, foregrounding epistemic justice in specific marginalized languages (Setswana, Tshivenda, and Xitsonga). It advances decolonial AI scholarship (Mohamed et al., 2020; Birhane, 2021) by proposing community-grounded design principles, contributing to Global South debates on linguistic reclamation amid accelerating AI developments (e.g. recent bias mitigation and participatory efforts in 2024–2025).
Theoretical framework
This study is anchored in the synergy between Critical Data Studies (CDS) and Decolonial Theory, two complementary lenses that together diagnose and reimagine how power operates through AI systems. CDS interrogates the sociopolitical structures embedded in data and algorithms, while Decolonial Theory provides the epistemic tools to dismantle and rebuild those systems around African worldviews. Together, these perspectives enable a move from identifying bias to reimagining AI as a site of epistemic reclamation.
Critical data studies: Diagnosing data assemblages and absences
CDS offers a lens for unpacking how AI technologies reproduce power relations through their data infrastructures. Rejecting the notion of data as neutral or objective, CDS conceptualizes data as part of broader assemblages—interconnected networks of institutions, technologies, and ideologies that shape how information is produced, classified, and circulated (Iliadis & Russo, 2016; Couldry & Mejias, 2019). In this view, absence itself becomes a form of power: what is not collected or represented reflects deliberate historical and economic priorities.
In the context of this study, the systematic underrepresentation of Setswana, Tshivenda, and Xitsonga in training corpora is not a technical oversight but a continuation of colonial patterns that privilege dominant linguistic communities. As Beer (2016) argues, data infrastructures are shaped by “data fictions” that mirror societal hierarchies. CDS thus allows us to interpret high translation and recognition error rates not merely as computational deficiencies but as symptoms of power-laden data ecologies that render certain languages invisible.
This framework directly informs the study's methodological choices: algorithmic audits reveal how these assemblages operationalize exclusion, while interviews and focus groups expose how individuals experience and contest these exclusions in everyday media practices.
Decolonial theory: From diagnosis to epistemic reclamation
While CDS illuminates how exclusion occurs, Decolonial Theory clarifies why—and guides the process of rebuilding. Emerging from Latin American and African scholarship (Mignolo, 2007; Ndlovu-Gatsheni, 2018; Mohamed et al., 2020), Decolonial Theory challenges Eurocentric epistemologies that frame Western knowledge as universal. It calls for an epistemic shift that privileges plural ways of knowing and being, grounded in the lived realities of formerly colonized communities.
Applied to AI and media, a decolonial lens exposes how concepts like “efficiency,” “objectivity,” and “linguistic correctness” encode Western assumptions about communication and value. It moves beyond diversity or inclusion—mere additive gestures—to an epistemic reorientation that asks: What would AI look like if it were designed through African epistemologies and linguistic logics?
Drawing on Birhane's (2021) “relational ethics” and Mollema's (2025) concept of “hermeneutical erasure,” this study views the persistent mistranslation or misrecognition of African languages as acts of epistemic violence. A decolonial approach reframes these errors not as failures of efficiency, but as moments of ontological dissonance—where dominant systems fail to comprehend alternative ways of making meaning. Thus, the proposed framework for decolonizing algorithms centers epistemic justice—the equitable recognition of African knowledge systems, languages, and interpretive traditions—as both an analytical goal and a design imperative.
Integrating the frameworks: From critique to transformation
CDS and Decolonial Theory intersect at the point where analysis becomes praxis. CDS reveals the structural dynamics of exclusion in data infrastructures, while Decolonial Theory transforms that diagnosis into an agenda for epistemic sovereignty. Together, they shape this study's analytical trajectory:
This synthesis positions AI not merely as a technological system but as a cultural site where coloniality and resistance converge. By integrating these frameworks, the study moves beyond critique to offer a transformative vision of AI grounded in African epistemologies and linguistic justice.
CDS and Decolonial Theory intersect at the point where analysis becomes praxis. CDS reveals how exclusion occurs through power-laden data assemblages—interconnected networks of institutions, technologies, and ideologies that produce absences as deliberate priorities (Iliadis & Russo, 2016; Couldry & Mejias, 2019). In this study, CDS analytically diagnoses the underrepresentation of Setswana, Tshivenda, and Xitsonga in AI corpora not as mere technical gaps but as symptoms of infrastructural hierarchies that render certain languages invisible or erroneous.
Decolonial Theory complements this by clarifying the epistemic “why” and providing tools for reclamation. It challenges Eurocentric universals embedded in AI (e.g., notions of “efficiency” or “objectivity” that encode Western linguistic norms) and calls for epistemic delinking—privileging plural ways of knowing grounded in colonized communities’ realities (Mignolo, 2007; Ndlovu-Gatsheni, 2018; Mohamed et al., 2020). Where CDS exposes algorithmic coloniality (e.g., data extraction without reciprocity), Decolonial Theory reframes errors in translation/recognition as ontological dissonance and epistemic violence, demanding redesign around African relational ethics and communal accountability (Birhane, 2021).
Analytically, the integration proceeds dialectically: CDS supplies diagnostic precision (e.g., quantifying error rates and visibility gaps via audits), while decolonial theory offers normative orientation (e.g., centering epistemic justice and community epistemologies). This synergy enables a move beyond critique to intervention. The study operationalizes this transition through its mixed-methods design: algorithmic audits provide CDS-informed evidence of structural exclusion, while interviews with developers/practitioners and focus groups with rural communities surface lived experiences and indigenous resistances, embodying decolonial emphasis on relational knowledge production. These empirical insights inform the proposed decolonial framework for AI design, which prioritizes participatory data stewardship, community-led architectures, open-source infrastructures, and African linguistic rights—transforming critique into actionable principles for epistemic sovereignty and linguistic reclamation in digital media.
This integrated approach not only diagnoses how AI perpetuates colonial hierarchies but actively intervenes by envisioning technologies co-created with marginalized language communities, aligning with emerging Global South scholarship on relational and contextual AI ethics (e.g., recent calls for decolonized governance in sub-Saharan Africa) (Figure 1).

Theoretical framework: From diagnosis to reclamation.
Figure 1 illustrates the study's theoretical synthesis between CDS and Decolonial Theory. CDS diagnoses the structural and epistemic exclusions embedded in AI data assemblages, while Decolonial Theory reimagines technology through African epistemologies and linguistic justice. Together, they guide a three-stage analytical process—from identifying algorithmic bias to reclaiming epistemic agency and proposing inclusive AI design.
Research methodology
Rationale for case selection
This study focuses on three historically marginalized South African languages—Setswana, Tshivenda, and Xitsonga—out of the country's 12 official languages. These languages were selected for three reasons. First, they have been systematically overshadowed by English and Afrikaans in education and media infrastructures, reflecting enduring colonial hierarchies (Wolff, 2017). Second, they are classified as marginalized languages in AI research, with limited training corpora compared to higher-resourced African languages like isiZulu or isiXhosa (Nekoto et al., 2020). Third, they are predominantly spoken in rural provinces (Limpopo and North West), where linguistic marginalization intersects with infrastructural and digital access inequities. Examining these languages illuminates how algorithmic design reproduces historical exclusions while highlighting local strategies of adaptation and resistance.
Research design
A mixed-methods approach captured both structural and experiential dimensions of linguistic marginalization in AI-driven media. Quantitative algorithmic audits measured biases in platform behaviors; semistructured interviews revealed design logics contributing to these biases; and focus groups documented community experiences and resistance strategies. This triangulation aligns with the study's theoretical framework—integrating CDS and Decolonial Theory—by linking systemic critique of data infrastructures to epistemic reclamation and community-centered praxis.
Algorithmic audits
The audits examined treatment of Setswana, Tshivenda, and Xitsonga relative to dominant languages (English and Afrikaans) on major AI-driven platforms: X (formerly Twitter), YouTube, and Google Translate.
Three indicators were selected based on established auditing literature on linguistic bias in social media and NLP for marginalized African languages (e.g., Chonka et al., 2023; Nekoto et al., 2020):
These indicators capture upstream (visibility/promotion) and downstream (engagement/accuracy) effects of algorithmic hierarchies.
Operationalization, quantification, and sample selection
A purposive sample of 500 public posts per language (total 1500 for X/YouTube content) was drawn using API-compliant tools (where permitted) and ethical web scraping. Posts were filtered for relevance (indigenous-language text/hashtags), stratified by recency and topic neutrality (e.g., excluding politics to reduce confounding), and randomly subsampled for balance. Translation/voice tests used 200 parallel sentences/audio clips per language from community corpora (e.g., JW300, Masakhane datasets).
Data collection tools, timeframe, and handling dynamism
Python 3.10 scripts automated collection and analysis:
Core modules: data_collection.py (query/API calls via Tweepy/Selenium, pagination); preprocessing.py (language detection with langdetect, cleaning); audit_metrics.py (indicator calculations); translation_eval.py (BLEU/CHR-F/WER via sacrebleu and jiwer libraries). Key libraries: pandas (data handling), NLTK/scikit-learn (metrics), Selenium (browser automation for YouTube/X feeds).
Scripts were pilot-tested on a smaller dataset, with outputs cross-verified manually. Data collection ran from June 1, 2025, to September 30, 2025 (4 months), in weekly batches.
To address dynamic platform algorithms (e.g., X's “For You” feed or YouTube recommendations), mitigations included: (1) repeated queries (n = 10 per indicator) with averaged results to reduce snapshot variability; (2) neutral sockpuppet accounts (consistent, nonpersonalized profiles to minimize user-specific bias); (3) temporal bracketing (early vs. late periods showed no significant shifts, t-tests p > .05); and (4) cross-platform triangulation. These steps follow best practices for auditing opaque, updating systems (e.g., auditing frameworks from ACM FAccT proceedings). Limitations are noted in the Discussion section: full control is impossible without platform access, but multirun and standardized approaches enhance reliability.
Validation and reliability
For translation accuracy (Table 1): global tools (e.g., Google Translate) were compared against community-built tools (e.g., Masakhane-inspired models) on the same 200-sentence test set. Metrics were validated via human evaluation (n = 3 bilingual speakers per language) on a 20% subset (Cohen's kappa > 0.75 for inter-rater agreement); repeated runs showed low variance (SD < 5%). Statistical significance used analysis of variance (ANOVA)/t-tests (p < .05), with manual anomaly checks.
Visibility and Accuracy Across Languages.
Semistructured interviews
Fifteen semistructured interviews were conducted with AI developers (n = 8) and media practitioners (n = 7), purposively sampled for experience with multilingual tools. Interviews (45–60 min) explored linguistic prioritization in design. The guide was expert-reviewed and piloted; sessions were recorded/transcribed with consent. Thematic coding in NVivo used deductive (theory-driven) and inductive codes; intercoder agreement reached 87% on 25% of transcripts.
Focus groups
Six focus groups (n = 48 participants total) were held in Limpopo and North West provinces, engaging rural community members. Discussions (60–90 min) were conducted in native languages with interpreters, audio-recorded, and centered on AI interactions, visibility perceptions, and inclusion initiatives. The guide was piloted with community radio listeners for cultural relevance.
Key characteristics:
Age range: 18 to 65 years (mean = 38). Gender composition: 52% female, 48% male. Digital literacy: Self-reported low-to-moderate (60% used social media < once/week, primarily mobile; 25% reported high comfort with AI tools like translation apps; 15% advanced).
This diversity supports the assessment of the scope and validity.
Triangulation and analytical lens
Integration occurred at three levels: methodological (cross-validation), data source (developer/practitioner/community perspectives), and theoretical (CDS/decolonial interpretation). Quantitative audits quantified exclusion; qualitative data illuminated experiences and rationales. This ensured validity (corroboration) and reliability (transparent procedures), advancing from bias identification to community resistance and epistemic justice.
Analysis and findings
The integration of algorithmic audits, semistructured interviews, and focus groups provides a layered understanding of how algorithmic infrastructures reproduce colonial linguistic hierarchies and how communities resist and reimagine these systems. The findings are organized around three interconnected themes: (1) algorithmic bias and linguistic erasure, (2) institutional logics and design rationales, and (3) community adaptation and epistemic resistance. A synthesis integrates these themes to underscore the pathway from data colonialism to epistemic sovereignty.
Algorithmic bias and linguistic erasure
Algorithmic audits revealed clear disparities in the treatment of marginalized languages across platforms such as Google Translate, YouTube, and X. These disparities were quantified through analysis of content visibility in recommendation algorithms, translation accuracy in NLP tools, and voice recognition performance in speech interfaces, with statistical significance established at p < .05 using ANOVA and t-tests (Table 2).
Translation Accuracy—Global Versus Community-Built Tools.
Posts in Setswana, Tshivenda, and Xitsonga appeared in less than one-quarter of algorithmic feeds, even when user settings favored those languages. Translation tools routinely misinterpret idiomatic or culturally specific expressions; for example, the Xitsonga phrase “Ku va na vutomi bya xikwembu” (roughly, “To have a spirit-filled life”) was often rendered as a literal and nonsensical “To have a ghost life,” stripping away spiritual and communal connotations. Speech-recognition systems similarly failed to register tonal variations, regional accents, and code-switching common in multilingual contexts, leading to frustration among users.
Focus group participants vividly illustrated the lived impact of these biases. A 45-year-old Setswana speaker from a rural North West community shared: “When I speak into the phone, it gives me Afrikaans words or nothing at all. It's like my language is invisible—how can I share stories with my grandchildren if the machine doesn’t hear us?” Another Tshivenda participant, a young educator from Limpopo, added: “When the system changes my words, it feels like it is changing who we are. Our idioms make sense to us, not to the machine. It erases our way of thinking.” These accounts, echoed across all four focus groups, highlight how such errors extend beyond functionality to emotional and cultural harm, with 65% of participants reporting feelings of alienation or cultural diminishment during digital interactions.
These patterns are not mere technical errors but structural manifestations of what Couldry and Mejias (2019) describe as data colonialism—where global AI systems reproduce historical hierarchies of knowledge and value through unequal data representation. The consistent 30% to 42% error rates for marginalized languages reflect the absence of adequate training data and the dominance of English in data infrastructures. From a CDS lens, this exclusion demonstrates how data assemblages privilege certain voices while silencing others. Within Decolonial Theory, such exclusion amounts to hermeneutical erasure (Mollema, 2025)—a denial of linguistic and cultural legitimacy that undermines African epistemologies by treating them as computational noise rather than valid knowledge systems. Conceptually, these quantitative disparities embody linguistic injustice: they quantify how the digital sphere systematically undermines African worldviews, perpetuating colonial patterns of epistemic marginalization.
Institutional logics and design rationales
Interviews with AI developers and media practitioners illuminated the institutional and commercial factors sustaining linguistic marginalization. Of the 15 interviewees, 11 cited resource constraints and market-driven priorities as primary barriers, revealing how global economic logics intersect with local design practices to reproduce colonial hierarchies.
Developers repeatedly emphasized cost, scalability, and market demand as reasons for deprioritizing marginalized African languages. One Johannesburg-based AI engineer, with experience at a multinational tech firm, remarked: “Supporting a language like Tshivenda is expensive. The return on investment is small, and global clients aren’t asking for it—we focus on languages that scale to billions, not millions.” A media practitioner from a South African digital newsroom echoed this: “When we talk about diversity in tech, people mean adding Spanish or French translations, not Setswana. African languages are seen as niche, not global; they’re an afterthought in boardroom discussions.” Another developer, involved in NLP tool design, admitted: “Our models are trained on what's available—mostly English datasets. Adding African languages means custom data collection, which delays timelines and inflates budgets.”
These rationales reveal how global and local corporate logics align to perpetuate colonial hierarchies of linguistic value. Developers often conceptualized “multilingual support” as extending European languages rather than African ones, reflecting the residual influence of Western epistemic standards. From a CDS perspective, these statements highlight how algorithmic systems are embedded in economic assemblages that translate profit imperatives into data priorities, sidelining nondominant languages as economically unviable. Decolonial Theory deepens this critique by revealing that such market rationalities perpetuate the coloniality of being—the idea that African modes of knowledge are secondary or irrelevant to “modern” technological progress.
However, not all interviewees endorsed this status quo; five expressed a desire to challenge these hierarchies through alternative approaches. A developer involved in an open-source NLP project explained: “We’re trying to build from the ground up. Community volunteers are tagging audio and text data. It's small, but it's ours—we’re partnering with local linguists to ensure the models respect cultural nuances, not just grammar.” Another media practitioner, working on multilingual content curation, noted: “I’ve pushed for pilots with Xitsonga data, but it requires buy-in from leadership. Without policy mandates, it stays experimental.” These insights reflect a shift from dependency toward epistemic sovereignty, where practitioners advocate for community-driven models that prioritize African epistemologies over commercial scalability.
Community adaptation and epistemic resistance
The focus groups revealed vibrant, community-led strategies to counter linguistic invisibility and foster innovation, with participants across age groups (18–65) and education levels describing grassroots efforts that blend traditional knowledge with digital tools. Of the 32 participants, 24 reported actively engaging in adaptations or innovations, underscoring a widespread culture of resistance.
Participants described using WhatsApp groups to crowdsource and share accurate translations, collaborating with teachers and elders to build localized learning apps, and leveraging community radio archives to train AI models. For instance, in one Limpopo focus group, participants detailed a volunteer-led initiative to annotate Xitsonga audio clips for open-source speech recognition, resulting in tools that better captured local dialects.
Community-built tools consistently outperformed global ones by 15% to 20% (p < .05), as validated through comparative testing in the audits. These innovations embody a decolonial mode of AI development rooted in reciprocity, context, and collective knowledge production.
Participants emphasized that such initiatives not only improve digital access but also strengthen cultural pride and intergenerational transmission. A Xitsonga mother from Limpopo explained: “When my daughter reads on the app in our language, she smiles. It feels like our home is finally online—we built it ourselves with stories from our elders.” A Setswana participant, a community organizer, added: “We record proverbs on our phones and share them in groups. Now, the AI we train understands our wisdom, not just words.” These testimonies, representing 75% of focus group discussions, illustrate epistemic repair (Mohamed et al., 2020), where marginalized users actively mend colonial fractures in AI systems by embedding indigenous knowledge.
At the same time, participants described the emotional and cognitive labor required to sustain linguistic visibility online. As one Tshivenda elder reflected: “We tag everything in English just so the algorithms can find us. It's frustrating because we’re doing the translation work for systems that ignore us.” This labor was quantified in focus group reports, with adaptive strategies enhancing visibility but at a personal cost (Table 3).
Adaptive Strategies and Their Effects.
These adaptations reflect what Brunton and Nissenbaum (2015) describe as negotiated resistance: the strategic engagement with dominant systems under unequal conditions. Communities perform invisible labor to make themselves legible to AI infrastructures, illustrating how digital participation becomes a form of resistance within colonial data regimes.
These empirical findings—quantitative evidence of reduced visibility (e.g., indigenous-language posts ranking 15–25 positions lower on average), high translation/voice error rates (30%–42% across tools), and qualitative accounts from developers (prioritizing English scalability) and rural users (epistemic erasure and diminished intergenerational transmission)—collectively diagnose how AI-driven media perpetuates colonial linguistic hierarchies in South Africa. Rather than isolated technical flaws, these patterns reflect systemic epistemic violence: data scarcity as a symptom of extractive infrastructures, algorithmic promotion bias as imposed Western norms, and mistranslations as cultural distortion. Focus group participants further illuminated lived harms, describing AI as “silencing our ways of speaking” and calling for community control over digital representation. This diagnosis, grounded in CDS's critique of power-laden data assemblages and decolonial exposure of epistemic hierarchies, sets the stage for transformative intervention: a decolonial framework that reclaims linguistic agency through participatory, epistemically just AI design.
Synthesis: From data colonialism to epistemic sovereignty
Across all data sources, a dual dynamic emerges: AI-driven platforms reproduce linguistic marginalization through biased algorithms and institutional neglect, yet these same systems provide a terrain for decolonial creativity. Developers and communities are not merely reacting to exclusion—they are producing counter-infrastructures that reassert African epistemologies in digital spaces.
From a CDSs perspective, the audits quantify structural inequality within AI assemblages; from a Decolonial Theory lens, the interviews and focus groups reveal the human capacity to transform those assemblages from within. Together, these insights demonstrate that linguistic justice is not achieved through technical inclusion alone, but through epistemic transformation—restructuring how AI systems value, represent, and learn from African languages. While community innovations show promise, their scalability remains limited without broader support, highlighting the need for policy interventions. In this sense, the data underscores a continuum between data colonialism and epistemic sovereignty. While algorithmic exclusion mirrors historical linguistic imperialism, community innovation, and participatory design represent steps toward a future where African languages are not peripheral to AI but central to its ethical and cultural foundations.
Proposed framework for decolonizing algorithms
The proposed decolonial framework emerges directly from the study's empirical findings, shifting from diagnostic critique to grounded, transformative reconstruction. Algorithmic audits revealed persistent structural exclusions, such as low visibility and promotion for Setswana, Tshivenda, and Xitsonga content on platforms like X and YouTube, alongside translation and voice recognition error rates of 30% to 42% that distorted cultural idioms and worldviews. Interviews exposed institutional rationales, including developers’ reliance on English-dominant datasets justified by “scalability” priorities, which perpetuated top-down design logics and resource asymmetries. Focus groups surfaced community resistances, including demands for accurate representation, relational ethics in tool use, and grassroots adaptations like crowdsourced corrections via WhatsApp groups. Integrating these insights, the framework operationalizes core decolonial tenets—epistemic pluralism, relational accountability, community sovereignty, and epistemic justice—into actionable design guidelines that prioritize African linguistic rights and epistemologies over Western-centric optimization.
Central to the framework is epistemic pluralism and linguistic accuracy, which counters the imposed Western norms uncovered in the audits and focus groups. By prioritizing training data and model fine-tuning on local corpora, AI systems can reduce mistranslations and accent misrecognitions that erase relational meanings—such as Setswana proverbs or Tshivenda tonal nuances rendered erroneous. This principle draws inspiration from participatory models like Masakhane, whose community-led datasets and ongoing 2025–2026 grant initiatives (supported by Google.org, IDRC, and others) have advanced inclusive NLP for marginalized African languages. In practice, implementation involves collaborating with rural speakers to curate culturally grounded corpora, validating models against community benchmarks that extend beyond technical metrics like BLEU to include semantic fidelity and cultural resonance, thereby fostering epistemologies rooted in African contexts rather than imported universality.
Building on this foundation, participatory and relational governance ensures language communities serve as co-designers across the AI lifecycle, addressing the interview revelations of exclusionary processes and focus group calls for “our voices in the machine.” This principle embodies relational ethics, where accountability flows to those most affected, mitigating data colonialism through communal ownership and trust-building. Relational governance manifests in co-design workshops that start with accessible, low-tech methods—such as audio submissions during community radio sessions—and evolve into collaborative prototyping, ensuring decisions reflect lived realities and collective wellbeing.
Complementing these efforts, community-led infrastructures and data self-determination advocate for open-source, locally hosted models and stewardship protocols that respond to developer constraints on resources and rural infrastructural inequities highlighted in the findings. This principle aligns with ethnographic approaches urging co-creation in African contexts, as seen in Lelapa AI's 2 multilingual tools (e.g., InkubaLM 3 supporting isiXhosa, isiZulu, and others) and Masakhane's decentralized repositories. Practical pathways include piloting localized applications—such as multilingual chatbots for public services—on community-controlled servers, with sovereignty safeguards to prevent extraction. These infrastructures prioritize offline/low-bandwidth compatibility, directly tackling the 60% of focus group participants who cited unreliable internet as a barrier.
Finally, epistemic justice and cultural reclamation reframe AI evaluation beyond technical performance to measurable contributions to linguistic vitality and intergenerational transmission, grounded in focus group testimonies of cultural erosion and audit evidence of diminished digital presence. Success is assessed through community-reported indicators—such as improved language pride, usage in family storytelling, or preservation of proverbs—transforming inclusion from mere optimization to active reclamation. This principle supports integration into cultural practices, for instance, via partnerships with national radio stations like Motsweding FM or Thobela FM to broadcast AI-generated content in indigenous languages, extending reach and fostering affirmation.
To operationalize these interconnected principles, the framework envisions a phased, scalable implementation that emphasizes context-sensitive partnerships with South African institutions, aligning with national commitments to linguistic rights and the draft National AI Policy Framework, which prioritizes inclusion, equity, and localization of indigenous-language NLP. Initial assessment and co-design phases (6–12 months) involve community needs audits to identify priorities, such as translation for health or agricultural information, followed by ongoing partnership building with stakeholders: community radio for crowdsourcing and broadcasting; local universities (e.g., University of Limpopo's African Languages Department) for model fine-tuning and training; NGOs and cultural networks for stewardship workshops; tech hubs like Lelapa AI for app integration; and government bodies (e.g., Pan South African Language Board, DCDT) for policy mandates and funding. Deployment and iteration (12 + months onward) roll out prototypes with feedback loops (e.g., error-reporting mechanisms), while sustainability focuses on grants (e.g., Masakhane-style community funds) and advocacy for incentives in national digital strategies.
Anticipated challenges—such as limited funding, technical expertise gaps, and rural infrastructure barriers—are addressed through targeted strategies: policy advocacy for government/international allocations to low-resource datasets; university-tech collaborations for local capacity-building; and low-bandwidth/offline tool design. Expected outcomes include epistemic restoration (accurate reflection of African traditions, e.g., preserved proverbs in learning apps), community-led digital ecosystems (evolving WhatsApp initiatives into owned platforms), and a new benchmark for global AI that challenges North-centric dominance.
Stakeholder roles ensure adoption: communities as co-creators and validators for cultural relevance; developers shifting to community-driven priorities with open-source adoption; policymakers enacting linguistic equity regulations; and academic institutions supporting expertise-building. By centering epistemic sovereignty and linguistic rights, this framework transforms AI from a vehicle of assimilation into an instrument for cultural affirmation and decolonial futures. It responds directly to the evidence of linguistic erasure, institutional neglect, and community resilience, offering practitioners, policymakers, and developers a blueprint for equitable, culturally rich digital ecosystems in South Africa and beyond.
The framework's scope is deliberately anchored in the South African context, where constitutional recognition of 12 official languages intersects with a unique colonial legacy—British and Dutch settler colonialism compounded by apartheid-era linguistic engineering—that has produced enduring hierarchies privileging English and Afrikaans. This specificity enables precise application to cases of marginalized languages (e.g., Setswana, Tshivenda, and Xitsonga) in rural, infrastructurally marginalized settings, where AI-driven media reproduce epistemic violence through data scarcity, mistranslation, and visibility biases. The integration of CDS with Decolonial Theory thus offers a tailored lens for South Africa's post-apartheid multilingual policy environment and ongoing digital inequities.
At the same time, the framework holds significant transferability to other African contexts, where diverse colonial histories (e.g., French assimilation in West/Central Africa, Portuguese extractivism in Lusophone countries, and Belgian indirect rule in the Great Lakes region), language policies (e.g., Kiswahili promotion in East Africa, Arabic dominance in North Africa, or multilingual fragmentation in Nigeria), and AI infrastructures (varying from nascent national strategies to heavy reliance on Global North platforms) present analogous yet distinct challenges of linguistic marginalization and digital colonialism. Core elements—participatory governance, community-led data stewardship, epistemic pluralism, and reclamation-oriented evaluation—are adaptable because they address continent-wide patterns: extractive data practices without reciprocity, imposition of Western linguistic norms, and erasure of indigenous epistemologies. For instance, Masakhane's pan-African participatory NLP model demonstrates how relational, grassroots approaches can scale across borders, fostering ownership in low-resource settings from Kenya to Senegal. Similarly, frameworks emphasizing decolonial foresight (Mohamed et al., 2020) and ethnographic co-creation in African LLM deployment highlight shared imperatives for centering local agency amid global power asymmetries.
Transferability requires contextual adaptation: in francophone contexts, principles might prioritize resistance to French linguistic hegemony in public AI tools; in East Africa, alignment with Kiswahili policies could accelerate scaling; in regions with stronger data sovereignty movements (e.g., emerging national AI strategies in Rwanda or Kenya), community-led infrastructures gain immediate policy traction. This adaptability positions the framework not as a universal blueprint but as a flexible, relational model that invites iteration by African scholars, practitioners, and communities. By foregrounding epistemic justice and sovereignty, it contributes to broader decolonial AI scholarship—advancing beyond South Africa to support pluriversal technological futures across the continent, where diverse colonial legacies meet shared aspirations for linguistic dignity and self-determination.
Discussion
This study reveals how AI-driven media platforms perpetuate colonial linguistic hierarchies in South Africa, marginalizing Setswana, Tshivenda, and Xitsonga through reduced visibility, high translation and voice recognition error rates (30%–42%), and epistemic erasure. The empirical findings—from algorithmic audits that quantified structural exclusion and promotion biases, developer interviews that exposed scalability-driven prioritization of English-dominant datasets, and rural focus groups that highlighted lived silencing and calls for relational accountability—collectively underscore the urgency of decolonial intervention. The proposed framework, derived directly from these insights, shifts critique toward epistemic pluralism, participatory governance, community-led infrastructures, and linguistic reclamation, offering a pathway to reclaim African linguistic identities in digital spaces. Below, we discuss anticipated implementation challenges, study limitations, and directions for future research.
Operationalizing the decolonial framework encounters entrenched structural, resource, and institutional barriers within South Africa's unequal digital ecosystem, where rural divides, data asymmetries, and profit-oriented global platforms intersect with nascent national AI governance. These challenges and corresponding responses are organized thematically by key actors to clarify responsibilities, pathways, and transformative potential.
Platforms such as X, YouTube, and Google Translate present significant hurdles due to their proprietary, profit-oriented algorithms that favor dominant languages like English, resisting localization through opaque updates and engagement metrics that disadvantage marginalized-language content. Audits demonstrated persistent ranking and promotion biases, further complicated by dynamic changes that limit long-term reliability and exacerbate exclusion. To address this, advocacy for mandatory linguistic impact assessments and transparency reports aligns with South Africa's National AI Policy Framework, which emphasizes ethical AI, inclusion, risk mitigation, and equity in emerging technologies. Developers and researchers can contribute by submitting evidence from studies like this to inform revisions and consultations. Additionally, deploying open-source auditing extensions or community-maintained third-party monitoring tools enables real-time tracking of changes, building on participatory NLP practices. Pursuing collaborative API access for researchers and NGOs would facilitate sustained audits and bias reporting, reducing dependence on full platform cooperation.
Communities and local institutions, including rural speakers and community radio stations, face rural digital divides that hinder participation in data co-creation and tool development; focus groups indicated that 60% of participants used social media infrequently, with limited infrastructure amplifying marginalization, while volunteer-led efforts risk burnout without sustained support. Hybrid, low-tech partnerships offer a viable response: collaborating with radio stations—such as Motsweding FM in North West for Setswana regions or Thobela FM in Limpopo for Sepedi—allows crowdsourcing of audio and voice data through broadcasts, raises awareness, and pilots localized voice tools, leveraging radio's extensive rural reach for gradual digital inclusion and language vitality. Tiered participation models further support this by starting with accessible inputs like SMS or audio submissions during radio programs and progressing to app-based contributions as digital literacy and infrastructure improve. Securing micro-grants or backing from NGOs and rural development agencies can provide participant stipends, training, and mechanisms for relational accountability, ensuring long-term engagement.
Policymakers, developers, and institutions—including government bodies, universities, and tech hubs—encounter resource asymmetries, such as limited local AI talent and English-centric datasets, alongside top-down design logics where scalability often overshadows diversity, as revealed in interviews. Although the National AI Policy Framework prioritizes inclusion, ethics, and human-centered approaches, enforcement of linguistic rights remains nascent amid ongoing interdepartmental review and Cabinet consideration. Responses include integrating framework principles into national AI governance by lobbying the Department of Communications and Digital Technologies (DCDT) and Pan South African Language Board for mandates on indigenous-language support in public tools, using audit evidence to shape post-2024 revisions, and stakeholder forums. Fostering collaborations between universities (e.g., University of Limpopo's African Languages Department) and tech hubs (e.g., Lelapa AI, advancing efficient multilingual models like InkubaLM) supports capacity-building workshops, open-source fine-tuning, and community co-design pilots. Establishing multistakeholder monitoring committees with community representation enables evaluation against epistemic justice metrics, such as linguistic vitality indicators beyond technical scores.
These actor-aligned strategies transform anticipated barriers into opportunities, grounding the framework in South African realities—rural infrastructures, existing community media, academic expertise, and evolving policy levers—while advancing relevance amid national AI discussions that emphasize equity, localization, and inclusive development.
This mixed-methods approach provides robust triangulation but carries constraints that shape the interpretation of the findings. Audits spanned four months (June–September 2025) using external tools, so platform dynamism may introduce temporal variability, potentially limiting generalizability to longer-term trends or behaviors after undocumented updates; this tempers claims about enduring exclusion patterns, though repeated runs and temporal bracketing mitigate snapshot bias. Purposive samples—1500 posts, 48 focus group participants, and 15 interviews—focused on rural Limpopo and North West contexts and three specific languages, restricting transferability to urban settings or other official languages (e.g., Nguni groups); findings thus diagnose targeted marginalization effectively but do not claim exhaustiveness. Reliance on external scraping and APIs precludes insight into proprietary internals, meaning results reflect observable outcomes rather than underlying causal mechanisms. Self-reported digital literacy and subjective focus group experiences introduce potential bias, although triangulation with quantitative audits strengthens overall validity. These limitations bound causal assertions and broad applicability yet reinforce the study's diagnostic strength for epistemic justice in marginalized languages and rural contexts.
Building directly on this work, future research could pursue several concrete avenues to extend methodological, theoretical, and practical contributions toward equitable linguistic futures in Africa. Longitudinal audits spanning 12 to 24 months could track visibility and accuracy metrics following implementation of the National AI Policy Framework, assessing policy impacts on linguistic inclusion and platform behaviors. Comparative studies across South African language families (e.g., Sotho-Tswana vs. Nguni) or pan-African low-resource settings would test the framework's transferability and scalability in diverse contexts. Experimental pilots of framework elements—such as community-radio co-designed voice models or participatory datasets—could evaluate pre- and postintervention outcomes on translation fidelity, user engagement, and intergenerational transmission. Policy-oriented evaluations might examine how decolonial principles integrate into national strategies, including linguistic impact assessments mandated by DCDT or PanSALB. Investigations into generational adaptations among youth and digital natives using indigenous-language AI tools would address emerging shifts in linguistic vitality amid accelerating AI adoption. These directions offer promising pathways for advancing evidence-based, inclusive AI governance.
Conclusion
This study illuminates the dual role of AI-driven media platforms in shaping the visibility and vitality of South Africa's marginalized languages—Setswana, Tshivenda, and Xitsonga. Through a mixed-methods approach combining algorithmic audits, interviews, and community focus groups, it reveals how AI systems, constrained by biased data and market-driven priorities, perpetuate colonial linguistic hierarchies by marginalizing indigenous languages in digital spaces. Simultaneously, it uncovers the transformative potential of community-led innovations, which leverage local knowledge to enhance linguistic representation and foster cultural pride. These findings underscore a critical tension: AI can either entrench historical inequities or amplify underrepresented voices, depending on how it is designed and deployed.
The evidence of systemic exclusion—low visibility, high error rates, and institutional neglect—highlights the urgent need to reorient AI development toward linguistic equity. Community-driven efforts, such as localized translation tools and adaptive strategies, demonstrate that inclusive digital ecosystems are possible when African epistemologies guide technological innovation. The proposed decolonial framework offers a pathway to achieve this by prioritizing epistemic sovereignty, linguistic rights, and participatory design. It advocates for co-created datasets, community-led architectures, and open-source infrastructures that empower local stakeholders to shape AI in alignment with South Africa's multilingual heritage.
Stakeholders—AI developers, policymakers, media practitioners, and communities—must collaborate to translate these insights into action. Investing in culturally rich datasets, fostering participatory development, and supporting accessible tools are essential steps to ensure that digital platforms reflect and celebrate linguistic diversity. By placing African languages at the heart of AI innovation, this study challenges the dominance of Western-centric paradigms and contributes to a global discourse on ethical AI. It envisions a digital future where Setswana, Tshivenda, and Xitsonga are not peripheral but integral to technological progress, fostering a more equitable and culturally vibrant media landscape that honors Africa's linguistic heritage.
Footnotes
Ethical statement
This study was conducted with strict adherence to ethical research principles to ensure the protection of participants’ rights, dignity, and cultural identities. Research ethics were implemented through culturally sensitive protocols: interviews and focus groups were conducted in participants’ native languages (Setswana, Tshivenda, or Xitsonga), discussions took place in accessible community locations, and all data were anonymized during collection and stored on password-protected, encrypted servers. Ethical clearance was obtained from an accredited institutional review committee at a South African university following a comprehensive review of the methodology and participant welfare measures, in compliance with national and international guidelines for research involving human participants. All participants provided informed consent before participation: written consent for interviews, and either written or witnessed verbal consent for focus group participants with varying literacy levels, after receiving detailed information about the study's purpose, voluntary nature, confidentiality protections, and their right to withdraw without consequence.
Funding
The author disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: The author received funding from the South African National Research Foundation to conduct this research.
Declaration of conflicting interests
The author declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
