Abstract
Algerian Arabic (Darija) dominates digital communication in North Africa yet remains severely under-resourced in Natural Language Processing (NLP), hindering the development of robust applications for social media analysis and e-commerce. This paper addresses this scarcity by presenting a systematic framework for constructing and benchmarking Named Entity Recognition (NER) and Entity Linking (EL) resources tailored to the dialect’s linguistic complexity. We introduce a large-scale, multi-script dataset constructed through a novel hybrid methodology that integrates manual annotation of authentic texts, automated knowledge graph extraction from Wikidata, and rule-based synthetic generation. This approach ensures diverse coverage across ten semantic categories while explicitly addressing the challenges of code-switching and orthographic variation (Arabizi and Arabic script). A transformer-based model (XLM-RoBERTa) fine-tuned on this resource achieves state-of-the-art performance, demonstrating significant robustness compared to existing baselines. Beyond the dataset, we provide a practical deployment interface and comprehensive evaluation metrics, establishing a crucial foundation for advancing NLP capabilities in North African dialects and facilitating downstream tasks such as content moderation and cultural heritage preservation.
Keywords
Get full access to this article
View all access options for this article.
