Construction of a comprehensive dataset for named entity recognition and entity linking in Algerian Dialectal Arabic

Abstract

Algerian Arabic (Darija) dominates digital communication in North Africa yet remains severely under-resourced in Natural Language Processing (NLP), hindering the development of robust applications for social media analysis and e-commerce. This paper addresses this scarcity by presenting a systematic framework for constructing and benchmarking Named Entity Recognition (NER) and Entity Linking (EL) resources tailored to the dialect’s linguistic complexity. We introduce a large-scale, multi-script dataset constructed through a novel hybrid methodology that integrates manual annotation of authentic texts, automated knowledge graph extraction from Wikidata, and rule-based synthetic generation. This approach ensures diverse coverage across ten semantic categories while explicitly addressing the challenges of code-switching and orthographic variation (Arabizi and Arabic script). A transformer-based model (XLM-RoBERTa) fine-tuned on this resource achieves state-of-the-art performance, demonstrating significant robustness compared to existing baselines. Beyond the dataset, we provide a practical deployment interface and comprehensive evaluation metrics, establishing a crucial foundation for advancing NLP capabilities in North African dialects and facilitating downstream tasks such as content moderation and cultural heritage preservation.

Keywords

Named entity recognition entity linking Algerian Dialectal Arabic dataset construction corpus annotation arabizi script normalization knowledge graph low-resource languages

Get full access to this article

View all access options for this article.

References

Zaidan

Callison-Burch

. The arabic online commentary dataset: An annotated dataset of commentaries on newswire articles. In: Proceedings of the 49th annual meeting of the association for computational linguistics (ACL) 2011: 37–41. https://aclanthology.org/P11-2007/.

Dahou

Cheragui

. DzNER: A large Algerian named entity recognition dataset. Nat Lang Process J 2023; 3: 100005. https://doi.org/10.1016/j.nlp.2023.100005

Mubarak

Darwish

Abdelali

. Arabic dialect identification in the wild. In: Proceedings of the 12th language resources and evaluation conference (LREC), 2020, p.4696–701. https://aclanthology.org/2020.lrec-1.579/.

Belbekri

Bouarroudj

Benchikha

, et al. Gaps: Methodology to create a novel dataset for named entity recognition in algerian dialectal arabic. In: Proceedings of the national conference on artificial intelligence and its applications (CNIA), 2024, pp.1–15.

Belbekri

Bouarroudj

Benchikha

, et al. Generating synthetic training data for named entity recognition with large-scale models integrating wikidata and GPT. In: CEUR workshop proceedings, 2024, Vol.3935, pp.48–62. https://ceur-ws.org/Vol-3935/paper4.pdf.

Belbekri

Bouarroudj

Benchikha

. Integrating GPT-3 and DBpedia for named entity recognition datasets. In: CEUR workshop proceedings, 2024, Vol. 3973, p.52–67. https://ceur-ws.org/Vol-3973/paper5.pdf.

Bouarroudj

Boufaida

Bellatreche

. Named entity disambiguation in short texts over knowledge graphs. Knowl Inf Syst 2022; 64: 325–351. https://doi.org/10.1007/s10115-021-01642-9

Benajiba

Rosso

Benedí Fabra

. ANERsys 1.0: An arabic named entity recognition system. Proc del Lengua Nat 2007; 39: 305–308. https://doi.org/10.1007/978-3-540-70939-8_13

Shaalan

. An automatic system for extracting named entities from Arabic financial news articles. In: Proceedings of the first international conference on onomastics. 2005.

10.

Nadeau

Sekine

. A survey of named entity recognition and classification. Lingvist Invest 2007; 30: 3–26. https://doi.org/10.1075/LI.30.1.03NAD

11.

Etaiwi

Awajan

Suleiman

. Statistical arabic name entity recognition approaches: A survey. Int J Adv Res Comput Sci Manag Stud 2017; 5: 11–20. https://doi.org/10.1016/j.procs.2017.08.288

12.

Benajiba

Diab

Rosso

. Arabic named entity recognition using optimized feature sets. Proceedings of EMNLP 2008 2008: 284–293.

13.

Hovy

. End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF. In: Proceedings of the 54th annual meeting of the association for computational linguistics (ACL). 2016, pp.1064–74. https://aclanthology.org/P16-1101/.

14.

Antoun

Baly

Hajj

. AraBERT: Transformer-based model for Arabic language understanding. In: Proceedings of the 4th workshop on open-source arabic corpora and processing tools (OSACT4). 2020, pp.9–15. https://aclanthology.org/2020.osact-1.2/.

15.

Zirikly

Diab

. Named entity recognition for arabic social media. In: Proceedings of the 2015 conference of the north american chapter of the association for computational linguistics (NAACL), 2015, p.176–85. DOI: 10.18653/v1/N15-1019.

16.

El Elkhbir

Zaratiana

Tomeh

, et al. Cross-dialectal named entity recognition in Arabic. In: Proceedings of the first arabic natural language processing conference (ArabicNLP 2023), 2023, p.140–9. https://aclanthology.org/2023.arabicnlp-1.15/.

17.

Moussa

Mourhir

. A manually annotated ner dataset for moroccan dialect. In: Proceedings of the first arabic natural language processing conference (ArabicNLP 2023), 2023, p.164–72. https://aclanthology.org/2023.arabicnlp-1.18/.

18.

Touileb

. NERDz: A preliminary dataset of named entities for algerian arabic. In: Proceedings of the 2nd conference of the asia-pacific chapter of the association for computational linguistics and the 12th International joint conference on natural language processing (AACL-IJCNLP), 2022, p.95–101. https://aclanthology.org/2022.aacl-main.13/.

19.

Hachey

Radford

Nothman

, et al. Entity linking with dictionary selection. In: Proceedings of the 51st annual meeting of the association for computational linguistics (ACL), 2013, p.483–91. https://aclanthology.org/P13-1048/.

20.

Mulang

Singh

Orlandi

. Encoding knowledge graph entity aliases in attentive neural network for entity linking. In: International conference on web information systems engineering (WISE), 2020, p.95–108, Springer. DOI: 10.1007/978-3-030-62004-2_7.

21.

Chen

, et al. A knowledge graph entity disambiguation method based on entity-relationship embedding and graph structure embedding. Wirel Commun Mobile Comput 2021; 2021: 1–12. https://doi.org/10.1155/2021/2878189

22.

Hegazi

El-Beltagy

Fahmy

. Preprocessing arabic text on social media. Health Technol (Berl) 2021; 11: 183–201. https://doi.org/10.1016/j.heliyon.2021.e06191

23.

Gries

. Linguistic annotation in/for corpus linguistics. In: The handbook of linguistic annotation, 2017, p.7–39, Springer. DOI: 10.1007/978-94-024-0881-2_1.

24.

Cohen

. A coefficient of agreement for nominal scales. Educ Psychol Meas 1960; 20: 37–46. http://dx.doi.org/10.1177/001316446002000104

25.

Krippendorff

. Content analysis: An introduction to its methodology. 2nd ed. Thousand Oaks, Londres, New Delhi: Sage Publications, 2004.

26.

Landis

Koch

. The measurement of observer agreement for categorical data. Biometrics 1977; 33: 159–174. http://doi.org/10.2307/2529310

27.

Leech

. Introducing corpus annotation. Lit Linguist Comput 1993; 8: 115–126. https://doi.org/10.4324/9781315841366-7

28.

Tjong Kim Sang

De Meulder

. Introduction to the CoNLL-2003 Shared Task: language-independent named entity recognition. In: Proceedings of the Seventh conference on natural language learning (CoNLL-2003), 2003, p.142–7. https://aclanthology.org/W03-0419/.

29.

Conneau

Khandelwal

Goyal

, et al. Unsupervised cross-lingual representation learning at scale. In: Proceedings of the 58th annual meeting of the association for computational linguistics, 2020, p.8440–51. DOI: 10.18653/v1/2020.acl-main.747.

30.

Devlin

Chang

Lee

, et al. BERT: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 conference of the north american chapter of the association for computational linguistics: Human language technologies, volume 1 (Long and Short Papers), 2019, p.4171–4186. Association for Computational Linguistics.