Artificial intelligence-driven multilingual corpus for enhancing information retrieval in academic libraries

Abstract

Academic libraries in Sub-Saharan Africa and China face persistent challenges in cross-lingual retrieval due to linguistic fragmentation and uneven metadata infrastructures. This study constructed and evaluated a multilingual academic corpus designed to enhance semantic retrieval, metadata interoperability, and inclusive access across 13 languages. The core innovation lies in the Multilingual Adaptive Corpus for Retrieval Equity (MACRE), a modular architecture that integrates language-specific adapters, ontology-driven metadata harmonisation, and an intent disambiguation engine features that collectively surpass existing retrieval frameworks. The project aligns with core LIS objectives by advancing user-centred discovery, cross-language cataloguing, and metadata standardisation in institutional repositories. A 41,129,582-token corpus was compiled from 39,725 academic records drawn from university repositories across both regions. The corpus incorporated Mandarin, English, and 11 African languages selected to reflect regional LIS priorities. Metadata was harmonised to SKOS and Schema.org standards. The proposed MACRE retrieval model was benchmarked against ColBERT-X, SwahiliDocBERT, and CrossLingual2Vec using cosine similarity, MRR, MAP, and NDCG. Evaluation included ablation and post hoc analysis. Mandarin and English accounted for 64.6% of all tokens; Swahili reached 16.9%, while nine African languages contributed under 1.8% each. MACRE significantly outperformed all baselines (MRR = 0.864; MAP = 0.812; p < .001), particularly in LIS-aligned fields such as metadata accuracy (98.6%) and entity completion (94.9%). Adapter performance exceeded 90% in dominant languages but revealed key gaps in under-annotated African records. These findings illustrate that retrieval accuracy is not just a technical challenge, but also reflects underlying LIS concerns, such as language equity, cataloguing depth, and metadata policy enforcement. This study contributes a scalable LIS infrastructure for multilingual academic retrieval, advancing both technological and policy innovations for cross-lingual access in library systems.

Keywords

multilingualism information retrieval natural language processing metadata low-Resource languages

Get full access to this article

View all access options for this article.

References

Adebara

Elmadany

Abdul-Mageed

(2024) Cheetah: Natural language generation for 517 african languages. In: Proceedings of the 62nd annual meeting of the association for computational linguistics (Volume 1: Long Papers), pp.12798–12823. https://doi.org/10.18653/v1/2024.acl-long.691.

Adewole

Alozie

Olagunju

, et al. (2024) A systematic review and meta-data analysis of clinical data repositories in Africa and beyond: Recent development, challenges, and future directions. Discover Data 2(1): 8.

Adeyemi

Oladipo

Zhang

, et al. (2023) Overview of the CIRAL track at FIRE 2023: Cross-lingual information retrieval for african languages. CEUR Workshop Proceedings. https://doi.org/https://ceur-ws.org/Vol-3681/T2-1.pdf.

Ankamah

Vidza

Addo

(2024) The role of artificial intelligence in enhancing library services in universities: a bibliometric analysis. Ghana Library Journal 29(2): 54–65. 10.4314/glj.v29i2.5.

Brokensha

Kotzé

Senekal

(2023) AI in and for Africa: A humanistic perspective. Boca Raton, FL: CRC Press/Chapman & Hall.

Chen

Gong

(2025) The Role of AI-Assisted Learning in Academic Writing: A Mixed-Methods Study on Chinese as a Second Language Students. Education Sciences 15(2): 141. 10.3390/educsci15020141.

Chiware

ERT

Skelly

(2022) Open Science in Africa: What policymakers should consider. Front. Res. Metr. Anal 7: 950139. 10.3389/frma.2022.950139.

Chonka

Diepeveen

Haile

(2022) Algorithmic power and African indigenous languages: Search engine autocomplete and the global multilingual internet. Media, Culture & Society 45(2): 246–265.

Cigliano A and Fallucchi F (2025) The convergence of open data, linked data, ontologies and large language models: Enabling next-generation knowledge systems. In: Proceedings of metadata and semantic research (MTSR 2025), communications in computer and information science, 2025, pp. 197–213. Springer. https://doi.org/10.1007/978-3-031-81974-2_17.

10.

Cruz JCB and Aji AF (2025) Extracting general-use transformers for low-resource languages via knowledge distillation. In: Proceedings of the first workshop on language models for low-resource languages, 2025, pp.219–224. LoResLM.

11.

Deshpande

Rajalbandi

(2025) Intelligent document processing: AI-powered RPA for multilingual OCR of receipts. International Journal of Science and Research Archive 14(1): 1164–1166.

12.

Dubiel

Barghouti

Kudryavtseva

, et al. (2024) On-device query intent prediction with lightweight LLMs to support ubiquitous conversations. Scientific Reports 14(1): 12731.

13.

Echedom

Okuonghae

(2021) Transforming academic library operations in Africa with artificial intelligence: Opportunities and challenges: A review paper. New Review of Academic Librarianship 27: 1–11.

14.

Edam-Agbor

Sylvester Orim

Ofem

, et al. (2025) Librarians’ awareness, acceptability, and application of artificial intelligence in academic research libraries. Multigroup Analysis via PLS‒SEM. Social Sciences & Humanities Open 11: 101333.

15.

Huang

Cox

(2023) Artificial intelligence in academic library strategy in the United Kingdom and the mainland of China. The Journal of Academic Librarianship 49(6): 102772.

16.

Islam

Ahmad

Aqil

, et al. (2025) Application of artificial intelligence in academic libraries: a bibliometric analysis and knowledge mapping. Discover Artificial Intelligence 5: 59. 10.1007/s44163-025-00295-9.

17.

Kohnke

Zaugg

(2025) Artificial Intelligence: An Untapped Opportunity for Equity and Access in STEM Education. Education Sciences 15(1): 68. 10.3390/educsci15010068.

18.

Liu

Yang

(2025) Exploring the narrative landscape: The discursive construction of identity for Chinese enterprises in Africa. PLoS One 20(2): e0314285.

19.

Molaudzi

Ngulube

(2025) Use of artificial intelligence innovations in public academic libraries. IFLA Journal 51(3): 660–670. 10.1177/03400352241301780.

20.

Monyela

(2022) Knowledge Organisation in Academic Libraries: The Linked Data Approach. In pp.71–88. https://doi.org/10.4018/978-1-6684-3364-5.ch005.

21.

Mosha

(2025) The role of artificial intelligence tools in enhancing accessibility and usability of electronic resources in academic libraries. Library Management 46(1/2): 132–157.

22.

Ogundepo

Gwadabe

Rivera

, et al. (2023) AfriQA: Cross-lingual open-retrieval question answering for african languages. 14957–14972. https://doi.org/10.18653/v1/2023.findings-emnlp.997 (Findings of the Association for Computational Linguistics: EMNLP 2023).

23.

Ogungbenro

Esse

Isaac

, et al. (2025) Revolutionizing library services: The impact of artificial intelligence on cataloguing and access to information in Nigeria academic libraries. Journal of Library Metadata 25(2): 99–118.

24.

Padua

(2024) Artificial intelligence and Quality Education: The Need for Digital Culture in Teaching. Journal of Educational, Cultural and Psychological Studies 30. 10.7358/ecps-2024-030-padd.

25.

Palivela

Narvekar

Asirvatham

, et al. (2025) Code-switching ASR for low-resource Indic languages: A Hindi-Marathi case study. IEEE Access 13: 9171–9198. 10.1109/ACCESS.2025.3527745.

26.

Sun

Shabaya

Kalema

(2024) Fostering African Data Commons: Embracing the Philosophy of Ubuntu. Oxford Intersections: AI in Society. https://doi.org/10.1093/9780198945215.003.0044 .

27.

Tai

Ghosh

(2025) Integrating AI into library systems: A perspective on applications and challenges. In: Proceedings of the 24th ACM/IEEE joint conference on digital libraries, pp.Article 42. Association for Computing Machinery. https://doi.org/10.1145/3677389.3702568.

28.

Thangaraj

Chenat

Walia

, et al. (2024) Cross-lingual transfer of multilingual models on low resource African Languages .

29.

Wang

(2024) Research on the application and frontier issues of artificial intelligence in library and information science. Voice of the Publisher 10: 357–368.

30.

Wang

Shelmanov

Mansurov

, et al. (2025) GenAI content detection task 1: English and multilingual machine-generated text detection: AI vs. Human. In: Proceedings of the 1stWorkshop on GenAI content detection (GenAIDetect), pp.244–261. https://aclanthology.org/2025.genaidetect-1.27/.

31.

Wanjawa

Wanzare

Indede

, et al. (2023) Kencorpus: A Kenyan language corpus of swahili, dholuo and luhya for natural language processing tasks. Journal for Language Technology and Computational Linguistics 36: 1–27.

32.

Xie

Zhang

Yan

(2025) Student’s acceptance of artificial intelligence eBooks using LCA and SEM: A case study of medical book in China. Frontiers in Education 10: 1683176. 10.3389/feduc.2025.1683176.

33.

Zhang

(2025) Artificial intelligence contributes to the creative transformation and innovative development of traditional Chinese culture. International Journal of Computational and Experimental Science and Engineering 11(1). 10.22399/ijcesen.860.

34.

Zondi

Epizitone

Nkomo

, et al. (2024) A review of artificial intelligence implementation in academic library services. South African Journal of Libraries and Information Science 90(2): 1–8. 10.7553/90-2-2399.