Abstract
Academic libraries in Sub-Saharan Africa and China face persistent challenges in cross-lingual retrieval due to linguistic fragmentation and uneven metadata infrastructures. This study constructed and evaluated a multilingual academic corpus designed to enhance semantic retrieval, metadata interoperability, and inclusive access across 13 languages. The core innovation lies in the Multilingual Adaptive Corpus for Retrieval Equity (MACRE), a modular architecture that integrates language-specific adapters, ontology-driven metadata harmonisation, and an intent disambiguation engine features that collectively surpass existing retrieval frameworks. The project aligns with core LIS objectives by advancing user-centred discovery, cross-language cataloguing, and metadata standardisation in institutional repositories. A 41,129,582-token corpus was compiled from 39,725 academic records drawn from university repositories across both regions. The corpus incorporated Mandarin, English, and 11 African languages selected to reflect regional LIS priorities. Metadata was harmonised to SKOS and Schema.org standards. The proposed MACRE retrieval model was benchmarked against ColBERT-X, SwahiliDocBERT, and CrossLingual2Vec using cosine similarity, MRR, MAP, and NDCG. Evaluation included ablation and post hoc analysis. Mandarin and English accounted for 64.6% of all tokens; Swahili reached 16.9%, while nine African languages contributed under 1.8% each. MACRE significantly outperformed all baselines (MRR = 0.864; MAP = 0.812;
Keywords
Get full access to this article
View all access options for this article.
