Abstract
The exponential growth of web scraping as a data collection methodology has outpaced the development of comprehensive ethical frameworks, particularly for Global South contexts where digital infrastructure and regulatory environments present unique challenges. This study addresses the critical gap between technical capability and ethical responsibility by developing and validating an integrated Ethical Web Scraping Lifecycle Framework. Through Latent Dirichlet Allocation analysis of 6,055 scholarly documents, we first identify the fundamental epistemological schism between technical implementation and ethical discourse in current web scraping practices. Building on this empirical foundation, we introduce a novel five-phase framework that operationalizes ethical principles through practical checklists, technical protocols, and adaptive response mechanisms. The framework’s efficacy is demonstrated through a longitudinal case study monitoring commodity prices across 129 Zimbabwean firms, successfully extracting 12,067 product records while maintaining rigorous ethical standards. Our findings reveal that HTTP 403 errors constitute a significant form of non-response (72.9% of cases) that must be formally accounted for in sampling frameworks. The study contributes both methodologically by bridging the technical-ethical divide through an empirically-grounded approach and practically by providing National Statistical Offices and researchers with an implementable framework for responsible data collection that balances research utility with legal compliance and social awareness in increasingly regulated digital ecosystems.
Keywords
Get full access to this article
View all access options for this article.
