Redefining non-response in the era of big data: An ethical and technical framework for web data extraction in national statistical offices

Abstract

The exponential growth of web scraping as a data collection methodology has outpaced the development of comprehensive ethical frameworks, particularly for Global South contexts where digital infrastructure and regulatory environments present unique challenges. This study addresses the critical gap between technical capability and ethical responsibility by developing and validating an integrated Ethical Web Scraping Lifecycle Framework. Through Latent Dirichlet Allocation analysis of 6,055 scholarly documents, we first identify the fundamental epistemological schism between technical implementation and ethical discourse in current web scraping practices. Building on this empirical foundation, we introduce a novel five-phase framework that operationalizes ethical principles through practical checklists, technical protocols, and adaptive response mechanisms. The framework’s efficacy is demonstrated through a longitudinal case study monitoring commodity prices across 129 Zimbabwean firms, successfully extracting 12,067 product records while maintaining rigorous ethical standards. Our findings reveal that HTTP 403 errors constitute a significant form of non-response (72.9% of cases) that must be formally accounted for in sampling frameworks. The study contributes both methodologically by bridging the technical-ethical divide through an empirically-grounded approach and practically by providing National Statistical Offices and researchers with an implementable framework for responsible data collection that balances research utility with legal compliance and social awareness in increasingly regulated digital ecosystems.

Keywords

Data colonialism ethical web scraping data provenance natural language processing data governance

Get full access to this article

View all access options for this article.

References

Gennari

. Citizen data and official statistics: Opportunities and critical questions. Stat J IAOS 2025; 41: 973–981. DOI: 10.1177/18747655251396379

Groves

. Three eras of survey research. Public Opin Q 2011; 75: 861–871.

Pratesi

. Citizen Data and Citizen Science: A Challenge for Official Statistics. 406. Springer Proceedings in Mathematics & Statistics. Springer International Publishing, 2022.

Ruppert

Grommé

Upsec-Spilda

, et al. Citizen data and trust in official statistics. Econom Stat / Econ Stat 2019; 171–184.

Watambwa

Sibanda

Chinwadzimba

, et al. Measuring access to schools and health facilities in zimbabwe: A distance-based analysis of service islands. Stat J IAOS 2025; 41: 1166–1181. DOI: 10.1177/18747655251393052

Depire

Gayà Riera

Sopranidis

. Use of public data sources to increase the accuracy and completeness of the eurogroups register. Statistical Journal of the IAOS 2025; 41: 1020–1027. DOI: 10.1177/18747655251391288

Fernández-Álvaro

. Scraped data as a source to study the demand for ict specialists. Statistical Journal of the IAOS 2025; 41: 996–1008. DOI: 10.1177/18747655251393927

Chaulagain

Pandey

Basnet

, et al. Cloud based web scraping for big data applications. In: 2017 IEEE international conference on smart cloud (SmartCloud), 2017. IEEE. DOI: 10.1109/smartcloud.2017.28.

Zhu

Gui

Guo

. Promoting fusion of diverse factors and unveiling the nexus: Prospects of big data-driven artificial intelligence technology in achieving carbon neutrality in chongming district. Water-Energy Nexus 2023; 6: 112–121. DOI: 10.2139/ssrn.4495357.

10.

Couldry

Mejias

. Data colonialism: Rethinking big data’s relation to the contemporary subject. Televis ew Media 2018; 20: 336–349.

11.

Qiu

. Progress and recommendations in data ethics governance: a transnational analysis based on data ethics frameworks. Human Soc Sci Commun 2025; 12: 1–11. DOI: 10.1057/s41599-025-05664-4.

12.

Birhane

. Algorithmic colonization of africa. SCRIPT-ed 2020; 17: 389–409.

13.

Hillen

. Web scraping for food price research. British Food J 2019; 121: 3350–3361.

14.

Massimino

. Accessing online data: Web-crawling and information-scraping techniques to automate the assembly of research data. J Business Logist 2016; 37: 34–42.

15.

Knížat

. Web scraped data in consumer price indices. Stat J IAOS 2023; 39: 203–212.

16.

Virgillito

Polidoro

. Big Data Techniques for Supporting Official Statistics: The Use of Web Scraping for Collecting Price Data. Information Resources Management Association (USA): IGI Global, 2019.

17.

Kienle

German

Muller

. Legal concerns of web site reverse engineering. In: Web Site Evolution, Sixth IEEE international workshop on (WSE’04). WSE-04, IEEE Comput. Soc, 2004, p.41–50. DOI: 10.1109/wse.2004.10000.

18.

Han

Anderson

. Web scraping for hospitality research: Overview, opportunities, and implications. Cornell Hospit Quart 2020; 62: 89–104.

19.

Khder

. Web scraping or web crawling: State of art, techniques, approaches and application. Int J Adv Soft Comput its Appl 2021; 13: 145–168.

20.

Glez-Peña

Lourenço

López-Fernández

, et al. Web scraping technologies in an api world. Brief Bioinform 2013; 15: 788–797.

21.

Singrodia

Mitra

Paul

. A review on web scrapping and its applications. In: 2019 International conference on computer communication and informatics (ICCCI), 2019. IEEE. DOI: 10.1109/iccci.2019.8821809.

22.

Coleman

. Digital colonialism: The 21st century scramble for africa through the extraction and control of user data and the limitations of data protection laws. Michigan J Race Law 2019; 24: 417. DOI: https://doi.org/10.36643/mjrl.24.2.digital.

23.

Hurasha

Chiremba

. Influence of electronic commerce on business performance: Evidence from e-commerce organisations in harare, zimbabwe. J Econ Behav Stud 2017; 8: 146–152.

24.

Mutiro

Saki

. The cyber and data protection act of zimbabwe: A critical analysis. African Journal on Privacy and Data Protection 2024; 1: 1–80. DOI: 10.29053/ajpdp.v1i1.0004.

25.

Chen

. an extended tf-idf method for improving keyword extraction in traditional corpus-based research: An example of a climate change corpus. Data Knowl Eng 2024; 153: 102322.

26.

Christian

Agus

Suhartono

. Single document automatic text summarization using term frequency-inverse document frequency (tf-idf). ComTech: Comput Math Eng Appl 2016; 7: 285.

27.

. Latent Dirichlet Allocation. Singapore: Springer Nature, 2023.

28.

Blei

Jordan

. Latent Dirichlet Allocation. Advances in Neural Information Processing Systems 14. The MIT Press, 2002.

29.

Ferrara

De Meo

Fiumara

, et al. Web data extraction, applications and techniques: A survey. Knowl Based Syst 2014; 70: 301–323.

30.

vanden Broucke

Baesens

. From Web Scraping to Web Crawling. Apress, 2018.

31.

Zhao

. Web Scraping. Springer International Publishing, 2017.