Web Scraping Techniques for Surgical Research: A Technical Tutorial With a Worked Example in Publication Data Mining

Abstract

Background

Web scraping—the automated extraction of data from websites—has become an essential technique for researchers seeking to collect large-scale data that would be impractical to gather manually. Surgeon-scientists increasingly encounter publicly available web data relevant to outcomes research, health services analysis, workforce studies, and policy work, yet technical guidance on implementing web scrapers remains limited in the surgical literature.

Methods

This tutorial provides a clinician-oriented technical guide to web scraping for surgical research. We present key concepts including static vs dynamic websites, CSS selectors, browser automation, rate limiting, and ethical considerations. A complete worked example demonstrates the full pipeline by scraping a surgical research group’s publication page (https://www.onetomapanalytics.com) to build a structured bibliometric database.

Results

The worked example successfully extracts structured publication data—including titles, author lists, abstracts, keywords, and PubMed links—from a JavaScript-rendered website, producing an analysis-ready data set. We demonstrate how this pipeline generalizes to other surgical research applications including hospital price transparency data, residency program characteristics, and quality metrics.

Conclusions

Web scraping is a powerful tool for surgeon-scientists when implemented with technical rigor and ethical responsibility. By anchoring the tutorial to a concrete surgical use case and providing a reusable code template, we equip surgical researchers with the foundational knowledge to design, implement, and adapt web scrapers for their own data collection projects.

Keywords

web scraping surgical research bibliometrics publication data mining Python Selenium hospital price transparency research methodology automation data extraction browser automation

Get full access to this article

View all access options for this article.

References

Dalby

Sahoo

Grimsley

, et al. Price variability persists despite price transparency: analysis of laparoscopic cholecystectomy. Am J Surg. 2025;341:116147.

Grimsley

Anderson

Kendall

, et al. For the love of the game: calculating the premium associated with academic surgical practice. Ann Surg. 2024;280(4):640-649.

Mateussi

Janjua

Grimsley

, et al. OnetoMap meta-data: healthcare analytics through research. Cureus. 2024;16(8):e66763.

DeVito

Richards

Inglesby

. Tools that ease data collection from the web. Nature. 2020;585:621-622.

Mitchell

. Web Scraping with Python. 2nd ed. O’Reilly Media; 2018.

Freelon

. Computational research in the post-API age. Polit Commun. 2018;35(4):665-668.

Selenium Project . Selenium documentation. https://www.selenium.dev/documentation/

Davidson

Wischerath

Racek

, et al. Platform-controlled social media APIs threaten open science. Nat Hum Behav. 2023;7:2054-2057.

Krotov

Johnson

Silva

. Legality and ethics of web scraping. Commun Assoc Inf Syst. 2020;47:539-563.

10.

Brewer

Westlake

Hart

Arauza

. The ethics of web crawling and web scraping in cybercrime research. In: Lavorgna

Holt

, eds. Researching Cybercrimes. Palgrave Macmillan; 2021:543-570.

11.

hiQ Labs, Inc. v. LinkedIn Corp. , 938 F.3d 985 (9th Cir. 2019).

12.

Zimmer

. “But the data is already public”: on the ethics of research in Facebook. Ethics Inf Technol. 2010;12:313-325.