Quantifying the Systematic Bias in the Accessibility and Inaccessibility of Web Scraping Content From URL-Logged Web-Browsing Digital Trace Data

Abstract

Social scientists and computer scientists are increasingly using observational digital trace data and analyzing these data post hoc to understand the content people are exposed to online. However, these content collection efforts may be systematically biased when the entirety of the data cannot be captured retroactively. We call this often unstated assumption the problematic assumption of accessibility. To examine the extent to which this assumption may be problematic, we identify 107k hard news and misinformation web pages visited by a representative panel of 1,238 American adults and record the degree to which the web pages individuals visited were accessible via successful web scrapes or inaccessible via unsuccessful scrapes. While we find that the URLs collected are largely accessible and with unrestricted content, we find there are systematic biases in which URLs are restricted, return an error, or are inaccessible. For example, conservative misinformation URLs are more likely to be inaccessible than other types of misinformation. We suggest how social scientists should capture and report digital trace and web scraping data.

Keywords

digital trace data internet measurement misinformation web-log data web scraping news news consumption

Get full access to this article

View all access options for this article.

References

Ananny

Bighash

(2016). Why drop a paywall? Mapping industry accounts of online news decommodification. International Journal of Communication, 10(2016), 3359–3380.

Arrese

Á.

(2016). From gratis to paywalls: A brief history of a retro-innovation in the press’s business. Journalism Studies, 17(8), 1051–1067. https://doi.org/10.1080/1461670x.2015.1027788

Bach

R. L.

Kern

Amaya

Keusch

Kreuter

Hecht

Heinemann

(2021). Predicting voting behavior using digital trace data. Social Science Computer Review, 39(5), 862–883. https://doi.org/10.1177/0894439319882896

Bainotti

Caliandro

Gandini

(2021). From archive cultures to ephemeral content, and back: Studying instagram stories with digital methods. New Media & Society, 23(12), 3656–3676. https://doi.org/10.1177/1461444820960071

Bakshy

Messing

Adamic

L. A.

(2015). Political science. Exposure to ideologically diverse news and opinion on Facebook. Science, 348(6239), 1130–1132. https://doi.org/10.1126/science.aaa1160

Baumgartner

S. E.

Sumter

S. R.

Petkevič

Wiradhany

(2022). A novel iOS data donation approach: Automatic processing, compliance, and reactivity in a longitudinal study. Social Science Computer Review, 41(4), 1456–1472. https://doi.org/10.1177/08944393211071068

Bayer

J. B.

Ellison

N. B.

Schoenebeck

S. Y.

Falk

E. B.

(2016). Sharing the small moments: Ephemeral social interaction on snapchat. Information, Communication & Society, 19(7), 956–977. https://doi.org/10.1080/1369118x.2015.1084349

Ben-David

(2016). What does the web remember of its deleted past? An archival reconstruction of the former yugoslav top-level domain. New Media & Society, 18(7), 1103–1119. https://doi.org/10.1177/1461444816643790

Bilge

Kirda

Kruegel

Balduzzi

(2011). Exposure: Finding malicious domains using passive DNS analysis. Ndss, 1–17.

10.

Brandtzaeg

P. B.

(2017). Facebook is no “great equalizer” a big data approach to gender differences in civic engagement across countries. Social Science Computer Review, 35(1), 103–125. https://doi.org/10.1177/0894439315605806

11.

Carah

Shaul

(2016). Brands and instagram: Point, tap, swipe, glance. Mobile Media & Communication, 4(1), 69–84. https://doi.org/10.1177/2050157915598180

12.

Cavalcanti

L. H. C.

Pinto

Brubaker

J. R.

Dombrowski

L. S.

(2017). Media, meaning, and context loss in ephemeral communication platforms: A qualitative investigation on snapchat. Proceedings of the 2017 ACM Conference on computer Supported Cooperative work and social computing (pp. 1934–1945). https://doi.org/10.1145/2998181.2998266

13.

Chen

Quan-Haase

(2020). Big data ethics and politics: Toward new understandings. Social Science Computer Review, 38(1), 3–9. https://doi.org/10.1177/0894439318810734

14.

Choi

(2020). When digital trace data meet traditional communication theory: Theoretical/methodological directions. Social Science Computer Review, 38(1), 91–107. https://doi.org/10.1177/0894439318788618

15.

Chowdhury

F. A.

Liu

Saha

Vincent

Neves

Shah

Bos

M. W.

(2021). Ceam: The effectiveness of cyclic and ephemeral attention models of user behavior on social platforms. Proceedings of the International AAAI Conference on Web and Social Media, 15(1), 117–128. https://doi.org/10.1609/icwsm.v15i1.18046

16.

Christ

Penthin

Kröner

(2021). Big data and digital aesthetic, arts, and cultural education: Hot spots of current quantitative research. Social Science Computer Review, 39(5), 821–843. https://doi.org/10.1177/0894439319888455

17.

Clark

H. H.

(1996). Using language. Cambridge University Press.

18.

Dahlke

Hancock

(2022). The effect of online misinformation exposure on false election beliefs. OSF Preprints. https://doi.org/10.31219/osf.io/325tn

19.

Dahlke

Moore

Forberg

Hancock

(2022). A mixed methods analysis of americans’ QAnon website consumption. OSF Preprints. https://doi.org/10.31219/osf.io/u6vgz

20.

Dimitrova

D. V.

Bugeja

(2007). The half-life of internet references cited in communication journals. New Media & Society, 9(5), 811–826. https://doi.org/10.1177/1461444807081226

21.

Eck

Cazar

A. L. C.

Callegaro

Biemer

(2021). Big data meets survey science. In Social Science Computer Review (No. 4; Vol. 39, pp. 484–488). Sage Publications Sage CA.

22.

Franklin

(2014). The future of journalism: In an age of digital media and economic uncertainty. In Journalism Studies (No. 5; Vol. 15, pp. 481–499). Taylor & Francis.

23.

Freelon

(2018). Computational research in the post-API age. Political Communication, 35(4), 665–668. https://doi.org/10.1080/10584609.2018.1477506

24.

Gertler

A. L.

Bullock

J. G.

(2017). Reference rot: An emerging threat to transparency in political science. PS: Political Science & Politics, 50(01), 166–171. https://doi.org/10.1017/s1049096516002353

25.

Gil de Zuniga

Diehl

(2017). Citizenship, social media, and big data: Current and future research in the social sciences. Social Science Computer Review, 35(1), 3–9. https://doi.org/10.1177/0894439315619589

26.

Guess

A. M.

(2021). (Almost) everything in moderation: New evidence on americans’ online media diets. American Journal of Political Science, 65(4), 1007–1022. https://doi.org/10.1111/ajps.12589

27.

Guess

A. M.

Barberá

Munzert

Yang

(2021). The consequences of online partisan media. Proceedings of the National Academy of Sciences of the United States of America, 118(14), Article e2013464118. https://doi.org/10.1073/pnas.2013464118

28.

Guess

A. M.

Nyhan

Reifler

(2020). Exposure to untrustworthy websites in the 2016 US election. Nature Human Behaviour, 4(5), 472–480. https://doi.org/10.1038/s41562-020-0833-x

29.

Haenschen

(2020). Self-reported versus digitally recorded: Measuring political activity on facebook. Social Science Computer Review, 38(5), 567–583. https://doi.org/10.1177/0894439318813586

30.

Han

Kumar

Durumeric

(2022). On the infrastructure providers that support misinformation websites. Proceedings of the International AAAI Conference on Web and Social Media, 16(1), 287–298. https://doi.org/10.1609/icwsm.v16i1.19292.

31.

Hanley

H. W.

Kumar

Durumeric

(2022). No calm in the storm: Investigating QAnon website relationships. Proceedings of the International AAAI Conference on Web and Social Media, 16(1), 299–310. https://doi.org/10.1609/icwsm.v16i1.19293

32.

Holz

Gorecki

Rieck

Freiling

F. C.

(2008). Measuring and detecting fast-flux service networks. Ndss.

33.

Hounsel

Holland

Kaiser

Borgolte

Feamster

Mayer

(2020). Identifying disinformation websites using infrastructure features. 10th USENIX Workshop on Free and Open Communications on the Internet (FOCI 20).

34.

Jünger

(2021). A brief history of APIs: Limitations and opportunities for online research. In: Handbook of computational social science (vol. 2). Taylor & Francis.

35.

Jungherr

Schoen

Posegga

Jürgens

(2017). Digital trace data in the study of public opinion: An indicator of attention toward politics rather than political support. Social Science Computer Review, 35(3), 336–356. https://doi.org/10.1177/0894439316631043

36.

Klein

Van de Sompel

Sanderson

Shankar

Balakireva

Zhou

Tobin

(2014). Scholarly context not found: One in five articles suffers from reference rot. PLoS One, 9(12), Article e115253. https://doi.org/10.1371/journal.pone.0115253

37.

Koehler

(1999). An analysis of web page and web site constancy and permanence. Journal of the American Society for Information Science, 50(2), 162–180. https://doi.org/10.1002/(sici)1097-4571(1999)50:2<162::aid-asi7>3.0.co;2-b

38.

Kreuter

Haas

G.-C.

Keusch

Bähr

Trappmann

(2020). Collecting survey and smartphone sensor data with an app: Opportunities and challenges around privacy and informed consent. Social Science Computer Review, 38(5), 533–549. https://doi.org/10.1177/0894439318816389

39.

Krotov

Silva

(2018). Legality and ethics of web scraping. Emergent Research Forum.

40.

Kumar

Durumeric

Mirian

Mason

Halderman

J. A.

Bailey

(2017, April). Security challenges in an increasingly tangled web. In Proceedings of the 26th International Conference on world Wide web (pp. 677–684). https://doi.org/10.1145/3038912.3052686

41.

Kumar

D. V.

Kumar

B. T. S.

Parameshwarappa

(2015). URLs link rot: Implications for electronic publishing. World Digital Libraries - An International Journal, 8(1), 59–66. https://doi.org/10.18329/09757597/2015/8105

42.

Landers

R. N.

Brusso

R. C.

Cavanaugh

K. J.

Collmus

A. B.

(2016). A primer on theory-driven web scraping: Automatic extraction of big data from the internet for use in psychological research. Psychological Methods, 21(4), 475–492. https://doi.org/10.1037/met0000081

43.

Zhou

Cai

(2021). Trails of data: Three cases for collecting web information for social science research. Social Science Computer Review, 39(5), 922–942. https://doi.org/10.1177/0894439319886019

44.

Linell

(2004). The written language bias in linguistics: Its nature, origins and transformations. Routledge.

45.

Lukito

Josephine

Brown

Megan A.

Dahlke

Ross

Suk

Jiyoun

Yang

Yunkang

Zhang

Yini

Chen

Bin

Kim

Sang Jung

Soorholtz

Kaiya

(2023). The State of Digital Media Data Research, 2023. Media & Democracy Data Coop. https://doi.org/10.26153/tsw/46177

46.

Lyons

B. A.

(2022). Why we should rethink the third-person effect: Disentangling bias and earned confidence using behavioral data. Journal of Communication, 72(5), 565–577. https://doi.org/10.1093/joc/jqac021

47.

McRoberts

Yuan

Watson

Yarosh

(2019). Behind the scenes: Design, collaboration, and video creation with youth. In: Proceedings of the 18th ACM International Conference on Interaction Design and Children, 173–184. https://doi.org/10.1145/3311927.3323134

48.

Möller

van de Velde

R. N.

Merten

Puschmann

(2020). Explaining online news engagement based on browsing behavior: Creatures of habit? Social Science Computer Review, 38(5), 616–632. https://doi.org/10.1177/0894439319828012

49.

Moore

R. C.

Dahlke

Bengani

Hancock

J. T.

(2023). The consumption of Pink Slime journalism: Who, what, when, where, and why? OSF Preprints. https://doi.org/10.31219/osf.io/3bwz6

50.

Moore

R. C.

Dahlke

Hancock

J. T.

(2023). Exposure to untrustworthy websites in the 2020 US election. Nature Human Behaviour, 7(7), 1096–1105. https://doi.org/10.1038/s41562-023-01564-2

51.

Myllylahti

(2014). Newspaper paywalls—the hype and the reality: A study of how paid news content impacts on media corporation revenues. Digital Journalism, 2(2), 179–194. https://doi.org/10.1080/21670811.2013.813214

52.

Myllylahti

(2017). What content is worth locking behind a paywall? Digital news commodification in leading australasian financial newspapers. Digital Journalism, 5(4), 460–471. https://doi.org/10.1080/21670811.2016.1178074

53.

Olmedilla

Martínez-Torres

M. R.

Toral

(2016). Harvesting big data in social science: A methodological approach for collecting online user-generated content. Computer Standards & Interfaces, 46(2), 79–87. https://doi.org/10.1016/j.csi.2016.02.003

54.

Pavlik

J. V.

(2013). Innovation and the future of journalism. Digital Journalism, 1(2), 181–193. https://doi.org/10.1080/21670811.2012.756666

55.

Perdisci

Lee

(2018). Method and system for detecting malicious and/or botnet-related domain names. Google Patents.

56.

Perkel

J. M.

(2015). The trouble with reference rot. Nature, 521(7550), 111–112. https://doi.org/10.1038/521111a

57.

Pickard

Williams

A. T.

(2014). Salvation or folly? The promises and perils of digital paywalls. Digital Journalism, 2(2), 195–213. https://doi.org/10.1080/21670811.2013.865967

58.

Praet

Guess

A. M.

Tucker

J. A.

Bonneau

Nagler

(2022). What’s not to like? Facebook page likes reveal limited polarization in lifestyle preferences. Political Communication, 39(3), 311–338. https://doi.org/10.1080/10584609.2021.1994066

59.

Reiss

M. V.

(2022). Dissecting non-use of online news–systematic evidence from combining tracking and automated text classification. Digital Journalism, 11(2), 363–383. https://doi.org/10.1080/21670811.2022.2105243

60.

Revilla

Ochoa

Loewe

(2017). Using passive data from a meter to complement survey data in order to study online behavior. Social Science Computer Review, 35(4), 521–536. https://doi.org/10.1177/0894439316638457

61.

Sjøvaag

(2016). Introducing the paywall: A case study of content changes in three online newspapers. Journalism Practice, 10(3), 304–322. https://doi.org/10.1080/17512786.2015.1017595

62.

Soffer

(2016). The oral paradigm and snapchat. Social Media + Society, 2(3), 205630511666630. https://doi.org/10.1177/2056305116666306

63.

Spence

P. R.

Burns

C. S.

(2020). Retrieving arguments and support after publication: Archiving links in communication research. In Communication Studies (No. 5; Vol. 71, pp. 911–914). Taylor & Francis.

64.

Stone-Gross

Cova

Cavallaro

Gilbert

Szydlowski

Kemmerer

Kruegel

Vigna

(2009). Your botnet is my botnet: Analysis of a botnet takeover. Proceedings of the 16th ACM Conference on Computer and Communications Security, 635–647. https://doi.org/10.1145/1653662.1653738

65.

Tyler

D. C.

McNeil

(2003). Librarians and link rot: A comparative analysis with some methodological considerations. Portal: Libraries and the Academy, 3(4), 615–632. https://doi.org/10.1353/pla.2003.0098

66.

Vázquez-Herrero

Direito-Rebollal

López-García

(2019). Ephemeral journalism: News distribution through instagram stories. Social Media + Society, 5(4), 205630511988865. https://doi.org/10.1177/2056305119888657

67.

Villaespesa

Wowkowych

(2020). Ephemeral storytelling with social media: Snapchat and instagram stories at the brooklyn museum. Social Media + Society, 6(1), 205630511989877. https://doi.org/10.1177/2056305119898776

68.

von Hohenberg

Bernhard Clemm

Stier

Sebastian

Cardenal

Ana S.

Guess

Andrew M.

Menchen-Trevino

Ericka

Wojcieszak

Magdalena

(2023). Analysis of Web Browsing Data: A Guide. OSF Preprints. https://doi.org/10.31219/osf.io/7hvap

69.

Wells

Thorson

(2017). Combining big data and survey techniques to model effects of political content flows in facebook. Social Science Computer Review, 35(1), 33–52. https://doi.org/10.1177/0894439315609528

70.

Wojcieszak

Leeuw

S. de

Menchen-Trevino

Lee

Huang-Isherwood

K. M.

Weeks

(2021). No polarization from partisan news: Over-time evidence from trace data. The International Journal of Press/Politics, 28(3), 601–626. https://doi.org/10.1177/19401612211047194

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.80 MB