Curating Training Data for Reliable Large-Scale Visual Data Analysis: Lessons from Identifying Trash in Street View Imagery

Abstract

Visual data have dramatically increased in quantity in the digital age, presenting new opportunities for social science research. However, the extensive time and labor costs to process and analyze these data with existing approaches limit their use. Computer vision methods hold promise but often require large and nonexistent training data to identify sociologically relevant variables. We present a cost-efficient method for curating training data that utilizes simple tasks and pairwise comparisons to interpret and analyze visual data at scale using computer vision. We apply our approach to the detection of trash levels across space and over time in millions of street-level images in three physically distinct US cities. By comparing to ratings produced in a controlled setting and utilizing computational methods, we demonstrate generally high reliability in the method and identify sources that limit it. Altogether, this approach expands how visual data can be used at a large scale in sociology.

Keywords

computer vision crowdsourcing visual data urban sociology systematic social observation

Get full access to this article

View all access options for this article.

References

Abbott

Andrew.

1997. “Of Time and Space: the Contemporary Relevance of the Chicago School.” Social Forces 75(4):1149-82.

Andersson

Emilia

Sørvik

Gard Ove

. 2013. “Reality Lost? Re-Use of Qualitative Data in Classroom Video Studies.” Forum Qualitative Sozialforschung 14(3):1-25.

Bader

Michael DM

Mooney

Stephen J.

Lee

Yeon Jin

Sheehan

Daniel

Neckerman

Kathryn M.

Rundle

Andrew G.

Teitler

Julien O.

. 2015. “Development and Deployment of the Computer Assisted Neighborhood Visual Assessment System (CANVAS) to Measure Health-Related Neighborhood Conditions.” Health & Place 31:163-72.

Benkler

Yochai.

2006. The Wealth of Networks: How Social Production Transforms Markets and Freedom. New Haven, CT: Yale University Press.

Benoit

Kenneth

Conway

Drew

Lauderdale

Benjamin E.

Laver

Michael

Mikhaylov

Slava

. 2016. “Crowd-Sourced Text Analysis: reproducible and Agile Production of Political Data.” American Political Science Review 110(2):278-95.

Benoit

Kenneth

Munger

Kevin

Spirling

Arthur

. 2019. “Measuring and Explaining Political Sophistication Through Textual Complexity.” American Journal of Political Science 63(2):491-508.

Budak

Ceren

Goel

Sharad

Rao

Justin M.

. 2016. “Fair and Balanced? Quantifying Media Bias Through Crowdsourced Content Analysis.” Public Opinion Quarterly 80(S1):250-71.

Collins

Randall.

2008. Violence: A Micro-Sociological Theory. Princeton, NJ: Princeton University Press.

Cordts

Marius

Omran

Mohamed

Ramos

Sebastian

Rehfeld

Timo

Enzweiler

Markus

Benenson

Rodrigo

Franke

Uwe

Roth

Stefan

Schiele

Bernt

. 2016. “The Cityscapes Dataset for Semantic Urban Scene Understanding.” Pp. 3213-23 in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.

10.

Crowder

Kyle

Pais

Jeremy

South

Scott J.

. 2012. “Neighborhood Diversity, Metropolitan Constraints, and Household Migration.” American Sociological Review 77(3):325-53.

11.

Deng

Jia

Dong

Wei

Socher

Richard

Li-Jia

Kai

Fei-Fei

. 2009. “ImageNet: A Large-Scale Hierarchical Image Database.” Pp. 248-55 in 2009 IEEE Conference on Computer Vision and Pattern Recognition.

12.

Salomé

Ollion

Etienne

Shen

Rubing

. 2022. “The Augmented Social Scientist. Using Sequential Transfer Learning to Annotate Texts Faster and More Accurately.”

13.

Dubois

W. E. B

. 1899. The Philadelphia Negro: A Social Study. Philadelphia, PA: University of Pennsylvania Press.

14.

Evans

Gary W.

2006. “Child Development and the Physical Environment.” Annu. Rev. Psychol. 57:423-51.

15.

Fele

Giolo.

2012. “The Use of Video to Document Tacit Participation in an Emergency Operations Centre.” Qualitative Research 12(3):280-303.

16.

Finnigan

Ryan.

2021. “The Growth and Shifting Spatial Distribution of Tent Encampments in Oakland, California.” The ANNALS of the American Academy of Political and Social Science 693(1):284-300. .

17.

Gebru

Timnit

Krause

Jonathan

Wang

Yilun

Chen

Duyun

Deng

Jia

Lieberman Aiden

Erez

Fei-Fei

. 2017. “Using Deep Learning and Google Street View to Estimate the Demographic Makeup of Neighborhoods Across the United States.” Proceedings of the National Academy of Sciences 114(50):13108-13.

18.

Glasmachers

Tobias.

2017. “Limits of End-to-End Learning.” Proceedings of Machine Learning Research 77(7):17-32.

19.

Hallgren

Kevin A.

2012. “Computing Inter-Rater Reliability for Observational Data: an Overview and Tutorial.” Tutorials in Quantitative Methods for Psychology 8(1):23.

20.

Hayes

Andrew F.

Krippendorff

Klaus

. 2007. “Answering the Call for a Standard Reliability Measure for Coding Data.” Communication Methods and Measures 1(1):77-89.

21.

Kaiming

Zhang

Xiangyu

Ren

Shaoqing

Sun

Jian

. 2015. “Deep Residual Learning for Image Recognition.” ArXiv:1512.03385 [Cs] .

22.

Herbrich

Ralf

Minka

Tom

Graepel

Thore

. 2006. “Trueskill^TM: A Bayesian Skill Rating System.” Pp. 569-76 in Proceedings of the 19th International Conference on Neural Information Processing Systems.

23.

Hipp

John R.

Lee

Sugie

Donghwan

Kim

Jae Hong

. 2022. “Measuring the Built Environment with Google Street View and Machine Learning: consequences for Crime on Street Segments.” Journal of Quantitative Criminology 38:537-65.

24.

Hunter

Albert D.

1985. “Private, Parochial and Public Social Orders: The Problem of Crime and Incivility in Urban Communities.” In The Challenge of Social Control: Institution Building and Systemic Constraint .

25.

Hwang

Jackelyn.

2017. “Invited Commentary: observing Neighborhood Physical Disorder in an Age of Technological Innovation.” American Journal of Epidemiology 186(3):274-77.

26.

Hwang

Jackelyn

Ding

Lei

. 2020. “Unequal Displacement: gentrification, Racial Stratification, and Residential Destinations in Philadelphia.” American Journal of Sociology 126(2):354-406.

27.

Hwang

Jackelyn

Naik

Nikhil

. 2023. “Systematic Social Observation at Scale: Using Crowdsourcing and Computer Vision to Measure Visible Neighborhood Conditions.” Sociological Methodology. Advanced Online publication: DOI: 10.1177/00811750231160781.

28.

Hwang

Jackelyn

Sampson

Robert J.

. 2014. “Divergent Pathways of Gentrification: racial Inequality and the Social Order of Renewal in Chicago Neighborhoods.” American Sociological Review 79(4):726-51.

29.

Krippendorff

Klaus.

2004. Content Analysis: An Introduction to Its Methodology (2^nd Edition). Thousand Oaks, CA: Sage.

30.

Krizhevsky

Alex

Sutskever

Ilya

Hinton

Geoffrey E.

. 2012. “ImageNet Classification with Deep Convolutional Neural Networks.” Advances in Neural Information Processing Systems 25:1097-105.

31.

Landis

J. Richard

Koch

Gary G.

. 1977. “An Application of Hierarchical Kappa-Type Statistics in the Assessment of Majority Agreement among Multiple Observers.” Biometrics 33(2):363-74.

32.

Langton, Samuel H., and Wouter Steenbeek. 2017. “Residential Burglary Target Selection: An Analysis at the Property-Level Using Google Street View.” Applied Geography 86:292-99.

33.

LeCun

Yann

Bengio

Yoshua

Hinton

Geoffrey

. 2015. “Deep Learning.” Nature 521(7553):436-44.

34.

Lee

James

Koh

David

Ong

C. N.

. 1989. “Statistical Evaluation of Agreement Between Two Methods for Measuring a Quantitative Variable.” Computers in Biology and Medicine 19(1):61-70.

35.

Logan

John R.

2012. “Making a Place for Space: spatial Thinking in Social Science.” Annual Review of Sociology 38:507-24.

36.

Logan

John R.

2018. “Relying on the Census in Urban Social Science.” City & Community 17(3):540-49. .

37.

Marcel

Sébastien

Rodriguez

Yann

. 2010. “Torchvision the Machine-Vision Package of Torch.” Pp. 1485-88 in Proceedings of the 18th ACM International Conference on Multimedia, MM ‘10. New York, NY, USA: Association for Computing Machinery.

38.

McGraw

Kenneth O.

Wong

Seok P.

. 1996. “Forming Inferences About Some Intraclass Correlation Coefficients.” Psychological Methods 1(1):30.

39.

Molina

Mario

Garip

Filiz

. 2019. “Machine Learning for Sociology.” Annual Review of Sociology 2019:27-45.

40.

Mooney

Stephen J.

Bader

Michael DM

Lovasi

Gina S.

Neckerman

Kathryn M.

Rundle

Andrew G.

Teitler

Julien O.

. 2020. “Using Universal Kriging to Improve Neighborhood Physical Disorder Measurement.” Sociological Methods & Research 49(4):1163-85.

41.

Mooney

Stephen J.

Bader

Michael DM

Lovasi

Gina S.

Neckerman

Kathryn M.

Teitler

Julien O.

Rundle

Andrew G.

. 2014. “Validity of an Ecometric Neighborhood Physical Disorder Measure Constructed by Virtual Street Audit.” American Journal of Epidemiology 180(6):626-35.

42.

Mooney

Stephen J.

Bader

Michael DM

Lovasi

Gina S.

Teitler

Julien O.

Koenen

Karestan C.

Aiello

Allison E.

Galea

Sandro

Goldmann

Emily

Sheehan

Daniel M.

Rundle

Andrew G.

. 2017. “Street Audits to Measure Neighborhood Disorder: virtual or in-Person?” American Journal of Epidemiology 186(3):265-73.

43.

Naik

Nikhil

Philipoom

Jade

Raskar

Ramesh

Hidalgo

César

. 2014. “Streetscore-Predicting the Perceived Safety of One Million Streetscapes.” Pp. 779-85 in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops.

44.

Nassauer

Anne

Legewie

Nicolas M.

. 2021. “Video Data Analysis: a Methodological Frame for a Novel Research Trend.” Sociological Methods & Research 50(1):135-74.

45.

Nelson

Kerrie P.

Edwards

Don

. 2018. “A Measure of Association for Ordered Categorical Data in Population-Based Studies.” Statistical Methods in Medical Research 27(3):812-31.

46.

Nelson

Laura K.

Burk

Derek

Knudsen

Marcel

McCall

Leslie

. 2021. “The Future of Coding: a Comparison of Hand-Coding and Three Types of Computer-Assisted Text Analysis Methods.” Sociological Methods & Research 50(1):202-37.

47.

O’Brien

Daniel T.

Farrell

Chelsea

Welsh

Brandon C.

. 2019. “Looking Through Broken Windows: the Impact of Neighborhood Disorder on Aggression and Fear of Crime Is an Artifact of Research Design.” Annual Review of Criminology 2:53-71.

48.

Odgers

Candice L.

Caspi

Avshalom

Bates

Christopher J.

Sampson

Robert J.

Moffitt

Terrie E.

. 2012. “Systematic Social Observation of Children’s Neighborhoods Using Google Street View: a Reliable and Cost-Effective Method.” Journal of Child Psychology and Psychiatry 53(10):1009-17.

49.

Pan

Sinno Jialin

Yang

Qiang

. 2009. “A Survey on Transfer Learning.” IEEE Transactions on Knowledge and Data Engineering 22(10):1345-59.

50.

Park

Robert E.

Burgess

Ernest W.

. 1925. The City. Chicago, IL: University of Chicago Press.

51.

Pedregosa

Fabian

Varoquaux

Gaël

Gramfort

Alexandre

Michel

Vincent

Thirion

Bertrand

Grisel

Olivier

Blondel

Mathieu

Prettenhofer

Peter

Weiss

Ron

Dubourg

Vincent

. 2011. “Scikit-Learn: machine Learning in Python.” The Journal of Machine Learning Research 12:2825-30.

52.

Perkins

Douglas D.

Meeks

John W.

Taylor

Ralph B.

. 1992. “The Physical Environment of Street Blocks and Resident Perceptions of Crime and Disorder: implications for Theory and Measurement.” Journal of Environmental Psychology 12(1):21-34.

53.

Pikora

Terri J.

Bull

Fiona CL

Jamrozik

Konrad

Knuiman

Matthew

Giles-Corti

Billie

Donovan

Rob J.

. 2002. “Developing a Reliable Audit Instrument to Measure the Physical Environment for Physical Activity.” American Journal of Preventive Medicine 23(3):187-94.

54.

Raudenbush

Stephen W.

2003. “The Quantitative Assessment of Neighborhood Social Environments.” Neighborhoods and Health 112:131.

55.

Reiss

Albert J.

1971. “Systematic Observation of Natural Social Phenomena.” Sociological Methodology 3:3-33.

56.

Ross

Catherine E.

Mirowsky

John

. 2001. “Neighborhood Disadvantage, Disorder, and Health.” Journal of Health and Social Behavior 42(3):258-76.

57.

Rundle

Andrew G.

Bader

Michael DM

Richards

Catherine A.

Neckerman

Kathryn M.

Teitler

Julien O.

. 2011. “Using Google Street View to Audit Neighborhood Environments.” American Journal of Preventive Medicine 40(1):94-100.

58.

Salganik

Matthew J.

Levy

Karen EC

. 2015. “Wiki Surveys: open and Quantifiable Social Data Collection.” PloS One 10(5):e0123483.

59.

Sampson

Robert J.

2012. Great American City: Chicago and the Enduring Neighborhood Effect. Chicago and London: University of Chicago Press.

60.

Sampson

Robert J.

Raudenbush

Stephen W.

. 1999. “Systematic Social Observation of Public Spaces: a New Look at Disorder in Urban Neighborhoods.” American Journal of Sociology 105(3):603-51.

61.

Sampson

Robert J.

Raudenbush

Stephen W.

. 2004. “Seeing Disorder: neighborhood Stigma and the Social Construction of ‘Broken Windows.’” Social Psychology Quarterly 67(4):319-42.

62.

Selvaraju

Ramprasaath R.

Cogswell

Michael

Das

Abhishek

Vedantam

Ramakrishna

Parikh

Devi

Batra

Dhruv

. 2017. “Grad-Cam: Visual Explanations from Deep Networks via Gradient-Based Localization.” Pp. 618-26 in Proceedings of the IEEE International Conference on Computer Vision.

63.

Sharkey

Patrick

Faber

Jacob W.

. 2014. “Where, When, Why, and for Whom Do Residential Contexts Matter? Moving Away from the Dichotomous Understanding of Neighborhood Effects.” Annual Review of Sociology 40:559-79.

64.

Small

Mario L.

Manduca

Robert A.

Johnston

William R.

. 2018. “Ethnography, Neighborhood Effects, and the Rising Heterogeneity of Poor Neighborhoods Across Cities.” City & Community 17(3):565-89.

65.

Steed

Ryan

Caliskan

Aylin

. 2021. “Image Representations Learned with Unsupervised Pre-Training Contain Human-Like Biases.” Pp. 701-713 in Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency.

66.

Taylor

Ralph B.

Shumaker

Sally Ann

Gottfredson

Stephen D.

. 1985. “Neighborhood-Level Links Between Physical Features and Local Sentiments: deterioration, Fear of Crime, and Confidence.” Journal of Architectural and Planning Research 2(4):261-75.

67.

Thurstone

Louis L.

1927. “The Method of Paired Comparisons for Social Values.” The Journal of Abnormal and Social Psychology 21(4):384.

68.

Torres

Michelle

Cantú

Francisco

. 2022. “Learning to See: Convolutional Neural Networks for the Analysis of Social Science Data.” Political Analysis 30:113-31.

69.

Wilkerson

John

Casas

Andreu

. 2017. “Large-Scale Computerized Text Analysis in Political Science: opportunities and Challenges.” Annual Review of Political Science 20:529-44.

70.

Wilson

William

Julius. 1987. The Truly Disadvantaged: The Inner City, the Underclass, and Public Policy. Chicago, IL: University of Chicago Press.

71.

Ying

Luwei

Montgomery

Jacob M.

Stewart

Brandon M.

. 2022. “Topics, Concepts, and Measurement: a Crowdsourced Procedure for Validating Topics as Measures.” Political Analysis 30(4):570-89.

72.

Zhang

Han

Pan

Jennifer

. 2020. “CASM: a Deep-Learning Approach for Identifying Collective Acton Events with Text and Image Data from Social Media.” Sociological Methodology 49(1):1-57.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

3.24 MB