The Future of Coding: A Comparison of Hand-Coding and Three Types of Computer-Assisted Text Analysis Methods

Abstract

Advances in computer science and computational linguistics have yielded new, and faster, computational approaches to structuring and analyzing textual data. These approaches perform well on tasks like information extraction, but their ability to identify complex, socially constructed, and unsettled theoretical concepts—a central goal of sociological content analysis—has not been tested. To fill this gap, we compare the results produced by three common computer-assisted approaches—dictionary, supervised machine learning (SML), and unsupervised machine learning—to those produced through a rigorous hand-coding analysis of inequality in the news (N = 1,253 articles). Although we find that SML methods perform best in replicating hand-coded results, we document and clarify the strengths and weaknesses of each approach, including how they can complement one another. We argue that content analysts in the social sciences would do well to keep all these approaches in their toolkit, deploying them purposefully according to the task at hand.

Keywords

supervised machine learning hand-coding methods unsupervised machine learning dictionary methods content/text analysis inequality

Get full access to this article

View all access options for this article.

References

Andersen

Peggy M.

Hayes

Philip J.

Huettner

Alison K.

Schmandt

Linda M.

Nirenburg

Irene B.

Weinstein

Steven P.

. 1992. “Automatic Extraction of Facts from Press Releases to Generate News Stories.” Pp. 170–77 in Proceedings of the Third Conference on Applied Natural Language Processing. Stroudsburg, PA: Association for Computational Linguistics.

Bail

Christopher A

. 2014. “The Cultural Environment: Measuring Culture with Big Data.” Theory and Society 43(3-4): 465–82.

Bamman

David

Smith

Noah A.

. 2015. “Open Extraction of Fine-Grained Political Statements.” Pp. 76–85 in Proceedings of the Conference on Empirical Methods in Natural Language Processing. Lisbon, Portugal: Association for Computational Linguistics.

Bearman

Peter S.

Stovel

Katherine

. 2000. “Becoming a Nazi: A Model for Narrative Networks.” Poetics 27(2-3): 69–90.

Benoit

Kenneth

Laver

Michael

Mikhaylov

Slava

. 2009. “Treating Words as Data with Error: Uncertainty in Text Statements of Policy Positions.” American Journal of Political Science 53(2): 495–513.

Biernacki

Richard

. 2012. Reinventing Evidence in Social Inquiry: Decoding Facts and Variables. New York: Palgrave Macmillan.

Blei

David M.

2012. “Probabilistic Topic Models.” Communications of the ACM 55(4): 77–84.

Bonikowski

Bart

Gidron

Noam

. 2016. “The Populist Style in American Politics: Presidential Campaign Rhetoric, 1952-1996.” Social Forces 94(4): 1593–621.

Burscher

Bjorn

Vliegenthart

Rens

De Vreese

Claes H.

. 2015. “Using Supervised Machine Learning to Code Policy Issues: Can Classifiers Generalize across Contexts?” The Annals of the American Academy of Political and Social Science 659(1): 122–31.

10.

Carley

Kathleen

. 1994. “Extracting Culture through Textual Analysis.” Poetics 22(4): 291–312.

11.

Caruana

Rich

Niculescu-Mizil

Alexandru

. 2006. “An Empirical Comparison of Supervised Learning Algorithms.” Pp. 161–68 in Proceedings of the 23rd International Conference on Machine Learning. New York: ACM.

12.

Chong

Dennis

Druckman

James N.

. 2009. “Identifying Frames in Political News.” Pp. 238–87 in Sourcebook for Political Communication Research: Methods, Measures, and Analytical Techniques, edited by Bucy

E. P.

Holbert

R. L.

. New York: Routledge.

13.

Cowie

Jim

Lehnert

Wendy

. 1996. “Information Extraction.” Communications of the ACM 39(1): 80–91.

14.

DiMaggio

Paul

Nag

Manish

Blei

David

. 2013. “Exploiting Affinities between Topic Modeling and the Sociological Perspective on Culture: Application to Newspaper Coverage of U.S. Government Arts Funding.” Poetics 41(6): 570–606.

15.

Dyck

Joshua

Hussey

Laura

. 2008. “The End of Welfare as We Know It? Durable Attitudes in a Changing Information Environment.” Public Opinion Quarterly 72(4): 589–618.

16.

Enns

Peter

Kelly

Nathan

Morgan

Jana

Witko

Christopher

. 2015. “Money and the Supply of Political Rhetoric: Understanding the Congressional (Non-) Response to Economic Inequality.” Paper presented at the APSA Annual Meetings, San Francisco, CA.

17.

Evans

John H.

2002. Playing God? Human Genetic Engineering and the Rationalization of Public Bioethical Debate. Chicago, IL: University of Chicago Press.

18.

Ferree

Myra Marx

Gamson

William Anthony

Gerhards

Jurgen

Rucht

Dieter

. 2002. Shaping Abortion Discourse: Democracy and the Public Sphere in Germany and the United States. New York: Cambridge University Press.

19.

Franzosi

Roberto

. 2004. From Words to Numbers: Narrative, Data, and Social Science. Cambridge, England: Cambridge University Press.

20.

Franzosi

Roberto

Fazio

Gianluca De

Vicari

Stefania

. 2012. “Ways of Measuring Agency: An Application of Quantitative Narrative Analysis to Lynchings in Georgia (1875–1930).” Sociological Methodology 42(1): 1–42.

21.

Gilens

Martin

. 1999. Why Americans Hate Welfare: Race, Media, and the Politics of Antipoverty Policy. Chicago, IL: University of Chicago Press.

22.

Goth

Gregory

. 2016. “Deep or Shallow, NLP is Breaking Out.” Communications of the ACM 59(3): 13–16.

23.

Grimmer

Justin

. 2010. “A Bayesian Hierarchical Topic Model for Political Texts: Measuring Expressed Agendas in Senate Press Releases.” Political Analysis 18(1): 1–35.

24.

Grimmer

Justin

Stewart

B. M.

. 2011. “Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts.” Political Analysis 21(3): 267–97.

25.

Griswold

Wendy

. 1987a. “The Fabrication of Meaning: Literary Interpretation in the United States, Great Britain, and the West Indies.” American Journal of Sociology 92(5): 1077–117.

26.

Griswold

Wendy

. 1987b. “A Methodological Framework for the Sociology of Culture.” Sociological Methodology 17:1–35.

27.

Hanna

Alex

. 2013. “Computer-Aided Content Analysis of Digitally Enabled Movements.” Mobilization: An International Quarterly 18(4): 367–88.

28.

Hopkins

Daniel

King

Gary

. 2010. “A Method of Automated Nonparametric Content Analysis for Social Science.” American Journal of Political Science 54(1): 229–47.

29.

Hopkins

Daniel

King

Gary

Knowles

Matthew

Melendez

Steven

. 2013. ReadMe: Software for Automated Content Analysis. Version 0.99836. Accessed 4 October 2017: (http://gking.harvard.edu/readme).

30.

Jain

Anil K.

2010. “Data Clustering: 50 Years Beyond K-Means.” Pattern Recognition Letters 31(8): 651–66.

31.

Jurka

Timothy P.

Collingwood

Loren

Boydstun

Amber E.

Grossman

Emiliano

Atteveldt

Wouter van

. 2014. RTextTools: Automatic Text Classification via Supervised Learning. R package version 1.4.2. Accessed 4 October 2017: (https://cran.rproject.org/web/packages/RTextTools/index.html).

32.

Kellstedt

Paul M.

2000. “Media Framing and the Dynamics of Racial Policy Preferences.” American Journal of Political Science 44(2): 239–55.

33.

King

Gary

Pan

Jennifer

Roberts

Margaret

. 2013. “How Censorship in China Allows Government Criticism but Silences Collective Expression.” American Political Science Review 107(2): 1–18.

34.

Krippendorff

Klaus

. 1970. “Bivariate Agreement Coefficients for Reliability of Data.” Sociological Methodology 2:139–50.

35.

Lancichinetti

Andrea

Irmak Sirer

Wang

Jane X.

Acuna

Daniel

Körding

Konrad

Amaral

Luís A. Nunes

. 2015. “High-Reproducibility and High-Accuracy Method for Automated Topic Classification.” Physical Review X 5(1): 011007.

36.

Lang

Ken

. 1995. “NewsWeeder: Learning to Filter Netnews.” Pp. 331–39 in Proceedings of the 12th International Machine Learning Conference. Morgan Kaufmann Publishers Inc.

37.

Lee

Monica

Martin

John Levi

. 2015. “Coding, Culture, and Cultural Cartography.” American Journal of Cultural Sociology 3:1–33.

38.

Levay

Kevin

. 2013. “A Malignant Kinship: The Media and Americans’ Perceptions of Economic and Racial Inequality.” Unpublished paper, Northwestern University Department of Political Science, Evanston, IL.

39.

Lloyd

Stuart P

. 1982. “Least Squares Quantization in PCM.” IEEE Transactions on Information Theory 28(2): 129–37. doi:10.1109/TIT.1982.1056489.

40.

Loughran

Tim

McDonald

Bill

. 2011. “When Is a Liability Not a Liability? Textual Analysis, Dictionaries, and 10-Ks.” The Journal of Finance 66(1): 35–65.

41.

Manning

Christopher

Surdeanu

Mihai

Bauer

John

Finkel

Jenny

Bethard

Steven

McClosky

David

. 2014. “The Stanford CoreNLP natural language processing toolkit.” Pp. 55–60 in Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations. Baltimore, MD: Association for Computational Linguistics.

42.

Martin

John Levi

. 2000. “What Do Animals Do All Day? The Division of Labor, Class Bodies, and Totemic Thinking in the Popular Imagination.” Poetics 27(2-3): 195–231.

43.

McCall

Leslie

. 2013. The Undeserving Rich: American Beliefs about Inequality, Opportunity, and Redistribution. New York: Cambridge University Press.

44.

Mikolov

Thomas

Chen

Kai

Corrado

Greg

Dean

Jeffrey

. 2013. “Efficient Estimation of Word Representations in Vector Space.” in Proceedings of Workshop at International Conference on Learning Representations. (https://research.google.com/pubs/pub41224.html)

45.

Milkman

Ruth

Luce

Stephanie

Lewis

Penny

. 2013. Changing the Subject: A Bottom-Up Account of the Occupy Wall Street Movement in New York City. New York: The Murphy Institute, City University of New York.

46.

Mische

Ann

Pattison

Philippa

. 2000. “Composing a Civic Arena: Publics, Projects, and Social Settings.” Poetics 27(2): 163–94.

47.

Mohr

John W.

1998. “Measuring Meaning Structures.” Annual Review of Sociology 24(1): 345–70.

48.

Mohr

John W.

Wagner-Pacifici

Robin

Breiger

Ronald L.

Bogdanov

Petko

. 2013. “Graphing the Grammar of Motives in National Security Strategies: Cultural Interpretation, Automated Text Analysis and the Drama of Global Politics.” Poetics 41(6): 670–700.

49.

Mohr

John W.

Duquenne

Vincent

. 1997. “The Duality of Culture and Practice: Poverty Relief in New York City, 1888-1917.” Theory and Society 26(2/3): 305–56.

50.

Nardulli

Peter F.

Althaus

Scott L.

Hayes

Mathew

. 2015. “A Progressive Supervised-learning Approach to Generating Rich Civil Strife Data.” Sociological Methodology 45(1): 145–83.

51.

Nelson

Laura K.

2017. “Computational Grounded Theory: A Methodological Framework.” Sociological Methods and Research. Retrieved April 02, 2018 (https://doi.org/10.1177/0049124117729703).

52.

Pedregosa

Varoquaux

Gramfort

Michel

Thirion

Grisel

Duchesnay

. 2011. “Scikit-learn: Machine Learning in Python.” Journal of Machine Learning Research 12:2825–30.

53.

Pelleg

Dan

Moore

Andrew W.

. 2000. “X-Means: Extending K-Means with Efficient Estimation of the Number of Clusters.” Pp. 727–34 in Proceedings of the Seventeenth International Conference on Machine Learning. San Francisco, CA: Morgan Kaufmann Publishers Inc.

54.

R Core Team. 2014. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. Accessed 4 October 2017: (http://www.R-project.org/).

55.

Reed

Isaac Ariail

. 2015. “Counting, Interpreting and Their Potential Interrelation in the Human Sciences.” American Journal of Cultural Sociology 3(3): 353–64.

56.

“Reuters-21578 Test Collection.” n.d. Retrieved March 09, 2017. (http://www.daviddlewis.com/resources/testcollections/reuters21578/).

57.

Roberts

Margaret

Stewart

Brandon

Tingley

Dustin

Airoldi

Edoardo M.

. 2013. “The Structural Topic Model and Applied Social Science.” Pp. 1–4 in Advances in Neural Information Processing Systems Workshop on Topic Models: Computation, Application, and Evaluation. https://scholar.princeton.edu/bstewart/publications/structural-topic-model-and-applied-social-science

58.

Rousseeuw

Peter J.

1987. “Silhouettes: A Graphical Aid to the Interpretation and Validation of Cluster Analysis.” Computational and Applied Mathematics 20:53–65.

59.

Schmidt

Benjamin M.

2012. “Words Alone: Dismantling Topic Models in the Humanities.” Journal of Digital Humanities 2 (1). Retrieved April 2, 2018 (http://journalofdigitalhumanities.org/2-1/words-alone-by-benjamin-m-schmidt/).

60.

Spillman

Lyn

. 2015. “Ghosts of Straw Men: A Reply to Lee and Martin.” American Journal of Cultural Sociology 3(3): 365–79.

61.

Steinbach

Michael

Karypis

George

Kumar

Vipin

. 2000. “A Comparison of Document Clustering Techniques.” in KDD Workshop on Text Mining. Minneapolis: University of Minnesota. 400(1): 525-26

62.

Tausczik

Yla R.

Pennebaker

James W.

. 2010. “The Psychological Meaning of Words: LIWC and Computerized Text Analysis Methods.” Journal of Language and Social Psychology 29(1): 24–54.

63.

Van Rijsbergen

C. J.

1979. Information Retrieval. London, England: Butterworth-Heinemann.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.13 MB