Text as Data: A New Framework for Machine Learning and the Social Sciences

Abstract

Text as Data: A New Framework for Machine Learning and the Social Sciences (hereafter TaD) marks a perspective developed by Justin Grimmer, Margaret Roberts, and Brandon Stewart (hereafter GRS) that allows social scientists to optimize the opportunity of substantial recent advances in computational capacity, which fosters the ability to analyze the increasing volume of large-scale text data. TaD doesn’t so much invent a set of methods (though it does present and adapt developments from GRS’s own work, along with that of myriad other scholars), as much as it illustrates a means to conceptualize, operationalize, and analyze text data in a way that remains squarely social-scientific. This is especially refreshing when so many applications of “computational social science” all too often merely deploy recently developed computational tools to “social” questions while having little to no actual theoretical or empirical grounding in the social sciences.

TaD presents an extremely broad approach to making use of textual data, while also providing depth in many key elements thereof (e.g., take note of its 28 chapters). At the same time, there are key dimensions it doesn’t address (e.g., computational tutorials and code¹).

At the outset of the book, GRS mention a few fundamental assumptions to their approach. First, their approach is resolutely agnostic in orientation—that is, they assume no real “ground truth” against which methods can be assessed, because “Discovering an organization of texts is fundamentally different from classifying documents into pre-existing categories” (p. 30). This necessitates the central role of theory for selecting among the range of methods they address. Second, they consider social science to be fundamentally explanatory at heart, seeking to “learn something about the human processes that documents record and represent” (p. 33).

Taking this into account, TaD proceeds to present most of the material with a parallel structure of the conceptual (often visual and textual), mathematical formulations, and empirical application examples. This combination is especially helpful for developing intuitions about how various elements operate and their potential use cases.

TaD breaks down text analysis into four primary stages: discovery, measurement, inference, then iteration and cumulation of evidence. Of these, GRS make the case that discovery is the most underappreciated element in the social sciences, which motivates them both to (1) explain its importance in some detail and (2) leverage those notions to differentiate how social scientists will likely use computational text-analytic methods from the orientations of computer science, where they were most often developed.

GRS define discovery as the phase of research where a question is refined and scholars generate a conceptualization of the problem—something TaD suggests (and I wholeheartedly agree) needs more explicit attention within the social sciences. GRS’s approach to discovery entails steps that: (1) detect difference, (2) sort data into partitions, (3) tabulate proportions, (4) reduce and summarize, then (5) label (which, they note, is ultimately a human activity that often requires “careful reading and automated heuristics” [p. 159], emphasis mine). While the steps of measurement, inference, and iteration ultimately carry more direct parallels to other research methods, GRS also highlight some key particularities for text analyses (e.g., determining the proper number of topics for a model can be a vexing combination of art and science; when evaluating topic measurements, differentiating the probability of being in category k from the intensity of supporting category k; or the necessity for formulating and applying a variety of validation methods if you embrace their agnostic approach). Perhaps surprising to some, their approach has strong—and direct—parallels to qualitative orientations to text analysis (e.g., approaching it as “a sequential, iterative, and inductive approach to research” [p. 4]), an overlap that could be assessed more explicitly (e.g., Miner et al. 2023).

TaD comes at a uniquely opportune moment in the social sciences, which combines the rising ubiquity of large volumes of increasingly digitized and widely accessible text-based data with recent advances in computational capacity, making the approaches GRS describe available to a much wider set of scholars than has previously been the case. TaD, however, emphatically notes that these opportunities heighten the importance of theory and strong disciplinary footing (rather than signaling their demise, as some have unwisely suggested). GRS especially contend that advanced methods and ubiquitous data are not a replacement for theoretical development nor leveraging substantive expertise. Furthermore, knowing which data are relevant and determining how to properly interpret inferences from their analyses requires drawing on the combination of these foundations. That said, this expanding set of resources in our toolkits does afford the social sciences an opportunity to rethink the types of questions we can address.

TaD provides thorough conceptual, formal, and accessible coverage of the gamut of questions among the most necessary for scholars considering dipping their toes into this fresh pool (indeed likely even for those who have already taken the plunge). It will serve as a useful resource for years to come both in courses with text analyses as their focus and as a resource to sample from when introducing elements thereof in courses engaging in a more limited way with these approaches. Making optimal use of TaD in courses will likely necessitate companion computational tutorials (but their intentional exclusion from TaD bolsters its long-term potential value, rather than—had they been included—those elements likely already being outdated by the time the book made it to print).

Footnotes

1.

Example “companion” packages are available, including .

References

Miner

Adam S.

Stewart

Sheridan A.

Halley

Meghan C.

Nelson

Laura K.

Linos

Eleni

. 2023. “Formally Comparing Topic Models and Human-Generated Qualitative Coding of Physician Mothers’ Experiences of Workplace Discrimination.” Big Data & Society 10(1):1–17.