Abstract
We describe some of the ways that the field of content analysis is being transformed in an Era of Big Data. We argue that content analysis, from its beginning, has been concerned with extracting the main meanings of a text and mapping those meanings onto the space of a textual corpus. In contrast, we suggest that the emergence of new styles of text mining tools is creating an opportunity to develop a different kind of content analysis that we describe as a computational hermeneutics. Here the goal is to go beyond a mapping of the main meaning of a text to mimic the kinds of questions and concerns that have traditionally been the focus of a hermeneutically grounded close reading, a reading that focuses on what Kenneth Burke described as the poetic meanings of a text. We illustrate this approach by referring to our own work concerning the rhetorical character of US National Security Strategy documents.
Thin reading: The first century of content analysis
Content analysis describes a set of procedures for transforming texts, which are written by and intended to be read by people, into numerical datasets that are read by computers and intended to be interpreted with formal methods. 1 The goals of a content analysis vary from project to project, but mostly social scientists have sought to use these methods to measure the presence of a set of key meanings and to map the distribution of those meanings across the space of a textual corpus. 2
Scholars have been using these methods to analyze texts for over a hundred years but a significant leap in technical sophistication occurred in the interdisciplinary crucible of the Second World War. Scholars like Harold Lasswell, who had written his dissertation at the University of Chicago on the propaganda campaigns of the First World War, worked on behalf of the US government to create new textual analysis procedures that could be used to gather information from newspapers and other strategic textual corpora. Lasswell, who served as director of the Experimental Division for the study of Wartime Communications at the US Library of Congress, led a staff of brilliant young social scientists in developing a suite of methods for systematically reading large textual corpora in such a way that critical bits of information could be extracted and a measure of informational reliability could be calibrated (Lasswell et al., 1949). After the war, Lasswell signed on to help direct a project at Stanford’s Hoover Institute that used the same suite of methods to study 20,000 newspaper editorials, sampled from the “prestige” papers of five countries—France, Germany, Russia, the US and the UK (between 1890 and 1945). 3 The goal of the project was to map the changing symbolic frames of domestic and international politics and to compare these mappings across the dominant nation states in the years leading up to the war. It is also a useful exemplar of the logic of analysis that came to define the field.
Lasswell and his Stanford colleagues began by laying out their data categories as a set of pre-defined keywords, phrases, and concepts. Next, they wrote out careful instructions (and a set of decision rules) for coding each item. 4 Finally, a team of human coders, following these procedures, read the corpus while searching for 206 place names and 210 key symbols reflecting major political ideologies of the times, concepts like “Nationalism, Nazism, Neutrality, and Nonintervention” (Lasswell et al., 1952: 43). Here, and throughout the history of modern content analysis, the technical procedures were designed to carefully and reliably pare away the complexity of textual information into a small set of core informational units that could be reliably measured and mapped. In short, content analysis, from the start, has been focused on the goal of capturing a set of primary ideas, usually those constituting the manifest meaning of a textual corpus, which is to say, that which is expressed in plain view and about which there is little or no dispute. 5
The pursuit of this primary goal has meant that the subtleties of expression, the complexities of phrasing and the more nuanced meanings of textual corpora are discarded, with only the best, most efficient units of meaning being extracted and preserved as data. Even as the social sciences advanced and computing power exploded over the next half century, this core logic of content analysis methodologies has lived on. 6 During the 1950s, content analysts began to “focus on counting internal contingencies between symbols instead of the simple frequencies of symbols” and they began to worry more about “problems of inference from verbal material to its antecedent conditions” (Pool, 1959: 2). This era introduced the use of co-occurrence matrices that increasingly came to be used as a poor-man’s measure of semantic structure. 7 In the 1960s, pre-coded computer dictionaries were compiled and shared among researchers in common domains of inquiry (Stone et al., 1966). By the 1970s and 1980s networks of causal assertions were being constructed by analysts closely reading transcripts of policy deliberations (Axelrod, 1976), and factor analysis and latent structure analysis technologies were being used to help excavate implicit meaning structures from a variety of textual datasets (e.g. Namenwirth and Lasswell, 1970; Weber, 1987). In the 1980s, hand-coding procedures were modified so as to record information about the semantic grammars that linked key terms together in relational sets (Franzosi, 1989, 1990). By the 1990s many new types of relational and, especially, network style methodologies were being applied to textual data demonstrating new ways to unpack implicit meaning and communication structures (Abbott and Hrycak, 1990; Bearman and Stovell, 2000; Breiger, 2000; Carley, 1994; Cerulo, 1988; Ennis, 1992; Martin, 2000; Mohr, 1994, 1998; Tilly, 1997). And yet, throughout all these important advances, the core logic of the field did not change. The goal is to extract the main bits of communicative content from the corpus, to apply formal methods to extract the principal components of the meaning structures (or the communication structures) and to map those onto the textual space of the corpus.
Close reading: The method of hermeneutic analysis
But, as any close reader will tell you, there is always more than one way to read a corpus. For humanists and scholars who specialize in the ‘close reading’ of texts it is the very complexity of their meanings, their nuanced peculiarities of style, their inherent multivocality, their complex layeredness and organized incoherence, indeed, it is precisely those things that are not so easily found on the surface of texts, that makes them worthy of close study in the first place. It is hardly a surprise that content analysis projects, which have traditionally focused on capturing manifest meanings of textual corpora, have been of little interest to scholars who come from more hermeneutic disciplines. Those qualities of a text that are of greatest interest to a close reader are the very things that traditional content analysis projects seek to cleave away. 8
In contrast, when confronting a text an experienced close reader will begin with a sense of the basic semantics, syntax, rhetorical forms, and genre of the given text or corpus, along with a knowledge of the text’s historical context. But a skilled reader must also have a sense of the positionings, form-takings and flows of the text—its creature-like qualities. A close reader will keep an eye on the text’s diachronic pulse, its fluidity, its starts and stops, expansions and contractions (managed through such things as repetition or recursivity or parentheticals …), accelerations and slow-downs. A close reader will move back and forth between the particulars and the whole and between the text and the context(s) and she will attend to any unexpected turn of phrase, to the anomalous character, to the anachronistic appearance of a feature from another genre and so bring a more general “sense of the textual” to the reading. This will include a sense of the language(s) used, a sense of the genre in which the language is embedded, and a sense of the typical flows or relays of these texts with other texts and other symbolic mediations. This type of reading is enormously difficult and time-consuming and, as Paul Ricoeur taught us, always provisional, as the thing about texts is that once they leave the hands of the writers, they launch themselves into unknown and unpredictable shaping contexts and interpreting readers. Nevertheless—sound and illuminating interpretive readings can be made and made to stick through the deployment of an effective hermeneutic practice.
The literary theorist Kenneth Burke described these as two different kinds of interpretations, a semantic and a poetic. According to Burke, a semantic interpretation seeks to clarify and to specify the precise and manifest communicative intention of a text, in much the same way that a postal system seeks to establish a clear and unambiguous mapping of written addresses and geo-physical destinations so that mail can be efficiently mapped to its proper destination. Thus for Burke, the semantic ideal has “the aim to evolve a vocabulary that gives the name and address of every event in the universe” (Burke, 1941: 141). But, as Burke also explains, “The address, as a counter, works only in so far as it indicates to the postal authorities what kind of operation should be undertaken. But it assumes an organization. Its meaning, then, involves the established procedures of the mails, and is in the instructions it gives for the performance of desired operations within this going concern” (Burke, 1941: 140).
Burke argues that human experiences are more complex than this because humans are suspended in elaborate webs of overlapping meanings. Burke explains, “when you have isolated your individual by the proper utilizing of the postal process, you have not at all adequately encompassed his ‘meaning.’ He means one thing to his family, another to his boss, another to his underlings, another to his creditors, etc.” (p. 142). It is this complex multiplicity of layered meanings that brings us toward a poetic reading of a text. Thus, a poetic interpretation is not concerned with the thinning out of meaning, but on the contrary, with the filling out of meaning. It is not arrived at by neutral analysis but instead through the expression and experience of passion and attitude. Burke writes, “(t)he semantic ideal envisions a vocabulary that avoids drama. The poetic ideal envisions a vocabulary that goes through drama …The first seeks to attain this end by the programmatic elimination of a weighted vocabulary at the start (the neutralization of names containing attitudes, emotional predisposition); and the second would attain the same end by exposure to the maximum profusion of weightings” (Burke, 1941: 149). If the first century of content analysis was focused on semantic interpretation, we expect that the next century will focus on the poetic.
Thick reading: The new age of computational hermeneutics
The arrival of Big Data is changing the way that social scientists and humanists analyze texts. Most obviously we have seen a transformation in the scale and the breadth of digitized textual corpora that are becoming available for analysis and this changes the kinds of questions that then come into focus (e.g. Goldstone and Underwood, 2012; Jockers, 2013; Jockers and Mimno, 2013; Lazer et al., 2009; Liu, 2013; Mayer-Schonberger and Cukier, 2013; Michel et al., 2011; Moretti, 2013; Tangherlini and Leonard, 2013, see also cases described by Bearman in this issue). But researchers are also making fundamental changes in how they use text analytic methodologies to measure the meanings and character of textual corpora. Here we discuss the emergence of one such strand of text mining sensibilities that we call computational hermeneutics. 9
The central idea of a computational hermeneutics is that all available text analysis tools can and should be drawn upon as needed in order to pursue a particular theory of reading. Here the most important impact of Big Data is the expansion of new types of algorithmic and computational tools for reading texts. Instead of restricting ourselves to collecting the best small pieces of information that can stand in for the textual whole in the manner that Lasswell’s project illustrates, contemporary technologies give us the ability to instead consider a textual corpus in its full hermeneutic complexity and nuance. This is what creates the opportunity for a new style of computational hermeneutics. But with this change, so do the research questions shift in a fundamental way. Now we must ask, given the complexity of the textual whole, how can we extract those various poetically meaningful components (or structurally intertwined sets of poetically meaningful components) that would be of greatest use to whatever interpretive intention we bring to the corpus? Put differently, how can we begin to focus on whatever combination of measurable textual features that we would most want to attend to as a focused close reader of this text?
This is not just a methodological question, it is very much a theoretical question. Before we can ask what component of textual expression we would want to extract, we must have a theory of the text within which the concept of a component makes sense, a component of what? What is this meaningful whole? 10 This is the kind of interpretive endeavor that we think has become possible in the age of Big Data, and it has created the opportunity for building a different kind of computational hermeneutics.
An example: The changing rhetorical logic of US National Security Strategy
It was George W Bush’s preventive war doctrine that first led us to examine the corpus of National Security Strategy (NSS) reports of the United States executive branch. An initial close reading of several of the NSS reports suggested that there were some elements of these texts that a close reading, with its exclusively hermeneutic tools, could not illuminate but only dimly discern. The intuition was about latent networks—a deep structure of the international order (from the US point of view, of course) that involved interactions (exchanges, recognitions, performative speech acts like exhortations, expositives, exercitives, hailings, callings-out) and also political-cognitive mappings and structures. One possible approach anticipated a typologization of these networks along familiar social structural lines: family, clan, bureaucracy, corporation, schoolyard, comprised of stock characters—patriarchs, elders, bullies, friends, partners, upstarts, middle managers. And the hope was that a more formal computational reading of the multiple texts would be able to discern if and how these networks took shape and behaved across the compiled set of texts.
Our first attempt to analyze this corpus of documents focused on their rhetorical form (Mohr et al., 2013). We were interested in rhetoric because it is a style of textual analysis that has deep roots in the history of hermeneutic studies and because it is constructed according to a series of fairly well understood, relatively formal properties and principles that make it a field of investigation that is amenable to the sorts of structured investigations that were of interest to us. The study of rhetoric is also a convenient way to link textual analysis to consequential matters because when rhetoric is deployed effectively in significant fields of social action, such as in the world of National Strategic Security institutions, it can become some of the most materially powerful types of institutionalized speech activity. This is because rhetorical logics can become enacted things in the world, by being re-deployed through bureaucratic forms. Not necessarily as formal operating orders, but as rhetorical frames that undergird a broader discursive framing of the international scene. And so these are texts that matter as the public face of a configuration of rhetorical framings about the nature of international order.
Our corpus included 11 NSS statements (published from 1990 to 2010) and we drew upon Kenneth Burke’s insights about how to perform a rhetorical analysis (Burke, 1941). Burke proposes to identify the dramatic logic of a text by attending to how events are characterized within what he called the dramatistic pentad—this includes five terms, “what was done (act), when or where it was done (scene), who did it (agent), how he did it (agency), and why (purpose)” (Burke, 1945: xv). We asked how might we apply new computer based tools to read the corpus just as Kenneth Burke would have us read it? In our first effort, we employed topic models as a way to sort the corpus into different thematic arenas (or “scenes” in Burke’s terminology). 11 We then used Named Entity Recognition (NER) tools to identify different agents and semantic grammar analysis to identify the actions taken by these agents. We present this as graphs of actor–action–actor semantic networks broken out by topic. This gives us a way to begin to visualize the shifting rhetorical logics that moved across time and across US presidential administrations.
Figure 1 shows how this approach captures the Clinton administration’s framing of the problem of terrorism in 1996. This was a frame that focused on the end of the Cold War. Key agents in the frame included Russia, China, Ukraine, Belarus, and Kazakhstan. Control was the primary action that was invoked and the objects of concern were weapons of mass destruction, nuclear tipped missiles, and nuclear proliferation. Local hot-spots were also included in this discussion—Bosnia, South Africa, and Guatemala, places where terrorism might flare up. Figure 2 shows how the same topic was discussed by the Bush administration a decade later. The frame has shifted dramatically toward the Middle East. Key agents in the discussion of terrorism now include Iraq, Iran, Syria, Egypt, Saudi Arabia, Israel, the Palestinians, and Saddam. From the Clinton administration’s concerns with ‘controlling’ the agents of terror, we have moved into the Bush administration’s focus on ‘removing’, ‘commanding,’ ‘building,’ and ‘expanding’. The overall goal is also much broader, from controlling the proliferation of materials, the focus has shifted toward creating opportunities for people to live in freedom so that terrorism will find no footing. In short, by focusing in on what Burke describes as the grammar of motives we are able to use automated methods to mimic a particular style of close reading and, in doing so, to reveal interpretations that are not immediately visible on the surface of these texts.
Actors and acts in terrorism discourse, NSS 1996 (Clinton). Actors and acts in terrorism discourse, NSS 2006 (GW Bush).

Conclusion: Toward a computational hermeneutics
In this short essay we have sought to suggest that the age of Big Data is important not only because it presents us with opportunities to use larger and more comprehensive datasets, but also because it gives us the opportunity to change the way that we formally engage with and interpret textual corpora. Rather than seeking to extract small amounts of critical information that we hope is representative of the manifest meaning of a text, as has been true throughout the history of content analysis, the new age of computational hermeneutics provides us with a chance to pursue deeper, subtler and more poetic readings of textual corpora. Instead of focusing on the main communicative intentions of a text, we are now able to push toward the kind of close reading that has traditionally been conducted by hermeneutically oriented scholars who find not one simple uncontested communication, but multiple, contradictory and overlapping meanings. Instead of just content, we are now able to focus on style, and on the ways in which texts are embedded in broader literary conversations. 12 In this sense, style is substance.
Our own work in this domain is just beginning. For one thing, we have not yet completed Burke’s mandate for rhetorical analysis. In the paper discussed here, we focused on analyzing just three elements from Burke’s theory of the dramatistic pentad (actors, acts and scenes). In new work we hope to fill in the other two elements from Burke’s theory (agency and purpose). We are exploring ways of using “named-entity recognition” (NER) as a way of coding what Burke describes as the problem of “agency” (specifying what means or instruments were used in carrying out an act) and looking to see whether sentiment analysis can provide us with a way of coding Burke’s last rhetorical element, the “purpose” of the act. From here, there are many directions to proceed. For example, Burke highlights the elements of textual “friction” in his pentadic ratios (Act-Scene; Act-Agent; Agent-Purpose). This friction expresses the points of ambiguity or contradiction or uncertainty in texts where there is not a seamless alignment of all the pentadic elements—agents are not “at home” in their scene; acts don’t have a clear purpose and so forth. We wish to press on this fundamental Burkean insight—a kind of textual uncertainty principle.
Our practical goal in this research has been to understand how these kinds of rhetorical logics are used to create a grounding of normalcy within which acts of strategy across the international order will be perceived as legitimate, rational and powerful. Once we are able to more effectively map these rhetorical elements in the current corpus, we are intrigued to see whether we will also be able to track how these rhetorical frames flow across broader institutional domains—how they are mimicked, changed, and contested by others. But our overall goal is to move to embrace the full complexity of the textual and to do so in a way that takes advantage of new opportunities for computational analysis. Ultimately we hope to employ these kinds of computational methods to assemble a sort of rhetorical interpretation machine. But more broadly we hope to have suggested that as these new kinds of computational methodologies continue to proliferate, so too does the need for skillful close readers, scholars who bring a sophisticated understanding of textuality, hermeneutics, and theories of close reading to bear. Without this sort of theorizing, the new computational methodologies will be severely hobbled.
Footnotes
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
