Sage Journals: Discover world-class research

Abstract

Debates over “Big Data” shed more heat than light in the humanities, because the term ascribes new importance to statistical methods without explaining how those methods have changed. What we badly need instead is a conversation about the substantive innovations that have made statistical modeling useful for disciplines where, in the past, it truly wasn’t. These innovations are partly technical, but more fundamentally expressed in what Leo Breiman calls a new “culture” of statistical modeling. Where 20th-century methods often required humanists to squeeze our unstructured texts, sounds, or images into some special-purpose data model, new methods can handle unstructured evidence more directly by modeling it in a high-dimensional space. This opens a range of research opportunities that humanists have barely begun to discuss. To date, topic modeling has received most attention, but in the long run, supervised predictive models may be even more important. I sketch their potential by describing how Jordan Sellers and I have begun to model poetic distinction in the long 19th century—revealing an arc of gradual change much longer than received literary histories would lead us to expect.

Keywords

Literary distinction poetic diction predictive modeling machine learning literary theory bag of words

Why humanists have distrusted data

In the last decade or so, humanists have struggled to assimilate a series of innovations that played out across a century in the social sciences. Learning algorithms, new scales of analysis, the very idea of a “model.” Understandably, we tend to perceive all these things as a single tangled knot—and lately, “Big Data” is the name we give it. For most humanists, “Big Data” doesn’t imply a particular set of new methods; it just evokes a cloudy, gigantic version of everything we already distrusted about numbers (Marche, 2012).

It’s an unfortunate misunderstanding, because the modeling strategies that have emerged over the last 20 years are importantly different from the quantitative methods we used to know and dislike—and are creating a remarkable opportunity for humanists. It’s an opportunity related to new scales of exploration, but perhaps not best communicated as sheer bigness. This may be a better place to start: why haven’t statistical models worked in our disciplines before, and what has changed that would make them work now?

I think literary historians were mostly right, for instance, to ignore statistical models in the 20th century. The modeling methods that prevailed for most of that century were best suited to structured data sets with relatively few variables, and that isn’t the form our subject usually takes. Sociologists could use linear regression to model social mobility, but it wasn’t clear how we could use that method on unstructured text. Of course, you might convert a collection of novels into structured data by defining a scheme for content analysis (murder happens—yes or no—spooky mansion present—yes or no). But literary scholars have rarely been willing to trust that kind of data model, and there may be some rationale for our reluctance (Posner, 2015). Among other things, the questions posed in a historical discipline are likely to cut across periods that organize the world differently, and it’s difficult to envision a coding scheme that would make all those different sets of categories comparable. Similar problems can arise in social science, but in literary history, problems of historical incommensurability are in effect the discipline’s central subject, so quantitative methods have found little foothold. Instead, literary historians have typically contrasted a small number of texts in a sensitive, qualitative way. This approach also had its pitfalls, since a literary period with complex internal divisions isn’t well represented by a few emblematic works (Moretti, 2005). But it was hard to see an alternative.

Why data now matters for us

The emergence of techniques for modeling unstructured text changes this landscape in a way that humanists have been slow to appreciate. The boundary we used to take for granted between the humanities and quantitative social sciences no longer has a rationale rooted in the nature of our material.

By “new techniques for modeling unstructured text,” I mean partly the new algorithms that make it possible to model thousands of variables without overfitting, and partly the computing power that makes it practical to try. But more fundamentally I’m talking about what Leo Breiman called a new “culture” of statistical modeling—a culture that doesn’t assume we need to craft a model by deciding in advance which variables matter for a given problem (2001). Instead, it’s now possible to start a modeling process by admitting that we don’t know which variables matter. We don’t really know, for instance, whether murders and mansions were the key elements distinguishing literary genres. But we can still attempt to model genre, by gathering thousands of variables and asking a learning algorithm to identify the variables that do reliably distinguish examples of different genres. Because this strategy doesn’t require us to choose variables closely tailored to a particular research question, we can begin the process with a gesture as simple as counting words.

So-called “bags of words” are not usually perceived as a natural data model for the humanities. As writers and readers, we experience writing sequentially; we don’t experience it as a distribution over the lexicon. So scholars often find it difficult to believe that useful information can be recovered from word frequencies. The mere counting of words seems like a “blunt hermeneutic instrument” (Trumpener, 2009). But in reality, words are important little things, and a high-dimensional space defined by thousands of them gives us room to trace complex literary boundaries that don’t line up with any single term. Distributions over lexical space can simultaneously register genre, topic, tone, and even (as we will see) the social context of writing—without requiring a researcher to decide in advance which variables represent “genre” and which “social context.”

No data model is a universal solvent, and a bag-of-words representation has limits like any other. If we want to model poetic meter, for instance, we will after all need information about word order. So there is no escape from theory here: researchers working with loosely structured data do still have to make initial assumptions about the evidence relevant to their question. But assumptions are not all equally confining. The presupposition that “genre may affect word choice” is a great deal more open-ended than an assumption that the Gothic means specifically murders and haunted mansions. Loosely structured representations of text allow us to model concepts from examples in a high-dimensional space, without designating specific variables as their proxies. That flexibility can make an enormous difference in the humanities.

Supervised and unsupervised strategies

We do sacrifice something by moving into a space with thousands of dimensions. In traditional statistical models, prediction and explanation were tightly coupled. The models produced by learning algorithms, by contrast, maximize predictive accuracy without guaranteeing a crisp explanation: their “explanations” may be diffused across thousands of coefficients (Shmueli, 2010).

But this challenge could be overstated; in practice I don’t find it difficult to glean explanatory insights from, say, a regularized regression model. If machine learning seems inscrutable, it’s more likely because recent discussions in the humanities have focused on unsupervised algorithms, like most of those used for topic modeling (DiMaggio et al., 2013; Goldstone and Underwood, 2014; Liu, 2013). Topic modeling is useful as a discovery strategy, but one has to admit that the explanatory goals of an unsupervised model are inherently a bit diffuse (see the warnings in Mohr and Bogdanov, 2013: 23–25).

For staging a demonstration of new methods, it has been a rhetorical advantage that topic models are unsupervised—it says, in effect, “nothing up my sleeve.” But as we move beyond the demonstration stage of text analysis, supervised predictive models may become more important (Scharkow, 2013). We are, after all, usually working on specific research questions: we’re going to want methods that can be pointed at specific topics.

Perhaps humanists have been slow to appreciate supervised models because the process sounds circular: since you have to start from labeled examples, it may at first appear that a supervised model could only reproduce categories you already understood. But in practice we can often locate examples of social phenomena without clearly understanding the categories they instantiate. I know how to find a sample of bestsellers, for instance, or a sample of experimental novels reviewed in highbrow magazines, although I can’t necessarily define “the avant-garde” or “mass culture.” Indeed, the difficulty of defining those abstractions may be no accident: Andrew Abbott has argued that the social entities we talk about are in general less substantial than the boundaries that separate them (1995).¹ Whether or not we affirm Abbott’s point as a general principle, it’s certainly true that literary historians often start with observed differences between particular groups of texts. Supervised predictive models allow us to build on that foundation—mapping the literary field by sampling works from different social locations, and modeling the boundaries between them.

A model of poetic distinction, 1820–1919

For instance, suppose I want to understand the history of poetic prestige. We have inherited a literary history organized around conflicting poetic movements that displace each other every 20 years or so. Between 1820 and 1919, poetry is supposed to have passed through romantic, early Victorian, pre-Raphaelite, aesthetic, and modernist phases. Critics often characterize the last of these transitions as a “revolution”; we might well infer that standards of poetic distinction have been, in general, very volatile (Greenblatt et al., 2006: 1834). On the other hand, we draw this picture of history largely from poets’ programmatic accounts of their own work, and it wouldn’t be surprising if they had exaggerated the importance of transitory conflicts in which they were interested parties. Literary prestige might also have been governed by durable social boundaries. As things taken for granted, these boundaries would have been less likely to attract comment.

To test this hypothesis, I and a collaborator (Jordan Sellers) gathered two sets of poetry volumes across the period 1820–1919. One we sampled from volumes reviewed in 14 British and American magazines that were widely read by literary elites; the other we assembled by randomly sampling 53,200 volumes of poetry from HathiTrust Digital Library (using methods described in Underwood, 2014). Each sample contained 360 volumes, and both samples were distributed over the timeline in the same way. But volumes from the second set were much less likely to have been reviewed in a prestigious venue. This difference gave us leverage on a social boundary that is not widely discussed in literary history, because by its nature it leaves little evidence—the boundary, not between thumbs-up and thumbs-down, but between reviewed and ignored.

Given this research problem, it might initially seem pointless to train a predictive model that distinguishes the two sets of volumes using only the poems themselves. Since reviews happen after a volume is written, it is not obvious that they should leave any trace in the text at all. Even if we could somehow train a model to predict “whether a volume got reviewed” based on the text alone, what would be the point of doing so? If we want to know whether these volumes got reviewed, we could just check.

But in historical research, the value of a predictive model is rarely literally to make predictions about individual instances. It can be just as important to see how a model works, or where it fails. In this case, it’s significant that it’s even possible to sort reviewed from random volumes, using lexical evidence alone, across a whole century and 14 different venues of review. Apparently, our stories about conflicting “movements” are not a complete picture of literary history. There were also linguistic criteria of poetic distinction that remained relatively stable.

Each point in Figure 1 is a volume of poetry, evaluated by a regularized logistic model trained to distinguish the “reviewed” and “random” samples (using software libraries described in Pedregosa et al., 2011 and Wickham, 2009). To avoid circularity, predictions about a given author are made by a model trained only on evidence from other authors, so technically we’re training 636 different models. But since each pair of models shares more than 99% of their evidence, I’ll describe these inferences collectively as a single model. In each case, the model predicts a probability that the volume came from the reviewed set. The evidence it uses is simply the frequency of the 3200 most common words in the whole collection; each of those words contributes either positively or negatively to a volume’s likelihood of being reviewed. Scanning through the weights the model attaches to different words, the broad outlines of prestigious diction are rather clear. Concrete description (“eyes”, “black,” “wind,” “grass”) made a volume more likely to be reviewed, and blandly positive adjectives (“wondrous”, “grand,” “sparkling,” “joyful”) were less prestigious than negative ones (“shuddering,” “harsh,” “dead”). In the limited space of this short article I’ll characterize this simply as a linguistic boundary. In reality these verbal choices register a set of overlapping factors—including genre, topic, and tone—that we have teased out in a longer piece by reading specific examples (Underwood and Sellers, 2015).

Figure 1.

Predicted probabilities that volumes come from the reviewed set, and actual set membership.

Distinction and historical change

Normally we would evaluate a model of this kind by using the 50% line in the middle of the y axis; the model effectively predicts that everything above that divide comes from our “reviewed” sample. Evaluated in this simple way, the model would perform moderately well; it’s right 77.5% of the time. But we can get better accuracy by acknowledging the odd fact that the whole collection drifts upward as historical time passes. If we consider publication date as a factor, and use the slanted black line to divide the data set, the model will be 79.2% accurate. (This “slanted line” is just the central trend line for the whole data set, inferred by ordinary linear regression.)

Technically, the upward drift we’re acknowledging is a failure of the model. Volumes are not really “more likely to be reviewed” just because they were published later. But this is a failure of an interesting kind, because it suggests that the criteria of literary prestige are bound up somehow with diachronic change. The words that were more common in reviewed volumes across this whole period are also words that tended to become more common in all volumes by the end of the period. We’re still working at the level of description here rather than explanation, but it’s not unreasonable to guess that writers’ imitation of prestigious examples played some role in this process. However, we don’t need to rush to causal inference: this trend is already startling on a descriptive level. It implies that literary practices changed in an oddly predictable way. For instance, it turns out that you can predict “review-worthiness” quite well in 1910 using the same list of poetic or banal words you would have used in 1850. It’s just that a volume would need to adhere to the model’s implied standard of diction more strictly to get reviewed in 1910. It would need to be even more concrete, even more “harsh,” even less “sparkling” and “grand,” because the bar for review-worthiness has drifted upward over time, paralleling a larger drift in the whole data set. Methods like this may give us a new way to understand long-term trends in literary history.

Of course, it’s not possible to compress a century of literary history into a single visualization. Certainly, many factors that shaped poetic prestige in particular decades are left out here. The model is at present unable to capture some important aspects of poetry (like meter). It also doesn’t say anything about the social networks and marketing strategies exploited by particular publishing houses (Mason, 2013). Omissions like this are, after all, why its predictions are only right 79% of the time. But the evidence presented here doesn’t aim to prove that existing explanations of poetic history (based on publishers, or movements like “modernism”) are unimportant or unreal. It merely suggests that those explanations may coexist with larger arcs of change that we have barely begun to describe.

In a longer article, Jordan Sellers and I have fleshed out the thesis sketched here, using methods familiar to humanists (close reading of poetry) and also those familiar to social scientists (ANOVA helped us measure the confounding effects of nationality and gender). Drawing also on parallel evidence from other studies, we argue that the pace of literary change has probably been slower than scholars currently assume. Although our existing repertoire of qualitative methods excels at characterizing conflict on a roughly generational scale, a lot of important changes take place more gradually than that. These trends may be slipping through our net; to catch them we may need quantitative methods that can compare hundreds of volumes at once (Underwood and Sellers, 2015).

Conclusion

But in this brief piece I’m less interested in theses about literary history than in a broader methodological point. I’ve suggested that “Big Data” is not a useful term for humanists. The problem is not just that humanists shudder when they hear the word “data,” or that we lack consensus about the scale that counts as “big,” but that the term fails to register the really important methodological shifts that have opened up boundaries between the humanities and social sciences. What we need instead is a conversation that distinguishes the humanistic applications of different modeling strategies.

Straightforward 20th-century methods, like linear regression, can still be useful if we’re working mainly with structured social evidence. Unsupervised methods (like most forms of topic modeling) can be useful if we’re working purely with unstructured text. But between those two poles there’s a rich array of supervised methods that can use unstructured text to help us understand specific social boundaries. A connection to social evidence gives supervised predictive models a straightforward epistemological foundation; unlike topic models, they are easy to validate using out-of-sample accuracy. But these methods still benefit from recent advances in machine learning, which have made it possible to use flexible, relatively omnivorous representations of text instead of data models shaped in advance by specific assumptions about expected results. Supervised methods of this kind are just beginning to be used widely in the humanities, but they have enormous promise. Literary historians, for instance, spend a lot of time organizing texts into groups that are said to represent coherent social phenomena, and pitting them against other groups that are said to represent opposing standards. Predictive modeling could provide a way of testing and enriching these claims. Although high-dimensional models may work, at bottom, with linguistic evidence, they can give us leverage on social as well as stylistic boundaries if our questions are thoughtfully designed.

Footnotes

Declaration of conflicting interest

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: The research reported here was supported by Digital Humanities Start-Up Grant HD 5178713 from the National Endowment for the Humanities and a Digital Innovation Fellowship from the American Council of Learned Societies. Any views, findings, conclusions, or recommendations expressed do not necessarily represent those of the funding agencies.

Note

This article is part of a special theme on Colloquium: Assumptions of Sociality. To see a full list of all articles in this special theme, please click here: http://bds.sagepub.com/content/colloquium-assumptions-sociality.

References

Abbott

(1995) Things of boundaries. Social Research 62(4): 857–882.

Breiman

(2001) Statistical modeling: The two cultures. Statistical Science 16(3): 199–231.

DiMaggio

Nag

Blei

(2013) Exploiting affinities between topic modeling and the sociological perspective on culture: Application to newspaper coverage of U.S. government arts funding. Poetics 41(6): 570–606.

Goldstone

Underwood

(2014) The quiet transformations of literary studies: What thirteen thousand scholars could tell us. New Literary History 45(3): 359–384.

Greenblatt

Abrams

Christ

(2006) Norton Anthology of English Literature, 8th ed. Vol 2. New York: WW Norton.

Liu

(2013) The meaning of the digital humanities. PMLA 128: 409–423.

Marche S (2012) Literature is not data: Against digital humanities. Los Angeles Review of Books, 28 October. Available at: https://lareviewofbooks.org/essay/literature-is-not-data-against-digital-humanities. (accessed 2 March 2015).

Mason

(2013) Literary Advertising and the Shaping of British Romanticism, Baltimore, MD: Johns Hopkins University Press.

Mohr

Bogdanov

(2013) Topic models: What they are and why they matter. Poetics 41(6): 545–569.

10.

Moretti

(2005) Graphs, Maps, Trees, London: Verso.

11.

Pedregosa

Varoquaux

Gramfort

(2011) Scikit-learn: Machine learning in Python. Journal of Machine Learning Research 12: 2825–2830.

12.

Posner M (2015, June 25) Humanities data: A necessary contradiction. Available at: http://miriamposner.com/blog/humanities-data-a-necessary-contradiction/ (accessed 12 July 2015).

13.

Scharkow

(2013) Thematic content analysis using supervised machine learning: An empirical evaluation using German online news. Quality and Quantity 47: 761–773.

14.

Shmueli

(2010) To explain or to predict? Statistical Science 25(3): 289–310.

15.

Trumpener

(2009) Paratext and genre system: A response to Franco Moretti. Critical Inquiry 36(1): 159–171.

16.

Underwood T (2014) Page-level genre metadata for English-language volumes in HathiTrust, 1700–1922. In: figshare. Available at: dx.doi.org/10.6084/m9.figshare.1279201 (accessed 2 March 2015).

17.

Underwood T and Sellers J (2015) How quickly do literary standards change? In: figshare. Available at: http://dx.doi.org/10.6084/m9.figshare.1418394 (accessed 12 July 2015).

18.

Wickham

(2009) ggplot2: Elegant Graphics for Data Analysis, New York, NY: Springer.