Abstract
This essay describes how using unsupervised topic modeling (specifically the latent Dirichlet allocation topic modeling algorithm in MALLET) on relatively small corpuses can help scholars of literature circumvent the limitations of some existing theories of the novel. Using an example drawn from work on Victorian novelist Anthony Trollope's Barsetshire series, it argues that unsupervised topic modeling's counter-factual and retrospective reconstruction of the topics out of which a given set of novels have been created allows for a denaturalizing and unfamiliar (though crucially not “objective” or “unbiased”) view. In other words, topic models are fictions, and scholars of literature should consider reading them as such. Drawing on one aspect of Stephen Ramsay's idea of algorithmic criticism, the essay emphasizes the continuities between “big data” methods and techniques and longer-standing methods of literary study.
In the final two paragraphs of The Last Chronicle of Barset (1867), the last-published of Anthony Trollope's six-novel series detailing the social lives of country clergymen and the effects of clergymen “on the society of those around them,” the authorial narrator says a sad goodbye to the fictional English county of Barsetshire in which all six novels are set: And now, if the reader will allow me to seize him affectionately by the arm, we will together take our last farewell of Barset and of the towers of Barchester. I may not venture to say to him that, in this country, he and I together have wandered often through the country lanes, and have ridden together over the too-well wooded fields, or have stood together in the cathedral nave listening to the peals of the organ, or have together sat at good men's tables, or have confronted together the angry pride of men who were not good. I may not boast that any beside myself have so realized the place, and the people, and the facts, as to make such reminiscences possible as those which I should attempt to evoke by an appeal to perfect fellowship. (The Last Chronicle of Barset, 2002: 860, 861)
Literary critics have had a particularly difficult time accounting for small groups of related novels like the Barsetshire series, in part because theories of the novel almost always take the single novel as their main unit of analysis. Theories of the Victorian novel also tend to assume (even when they don't assert) that novelists like Trollope seek to represent the singular social world of the individual novel as a finished and stable totality. And from this perspective, the Victorian novel's representation of social totality depends upon its unified, singular, self-enclosed formal totality. Critics imagine this formal totality as secured by a controlling omniscient narrator who sees all, knows all, and describes all from a point above and outside the novel. Such a total coherence clearly can’t be the model for the more partial and contingent connections that join the Barsetshire series into a loose group. And yet critics who do deal with groups of novels, such as advocates of “distant reading”, seek to identify large-scale patterns across hundreds or thousands of novels—a search that tends to end with the discovery of new varieties of structural totality. 3 So while distant reading offers us some new insights into the study of novels, it is equally unsuited to understanding the kinds of middle-distance questions raised by Trollope's small group. 4
How might we undo the ingrained habit of reading social totality and formal totality together that both close and distant readings of the novel seem to share? How might we instead find a way of reimagining the forms of the six novels as semi-detached and their social relations as more partial and unfinished? Borrowing technology built for relatively “big” data and turning it on the relatively small 1,396,000-word corpus of the Barsetshire series offers us one path. Running various iterations of the unsupervised latent Dirichlet allocation topic modeling algorithm in MALLET 5 on their collective 314 chapters generates a number of topics that suggest both expected and unexpected connections between the very different novels in the informal series. And these connections, when tracked back into individual chapters and read by humans rather than machines, offer us (among other things) a look at the Barsetshire novels' own encoding of the layered histories of the novel's many attempts to capture social relations and social worlds through testing out different genres. 6
For example, we can look at the various versions of one topic whose most frequently occurring words are likely to be “letter write read written letters note wrote writing received table paper send answer return judge handed desk pen addressed” (here labeled topic 38) (see Figure 1).
7
Turning to the chapters in which the topic is likely to appear shows that the Barchester series isn't merely full of letters (See Moody, 2003). It is, of course, but the appearance of these letters, notes, addresses, and envelopes suggests not merely an emphasis on correspondence; it also points to a generic revenant, to the series' haunting by the ghost of the epistolary novel, or novel-in-letters. One of the most popular novelistic forms during the middle of the 18th century, by the century’s end the epistolary novel had fallen out of favor. By the mid-Victorian moment of the Barchester novels it was a distant—but, as this model helps us see, persistent—memory.
Topic key for 50 topics, topic 38 highlighted.
A relatively low-density topic, distributed in drips and drabs throughout the Barchester novels, the “letter write read written letters note” topic thus addresses itself to the past epistolary novel genre trapped inside; we glimpse it in outline, like a bricked-up window in a Victorian renovation of a Georgian house.
8
Read alone, the topic can't tell us anything about this generic fossil; it suggests only the idea that letter exchange and correspondence is a recurring topic or theme, a part of the novels' “contents.” But when we examine the “topics in documents” output, we realize that the chapters in which characters exchange letters and worry about unsent notes gesture to that earlier genre and even proffer an alternative configuration for the novel (see Figure 2). The topics in documents output even points to one chapter in which the narrator announces that for the moment he will regress to the genre of the epistolary novel for the length of the chapter.
9
Lines from topics in documents MALLET output showing chapters with relatively high percentages of topic 38.
The generative uncertainty of topic modeling is crucial here, and stems from the enabling assumptions of topic modeling—the counter-factual assumptions upon which the topic modeling algorithm is explicitly and deliberately based. Topics are probabilistically created formations, and the algorithm that generates topic models is based on the enabling—but crucially, counter-factual—“assumption that documents have multiple topics” (Boyd-Graber et al., 2014: 4). By looking at the documents we offer it, the algorithm generates topics that, in given proportions, compose each document. (Or, rather, it generates the probability that a certain percentage of words in every given document were generated by a given particular topic.) Topics, of course, don't actually exist prior to the documents that generate them; they don't actually exist independently in the same way the documents (in this case, our chapters of novels) exist at all. They are, in a certain sense, fictions—they might have existed, they are the kind of thing that could exist given the existence of the document set in question. This deceptively simple point can seem obvious, or like a minor technical detail. And for some applications of topic modeling, it may well be. But the fictionality of topics is crucial to remember for literary-critical uses of topic modeling, for it reminds us that these models offer us a view of our document set radically at odds with any other more literal sources of a novel we might use—such as an author’s notes towards a novel, or a catalog of the virtual or actual library of books a novelist brings to the writing table, or even the looser sense of social “discourses” that exist prior to novels and which we might imagine in part “composing” a novel. 10
So the topic modeling algorithm knows nothing about letters, nothing about narrative form, nothing about Trollope. All it can tell us is that 1) this string of tokens (in our case, words) co-occur together more than we would expect, all things being equal, and 2) some particular documents (in this case, chapters) are composed of a certain number of tokens (words) with a relatively high probability of belonging to this topic. But the algorithm's lack of knowledge of semantic meaning, and particularly its lack of knowledge of the Victorian novel as a form or genre, lets it point us to a very different model of the social than the kind of formal totalities held out to us by the novel theory we currently possess. As a kind of reader who knows nothing at all about the rich historical, formal, and social contexts within which the Barsetshire novels (like any novels) are embedded, the algorithm offers us a new view—not a more accurate one, but a different one that lets us see and interpret our novels in a denaturalized and different light. Rather than suspending us in a totalizing system or network, it decomposes our novels, taking us backwards into a fictional composition history, towards the other potential, unwritten novels the Barchester series might have been. The algorithm helps us imagine the way any given novel contains within it many unfinished and impossible versions of itself—versions no Victorian author would or could have written. More specifically, in my example, it lets us see how the ghostly epistolary connections that stretch within and between novels in the series could replace or contradict any totalizing vision of the social, any model relying on a formal totality secured by the idea of an omniscient narrator of a single novel. In so doing, it jettisons any finished and final version of the fiction in favor of what we might think of as a kind of counter-factual set of notes. In some sense, we might imagine topics as the notes a (fictional) narrator might have taken towards writing the novel it (or she, or he, or even they) inhabits and over which it so often claims authorial agency.
I've offered a brief and particularly reflexive example of the way a topic model can point us not to the existing “contents” of novels imagined as represented worlds, but rather to the kinds of writing that prepared for or generated the Barsetshire novels. In the context of literary study, I argue, we should train ourselves to read topic models as notes written by nobody rather than “contents” merely poured into fictional form. I want to suggest, that is, that all topics generated from literary corpuses can help take us back to earlier imaginary forms and versions—discarded drafts that authors might have written but didn't, outmoded genres that are fragmentarily recycled within new forms. Topic modeling may be most useful for humanists when we use it this way, as a kind of uncanny, shifting, temporary index to the works we know best, rather than trying to imagine it, as we too often do, only as telling us something about the stable “contents” of large literary corpora. Closely linked to older traditions of indexing literature (from Victorian Bible concordances to Caroline Spurgeon's index to all of Shakespeare's figural language in Shakespeare's Imagery to Roberto Busa's Index Thomisticus), the algorithm's machinic, non-semantic, probabilistic characteristics can help denaturalize our relationship to literature and our attachments to the assumptions—about the sociality of literary form, in my example—baked into our favorite theories of the novel.
Not something a human would ever create, a topic model nevertheless perhaps has more in common than we might at first suspect with the probabilistic, counter-factual, human-created fictions we think we know. Although topics can look at first glance like a pre-existing “discourse,” that is, what topics generated from novels actually offer us is the ultimate formalist fantasy of the components of the novel's representation of a social world—a set of “topics” that make up the “contents” of a corpus with no leftovers, a nearly perfect correspondence between the materials of the work and the finished work itself. 11 As Stephen Ramsay argues in Reading Machines, using algorithms need not propel us towards applying an ersatz scientific and scientistic evidentiary standard to literary interpretation, but rather should reveal and perhaps help amplify our already part-algorithmic literary-critical reading practices, the regular sets of protocols and procedures of analog literary criticism with which we are very—perhaps sometimes too—familiar (Ramsay, 2011: 14). 12 It is as fantasies of formalist reading practices, perhaps, that topic models of literary texts can be most helpful to human readers—as denaturalizing indexes or suggestive counter-factual maps that open up new interpretive possibilities.
Footnotes
Acknowledgements
Valuable conversations about and feedback on the ideas in this paper came from Laura Heffernan, David Mimno, Michael Reay, the members of the Swarthmore College Victorian Novel Research Seminar, particularly Allison Shultes, and the members of the Tri-College Digital Humanities Art of Topic Modeling Seminar.
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
Notes
This article is part of a special theme on Colloquium: Assumptions of Sociality. To see a full list of all articles in this special theme, please click here: http://bds.sagepub.com/content/colloquium-assumptions-sociality.
