Abstract
With slogans such as ‘Tell the stories hidden in your data’ (www.narrativescience.com) and ‘From data to clear, insightful content – Wordsmith automatically generates narratives on a massive scale that sound like a person crafted each one’ (www.automatedinsights.com), a series of companies currently market themselves on the ability to turn data into stories through Natural Language Generation (NLG) techniques. The data interpretation and knowledge production process is here automated, while at the same time hailing narrativity as a fundamental human ability of meaning-making. Reading both the marketing rhetoric and the functionality of the automated narrative services through narrative theory allows for a contextualization of the rhetoric flourishing in Big Data discourse. Building upon case material obtained from companies such as Arria NLG, Automated Insights, Narrativa, Narrative Science, and Yseop, this article argues that what might be seen as a ‘re-turn’ of narrative as a form of knowledge production that can make sense of large data sets inscribes itself in – but also rearticulates – an ongoing debate about what narrative entails. Methodological considerations are thus raised on the one hand about the insights to be gained for critical data studies by turning to literary theory, and on the other hand about how automated technologies may inform our understanding of narrative as a faculty of human meaning-making.
Keywords
This article is a part of special theme on Knowledge Production. To see a full list of all articles in this special theme, please click here: http://journals.sagepub.com/page/bds/collections/knowledge-production
Rage – Goddess, sing the rage of Peleus’ son Achilles, murderous, doomed, that cost the Achaeans countless losses, hurling down to the House of Death so many sturdy souls, great fighters’ souls, but made their bodies carrion, feasts for the dogs and birds, and the will of Zeus was moving toward its end. Begin, Muse, when the two first broke and clashed, Agamemnon lord of men and brilliant Achilles. (The Iliad, Book I)
When opening the website www.narrativa.com – a company that markets itself on its ability to turn vast data sets into narratives through Natural Language Generation (NLG) software – it is not the muses but data that is implored to sing. ‘Make your data sing’, is the imperative here (see Figure 1), bestowing upon the customer – rather than the gods – agency to solicit the right medium and make the narrative take form (Narrativa, 2017). In contrast to The Iliad, the vehicle for the song to be sung, the narrative to be told, is not the epic of the muse but the available data of a given database that may be transformed into a narrative through a process that rests on NLG and Artificial Intelligence (AI) and that emphasizes automation, scale, context and tone. Yet, there is a similarity in the way in which the ability to bring about such a ‘song’ requires special attunement, whether to the muses or to data.
www.narrativa.com screenshot.
This article approaches the alleged paradigm shifts in knowledge production brought about by Big Data, around which this theme issue centres, by taking a closer look at the way in which narratives are appropriated in the discourse of data analytics. The techniques in which I am interested offer means of addressing the current challenge of extracting useful information and producing knowledge from vast databases. They are thus part of a generation of Big Data discourse that focuses on how to make the most of the data that public and private institutions have accrued over recent decades and minimize the data the human needs to consider, thus increasing cost effectiveness. Data visualization is one proposed and popular technique. However, another possibility is to turn to the written narrative as means of sorting and organizing data.
As a literary scholar by training, it has struck me how companies such as Arria NLG (Scotland), Automated Insights (US), Narrativa (Spain), Narrative Science (US) and Yseop (US) promote ‘narrative’ as a mode of knowledge production capable of ‘translating’ data into human language through the use of NLG techniques. This has made me curious as to, firstly, what kind of conception of narrative is at work here; secondly, if and how the narrative theory that we find at the intersection of cultural studies, philosophy and psychology may inform the discussion about the paradigm shift in knowledge production linked to the advent of Big Data; and thirdly, how these latest developments in automatization of narratives may contribute to ongoing discussions across such fields about the nature of narrative as a mode of representation.
Hidden narratives
We shall begin with a closer look at the marketing rhetoric that triggered my curiosity. For instance, this pitch by Narrative Science: We don’t need more data, more spreadsheets, or even more beautiful visualizations. What we need are the key takeaways: a way to understand the impact of the story the data is telling now. We need an analyst at our elbow, ready to provide us with that level of information in a quick consumable form. We need information that is tailored to our domain and to our particular roles. We all need this type of information when we want it and need it, a reality that is only achievable with machine scale. (CITO Research, 2015: 1)
The data accumulation that has prompted such rhetoric of narrative has emerged out of the past decade’s repeated claims to data as ‘the new oil’. 1 The popularized comparison between data and oil gestures towards a conception of data as crude in need of refinement, creating a parallel between the transformation of raw materials into gas, chemicals, and plastic and that of data analytics as a process of making data useful. The imagery of data as a resource resonates in phrases found on Automated Insights’ webpage such as: ‘Structured data is the fuel for Wordsmith – use your data to power your narratives by leveraging our API’ (Automated Insights, 2017). The automated NLG techniques promise to ‘refine’ the raw material by using narratives as a means of sorting data and making visible a hierarchy between information that is important and unimportant for what the customer wants to know, and in that process create energy out of the raw material that can ‘power your narratives’.
In the above quotation, Narrative Science positions itself as a viable alternative to another prominent means of transforming data into actionable information, namely data visualizations, 2 which are here relegated to ‘beautiful visualizations’ – a term that echoes discussions regarding the aestheticism of such visualizations (Philipsen and Kjærgaard, 2018). What is offered instead is ‘a way to understand the impact of the story the data is telling now’. This form of rhetoric gives the impression that the data is continually telling a story but that we need a technique to tap into this story to make the narrative legible to the human mind. Narrative Science here reiterates Narrativa’s focus on automation, scale, context, and tone, combining what sounds like an essentialist belief in the existence of recoverable narratives buried in the data with the argument that automation is needed to deliver the necessary (and cost efficient) speed and scale.
Significantly, the word ‘hidden’ reoccurs in the self-understanding of many of these companies. For instance, Arria NLG writes: ‘Our patented AI Natural Language Generation technology goes beyond simple templates, extracting and communicating the insights hidden in your data’ (Arria NLG, 2017). The metaphor of depth is persistent, conveying that something needs to be extracted or excavated and brought to the surface. The terminology of ‘hidden’ can be read as a nod to the use of ‘hidden layers’ in neural networks. 3 The way in which this terminology is used in marketing material such as this example or Narrative Science’s ‘Tell the stories hidden in your data’ (Narrative Science, 2017) conveys an impression to the lay customer that the narratives exist prior to the implementation of the offered techniques – i.e. that the narratives are somehow embedded in the data and only need to be carved out, which is enabled through NLG and automated with AI. This rearticulates central discussions in narrative theory throughout the twentieth century concerning the relationship between the story and the form that a narrative may take. When investigating the claim of a paradigm shift in knowledge production linked to the advent of Big Data, which is the wider scope of this issue, it is therefore productive, indeed necessary, to cross-read the current marketing articulation of hidden narratives with theoretical discussions of narrative properties, thereby placing the vocabulary of the NLG-based companies in a longer historical trajectory. This vocabulary seems to stem from assumptions about what language and narrative are in the field of NLG as well as Natural Language Processing (NLP) more generally. It thus seems productive to take a closer look at what this technique entails and how it can be seen in dialogue with narrative theory in order to trace the connotations at work in the notion of the hidden narrative.
Bringing NLG and narrative theory into dialogue
NLG systems are computer software systems that generate texts in a human language from non-linguistic input data, using techniques from computational linguistics and AI (Reiter, 2012: 558ff). The groundwork for this field was laid in the 1950s and 1960s with the development of machine translation programs. Like Natural Language Understanding (NLU), it was originally a subdiscipline of NLP, a field that cuts across computer science and cognitive science. NLU techniques are able to scan thousands of written documents, disambiguating and parsing the input as unstructured data and thereby turning it into structured data. While rooted in the same field of NLP, we find significant differences between NLU and NLG techniques. The scanning techniques of NLU are often used for newsgathering, text categorization, voice activation, archiving, and large-scale content analysis. NLG on the other hand ‘writes’ rather than ‘reads’ in the sense that it turns structured data into written narratives. This may range from simple letter templates that copy-paste and link so-called ‘canned text’
4
to more sophisticated AI systems that generate textual summaries of databases. An early example often mentioned in the literature is weather forecast systems that produce forecast texts from numerical prediction and graphical maps using a set of rules and a natural language generator (Reiter, 2012: 559; Reiter and Dale, 2006: 7ff). Reiter and Dale gives the following example, which is generated based on meteorological data collected automatically and a corpus of human-written texts: ‘The month was cooler and drier than average, with the average number of rain days. The total rain for the year so far is well below average. There was rain on every day for eight days from the 11th to the 18th.’ (2006: 8). In recent years, the field of NLG has consolidated as a significant research area in its own right and is already ubiquitous in our society through technologies such as personal assistants that rely heavily on NLG. It is furthermore predicted to boom in the coming years, currently extending its customer circle to media, e-commerce, logistics, energy, pharmaceutical and real estate industries (Llorente, 2016). As Reiter and Dale point out, NLG raises questions pertinent to a range of research fields, including human-computer interaction: How should computers interact with people? What is the best way for a machine to communicate information to a human? What kind of linguistic behaviour does a person expect of a computer he or she is communicating with, and how can this behaviour be implemented? (2006: 2) What constitutes ‘readable’ or ‘appropriate’ language in a given communicative situation? How can the appropriate pragmatic, semantic, syntactic, and psycholinguistic constraints be formalised? What role does context in its many aspects play in the choice of appropriate language? (2) How can typical computer representations of information – large amounts of low-level (often numeric) data – be converted into appropriate representations for humans, typically a small number of high-level symbolic concepts? What types of domain and world models and associated reasoning are required to ‘translate’ information from computer representations to natural language, with its human-oriented vocabulary and structure? (2)
The notion of narrative in the weather forecast example above seems in many ways closer to what German cultural theorist Walter Benjamin in his seminal essay ‘The Storyteller’ calls information rather than storytelling. This text can be read as a reaction to what Benjamin in 1936 identifies as an increase in the amount and distribution of data that he regards to be part of the reason why people’s way of relating to what they experience and communicate is changing. It thus raises a discussion not without parallel to the knowledge production paradigm under scrutiny in this issue. Benjamin makes a distinction between information and storytelling and aligns these with two kinds of experience: Erlebnis and Erfahrung. Erlebnis is what information conveys, it refers to something lived by the individual in the present. It is verifiable and concise, and leaves little room for the reader to interpret. To Benjamin information as a form of communication increases its importance with the advent of the printing press, the growth of the middle class bourgeoisie, capitalism and is exemplified by newspaper stories. As opposed to this, Erfahrung embodies the wisdom of the collective. It is linked to oral storytelling and contains a collective kind of experience in which what is told becomes modulated into the lives of the audience (Benjamin, 2006). As we shall see, NLG narratives cannot simply be relegated to ‘information’, rather they bridge an Erlebnis and an Erfahrung mode of experience. 5
However, narrative theory in general terms is a diverse and extensive umbrella term, which is impossible to condense into a single theory. This article nonetheless makes a case for bringing the diverse discussions in this field into dialogue with NLG and the current inclination for automatization of narratives. Just as driverless cars mean that fundamental philosophical questions of agency and volition are being re-articulated, so does the automatization of narratives rearticulate discussions that go back to Aristotle and that have dominated the twentieth century in particular: from the Russian formalists (Propp, 1968; Shklovsky, 1965) to the coinage of narratology initiated by structuralist theories of narrativity in France in the mid- to late-1960s (Todorov, 1980; Barthes, 1977; Genette, 1980; Greimas, 1970), which gradually merged into what has been termed a ‘narrative turn’ in the 1980s, during which narrativity came to be broadly regarded as a fundamental and indeed universal way of representing experience within a wide range of academic disciplines (Ricouer, 1984–1988; Taylor, 1989; Bruner, 1987; Czarniawska, 1988). The narrative turn – with its origins in narratology, hermeneutics, structuralism and literary theory – questioned and challenged positivist approaches to the study of the social world and human experience. This development has been paralleled by anti-narrative currents up to the present day, currents that question the universality of narrative as a means of understanding ourselves and the world (Sartre, 1964; White, 1978; Strawson, 2004). In recent years, these discussions have been revisited in light of the advent of digital media (Manovich, 2001; Hayles, 2007).
The dominance of formalist and structuralist thinking in the development of narrative theory has no doubt made its appropriation appealing for AI researchers who design artificial storytelling systems and game design. A dialogue between these fields thus already exists in what is called Computational Narratology (Bringsjord and Ferrucci, 1999; Rumelhart, 1980). The conception of narrative at work in NLG is clearly informed by these discourses. Thus, when companies such as Narrativa and Narrative Science discuss hidden narratives, it is no surprise that we hear thinly veiled echoes of structuralist approaches to narrative, which likewise work with a notion of a universal pattern operating within a text – indeed across a wide variety of media – so that the same narrative may take many different forms. This is a recurring discussion of the relationship between the story and the way in which the narrative is told, which has taken on various guises over the years. The Russian formalists, for instance, distinguished between fabula and sjuzet, where fabula is defined as the events in themselves, while sjuzet describes the events as presented in the narrative. Bulgarian-French structuralist Tzvetan Todorov developed these into notions of histoire and discours. In English, they resonate as story/discourse, which in Jonathan Culler’s Derridian deconstruction become vehicles for fundamentally problematizing such distinctions and arguing that it may in fact be the discourse that generates story, rather than the other way around (Culler, 1981). As literary scholar Peter Brooks summarizes: We must, however, recognize that the apparent priority of fabula to sjuzet is in the nature of a mimetic illusion, in that the fabula – “what really happened” – is in fact a mental construction that the reader derives from the sjuzet, which is all that he ever directly knows. This differing status of the two terms by no means invalidates the distinction itself, which is central to our thinking about narrative and necessary to its analysis since it allows us to juxtapose two modes of order and in the juxtaposing to see how ordering takes place. (1984: 13)
Humanizing data – Automating narratives
List of tasks in NLG pipeline architecture.
Source: Perera and Nand (2017: 3).
Reading the table from left to right, there is initially a level of document planning that involves content determination, i.e. making decisions about what to include in the text on a meta content level. The decision most often relies on a model of what is important and significant to the user. As COO of Narrative Science Nick Beil has expressed in an interview: ‘We aren’t starting with data […] We’re starting with intent. And intent drives the data’ (Woods, 2016). This process has in recent years been advanced by machine learning and pattern recognition, which means that the more the same user uses the system, the better equipped it becomes for predicting and facilitating user actions. The content determination phase is intricately linked to the task of document structuring, which is likewise part of the document planning level and describes the overall organization of the text: In which order is the information narrated? How should information be grouped into sentences and paragraphs? These decisions are often influenced by the genre of the output text. It also affects the level of microplanning, which involves lexical and syntactic choice, i.e. deciding on the most appropriate words, terms, concepts, and tense. When a generated sentence needs to be part of a larger context of multiple sentences a process of aggregation is needed, which describes the generation of structured and integrated sentences that meaningfully combine the various pieces of information that must be conveyed. It is also an example of referring expression generation in terms of generating expressions that are appropriate to the context, including pronouns. These decisions can be prescribed by having a human manually write decision rules. Yet, recent research seeks to automate this process through machine learning, which analyses large collections of texts written by humans and extrapolates decision rules used by the human writers who created them (Reiter, 2012: 560). The third and final level is Realization, which is also often referred to as surface realization (again emphasizing the depth metaphor). This describes the stage at which the actual text that will be read is generated, involving a linguistic and a structure realization, abiding to the rules of syntax, morphology and orthography (Reiter, 2012; Reiter and Dale, 2006; Perera and Nand, 2017).
When going through the different stages of decision-making in a table like this, it becomes apparent, firstly, how the programming of NLG relies upon a series of decisions executed in accordance with a set of specific rules and, secondly how there are similarities in the ways of thinking in NLG and in structuralism and semiotics, which also operate from the idea of a more-or-less universal pattern of discernible codes. This systematic way of thinking about a narrative as a formal structure is what seems to be at the basis of the marketing rhetoric’s emphasis on a depth narrative that can be uncovered. However, the above statement from Narrative Science’s COO also makes clear the importance of intent when constructing narratives from data. The importance of determining intent before engaging with the data highlights the intricacies that Peter Brooks noted with regard to the relationship between fabula and sjuzet, cited in the previous section. In other words, a closer look at the phases of the NLG technique itself shows that there is not just a pre-existing narrative, lying dormant in the data and being uncovered. Instead, the ‘found’ narrative is heavily influenced by what I here call intent and which should be understood as arising from the questions being asked (i.e. what the human customer wants to know – questions that can be commercial, scientific in nature and imbued by all sorts of bias and ‘Vorverständnis’) as well as the way in which the data is structured in the database and the available linguistic templates of human language. The data thus acts as both an oracle which can deliver answers and a consultant which frames which answers can be asked. Yet, the motivation of a machine especially in machine learning is to fulfil a function and obtain a score that is deemed acceptable. The objective is thus removed from the story and focussed on generating coherent text and it may as such be removed from not only the reader, but also the text and data upon which it relies. Intent in this context is thus a composite phenomenon of human and technological motivations. When seeking to understand the form of knowledge production at work here, we must couple Brooks’ ‘two modes of order’ (fabula/sjuzet) with this complex question of intent: i.e. those properties which humans and technology presupposes of both the story itself and the form that it takes. Intent is inextricable from both the narrative and its form – it exists prior to the story, but can at the same time be forestalled by the output form.
Alongside the marketing rhetoric of the hidden narrative that can be uncovered, we find reoccurring mention of translation, arguing that data is a different language for which we require a translation that humans can understand. ‘Fundamentally, none of us speak data,’ says Senior Vice President at Yesop Arden Manning (2017). Significantly, all of the sites I have considered emphasize that their narratives sound as though a human had written them. In this respect, it is also interesting to note the use of words that connote the human voice such as ‘speak data’ or ‘make your data sing’. Transforming data into narratives is thus a process of – in the words of Narrative Science – ‘humanizing the data’ (Narrative Science, 2017). The task of turning large amounts of data into narratives that are understandable for the employees who are going to use them is articulated as data literacy, once again giving the impression that the insights are there before our eyes and that all we need is the right technology to make them intelligible to us. Yet, as the close reading of the process has shown, we are dealing with an intricate mesh of narrative, form, and intent that influences this ‘translation’ of which the human is not only the end receiver, but an active part of the entire process, and technology is not just a translation machine.
Significant in the outline of the NLG process above is the question of what automatization and machine learning add to this process when the system itself begins to ask questions based on previous use, thereby making the question of intent and authorship more ambiguous and in line with poststructuralist criticism of the author function, as voiced for instance by Roland Barthes: ‘It is language which speaks, not the author’ (Barthes, 1977: 143). In this theoretical understanding, a text should be regarded as an intertextual patchwork drawing upon multiple sources, which coalesce in the reader. As with The Iliad, with which we began, there might be doubt as to the origins of the conveyed narratives. Many scholars believe the poems of The Iliad to be the result not of one individual named Homer but instead the outcome of a long tradition of oral storytelling and the reworking of many contributors (Graziosi, 2002: 15). As such it is emblematic of what Benjamin calls ‘storytelling’. The first lines of The Iliad quoted above argue that it is the muse who gives voice to the story, the course of which is the intent of the gods. We thus return to Narrativa’s slogan ‘Make your data sing’ and the question of how this is done and with what intent. Who gives shape to the intricate and entangled tissue of data, narrative and realization in the just-as-entangled web of technologies and humans, readers and writers? Cross-reading the marketing rhetoric with, on the one hand, the way in which NLG technology works and, on the other hand, negotiations of what narrative entails in literary narrative theory provides us means of critiquing a conception of narrative as ‘hidden’ in the data and problematizing a conception of the human as the end recipient rather than as something involved throughout the process. We instead see how the fabula, sjuzet and human and non-human intent coalesce in the formation of these narratives, making them – in Benjamin’s terminology – embody both storytelling and Erfahrung as born out of a collective of humans and technology as well as verifiable information and Erlebnis. This realisation considerably nuances the claim of a paradigm shift in knowledge production brought about by Big Data, showing how the discussion is engrained in a longer historical debate about meaning-making that takes on particular properties given current technological possibilities.
The narrative re-turn
We thus return to my initial puzzlement as a literary scholar by the ‘re-turn’ to narrative in the midst of the applauding of the quantitative approaches to knowledge production that Big Data facilitates. Reading both the marketing rhetoric and the functionality of the automated narrative services through narrative theory allows for a contextualization of the rhetoric flourishing in Big Data discourse. This article thus argues that what might be seen as a ‘re-turn’ of narrative as a form of knowledge production that can make sense of large data sets inscribes itself in – but also rearticulates – an ongoing debate about what narrative entails. Methodological considerations are thus raised on the one hand concerning the insights to be gained for critical data studies by turning to literary theory and on the other hand about how automated technologies may inform and be informed by our understanding of narrative as a faculty of human meaning-making.
The rhetoric of companies such as Arria NLG, Automated Insights, Narrativa, Narrative Science and Yseop, from which the case material of this article originate, inscribes itself in a discourse on storytelling that has dominated the second half of the twentieth century and the beginning of the twenty-first. If we wish to understand the discursive framing of narrativity in Big Data discourses, I here contend that the long history of narrative negotiations between fabula and sjuzet in literary theory has important insights to offer. Including this line of thinking helps us understand Big Data’s alleged paradigm shift in knowledge production as interwoven in a longer historical trajectory that here finds new articulations.
The scope of this article only allows for an initial broaching of the richness of such perspectives, which I believe are becoming only more pertinent as NLG techniques gradually become capable of generating narratives in real time, based also on unstructured data, for instance from social media. Combined with the data analytics that enabled the targeted social media campaigns of Brexit and Donald Trump, we might foresee that it will become more important than ever to rekindle the insights of twentieth-century literary theory concerning the relationship between form and content to raise awareness of the intent with which the data is being asked to sing, and by whom.
Footnotes
Acknowledgements
I am grateful for the helpful suggestions and comments received from the anonymous reviewers as well as the editors of this issue.
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: the Danish Research Council as part of the research project Uncertain Archives: Adapting Cultural Theories of the Archive to Understand the Risks and Potentials of Big Data (
), of which the author is the Principal Investigator.
