Abstract
Working with computational methods and large textual analysis has been challenging and very rewarding—with all the ups and downs that doing empirical social research entails. In my contribution, I relate some research experiences and reflect upon data construction and the links between theory, data, and methods.
Keywords
With a formative training in social network analysis and its theoretical contributions to sociology (Breiger, 2004; Latour, 1996; White, 1992), economic sociology (Stark, 1996, 2000), and empirical cultural sociology interested in meaning making processes (Mohr, 1998), I began my fieldwork research for my dissertation. I was interested in newspapers’ meaning making at the time of the move of the German political capital from cozy Bonn to the formerly divided city of Berlin in the late 1990s. I observed and interviewed journalists in Bonn and Berlin, followed newspapers’ physical move of offices, and read five newspapers a day for two years. My hunch was that this move would be reflected in the newspapers. From my ethnographic fieldwork I found a competitive jockeying for readers, the re-wiring of journalists’ career paths, and isomorphism in newspaper design.
Translating my newspaper reading into an analytic framework posed different problems. In particular, I was interested in how those newspapers would position themselves narratively in their editorials as opinion leaders in a united Germany. From my fieldwork I learned that editorialists wrote to communicate with their colleagues at competing newspapers. They took their competitors into account without citing them directly. Ideally, I needed a method that would take contextual meaning making across different papers over time into account while, dually, allowing me to analyze how the meaning making processes shaped the identity of the newspapers—using all 9000 editorials.
For lack of modeling and computational skills, I ended up using only a sample of editorials, highlighting particular issues and how they were disputed across the editorials. Using insights from sociological research on text as data (Bearman and Stovel, 2000; Franzosi, 1994; Mohr, 1994), I manually coded how editorials evaluated particular issues (Boltanski and Thévenot, 1999), traced the dispute of evaluation across all editorials, and detected patterns of evaluation in those disputes using optimal matching and social network analyses. I showed “narrative competition” between the different newspapers and was able to bring together my interest in an empirical cultural and economic sociology (Mützel, 2002).
While my fieldwork was a close reading of what was happening economically and organizationally in the field, I had to acknowledge that I was not able to provide a “distant reading” (Moretti, 2000) or a macroscopic view of what the newspaper were writing about. I had more textual data than I could analyze and lacked the skills and analytical tools for what I was interested in.
Encountering topic modeling
I continued on to new empirical projects. And again, I found myself interested in processes of meaning making of economic actors as a new field was emerging—using newspaper and other textual data. Studying the emergence of the field of innovative breast cancer therapeutics beginning with the late 1980s, I first resorted to close reading and coding of press statements and business reports to point to competitive processes in markets without products (Mützel, 2010).
I then encountered the method of topic modeling a couple of years ago. And right away, I was enthusiastic about its possibilities to analyze emergent processes from a macroscopic view using large textual corpora.
Topic modeling, and specifically Latent Dirichelet Allocation (LDA), is a method that in large textual corpora groups together words into topics based on their co-occurrence within the texts. Without a priori coding and thus without the assumption of content analysis that the analyst has thoroughly understood the entire corpus, the LDA algorithm identifies clusters of words, i.e. latent topics, based on a statistical model of language. What is needed is an idea about how many topics a corpus consists of and the ability to recognize when a set of topics matches with the analyst’s knowledge of the field. Developed in the fields of computer science, machine learning, and natural language processing (e.g. Blei, 2012; Blei and Lafferty, 2009; Blei et al., 2003), topic modeling has recently received heightened attention in the humanities (e.g. Jockers, 2013; Meeks and Weingart, 2013; Moretti, 2013) and the social sciences (e.g. Grimmer and Stewart, 2013; Kaplan and Vakili, 2014; Mohr and Bogdanov, 2013; Ramage et al., 2009). As users and developers of the methods have pointed out, the identified topics capture “the relationality of meaning” (DiMaggio et al., 2013: 571) in that assignment of words to topics is based on co-occurrence and thus embedded in contextual meaning. What is more, topic modeling also allows for the multiplicity of meaning: a word can belong to different topics as its meaning may vary across different contexts. Topic modeling presents an “inductive relational approach to the study of culture”; it offers substantive interpretability of topics as frames (p. 576).
To be sure, topic modeling only shifts substantive interpretation to a later position in the analytical process—it does not replace it. Analysts need knowledge of the respective field to make sense of the resulting topics. I have used topic modeling to gain macroscopic insights into the developments of entire fields. For the field of breast cancer therapeutics, I use different textual corpora to trace discursive trajectories of biochemical molecules, research strategies, and financial expectations over the span of 23 years. In another study, I analyze the gastronomic field of the city of Berlin over the span of 18 years, using restaurant reviews (Mützel, 2015). In both cases, results from the LDA procedure allow me to describe and trace developments over long periods of time. I zoom in at particular moments in time, conduct qualitative analysis of the texts, and interpret the developments using the topics as frames for what was going on based on my knowledge of the respective field. For my long-standing interest in the emergence of fields based on a study of large textual corpora, LDA is thus a valuable contribution with its own limitations.
Reflections on working with computational analyses of large textual data
My venturing out into the world of machine learning and natural language processing using Big Data over the past couple of years has been an amazing learning experience. Here I reflect upon some aspects of data construction and the link between theory, data, and methods.
Data construction
Starting out in each project, I had to decide what counts as being part of the data set. Moreover, to computationally analyze large textual corpora, I needed machine-readable data.
For the breast cancer project, I used already digitized data from publication repositories selected on the basis of very general key words to be most inclusive when constructing the basic data set. For the gastronomic field, I transformed archival records, in this case all restaurant reviews published by two biweekly magazines, from microfilm, photographs, or paper into machine-readable format. This involved sending the scanned or photographed texts through optical character recognition and then correcting OCR mistakes in the machine-readable text by comparison to the scanned text.
Yet, this replicable selection of texts is only the first step in the data set construction. A next step involves many decisions on what will be analyzed and what will be neglected given the machine-readable texts. I found this to be far removed from any automatable processes. Instead, the procedures of “data curation” and “cleaning the data corpus” of “unnecessary” information proved extremely challenging. I learned a lot about grammar, stemming, tokenization, stop words, and n-grams. I had to make decisions about terms that in my readings of the texts suggested importance, like particular evaluative expressions in the case of the restaurant reviews, yet in the algorithmic analysis proved negligible. These decisions were time-consuming and, in a back-and-forth between data and tentative results, analytically intense. Curating the data for the algorithm brought new uncertainties. As Venturini et al. point out, “we are far from the concept of automation: computerized research is neither fast nor easier” (2014: 5).
To me, these decisive processes speak to the insight that “raw data is an oxymoron” (Gitelman, 2013). As sociologists we know that both qualitative and quantitative approaches involve assembling, analyzing, and inscribing forms of data. In doing so, we make important decisions about how we envision the social. In “cleaning the data” I also made them “algorithm ready” (Gillespie, 2014: 171). This resonates with discussions about social media data collection processes, which point to systematic biases because the practices of users on social media platforms reflect the technical possibilities, in particular the algorithms, on those platforms (e.g. Boyd and Crawford, 2012; Marres, 2012; Ruths and Pfeffer, 2014). Social media platforms “perform and produce sociality as much as they describe it” (Burrows and Savage, 2014: 5). More sociological research is needed on how data is constructed for all types of social research, including but not limited to unstructured, large-scale social media sources. We also need to be more transparent in describing how we arrive at our data, be it in qualitative or quantitative research.
Linking theory, data, and methods
Working with these data and methods also made me reflect upon the link between theory and methods in several ways.
DiMaggio et al. (2013) indicate topic modeling’s fit to theoretical concepts of cultural sociology, e.g. operationalizing frames and the relationality of meaning. They also stress that a topic model is a starting point to be used to answer further questions. However, limitations of the method are also apparent. The statistical method ignores subtleties and ambiguities of language. Further advances are needed to include qualitatively established insights on motives and logics among others and how they are connected to each other across texts (but see Mohr et al., 2013). The field of socio-semantic network analyses is rapidly developing in multiple disciplinary domains. In that sense, Big Data methods and established techniques of cultural sociology can become excellent complements (Bail, 2014).
In another sense, working with algorithms and considering their assumptions about the data I feed them, at a time when algorithms format and influence many of my everyday decisions, seems like a fitting hands-on experience. As research on the “social life of methods” has pointed out, “social science methods are now more in and of social worlds, not standing outside and detached from them as objects or subjects of inquiry” (Ruppert, 2013: 273). While social science methods are constitutive of social media platforms and search engines, I find some familiarity with the workings of machine learning, in turn, very helpful to conduct research and to teach on digital life.
I also find myself in discussions with students and colleagues about the challenges of combining theory and methods using Big Data in a more practical sense. Already undergraduate students in sociology start to be interested in conducting empirical research using social media sources. However, methods training is typically limited to fixed tools for statistical analysis. Only little training is offered on how to construct data, how to manipulate large data sets, or how to analyze large textual corpora using machine-learning tools. Only little theoretical guidance for doing such research is available as training in technical skills and theoretical approaches is often separated. Students who have coding and modeling skills often leave academia for positions doing social research in the media and marketing industry. Similarly, projects using Big Data, from data journalism to computational social science, have little engagement with sociology, although many sociological insights could strengthen analyses; in turn, sociologists could benefit from enhanced computing and visualization skills.
As these points indicate, Big Data and its methods of analysis challenge the praxis of doing sociology. But, to be sure, sociology has much to contribute to the new arenas of social science research: because of its insights and techniques to study meaning and how the social is structured, sociology makes itself very relevant to data science projects mining large data sets.
Footnotes
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
Acknowledgement
I thank the editors of this special issue for their helpful comments.
