Sage Journals: Discover world-class research

Abstract

Working with computational methods and large textual analysis has been challenging and very rewarding—with all the ups and downs that doing empirical social research entails. In my contribution, I relate some research experiences and reflect upon data construction and the links between theory, data, and methods.

Keywords

Big Data sociology computational social science topic modeling textual analysis data construction

With a formative training in social network analysis and its theoretical contributions to sociology (Breiger, 2004; Latour, 1996; White, 1992), economic sociology (Stark, 1996, 2000), and empirical cultural sociology interested in meaning making processes (Mohr, 1998), I began my fieldwork research for my dissertation. I was interested in newspapers’ meaning making at the time of the move of the German political capital from cozy Bonn to the formerly divided city of Berlin in the late 1990s. I observed and interviewed journalists in Bonn and Berlin, followed newspapers’ physical move of offices, and read five newspapers a day for two years. My hunch was that this move would be reflected in the newspapers. From my ethnographic fieldwork I found a competitive jockeying for readers, the re-wiring of journalists’ career paths, and isomorphism in newspaper design.

Translating my newspaper reading into an analytic framework posed different problems. In particular, I was interested in how those newspapers would position themselves narratively in their editorials as opinion leaders in a united Germany. From my fieldwork I learned that editorialists wrote to communicate with their colleagues at competing newspapers. They took their competitors into account without citing them directly. Ideally, I needed a method that would take contextual meaning making across different papers over time into account while, dually, allowing me to analyze how the meaning making processes shaped the identity of the newspapers—using all 9000 editorials.

For lack of modeling and computational skills, I ended up using only a sample of editorials, highlighting particular issues and how they were disputed across the editorials. Using insights from sociological research on text as data (Bearman and Stovel, 2000; Franzosi, 1994; Mohr, 1994), I manually coded how editorials evaluated particular issues (Boltanski and Thévenot, 1999), traced the dispute of evaluation across all editorials, and detected patterns of evaluation in those disputes using optimal matching and social network analyses. I showed “narrative competition” between the different newspapers and was able to bring together my interest in an empirical cultural and economic sociology (Mützel, 2002).

While my fieldwork was a close reading of what was happening economically and organizationally in the field, I had to acknowledge that I was not able to provide a “distant reading” (Moretti, 2000) or a macroscopic view of what the newspaper were writing about. I had more textual data than I could analyze and lacked the skills and analytical tools for what I was interested in.

Encountering topic modeling

I continued on to new empirical projects. And again, I found myself interested in processes of meaning making of economic actors as a new field was emerging—using newspaper and other textual data. Studying the emergence of the field of innovative breast cancer therapeutics beginning with the late 1980s, I first resorted to close reading and coding of press statements and business reports to point to competitive processes in markets without products (Mützel, 2010).

I then encountered the method of topic modeling a couple of years ago. And right away, I was enthusiastic about its possibilities to analyze emergent processes from a macroscopic view using large textual corpora.

Topic modeling, and specifically Latent Dirichelet Allocation (LDA), is a method that in large textual corpora groups together words into topics based on their co-occurrence within the texts. Without a priori coding and thus without the assumption of content analysis that the analyst has thoroughly understood the entire corpus, the LDA algorithm identifies clusters of words, i.e. latent topics, based on a statistical model of language. What is needed is an idea about how many topics a corpus consists of and the ability to recognize when a set of topics matches with the analyst’s knowledge of the field. Developed in the fields of computer science, machine learning, and natural language processing (e.g. Blei, 2012; Blei and Lafferty, 2009; Blei et al., 2003), topic modeling has recently received heightened attention in the humanities (e.g. Jockers, 2013; Meeks and Weingart, 2013; Moretti, 2013) and the social sciences (e.g. Grimmer and Stewart, 2013; Kaplan and Vakili, 2014; Mohr and Bogdanov, 2013; Ramage et al., 2009). As users and developers of the methods have pointed out, the identified topics capture “the relationality of meaning” (DiMaggio et al., 2013: 571) in that assignment of words to topics is based on co-occurrence and thus embedded in contextual meaning. What is more, topic modeling also allows for the multiplicity of meaning: a word can belong to different topics as its meaning may vary across different contexts. Topic modeling presents an “inductive relational approach to the study of culture”; it offers substantive interpretability of topics as frames (p. 576).

To be sure, topic modeling only shifts substantive interpretation to a later position in the analytical process—it does not replace it. Analysts need knowledge of the respective field to make sense of the resulting topics. I have used topic modeling to gain macroscopic insights into the developments of entire fields. For the field of breast cancer therapeutics, I use different textual corpora to trace discursive trajectories of biochemical molecules, research strategies, and financial expectations over the span of 23 years. In another study, I analyze the gastronomic field of the city of Berlin over the span of 18 years, using restaurant reviews (Mützel, 2015). In both cases, results from the LDA procedure allow me to describe and trace developments over long periods of time. I zoom in at particular moments in time, conduct qualitative analysis of the texts, and interpret the developments using the topics as frames for what was going on based on my knowledge of the respective field. For my long-standing interest in the emergence of fields based on a study of large textual corpora, LDA is thus a valuable contribution with its own limitations.

Reflections on working with computational analyses of large textual data

My venturing out into the world of machine learning and natural language processing using Big Data over the past couple of years has been an amazing learning experience. Here I reflect upon some aspects of data construction and the link between theory, data, and methods.

Data construction

Starting out in each project, I had to decide what counts as being part of the data set. Moreover, to computationally analyze large textual corpora, I needed machine-readable data.

For the breast cancer project, I used already digitized data from publication repositories selected on the basis of very general key words to be most inclusive when constructing the basic data set. For the gastronomic field, I transformed archival records, in this case all restaurant reviews published by two biweekly magazines, from microfilm, photographs, or paper into machine-readable format. This involved sending the scanned or photographed texts through optical character recognition and then correcting OCR mistakes in the machine-readable text by comparison to the scanned text.

Yet, this replicable selection of texts is only the first step in the data set construction. A next step involves many decisions on what will be analyzed and what will be neglected given the machine-readable texts. I found this to be far removed from any automatable processes. Instead, the procedures of “data curation” and “cleaning the data corpus” of “unnecessary” information proved extremely challenging. I learned a lot about grammar, stemming, tokenization, stop words, and n-grams. I had to make decisions about terms that in my readings of the texts suggested importance, like particular evaluative expressions in the case of the restaurant reviews, yet in the algorithmic analysis proved negligible. These decisions were time-consuming and, in a back-and-forth between data and tentative results, analytically intense. Curating the data for the algorithm brought new uncertainties. As Venturini et al. point out, “we are far from the concept of automation: computerized research is neither fast nor easier” (2014: 5).

To me, these decisive processes speak to the insight that “raw data is an oxymoron” (Gitelman, 2013). As sociologists we know that both qualitative and quantitative approaches involve assembling, analyzing, and inscribing forms of data. In doing so, we make important decisions about how we envision the social. In “cleaning the data” I also made them “algorithm ready” (Gillespie, 2014: 171). This resonates with discussions about social media data collection processes, which point to systematic biases because the practices of users on social media platforms reflect the technical possibilities, in particular the algorithms, on those platforms (e.g. Boyd and Crawford, 2012; Marres, 2012; Ruths and Pfeffer, 2014). Social media platforms “perform and produce sociality as much as they describe it” (Burrows and Savage, 2014: 5). More sociological research is needed on how data is constructed for all types of social research, including but not limited to unstructured, large-scale social media sources. We also need to be more transparent in describing how we arrive at our data, be it in qualitative or quantitative research.

Linking theory, data, and methods

Working with these data and methods also made me reflect upon the link between theory and methods in several ways.

DiMaggio et al. (2013) indicate topic modeling’s fit to theoretical concepts of cultural sociology, e.g. operationalizing frames and the relationality of meaning. They also stress that a topic model is a starting point to be used to answer further questions. However, limitations of the method are also apparent. The statistical method ignores subtleties and ambiguities of language. Further advances are needed to include qualitatively established insights on motives and logics among others and how they are connected to each other across texts (but see Mohr et al., 2013). The field of socio-semantic network analyses is rapidly developing in multiple disciplinary domains. In that sense, Big Data methods and established techniques of cultural sociology can become excellent complements (Bail, 2014).

In another sense, working with algorithms and considering their assumptions about the data I feed them, at a time when algorithms format and influence many of my everyday decisions, seems like a fitting hands-on experience. As research on the “social life of methods” has pointed out, “social science methods are now more in and of social worlds, not standing outside and detached from them as objects or subjects of inquiry” (Ruppert, 2013: 273). While social science methods are constitutive of social media platforms and search engines, I find some familiarity with the workings of machine learning, in turn, very helpful to conduct research and to teach on digital life.

I also find myself in discussions with students and colleagues about the challenges of combining theory and methods using Big Data in a more practical sense. Already undergraduate students in sociology start to be interested in conducting empirical research using social media sources. However, methods training is typically limited to fixed tools for statistical analysis. Only little training is offered on how to construct data, how to manipulate large data sets, or how to analyze large textual corpora using machine-learning tools. Only little theoretical guidance for doing such research is available as training in technical skills and theoretical approaches is often separated. Students who have coding and modeling skills often leave academia for positions doing social research in the media and marketing industry. Similarly, projects using Big Data, from data journalism to computational social science, have little engagement with sociology, although many sociological insights could strengthen analyses; in turn, sociologists could benefit from enhanced computing and visualization skills.

As these points indicate, Big Data and its methods of analysis challenge the praxis of doing sociology. But, to be sure, sociology has much to contribute to the new arenas of social science research: because of its insights and techniques to study meaning and how the social is structured, sociology makes itself very relevant to data science projects mining large data sets.

Footnotes

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) received no financial support for the research, authorship, and/or publication of this article.

Acknowledgement

I thank the editors of this special issue for their helpful comments.

This article is part of a special theme on Colloquium: Assumptions of Sociality. To see a full list of all articles in this special theme, please click here: .

References

Bail

(2014) The cultural environment: Measuring culture with big data. Theory and Society 43(3–4): 465–482.

Bearman

Stovel

(2000) Becoming a Nazi: A model for narrative networks. Poetics 27(2): 69–90.

Blei

(2012) Probabilistic topic models. Communications of the ACM 55(4): 77–84.

Blei

Lafferty

(2009) Topic models. In: Srivastava

Sahami

(eds) Text Mining: Classification, Clustering, and Applications, Boca Raton, FL: Chapman & Hall, pp. 71–93.

Blei

Jordan

(2003) Latent Dirichlet allocation. Journal of Machine Learning Research 3: 993–1022.

Boltanski

Thévenot

(1999) The sociology of critical capacity. European Journal of Social Theory 2(3): 359–377.

Boyd

Crawford

(2012) Critical questions for big data. Information, Communication & Society 15(5): 662–679.

Breiger

(2004) The analysis of social networks. In: Hardy

Bryman

(eds) Handbook of Data Analysis, London: Sage, pp. 505–526.

Burrows

Savage

(2014) After the crisis? Big Data and the methodological challenges of empirical sociology. Big Data & Society 1(1): 1–6.

10.

DiMaggio

Nag

Blei

(2013) Exploiting affinities between topic modeling and the sociological perspective on culture: Application to newspaper coverage of U.S. government arts funding. Poetics 41(6): 570–606.

11.

Franzosi

(1994) From words to numbers: A set theory framework for the collection, organization, and analysis of narrative data. Sociological Methodology 24: 105–136.

12.

Gillespie

(2014) The relevance of algorithms. In: Gillespie

Boczkowski

Foot

(eds) Media Technologies, Cambridge, MA: MIT Press, pp. 167–193.

13.

Gitelman

(2013) Raw Data is an Oxymoron, Cambridge, MA: MIT Press.

14.

Grimmer

Stewart

(2013) Text as data: The promise and pitfalls of automatic content analysis methods for political texts. Political Analysis 21(3): 267–297.

15.

Jockers

(2013) Macroanalysis: Digital Methods and Literary History, Urbana: University of Illinois Press.

16.

Kaplan

Vakili

(2014) The double-edged sword of recombination in breakthrough innovation. Strategic Management Journal Early View. Available at: http://onlinelibrary.wiley.com/doi/10.1002/smj.2294/abstract (accessed 3 August 2015).

17.

Latour

(1996) Aramis, or, The Love of Technology, Cambridge, MA: Harvard University Press.

18.

Marres

(2012) The redistribution of methods: On intervention in digital social research, broadly conceived. The Sociological Review 60(S1): 139–165.

19.

Meeks

Weingart

(2013) The digital humanities contribution to topic modeling. Journal of Digital Humanities 2(1): Available at: http://journalofdigitalhumanities.org/2-1/dh-contribution-to-topic-modeling/ (accessed 10 October 2013).

20.

Mohr

(1994) Soldiers, mothers, tramps and others: Discourse roles in the 1907 New York City Charity Directory. Poetics 22(4): 327–357.

21.

Mohr

(1998) Measuring meaning structures. Annual Review of Sociology 24: 345–370.

22.

Mohr

Bogdanov

(2013) Introduction—Topic models: What they are and why they matter. Poetics 41(6): 545–569.

23.

Mohr

Wagner-Pacifici

Breiger

(2013) Graphing the grammar of motives in National Security Strategies: Cultural interpretation, automated text analysis and the drama of global politics. Poetics 41(6): 670–700.

24.

Moretti

(2000) Conjectures on world literature. New Left Review 1: 54–68.

25.

Moretti

(2013) Distant Reading, London: Verso.

26.

Mützel

(2002) Making Meaning of the Move of the German Capital: Networks, Logics, and the Emergence of Capital City Journalism, Ann Arbor, MI: UMI.

27.

Mützel

(2010) Koordinierung von Märkten durch narrativen Wettbewerb. In: Beckert

Deutschmann

(eds) Wirtschaftssoziologie. 49. Sonderheft der KZfSS, Wiesbaden: VS Verlag, pp. 87–106.

28.

Mützel

(2015) Structures of the tasted: Restaurant reviews in Berlin between 1995 and 2012. In: Antal

Hutter

Stark

(eds) Moments of Valuation: Exploring Sites of Dissonance, Oxford: Oxford University Press, pp. 147–167.

29.

Ramage

Rosen

Chuang

(2009) Topic modeling for the social sciences. Workshop on Applications for Topic Models, NIPS. Available at: http://vis.stanford.edu/papers/topic-modeling-social-sciences. (accessed 13 December 2013).

30.

Ruppert

(2013) Rethinking empirical social sciences. Dialogues in Human Geography 3(3): 268–273.

31.

Ruths

Pfeffer

(2014) Social media for large studies of behavior. Science 346(6213): 1063–1064.

32.

Stark

(1996) Recombinant property in East European capitalism. American Journal of Sociology 101(4): 993–1027.

33.

Stark

(2000) For a sociology of worth. Keynote address for the Meetings of the European Association of Evolutionary Political Economy. Working Paper Series, Center on Organizational Innovation, Columbia University.

34.

Venturini

Baya Laffite

Cointet

J-P

(2014) Three maps and three misunderstandings: A digital mapping of climate diplomacy. Big Data & Society 1(2): Available at: http://bds.sagepub.com/content/1/2/2053951714543804 (accessed 2 December 2014).

35.

White

(1992) Identity and Control: A Structural Theory of Social Action, Princeton, NJ: Princeton University Press.

Facing Big Data: Making sociology relevant

Abstract

Keywords

Encountering topic modeling

Reflections on working with computational analyses of large textual data

Data construction

Linking theory, data, and methods

Footnotes

Declaration of conflicting interests

Funding

Acknowledgement

References