Abstract
Despite its empirical prominence, there is very little extant organizational research on Big Data. However, there is reason to believe this is changing as organizational theory scholars are beginning to embrace new methods and data sources. In this essay, I present a view that suggests there are several latent opportunities, many of which have been simmering unattended for some time. This research approach is not without its challenges, as the ontological terrain of Big Data is untested and potentially disruptive. However, we are observing a renewal of approaches to text and content analysis. By opening up the toolkit of computational linguistics methods for text analysis, Big Data may bring about fresh synthesis and reshape classic debates around social structure.
Keywords
Introduction
The era of Big Data is full of promises, heralding new ways to unlock the latent value of unstructured information. For the sceptical social scientist, this may seem strange, because it is a proposition that intersects with the core of what we practise. We often conduct research by taking a closer look at information that is initially unstructured. By examining patterns, we then form cases and inductively examine them. With these data and methods of scientific inquiry (Dewey, 1910), we deductively examine theoretical mechanisms to further our understandings and contribute to epistemologies. So what is sufficiently new and relevant about Big Data that is causing such a stir? In a recent book on the subject, Mayer-Schönberger and Cukier (2013: 19) claim that, “Big Data is all about seeing and understanding the relations within and among pieces of information that, until very recently, we struggled to fully grasp”. Kitchin (2014) reflects on Big Data affecting epistemologies, reminding us that new forms of measurement often precede paradigm shifts in science. For management studies and organizational research, the conceptualization of social structure is a key topic restricted by empirical limitations. Big Data work brings a volume, scale, and precision that affords an unprecedented view of social structure, but is not easily derived.
Despite the paucity of published work in organizational research on Big Data, as a movement emerging first from practice, it has recently become of interest to the field of management. In the flagship journal, the
In this essay, I focus on the opportunities and challenges in the pairing of methodological competencies with what George et al. (2014) call ‘community data’. They define this as the “distillation of unstructured data – especially text – into dynamic networks that capture social trends … [that] can then be distilled for meaning to infer patterns in social structure” (2014: 322). Working with ‘community data’ entails first tracing patterns in public discourse to derive social structure, and then showing the impact of organizations embedded in it. In the few available examples of this type of work, the media serves as an important intermediary, stitching together cues of social structure in public discourse. Kennedy (2008) showed how the media provided a cognitive legitimacy for firms entering nascent markets by mentioning them together in clusters. Rosa et al. (1999) found media serving as a conduit for producer–customer sensemaking through market stories. Building from these, I believe there is an opportunity here with Big Data approaches and methods to study social structure at a meso-level and trace how this maintains organizational fields.
Lounsbury and Ventresca (2003) hailed the interest in measuring social structure from discourse as a ‘new structuralism’ that conceptualized organizations being constituted within cultural processes, meaning systems (Friedland and Alford, 1991; Meyer et al., 1987), and structures of social cognition (DiMaggio, 1997). A decade on, George et al. (2014) made a similar argument based on Big Data. I contend that ‘new structuralism’ was a promising perspective that set in motion fresh theorizing about organizations and institutions, but was slowed by methodological limitations in text analysis. It is well suited to the methodological particulars of Big Data with scale and precision. However, in stitching the two together, there is a potential minefield of empirical pitfalls. In this essay, I suggest that Big Data in the form of competencies with computational linguistics and natural language processing (NLP) can effectively answer the call of ‘new structuralism’.
Social structure in organizational theory
As noted by Lounsbury and Ventresca (2003), organizational theory has undergone several historical transformations where social structure has oscillated as a core concern. The early works of the 1950s and 1960s (Gouldner, 1954; Selznick, 1949) included classic case studies that considered the role of organizations in society. By the 1970s, this gave way to an era of organization research that was “dominated by a conceptualization of interorganizational relations as highly rationalistic and instrumental” (Lounsbury and Ventresca, 2003: 458). This work drew upon theories of contingency to explain formalized organizational structures and resource dependency relationships, driving general societal structures to the background. By the 1990s a ‘new structuralism’ (Lounsbury and Ventresca, 2003) emerged through an interest in societal structures based on an intersection of the sociology of culture, practice theory, and institutional analyses (Mohr, 2000). In this research programme, rational action of organizations is embedded within broader social structures, compelling us to examine societal belief systems (Friedland and Alford, 1991; Meyer et al., 1987) and social ontologies (Ruef, 1999).
By focusing on cultural processes and meaning systems, ‘new structuralism’ provided a fresh impetus for archival work, both in studying unstructured data in contemporary and historical texts (Ventresca and Mohr, 2002). The promise of this work was to have a richer conceptualization of organizations and social structure in “exploring the deeper cultural categories and meanings that inform practices in fields” (Lounsbury and Ventresca, 2003: 464). This echoed a renewed push within cultural sociology to measure culture and social cognition (DiMaggio, 1997) using more formal relational methods such as multidimensional scaling, cluster analysis, network analysis, and correspondence analysis (Mohr, 1998).
One fruitful approach to this can potentially be found in social network analysis (SNA), where there has been an ongoing tension between social networks capturing social interactions and networks as cognitive structures (Burt et al., 2013). Empirical work in SNA is often based on inferring relationships as dyads. Two people can be linked based on friendship ties, or sitting on the same corporate board. Organizations can be linked in a number of different ways including: competing in the same market, having R&D alliances, or having complementary products (George et al., 2014). When two organizations have an R&D alliance, this can be measured as a social interaction. Measuring two organizations as competing in the same market is a cognitive structure. In considering how legitimacy is the mechanism holding the cognitive structure together, social networks can also integrate conceptions of organizational fields.
When we consider that social networks are graphical representations of either social interactions or cognitive structures, this may speak to social structure bounding organizational fields, and thus organizational activity. Networks represent both ongoing interactions and potential ones, in facilitating future interactions through mechanisms such as structural equivalence. The latter is perhaps less apparent than the former. The classic work from White (2002) conceived of a product market as a set of producers who follow each another’s actions to determine their own strategies. Porac et al. (1989: 397) introduced a cognitive perspective to studying industry by suggesting that the “structure of industry both determines and is determined by managerial perceptions of the environment”. In terms of social structure, this claimed that competitive groups were ‘cognitive communities’ based on the mental models of managers. Network representations of a market capture both a cognitive structure and a series of interactions.
Text analysis in organizational research
Relational approaches to text analysis are far from the norm in organizational research. The field has long used content analysis methods to study “individual or collective structures such as values, intentions, attitudes, and cognitions” (Duriau et al., 2007). Much of this work has a positivist leaning (Burrell and Morgan, 1979) and measuring psychological constructs in texts using standardized dictionaries such as the Linguistic Inquiry and Word Count (LIWC), or the General Inquirer (Pollach, 2012). In such a deductive research design utilizing a standardized dictionary, keywords are established a priori in dictionary categories such as positive or negative sentiment. This analytical approach focuses on the manifest content that can be captured and measured in text statistics (Duriau et al., 2007). By counting the incidences of dictionary constructs, inferences are made about the meaning of the text. However, still another form of computer-aided content analysis work draws on an interpretivist epistemology (Burrell and Morgan, 1979) and seeks to identify deeper meanings and discourses in texts (Krippendorff, 2003). Both the deductive and inductive approaches tend to focus on individual words and are notably detached from relational approaches to measuring social structure.
Text analysis in organizational research has largely remained in a stable state based on measuring content through dictionaries. In an effort to push this forward, Pollach (2012) urged the field to consider corpus linguistics as a further form of computationally oriented content analysis. She made the case that corpus linguistics extends content analysis, contributing on the basis that “[it] studies real-life language use on the basis of a text corpus”, through the “integration of corpus linguistics into discourse analysis with a view to reducing the subjectivity inherent in discourse analysis” (2012: 2). Corpus linguistics was portrayed not as a single approach but as a computationally driven set of approaches that used large datasets to study textual patterns. From this, it is clear that in order to effectively utilize corpus linguistics, there is a need for novel competencies not typically found in business school training, but more likely in a linguistics or computer science department.
I propose that Pollach’s (2012) argument can be extended to consider computational linguistics and text analysis more broadly. Then, in the divide between content analysis in organizational research and the assortment of work in computational linguistics, it is evident that the latter are using bespoke customizable tools to manipulate text and data in technically sophisticated ways. This relates to the granular dimension of Big Data as a competency to move, transform, and isolate particular units of text within a broader corpus. There is a qualitative difference between measuring constructs as pre-determined dictionaries and manipulating text to find and test relational patterns.
Deriving social structure from text analysis
If the analytical goal of ‘new structuralism’ (Lounsbury and Ventresca, 2003) is to derive evidence of social structure from archival textual data, then Big Data can be a potentially fruitful approach to address this. George et al. (2014: 325) reflect that “community data” contains relational ties and “information on such relationships is often available in unstructured textual form, such as in news articles or company blogs on the web”. No doubt, this presents an exciting opportunity for management researchers to use computational linguistics to extract relational data such as semantic networks from text that can be approximated as social networks and integrated with organizational analyses.
To conduct such work, there is a wealth of material from the last three decades on semantic network analysis to draw upon. Franzosi (1987) sought to isolate fundamental narrative units and map them. Carley and Kaufer (1993) contributed a perspective not only of concepts as symbols in texts, but also dimensions of “symbolic connectivity”. More recently, Diesner et al. (2012) used statistical co-occurrence of named entities to form socio-cultural networks. This echoed similar work by Roth and Cointet (2010) tracing the co-evolution of social and semantic networks that explored the dynamics of communities of bloggers and scientists.
Recently in this journal, Sudhahar et al. (2015) used “automatic corpus linguistics methods and network analysis to obtain a network representation of [an] entire [political] campaign coverage by the new media” (p. 1). This study conducted an automated Big Data analysis of 130,000 news articles covering a US presidential election. The authors used computational linguistics to identify noun phrases and verbs as the elements of a semantic network.
Disambiguating concepts as network elements
Thus far I have discussed the potential for semantic networks in the process of theorizing about social structure in organizational research. However, when we consider computational linguistics as a Big Data approach studying natural language, this also introduces several methodological pitfalls. First, is the issue of word sense disambiguation, or “how to decide between different meanings of the words” (Vossen, 2005: 9). Second, when our analytical goal is to tease out social structure through networks of entities and their relations, there is the issue of whether individual words match our conceptualization of entities, or if phrases are more adequate representations. We must be reflexive in these issues regarding assumptions about ontologies.
Although computational linguistics is a label for a family of computational approaches to studying texts, it is notable that many of the tools used to manipulate, transform, and transport the textual data fall under the umbrella of “NLP”. As a combination of computational linguistics, artificial intelligence, and statistics (Mitkov, 2005), NLP methods do not aim to uncover a fully comprehensive representation of knowledge (Vossen, 2005). Rather, they are based on using probabilistic techniques from computer science to uncover textual patterns from corpora (Manning and Schuetze, 1999).
Considering the different subfields in linguistics – morphology, semantics, pragmatics, semiotics, discourse, to mention some – it should be evident that all languages are inherently ambiguous (Vossen, 2005). Social actors interact based on shared meanings that are imperfectly constructed within social worlds (Ruef, 1999; Wittgenstein, 1958). Even if we acknowledge room for ambiguity in the meaning of words, and that cultural meanings are situated in localized social contexts, it is a formidable challenge to carefully extract entities that accurately represent concepts.
Determining adequate empirical representations of concepts is an important issue. In linguistics, studying individual words as the unit of analysis has the drawback of potentially different meanings – “polysemy” – and is resolved through making inferences based on context, or “word sense disambiguation” (Vossen, 2005: 9). However, this leads to questions around appropriate methodological ontologies. In resolving words to match concepts representing units of social structure, is it the particular meaning (sense) of a word that matters, or the linguistic role it plays? The process generally referred to in NLP as “information extraction” considers how the presence of named entities – people, places, organizations – occurs in the form of phrases that may be one or more words. In NLP, a phrase refers to a syntactically well-formed unit, such as a noun phrase, a prepositional phrase, or a verb phrase. The most frequently used phrases for research are noun phrases since they tend to capture entities and concepts. (Tzoukermann et al., 2005: 6)
Cultural norms of expression are embedded in the way natural language is structured and not all expressions may necessarily be commensurate. This leads to a brief aside on ontological assumptions in this type of analysis. In the linguistics tradition, ontologies are broadly defined as “the storage of information within a domain, to draw common sense inferences … an inventory of the objects, processes, etc. in a domain, as well as a specification of (some of) the relations that hold among them” (Vossen, 2005: 2). Ruef (1999: 1403) reflects this concern from a sociological tradition, describing ontologies as “systems of categories, meanings, and identities within which actors and actions are situated”. Both theoretically and methodologically, we are faced with making decisions about ontological units as the basis of analysis.
My encounters with disambiguating concepts from the messiness of natural language in text have been about resolving ambiguities. Theoretical approaches to measuring meaning as relational structures (Mohr, 1998) are based on parsimony and simplification. Whilst we require ontological stability in order to conduct large-scale analyses, the extent to which the meaning of units is shared is rather interesting. There is often an assumption of a stable ontology as a meaning system (Vossen, 2005). Within Big Data, the constitution of this knowledge lexicon is not without controversy (Boyd and Crawford, 2012). This presents a methodological difficulty in moving from constructs to measures.
This essay has described several pitfalls in a Big Data analysis of social structure in text as being akin to a minefield of methodological issues. However, this also opens up several interesting opportunities for analysis. My recent work is a joint project focused on the role of media in the social construction of scandal and the mechanisms of how particular cognitive frames (Cornelissen and Werner, 2014) are applied to Members of Parliament in the scandal coverage (Hannigan et al., 2015). When frames are measured as formal meaning structures, then how they are applied to actors in a period of sensemaking is a relational process as per ‘new structuralism’. We considered a set of actor entities from amongst the 644 Members of Parliament in the 2009 British “Members of Parliament Expenses Scandal”. Our analytical goal was to find media treatment variables from concept networks that would predict when MPs would resign in the scandal. The process we developed used NLP on media articles to generate concept collocation networks (Kennedy, 2008).
Similar to classic studies in SNA that generated actor incidence networks based on common events (Davis et al., 1941), sitting on corporate boards (Burt et al., 2013), or creative teams in Broadway musicals (Uzzi and Spirro, 2005), our actors are Members of Parliament, and incidences are other concepts in the texts that may constitute frames. In order to conduct our analysis, we needed to disambiguate both. Noun phrases were determined through a linguistic parser, which uses parts of speech to isolate phrases. From amongst the set of noun phrases, we then disambiguated MP names.
In approaching our project, we initially recoiled at the prospect of the complexities in disambiguating MP entities. The computational linguistics analysis yielded a set of millions of noun phrases. In order to make this process manageable, we first acknowledged that pronouns were far too difficult to automate in over 30,000 articles. We then turned to last names as the basis for candidate phrases. In the process of preparing the dataset, we also built a database of 644 MPs and their attributes. We used this attribute information to construct dynamic dictionaries for disambiguation on the fly in generating concept networks. Each article was initially sampled from LexisNexis on the basis of the MP’s full name, and when combined with the date, this provided another clue in disambiguating names. Then by using attribute information from our database of MPs, we scored phrases based on the presence of gender, political party, role and others. Through this process of building probabilities for matching, we were able to confidently collapse phrases into concepts pertaining to MPs. Once we had MPs and concepts disambiguated in text, we were then able to map out the structures of media discourse and show how these impacted the actions of resignations.
Conclusion
In this essay, I argued that Big Data does not simply cover volume or precision, but also an interdisciplinary renaissance whereby computer science, linguistics and NLP methods are being used for social science. This entails competences for preparing, manipulating, and analysing data in ways that respond to renewed concerns in organizational research for measuring social structure (Lounsbury and Ventresca, 2003; Mohr, 1998). In order to handle the volume and precision required for extracting social structure, text-handling capabilities are necessary to manipulate novel methodological ontologies. On a concrete level, this speaks to skillsets around scripting languages with Python or R, and using natural language toolkits instead of off-the-shelf software packages.
Cornelissen and Werner (2014) point out that frames and framing have been of core interest to organizational researchers, but work has lagged on using semantic networks to represent such cognitive structures. I believe the Big Data movement illustrates an explanation of the lag. The issue is not about more powerful machines, or statistical tools. Rather, it is about employing a set of competencies to enable a qualitatively different type of text analysis than what is done currently with Content Analysis. In this way, Big Data can integrate computational linguistics and NLP competencies into the field of organizational research to propel it forward.
Footnotes
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
