Building Social Media Observatories for Monitoring Online Opinion Dynamics

Abstract

Social media house a trove of relevant information for the study of online opinion dynamics. However, harvesting and analyzing the sheer overload of data that is produced by these media poses immense challenges to journalists, researchers, activists, policy makers, and concerned citizens. To mitigate this situation, this article discusses the creation of (social) media observatories: platforms that enable users to capture the complexities of social behavior, in particular the alignment and misalignment of opinions, through computational analyses of digital media data. The article positions the concept of “observatories” for social media monitoring among ongoing methodological developments in the computational social sciences and humanities and proceeds to discuss the technological innovations and design choices behind social media observatories currently under development for the study of opinions related to cultural and societal issues in European spaces. Notable attention is devoted to the construction of Penelope: an open, web-services-based infrastructure that allows different user groups to consult and contribute digital tools and observatories that suit their analytical needs. The potential and the limitations of this approach are discussed on the basis of a climate change opinion observatory that implements text analysis tools to study opinion dynamics concerning themes such as global warming. Throughout, the article explicitly acknowledges and addresses potential risks of the machine-guided and human-incentivized study of opinion dynamics. Concluding remarks are devoted to a synthesis of the ethical and epistemological implications of the exercise of positioning observatories in contemporary information spaces and to an examination of future pathways for the development of social media observatories.

Keywords

media data mining opinion dynamics digital methods artificial intelligence

Introduction

In recent years, controversies surrounding the Trump campaign, the Brexit referendum, Russian-backed military intervention in the Ukraine, and the rise of the Islamic State have revealed just how closely cultural conflicts are entangled with the use and abuse of online (social) media. Indeed, some of the most recent clashes on themes such as nationalism, populism, and climate change have in one way or another been tied to the dynamics that govern platforms like Twitter, Facebook, Reddit, 4chan, and the comment sections of news websites (for an overview, see Singer & Brooking, 2018). It goes without saying, then, that data harvested from social media hold a trove of relevant information for the analysis of opinion dynamics, and the mechanisms that foster them. However, the sheer quantity and diversity of the data that is created each day on these platforms poses tremendous challenges to those who would benefit from a more systematic overview of this information, including policy makers, researchers, journalists, activists, and concerned citizens. For these groups, it has become all but impossible to “manually” sift through the data that might provide evidence of interference by bots and trolls, injections of fake news, polarization, coalitions and antagonisms, or the patterns through which different opinions emerge, clash, and change.

The central thesis of this article is that this problem of information overload, which is essentially a by-product of technological innovations, can, to an extent, also be mitigated by technological means. To this end, this article argues for the creation of social media “observatories”: platforms that offer the aforementioned target groups the tools to study cultural or societal conflicts using (social) media data. Following Richard Rogers (2018), the concept of an “observatory” is here not understood in the sense of an astronomical observatory that processes a constant, stable flow of “good” data (such as radiation levels registered by a telescope pointed toward the sky) (p. 560). Rather, opinion observatories are conceptualized as platforms for media monitoring that can handle the different data sources and ambiguous contents that characterize online (social) media and that are capable of capturing the “emergent” properties of societal and cultural phenomena. To make this concrete, this article illustrates and explores the design philosophies and technologies behind social media observatories under development for the study of opinion landscapes on societal and cultural issues in European spaces. As an integral part of the conceptual exercise and practice of developing media observatories, the article thereby explicitly acknowledges and addresses risks inherent to the machine-guided, but human-incentivized study of online opinion dynamics.

First, the article will introduce the Penelope platform, an open infrastructure that can house different types of observatories, including observatories that facilitate forms of social and geographical network analysis, observatories designed for textual analysis, and observatories that combine both approaches. Second, and on a more detailed level, the article discusses the workings of one concrete text-based observatory currently under development, namely the climate change opinion observatory. By means of a technical overview of this observatory, it will be shown how computational methods for text analysis can be deployed to perform precision language processing, with particular attention devoted to the semantic frame extractor at the core of this observatory. A series of sample analyses performed on commentaries from the news website of The Guardian will then further demonstrate how this climate change opinion observatory supports a range of analyses of social media data, including causation tracking, relational discourse analysis, and changing actor-composition in online discussions. Finally, the article will move back from the concrete to the general by synthesizing and reflecting on the methodological, ethical, and epistemological repercussions of opinion observatories and their applications.

Defining Media Observatories

Media observatories can be defined on the basis of their ultimate purpose: to technologically capture the complexities of social behavior based on evidence in the form of digital (social) media data. As such, the construction of these observatories can be situated among ongoing explorations in bringing computational, data-driven methods to academic and professional fields associated with the social sciences and humanities. For one thing, this includes the digital humanities (DH), an umbrella term for a series of experimental computational approaches to humanistic inquiry (Schnapp, 2014). Going back to late twentieth-century precursors such as “humanities computing” and firmly rooted within traditional humanities disciplines, DH methods rely on computers to mine, visualize, or otherwise explore digital sources. The construction of opinion observatories particularly builds on a lineage of DH experiments revolving around the analysis of large corpora of textual data (for instance, digitized books in the Google Books corpus), and notably on methods such as “distant reading” and “culturomics” (Michel et al., 2011; Moretti, 2013), which aim to extract patterns from big textual data. For another, the construction of social media observatories can be inscribed within the emerging field of “computational social science” (Watts, 2013), which, among other things, explores how digital data allow us to study society at large. For such approaches to “research with the web,” Rogers (2013, 2019) coins the term “digital methods.” These digital methods are geared toward scientifically re-purposing the functionalities and data of online media. Wikipedia articles in different languages might, for instance, be used to study the attitudes of editors from different nationalities toward the same historical event (such as the 1995 Srebrenica massacre) (Rogers, 2013, p. 165). The platform and pipelines for automated text analysis that will be discussed in this article are aimed at magnifying the potential of online (social) media to reflect their editors’ or users’ opinions on a series of prominent cultural conflicts, with a specific attention to climate change. As will follow, these (misaligned) opinions on climate change might be revelatory of various underlying conflicts, channeling for instance political discussions.

Taking a “big data” approach to cultural conflict, the construction of social media observatories generally implies that the theoretical complexities of social or cultural phenomena are aligned with methods associated with the “hard sciences” such as physics or computer science (Watts, 2013, p. 5). Epistemologically, this bridging is not without its challenges. As Petter Törnberg and Justus Uitermark note, the adoption of digital methods puts the social sciences, humanities, and related fields at risk of evoking the “fallacy of a naïve naturalism” (Törnberg & Uitermark, 2018, p. 3; also see Törnberg & Törnberg, 2018), thus referring to the pitfall of assuming that the complexities of social behavior can be described according to the same laws and regularities that govern the natural world. Instead, channeling Ball (2012, p. IX), Törnberg and Uitermark argue for an approach modeled after complexity science that acknowledges that cultural and societal phenomena do not follow the predictable “clockwork lines” of the Newtonian universe, but rather display the emergent properties of complex systems such as “avalanches and granular flows, flocks of birds and fish, networks of interaction in neurology, cell biology and technology” (Törnberg & Uitermark, 2018, p. 5). As such, the study of social structures meets its limitations: “[. . .] whatever way we slice them [social structures, author(s)], they keep transforming in ways that we cannot capture, leaking through our abstractions” (Törnberg & Uitermark, 2018, p. 8). Capturing societal phenomena in all their complexity thus requires a combination of different approaches and perspectives. In other words, developers and users of social media observatories should be granted the flexibility to combine, create, and add the tools and interfaces required for the purpose of their analysis. In addition, the outcomes of these analyses should be compatible with other qualitative and quantitative research methods deployed in users’ respective fields. As will be unpacked in the following discussion of the H2020 ODYCCEUS project’s Penelope infrastructure, this type of flexibility and compatibility can be fostered by means of an infrastructure that allows users to add and consult analytical tools as web services.

In addition to these baseline epistemological points of discussion, it should be noted that the practice of constructing media observatories is fundamentally an exercise in positioning oneself in a media ecology and information space that puts notions of trust and truth at stake on an unprecedented scale. Although inspired by a sense of technological optimism, observatory construction thus also brings into view the downsides and risks of the technological communication space of which these media observatories become a part—as well as the limitations of the proposed approaches to mitigate such risks. Among many things, social media and other digital discussion platforms create environments in which data and information are open to manipulation (compromising the validity and reliability of analyses), and in which well-intended tools can be abused to elicit and exploit weaknesses in debates or to spread misinformation more effectively. In the following sections, these difficulties will be acknowledged and addressed as an integral part of observatory construction. Notably, conceptualizing and implementing observatories foregrounds open challenges facing digital scholarship in new media environments, including matters of “openness” of data and tools, ensuring data quality and representative sampling, accounting for changing data legislation and policy, building communities and trust, and envisioning and fostering beneficial applications.

The Penelope Infrastructure: A Web Service Approach to Social Media Observatories

The goal of the H2020 ODYCCEUS project is to harness the potential of (online) social media for the analysis and detection of crises facing contemporary society, with a focus on opinion dynamics related to cultural and societal issues in European spaces (ODYCCEUS, 2019a). This includes the study of the alignment and misalignment of opinions on topics such as nationalism, migration, and climate change. On a theoretical level, ODYCCEUS is inspired by global systems science. As such, the project opts for an interdisciplinary approach in which the development of tools for social or geographical network analysis and text analysis is situated among the conceptualization of models to represent cultural frameworks, insights from game theory, and models for alignment and polarization dynamics. These modeling efforts are undertaken on different levels, ranging from mapping the inter-personal mechanisms of conceptual negation and opinion exchange, to modeling the conditions under which distinct spheres of communication emerge that might foster polarization (see, for instance, Banisch & Olbrich, 2018, 2019; Törnberg, 2018). Development and testing of these models is accompanied by empirical studies of societal structures and behavioral dynamics on the basis of (textual) data harvested from online (social) media. Through their quantity and diversity, these data offer unique opportunities to investigate the relationships between meaning, representation, and opinion or conflict dynamics. Tapping into this potential, however, requires technological advances in terms of platforms and pipelines for the large-scale and detailed analysis of social media data. The ODYCCEUS project adheres to the notion that such technologies for the analysis of web content should be made available to a range of user types, with varying degrees of technical proficiency, and offer the flexibility to answer different types of questions concerning human behavior and social structures. To this end, the project partners are developing Penelope, a cloud-based, open, and modular platform that facilitates the data-driven analysis of opinion dynamics in online textual media (Penelope, 2019). Through web APIs, Penelope groups a variety of interconnected components and interfaces. These components and interfaces might serve different aspects of a computational research cycle and can be combined into pipelines suiting the needs of the user. Components thus include, but are not limited to:

Components for gathering data, for instance, via databases or via the API of social media sites.

Components for analyzing data, for instance, for natural language processing, network analysis, or dimensionality reduction.

Components for visualizing data, such as tools for visually plotting data or insights from analyses.

Interfaces that allow the use of Penelope components without extensive programming, for instance, visual programming tools. The infrastructure thus enables a high degree of end-user development, allowing users to configure tools adequate to their skills and research intentions.

Interfaces and observatories that provide insight into particular topics, for instance, the climate change opinion observatory discussed further in this article.

Following a design philosophy that is becoming the standard in web development for data-intensive platforms, Penelope uses a service-oriented architecture of self-contained microservices. Interoperability between these microservices is ensured via a common communication protocol (RESTful services over HTTP), a standard data format (JSON), and public API specifications (see Figure 1). As such, the Penelope platform and services are implemented using well-established techniques that are supported by the vast majority of programming languages and tools. This approach fosters the creation of new tools and also allows for the integration of already existing components into the Penelope platform. Penelope therefore becomes a true community effort, as developers, scientists, (data) journalists, and other aforementioned users can all contribute and make use of its tools.

Figure 1.

Diagram of the Penelope infrastructure. Components for data collection, analysis, and visualization are implemented as RESTful microservices, which are exposed through their web API. The components can be used by (1) other components, for instance in the case of the semantic frame extractor, which calls the dependency parser; (2) dedicated interfaces, which are typically web pages that provide a graphical way to carry out analyses, for example, in the form of an opinion observatory or a visual programming tool; and (3) developers, who can call them directly from their computer programs.

Penlope’s decentralized and collaborative approach meets some key requirements for doing computational social sciences and humanities research. First, the open data formats and communication specifications are geared toward an increased FAIRness (Findability, Accessibility, Interoperability, and Reusability) of datasets and services (Wilkinson et al., 2016). This also opens up the possibility for the Penelope infrastructure to be integrated with other infrastructure projects that support the study of cultural and societal data, such as DARIAH and CLARIN (CLARIN, 2019; DARIAH-EU, 2019). Second, the flexibility Penelope offers in terms of creating custom pipelines and implementing graphical interfaces means that the platform can benefit diverse users and projects. The modular approach for instance allows developers to deploy resources available as soon as they are ready, without having to wait until an entire centralized infrastructure is completed. Similarly, Penelope’s modularity allows for a diversification of methods and perspectives. Rather than offering one-size-fits-all solutions, the platform allows users to combine components in ways that are required to deal with the often unstructured nature and ambiguity of web data (see Rogers, 2013, p. 205), as well as to attune the results to various qualitative and quantitative research paradigms.

It should be noted that “openness” in this technological sense does not always imply openness from a user perspective. Requirements in terms of understanding at least the basics of pipeline construction in programming languages such as R and Python, calling APIs, and being familiar with digital research methods might realistically restrict users to computational social scientists, digital humanists, and related groups. In the iteration of the infrastructure discussed in this article, technical barriers to entry are lowered through a series of user interfaces. This to an extent allows tools to be consulted both by technical specialists and researchers or citizens less versed in programming. There are, however, some limiting factors to user interfaces that should be acknowledged, notably that graphical interfaces might diminish control over options and attributes that might be more easily adjusted in programming environments. Similarly, user interfaces cannot be a stand-in for a thorough explanation of methods and the algorithms behind them. One approach for avoiding this type of black-boxing currently implemented in the project is splitting up larger pipelines into smaller components, as exemplified by the access to the spaCy natural language processing tools (see Penelope, 2019). In this suite of tools, each step in the NLP pipeline (e.g., tokenization, lemmatization, noun chunking, part-of-speech tagging) is accessed through a separate API endpoint, raising user awareness about the functionalities and roles of each of the individual components.

While improving users’ engagement can be approached in terms of education, software, and user interface design, there is much less of a clear development pathway when it comes to controlling the intentions and motivations behind the use of these observatories. Indeed, machine-guided and man-incentivized analyses are susceptible to various forms of misuse, such as exploiting weak points in debates and distributing misinformation. This raises important questions and concerns about how to prevent or mitigate this risk of abuse. Addressing all of these threats and approaches in detail is beyond the scope of this article as well as beyond the current capacities of the infrastructure under discussion (for a more thorough analysis, see Rogers, 2018). Yet some key aspects of this matter nonetheless warrant a further elaboration.

One ethical aspect of digital media monitoring that is foregrounded by the construction of observatories concerns responsibility and control over analyses and underlying algorithms. The Penelope infrastructure follows a distributed model, where components can be sourced from different stakeholders. This decentralized approach entails a distribution of responsibilities, which in turn stresses the importance of community building and management. On the level of pipelines and observatories, a measure of control can be introduced by balancing modularity with teleology and narratives. The climate change opinion observatory that will be introduced and discussed in the following section is for instance conceptualized as a pipeline that guides the user step by step through the cycle of data selection, exploratory analyses, and in-depth analyses aimed specifically at mapping opinions and beliefs on the basis of expressions of causation.

Another area of ethical debate concerns not so much the observatory’s analytical components (analyses and algorithms), but rather the nature, quality, and sampling of the data that are used as inputs for observation. In the presently discussed iteration of the infrastructure, data input can be sourced from social media and other platforms, gathered or created by researchers, or accessed through open APIs. One might, for instance, use a coding environment such as a Jupyter notebook to call text analytical tools from an API for the analysis of locally stored textual data (Kluyver et al., 2016).

On a general level, matters and principles of data creation, acquisition, and data use and re-use are governed by a series of legal and ethical frameworks, as well as standards for sound scholarship. This includes data protection legislation and copyright and IP laws, control mechanisms installed by the academic community (e.g., open data policies and peer review), and institutional data management regulations and guidelines. Operating on the level of data input, these measures increase the ultimate validity and reliability of analyses. In practice, however, numerous factors might still compromise the integrity of data. Data might for instance be manipulated by untruthful social media users, or misrepresented through the algorithmic bias on the social media platforms themselves. Furthermore, changing privacy regulations might facilitate the deletion of entries from media platforms, which might impose a distorting factor that will need to be reckoned with in increasing measure. Overall, it has to be acknowledged that media observatories are tied to a number of not always equally visible dependencies at data inflow. It should also be added that this can go both ways, as observatories and the digital methods they encapsulate might provide insight into precisely those mechanisms that distort social media data and information (see the aforementioned analysis of Wikipedia editing practices and other examples in Rogers, 2013). Illustrating this principle, the current Penelope component ecology contains a multidimensional outlier explorer that can be used to detect anomalies in the data (see Penelope, 2019).

Apart from the project’s emerging ecology of tools, the observatory construction efforts discussed in this article can be situated among other international initiatives that support media monitoring for a range of research and application purposes. Examples include the Digital Methods Initiative’s (DMI) Twitter Capture and Analysis Toolset (TCAT) (Bruns et al., 2014; Digital Methods Initiative, 2014), 4CAT (Peeters & Hagen, 2018), and the Institut des Systèmes Complexes de PARIS IDF’s (ISC-PIF) Politoscope (Chavalarias et al., 2019; Gaumont et al., 2018) and Climate Tweetoscope (Chavalarias & Panahai, 2018).

In the context of the ODYCCEUS project, Penelope observatories and components for data analysis are constructed in support of five case studies (for an overview of associated publications, see ODYCCEUS, 2019b):

A historical case study on French anti-Semitism evaluates how advanced digital text analysis can be applied to study the long-term dynamics of racist political and social movements. Conducted at the Università Ca’ Foscari Venezia, this case uses historical data mined from the Gallica database of the Bibliothèque Nationale de France.

A study conducted at the Université Paris Diderot analyses the dynamics of geopolitical conflicts such as the 2008 Georgian border conflict at different spatial and temporal scales. This is achieved by mapping the conflictual definitions of political borders in digitized daily newspapers.

A case study analyzing border conflicts in the context of the Mediterranean migratory crisis. Conducted at the Université Paris Diderot, this study extracts representations of conflict from broadsheet news media to map geographical divergences among participants in the debate.

A case study exploring spatial representations of political opinions (political spaces) conducted at the Max Planck Institute for Mathematics in the Sciences.

A case on opinion dynamics in the climate change debate that is being explored at the Artificial Intelligence Lab of the Vrije Universiteit Brussel.

The climate change opinion observatory associated with the latter case study will be the focus of the remainder of this article. By means of a case example, the technological innovations required to perform precision analysis of textual data as well as the applications of this observatory will be explored.

A Climate Change Opinion Observatory

Debates concerning climate change involve many opinions and voices across online (social) media and communications channels. Because of their complexity, online representations of and discussions concerning climate change can be approached from a range of different perspectives, including scientific, economic, political, and social viewpoints. As will be illustrated on the basis of the Penelope climate change opinion observatory, computational approaches may help arrest this flow of information and map the opinion dynamics that govern the climate change debate on online (social) media. In particular, it will be presented how sophisticated tools for text analysis can be deployed to track semantic frames in online media and how these can form the basis for further digital methods pipelines.

As described in Pearce et al. (2019), the state of the art concerning the study of figurations of climate change on social media is characterized by four gaps: a bias toward Twitter data, a focus on quantitative over qualitative studies, a preference for textual information (excluding graphs or other visual information), and a focus on science communication rather than public imaginations of climate change’s role in society (Pearce et al., 2019, p. 1). The climate change opinion observatory aims to bridge a number of these gaps by extending its functionality beyond Twitter data (notably to Reddit and news articles on The Guardian with their associated comment sections) and by offering a text analysis toolbox that allows for more fine-grained discursive analysis of the imaginaries surrounding climate change causes and effects. This toolbox builds on technological innovations for text analysis which will first be discussed.

Technological Innovations

In the most general terms, two approaches to the automated analysis of textual data can be discerned. On one hand, there is a trend to focus on pattern recognition and information retrieval, which mainly operates on the syntactic level of language, and on the other hand, there is the approach related to computational linguistics which aims to work on the levels of textual contents and meaning. The language analysis toolbox of the climate change opinion observatory leans toward the latter paradigm. Grounded in the VUB AI Lab’s work on knowledge-based language technologies, the technology at the core of this climate change opinion observatory is Fluid Construction Grammar (FCG) (Steels, 2011, 2017). FCG is a computational platform that allows the implementation of language technologies based on the linguistic concepts of constructions and semantic frames. In the fields of artificial intelligence (AI) and linguistics, semantic frames are defined as basic data structures consisting of a number of frame elements. The act of cooking might for instance be represented by the frame Apply heat, consisting of frame elements such as Cook (the person doing the cooking), Food (food to be cooked), and Container (something to hold the food while cooking) (FrameNet, 2019). Words evoking the Apply heat frame, such as, “fry,” “bake,” or “boil” are called lexical units. The FrameNet database (Baker et al., 1998) contains lists of semantic frames and their associated lexical units in English. The frame that is central to the climate change opinion observatory is the Causation frame, as illustrated in Figure 2.

Figure 2.

Illustration of the FrameNet Causation frame.

The climate change opinion observatory as presented in this article implements a tool that is able to extract Causation frames from large bodies of texts, based on the English lexical units “cause,” “due to,” “because (of),” “give rise to,” “lead to,” and “result in.” For each frame instance that is found, the tool will extract two frame elements: Cause and Effect. There are two main motivations for extracting this Causation frame. First, the opportunities of using social media communications (notably on Twitter) as a means of mapping the “invisible causes” and “distant impacts” of climate change have already attracted significant scholarly attention (see, for instance, Kirilenko et al., 2015; Moser, 2010; Pearce et al., 2014, 2019). In this regard, Veltri and Atanasova (2017) conclude that the semantic frame of causation is indeed of particular interest when it comes to the study of Twitter data related to climate change and that relevant insights can be obtained through the use of available tooling for the linguistic analysis of texts (T-Lab and LIWC2007). However, the authors also remark on the limitations of this tooling, notably that

[w]hile the identification of themes and subthemes can be reliably obtained by automatic procedures and applied to a multi-language corpus of a large size, higher order structures of meaning such as narratives and arguments/claims are much harder to automatically extract. (Veltri & Atanasova, 2017, p. 735)

Ongoing research thus offers grounds for comparison in terms of the study of media other than Twitter as well as the performance of tools and methods for the automated analysis of texts.

Correspondingly, a second reason for focusing on the causation frame is that it goes to the epistemological core of the digital methods paradigm, notably its ambition to move beyond a traditional “naturalism” in which causation would be interpreted in the strictly literal or scientific sense of the word (Törnberg & Uitermark, 2018, p. 5). It is for instance not the purpose of the climate change opinion observatory to capture which grounds are offered for climate change to contribute to the scientific study of the climate as such. Rather, the challenge that the observatory faces is to map the many social imaginaries that figure in the public discourse on climate change causes and effects (see, for instance, Levy & Spicer, 2013; Pearce et al., 2019). Capturing these nuances is facilitated by the observatory’s capacity to combine large datasets with tools for precision language processing, in this case semantic frame extraction.

The current prototype version of the Penelope climate change opinion observatory allows users to perform analyses on three resources: Tweets, posts on the social media website Reddit, and a corpus of news articles tagged for “climate change” and associated comments from the website of The Guardian. The latter provided the data for the sample analysis presented below. For students of opinion dynamics concerning climate change, The Guardian website is an interesting data resource. Previous research has indeed shown this news website to be a frequently cited source on other social media platforms, which allows for cross-spherical comparisons of comments (Pearce et al., 2019, p. 5).

Example Analyses

This section presents a series of preliminary examples of how the climate change opinion observatory prototype might support the analysis of opinion landscapes based on textual web data. In particular, it will be demonstrated how the observatory, in combination with other Penelope components, can be flexibly deployed to accommodate a range of digital methods from the field of computational social science (Digital Methods Initiative, 2019; Rogers, 2013). The platform derives much of this flexibility from the fact that it can combine precise semantic information with other (meta)data harvested from online (social) media. Recalling the metaphor of Törnberg and Uitermark (2018) cited earlier, it will thus be shown how the observatory can help users make a series of complementary “cuts” into the fabric of societal phenomena.

Causation Tracking

A key objective of the climate change opinion observatory is to use web media data to provide a perspective on the diversity and potential alignment or misalignment of opinions. By using the frame extraction method described earlier, the prototype observatory’s semantic frame extractor offers insight into how opinions concerning the causes and associated effects of climate change are interlinked. For each cause queried by the user, the causation tracker returns the 10 most frequent associated effects (if applicable). This cycle is repeated until no more associations are found. For one thing, this allows users to see the diverse (and potentially contradictory nature) of causes or effects that are assigned to a certain phenomenon, such as “global warming.” For another, the tracker represents these opinions as a network, thus revealing the underlying patterns of the opinions expressed in the corpus and creating the basis of a “causal map” (see Axelrod, 1976). In Figure 3, for instance, the 10 most frequent associated effects of “climate change” are retrieved from the corpus. The effect “extreme weather” is expanded further, revealing as associated effects “floods” and “losses.” The effect “floods” is in turn associated with “anguish.” As such, it can be shown how the climate change opinion landscape as seen through the lens of data from The Guardian news website transgresses boundaries between the social and physical realms, containing associations between natural phenomena (“extreme weather,” “floods”), and human emotion (“anguish”). As also seen in Figure 3, the same holds for the path that leads from “extreme weather,” over “rising seas” to the socio-political effect of “wars,” as well as for the path linking “climate change” to “greater crimes.”

Figure 3.

A causation network initiated from the search term “climate change.” The arrows point from cause to effect. In this way, “climate change” causes “extreme weather,” which is said to cause “floods,” which in turn lead to “anguish.” The tool also allows you to search in the opposite direction, thus starting from an effect and querying its main causes.

Provided that the datasets that are input into the system are harvested in a neutral fashion, the causation tracker can be used to reveal opinions that exceed the filter bubble that a user of online media might find herself encapsulated in. Exposing users to a more diversified opinion landscape could thus be considered a pathway to combat polarization and the spread of misinformation or disinformation (see, for instance, Sunstein, 2018).

Relational Discourse Analysis

The opinion observatory not only allows the tracing of patterns on the level of the contents of social media posts but also combines this information with data on the social structures through which these opinions come about. In its current form, the platform’s components allow users to visualize causation-related comments through their reply structure (see Figure 4). This combination of the semantic analysis of reader comments with the social network structure that underlies them aligns with the method of relational discourse analysis (see, for instance, Uitermark et al., 2016). By enabling the combination of semantic and social information, the opinion observatory facilitates the study of the discursive dimension of opinions. One might for instance use the platform to map patterns of readers challenging or backing each other’s opinions on climate change causes (see, for instance, Stede et al., 2018). As such, the platform can render more transparent the transformations opinions undergo in the course of a discussion, as well as the role of different actors (commentators) in this process. A promising pathway in this regard is the integration of the comments’ time-stamp data to research how opinion landscapes and dynamics change over time.

Figure 4.

Relational discourse analysis: combining the reply structure of news paper comments (a) and semantic frame extraction (b). These threaded comments evaluate the proportion of climate change due to human activity.

Changing Actor Composition

A further metric that bridges the realms of semantics and social structure is changing actor composition, that is, the evolution of the actors mentioned in a debate as a measure for the maturity of said debate (Rogers, 2013, p. 86). As shown in Figure 5, the observatory’s NER-tagger (named entity recognition, the task of automatically detecting textual references to persons, organizations, geographical locations, etc.) reveals the persons and organizations that figure in the debate.

Figure 5.

The actor composition module provides insight into the organizations and persons that are referenced or criticized across the comments network (in this case a newspaper [the Telegraph], a political party [UKIP], and a politician [Tim Yeo]).

By integrating time stamp data into this analysis, it becomes possible to study changes in this social topology over time. Further analytical methods could then be developed to associate different stages in this process with a degree of maturity. It should be noted that the NER-tagger can also be deployed to extract statistics and other figures from comments, which could facilitate the process of fact-checking the contents of the debate.

Intensity of Cultural Preference or Political Expression

A semantic approach to opinion dynamics allows for the measurement of certain qualitative aspects of opinions, for instance, intensity of a cultural or political expression. As noted by Rogers (2013, p. 121), such meta data could be useful to study when the actual user names or opinions cannot be opened up for publication (for privacy reasons, for instance). The intensity of opinions expressing causality could be measured by means of the modality of the verb: “climate change causes anxiety,” for instance, is a stronger expression than “climate change might cause anxiety.”

Spherical Analysis

By opening up multiple corpora for analysis, the observatory facilitates comparative analysis of data harvested from different media spheres (Rogers, 2013, p. 211). The platform could, for instance, be used to compare opinions in reader comments from The Guardian’s website with tweets (retweets or replies to tweets) about the same article. This comparative perspective increases the scope and diversity of opinions that can be studied. Furthermore, this type of comparative analysis could foster interactions between the use of digital methods and work conducted in the field of media studies.

Conclusion

Along with the introduction of the Web and personal computing, recent decades have seen a steep increase in the quantity and diversity of digital data and information (see, for instance, Gitelman, 2013). As novel traces of cultural and societal phenomena, these digital resources are, among many things, transforming scholarship in the social sciences and humanities, offering new perspectives to investigative journalists and policy makers, and empowering citizens to hold their governments accountable. Evidence of this transformational potential is wide-ranging: it manifests itself in substantial bodies of scientific literature, a rising interest in collaborative digital research infrastructures, the availability of curated data sets and collections, and the development of novel research practices, techniques, and methods (see, for instance, Borgman, 2010, 2015). This article has demonstrated that one area where these developments culminate is the study of cultural conflicts through web data. As online (social) media become prominent sites for clashes of opinions on nationalism, migration, and climate change, social media posts and news website comments sections offer a unique window into the opinion dynamics that govern society’s most pressing issues. Data from online media platforms indeed might hold the answer to such questions as which opinions are raised in the debate, which agents (human or other) are actively involved in shaping, propagating, or suppressing those opinions, and how opinions transform over time. Eventually, a deeper understanding of these dynamics might help stakeholders (including journalists, policy-makers, citizens, researchers) to conscientiously mediate between the online debate and its potential off-line manifestations.

As discussed in this article, insight into social media data can be obtained through observatories that bridge the gaps between humanistic or sociological inquiry and computational methods. The exercise of creating such platforms and technologies for the study of culture and society via online media, however, opens up a window on a series of methodological opportunities and associated challenges. As has been shown on the basis of a series of sample analyses, precision-language processing tools can offer insight into the diversity and dynamics of opinions on online media, thus moving into the direction of automatically capturing emergent properties of social behavior. However, some ethical and epistemological aspects of these analyses require further discussion. For one thing, it should be noted that by automatically capturing and analyzing opinions we also shape them. Put differently, an opinion observatory might also inform debate facilitation. As argued by Törnberg and Uitermark, there is indeed no non-value laden “view from nowhere” from which to research or observe societal phenomena like opinions: “when we research the social world, we also act on it, change it as the knowledge we bring becomes part of what we study” (Törnberg & Uitermark, 2018, pp. 8–9). As the output of opinion observatories again enters the public space, there is a possibility that those outputs adjust or enforce opinions held, for better or for worse (Rogers, 2018). Tools for the analysis of opinion dynamics thus face the same type of ethical problems that can emerge in any scenario where humans and technology mingle within the public space, such as the case of artificial bots interfering in online political debates (Veale & Cook, 2018, p. 46). Further research should be conducted to mitigate these risks, for instance, through the development of reporting tools that render the opinion mining process more transparent by providing an overview of which pipeline components were combined and what the output of each component was. The research process can thus be broken down into discrete, reproducible steps. This approach might also enforce the use of the aforementioned observatory not just as an instrument of study, but also as a tool to combat fake news and misinformation. Questions of transparency and explainability become particularly pressing when AI methods are incorporated into analytical pipelines. In broad terms, these methods can be situated on a spectrum between symbolic approaches (centered around formal logic and language) and numerical approaches (based on statistical methods). The semantic frame extraction method that was integrated in the observatory under discussion leans toward the symbolic end of the spectrum, as it is grounded in language, and therefore allows for a high degree of explainability. However, the same frame extraction task can also be achieved using numerical approaches, at the cost of transparency. Pipeline and observatory developers thus need to balance this aspect of transparency with other costs and benefits when selecting a method.

A second, closely related avenue for future inquiry then concerns the epistemological aspects of opinion observatories. In accordance with technological advances, the boundaries of the explanatory and interpretative power that can be assigned to platforms for automated text mining should be continuously evaluated. Indeed, while the prototype discussed in this article supports knowledge creation and intelligence gathering in the social sciences and humanities, it does not offer any interpretations or explanations of the data it represents. An avenue for further exploration would thus be to reconsider the extent to which digital technologies could actually be considered interpretative machines, and what could be done to further enhance this interpretative dimension (for a reflection on these questions, see, for instance, Romele et al., 2018). More fine-grained analyses of opinions could for instance rely on recent advances in the field of argument mining, which has produced frameworks and methods for studying the contextual elements inherent to opinion dynamics, such as narrative and argumentative structures (see, for instance, Stede et al., 2018). Future methodological gains can be expected at the intersection of the relational discourse pipelines presented in this article and insights from the field of argumentation mining.

Finally, it cannot be forgotten that these ethical and epistemological concerns form the backdrop for a series of technical challenges still facing computational approaches to research in the social sciences and humanities (Rogers, 2013, p. 206), including the automatic analysis of textual data. Working with texts of any type is inherently difficult, given the ambiguous nature of (written) language as such (Farzindar & Inkpen, 2017; Ingersoll et al., 2013, pp. 8–10). Furthermore, as online (social) platforms and media are continuously evolving, the formats and structures of the data that are mined are far from stable, which is something analytical pipelines need to be able to deal with (Rogers, 2013, p. 206).

Footnotes

Authors’ Note

Luc Steels is now affiliated with Catalan Institute for Research and Advanced Studies (ICREA), Spain.

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No. 732942.

ORCID iD

Tom Willaert

Author Biographies

Tom Willaert (PhD, KU Leuven) is a postdoctoral researcher at the Artificial Intelligence Lab at the Vrije Universiteit Brussel. His research interests are situated at the intersections of media, digital methods, (dis)information, and democracy.

Paul Van Eecke (PhD, Vrije Universiteit Brussel) is a researcher in Artificial Intelligence at the Vrije Universiteit Brussel. His research interests include emergent communication and language, and computational construction grammar.

Katrien Beuls (PhD, Vrije Universiteit Brussel) is a researcher in Artificial Intelligence at the Vrije Universiteit Brussel. Her research interests include emergent communication and language, and computational construction grammar.

Luc Steels (PhD, University of Antwerp) is a research professor at the Catalan Institute for Research and Advanced Studies (ICREA) embedded in the Institute for Evolutionary Biology (UPF/CSIC). He was the founding director of the Sony Computer Science Laboratory in Paris, and the founding director of the VUB Artificial Intelligence Laboratory as well as chairman of the Computer Science Department at the University of Brussels. He published hundreds of papers in high-profile journals with an impact of H=71 (Google scholar), as well as dozens of books on various aspects of AI.

References

Axelrod

(1976). Structure of decision. Princeton University Press. https://books.google.be/books?id=JBUxswEACAAJ

Baker

C. F.

Fillmore

C. J.

Lowe

J. B.

(1998, August 10–14). The Berkeley framenet project. In Proceedings of the 17th International Conference on Computational Linguistics (Vol. 1, pp. 86–90). Stroudsburg, PA: Association for Computational Linguistics.

Ball

(2012). Why society is a complex matter: Meeting twenty-first century challenges with a new kind of science. Springer Science & Business Media.

Banisch

Olbrich

(2018). An argument communication model of polarization and ideological alignment. arXiv preprint arXiv:1809.06134.

Banisch

Olbrich

(2019). Opinion polarization by learning from social feedback. The Journal of Mathematical Sociology, 43(2), 76–103.

Borgman

C. L.

(2010). Scholarship in the digital age: Information, infrastructure, and the Internet. MIT Press.

Borgman

C. L.

(2015). Big data, little data, no data: Scholarship in the networked world. MIT Press.

Bruns

Weller

Borra

Rieder

(2014). Programmed method: Developing a toolset for capturing and analyzing tweets. Aslib Journal of Information Management, 66, 262–278.

Chavalarias

Panahai

(2018). Climate tweetoscope. http://tweetoscope.iscpif.fr/

10.

Chavalarias

Panahai

Gaumont

(2019). Politoscope. https://politoscope.org/

11.

CLARIN. (2019). CLARIN in a nutshell. https://www.clarin.eu/content/clarin-in-a-nutshell

12.

DARIAH-EU. (2019). Dariah in a nutshell. https://www.dariah.eu/about/dariah-in-nutshell

13.

Digital Methods Initiative. (2014). Twitter capture and analysis toolset (DMI-TCAT). https://wiki.digitalmethods.net/Dmi/ToolDmiTcat

14.

Digital Methods Initiative. (2019). Digital methods course. https://wiki.digitalmethods.net/Digitalmethods/WebHome

15.

Farzindar

Inkpen

(2017). Natural language processing for social media. Synthesis Lectures on Human Language Technologies, 10(2), 1–195.

16.

FrameNet. (2019). What is FrameNet. https://framenet.icsi.berkeley.edu/fndrupal/WhatIsFrameNet

17.

Gaumont

Panahi

Chavalarias

(2018). Reconstruction of the socio-semantic dynamics of political activist twitter networks—Method and application to the 2017 French presidential election. PLOS ONE, 13(9), Article e0201879.

18.

Gitelman

(2013). “ Raw data” is an oxymoron. MIT Press.

19.

Ingersoll

G. S.

Morton

T. S.

Farris

A. L.

(2013). Taming text: How to find, organize, and manipulate it. Manning Publications.

20.

Kirilenko

A. P.

Molodtsova

Stepchenkova

S. O.

(2015). People assensors: Mass media and local temperature influence climate change discussion on twitter. Global Environmental Change, 30, 92–100.

21.

Kluyver

Ragan-Kelley

Pérez

Granger

Bussonnier

Frederic

Kelley

Hamrick

Grout

Corlay

Ivanov

Avila

Abdalla

Willing

(2016). Jupyternotebooks—A publishing format for reproducible computational workflows. In Loizides

Schmidt

(Eds.), Positioning and power in academic publishing: Players, agents and agendas (pp. 87–90). IOS Press.

22.

Levy

D. L.

Spicer

(2013). Contested imaginaries and the cultural political economy of climate change. Organization, 20(5), 659–678.

23.

Michel

J. B.

Shen

Y. K.

Aiden

A. P.

Veres

Gray

M. K.

Pickett

J. P.

Hoiberg

Clancy

Norvig

Orwant

Pinker

Nowak

M. A.

Aiden

E. L., &

The Google Books Team. (2011). Quantitative analysis of culture using millions of digitized books. Science, 331(6014), 176–182.

24.

Moretti

(2013). Distant reading. Verso Books.

25.

Moser

S. C.

(2010). Communicating climate change: History, challenges, process and future directions. Wiley Interdisciplinary Reviews: Climate Change, 1(1), 31–53.

26.

ODYCCEUS. (2019a). ODYCCEUS project home page. https://www.odycceus.eu/project/

27.

ODYCCEUS. (2019b). ODYCCEUS publications. https://www.odycceus.eu/publications/

28.

Pearce

Holmberg

Hellsten

Nerlich

(2014). Climate change on twitter: Topics, communities and conversations about the 2013 IPCC Working Group 1 Report. PLOS ONE, 9(4), Article e94785.

29.

Pearce

Niederer

Özkula

S. M.

Sánchez Querubín

(2019). The social media life of climate change: Platforms, publics, and future imaginaries. Wiley Interdisciplinary Reviews: Climate Change, 10(2), Article e569.

30.

Peeters

Hagen

(2018). 4CAT: Capture and Analysis Toolkit (Version 0(5)) [Computer software]. https://4cat.oilab.nl/

31.

Penelope. (2019). The penelope platform. https://penelope.vub.be/

32.

Rogers

(2013). Digital methods. MIT Press.

33.

Rogers

(2019). Doing digital methods. SAGE.

34.

Rogers

(2018). Social media research after the fake news debacle. Partecipazione e Conﬂitto, 11(2), 557–570.

35.

Romele

Severo

Furia

(2018). Digital hermeneutics. AI & Society: Knowledge, Culture and Communication. https://hal.archives-ouvertes.fr/hal-01824173

36.

Schnapp

(2014). How does computer science intersect the humanities? http://serious-science.org/digital-humanities-673

37.

Singer

P. W.

Brooking

E. T.

(2018). LikeWar: The weaponization of social media. Eamon Dolan Books.

38.

Stede

Schneider

Hirst

(2018). Argumentation mining (Synthesis Lectures on Human Language Technologies). Morgan & Claypool. https://books.google.be/books?id=Z3WBDwAAQBAJ

39.

Steels

(2011). Design patterns in fluid construction grammar. John Benjamins Publishing.

40.

Steels

(2017). Basics of fluid construction grammar. Constructions and Frames, 9(2), 178–225.

41.

Sunstein

C. R.

(2018). # Republic: Divided democracy in the age of social media. Princeton University Press.

42.

Törnberg

(2018). Echo chambers and viral misinformation: Modeling fake news as complex contagion. PLOS ONE, 13(9), Article e0203958.

43.

Törnberg

(2018). The limits of computation: A philosophical critique of contemporary Big Data research. Big Data and Society. https://doi.org/10.1177/2053951718811843

44.

Törnberg

Uitermark

(2018). Beyond social physics: Manifesto for a new complexity science of the social world ODYCCEUS Work package 5, deliverable 5.1 Reﬂections on Europe’s cultural conﬂicts. An interdisciplinary dialogue on new methods and new data.

45.

Uitermark

Traag

V. A.

Bruggeman

(2016). Dissecting discursive contention: A relational analysis of the Dutch debate on minority integration, 1990–2006. Social Networks, 47, 107–115.

46.

Veale

Cook

(2018). Twitterbots: Making machines that make meaning. MIT Press.

47.

Veltri

G. A.

Atanasova

(2017). Climate change on twitter: Content, media ecology and information sharing behaviour. Public Understanding of Science, 26(6), 721–737.

48.

Watts

D. J.

(2013). Computational social science: Exciting progress and future directions. The Bridge on Frontiers of Engineering, 43(4), 5–10.

49.

Wilkinson

M. D.

Dumontier

Aalbersberg

I. J.

Appleton

Axton

Baak

Blomberg

Boiten

J. W.

da Silva Santos

L. B.

Bourne

P. E.

Bouwman

Brookes

A. J.

Clark

Crosas

Dillo

Dumon

Edmunds

Evelo

C. T.

Finkers

. . . Mons

(2016). The FAIR guiding principles for scientific data management and stewardship. Scientific Data, 3, Article 160018.