Sage Journals: Discover world-class research

Abstract

The multilevel model of meme diffusion conceptualizes how mediated messages diffuse over time and space. As a pilot application of implementing the meme diffusion, we developed the social media analytics and research testbed to monitor Twitter messages and track the diffusion of information in and across different cities and geographic regions. Social media analytics and research testbed is an online geo-targeted search and analytics tool, including an automatic data processing procedure at the backend and an interactive frontend user interface. Social media analytics and research testbed is initially designed to facilitate (1) searching and geo-locating tweet topics and terms in different cities and geographic regions; (2) filtering noise from raw data (such as removing redundant retweets and using machine learning methods to improve precision); (3) analyzing social media data from a spatiotemporal perspective; and (4) visualizing social media data in diagnostic ways (such as weekly and monthly trends, trend maps, top media, top retweets, top mentions, or top hashtags). Social media analytics and research testbed provides researchers and domain experts with a tool that can efficiently facilitate the refinement, formalization, and testing of research hypotheses or questions. Three case studies (flu outbreaks, Ebola epidemic, and marijuana legalization) are introduced to illustrate how the predictions of meme diffusion can be examined and to demonstrate the potentials and key functions of social media analytics and research testbed.

Keywords

Big Data social media human dynamics spatiotemporal analysis data mining

Introduction

Careful mining of social media messages can reflect the trends of human dynamics, such as where seasonal flu is spreading (Aslam et al., 2014; Nagel et al., 2013), when social movement ideology is diffusing (Tsou et al., 2013), and how urban mobility patterns reveal geospatial and social functions (Hawelka et al., 2014, McKenzie et al., 2015). Despite the new and evolving nature of social media, various models and theories are emerging to account for the ways in which message content and exchange dynamics both reflect and drive human processes in realspace (Cheung and Thadani, 2012; Wang and Zhang, 2012). Integrating several traditional theories (e.g., framing theory, narrative theory, diffusion of innovations, information theory, communicative competence theory, evolutionary theory), the multilevel model of meme diffusion (M³D) conceptualizes how mediated messages diffuse over time and space (Spitzberg, 2014). Memes are any replicable unit of cultural transfer of information. Thus, all digital social media messages are potential memes, and their replicability is a key distinction between memes and everyday face-to-face communication. It also makes memes a unique source of information that can be monitored and modeled at varying levels of granularity. The degree to which the diffusion of memes through social media networks can be modeled becomes a major theoretical challenge for businesses, governments, celebrities, and all those seeking to understand the human dynamic of communicative influence (Kamel Boulos et al., 2011; Sasahara et al., 2013; Segev et al., 2015).

The M³D anticipates that certain features of (a) memes or social media messages (e.g., novelty, repetition, etc.); (b) communicators (e.g., source credibility, network centrality, etc.); (c) structural (e.g., network span, homophily, etc.) and subjective network structures (e.g., counter-memes, thresholds, etc.); (d) societal processes (e.g., publicity campaigns, stage of diffusion, etc.); and (e) geo-technical factors (e.g., geospatial proximity, population density, etc.) predict or moderate M³D dynamics. In any given context, such processes may reveal unique diffusion maps or patterns across time and space, varying by rapidity of diffusion, exhaustion and duration of diffusion, and evolution of message variation. To the extent such unique patterns reveal distinct correspondence to realspace activities, they provide an important window into developing surveillance and intervention programs to serve the public interest in diffusing time and space sensitive information (e.g., disease diffusion and/or treatment, drug abuse or diffusion, natural disaster or crisis response). In other contexts, such surveillance may map important applications for organizations concerned with mapping or stimulating institutional reputation or product diffusion (e.g., academic reputation or product reviews).

As a pilot application of implementing the M³D in studying human dynamics and social phenomena, we developed a web-based social media analytics and research testbed (SMART) integrating social media application programming interface (API), machine learning, and geographic information system (GIS) (Figure 1). SMART is designed with the goal of providing researchers with a platform to quickly test hypotheses and to refine research questions. The analytics and statistics presented on SMART may not directly provide answers to the research questions, but the trend visualization and aggregated analytics can suggest possible outlooks and potential spatiotemporal diffusion patterns for the selected topics and keywords in the region(s) of interest. Researchers can utilize the frontend user interface of SMART—the SMART dashboard—to visualize, characterize, and predict trends in these topics in different cities over time. SMART can also provide broader applications beyond academia. Taking the public health and epidemiology domain as an example, local healthcare providers, hospital staff, government officials, first responders, and other stakeholders can access this online web-based dashboard without local desktop installation to keep track of trending information related to specific disease outbreaks. The following features of SMART underscore its significance and innovativeness:

SMART captures and updates the spatial nature of social media messages on a daily basis, evaluates patterns of messages in diverse cities, and geographically tracks patterns of diffusion.

The frontend of SMART, the SMART dashboard, displays the dynamic temporal trends of social media messages (daily, weekly, and monthly) with interactive selection tools.

The backend of SMART is built around geo-locating, data filtering, and machine learning components to remove noise and errors in different subjects. These procedures facilitate more accurate analysis and tracking of topics, such as disease outbreaks and drug abuse.

Users can select a target city to analyze the temporal trends and top messages from individual cities, or use the default view to display the aggregated trend analysis by combining messages from all cities in the monitoring list.

The visualization map of tweet messages displayed in the online interface helps users to understand the differences of social media messages between cities. This provides information for users to track the spatial and temporal evolution of a topic.

Figure 1.

The system design of social media analytics and research testbed (SMART).

The development of SMART initially focuses on five topics (Influenza, Whooping Cough, Wildfire, Drugs, and Aztecs) because they represent a diverse range of patterns of use, targeted users, required analysis functions, and social values. Communication in general, and social media in particular, have been identified as key elements in each type of context: flu (Culotta, 2013; Nagar et al., 2014), whooping cough (Olsen, 2013), wildfire (Helsloot and Groenendaal, 2013, Slavkovikj et al., 2014; Sutton et al., 2013), drugs (Hanson et al., 2013; Young and Shoptaw, 2013), and institutional affiliation and reputation (Miao et al., 2014; Snoeijers et al., 2014). As anticipated by the M³D and numerous other theories, there is likely to be some degree of reciprocal influence and representation between social media communication and the activities that involve such communication. Mapping such media content and dynamics is the first step in revealing the degree of correspondence in socially relevant contexts of human activity, and that is the fundamental concept behind the development of SMART.

Web and social media analytics platforms

The high daily volume of user-generated content posted to social media and the Internet has brought great opportunity to social media and Big Data analytics (Tsou et al., 2013b). To facilitate such analytics and visual presentations, many research groups have devoted their efforts to analyzing information from social media and the Internet, and developing web-based interactive platforms to display analytics (Brownstein et al., 2008; Diakopoulos et al., 2010; Ginsberg et al., 2009; Kamel Boulos et al., 2010; MacEachren et al., 2011; McKelvey and Menczer, 2013). Applications of this domain include and are not limited to epidemiology surveillance, crisis management, and trend detection and monitoring. The design of SMART emphasizes the geospatial characteristics of social media data by giving users the ability to target topics in specific geographical regions and analyze the spatiotemporal patterns of social media messages. Here, we introduce and compare several applications that inspired the development of SMART.

With the release of public APIs from social media companies, there are many commercial platforms that tap into the data stream and provide social media analytics. Topsy¹ was a social media analytics system and a certified partner of Twitter that indexed all the tweets since 2006. With direct access to the Twitter firehose database, typing a search term in the Social Search interface looks up tweets from 2006 that are associated to the term and presents the most popular content in different temporal resolutions. The Topsy Social Analytic function allows users to select multiple keywords and compare the trends in the past 30 days. The function of displaying the most popular content and interactive temporal trend visualization informed the design of SMART. Different from Topsy, SMART not only shows the overall top content in each module but implements a map view which can present analytics at regional scales. Also, instead of analyzing single keywords or finding the trend of an URL being mentioned, SMART focuses more on topic-level analytics.

Electronic event-based biosurveillance systems are examples where near real-time information from the Internet is an important source of providing epidemiologic intelligence (Gajewski et al., 2014). These systems robustly validate data source from the Internet using epidemiological approaches against an official comparison dataset to generate statistics and predictions (Barboza et al., 2014). Google Flu Trends² (GFT) (Ginsberg et al., 2009) attempted to make accurate predictions about flu activity across regions in U.S. based on aggregated Google Search query data. Their prediction model indicates that the frequency of certain Google Search queries is highly correlated with the percentage of physician visits in which a patient presents with influenza-like illness symptoms. Users of the GFT application can view the overall flu activity in line charts or in map view that shows the intensity of flu activity using color gradient. GFT implemented an automated query term selection process to generate a list of the highest scoring search queries to represent flu activity. SMART and GFT are different in two ways. First, GFT automatically aggregates multiple search term for a topic and focuses on search queries of the Google Search users, while SMART composes each topic with keywords defined by domain expertise and tracks the conversation on social media related to the topic. Second, GFT focuses on the predication model of influenza, where SMART, besides monitoring influenza outbreak, can also be used to track the trends of different topics.

HealthMap (Bahk et al., 2015; Brownstein et al., 2008; Huston et al., 2015; Schwind et al., 2014) is created for disease outbreak monitoring and real-time surveillance of emerging public health threats. The application monitors multiple data sources including online news, validated official reports, and witness reports to reveal the current global state of infectious diseases. Users can pick from the predefined list of diseases and also have the option to specify a location and date range for the search. HealthMap presents the search results with an interactive dot map where data sources are aggregated to city or country level. The application also contains time series charts showing the temporal trend of disease alerts in the past 12 months. To calculate the disease activity index, the HealthMap system uses a filtering procedure to classify data sources into disease-related alerts and nonrelated information. While SMART also supports disease outbreak surveillance if a topic is defined with disease-related keywords, it is different from HealthMap in which SMART users can monitor the outbreak at any region of interests.

Many other applications apply visual analytics and mapping to support analysis tasks including identifying structure and semantics about media event (Diakopoulos et al., 2010), revealing city dynamics about citizens (Andrienko et al., 2013), and detecting abnormal events (Chae et al., 2012; Thom et al., 2012). SensePlace2 is developed to support the understanding of spatial and temporal patterns of activities, events, and attitudes (MacEachren et al., 2011). With a user-selected keyword, the application extracts the top 500 related tweets and displays them on the list view and the map view. The map feature on SensePlace2 displays tweets in both gridded density surfaces and in geo-located points to support analysis of tweets. The time plot and control function allow users to examine related tweets posted at a certain time period. Different from SensePlace2, SMART (a) normalizes conversations about each topic by the population of user-defined region of interests, and (b) implements a machine learning procedure to filter out nonrelevant conversations for the monitoring topics.

Technological framework and key components of SMART

SMART is a web application utilizing GIS, social media APIs, and machine learning tools for Twitter data collection, normalization, and analytics. Figure 2 illustrates the two-tier system framework design of the SMART system. The server side of SMART is a data processing and analysis engine mainly developed with Node.js³ and Python. Twitter data is collected with official Twitter Search APIs, processed with our geo-locating engine, and then stored into the open-source No-SQL MongoDB⁴ database. To filter out noise in social media conversations and retain only related messages, a machine learning procedure was developed in Python with the scikit-learn library that allows the SMART system to better differentiate tweets representing real-world instances of diseases from false positives. Analytics about the monitored topics are also generated from the automatic SMART backend engine. On the client side, analysis results and spatiotemporal patterns are presented to the users in an interactive dashboard-like interface that contains several visualization modules. This SMART dashboard is a web application built around Bootstrap,⁵ a popular open-source frontend framework with HTML, CSS, and optional JavaScript extensions. For the visualization modules, we use dygraph.js⁶ to construct the temporal graphs for its interactive functions and D3.js⁷ to build word clouds for visualizing top keywords. Official Twitter JavaScript Widget is implemented on the dashboard to display media content embedded in the tweets. The trend maps module is developed with the open-source Leaflet.js⁸ and basemap layers from Esri ArcGIS Online. The SMART client side communicates with the server side by making RESTful API calls. The server returns the response to the client side API call in the JavaScript Object Notation format. Details about the SMART server side engine designs and client side user interface designs are introduced as follows.

Figure 2.

The client-side and server-side technological frameworks of SMART.

Social media retrieval (APIs)

SMART is originally designed to work with Twitter data. Twitter provides developers with two official APIs to access the tweet messages from the timeline. The Streaming API establishes a live stream connection with the global Twitter stream while the Search API is mainly for conducting singular queries to the Twitter timeline. With the Twitter Search APIs, SMART can retrieve tweets back to six to nine days before within the targeted area. This retrospective search function is useful for monitoring the social media responses to unexpected events, such as earthquakes. In addition, the design of SMART takes advantage of the spatial filtering methods provided by Twitter Search APIs. With the geocode parameter setup, the geo-search first identifies tweets with latitude/longitude coordinates that fall within the query areas, and then also finds tweets created by users whose profile location can be geo-coded to latitude/longitude coordinates that fall within the same area. By polling the Search API once per day, the system can access large datasets for specific geographic areas, which allows for detailed analysis of the data at municipal, regional, and national scales. Both geo-tagged tweets and non-geo-tagged tweets are retrieved with the Search API.⁹ In terms of data management, the SMART system stores all retrieved social media attributes in MongoDB, an open-source and NoSQL database, considering its scalability and read/write performance (van der Veen et al., 2012). MongoDB is one of the most popular databases for Big Data with its capability of high performance queries, full-text search, and support of spatial query from very large datasets.

Geo-locating social media conversations and normalization

When searching for tweets with keywords, the Twitter search API can potentially return tweets with two types of spatial information: (1) geo-tagged locations provided by GPS-enabled devices, and (2) self-reported locations specified in user profiles. Geo-tagged locations are latitude and longitude pairs created by mobile devices with built-in GPS receivers or by users engaging with the geo-location features when posting a tweet. Self-reported locations are specified by users in their profile and can be changed at any time. Using Twitter messages with geo-tagged locations can potentially decrease the location uncertainty in analysis. However, tweets as such remain a small fraction when harvesting data with Twitter Search API, ranging from only 0.7% (Graham et al., 2014) to 6% (Takhteyev et al., 2012) depending on different topics. Many efforts have been made to infer the location of the social media users using user profiles (Hecht et al., 2011; Sakaki et al., 2010), with text content and timestamp (Li et al., 2011), and with models that combine multiple indicators (Ghahremanlou et al., 2014; Mahmud et al., 2012; Rodrigues et al., 2013).

For tweets that do not have attached latitude/longitude coordinates, SMART implements a placename geo-coding procedure using the GeoNames database (GeoNames, 2015). GeoNames is a community-driven geographical database that contains more than 10 million geographic names including alternative names. The backend system of SMART retrieves text from the user profiles and tries to find matches from the GeoNames database. In the cases when GeoNames gazetteers return multiple matching locations, the SMART geo-locating engine goes further and assigns the most populated location as a default to the tweet.

Normalization of population is another important task in the design of the SMART backend system. SMART analyzes social media messages at the city level and it implements the block group centroid demographic retrieval approach to normalize tweeting numbers in each city based on their population size. Traditional city population size is based on the administrative boundaries and census tracks, which do not match well with geo-search circles of the Twitter Search APIs method. We used GIS software to calculate the estimated population for each circle based on 2010 census tracts from the center of downtown to the radius defined by the Twitter Search APIs. Each of these city point buffers was joined with census tract centroids to determine which tracts should be included in our population calculations (Figure 3). Using the fine-grained census data allows us to gain a more accurate estimation of population, which greatly improves the ability to accurately normalize tweet counts for individual cities. Most of our city search methods use a 17 mile radius buffer to cover major metropolitan areas of U.S. cities without overlapping each other (Tsou et al., 2013). Some cities use a larger or smaller radius, such as Phoenix (40 mile) or Anaheim/Irvine (10 mile), in order to adjust to the size and closeness of surrounding cities. The normalization function is limited to the U.S. in the current implementation of SMART, and adding census data of other countries can expand this function to more regions.

Figure 3.

Calculating population within the geo-search areas of targeted U.S. cities.

Data filtering and machine learning classification

Data filtering and cleaning procedures are very important for social media analytics due to the ubiquitous presence of noise. We developed comprehensive data filtering procedures on top of our previous research (Tsou et al., 2014; Tsou and Leitner, 2013). The first step is to exclude retweets and tweets containing URL links, as our previous studies have shown that these messages are more likely to be irrelevant to the study outcomes (Nagel et al., 2013). We also integrate a machine learning classification procedure for filtering noise from the tweets with scikit-learn, a Python library for data mining and data analysis. The goal of this procedure is to identify tweets that do not appear to indicate real-world cases of the monitoring topic/event so that they can be omitted from the statistical analysis. For classification, a linear support vector machine (SVM) (Joachims, 1998) was used, as this algorithm has demonstrated effectiveness at supervised learning tasks involving text. Twitter messages are transformed to numerical values using a term frequency-inverse document frequency model (Salton and Buckley, 1988).

For example, to train the SVM for the influenza topic, 1500 randomly sampled tweets from the 2012 to 2013 flu season containing the keyword “flu” were used to train the SVM classifier. Each of these tweets was manually inspected and classified as valid or invalid according to the likelihood that the message indicated an actual case of influenza, and this labeled data was used to train the SVM. The text classification model was validated using standard machine learning measures: recall, precision, and the F1 score. The test set was created by sampling 1000 influenza-related tweets at random. The validation procedure resulted in a precision score of 0.671, a recall score of 0.949, and an F1 score of 0.786. The high recall score indicates that the model correctly classifies the majority of valid tweets in the test set; however, the lower precision score indicates that invalid tweets in the test set are occasionally incorrectly categorized as valid.

Statistics and analytics for topics

After data filtering, SMART uses descriptive statistical methods to summarize the analytics such as the most popular retweets, the most popular URL, the most popular hashtags, and the most popular mentions. As the pseudo code (Figure 4) demonstrates, each new tweet is validated with the pretrained SVM text classification model mentioned in the previous section. Valid tweets are then inspected and entities (i.e., hashtags, user mentions, media, or retweets) are extracted from the tweet text. Total counts of all unique entity values are maintained in order to derive a list of the top 10 most frequent values for each entity.

Figure 4.

Pseudo code of the filter and classification procedures.

The backend of SMART utilizes Node.js and Express¹⁰ to build a RESTful API, which is an interface to receive requests and send responses. When the SMART frontend client connects to the SMART server, a request is sent to RESTful API to ask for the latest analytics for a specific topic. In addition to top content, the SMART server also returns overall spatiotemporal trends upon frontend query with specific time period of interests. Table 1 presents the statistics and analytics modules that are implemented in the SMART dashboard.

Table 1.

Descriptions of statistics and analytics modules in SMART dashboard.

Module	Description
Trend	The total number of daily, weekly, and monthly tweets
Top URL	The top 10 most referred Web pages
Top retweet	The top 10 most popular retweets
Top media	The top 10 pictures or videos included in tweets
Top keyword	The most frequent words used in tweets
Top hashtag	The top 10 hashtags people used in tweets
Top mention	The top 10 Twitter users mentioned in tweets
Tweets in cities	The web map of tweeting rates in different cities

Geo-targeted regional selection and multilevel spatial analysis

SMART can implement multiscale spatial analysis functions depending on the spatial resolution of interest regions and on the spatial granularity of social media content. For geo-tagged social media data that come with latitude/longitude coordinates, SMART can aggregate them at different levels of map scale. For social media data that is not geo-tagged in the first place, SMART can geo-locate these messages using user profiles and then analyzes them at aggregated spatial resolutions including city level or state level. SMART currently provides several aggregation modes for its spatial analysis results. In a city-level trends map, analysis results are aggregated for each city and a circle on a map represents each city (Figure 5, right side). The graduated color in each circle shows the tweeting intensity in each city. In a region-level trends map, social media messages are aggregated by administrative boundaries (e.g., state boundaries, county boundaries) or bounding boxes (Figure 5, left side). Analysis results displaying on the trends map can also overlay with external data sources such as land use or census to provide users with additional information to preliminarily explain the distribution of topics.

Figure 5.

Aggregated multilevel spatial analysis (state level, bounding boxes, and city-level analysis) in the SMART dashboard.

Interactive user interface design

Different from traditional web map applications, the design of the SMART dashboard provides interactive display maps with multimedia content, such as pictures, videos, text messages, and charts (Figure 6). The web-based user interface is built with open-source programming libraries, jQuery and Leaflet APIs, to visualize and query tweets from a server-side database.

Figure 6.

The user interface design in SMART dashboard.

The following key features are included in the SMART dashboard to provide interactive query and visualization functions:

Top index numbers show the number of tweets collected from one day, one week, or one month.

The left panel provides the list of targeted cities and the short cuts to different functions (Word Cloud (Kuo et al., 2007), Trend, Top URL, Top Media, Top Cities, etc.)

Trend Function allows interactive queries of actual tweeting texts by clicking on the point on the line chart. Users can switch the view to Daily, Weekly, and Monthly mode.

Word Cloud function shows the most prominent conversation keywords in tweets within one day, past seven days, past 30 days, or combined all.

Tweets in Cities function shows the normalized tweeting rates in each city using graduated cartographic symbols.

The dashboard also provides the top 10 list of top URLs (web pages), Hashtags (subjects), Retweets (forwarded messages), Mentions (opinion leaders), and Media (Pictures).

Figure 7 illustrates an example of web maps for the combined “flu” and “influenza” tweeting rates among top 31 U.S. cities. This web mapping tool was created by using open-source GIS mapping functions (Leaflet APIs) and basemap layers from ESRI ArcGIS Online. The size of red circles indicates the tweeting rates per one million people that were normalized by the method described earlier. Users can select each city interactively to identify the detailed information about the tweeting rate calculation. Analysis results are also shown in different types of web maps depending on the spatial granularity of interests as mentioned earlier in the “Geo-targeted regional selection and multilevel spatial analysis” section (Figure 5).

Figure 7.

Web map of SMART dashboard for showing the combined “flu” and “influenza” tweeting rates in U.S. cities.

The ultimate value of such dashboards is their ability to elucidate human dynamics in realspace. In order to evaluate the value of this dashboard, its preliminary utility in revealing a variety of social dynamics is illustrated by three independent case study exemplars.

Three case studies of SMART

When users access the SMART dashboard, at present they can select one topic from the topic list (Figure 8) to analyze the trends and popular messages. This topic list is potentially scalable. Here we present three case studies to illustrate the practical uses of the SMART system and the SMART dashboard in monitoring influenza outbreak, public responses to the Ebola outbreak, and online discussions of marijuana legalization. These analytic functions of the SMART dashboard provide researchers with methods for formalizing and testing their research hypotheses and questions. However, these analytics results should not be used to give a direct answer or action for complicated research issues. Since the current analytic capability of SMART is still at a preliminary stage, the analysis results are simplified and generalized.

Figure 8.

Topic selection options in the SMART dashboard (http://vision.sdsu.edu/hdma/smart/).

Case #1: Monitoring influenza outbreaks in U.S. cities

The M³D predicts that some types of tweets will propagate more than others, based on a number of predictable factors. Furthermore, realspace events sometimes stimulate tweets (event-based or evememic), and sometimes social media activate realspace events (symbol-based or etymemic). In the case of flu, people are expected to tweet employers, friends, or family about their symptoms, or comment upon the flu symptoms of their social network associates. Furthermore, such tweets are likely to be viewed as personally relevant, credible, and readily diffused within social networks. Finally, given geotechnical contexts of “seasonal flu” and public health “vaccination” media campaigns, such tweets are expected to capitalize on a public attention readiness. Flu tweets, therefore, should reflect a dynamic evememic process.

The influenza monitoring case study is built upon our previous works (Aslam et al., 2014; Nagel et al., 2013) by using the two keywords “flu” and “influenza” to collect related tweets from 31 major cities in the United States. The dashboard can display multilevel temporal trend analysis at the daily, weekly, and monthly scales for the 2014/2015 flu season (Figure 9). The system also includes historical data as supporting background information (Figure 9c). We used the SMART graph to compare to the official Centers for Disease Control (CDC)-curated manually collected flu records. The weekly SMART flu tweeting trend is highly correlated to the CDC weekly FluView Influenza Positive Tests Reports (Figure 10).

Figure 9.

The multilevel temporal trend analysis at (a) daily, (b) weekly, and (c) monthly scales for 2014/2015 flu season monitoring tasks.

Figure 10.

Comparing the (a) SMART filtered flu tweeting trend (weekly) with (b) CDC FluView Influenza Positive Tests Reported to CDC, National Summary (http://gis.cdc.gov/grasp/fluview/fluportaldashboard.html).

Case #2: Understanding public responses to the Ebola outbreak in the U.S. and West Africa

Flu outbreaks represent a situation expected to stimulate routine social media expressions. In contrast, Ebola represents a potential moral and existential panic that could reveal both self-expressions of apprehensions about the disease. Although flu kills thousands of lives per year, it is also a routine experience. In contrast, Ebola is widely viewed as an exotic, novel, and univocally feared disease with an extremely high mortality rate. Media exposure of the few domestic Ebola cases, and the prolonged outbreak abroad, provided an episodic context in which to observe a moral panic in its dynamic social media presence. There are likely to be few counter-campaigns or memes competing for the attention of the public on such topics in this context.

The emergence of the lethal Ebola virus in the U.S. and West Africa caused the public significant fear and concern regarding the risk of an Ebola epidemic (Seltzer et al., 2015). Public health policies aim to minimize the impact of disease outbreaks, yet heightened fear and anxiety can often drive population behavior. An effective intervention must anticipate how populations will respond. In the case of Ebola, the difference between containing a single isolated infection and combating a major outbreak can hinge on ensuring that the population understands how to respond to an infection. Some diseases may require immediate treatment, whereas others might involve sheltering and isolation in place, or contact chaining (Brennan, 2014; SteelFisher et al., 2015). Although our current health system and the CDC have good tools for reporting, monitoring, and measuring the spread of disease infection cases, we do not have effective tools to measure public perception, fear, and response during significant public health crises. Such panics can be counterproductive, diverting needed resources and attention away from more effective interventions (Towers et al., 2015). Thus, it is within the interest of both local and federal public health officials to understand the nature of public perceptions of risks regarding Ebola, and perhaps mine the myths that are diffusing regarding public health response to such outbreaks and moral panics.

One potential use of SMART is to monitor public opinions and responses during or after disease outbreaks or disaster events. During the Ebola outbreak in 2014, we used SMART to track the public responses in West Africa and in the U.S. Twelve cities in West Africa, three U.S. cities with 20 mile buffer zones (New York, Dallas, and Cleveland), and five U.S. airports with 5 mile buffer zones (JFK, Newark Liberty, Washington Dulles, Atlanta Hartsfield Jackson, and Chicago O’Hare) were monitored (Figure 11). Figure 12 illustrates the most popular media (pictures) sharing in Twitter in these regions. The temporal trends of Ebola-related tweets in U.S. cities are very different from the 12 West Africa cities. The most popular pictures shared in Twitter in U.S. cities are very negative or sarcastic (such as “Enjoy Ebola”). On the other hand, the most popular pictures and messages shared in West Africa cities are mainly disease prevention-oriented information or medical aid from international organizations. This knowledge can be used to gauge public opinions related to the topic and devise response strategies accordingly. This case study demonstrates that SMART can be used to listen to public opinions during crisis events or disease outbreaks.

Figure 11.

SMART dashboard for public responses to Ebola in selected regions.

Figure 12.

Temporal trends of public responses to the Ebola outbreaks in (a) West Africa and (b) U.S. cities.

Case #3: Online discussion about marijuana legalization

Our second case study used SMART to monitor online discussions about marijuana legalization. The M³D anticipates that sometimes realspace events stimulate social media communications, and these communications in turn activate realspace events (i.e., polymemic events). Social movements use social media strategically and depend upon the self-generating and self-reinforcing features of such media campaigns to further facilitate the movement. Such movements, however, also evoke counter-campaigns and counter-framed messages. As such, tracking social movements focused on policy reform provides a unique opportunity to ascertain the role of social media in policy formation.

Working with domain expertise from the public health department at San Diego State University, 12 keywords related to the marijuana legalization movement in the U.S. were selected (legalization, pot, legalizing, anti-pot, anti-marijuana, medicalmarijuana, #marijuanamedicine, #weed, #LegalizeIt, #cdb, #thc, #cannabis). The SMART system served as the Twitter observatory of the marijuana legalization discussion. As shown in Figure 13, SMART was able to pick up several peak days on which activity was high on this topic, especially around 20 April 2015 (“420”), which has become a countercultural observance day in North America on which celebrations and events related to cannabis are hosted (O’Sullivan, 2005). The SMART dashboard provides the opportunity to extract text content and embedded media, enabling inspection and interpretation of major activities of the Twitter user community on that specific day. In addition to the temporal trends, SMART also highlighted the most active Twitter users in this topic community. This information could be useful for researchers, political consultants, community organizers, NGOs, and government agencies to explore the opinion leaders and influential conversations in the topic-related social network, which can lead to further strategic and policy planning. For example, SMART was able to identify the Twitter user @THEMMEXCHANGE (The Medical Marijuana Exchange) in the conversation related to marijuana legalization. The handler of the @THEMMEXCHANGE account is a medical marijuana services and advertising company that is active and well recognized in the marijuana legalization discussions on Twitter.

Figure 13.

Monitoring the marijuana legalization movement and public discussions with SMART.

Evaluation of user interface and the system

Shneiderman (1997) suggests that a successful design of user interfaces would depend on three aspects: (a) Guidelines, documents, and processes: provide a clear set of principles for developers to follow and ensure a harmony in design across products; (b) User interface software tools: use for prototyping and the implementation of actual design; and (c) Expert reviews and usability testing: ensure a successful product for users. To evaluate the design and usability of SMART, we invited nine graduate students from San Diego State University as testing users with domain expertise including geography, computer science, public health, communication, and sociology to conduct expert reviews on the user interface design and functionalities. With a brief introduction of SMART, each reviewer was given seven days to test SMART from 9 February to 15 February. The review and testing was focused on identifying usability issues and pointing out the positive and negative user experiences that should be maintained or improved. Based on the aggregated expert review reports, the identified high priority usability issues included:

Loading of the page can be slower for topics with large amount of messages.

Users cannot directly add new keywords inside the SMART dashboard.

The updating frequencies on analysis results are different for topics.

It would be nice to have a quick tutorial page before entering the system.

Users cannot customize the display of modules in the SMART dashboard.

In addition to providing suggestions on significant usability issues, we also asked the reviewers to narrow down to three categories (efficient navigation, module organization and consistency, and, matching of user expectations) of the system and user interface design. Based on the aggregated suggestions and comments, key items of each category are described as follows.

Efficient navigation

Not very intuitive to trigger the popup windows for detailed tweets;

Feedback submission form exceeding screen extent if using low viewing resolutions;

Several modules contain an “option” button but not sure what it is for without clicking it.

Module organization and consistency

There are many analysis modules in the same page, making it somehow visually cluttered;

The time selection option is consistent across many analysis modules;

The analysis modules are not aligned perfectly in the two columns;

The title text for the Top URL module is not aligned with the time selection options.

Matching user expectations

The Trend (timeline) module shows the changes of topic-related tweets in different temporal resolutions;

Clicking on any tag of the Word Cloud modules lists all the tweets containing the tag, which matches the expectation;

The order of the module items in the left panel does not match the order with the actual module windows.

The expert reviews provide several important suggestions on how we could improve the functionalities and user experiences in future developments. Overall, SMART obtains positive feedback on its design, which is recognized in the review summary: (a) The graphic user interface design is friendly and intuitive; (b) the system functions as an easy-to-use tool to capture tweets; (c) variety of analysis modules supports exploration of different topics.

Discussion and conclusion

The multilevel M³D model anticipates that digital messages such as tweets serve a significant function as a transmitter of cultural change in society. As such, the model conceptualizes multiple levels of memetic influence, each of which is reflected to varying degrees in SMART, from (a) recovering the content of memes in tweets; (b) to the communicator level identified by users; (c) to the social network level recoverable from follower links, retweets, and URL links; (d) to the societal level of identifying the trends of other topics competing for attention in cyberspace; and finally (e) to the geotechnical level of urban and population differences.

There are four major challenges in developing and implementing SMART based on our own experiences: (1) control and access of social media data, (2) concerns over user privacy, (3) spatial uncertainty, and (4) biased user populations. First, similar to many web observatory applications, SMART extracts user-generated content through public APIs from a commercial company, Twitter. Researchers do not have any control or service guarantee of these public APIs administrated by private sectors. Situations such as the Twitter geo-search function breakdown problems are inevitable and can cause significant loss of data for researchers. Adding multiple social media data sources can be a possible solution for this challenge. SMART is originally designed to consume social media messages from Twitter. It is possible to extend the system design and data framework to collect other social media sources (such as Flickr, Instagram, Weibo, Foursquare, and YouTube). Adding more social media data sources might reduce the risks of missing key data due to the API failure or access deny. However, the integration of multiple social media sources under a unified database framework will be another challenge of system implementation.

Second, privacy concerns (Tsou, 2015) from users and the public may prevent the adoption of SMART in some sensitive research areas, such as drug abuse and drinking problems. The current design of SMART has taken several steps to protect the privacy of social media users. For example, data used in the analysis and the analytics presented on the SMART dashboard is only restricted to the public information that users agreed to share with the social media services. In the analysis stages, geo-tagged messages are analyzed at aggregated spatial resolutions, such as city or regions, to minimize the exposure of precise locations of social media users. At the visualization stage, to further enhance the protection, our privacy policy also implements a feedback procedure where users can request and have specific conversations removed from the SMART dashboard. Another possible privacy protection method is to convert all users to an anonymous ID. However, the user anonymization process may prevent future analysis of social networks from these messages. Future work needs to strike a viable balance between the protection of user privacy and the usefulness of social media messages.

Third, the spatial uncertainty in social media data affects the spatial analysis results of SMART. Social media messages were collected with different levels of location precision. Location information in the messages can vary from the most precise latitude/longitude coordinates points, to street addresses, to city names, to regions or place names that might not have clear boundaries at all. The procedure of aggregating geo-location information to different spatial granularities can introduce uncertainty to the final analysis results. Even though the geo-locating engine can geo-locate many incoming messages, the validation of actual user locations remains a difficult task in SMART.

Fourth, according to Pew Research Center (Duggan, 2015), young adults (age 15–24) are the majority users of most social media services. This could introduce demographic bias for the analysis of social media messages. To adjust to this demographic bias problem, one potential direction is to use a variety of keywords as sampling tools to collect opinions from different age groups of a population. For example, the use of the keyword #healthcare may collect a majority of messages from senior citizens rather than teenage users. But more validation studies are needed to verify the association between different keywords and different user groups.

Capturing human activities in real time using social media is now at the forefront of behavioral measurement and understanding human dynamics (Alshamsi et al., 2015). In this regard, the development of SMART is intended to provide a broader impact for both theory testing as well as future social response and monitoring. By following how the general population responds to trending topics, the role of public messaging in affecting population behaviors can be ascertained (Zubiaga et al., 2015). Such access will ultimately enable enhanced modeling of information propagation that can account for population-level behavioral responses and allow policymakers to tailor their communications to the specific needs of a given event or topic. This is especially relevant in public health cases such as influenza or disease outbreaks, where population responses can have a direct impact on the course of the outbreak, such as increased rates of vaccination, social isolation, and hand-washing. Further, knowledge of geographic disease diffusion and the communicative contents that correspond to such diffusion can be vital to designing and guiding public health campaigns.

To summarize, SMART can provide real-time spatiotemporal analytic functions for monitoring trending topics and online discussions by adding geo-targeting capability for cities and regions. Different cities and regions may reveal different patterns of conversations and trends. Linking space, time, and the content of social media messages, the SMART system is designed to discover meaningful patterns and insights of events such as disease outbreaks or social activities. The three case studies presented demonstrate the value of SMART in tracking disease outbreaks, monitoring public discussion, and measuring public perception during significant public health crises. With the capability of adding new topics and keywords to the monitoring system in the future, SMART is not limited to the public health domain and can be extended to other domains such as disaster responses, social movement studies, marketing, and political campaign.

This article is part of a special theme on Social Media & Society 2014. To see a full list of all articles in this special theme, please click here: http://bds.sagepub.com/content/social-media-society.

Footnotes

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This material is based upon work supported by the National Science Foundation under Grant No. 1416509, project titled “Spatiotemporal Modeling of Human Dynamics Across Social Media and Social Networks.” Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation. The authors thank other HDMA team members’ contributions to the development of SMART.

Notes

References

Alshamsi

Pianesi

Lepri

(2015) Beyond contagion: Reality mining reveals complex patterns of social influence. PLoS One 10(8): e0135740.

Andrienko

Bosch

(2013) Thematic patterns in georeferenced tweets through space-time visual analytics. Computing in Science and Engineering 15(3): 72–82.

Aslam

Tsou

M-H

Spitzberg

(2014) The reliability of tweets as a supplementary method of seasonal influenza surveillance. Journal of Medical Internet Research 16(11): e250.

Bahk

Scales

Mekaru

(2015) Comparing timeliness, content, and disease severity of formal and informal source outbreak reporting. BMC Infectious Diseases 15: 135.

Barboza

Vaillant

Le Strat

(2014) Factors influencing performance of internet-based biosurveillance systems used in epidemic intelligence for early detection of infectious diseases outbreaks. PLoS One 9(3): e90536.

Brennan

(2014) Reflections on the Ebola outbreak: Perspectives from the frontline. Journal of Communication in Healthcare 7(4): 248–249.

Brownstein

Freifeld

Reis

(2008) Surveillance sans frontières: Internet-based emerging infectious disease intelligence and the HealthMap project. PLoS Med 5: 1–6.

Chae J, Thom D, Bosch H, et al. (2012) Spatiotemporal social media analytics for abnormal event detection and examination using seasonal-trend decomposition. In: IEEE conference on visual analytics science and technology 2012, VAST 2012—proceedings, New York, NY: IEEE Computer Society Press, pp.143–152.

Cheung

CMK

Thadani

(2012) The impact of electronic word-of-mouth communication: A literature analysis and integrative model. Decision Support Systems 54(1): 461–470.

10.

Culotta

(2013) Lightweight methods to estimate influenza rates and alcohol sales volume from Twitter messages. Language Resources and Evaluation 47(1): 217–238.

11.

Diakopoulos N, Naaman M and Kivran-Swaine F (2010) Diamonds in the rough: Social media visual analytics for journalistic inquiry. In: IEEE Conference on Visual Analytics Science and Technology 2010, VAST 2010–Proceedings, New York, NY: IEEE Computer Society Press, pp.115–122.

12.

Duggan M (2015) Mobile Messaging and Social Media – 2015. Pew Research Center. Available at: http://www.pewinternet.org/2015/08/19/mobile-messaging-and-social-media-2015/.

13.

Gajewski

Peterson

Chitale

(2014) A review of evaluations of electronic event-based biosurveillance systems. PLoS One 9(10): e111222.

14.

GeoNames (2015) GeoNames (online). Available at: http://geonames.org/ (accessed 27 September 2015).

15.

Ghahremanlou

Sherchan

Thom

(2014) Geotagging twitter messages in crisis management. The Computer Journal 58(9): 1937–1954.

16.

Ginsberg

Mohebbi

Patel

(2009) Detecting influenza epidemics using search engine query data. Nature 457(7232): 1012–1014.

17.

Graham M, Hale SA and Gaffney D (2014) Where in the world are you? Geolocation and language identification in Twitter. The Professional Geographer 66(4): 568–578.

18.

Hanson

Burton

Giraud-Carrier

(2013) Tweaking and tweeting: Exploring Twitter for nonmedical use of a psychostimulant drug (Adderall) among college students. Journal of Medical Internet Research 15(4): e62.

19.

Hawelka

Sitko

Beinat

(2014) Geo-located Twitter as proxy for global mobility patterns. Cartography and Geographic Information Science 41(3): 260–271.

20.

Hecht B, Hong L, Suh B, et al. (2011) Tweets from Justin Bieber’s heart: The dynamics of the location field in user profiles. In: Proceedings of the SIGCHI conference on human factors in computing systems. New York, NY: ACM, pp.237–246.

21.

Helsloot

Groenendaal

(2013) Twitter: An underutilized potential during sudden crises? Journal of Contingencies and Crisis Management 21(3): 178–183.

22.

Huston

Mekaru

Kluberg

(2015) Searching the Web for influenza vaccines: HealthMap Vaccine Finder. American Journal of Public Health 105(8): e134–e139.

23.

Joachims

(1998) Text categorization with support vector machines: Learning with many relevant features. Machine Learning: ECML-98 1398: 137–142.

24.

Kamel Boulos

Resch

Crowley

(2011) Crowdsourcing, citizen sensing and sensor web technologies for public and environmental health surveillance and crisis management: Trends, OGC standards and application examples. International Journal of Health Geographics 10(1): 67.

25.

Kamel Boulos

Sanfilippo

Corley

(2010) Social Web mining and exploitation for serious applications: Technosocial predictive analytics and related technologies for public health, environmental and national security surveillance. Computer Methods and Programs in Biomedicine 100(1): 16–23.

26.

Kuo BY-L, Hentrich T, Good BM, et al. (2007) Tag clouds for summarizing web search results. In: Proceedings of the 16th international conference on World Wide Web—WWW ‘07. New York, NY: ACM, pp.1203–1204.

27.

Li W, Serdyukov P, de Vries AP, et al. (2011) The where in the tweet. In: Proceedings of the 20th ACM international conference on Information and knowledge management—CIKM ‘11. New York, NY: ACM, p.2473.

28.

MacEachren AM, Robinson AC, Jaiswal A, et al. (2011) SensePlace2: GeoTwitter analytics support for situational awareness. In: IEEE Conference on Visual Analytics Science and Technology 2011, VAST 2011–Proceedings. New York, NY: IEEE Computer Society Press, pp.181–190.

29.

McKelvey K and Menczer F (2013) Design and prototyping of a social media observatory. In: Proceedings of the 22nd international conference on World Wide Web. Republic and Canton of Geneva, Switzerland: International World Wide Web Conferences Steering Committee. New York: ACM, pp.1351–1358.

30.

McKenzie

Janowicz

Gao

(2015) POI pulse: A multi-granular, semantic signature–based information observatory for the interactive visualization of big geosocial data. Cartographica: The International Journal for Geographic Information and Geovisualization 50(2): 71–85.

31.

Mahmud J, Nichols J and Drews C (2014) Home location identification of Twitter users. ACM Trans Intell Syst Technol 5(3): 47:1–47:21.

32.

Miao

Feng

Zhou

(2014) The construction and application research of crisis early warning mechanism of reputation of colleges and universities in the new media environment. Journal of Chemical and Pharmaceutical Research 6(6): 202.

33.

Nagar

Yuan

Freifeld

(2014) A case study of the New York City 2012/2013 influenza season with daily geocoded Twitter data from temporal and spatiotemporal perspectives. Journal of Medical Internet Research 16(10): e236.

34.

Nagel

Tsou

M-H

Spitzberg

(2013) The complex relationship of realspace events and messages in cyberspace: Case study of influenza and pertussis using tweets. Journal of Medical Internet Research 15(10): e237.

35.

Olsen J (2013) Infodemiology to improve public health situational awareness: An investigation of 2010 pertussis outbreaks in California, Michigan and Ohio. PhD dissertation. The University of North Carolina at Chapel Hill, ProQuest, UMI Dissertations Publishing, 2013, p.3562785.

36.

O’Sullivan

(2005) It’s 420: Do you know where your children are? Adolescent substance use and misuse. Clinical Excellence for Nurse Practitioners 9(3): 127–129.

37.

Rodrigues E, Assuncao R, Pappa GL, et al. (2013) Uncovering the location of Twitter users. In: 2013 Brazilian Conference on Intelligent Systems (BRACIS). Washington, DC: IEEE Computer Society.

38.

Sakaki T, Okazaki M and Matsuo Y (2010) Earthquake shakes Twitter users: Real-time event detection by social sensors. In: WWW’10: Proceedings of the 19th international conference on World wide web. New York, NY: ACM, p.851.

39.

Salton

Buckley

(1988) Term-weighting approaches in automatic text retrieval. Information Processing and Management 24(5): 513–523.

40.

Sasahara

Hirata

Toyoda

(2013) Quantifying collective attention from tweet stream. PLoS ONE 8(4): 1–10.

41.

Schwind

Wolking

Brownstein

(2014) Evaluation of local media surveillance for improved disease recognition and monitoring in global hotspot regions. PLoS ONE 9(10): e110236.

42.

Segev

Nissenbaum

Stolero

(2015) Families and networks of internet memes: The relationship between cohesiveness, uniqueness, and quiddity concreteness. Journal of Computer-Mediated Communication 20: n/a–n/a.

43.

Seltzer

Jean

Kramer-Golinkoff

(2015) The content of social media’s shared images about Ebola: A retrospective study. Public Health 129(9): 1273–1277.

44.

Shneiderman

(1997) Designing the User Interface: Strategies for Effective Human-Computer Interaction, 3rd ed. Boston, MA: Addison-Wesley Longman Publishing Co., Inc.

45.

Slavkovikj

Verstockt

Van Hoecke

(2014) Review of wildfire detection using social media. Fire Safety Journal 68: 109–118.

46.

Snoeijers

Poels

Nicolay

(2014) #universitycrisis: The impact of social media type, source, and information on student responses toward a university crisis. Social Science Computer Review 32(5): 647–661.

47.

Spitzberg

(2014) Toward a model of meme diffusion (M 3 D). Communication Theory 24(3): 311–339.

48.

SteelFisher

Blendon

Lasala-Blanco

(2015) Ebola in the United States—Public reactions and implications. New England Journal of Medicine 373(9): 789–791.

49.

Sutton

Spiro

Johnson

(2013) Warning tweets: Serial transmission of messages during the warning phase of a disaster event. Information, Communication and Society 17(6): 765–787.

50.

Takhteyev

Gruzd

Wellman

(2012) Geography of Twitter networks. Social Networks 34(1): 73–81.

51.

Thom D, Bosch H, Koch S, et al. (2012) Spatiotemporal anomaly detection through visual analysis of geolocated Twitter messages. In: IEEE Pacific Visualization Symposium 2012, PacificVis 2012—Proceedings. Washington, DC: IEEE Computer Society, pp.41–48.

52.

Towers

Afzal

Bernal

(2015) Mass media and the contagion of fear: The case of Ebola in America. PLoS ONE 10(6): e0129179.

53.

Tsou

M-H

(2015) Research challenges and opportunities in mapping social media and Big Data. Cartography and Geographic Information Science 42(supp1): 70–74.

54.

Tsou

M-H

Kim

I-H

Wandersee

(2014) Mapping ideas from cyberspace to realspace: Visualizing the spatial context of keywords from web page search results. International Journal of Digital Earth 7(4): 316–335.

55.

Tsou

M-H

Leitner

(2013) Visualization of social media: Seeing a mirage or a message? Cartography and Geographic Information Science 40(2): 55–60.

56.

Tsou

M-H

Yang

J-A

Lusher

(2013) Mapping social activities and concepts with social media (Twitter) and web search engines (Yahoo and Bing): A case study in 2012 US Presidential Election. Cartography and Geographic Information Science 40(4): 337–348.

57.

van der Veen JS, van der Waaij B and Meijer RJ (2012) Sensor data storage performance: SQL or NoSQL, physical or virtual. In: 2012 IEEE fifth international conference on cloud computing, pp.431–438.

58.

Wang

Zhang

(2012) The evolution of social commerce: The people, management, technology, and information dimensions. Communications of the Association for Information Systems 31. . Article 5.

59.

Young

Shoptaw

(2013) Stimulant use among African American and Latino MSM social networking users. Journal of Addictive Diseases 32(1): 39–45.

60.

Zubiaga

Spina

Martínez

(2015) Real-time classification of Twitter trends. Journal of the Association for Information Science and Technology 66(3): 462–473.

Social media analytics and research testbed (SMART): Exploring spatiotemporal patterns of human dynamics with geo-targeted social media messages

Abstract

Keywords

Introduction

Web and social media analytics platforms

Technological framework and key components of SMART

Social media retrieval (APIs)

Geo-locating social media conversations and normalization

Data filtering and machine learning classification

Statistics and analytics for topics

Geo-targeted regional selection and multilevel spatial analysis

Interactive user interface design

Three case studies of SMART

Case #1: Monitoring influenza outbreaks in U.S. cities

Case #2: Understanding public responses to the Ebola outbreak in the U.S. and West Africa

Case #3: Online discussion about marijuana legalization

Evaluation of user interface and the system

Efficient navigation

Module organization and consistency

Matching user expectations

Discussion and conclusion

Footnotes

Declaration of conflicting interests

Funding

Notes

References