Abstract
Abstract
The Internet has forever changed the way people access information and make decisions about their healthcare needs. Patients now share information about their health at unprecedented rates on social networking sites such as Twitter and Facebook and on medical discussion boards. In addition to explicitly shared information about health conditions through posts, patients reveal data on their inner fears and desires about health when searching for health-related keywords on search engines. Data are also generated by the use of mobile phone applications that track users' health behaviors (e.g., eating and exercise habits) as well as give medical advice. The data generated through these applications are mined and repackaged by surveillance systems developed by academics, companies, and governments alike to provide insight to patients and healthcare providers for medical decisions. Until recently, most Internet research in public health has been surveillance focused or monitoring health behaviors. Only recently have researchers used and interacted with the crowd to ask questions and collect health-related data. In the future, we expect to move from this surveillance focus to the “ideal” of Internet-based patient-level interventions where healthcare providers help patients change their health behaviors. In this article, we highlight the results of our prior research on crowd surveillance and make suggestions for the future.
Introduction
“COLLECTING DATA THROUGH THESE MEANS AND MINING THE DATA FOR INSIGHTS IS CALLED ONLINE CROWD SURVEILLANCE.”
Collecting data through these means and mining the data for insights is called online crowd surveillance. Most Internet research in the field of public health has until now focused on monitoring health behaviors; however, researchers have recently begun to interact with users to collect a wider variety of health-related data. In the near future, we expect to move from a largely surveillance focus to the “ideal” of Internet-based patient-level interventions, where healthcare providers actually help patients to change their health behaviors, for example, by helping them eat more healthfully or stop smoking. In this article, we highlight the results of our prior research on online crowd surveillance, using a unique dataset to illustrate one of its limitations and provide suggestions for how “big data” might be utilized in the public health field in the future.
Surveillance
The Centers for Disease Control 1 referred to surveillance as, “The systematic, ongoing, collection, management and interpretation of these data to public health programs to stimulate public health action.” The attractiveness of the Internet as a research tool to health policy researchers for online crowd surveillance lies in its population-level scale and its ability to access the uncensored thoughts of patients, all for minimal cost. In essence, Internet users comprise a larger focus “crowd” group than other traditional methods make practicable, where the “voices of millions” can be heard. With the massive amounts of data this makes available, it is no surprise that researchers have used the Internet for surveillance. 2
Indeed, through surveillance, researchers have access to surprisingly rich public health-related data, generated when patients congregate, seek information, and discuss their concerns and outcomes. 3 Twitter especially has proven to be an abundant source of such information. For example, although many postings on Twitter communicate seemingly mundane accounts of everyday life and experiences, this chatter often also includes disclosure of emotional and physical well-being.4–10 Recent studies have suggested that 8.5% of English-language tweets relate to disease of some type, and 16.6–25.1% relate to health. 11 This information can be downloaded, geocoded, and characterized by researchers for content and demographics. 12
Twitter has served as a source of health-related data in numerous novel ways. In particular, Twitter's immediacy has permitted real-time assistance in the case of natural disasters (hurricanes and earthquakes, for instance) by allowing for the widescale broadcast of available resource, enabling people in need of medical assistance to locate help.10,13,14 This immediacy also allows for much quicker surveillance for targeting infection “hot spots” in pandemic situations, as was done by companies such as Google in the H1N1 crisis.9,15,16 However, the potential application is much broader than simply emergency situations or healthcare: linguists and sociologists, among others, have mined tweets for their research, among other things, succeeding in distinguishing local dialects and forecasting the moods and opinions of populations in specific geographic regions.17,18
“TWITTER OFFERS PROMISE AS A RESEARCH TOOL NOT ONLY BECAUSE OF ITS IMMENSE SCALE, BUT ALSO BECAUSE THE CONTENT OF MESSAGES CAN BE SYSTEMATICALLY SEARCHED.”
In terms of nonemergency healthcare, many studies offer important public health insights about linking the origin of sadness and depression to a number of serious medical conditions, and new methods of identifying them are always welcome. For example, researchers have recently been able to link changes in tweeting behavior to postpartum depression. 19 Others have used Twitter to quantify medical misconceptions (e.g., sequelae of concussions) and the spread of poor medical compliance (e.g., antibiotic use).8,20 In our recent work, 21 we have used Twitter to understand how people communicate online about cardiovascular health. Specifically, we sought to characterize how Twitter users seek and share information related to cardiac arrest, which is a time-sensitive cardiovascular condition where initial treatment is often reliant on public knowledge and response. This project demonstrated that tweets about cardiovascular health could be identified, sorted, and characterized relative to content and the person generating the content. Twitter offers promise as a research tool not only because of its immense scale, but also because the content of messages can be systematically searched. 22 The immediacy of Twitter offers another great advantage as a research tool. For example, emergency departments in Boston learned about the 2013 marathon bombings through Twitter before announcements from conventional sources such as the media or established emergency service communication channels. 23 While terrorist attacks are an extreme case, the general principle holds.
Surveillance opportunities extend far beyond Twitter, however, with the Internet offering significant opportunities for researchers and public health officials alike. Patients discuss their health with others on medical discussion boards and review sites, which provide a test-bed for public health surveillance. In our work,24–27 for instance, we used medical discussion board data to successfully link drugs and homeopathic remedies to relevant side effects. 27 We developed a methodology for establishing a corpus of medical message board posts, anonymizing the corpus and successfully extracting information on potential adverse drug effects discussed by users. In addition, we used these data to determine the extent to which patients use social media to discuss side effects related to medications. In addition to linking drug use to side effects, we also focused our research more specifically on discussions by breast cancer patients related to using aromatase inhibitors (AIs), with particular emphasis on AI-related arthralgia, and sought to understand the frequency and content of side effects and associated adherence behaviors. We found that online discussions of AI-related side effects are common and often relate to drug switching and discontinuation. 24 Obviously, physicians would benefit from awareness of the implications of these discussions and should promote optimal adherence by guiding patients in managing side effects effectively. It is this type of awareness—of what the “person in the street” is saying—that research such as ours can provide to an unparalleled extent.
In addition to posting information about their health, patients search for solutions on the Internet and often click on links to health-related websites. When collected, these link data are useful indicators of public health. Data resulting from search queries have been found to be highly predictive of a wide range of population-level health behaviors. For example, trends in Google and Yahoo search queries can be used to predict epidemics of illnesses such as flu and dengue fever, 28 the seasonality of mental health, depression and suicide,29,30 the prevalence of Lyme disease, 31 incidence of kidney stone, 31 and the prevalence of smoking and electronic cigarette use. 32 Web logs, which serve as histories of data about where people click, are predictive of individual characteristics such as mental health and dietary preferences. 33 While the availability of vast amounts of information about health on the Web means that people will find information when they search, we have found that search keyword selection is critical for arriving at reliable curated health content. 34
Limitations to Surveillance
While the collection and analysis of Internet data is a promising path to better understanding of health behaviors, this strategy suffers from several limitations. First, eavesdropping on such communication involves privacy concerns that have not been fully resolved. People have an expectation of and right to privacy, particularly when they discuss health-related issues. Internet-based data gathering thus represents both logistic challenges (e.g., how to get people to opt in to share their Facebook status updates) and potential ethics dilemmas (if one predicts that someone is at risk for suicide based on his/her posts, should one intervene in some way?). Second, such data are obtained without context; it does not include a patient's health history or medical outcomes, merely a snapshot of their daily lives. (Health history is almost impossible to come by if one only collects anonymized tweets or posts.) In the absence of context, causal claims about specific behaviors and health conditions are thus difficult to substantiate. Third, Internet-based data are seldom curated; with no distinction between genuine and spurious information, it becomes increasingly important to develop methodologies for isolating “the signal from the noise.” Fourth, a commonly expressed concern about data from Twitter and similar services relates to defining the sample populations. Twitter users do not represent a random sample of the population; for instance, the elderly and young children are less likely to use Twitter than people between the ages of 18 and 40. Although studies have shown that Twitter represents broad demographic segments of the population,35–37 drawing conclusions without considering the populations can be problematic. In our current work, we seek to understand how bias in the representation of Internet users impacts the conclusions drawn at the population level.
To illustrate the severity of the problem of relying on tweet data to draw population-level conclusions, we present below results from a large-scale survey of U.S. households, the Simmons National Consumer Study, annually issued to over 12,000 adults over the age of 18. The survey asks respondents questions on all aspects of their daily lives, including product purchases, news consumption, Internet usage, opinions, and health. To demonstrate the problems that may exist when generalizing to the entire population if special care is not taken to poststratify the information to match the general population, in Table 1, we combine answers from the survey about Internet usage and health from the Simmons survey. Table 1 presents the number of people in the U.S. population over age 18 who have diseases or conditions queried about in the Simmons survey in 2011 and 2012. For each year, we present the estimated counts of people in the population with the disease and people on Twitter with the disease. These data come directly from the Simmons survey. Survey respondents were asked about both their health conditions and whether they used Twitter. Therefore, we can cross-tabulate users by both of these characteristics. When we rank the conditions by their prevalence, some obvious differences appear. First, conditions more prevalent in the elderly, such as hypertension, arthritis, and high cholesterol, show up in the top five in the population, but not for Twitter users. On the other hand, conditions that skew young, like acne and anxiety, rank higher in prevalence on Twitter.
18 and over.
Much more serious problems than the differences in Twitter versus population demographics, however, arise from the facts that words are ambiguous (e.g., “heart attack” or “MI” mostly do not refer to heart attacks) and that people mention diseases without necessarily experiencing them. Thus, keywords searched for on Twitter do not necessarily accurately represent the incidence of specific medical problems. For example, Table 2 shows the number of tweets on Twitter about the 10 most prevalent diseases as well as the rank of the disease in the US population. We collected the tweets during the week August 7–13, 2013. We simply searched Twitter for the listed keywords and counted the resulting tweets. We see again that the Twitter ranking by keywords differs greatly from the incidence rate. For example, the most tweeted-about terms related to names of the top 10 symptoms and conditions were anxiety and depression, whereas these are at the bottom of the top 10 list in terms of prevalence. It is important to also note that the proportion of individuals tweeting about certain conditions is very low. For example, very few people tweet about arthritis or the word “obese.” Instead, most of the tweets containing these words are from health organizations. Finally, with the exception of backache, very few people are tweeting about having the condition themselves. Instead, they are sharing news and using the related terms to mean something other than the health condition. It is likely that no one factor accounts for this; a variety of reasons, including word ambiguity, omission of synonyms, stigma about the disease, the geographic location and demographics of Tweeters, and the different government and NGO involvement in disease all affect the tweet rate. In ongoing work, we are studying how to correct for biases introduced by these and other factors.
US Population 18 years and older.
Calling the Crowd to Action
While much of our work has been focused on mining social media data, there are other ways to employ Internet users to help solve public health–related challenges, for example, through crowd-sourcing. The Internet provides access to millions of users who can potentially answer a call for action, as has been demonstrated by the success of crowd-sourcing projects in many areas, including health challenges. As mentioned above, we see the opportunity for public health officials to move from simple surveillance to using the power of crowd-sourcing to collect public health data.38–58 During a recent literature review, we found that in addition to surveillance, crowd-sourcing was frequently used for problem solving, data processing, and surveying. 59
Crowd-sourcing has been used to provide data processing relating to a wide range of health-related tasks, including classifying polyps in computer tomography colonography images, 54 and then providing feedback to help optimize presentation of the polyps 53 ; annotating public webcam images to determine how the addition of a bike lane changed the mode of transportation observed in the images 57 ; and examining red blood cells for the presence of infection51,52 or thick blood smears containing 50 malaria parasites (Plasmodium falciparum). In a survey of workers on Amazon.com's Mechanical Turk, the crowd workforce was surveyed for malarial symptoms as part of a study to assess the prevalence of malaria in India. 46 Another survey provided a mobile phone application that allowed users to report potential flulike symptoms along with GPS coordinates and other details. Response data from the survey enabled researchers to chart the incidence of flu symptoms that matched relatively well with Centers for Disease Control data. 40
Crowd-sourcing can be used both as a way of gathering public health data and as a way of getting “crowd-sourced workers” (e.g., Mechanical Turk) to sift through and locate health data. In our work, we sought to determine the feasibility of using mobile workforce technology to validate locations of automated external defibrillators (AEDs), which are an emergency public health resource. We developed a crowd-sourcing application, the MyHeartMap Challenge, to organize the public reporting of AED locations throughout a major U.S. metropolitan area. This study had three purposes. First, we wanted to investigate the capacity of crowd-sourcing and social media for collecting meaningful public health data regarding an underutilized health-related technology. Second, we wanted to determine the locations of existing AEDs and build a serviceable inventory of AEDs within a defined region for use by laypeople and municipal service providers during life-threatening emergencies. The study provided a baseline snapshot of AED locations at a particular point in time. This will serve as the foundation for updating and maintaining a database of the devices over time. The third purpose was to evaluate the survey process of data collection itself, including the demographics and motivations of participants who submitted the crowd-sourced information, as well as the validity of the data submitted. Although we used the crowd, we noted that as with other Internet studies, participants were demographically limited. A major challenge when calling a crowd to action is incentivizing participation for a survey population with certain health conditions from across all walks of life. Nevertheless, despite its problems, the crowd-sourcing of health information presents tremendous opportunities, since the available survey population is still much larger than the traditional focus groups that were employed for health-related studies in the past.
The Future Is Intervention
“RESEARCHERS WILL BETTER UNDERSTAND PATIENTS AND PATIENTS WILL BETTER UNDERSTAND THEMSELVES AS THEY BECOME MORE PROACTIVE ABOUT THEIR HEALTH.”
What should we expect in the near future? Certainly, there will be further advances in healthcare surveillance methodology that integrates information from disparate sources such as Tweets, Facebook posts, medical records, purchases, and cell phone data. The forms in which data are available are also diversifying as patients increasingly gather health information from sources such as YouTube videos and their personal electronic medical records, and self-monitor their health behaviors using devices such as Nike wristbands or other medical measuring devices that are linked to smart phones. Additionally, we expect crowd-sourcing to play a major role in gathering health information. The data generated will be useful to both researchers and individuals. Researchers will better understand patients and patients will better understand themselves as they become more proactive about their health.
The biggest change, however, will be the shift from merely monitoring people's activities to actually using this information to induce behavioral changes that can impact individual health-related practices. Many of the most actionable health issues involve individual behaviors that can be modulated by feedback and social influence; these include exercise, obesity, smoking, drunk driving, lack of medication compliance, and seeking treatment for problems such as depression. Having access to a wealth of personal health information available, and the ability to develop interventions via cell phones or social networking sites open up a multitude of ways to improve the general health of the population-related behaviors.
Over the last decade, the doctor–patient relationship has shifted. Patients now routinely use the Internet to obtain medical information as well as a second—or sometimes first—opinion on their healthcare options. For example, upon receiving a diagnosis that a relative has cancer, or that one's mother does, a common first response is to Google the illness in order to understand the treatment options and potential outcomes. Patients then bring this knowledge—factual or not—to their next meeting with their doctor. While patients generally perceive physicians and other clinicians as highly credible and influential sources for health-related information, it is believed that people are also highly influenced by the opinions of friends and by information obtained from the Internet, whether or not these can be verified. The effect of these often nonprofessional opinions can be misinformation. This observation becomes even more significant when considering the amount of time the average person spends in a clinical setting in direct communication with a health professional compared with the amount of time s/he spend communicating with other people. Most individuals spend less than 2 hours a year with a physician, compared with the annual 5,000 hours spent in communication with others. Given that because of the spacing effect, repetition and convenience of access to information offer a greater likelihood of its retention, it is clear that nonclinical methods of imparting health information are likelier to have an effect than visits to a clinician, despite the latter's greater authority. Therefore, it is critical to provide reliable health information on the Web for patients.
This use of the Internet for health information goes beyond the management of one's health that has typically been the doctor's purview: people want to know not only how to best treat illnesses, but also, increasingly, how to be healthier and happier in general. For example, research has overwhelmingly shown that exercise has significant health benefits, as do being happy and having good relationships. This being the case, it is evident that attaining positive health outcomes involves a host of small daily decisions, many of which can be supported through mechanisms such as phone and social network reminders and support groups. The move from healthcare surveillance to actually helping people take control of their health presents healthcare professionals with a plethora of exciting opportunities. Data mining will play a crucial role in this effort by helping to determine which interventions are effective, at which times, and for which people. Further refinement of data mining abilities will doubtless increase the possibilities, and it will then be possible, thanks to these data, not only to see which interventions work, but also to plan new ones with a higher likelihood of success.
Footnotes
Acknowledgments
We would like to thank our many collaborators and research assistants on our prior work discussed in this article. The prior work was supported by the National Library of Medicine (RC1LM010342) and K23 grant 10714038. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Library of Medicine or the National Institutes of Health. The funding source did not play any role in the study design, in the collection, analysis and interpretation of data; in the writing of the manuscript; or in the decision to submit the manuscript for publication.
Disclosure Statement
No competing financial interests exist.
