Abstract
The design and reporting of data-driven studies seeking to measure misinformation are patchy and inconsistent, and these studies rarely measure associations with, or effects on, behaviour. The consequence is that data-driven misinformation studies are not yet useful as an empirical basis for guiding when to act on emerging misinformation threats, or for deciding when it is more appropriate to do nothing to avoid inadvertently amplifying misinformation. In a narrative review focused on examples of health-related misinformation, we take a critical perspective of data-driven misinformation studies. To address this problem, we propose a curated and open library of misinformation examples and describe its structure and how it might be used to support actionable surveillance. We draw on experiences with other curated repositories to speculate on the likely challenges related to achieving critical mass and maintaining data consistency. We conclude that an open library of misinformation could help improve the consistency of data-driven misinformation study design and reporting, as well as provide an empirical basis from which to make decisions about how to act on new and emerging misinformation threats.
This article is a part of special theme on Studying the COVID-19 Infodemic at Scale. To see a full list of all articles in this special theme, please click here: https://journals.sagepub.com/page/bds/collections/studyinginfodemicatscale
Background
Misinformation studies have been largely centred on politics, the environment and health. Vaccination is a current and important area where misinformation and misinformed beliefs are common. In 2018, Larson (2018) described how misinformation has the potential to erode trust in vaccines, predicting that a lack of access to vaccines would not be the reason for the next major outbreak. This may not have matched the reality of the global pandemic in 2020 but may still be an important barrier to management when vaccines are widely available.
In what follows, we define misinformation as any testable claim that is contrary to current evidence, noting that evidence can change and that where evidence is emerging, discordant or where a claim cannot be tested, misinformation often relies on experts and expertise (Vraga and Bode, 2020b). Examples from health include claims that SARS-CoV-2 was engineered in a laboratory (Kupferschmidt, 2020), or spinach is a good source of nutritional iron (Rekdal, 2014). This definition considers examples of misinformation at a level of granularity similar to what is used by the fact-checking website Snopes, which has been used as a dataset for studies in fake news and misinformation (Popat et al., 2017; Torabi Asr and Taboada, 2019). We include disinformation and fake news as types of misinformation where they include testable claims but exclude others like negative but factual representations of vaccines.
Misinformation is assumed to have a substantial detrimental impact on health but estimating harms associated with misinformation exposure remains a key challenge in the area. Retrospective reviews of cases such as the false link between the MMR vaccine and autism (Deer, 2011) or the use of bleach to treat COVID-19 clearly show that health misinformation can lead to harm, including reducing vaccine coverage (Smith et al., 2008) and what appears to have been an increase in poisonings from the misuse of drug treatments (Rivera et al., 2020).
But these cases also show that misinformation is dynamic and connections to harm are complicated. Evidence changes as it accumulates, claims are only revealed to be clearly false much later, and the claims themselves can adapt and change over time. They are also suggestive of the indirect and circuitous pathways between the presence of misinformation in online communities and its impact on behaviour. This makes it difficult to design studies that connect measurements of misinformation with measurements of impact, or to synthesise the data-driven misinformation studies to inform decision-making tools that can tell us if and when to act on new misinformation threats.
We present a critical view of current data-driven misinformation research and describe the current barriers to using this literature to inform when to act on emerging misinformation threats. We then propose the creation of a curated and open misinformation library, describing the set of characteristics that would need to be recorded consistently across misinformation examples for them to be useful, when synthesised, to inform actionable surveillance.
The problem: Misinformation studies are disconnected from actions
Data-driven misinformation studies
Some of the earliest data-driven studies on social media include a study of immunisation content on YouTube from 2007 (Keelan et al., 2007), and a content analysis of H1N1 and a description of mechanisms for spreading disinformation on Twitter from 2009 (Chew and Eysenbach, 2010; Chamberlain, 2010). There has since been massive growth in the number of published studies attempting to characterise the content of what social media users post, including a substantial number that study misinformation.
Most misinformation studies analysing social media data characterise how and to what extent misinformation is shared. The insights from these analyses are intended to help prioritise resources by focusing interventions on places or communities where misinformation is more common or spreads more easily. A major challenge in the area is the large number of studies that characterise misinformation by counting posts that include misinformation, groups where misinformation is shared, and the users who post misinformation. A common study design issue is making assumptions about the influence of misinformation after only sampling posts or groups that present misinformation (Johnson et al., 2020; Wang et al., 2019), without considering how often users engage with that content or measuring engagement with other relevant content that competes with it for attention. Like an inverse information deficit model, the flawed assumption is that the existence of misinformation is an indicator of impact regardless of how much other information is circulating, how often users are exposed to this other information, and the impact (if any) of this information on their behaviour. This is especially problematic given that studies estimating differences in exposure show that misinformation makes up a small proportion of what people see (Dunn et al., 2020; Grinberg et al., 2019).
Another issue comes from assuming or implying that belonging to a group or community means that a social media user also ascribes to the beliefs expressed by the most vocal users in the community (Johnson et al., 2020; Smith and Graham, 2019), which happens often when examining pages or groups on Facebook. Social media users watch and participate in communities for diverse reasons without necessarily sharing the opinions of the most vocal people in those communities. Studies that assume the volume of posts within a community represent the distribution of views can also draw incorrect conclusions if they are unable to observe or survey the attitudes of the users, especially where some users are responsible for most of the content and most users are responsible for very little.
While there is nothing inherently wrong with studies that count the number of misinformation posts and characterise the users that post misinformation, the problem appears when the results are used to speculate on exposure or impact without measuring it. This often manifests in data-driven social media studies as spin in the conclusions (Chiu et al., 2017), and incorrect assumptions about exposure and impact are then established in the literature and the public domain through citation distortion (Greenberg, 2009).
We think that data-driven misinformation studies that approach analysis from the perspective of the information consumer rather than the information producer can be especially useful for understanding the potential risks association with misinformation. These kinds of studies focus on finding and tracking the spread of posts that include or link to misinformation, or model how misinformation examples spread and cluster across online communities without purposively sampling communities. Some make use of network structure information in social media platforms to observe how misinformation examples spread (Shao et al., 2018a, 2018b; Tambuscio et al., 2015; Vosoughi et al., 2018; Wang et al., 2019), as well as how promotion of misinformation examples can become concentrated within certain communities (Surian et al., 2016; Schmidt et al., 2017; Wu and Liu, 2018). Studies that construct models of population-level outcomes using measures of information exposure or engagement are extremely rare – examples include models of cardiovascular mortality and vaccine coverage (Dunn et al., 2017; Eichstaedt et al., 2015).
At the end of the spectrum closest to measuring impact, other data-driven studies focus on health attitudes and behaviours to look for associations between what individual social media users say or see on social media platforms and individual outcomes. These are often retrospective cohort study designs, where representations of a social media user’s information are used in models to explain or predict a health outcome from linked data such as validated questionnaires (De Choudhury et al., 2013), diagnosis codes (Eichstaedt et al., 2018) or voter registration (Grinberg et al., 2019). Because these studies directly connect an individual user’s information exposure or engagement with reliable measures of health outcomes, they are a more direct connection between misinformation and health outcomes and could be used to support new forms of digital interventions (Dunn et al., 2018).
Interventions
While data-driven studies can reveal the presence and spread of misinformation, intervention studies can provide insights into how to respond to misinformation. Prominent debunking interventions target individuals with specialised messaging and aim to alter their beliefs (Chan et al., 2017; Walter et al., 2020).
Evidence for debunking interventions comes primarily from experiments in psychology and communications, where the how, when, who and what of messages are varied and the beliefs and intentions of participants are measured as outcomes (Nyhan et al., 2014; Nyhan and Reifler, 2015). The constraints of the experimental designs needed to test for effects of interventions mean that most experiments only measure short-term changes in attitudes and beliefs, use artificial scenarios and test participants in isolation away from the usual social spaces where patterns of information consumption, attention, trust and timing matter. Evidence from these experiments currently form the basis of guidelines and best practice on how to act on misinformation, which are mostly focused on interacting with individuals (Brewer et al., 2017; Lewandowsky et al., 2017, 2020).
Less well tested are the interventions deployed on social media platforms to limit the spread of misinformation by adding friction to sharing. For example, Twitter has hidden and flagged problematic posts, and in some cases has removed the option to retweet certain posts or share certain links, with the intention of limiting the spread of misinformation without suspending influential users. Other types of interventions that involve corrections have been tested in environments made to mimic social media interactions (Bode and Vraga, 2015, 2018; Vraga and Bode, 2020a). To date it remains unclear what flow-on effects the automated, platform-wide interventions might have on beliefs and outcomes in real world scenarios.
Wasted resources and unintended consequences
Because of the inconsistency in what is measured and the lack of evidence connecting measures of misinformation to measures of impact, data-driven studies are not yet especially useful for guiding when to intervene. Where data-driven studies speculate beyond their results about the importance or potential impact of misinformation, it can lead to problematic recommendations about intervening when it is not appropriate.
Deploying interventions for misinformation examples that are unlikely to have any impact on behaviours may represent a waste of resources. But beyond this, deploying interventions where they are not needed could also lead to unintended consequences. While it appears to be rare, debunking efforts may sometimes lead to backfire effects where misinformed beliefs become further entrenched (Betsch and Sachse, 2013; Ecker et al., 2020; Schmid and Betsch, 2019; Swire-Thompson et al., 2020). During the COVID-19 pandemic, misinformation experts have also raised concerns about the potential to unintentionally amplify misinformation by responding to it – giving oxygen to small fires that would otherwise extinguish on their own (Donovan, 2020).
Interventions that remove posts or deplatform users may also be less effective than alternatives in some circumstances. Recent evidence suggests that responding to misinformation where it appears on social media can reduce misperceptions among the audience of the misinformation content (Vraga and Bode, 2020a). Removing posts and deplatforming certain users may be a missed opportunity to have a positive influence on exactly the audience that is more susceptible to misinformation and for whom interventions would be well-targeted.
The solution: Consistent and complete reporting to improve synthesis
The heterogeneity of data-driven misinformation studies makes it hard to synthesise their results in ways that could be used to inform tools for actionable surveillance. Scoping reviews and systematic reviews could be used to map out gaps in evidence and estimate the prevalence and potential reach of misinformation examples but require consistently reported studies to be feasible. For heterogeneous sets of studies, meta-analysis techniques are not available to construct models that can infer the likely impact for an emerging misinformation example based on its observed characteristics. To bring the field to a point where actionable surveillance is possible, new approaches to consistently recording and tracking the impact of misinformation examples are now needed.
A library of misinformation examples
To address this gap, we propose an open library of misinformation. Its purpose is to map misinformation examples into a taxonomy through standardising the observing and reporting of its epidemiology. Stakeholders most likely to submit examples include researchers and public health organisations, potentially with support from social media platforms. We envisage the library will accept submissions that are required to provide information for a set of common data elements, mapped to standard vocabularies where possible, and vetted for data consistency.
Common data elements that should be included in submissions as ‘minimum data’ include sources – the social media platforms or purposively sampled pages that have been searched or monitored for content. Like a formal search strategy in a systematic review, submissions should also include details of exactly what was searched for and when. A critical part of the taxonomy relates to the set of outcome measures used to characterise misinformation and its impact. Wherever possible, investigators should be encouraged to design studies that go beyond counting posts to include measures of information exposure and engagement, details of the audience population including location and demographics, social network structure and outcome measures related to behaviours and health outcomes.
For example, a data-driven misinformation study examining misinformation related to COVID-19 vaccines on Twitter could go beyond counting the number of posts that promote a vaccine conspiracy to include designs that focus on audiences – information from a survey about the intentions and beliefs of a sample of Twitter users linked to the proportion of relevant posts they might have been exposed to that promoted those conspiracies. Motivation for such a study may come from recent work showing that social media use is associated with lower rates of intention to vaccinate compared to traditional media use (Allington et al., 2021). Examples of outcome measures might include attitude measures using vaccine confidence scales and health-related outcomes such as vaccination status (Dyda et al., 2020).
To be useful, the library would also need to be initially populated with examples. These could initially be identified via a systematic review, where summary data from published articles are used to populate a set of examples if they meet the requirements for minimum data. Once established, ongoing submissions from the community of researchers undertaking data-driven misinformation studies could be encouraged by making submission relatively easy in terms of time costs and additional funding support for curation to ensure data consistency. Other ways to encourage submissions might include providing digital object identifiers for submissions to make attribution easy when the data are used directly in syntheses.
Using the library
A library with a critical mass of misinformation examples could be used to inform surveillance and new data-driven tools to guide decisions about when it is appropriate to act on new sources of misinformation as they emerge. We expect that users will be able to develop new methods and tools that learn from the characteristics of the examples in the library to model and predict the potential impact of emerging misinformation examples early in their trajectory. Stakeholders that could make use of the guidance to prioritise resources and guide the development of debunking interventions include social media platforms, governments and public health organisations.
Together with a set of tools for modelling the potential impact of new and emerging misinformation examples, the library could then be used to help prioritise efforts on misinformation threats that are more likely to cause harm. For example, resources could be prioritised for emerging misinformation examples where the characteristics, context and trajectory look most like the early trajectories of library examples that went on to exhibit broader spread, engagement and evidence of harm.
Challenges and limitations
Other specialised repositories that record structured and semi-structured information about studies have been successful but faced challenges. For example, PROSPERO for systematic review protocols (Booth et al., 2012), and clinical trial registries like ClinicalTrials.gov (Tse et al., 2018; Zarin et al., 2011) are recognised for improving transparency and reducing reporting bias and redundancy (Stewart et al., 2012). Some of the challenges these databases face relate to data consistency and trying to fit a diverse set of examples into a single common structure (Booth et al., 2013). This suggests that a balance needs to be struck between allowing as many diverse examples as possible and driving the consistency and completeness of reporting by requiring strict adherence to a minimum set of common data elements.
Other examples of large databases that aggregate community contributions like Wikipedia and arXiv tend to be much broader and less structured, but researchers have also found new uses for the information, including prediction or modelling of outcomes such a citation counts, drug safety outcomes, or movie success (Bar-Ilan and Aharony, 2014; Davis and Fromerth, 2007; Ma and Weng, 2016; Mestyán et al., 2013; Moat et al., 2013). Though it might not seem similar, patient data stored in medical records are another example where structured and unstructured information are contributed by a decentralised group and then pooled for the purpose of decision support – including predicting risks by aggregating from the most similar past examples (Longhurst et al., 2014). Designing an open misinformation library with actionable surveillance in mind from the start should help avoid some of the major challenges that researchers have faced when trying to aggregate from messy sources of data for new purposes.
A limitation of the proposed library is that it can only be used to guide users about whether they should intervene based on experience. But this may not necessarily extend to deciding which interventions are most likely to work for a given misinformation example.
Conclusion
Despite the massive growth in the number of data-driven misinformation studies available, experimental designs and reporting vary substantially in quality and many are not useful on their own for supporting actionable surveillance. Only a handful of social media studies have connected information about engagement or exposure to measures of behaviour or health outcomes. An open library of misinformation examples with a set of required data elements could be used as an empirical basis from which to infer or extrapolate about the potential reach and impact of emerging misinformation threats early in their trajectory.
Footnotes
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
