Sage Journals: Discover world-class research

Abstract

Background. The format in which probabilities are presented influences comprehension and interpretation. Purpose. To develop comprehensive evidence-based guidance about how to communicate probabilities in health and to identify strengths and weaknesses in the literature. This article presents methods for the review of probability communication and is accompanied by several results articles. Data Sources. MEDLINE, Embase, CINAHL, the Cochrane Library, PsycINFO, ERIC, ACM Digital Library; hand search of 4 journals. Study Selection. Two reviewers conducted screening to identify experimental and quasi-experimental research that compared 2 or more formats for presenting quantitative health information to patients or lay audiences. Data Extraction. In our conceptual framework, people make sense of a stimulus (data in a data presentation format) by performing cognitive tasks, resulting in perceptual, affective, cognitive, or behavioral responses measured as 1 of 14 distinct outcomes. The study team developed custom instruments to extract concepts, conduct risk-of-bias evaluation, and evaluate individual findings for credibility. Data Synthesis. Findings were grouped into tables by task and outcome for evidence synthesis. Limitations. Reviewer error could have led to missing relevant studies despite having 2 independent reviewers screening each article. The granular data extraction and syntheses slowed the work and may have made it less replicable. Credibility was evaluated by only 2 experts. Conclusions. After reviewing 26,793 titles and abstracts, we identified 316 articles about probability communication. Data extraction produced 1,119 individual findings, which were grouped into 37 evidence tables, each containing evidence on up to 10 data presentation format comparisons. The Making Numbers Meaningful project required novel methods for classifying and synthesizing research, which reveal patterns of strength and weakness in the probability communication literature.

Highlights

The Making Numbers Meaningful project conducted a comprehensive systematic review of experimental and quasi-experimental research that compared 2 or more formats for presenting quantitative health information to patients or other lay audiences. The current article focuses on probability information.

Based on a conceptual taxonomy, we reviewed studies based on the cognitive tasks required of participants, assessing 14 distinct possible outcomes.

Our review identified 316 articles involving probability communications that generated 1,119 distinct research findings, each of which was reviewed by multiple experts for credibility.

The overall pattern of findings highlights which probability communication questions have been well researched and which have not. For example, there has been far more research on communicating single probabilities than on communicating more complex information such as trends over time, and there has been a large amount of research on the effect of communication approaches on behavioral intentions but relatively little on behaviors.

Keywords

numeracy risk communication health communication data graphics

1. Introduction

Much of the information patients need for decision making and disease self-management contains quantitative concepts, such as the chance of getting a disease and the harms and benefits of therapeutic options. Quantitative concepts can be challenging to understand and apply, especially for those with low numeracy or health numeracy. Because low numeracy is so prevalent, with nearly 30% of United States adults having difficulty working with fractions and decimals,¹ it is reasonable to assume that it is common in any patient or public audience.

To address challenges in communicating numbers, a rich and diverse research literature is demonstrating that comprehension can be affected by the data presentation format, the way in which a quantitative concept is conveyed. For example, for conveying probabilities, data presentation formats include numbers (e.g., 10%, 1 in 10, 10 in 100, and 0.10), graphics (such as icon arrays, bar charts, and number lines), and verbal descriptions of probability (such as “common” or “rare”).

The Making Numbers Meaningful seeks to comprehensively review and synthesize the evidence on the impact of data presentation formats on perceptual, affective, cognitive, or behavioral outcomes using numbers, words, or graphics, across all domains of health and medical decisions. As described elsewhere,² this was necessary for several reasons. First, although review articles, expert panel summaries, and guidance instruments are available to provide guidance in specific domains or for specific purposes, no single project has integrated evidence across all types of data, data presentation formats, and clinical domains. Second, as part of this project, we have developed a standard conceptual model and terminologies that will help readers and researchers make sense of diverse research literature that uses different terminologies and approaches.² Third, the project is also producing a freely available interactive guide to help disseminate this newly generated evidence to professional communicators and others outside the research world.

Although the broader Making Numbers Meaningful project encompasses research on the communication of both probabilities and quantities (such as laboratory results, nutrition data, or quality-of-life metrics), the current article details the methods used for the systematic review and evidence synthesis for probability communication only. We consider a separate methods article to be necessary because the methods are detailed and because the project is producing several closely related results articles that share the same methods. In this methodology article, we describe the literature search and review methods, our novel classification and data extraction approaches, and the evidence synthesis methods we applied systematically to all articles reporting methods for communicating probabilities. We also summarize the outcomes and data presentation format comparisons to provide a snapshot of the relative volume and scarcity of research examining different types of health probability communication questions.

2. Foundational Work

Our guiding conceptual model for this research represents quantitative communication (including communication of probabilities) as the presentation of a stimulus to a respondent, who performs cognitive tasks while making sense of the stimulus, resulting in perceptual, affective, cognitive, or behavioral responses that can be measured with outcome measures such as questionnaire instruments or observations (Figure 1). We are interested in the effect of data presentation format on these outcomes, and therefore, we consider only stimuli that consist of quantitative data such as probabilities represented in 1 or several data presentation formats.

Figure 1

In this conceptual model, people perform cognitive tasks to make sense of stimuli, resulting in perceptual, affective, cognitive, and behavioral responses that can be measured with outcome measures. Types of data structures, data presentation formats, tasks, and outcomes are described in section 2 of this article.

Applying this model to the literature was challenging because, as described in Ancker et al.,² vocabularies used in the research literature were ambiguous and redundant. Therefore, we created 4 taxonomies to classify the literature in line with this model, ensuring that we were using standard terminologies and comparing similar studies with each other. The first taxonomy is the list of possible data structures (such as absolute probabilities versus probability comparisons and probabilities measured at 1 time/time interval versus probabilities measured multiple times over time). The second is the taxonomy of data presentation formats (including number formats such as percentages and frequencies, different data graphics, and verbal descriptions of probabilities). These 2 taxonomies are used to describe the stimuli in the studies. These taxonomies ensure that, for example, “pictograph” and “icon array” are recognized as synonyms for the same graphic format, and that “1 in X” (for example, 1 in 20) and “rate per 100” (for example, 5 in 100) are recognized as 2 different frequency formats.²

Our next taxonomy is list of cognitive tasks conducted by a reader to make sense of a stimulus and extract information from it. For example, to answer a question about perceived risk from some graphic, the reader may examine a single point on the graphic; we label this task a “point task.” However, to assess the effectiveness of a treatment, the reader may need to perform the more complex cognitive task of evaluating the difference between 2 points on the graphic; we call this a “difference task.” The task taxonomy enumerates the cognitive actions performed on the stimulus as the reader seeks the answer to some informational question, allowing us to distinguish between simple and more complex tasks.

We also developed a taxonomy of outcomes, that is, measures of the construct that the research was designed to affect. Outcomes include recall, health behavior, probability perceptions (when responses to a probability are measured on a quantitative scale such as one ranging from very small to very large), and probability feelings (when responses to a probability are measured on an affective scale capturing, for example, worry or concern).² This taxonomy allows a classification that demonstrates how a single data presentation format can be good for one purpose and bad for another. For example, this classification allows us to demonstrate that formatting a probability as 1-in-X (e.g., 1 in 10) rather than a rate per 10ⁿ (e.g., 100 in 1,000) tends to increase readers’ perception of the size of the probability.³ It also allows us to capture the fact that presenting several probabilities in 1-in-X format makes it more difficult for readers to identify the larger of 2 quantities.⁴ Tasks and outcomes are closely linked; some outcomes are measured after the reader examines a small part of a stimulus, whereas other outcomes are linked to tasks that involve consideration of several portions of the stimulus or the entire stimulus holistically.

More details about these taxonomies are presented below as part of the systematic review.

3. Systematic Review

Guided by these taxonomies, we conducted a systematic literature review of the research on communicating health-related probabilities and amounts to nonexpert audiences.

3.1 Literature Search and Screening

We conducted a systematic literature search following the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) statement (PROSPERO registration #CRD42018086270). Professional librarians constructed a search strategy including selected vocabulary (the search strategy is available in the Methodology Files folder at MNM Project at OSF: https://osf.io/rvxf2/) and applied it to MEDLINE, Embase, CINAHL (EBSCO), the Cochrane Library (Wiley), PsycINFO (EBSCO), ERIC (ProQuest), and the ACM Digital Library, from inception to September 10, 2020. In addition, we hand-searched the tables of contents for the 4 most common journals occurring in the initial results (Medical Decision Making, Patient Education and Counseling, Risk Analysis, and Journal of Health Communication), starting in 2008 because this appeared to be a year when the numbers of relevant citations increased. Inclusion criteria were 1) original research articles that 2) presented quantitative health information, 3) compared 2 or more data presentation formats for this information, 4) included adult patients or lay-people as participants, and 5) quantitatively measured or more outcomes related to perceptual, affective, cognitive, or behavioral outcomes. Verbal information was eligible for inclusion only if it was a verbal description of a probability (such as “rare,”“common,” or “unlikely”) or a magnitude (“large”). Because of our focus on the effect of data presentation format, we excluded research that assessed the effects of presenting different sorts of information to the viewer; we therefore excluded studies that compared arms with different reference classes and those that compared information and no-information arms (such as pre-post assessments of knowledge with a decision aid). As a result, we can draw conclusions about the effects of data presentation format but cannot draw broader conclusions about the effects of providing information or the effects of providing different information.

Literature search results were imported into the Covidence web app (Covidence.org, Melbourne, Australia). A pool of 6 reviewers screened titles and abstracts using Covidence, with 2 independent reviews for each title and abstract. This process resulted in a set of articles for full-text review, each of which was also reviewed by 2 independent reviewers. At both stages, disagreements were resolved by consensus. No language restrictions were imposed; we had 4 publications professionally translated before full-text review. (However, we excluded studies [n = 1] of verbal probability terms when the stimulus terms were not in English. This was because we were not confident that terms would be reliably translated to the same verbal probability term in English or that perceptual, affective, cognitive, and behavioral responses to a word in another language would be the same as the responses to the translated term.) For all included articles, we also retrieved and screened reference lists and citing articles from Scopus.

3.2 Data Extraction

We constructed a custom online data extraction instrument (Qualtrics.com, Seattle, WA) with branching logic to apply our taxonomy concepts to different types of research (the data extraction instrument is available in the Methodology Files folder at MNM Project at OSF: https://osf.io/rvxf2/). Three researchers (J.S.A., N.C.B., M.M.S.) independently conducted data extraction; each extraction record was reviewed for accuracy twice, first by a second member of the team during data extraction and second by J.S.A. and B.J.Z.-F. during the construction of the evidence table. When a single publication described 2 or more studies, each of these studies was assessed for eligibility and (if included) underwent its own data extraction. This was necessary because studies in a single publication often had different samples, data presentation formats, and outcome measures.

As described below, the data extraction allowed for classification of the stimulus (containing the data and the data presentation formats), the task required to extract the relevant information from the stimulus, and the outcome measures.

3.2.1 Participants, settings, study characteristics

For each study, we extracted sample size, a description of the sample (e.g., Internet panel, patients recruited from a cancer clinic), and whether or not education, literacy, and numeracy were measured as covariates. Any statistically significant interactions between data presentation format and education, literacy, or numeracy were noted in free text (Appendix 2).

3.2.2 Structure of data in the stimulus

Data were classified as probability (e.g., chance of getting a disease or chance of side effects) or quantity (e.g., laboratory results or environmental contaminants). The current methods article describes only the probability class of research and introduces a series of results articles on probability communication. (Quantity articles will be published separately.)

As shown in Table 1 and Ancker et al.,² stimuli could include variables assessed at 1 time or time interval (such as the 10-y chance of a cardiovascular event), variables assessed multiple times over time (such survival curves or time trends), or sets of option attributes intentionally presented together to provide standardized information about choice options (such as the chance of benefit and the chance of harm for several therapeutic options). In addition, the data in the stimulus could be absolute probabilities (such as the chance of an event for an individual or population) or calculated probability comparisons (such as absolute risk differences, relative risk differences, relative risks, odds ratios, or numbers needed to treat or harm).

Table 1

Taxonomy of Data Structures for Stimuli in Studies

Category		Absolute Probability		Probability Comparison^a
Category		Single Population or Individual	Multiple Populations or Individuals	Single Population or Individual	Multiple Populations or Individuals
Variable at 1 time or time interval	One or more variables, observed once	This patient’s estimated 10-y chance of breast cancer is 4%.	Average lifetime probabilities of diabetes in 3 countries are 8%, 16%, and 20%	Treatment reduces the chance of cancer recurrence by 5 in 100. The number needed to treat for the new therapy is 36.	Smoking raises risk by a factor of 4 for lung cancer and 5 for oral cancers.
Variable measured multiple times over time	One or more variables observed multiple times	Cancer recurrence chance for a patient reported for each year over the next 10 y (e.g., a survival curve)	Trends of lung and colorectal cancer risks in the United States, Canada, and Mexico reported annually over the 20th century (e.g., several survival curves)	The reductions in cancer recurrence chance over the next 5 y attributable to therapy A are 5%, 6%, 8%, 10%, and 11%.	The reductions in cancer recurrence chances over the next 3 y are 5%, 6%, 8% for patients with tumor profile A and 7%, 9%, and 10% for patients with tumor profile B.
Set of option attributes	Multiple variables intentionally presented together as attributes of 1 or more choice options	With this therapy, the chance of survival is 80%, and the chance of blood clot is 0.01%.	The chance of survival with therapy A is 80%, and the risk of blood clot is 0.01%. The chance of survival with therapy B is 90%, and the risk of blood clot is 0.05%.	With this therapy, survival chances rise by 100 in 1,000, while the chance of a blood clot rises by 1 in 1,000.	Therapy A increases survival by 10 percentage points and increases the chance of blood clot by 0.1 percentage points, while therapy B increases survival by 20 percentage points and increases the chance of blood clot by 0.4 percentage points.

Probability comparisons include both absolute and relative differences. We classified a data structure as a probability comparison if the comparison was computed for the reader (e.g., as an absolute risk difference, relative risk difference, relative risk, odds ratio, number needed to treat, or number needed to harm). We classified it as a pair of absolute probabilities if the probability comparison was not computed (e.g., “This treatment reduces the chance of recurrence from 12% to 8%”).

3.2.3 Data presentation format in the stimulus

For each study, we classified the data presentation formats used in the stimuli as numerals, graphics, or words. Each stimulus could contain 1 or more data presentation formats.

Graphics: Graphic types included bar charts, line graphs, number lines, pie charts, icon arrays, and others. Details were extracted specific to the type of graphic (Appendix 1). For example, bar charts could be vertical or horizontal, the axis could represent the entire possible range or a truncated range, the axis could portray raw numbers or be scaled to 100%, and the bars could be stacked, clustered, or neither.

Numbers: For absolute probabilities, numbers could be represented as percentages (10%), decimals (0.10), 1-in-X frequencies (1 in 10), rates per 10ⁿ frequencies (10 in 100), or odds (1 to 9 odds). Probability comparisons (differences between probabilities) could be represented as the absolute difference between probabilities (a risk reduction of 5 percentage points), relative risk difference, relative risks, odds ratios, number needed to treat/harm, or combinations of these. (When the effect was shown as pairs of numbers, such as a preintervention and postintervention risk, we classified it as a pair of absolute probabilities.)

Words: Verbal probabilities were classified as ordinal (e.g., high risk, medium risk, low risk; common, uncommon) or nominal (“there is some risk”).

We also captured an enumerated list of modifications and manipulations commonly applied to data presentation formats.

Context: We enumerated a list of information additions designed to provide context for quantitative information. These included interpretive labels (e.g., adding a “low” or “abnormal” label to a laboratory value), anecdotes (narrative stories about or by individuals affected or not affected by an event), comparator events (e.g., the chance of a pedestrian being struck by a car as a comparator for the chance of a side effect), reference values or ranges, and target values (e.g., a goal HbA1c of 7%).

Framing: Gain-framing presents the chance of positive outcomes (e.g., 80% chance of survival), loss-framing presents the chance of negative outcomes, and combination-framing portrays the same information in both gain- and loss-framed formats.

Uncertainty: Some studies specifically examined uncertainty in estimates (either the uncertainty associated with whether a probabilistic event would occur or the uncertainty associated with a confidence interval or margin of error for an estimate). We classified these as modifications to the representation of the information. Uncertainty could be represented as numbers (confidence intervals, standard errors, or ranges), graphics (error bars or bands), or words (e.g., “there is some uncertainty about this value”).

Denominator manipulation: A variety of studies varied the denominator of a probability rate or ratio.

Animation and interactivity: Stimuli that moved (including animated graphics or automated slideshows) were considered animated. If the participant could interact with the information in some way (e.g., by manipulating graphics, or answering questions), it was considered interactive.

Time period: A small number of studies systematically varied the time period of the data structure (for example, contrasting response to 10-y risk with response to lifetime risk).

3.2.4 Data presentation format comparisons

As specified in the inclusion criteria, all studies involved 1 or more head-to-head comparison of different data presentation format groups. For example, a factorial design might contrast viewer perceptions of risk from 3 stimuli: a list of numbers, an icon array, and a bar chart. In this case, comparisons could be made between the numerical format and the 2 graphical ones or between the 2 types of graphics. On the basis of the study design, we placed each study record into 1 or more of these data presentation format comparison groups:

Numbers versus numbers

Graphics versus graphics

Numbers versus graphics

Numbers versus verbal probabilities

Context manipulations

Framing manipulations

Uncertainty manipulations

Denominator or unit manipulation

Animation and interactivity interventions

Time period manipulations

Within these larger comparison groups, we used the more specific information about the type of number, type of graphic, and so forth described in section 3.2.3 above for subgroups in the evidence synthesis.

3.2.5 Tasks

As described above, a reader conducts a cognitive task to make sense of a stimulus (its data structure and data presentation format) and extract information. Broadly, sometimes a participant focuses on only a small part of a stimulus, whereas other times the participant may consider several portions of the stimulus or the entire stimulus holistically.

For example, consider a bar chart representing the chance of disease with and without a vaccine (an absolute probability data structure). A viewer interested in learning about the chance of disease without the vaccine would perform the cognitive task of locating and examining the disease chance without vaccination. As this involves only a single element of the stimulus, we term this a point task.

Alternately, the viewer could use the same bar chart to estimate how effective the vaccine is by performing the cognitive task of contrasting the disease chances with and without vaccination and evaluating the difference between them. As this task involves examining the difference between numbers, we term this a difference task (Table 2).

Table 2

Task Definition and Examples

Task	Description	Examples
Point	Evaluating 1 or more individual data points	Evaluating the chance of cancer from an icon array showing the proportion of people affected by cancer (absolute probability – one time data structure); determining the survival probability at 5 y from a survival curve (absolute probability – multiple times data structure); determining which of 2 risks is larger (absolute probability – one time data structure with multiple populations)
Difference	Evaluating differences between 2 or more data points	Evaluating the effectiveness of a vaccine from a bar chart of pre-post disease chances (absolute probability – multiple populations data structure); evaluating the size of the difference between cancer rates associated with treatment (probability comparison – one time data structure)
Synthesis	Synthesizing multiple features that have the same valence or different valences	Assessing a list of several potential harms of a medication to estimate the overall chance of harm (absolute probability – option attributes data structure); evaluating a table of risks and benefits of 2 medications to choose which one to take (absolute probability – option attributes data structure); working through a decision aid that shows benefits and harms of multiple options to decide which to take (absolute probability – option attributes data structure)
Trend	Assessing a trend over time (or over some other variable)	Evaluating 2 survival curves showing the chance of recurrence as it changes over time to determine which therapy to choose (absolute probability – multiple times data structure); evaluating a graph showing the difference between 2 survival curves to determine which therapy to choose (probability comparison – multiple times data structure)
Synthesis-Bayesian	Understanding the complex problem of how a diagnostic or screening test result affects the likelihood of disease	Assessing a message about diagnostic or screening tests that includes multiple absolute probabilities or probability comparisons and interpreting the impact of a result on the personal chance of disease

3.2.6: Outcome measures

As described in Ancker et al.,² we classified the constructs measured in the study in an information-processing framework. Because “gist understanding” was operationalized in different ways by different researchers, we avoided the term and instead identified the specific outcome in the study that was used to assess understanding (e.g., identifying which of 2 treatments had lower risk or assessing whether a value was in the above-average category or the below-average category).

For each outcome, we recorded the effects of each of the data presentation format, with nonsignificant differences considered ties. We also wrote a free-text summary of the effect sizes (Table 3).

Table 3

Outcome Measures

Outcome	Definition
Identification^a	Answering basic questions about the stimulus (such as retrieving a number) with access to the stimulus
Recall^a	Answering basic questions about the stimulus (such as recalling a number) without access to the stimulus
Categorization	When given a value and a set of categories (such as low\|medium\|high or within range\|elevated), correctly determining which category the value falls into
Contrast	Selecting the value, option, or attribute with highest (or lowest) value
Computation	Answering questions requiring calculations (such as arithmetic difference between risks)
Probability perception	Assessing the likelihood of an event on a numeric or verbal scale indicating size (e.g., how probable the event seems on a scale from 1 to 5 or a scale from very unlikely to very likely); this outcome is sometimes called “perceived risk”
Probability feelings	Reporting response to the probability on a numeric or verbal scale indicating affective response, such as worry or concern (also sometimes called “perceived risk”).
Effectiveness perception	Assessing the size of a probability comparison or the difference between 2 absolute probabilities, such as the benefit or harm from an intervention or risk factor, on a numeric or verbal scale describing its size (such as “very small” to “very large”)
Effectiveness feelings	Reporting response to a probability comparison or the difference between 2 absolute probabilities, such as the benefit or harm from an intervention or risk factor, on a numeric or verbal scale indicating affective response, such as worry or concern
Discrimination	Discriminating between different values on the basis of differences in the stimulus
Behavioral intention^b	Expressed intention to choose a behavior; decision; choice
Health behavior^b	Behavior
Preference	Participant-reported preference for, perceived ease of understanding of, or perceived usefulness of the data presentation format
Trust	Trust or confidence in the information

Articles were frequently unclear about whether the participants answered the questions while viewing the stimulus or after the stimulus was removed. As a result, we were forced to combine identification and recall outcomes in the evidence tables and results publications.

Studies of actual health behaviors were so rare that we combined them with the behavioral intention studies in the evidence tables and results publications.

The outcome question presented to the participant would influence which cognitive task the participant needed to perform. For example, as described above, a question asking for a probability perception about a single absolute probability would prompt the participant to pay attention to one probability and perform a point task, whereas a question about how effective a treatment appeared (an effectiveness perception outcome) would prompt the participant to evaluate the size of the treatment effect via a difference task. We linked outcomes to tasks as follows:

People asked to report probability perceptions or probability feelings conduct point tasks to assess absolute probabilities.

People asked to report effectiveness perceptions and effectiveness feelings conduct a difference task to assess either probability comparisons (such as relative risk or absolute risk difference) or the relationship between 2 absolute probabilities (such as a pair of before- and after-treatment probabilities).

People asked questions about behavioral intention will perform a point task if given information about a single absolute probability. However, if given either a probability comparison data structure (e.g., a set of relative risks) or an absolute probability data structure with variables from several populations, they conduct a difference task of comparing the probabilities.

People asked questions about preference for or trust in the information will conduct a point task if they have access to an absolute probability data structure with a single probability. However, with either a probability comparison data structure or an absolute probability data structure with variables from several populations, they conduct a difference task of comparing the probabilities.

The identification-recall, contrast, categorization, discrimination, and computation outcomes could involve either point tasks or difference tasks, based on which information needed to be extracted from the stimulus. For example, finding the larger of 2 absolute quantities is a contrast outcome requiring a point task of looking at 2 absolute probabilities, but finding the larger of 2 absolute difference statistics is a contrast outcome requiring a difference task of looking at 2 probability comparisons. As another example, if a participant was given a stimulus presenting an absolute probability data structure and asked to compute some outcome, the task would be a point task if individual rates were needed for the calculation or a difference task if the inputs were probability comparisons (e.g., relative risk reduction statistics).

3.2.7 Findings

Our data extraction procedure resulted in multiple findings for each study. Each finding was defined uniquely by a task, an outcome, and a data presentation format comparison. For example, imagine a study contrasting icon arrays, bar charts, and 1-in-X numbers to illustrate the chances of colon cancer with and without regular screening, with measurements of the effect of the stimulus on the perceived risk of colorectal cancer and the perceived effectiveness of screening. Data extraction would produce 4 findings from this study (Table 4).

Table 4

Four Findings from a Hypothetical Study Comparing Numbers and Graphics

Finding	Task	Outcome	Data Presentation Format Comparison	Take-Away Message
1	Point	Probability perception	Graphics v. graphics	Describes differential effects of icon array v. bar chart on perceived probability of cancer
2	Point	Probability feelings	Numbers v. graphics	Describes differential effects of icon array, bar chart, and 1-in-x frequencies on perceived probability of cancer
3	Difference	Effectiveness perception	Graphics v. graphics	Describes differential effects of icon array v. bar chart on perceived effectiveness of screening
4	Difference	Effectiveness feelings	Numbers v. graphics	Describes differential effects of icon array, bar chart, and 1-in-x frequencies on perceived effectiveness of screening

Findings would then be placed in the evidence table defined by the unique combination of task and outcome, with subtables for each data presentation format comparison. Thus, findings 1 and 2 in Table 4 would go into the point task × probability perception table (and its subtables for graphics v. graphics and numbers v. graphics), and findings 3 and 4 would be allocated to the difference task × effectiveness perception table (and its subtables for graphics v. graphics and numbers v. graphics).

3.3 Risk-of-Bias Assessment

We adapted risk-of-bias criteria from the AHRQ Methods Guide for Effectiveness and Comparative Effectiveness Reviews⁵ and the Cochrane Handbook for Systematic Reviews of Interventions.⁶ For each study, pairs of reviewers independently evaluated and scored sample selection (with the highest scoring for educationally diverse or ecologically valid samples), randomization, protocol deviations, whether demographics and other covariates were reported, missing data, and presence of other potential biases (Appendix 3). Scoring conflicts were resolved in consensus meetings. This process allowed us to classify each study as having low, moderate, or high risk of bias. The study risk of bias (S-ROB) rubric is available in the Methodology Files section of the MNM Project at OSF (https://osf.io/rvxf2/).

3.4 Evidence Synthesis

3.4.1 Grouping findings into evidence tables

This set of data extraction tasks resulted in multiple findings for each article as portrayed in Figure 2.

Figure 2

After articles were identified for inclusion, data extraction was performed on each study within the article as described above, producing multiple findings per article. Data extraction steps, assessment/evaluation steps, and classification steps were performed at different stages of the analysis.

To synthesize the evidence, we grouped the findings into evidence tables united by task and outcome, containing subtables for each data presentation format comparison. In other words, each unique combination of task and outcome was a unique table, containing several subtables for different data presentation format comparisons.

This produced, for example, a point task| probability perception evidence table, with a graphics versus graphics subtable that assembles all findings about the effects of different graphics upon the estimated probability of single absolute probabilities, a numbers versus graphics subtable that assembles findings about comparisons between number formats and graphic formats, and a numbers versus numbers subtable that contains only comparisons between different number formats. Similarly, the point task|behavioral intention table assembles findings from studies about formats for single absolute probabilities in their effect upon behavior and behavioral intentions, with subtables for numbers v. numbers, numbers v. graphics, graphics v. graphics, numbers v. verbal probabilities, and so on.

3.4.2 Credibility assessment of findings

Because the ROB assessment at the study level was not granular enough to be applied to each of up to several dozen distinct findings from that study, we implemented a systematic evaluation process based on the judgments of 2 expert reviewers with more than 35 y of combined experience conducting research on communicating numbers in health (authors J.S.A. and B.J.Z.-F.). A third reviewer (N.C.B.) substituted for one of the other two in cases of coauthorship or other conflict of interest.

For each finding in each evidence table, the experts independently summarized the takeaway and assigned a credibility score based on the risk of bias, the sample size, potential for confounding, and the face validity of the comparisons. Face validity judgments assessed whether the experimental stimuli were comparable to each other and embodied the factors claimed to have been tested. Credibility ratings could be reduced by lack of hypothesis testing for a specific comparison, noncomparable study arms (such as one stimulus containing more extensive information about a risk than its comparator), or mismatches between measurements and stimuli (e.g., stimuli that did not give enough information to allow respondents to answer the questions asked). Typically, a finding would receive a high credibility score if it had low risk of bias, more than 100 participants per arm, stimuli with high face validity, and statistically significant results by credible hypothesis-testing approaches. A finding might receive a low credibility score if a failure to find statistically significant differences could be due to inadequate sample size, if there was confounding of several data presentation format manipulations that reduced confidence that the effect could be attributed to any one of them, or if a particular comparison of interest was a secondary outcome that was not subjected to hypothesis testing.

After independent ratings, the experts reviewed each other’s takeaways to come up with a consensus takeaway, with differences resolved by discussion. This process also resulted in regular methodological discussions to ensure consistent consideration of different factors in credibility judgments. The reviewers sometimes changed their credibility scores after the discussion but continued to assign individual credibility scores, which were averaged for the final score on each finding.

Because of this process, a single study often produced findings with different credibility scores. For example, in a well-designed and adequately powered study, the finding from the primary comparison might receive a high credibility score, but the finding for a secondary comparison not subjected to hypothesis testing might receive a low credibility score. For each finding, average credibility scores across reviewers from 0 to 4 were labeled low credibility, 4.5 to 6.5 moderate, and 7 to 10 high credibility. Low-credibility findings were reported but did not contribute to the evidence synthesis.

3.4.3 Evidence synthesis, consistency, and strength within evidence tables

For each evidence table, we created evidence syntheses that summarized the relevant findings. For each synthesis, we assessed consistency of the evidence (adapted from the GRADE criteria) as high (all the studies in a group point toward the same direction or show no effect), moderate (most show an effect in the same direction or no effect), or low (effects in different directions). Consistency was not applicable to categories that included only 1 finding.

The experts then used the credibility scores and the evidence consistency to rate the strength of evidence for the evidence synthesis.

Strong: High consistency within a group of 2 or more high-credibility findings or a mix of high- and moderate-credibility findings

Moderate: a) High consistency within a group of 2 or more moderate-credibility findings or b) moderate consistency within 2 or more findings in which at least 1 was high credibility and the others moderate credibility

Weak: Moderate consistency within a group of 2 or more moderate-credibility findings or only a single high-credibility finding

Insufficient evidence - too few findings: a) Only low-credibility findings available or b) only 1 moderate-credibility finding

Insufficient evidence - conflicting findings: Any case in which evidence consistency was low

3.4.4 Guidance statements within evidence tables

Each evidence synthesis was additionally translated into simpler guidance statements and examples for use in our interactive evidence guide for professional communicators. For example, a guidance statement on the effect of using 1-in-X number formats on perceived probability reads, “Strong evidence: People will perceive a probability as larger if it is stated as 1-in-X rather than a rate per 10ⁿ or a percent” (Figure 3).

Figure 3

Sample evidence subtable, synthesis, and guidance statements. The evidence subtable (top) lists findings in which numbers were compared with verbal probabilities for point tasks when identify/recall outcomes were measured. Findings were synthesized in text form (middle) and used to generate guidance (bottom) according to the procedures outlined in this article.

Results

As shown in Figure 4, we screened more than 35,000 unique articles to identify eligible research. We completed the full data extraction process described above for eligible articles reporting research about communicating probabilities (n = 314), quantities (n = 36), and both probabilities and quantities (n = 2). The current article focuses on probability communication research from the 316 probability articles.^3,4,7–320

Figure 4

More than 35,000 articles were screened to identify 314 articles about communicating probabilities, 36 about quantities, 2 about both, 21 about medication instructions, and an additional 18 about verbal probabilities (PRISMA diagram). Articles that met the inclusion criteria only for publications a and b above are not discussed in the current article, which focuses on the 314 probability articles and the 2 articles that studied both probabilities and quantities. (a) Published in 2021 as Andreadis et al., Journal of General Internal Medicine 2021;36(12):3820–9. (b) Published in 2022 as Sharko et al., Patient Education and Counseling 2022;105(7):1888–903.

Data extraction from the 316 articles about probability communication produced 1,119 findings relevant to probabilities, which we organized into 37 unique evidence tables by task-outcome combination as shown in Table 5, with each evidence table containing subtables for each data presentation format comparison (summarized in Table 6A –E). Some evidence tables are very large, reflecting deep research in that domain. For example, within the point task×perceived probability evidence table, we identified 193 unique findings (summarized in Table 6A, fifth row): 40 study records for numbers versus numbers comparisons, 30 for graphics versus graphics, 34 for numbers versus graphics, 15 for numbers versus verbal probabilities, 19 for framing, 30 for context, 12 for uncertainty, 5 for denominator manipulation, 7 for animation and interactivity, and 1 for time period. By contrast, certain other evidence tables are much smaller or even nonexistent.

Table 5

Numbers of Probability Findings by Task and Outcome, across All Data Types and Data Presentation Formats

		Task^a
		Point	Difference	Synthesis	Time-Trend	Posterior Probability Synthesis	Total
Outcome^a	Identification or recall	91	18	1	0^c	0	110
	Contrast	59	18	5	2	0	84
	Categorization	19	0	0	0	0	19
	Computation	49	17	0	1	57	124
	Probability perception	193	NA^b	NA	NA	1	194
	Probability feelings	89	NA	NA	NA	5	94
	Effectiveness perception	NA	34	4	3	NA	41
	Effectiveness feelings	NA	16	2	0	NA	18
	Discrimination	14	4	0	1	0	19
	Behavioral intention or behavior	68	94	61	2	5	230
	Preference	86	53	17	2	7	165
	Trust	16	4	1	0	0	21
Total		684	258	91	11	75	1,119

Each task × outcome combination defines a unique evidence table. The 37 nonzero cells represent 37 evidence tables. The number in each cell is the count of unique findings in that evidence table.

NA, not applicable. In these cases, the outcome-task combination was impossible. Probability perceptions and feelings outcomes are responses to individual probabilities, which means they require point tasks; occasionally, these outcomes were also measured after posterior probability synthesis tasks. By contrast, effectiveness perceptions and feelings are responses to differences between probabilities and so can involve difference, synthesis, and time-trend tasks.

In these cases, cell sizes were 0 because we did not observe any literature in this category, which appeared to reflect the fact that, in the literature, some research questions are more common than others.

Table 6A

Point Tasks Only: Numbers of Findings by Outcome and Data Presentation Format Comparison