Abstract
Nursing practices must be evidence-based, yet grappling with statistics proves challenging, hampering the integration of research into patient care. Statistics, a branch of mathematics, is often viewed solely as a method across various disciplines, including nursing, focusing on analyzing and inferring from measurable data. As with other methods, it is easy to overlook underlying assumptions. Statistical significance and p-values are commonly used and misused concepts in statistics. While extensively discussed, these issues receive less attention in nursing. Papers addressing them often adopt practical approaches, overlooking the root cause: a misunderstanding of their meaning. To address this, I aim to clarify the fundamental concept underpinning these issues – probability – by defining its two main types, each carrying implications for statistical significance and p-values. In conclusion, I advocate for the continued use of p-values in nursing but emphasize the need for critical considerations in doing so.
Keywords
Introduction
Nurses play a crucial role in integrating scientific research into healthcare, ensuring that patient care is based on the best evidence available. However, the complexity of statistics and statistical analysis presents a hurdle in adopting research findings into everyday clinical practice.1,2 Accurately interpreting statistical data is not just a technical skill but a cornerstone for quality patient care. 3 This becomes increasingly important as healthcare organizations rely more on ‘Big Data’ for quality improvements, and as nurses utilize statistics and hypothesis testing, for instance, to evaluate the effectiveness of care interventions through experimental or quasi-experimental study designs. 4 This issue also seems to be reflected in academia, with research indicating that over 90% of nursing journals feature articles that misrepresent p-values. 5 While this may seem inconsequential, the misuse of significance testing is recognized as a primary contributor to the replication crisis in fields like psychology. 6 Surprisingly, nursing scholars have been largely absent from this debate.7,8 Papers often tackle these issues practically but tend to overlook the core problem: a lack of understanding of their meaning. It is crucial to clarify concepts like probability and its different definitions since they underlie all these issues. Furthermore, the key metrics that should accompany p-values need to be elaborated on in more detail than usual to ensure a proper understanding of their complementary functions from a probabilistic perspective.
Statistics as a method and different probability interpretations
At its essence, statistics is about gaining insights from data. It allows us to extract meaningful information from raw data by combining mathematical rigor with a framework for understanding real-world complexities. When we gather data – whether through lab measurements or field surveys – we are essentially solving a puzzle, unveiling patterns within the population we are studying. These data points, or samples, represent real-world phenomena organized into what we call a sample space. Hypotheses are our educated guesses about these phenomena. In nursing, statistics serves as a method, using mathematical tools to evaluate hypotheses with sample data. While we cannot predict events in clinical settings with absolute certainty, probability helps us quantify their likelihood, ranging from zero to one (or corresponding percentages). Zero indicates uncertainty, one denotes certainty, and values in between show varying degrees of likelihood. 6 Probability theory enables us to evaluate the credibility of hypotheses, the reliability of decision-making processes, and the level of support provided by data, thereby empowering us to make informed judgments.
This emphasizes the importance of probability, which carries different meanings depending on the context, impacting statistical significance and p-values. Therefore, defining probability is crucial. Probability theory is a branch of mathematics. This paper explores two main interpretations aligning with specific definitions of probability in relation to statistical methods, to clarify the widely used concepts of statistical significance and p-values.
Classical and frequentist probability
Classical statistical methods are characterized by assigning probabilities to possible outcomes within a sample space. 9 Here, probability reflects real-world events, often referred to as ‘chance’. For instance, in a clinical trial for hypertension medication, outcomes could be a positive response (blood pressure reduction) or no significant change. Clinically, classical statistics assign probabilities based on trial data. For example, if historical data show that 70% of similar patients respond positively, classical statistics will assign a 70% probability for a positive response and 30% for no significant change for future patients under similar conditions. This interpretation aligns with the frequentist approach to statistics, where probability is understood in terms of long-term frequencies in repeated experiments. 9
Two key procedures in classical statistics, central to our main concepts, are hypothesis testing and estimation. Hypothesis testing, developed by Jerzy Neyman and Egon Pearson, compares statistical hypotheses, while estimation, pioneered by Ronald Fisher, selects hypotheses from a set. However, they differ in philosophy and application. 10 Fisher's approach focuses on significance testing, introducing the p-value to assess evidence against the null hypothesis. Neyman and Pearson's framework introduces Type I and Type II errors, which involve incorrectly rejecting a true null hypothesis and failing to reject a false one, respectively. Their method requires specifying both null and alternative hypotheses. In practice, both approaches are often combined into null hypothesis significance testing (NHST), with the choice depending on the research context and the goals of the statistical analysis.
Bayesian probability or evidential probability
Conversely, to classical probability, probability can also be understood from an epistemic standpoint, where it represents the degree of confidence or support for the likelihood of an event's occurrence. Transitioning from frequentist to Bayesian inference broadens our perspective on probabilities, accommodating uncertainties beyond repeatable events. Bayesian inference integrates prior knowledge into the estimation process, departing from the framework of classical statistics.11–14 In clinical settings, we often quantify beliefs, such as expressing a 90% chance that a specific treatment will alleviate pain better than doing nothing. These beliefs are subjective and not solely based on historical data. It is crucial to clarify that this assertion does not mean 90% of patients will respond positively to the specified medication or that it is consistently superior in 90% of cases. Instead, it represents a high level of confidence.
Understanding these definitions of probability is important for interpreting p-values and making well-informed decisions regarding statistical significance. It is a blend of theory, empirical observation over time, and personal beliefs and knowledge that navigates us through the complex realm of statistics in clinical settings.
In short, thus far:
- Probability theory aids in assessing a hypothesis's credibility, decision-making reliability, and data support.
- In Classical/Frequentist probability, p-values are crucial as they indicate the likelihood of observing something as extreme or more extreme than what was observed, especially in Fisher's approach.
- In Bayesian statistics, p-values are optional but can be used alongside Bayes factor or interpreted with their associated Bayes factor bounds.
Presenting and applying p-values
Because I argue that p-values are meaningful, what follows first, are recommendations to present and apply them accurately.
Report exact p-values
Rather than solely relying on the binary classification of results as significant or not based on p < 0.05, I propose providing precise p-values. This approach offers a deeper understanding, allowing readers to interpret findings with nuance, considering the unique context of the research setting or study design. Viewing p-values as a spectrum enables more meaningful comparisons. A p-value of 0.052, often disregarded, may hold clinical relevance, making it useful to recognize values under 0.10 as ‘trending towards significance’, particularly in studies with limited samples. 15
Lowering the alpha (α) level in certain situations
Lowering the alpha (α) level from, for example, 0.05 to 0.005 has been suggested by a group of 72 statisticians and researchers to enhance study reproducibility. 16 However, this adjustment comes with its own considerations. 17 It is important to understand that such a change might bring challenges, especially regarding statistical power and the feasibility of conducting studies with the necessary sample sizes. This proposed adjustment could increase the required sample size by up to 70% to maintain an 80% power, potentially limiting smaller, exploratory studies due to resource constraints. While larger studies offer robust findings, smaller-scale investigations are crucial for exploring emerging research areas. Therefore, when it comes to multiple comparison testing, each situation calls for a careful balance between statistical guidelines and the unique aspects of the research.
For instance, a limited number of pre-planned comparisons or exploratory analyses designed for hypothesis generation may not require stringent corrections. 18 Oppositely, corrections are especially helpful in three scenarios: 1) when preventing a Type I error is crucial; 2) when conducting a single test for non-significant results; and 3) when multiple tests are exploratory without specific hypotheses.18,19
From my perspective, it is wise to consider multiple comparisons adjustments to ensure we are not dismissing the null hypothesis too soon, but not act on the results too hastily. While some may debate the need for such adjustments, highlighting a potential over-reliance on p-values, 20 it seems clear from nursing research that overlooking the issue of multiple testing altogether is a more common problem.21,22 Still, blindly relying on such thresholds for significance oversimplifies the complexities of research design and context. Instead, it is important to embrace a nuanced approach. Fisher introduced the p-value to assess data deviation from a null hypothesis, but interpreting significance solely based on a p < 0.05 threshold oversimplifies matters.
In short, thus far:
- Provide the exact p-values (up to two or three decimal places).
- Consider lowering the alpha (α) depending on the research design and purpose, particularly distinguishing between exploratory and hypothesis-testing approaches.
Clear and accurate communication of statistical significance
Understanding and conveying statistical information accurately is crucial in enhancing the quality of nursing education and research. It is about making sense of numbers and what they imply for healthcare practices.
Use precise language
The concept of p-values can be elusive, often leading to misconceptions more than clear-cut evidence suggested by likelihood theory. 13 This confusion contributes to the miscommunication of statistical significance, underscoring the importance for nursing educators and researchers to prioritize precision in language. Greenland puts it aptly: Many criticisms of p-values are not about their inherent flaws but about misunderstandings due to poor teaching or incorrect terminology. 23
Misinterpretations, such as the belief that p-values overstate evidence against the null hypothesis, are common. 6 Yet, p-values simply measure how well the data support a particular hypothesis, not the probability of the hypothesis itself. Statistics is about interpreting data distributions and understanding the associated uncertainty and variability. While statistical tests are not definitive proof, they offer a measure of confidence in our estimates. It is a common misconception that p-values can tell us the probability that a null hypothesis is true. This is not the case, as classical statistics do not assign probabilities to hypotheses.9,24 Nor do they reflect the magnitude of an effect. Also, unlike effect size measures, p-values do not directly translate to associations. For that reason, referring to p-values when discussing the strength of associations is not recommended.
Do not confuse statistical with clinical significance or importance
Understanding the distinction between statistical and clinical significance is crucial for nursing scholars. When the concept of statistical significance was introduced by Francis Edgeworth in the late 19th century, it was originally intended to identify results that merited a closer look, not necessarily to confirm their scientific importance. 25 Significance alone should highlight the inherent value of research findings, while statistical significance pertains to the probability that these findings are not due to random chance. 26 It is essential to recognize that these two types of significance are not interchangeable. Despite its apparent clarity, confusion between them persists in health-related research. 27
Clinical significance is assessed by the tangible benefits or relevance of a healthcare intervention, and it may or may not take into account statistical significance. Nurses must critically evaluate both the methodology and outcomes of research, weighing its clinical significance and its implications for patient care.28,29 Some scholars advocate for a patient-centered approach to clinical significance, focusing on the impact of research findings on patients’ lives. 30 In contrast, practical significance deals with the extent of the effects as understood by researchers or clinicians. This nuanced understanding is vital in applying research to real-world clinical practice.
In light of the discussion, it has been suggested that ‘importance’ might be a more fitting term than ‘clinical significance’ for discussing effects. 31 I find merit in this view. This involves embracing the diverse perspectives of everyone involved, from patients and their families to healthcare providers and researchers. Basically, recognizing that patients’ insights might differ from medical professionals. To truly gauge what is clinically important from a patient's standpoint, it is advised to employ validated surveys that capture both collective measures, such as effect size, and personal thresholds, such as the minimal important change. 32
In short, thus far:
- Instead of stating statistical tests ‘confirm’ your idea, quantify how much uncertainty has been reduced.
- Do not use p-values to describe the strength of associations.
- Differentiate between significance, clinical significance, and statistical significance.
- Significance highlights the value of research findings, while clinical significance indicates relevance in terms of positive health or healthcare outcomes.
- Consider using “importance” instead of “clinical significance” for discussing effects, as it emphasizes the value of diverse perspectives from patients, families, healthcare providers, and researchers.
Always include confidence intervals and effect size parameters
In nursing research, it is crucial to communicate the reliability and impact of our findings effectively. This is where the use of confidence intervals (CIs) and effect size measures comes into play. CIs and effect sizes provide a richer, more nuanced picture than p-values, as they tell us not just if a finding is statistically significant, but also the magnitude of the effect, and the precision with which we have estimated it.
While p-values and CIs are based on the same mathematical framework, they tell us different things. 33 Think of a CI as a net that captures the true value of what we are measuring – most of the time, if we are using a 95% CI, we can trust that net to catch the true value in 95 out of 100 repeated samples, assuming we have steered clear of biases. Whether it is a t-test, chi-square, ANOVA, or regression, these metrics are your allies in conveying the significance of your work with both precision and clarity. 29 CIs offer a practical range for clinical measurements, such as blood pressure, making them easy to interpret. They also show how reliable our estimates are; a wide CI suggests uncertainty, while a narrow one indicates more confidence in the findings. CIs help determine if results are statistically significant; for instance, a 95% CI that includes zero suggests a p-value > 0.05, indicating no significant difference. They are invaluable in meta-analyses too, allowing for comparisons across studies and quantifying uncertainty in our estimates.
CIs offer valuable insights beyond what p-values can tell us, yet they fall short in quantifying the actual impact of nursing interventions. This is where effect size measurements step in, providing a clearer picture of an intervention's effectiveness. By utilizing various tests, such as Pearson's r for correlations or Cohen's d for group differences, we can gauge the strength of an effect. Effect sizes can be standardized, like Cohen's d, which are ideal for combining study results or comparing diverse variables. Or unstandardized, such as unstandardized beta coefficient in regression models, which are specific to the context of the study. Standardized measures, as unit-less measures, are preferable when results from several studies are combined, as in meta-analyses. For similar reasons, they are also used when comparing the effects of variables measured in different units in multivariate analyses, as in multiple regressions. 34
To gauge an effect's relative magnitude, we typically compare it to benchmarks from similar research or set standards. 23 Take Cohen's d: values of 0.2, 0.5, and 0.8 mark small, medium, and large effects, respectively. 35 Yet, these are not absolute; the thresholds are context-dependent and based on statistical distributions. 31 For instance, a ‘small’ effect size might be crucial if it is linked to a significant health issue affecting many people (e.g. blood pressure). Moreover, unlike p-values, effect sizes generally remain consistent across different sample sizes. They are crucial for power analysis, where for example Cohen's d helps determine the necessary sample size for reliable results. 34 The goal is to balance sample size with factors like the significance level (α) and statistical power to ensure robust findings. It is essential to report standardized effect sizes, which help determine if the results have clinical significance. If the effect is not significant, these data are crucial for power analysis and planning the sample size for future research. Moreover, while discussing these sizes, it is vital to remember that the term ‘effect’ suggests a cause-and-effect link between variables. However, in observational studies, we are looking at ‘associations’ rather than direct effects.
Taken together, effect sizes, preferably standardized, should be reported to determine if the results have clinical significance. If not, the data are crucial for power analysis and planning the sample size for future research. Lastly, the term ‘effect’ might suggest a cause-and-effect link, which is true for experimental studies. However, in observational studies, we’re looking at ‘associations’ rather than true ‘effects’. 31
In short, thus far:
- Always include confidence intervals (CIs) and effect sizes for a nuanced view of statistical measurement quality, surpassing p-values alone.
- Effect sizes indicate both statistical significance and reveal effect magnitude.
- Standardized effect sizes facilitate meta-analyses and cross-cultural comparisons.
- CIs offer a more precise estimation of effect size precision than p-values.
Consider using the Bayes factor in isolation or together with p-values
According to Bayes's theorem, the Bayes factor signifies the ratio of the probability of the observed data given two competing hypotheses. This factor serves as a measure of the strength of evidence in favor of one hypothesis over another, quantifying evidence for both null and alternative hypotheses. 36 In nursing research, the Bayes factor offers several advantages.37,38
First, when considering the Bayes factor, it is possible to simultaneously use p-values, interpret p-values alongside their associated Bayes factor bounds (BFB), or completely transition to using Bayesian statistics only. 39 Second, it eliminates the need for the binary decision of rejecting or not rejecting null hypotheses, instead providing evidence supporting each hypothesis under consideration, including data favoring the null hypothesis. In addition, it facilitates the evaluation of multiple hypotheses, recognizing that several hypotheses are assessed, not just one rival hypothesis. During data collection, support for hypotheses can be continually updated through Bayesian updating. If data-based support for hypotheses seems unconvincing during the study, it is acceptable within the Bayesian framework to gather more data and reassess hypotheses. 39 This is evident from the main theorem below.
The central equation, underlying Bayesian inference is: “Known information + Data = Total information.” Simply put, “Known information” refers to prior knowledge related to the event, while “Data” comprises empirical observations. “Total information” integrates both the researcher's existing knowledge and insights gained from the data. When there is no known information about the phenomenon, Bayesian inference resemble classical analysis in theory.
Bayes factors also provide a coherent method for determining whether non-significant results support a null hypothesis over a theory or merely signal data insensitivity (e.g., when a study lacks statistical power to detect a true effect, leading to inconclusive results. Unlike p-values, which can be influenced by factors like high standard errors, Bayes factors leverage both data sensitivity in distinguishing theories and the most straightforward aspects of a theory's predictions. However, researchers must define the prior distribution, impacting analysis outcomes through subjective choices. While default prior distributions are commonly recommended, uncritical adherence may introduce issues like default α values in p-value assessments. Alternative solutions include refining priors through elicitation or conducting sensitivity analyses to assess the impact of prior assumptions. In elicitation, experts provide input on prior probabilities based on knowledge and experience. In sensitivity analyses, researchers test how varying prior assumptions affect the final results. 39
In short, thus far:
- Bayesian statistics focus on belief and degree of certainty, often proving more intuitive.
- The Bayes factor can accompany or replace p-values.
- It eliminates the binary decision of accepting or rejecting null hypotheses.
- However, introducing Bayesian statistics to nursing scholars may pose challenges due to their general familiarity with classical probability statistics.
Conclusions and implications
We must transform our view of statistics in nursing beyond a mere method focused solely on metrics, delving deeper into conceptual understanding. Nurses practicing evidence-based care should scrutinize research results with attention to study design, sample size, statistical power, data analysis, and conclusions. When teaching inferential statistics to nursing students, it is imperative to emphasize effect sizes and confidence intervals as a bare minimum, complemented by practical insights gleaned from p-values. 27 Enhancing statistical literacy among nurses is vital, commencing at the undergraduate level to cultivate critical thinking skills. 40 Practically, effective teaching methods should incorporate real-life examples, visual aids, group activities, and journal clubs featuring research articles, notably reminiscent of Florence Nightingale's effective pedagogical techniques.17,41
Statistical methods of today are rooted in modern probability theory, offering a unified language for formulating hypotheses and interpreting data, thereby bolstering credibility by involving concepts from measure theory, stochastic processes, and limit theorems. Thus, recognizing statistics as a discipline rather than merely a tool is crucial for nursing scholars. Fostering conceptual understanding through reflective educational strategies promotes long-term comprehension. In addition, there is a growing need for Bayesian statistics education among nursing scholars and students, presenting clinical advantages worth exploring and integrating into curricula.
Footnotes
Author contribution
The author confirms sole responsibility for the following: study conception and design, literature review, analysis and interpretation, and manuscript preparation.
Declaration of conflicting interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The authors received no financial support for the research, authorship, and/or publication of this article.
