Abstract
Artificial effect-size magnification (ESM) may occur in underpowered studies, where effects are reported only because they or their associated p-values have passed some threshold. Ioannidis (2008, Epidemiology 19: 640–648) and Gelman and Carlin (2014, Perspectives on Psychological Science 9: 641–651) have suggested that the plausibility of findings for a specific study can be evaluated by computation of ESM, which requires statistical simulation. In this article, we present a new command called
Keywords
1 Introduction
Effect-size magnification (ESM) is a phenomenon by which low-powered studies that detect an effect will characteristically tend to exaggerate the size of that effect when the effect is required to pass some kind of threshold such as the p < 0.05 criterion often used for statistical significance. As an example of how ESM may occur, it is useful to imagine a thought experiment in which a trial is run thousands of times, each with different sample sizes. In practice, experiments are generally conducted once and have one fixed sample size. In this thought experiment, a broad distribution of observed effect sizes over the thousands of runs will be seen. The observed medians of these estimated effect sizes are expected to be close to the true effect size regardless of sample size. However, trials from smaller-sized studies from these simulations will in fact systematically produce wider variation in observed effect sizes than the larger trials. Further, only a small proportion of the observed effects in these small-size, low-power studies will pass any given statistical threshold of significance, popularly p < 0.05. Thus, when associations deemed “statistically significant” are found, they are more likely to overestimate the true size of that effect. This is the ESM phenomenon, which can lead to misinterpretation of inflated results from an experiment or observational study as important or “discovered” scientific findings. Stated mathematically: conditional on a result passing some predetermined threshold of statistical significance, test level, or magnitude, the estimated effect size is a biased estimate of the true effect size with the magnitude of this bias inversely related to power of the study.
The remainder of this article introduces the new command
2 ESM: Illustrating the phenomenon
For illustrative purposes and as an introduction to the issue, we draw on the concrete example from the work of Ioannidis (2008) discussed in the introduction and appearing in table 2 of his article. This section shows how to replicate this analysis with the new command.
Ioannidis’s table 2 is excerpted here as table 1. Ioannidis generated this from a series of simulations designed to illustrate the ESM phenomenon in which a low study power can be seen to lead to exaggerated effect sizes for those results that are statistically significant. As shown in the first data row of table 1, Ioannidis begins by assuming a true odds ratio (OR) for an association of 1.10 and that the proportion of exposed individuals in the control (or nondiseased) group is 30%. It follows then, mathematically, that the expected proportion of exposed individuals in the case group would be 0.3204.
2
Ioannidis then simulates a set of epidemiological studies in which i) the control group in each simulated study includes 1,000 subjects and the number of exposed subjects within the control group is randomly drawn from a binomial distribution with probability 0.3000 representing the control group proportion; and ii) the case group in each simulated study includes 1,000 subjects and the number of exposed subjects within the case group is randomly drawn from a binomial distribution with probability of 0.3204 representing the case group proportion. The observed OR of each of many simulated studies in which “n” samples are drawn per group is then computed and stored. The median OR of these simulated studies is expected to be equal to the true OR value of 1.10 (as can be calculated by the new
Simulations for Effect Sizes Passing the Threshold of Formal Statistical Significance (P = 0.05)
IQR indicates interquartile range.
While table 1 above is useful to illustrate the ESM phenomenon, demonstrate how it arises, and quantify it, the tabulated conditions and numbers in table 1 are necessarily fixed, and it would be useful to be able to generate such a table “on the fly” with inputs that are specific to a study of a researcher’s particular interest. This is what the new
3 The emagnification command
3.1 Syntax
Effect-size magnification for proportions
Effect-size magnification for rates
3.2 Description
The
The
The
Iterations that do not converge (for example, zero events are generated that may happen with small counts) are dropped, with the number of valid (completed) iterations shown at the end of the Stata run in the results table in the column labeled
3.3 Options
The pseudo-random-number generator seed is stored in the
3.4 Stored results
3.5 Example inputs
The following examples illustrate the
Estimate the effect-size magnification for a proportion
Estimate the effect-size magnification for a rate
Estimate the effect-size magnification for a proportion in multiple scenarios using 0.1 as the level of significance and showing the inflation factor for only the median statistically significant result
Show the stored results of the latest estimation
4 ESM: Illustrating the emagnification command
To illustrate the use of the the number of subjects in the reference group; the number of subjects in the comparison group; the specific proportion or rate of interest in the reference group (here this is the proportion of exposed subjects in the control group because the Ioannidis example uses ORs); and the assumed (true) ORs (here) or rate ratios of interest.
In the
We insert these values into the
As can be seen, the values for the
Using similar syntax and taking advantage of the ability of the
Similarly, the values estimated here by Stata (medians) of 1.481 and 1.519 and respective IQRs of (1.423–1.563) and (1.440–1.623) approximate reasonably well those provided by Ioannidis in table 1. The remaining two simulations from Ioannidis (corresponding to the third and fifth rows in table 1) can be re-created using the following two commands:
Similarly, these last two commands correspond to the simulation values generated by Ioannidis. 6 Importantly, the above series of simulations illustrate as he did that the more underpowered (and generally smaller) a study is and the smaller the true effect size that the study is investigating, the greater the degree that observed effect sizes that pass some preestablished statistical threshold or are by other means “discovered” will be inflated. Here we see the median inflations vary from 3% with a (moderate) true OR of 1.25 and a large sample of 1,000 to a near doubling of the true OR with a smaller sample size of only 50.
5 Discussion
The above examples have demonstrated that ESM has the potential to be considerable when the power of a study is low. From a practical perspective, these simulation results demonstrate that ESM should be of interest to those evaluating statistically significant results from low-powered studies and that any large effect sizes observed from such studies should be interpreted cautiously.
One question the reader may ask is how these estimated e-magnification intervals differ from or relate to the typical confidence intervals around point estimates that populate much of the literature. In addition, the reader may ask what advantages there are to considering both the (classic) 95% confidence interval around the effect size typically reported in any literature study and any estimated effect-size magnification interval as derived through
First, the (classic) confidence interval around an estimated effect size (such as a mean or mean difference, or an OR, a rate ratio, or a hazard ratio) is an interval that is expected to contain the true parameter (or effect size) over an infinite number of repetitions of the study with a frequency no less than the confidence level, if the underlying statistical model is correct and there is no bias (Rothman, Greenland, and Lash 2008). This confidence interval can be interpreted as a plausible range of estimated ORs if the observational study were repeated many times in the exact same way and if all differences in results in those study replications could be ascribed entirely to the random nature of the Bernoulli and binomial data-generation process in the case of an OR or a rate ratio.
What
Consider as an example the last row of table 1 simulated here with Stata:
The
Some readers may question these ESM calculations that focus on and emphasize the power of a study and consider them to be simply a variant of (discredited) post hoc power calculations. They are not.
8
Instead, ESM calculations can be considered calculations related to the “design calculations” or “post-data design analysis” advocated by Gelman and Carlin (2014)
9
,
10
and discussed further and more recently in greater mathematical detail in Lu, Qiu, and Deng (2019). Gelman and Carlin (2014) advocate using power calculations—reemphasized and named “design calculations” to focus on errors in magnitude and sign instead of declarations of statistical significance—after the data have been collected to help inform a statistical data summary.
11,12
Although Gelman and Carlin (2014) focus on continuous outcomes rather than the categorical and contingency table outcomes focused on here with
Finally, it is important when conducting simulations to use realistic hypothesized true effect sizes based on information that is external to the study under review. Specifically, Gelman and Carlin (2014) indicate that ranges of plausible effect sizes can be developed from auxiliary data, from direct literature, from meta-analyses derived from a systematic review, or from general subject matter expertise or knowledge. However, they acknowledge that some fields may have very unclear effect sizes and that in other cases, the investigational area may be brand new or novel, and estimates of effect size may not be readily available; in these cases, they recommend that researchers consider a broad range of possible effect sizes and perform calculations such as those illustrated here with
While we emphasize that low-powered studies tend to produce greater degrees of ESM in results that are found to be statistically significant (or pass other threshold criteria) than higher-powered studies in the context of ORs or rate ratios typically found in epidemiology, we note that the ESM phenomenon is a principle applicable to discovery science in general and is not a specific affliction or malady of epidemiology (Ioannidis 2005, 2008; Yarkoni 2009; Lehrer 2010; Button et al. 2013; Button 2013; Reinhart 2015); hence, it is applicable to any science in which studies tend to be underpowered and emphasize the use of p-values to “discover” an effect. It is often seen in studies in pharmacology, in gene studies, in psychological studies, and in oft-cited medical literature. In short, any discovered associations from an underpowered study that are highlighted or focused upon on the basis of passing a statistical or other similar threshold will be systematically biased away from the null. The potential degree of this inflation or bias away from the null will depend on a number of issues, including the background rate of the outcome of interest, the sample size of the study, and the effect size of interest. It follows that low-powered epidemiological studies investigating small or weak effects in populations that have a low background rate of the (health) outcome of interest will tend toward the greatest degree of ESM. Note that this is an issue related to how studies are interpreted by users and not one that is intrinsic to or the fault of the study design; nor is it an issue related to good scientific principles or practices.
6 Summary and conclusion
In sum, the ESM phenomenon is real, is important for more appropriately interpreting underpowered studies, and in many ways is underrecognized and underappreciated in the research community and among regulators and decision makers. The phenomenon is not specific to epidemiology and is applicable to any science in which studies tend to be underpowered and emphasize the use of p-values to “discover” an effect, and it is important that users of statistical study results recognize this issue and its potential interpretational consequences. The new
7 Additional notes
Some material presented here was originally generated by two of the authors who served in various capacities on an EFSA panel on PPR that, in turn, followed up on findings of the external scientific report “Literature review of epidemiological studies linking exposure to pesticides and health effects” (Ntzani et al. 2013) (University of Ioannina Medical School, 2013) (EFSA-Q-2014-00481). As part of their work on the PPR, the authors contributed to the review and writing of “Scientific opinion of the PPR panel on the follow-up of the findings of the external scientific report ‘Literature review of epidemiological studies linking exposure to pesticides and health effects’” and its Annex D where much of this material originally appeared. The PPR panel report is published in the EFSA Journal (EFSA Panel on Plant Protection Products and their Residues [PPR] 2017), an official publication of EFSA. This Stata Journal article introduces the new command
The analysis described in this article has been reviewed by the U.S. Environmental Protection Agency’s Office of Chemical Safety and Pollution Prevention and approved for publication. Approval does not signify that the contents necessarily reflect the views, policies, or determinations of the Agency, nor does the mention of trade names of commercial products constitute endorsement or recommendation for use.
Supplemental Material
Supplemental Material, st0608 - emagnification: A tool for estimating effect-size magnification and performing design calculations in epidemiological studies
Supplemental Material, st0608 for emagnification: A tool for estimating effect-size magnification and performing design calculations in epidemiological studies by David J. Miller, James T. Nguyen and Matteo Bottai in The Stata Journal
Footnotes
8 Programs and supplemental materials
To install a snapshot of the corresponding software files as they existed at the time of publication of this article, type
Notes
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
