Abstract
Meta-analyses are essential for cumulative science, but their validity can be compromised by publication bias. To mitigate the impact of publication bias, one may apply publication-bias-adjustment techniques such as precision-effect test and precision-effect estimate with standard errors (PET-PEESE) and selection models. These methods, implemented in JASP and R, allow researchers without programming experience to conduct state-of-the-art publication-bias-adjusted meta-analysis. In this tutorial, we demonstrate how to conduct a publication-bias-adjusted meta-analysis in JASP and R and interpret the results. First, we explain two frequentist bias-correction methods: PET-PEESE and selection models. Second, we introduce robust Bayesian meta-analysis, a Bayesian approach that simultaneously considers both PET-PEESE and selection models. We illustrate the methodology on an example data set, provide an instructional video (https://bit.ly/pubbias) and an R-markdown script (https://osf.io/uhaew/), and discuss the interpretation of the results. Finally, we include concrete guidance on reporting the meta-analytic results in an academic article.
Meta-analyses are a powerful tool for evidence synthesis. After a large body of literature has accumulated, researchers may want to conduct meta-analysis to assess the overall evidence for a claim. This might be because they wish to estimate the size of the effect more precisely or because they are interested in testing whether there is even an aggregate nonzero effect in this line of investigation.
However, these meta-analytic inferences can be frustrated by publication bias—the preferential publishing of statistically significant studies. This bias leads to an overestimation of effect sizes when evidence across a set of primary studies is accumulated (Kvarven et al., 2020; Rosenthal & Gaito, 1964). Some researchers have claimed that most research findings might never be published but instead languish in researchers’ file drawers (e.g., Ioannidis, 2005; Rosenthal, 1979). Even if the true extent of publication bias were less severe than these researchers have suggested, it would remain a formidable threat to the validity of meta-analyses (Borenstein et al., 2009). Indeed, there have been cases in which entire paradigms were possibly based on spurious results, caused in part by publication bias (e.g., Bartoš et al., 2022; Carter & McCullough, 2014; Haaf et al., 2020; Klein et al., 2019; Maier et al., 2022).
In this tutorial, we first introduce two frequentist methods to adjust for publication bias in meta-analysis: precision-effect test and precision-effect estimate with standard errors (PET-PEESE) and selection models. First, PET-PEESE is a meta-analytic estimator that adjusts for the correlation between effect sizes and standard errors. Second, selection models form a set of meta-analytic estimators that correct for different publication probabilities across different p-value intervals (for other methods and their implementation, see Table 1).
Summary of Publication-Bias-Adjustment Methods and Implementations
For an accessible overview and simulation studies, see Carter et al. (2019) and Hong and Reed (2020).
The WAAP-WLS (a hybrid of weighted average of the adequately powered studies and weighted least squares) can be implemented using the a sequence of two lm() functions. The first lm() function estimates the unadjusted meta-analytic effect-size estimate using a weighted least square regression and the second lm() function reestimates the weighted least squares regression only with studies that have sufficient power for finding the unadjusted effect-size estimate.
At the time of writing, CRAN did not feature an R package that implements p-curve. A web application from the original authors can be found at http://p-curve.com/.
Extensive simulation studies have shown that each of these methods often come to different conclusions depending on the data-generating process (Carter et al., 2019; Hong & Reed, 2020; McShane et al., 2016). A usual recommendation to accommodate the differences between methods is to apply multiple methods simultaneously (e.g., Carter et al., 2019; Hong & Reed, 2020; McShane et al., 2016). This can be done by fitting different methods and subjectively comparing their results. However, it is unclear how to combine the estimates across different methods or what to conclude if some methods find evidence for publication bias and others do not. The substantial differences between the estimates of different methods (see e.g., Meta Explore app by Carter et al., 2019; https://tellmi.psy.lmu.de/felix/metaExplorer/) make it difficult to derive robust conclusions from publication-bias-adjusted meta-analysis. In addition, researchers may unwittingly succumb to the temptation of “cherry-picking” a method that does not show publication bias in their specific setting.
Here, we outline a more formal way to combine inferences from different methods using Bayesian model averaging (Bartoš, Gronau, et al., 2021; Carter & McCullough, 2018; Gronau et al., 2021; Hinne et al., 2020; Hoeting et al., 1999). Bayesian model averaging is a technique that allows researchers to specify different models simultaneously and agnostically lets the data guide inferences using different models proportional to how well they predict the data. We combine PET-PEESE, selection models, and naive fixed- or random-effects meta-analysis in a model-averaging framework called robust Bayesian meta-analysis (RoBMA). We also implemented RoBMA in JASP (JASP Team, 2021; Ly et al., 2021), which is a free and open-source statistical-software package that uses a graphical user interface. In this tutorial, we explain how to use RoBMA in R and JASP.
In the next sections, we first briefly introduce the example data set, a meta-analysis on acculturation mismatch (Lui, 2015). Second, we provide an accessible explanation of PET-PEESE and selection models. Third, we introduce RoBMA, which combines selection models and PET-PEESE under one model-averaging umbrella. All of these methods have not previously been implemented in graphical-user-interface software, which has limited their accessibility to researchers without programming experience. Therefore, we provide guidance on using these methods in both R and JASP. We show how to apply these methods and interpret the results and provide an example report of a result section (Appendix A). We further accompany the tutorial with an R-markdown file (https://osf.io/uhaew/) and recorded tutorial videos (https://bit.ly/pubbias) to facilitate the application of the implemented methods. Detailed documentation describing options of the JASP analyses is accessible via the blue “i” icon in the analysis input headings.
Running Example: Acculturation Mismatch and Intergenerational Cultural Conflict
Lui (2015) studied how acculturation mismatch (i.e., the contrast between the collectivist cultures of Asian and Latin immigrant groups and the individualistic culture of the United States) correlates with intergenerational cultural conflict by meta-analyzing 18 independent studies that correlated acculturation mismatch with intergenerational cultural conflict. A standard random-effects reanalysis calculated with a restricted maximum likelihood estimator and using Fisher z values transformed from correlation coefficients (for more information about effect-size transformations, see Appendix C) indicates a significant relationship between acculturation mismatch on increased intergenerational cultural conflict,
The data set for the following analyses can be downloaded from the OSF repository at https://osf.io/mgu7v/. The first part of the R-markdown file explains how to set up the R environment and packages, load the data set, and perform the effect-size transformation required for meta-analysis (handled automatically in JASP).
PET-PEESE
Theoretical background
PET-PEESE is a publication-bias-adjustment method that corrects for the correlation between effect size and standard errors or effect sizes and standard errors squared (Stanley & Doucouliagos, 2014). It can be considered one type of a broader class of funnel-plot-based methods that correct according to the relationship between effect sizes and standard errors (e.g., Duval & Tweedie, 2000; Egger et al., 1997). Because the standard error of standardized effect sizes depends on the sample size, the term “small-study effects” is often used to refer to the overestimation of the meta-analytic mean effect size that is due to less precise studies.
The general idea behind PET-PEESE, and the other funnel-plot-based methods, is that the effect sizes and standard errors ought to be unrelated in the absence of publication bias—information about the standard error of any given study should not inform about the effect-size estimate of the study. Publication bias can introduce a relationship between the effect sizes and standard errors; studies with large sample sizes will usually get published, whereas small studies will be published only if they reach statistical significance. Therefore, the presence of publication bias often results in a relative increase of imprecise studies with inflated effect-size estimates.
However, there might be explanations for a relationship between effect sizes and standard errors other than publication bias (Lau et al., 2006). For example, in the case of a heterogeneous population of effect sizes, researchers might conduct power analyses and target smaller effects with larger studies.
PET-PEESE corrects for the effect-size inflation by using a two-step procedure. In the first step, the PET model, specifying a weighted least square regression predicting the effect sizes with standard errors, is estimated and used to test for the presence of the effect with
Application to the running example
The first part of the video (at the 5-min 30-s mark) shows how to perform the PET-PEESE analysis in JASP (Fig. 1). The corresponding analysis with R is outlined in the second part of the R-markdown file.

Results from Lui (2015) using the precision-effect test and precision-effect estimate with standard errors (PET-PEESE) analysis in JASP. Screenshot from JASP graphical user interface when we analyzed the data of Lui (2015). The analysis settings are specified in the left panel (click the blue “i” icon for a description of the controls) and the associated output is shown in the right panel. The shown output concerns (1) a test of effect size based on precision-effect test (PET), (2) a test of publication bias based on PET, (3) effect-size estimates from PET and PEESE, and (4) estimated PET regression model visualizing the relationship between standard errors and effect sizes.
To interpret the results, we first focused on the test of effect size based on precision-effect test (PET) model under the “Test of Effect” table in the upper right part of the Figure 1. We found that the test of effect size is not significant, so we proceeded to interpret the effect-size estimate on the basis of the PET model under the “Estimates” table in the middle right part of Figure 1. We found that the adjusted mean-effect-size estimate is practically zero,
We can further visualize the PET metaregression estimate of the relationship between the effect sizes and standard errors displayed in the bottom right part of Figure 1. The figure illustrates the relationship between standard errors (x-axis) and effect sizes (y-axis). The adjusted estimate then corresponds to the intercept with the y-axis. Figure 1 is similar to a funnel plot except that the funnel plot visualizes the effect sizes on the x-axis and the standard errors on the y-axis. Showing the standard errors on the x-axis, as here, highlights the PET-PEESE signature of publication bias (i.e., that less precise studies show larger effect sizes).
Selection Models
Theoretical background
Selection models are publication-bias-adjustment methods that use weighted likelihood to account for studies that are missing because of publication bias. Selection models are well established among statisticians (e.g., Iyengar & Greenhouse, 1988; Larose & Dey, 1998; Vevea & Hedges, 1995) and can accommodate realistic assumptions regarding publication bias and heterogeneity (e.g., the chances of publication depend on reported p values; Citkowicz & Vevea, 2017).
Selection models offer multiple ways of defining the relationship between p values and the relative publication probabilities via the weight function. Parameters of the weight function are estimated simultaneously with the rest of the model, which allows selection models to correct for the missing studies. Here, we focus on the step-weight functions that specify distinct p-value intervals, each governed by a different relative publication probability (for information about other implementations of selection models, see Box 1). The step-weight-function selection models are the most popular, arguably because of their simplicity, accessibility, and good performance across multiple simulation studies (e.g., Carter et al., 2019; Hong & Reed, 2020; Maier, Bartoš, & Wagenmakers, 2022; McShane et al., 2016).
Box 1.
Selection Models and Weight Functions
Throughout the article, we exclusively focus on selection models specified with a step-weight function that is estimated from data. This type of selection models allows researchers to adjust for publication bias operating on p-value thresholds, with the relative publication probabilities estimated simultaneously with the meta-analytic model.
However, there exist many other types of selection models and different use-cases. First, selection models offer a wide variety of weight functions that can be associated with p values, standard errors, or additional variables (for more details. see Citkowicz & Vevea, 2017; Iyengar & Greenhouse, 1988; McShane et al., 2016; Patil & Taillie, 1989; Preston et al., 2004). Second, selection models can be used with prespecified weight functions to perform a sensitivity analysis under different assumptions about the degree of publication bias (e.g., Mathur & VanderWeele, 2020; Vevea & Woods, 2005).
In R, many of the approaches above are implemented via the selmodel() function in the metafor package (Viechtbauer, 2010). Use ?metafor::selmodel in R for detailed documentation and examples.
To apply selection models with step-weight functions, researchers specify p-value intervals with different publication probabilities, for example, “statistically significant” p values (
Step-weight-function selection models can be specified flexibly in several ways. First, researchers can decide between one-sided and two-sided selection. One-sided selection means that only significant effects in the expected direction are more likely to be published. Commonly, significant positive effects are more likely to be published, although in some cases, significant negative effect sizes might be more likely to be published as well. Researchers can specify the direction of selection flexibly. Two-sided selection means that the probability of publication does not depend on the direction of the effect; in other words, positive and negative effects have the same probability of being published given that they fall in the same p-value interval.
Second, researchers may also specify different intervals for different publication probabilities. For example, to account for the fact that marginally significant results (
Application to the running example
The middle section of the video (at the 7-min mark) shows how to perform the selection-model analysis in JASP (Fig. 2), which uses the weightr R package (Coburn & Vevea, 2019). The corresponding analysis with R is then outlined in the third part of the R-markdown file.

Results from Lui (2015) using selection models analysis in JASP. Screenshot from the JASP graphical user interface when analyzing the data of Lui. The analysis settings are specified in the left panel (use the blue “i” icon for description of the controls), and the associated output is shown in the right panel. The shown output concerns (1) a test of heterogeneity, (2) a test of publication bias, and (3) adjusted and unadjusted effect-size estimates for the random-effects models.
To interpret the results, we first focused on the test of heterogeneity under the “Test of Heterogeneity” table in the upper right part of Figure 2.
3
We found that the test of heterogeneity was significant, so we proceeded to interpret the test for publication bias assuming heterogeneity in the “Test of Publication Bias” table, the second table on the right side of Figure 2. We found that the test for publication bias assuming heterogeneity is significant only when using
The above procedure involved an initial test for heterogeneity and an initial test for publication bias; on the basis of the outcomes of these tests, we then applied the random-effects selection model. However, some researchers have argued that random-effects models are to be preferred over fixed-effects models under almost all circumstances (Borenstein et al., 2010). Furthermore, the test for publication bias is often underpowered (Rothstein et al., 2005), and the presence of publication bias can greatly affect the heterogeneity estimate and its test (Augusteijn et al., 2019; Jackson, 2006). Therefore, one may argue that it is prudent to use the adjusted effect-size estimate from the random-effects selection model regardless of the tests for heterogeneity and publication bias (which would not change the result in our case).
We can further visualize the estimated-weight function on the basis of the random-effects model displayed.
4
The bottom right part of Figure 2 shows the estimated publication probabilities (y-axis) for the different p-value intervals (x-axis). The x-axis is rescaled to show equal distance between p-value cut points. This rescaling facilitates readability when the p-value cut points are defined to be relatively close. The first p-value interval
Limitations of PET-PEESE and Selection Models
Although the PET-PEESE and selection models provide powerful adjustment in several situations, the frequentist methods outlined above have several shortcomings.
The first limitation is that frequentist Neyman-Pearson point-null hypothesis significance tests (NHSTs) are based on binary accept/reject decisions. 5 When the number of primary studies is small, the methods might have insufficient power, compromising the reliability of the accept/reject decisions (cf. Robinson, 2019). Insufficient power is a considerable problem for the test of publication bias. From a frequentist point of view, the act of not rejecting the point-null hypothesis does not imply that there is evidence in its favor. 6 A single frequentist significance test against a point-null hypothesis cannot distinguish between absence of evidence (i.e., the data are uninformative) or evidence of absence (i.e., the data support the null hypothesis; Keysers et al., 2020; Wagenmakers et al., 2016). This problem was highlighted in the selection-models example—it was unclear whether nonsignificance at the .05 level indicates evidence of absence or absence of evidence regarding publication bias. A closely related limitation is that selection models cannot be estimated when there are insufficient p values in the specified p-value intervals, which is highly likely when the number of primary studies is relatively small.
The second limitation is accumulation bias (ter Schure & Grünwald, 2019). Consider meta-analyzing k primary studies with a frequentist method. At a later point in time, an additional study
A third limitation is that one needs to decide between different methods. PET-PEESE and selection models will sometimes arrive at different results. Although it is advised to fit multiple adjustment methods that are suitable under the given conditions for sensitivity analysis (Carter et al., 2019; McShane et al., 2016), it is less clear what to conclude if the different methods disagree. Ideally, one would want to combine different models into a single method that bases the inference on multiple models simultaneously depending on how well they account for the data.
To overcome these limitations, we developed robust Bayesian meta-analysis (RoBMA; Bartoš, Maier, et al., 2021; Maier, Bartoš, & Wagenmakers, 2022), which combines selection models and PET-PEESE using Bayesian model averaging. In the next sections, we explain RoBMA conceptually and show how it alleviates the shortcomings of frequentist selection models. In addition, we illustrate the workings of the JASP and R implementation in practice.
RoBMA
Theoretical background
RoBMA is a meta-analytic framework that uses Bayesian model averaging to adjust for publication bias (Bartoš, Maier, et al., 2021; Maier, Bartoš, & Wagenmakers, 2022). RoBMA allows researchers to simultaneously estimate different models and base the results on a weighted combination of their estimates. The models can be generally divided into three different pairs:
models assuming the null hypothesis to be true versus models assuming the alternative hypothesis to be true (i.e.,
models assuming fixed effects versus models assuming random effects (i.e.,
models assuming publication bias and models assuming no publication bias (i.e.,
The models assuming publication bias encompass the different publication-bias adjustments. We specify both the PET-PEESE publication-bias adjustment (however, instead of selecting either PET or PEESE, we model averaged across both models) and selection-model adjustment. For the selection models, we specify the following weight functions:
Two-sided
(a) p-value cutoffs = .05
(b) p-value cutoffs = .05 and .10
2. One-sided
(a) p-value cutoffs = .05
(b) p-value cutoffs = .025 and .05
(c) p-value cutoffs = .05 and .50
(d) p-value cutoffs = .025, .05, and 0.50
Overall, RoBMA contains eight distinct ways of adjusting for publication bias (PET, PEESE, and six weight functions). The complete model ensemble is then constructed as a combination of all possible components, 2 (Effect vs. No Effect) × 2 (Heterogeneity vs. no Heterogeneity) × 9 (Publication Bias [8] vs. No Publication Bias), resulting in 36 different models. For further details see “Appendix A: Model Specifications” in Bartoš, Maier, et al. (2021).
Prior distributions
To complete the specification of RoBMA, we need to specify prior parameter distributions (see Boxes 2 and 3) and set the prior model probabilities. We use the default settings outlined and tested in a simulation study by Bartoš, Maier, et al. (2021). The simulation study verified that the prior specification performs well in terms of mean square error and bias of the estimates as well as the evidence in favor of the null and alternative hypotheses across a range of scenarios considered typical for psychology. Furthermore, the prior specification outperformed a variety of other publication-bias-adjustment methods on real data examples (Bartoš, Maier, et al., 2021).
Box 2.
Prior Distributions (Part I)
A core part of every Bayesian analysis is the specification of appropriate prior distributions. This Box outlines the default and alternative prior distributions. The R package internally transforms the specified prior distributions to the Fisher z scale that is used for estimating the models. Users can change the scale for setting the priors in both R and JASP. The suggested alternative prior distributions can be used for a robust Bayesian meta-analysis (RoBMA) sensitivity analysis, that is, an assessment of the degree to which the reported conclusions are robust to alternative specification of the prior distributions.
Prior Distributions on Effect Size
By default, we use a standard normal distribution on the effect size,
δ ~ Cauchy(location = 0, scale = 0.707)—a default prior distribution in Bayes factor testing, appropriate when large effects cannot be ruled out (Morey & Rouder, 2015).
δ ~ Student−t[0, ∞](location = 0.35, scale = 0.102, df = 3)—an informed prior distribution for small- to medium-sized effects, called the “Oosterwijk prior distribution” after the expert from whom the distribution was elicited (Gronau et al., 2020).
Prior Distributions on Heterogeneity
We suggest the
Box 3.
Prior Distributions (Part II)
Prior Distributions for Precision-Effect Test and Precision-Effect Estimate With Standard Errors
By default, we suggest half-Cauchy distributions on the PET,
βPET ~ Gamma(shape = 2.84, rate = 2.19)
βPEESE ~ Gamma(shape = 2.32, rate = 0.86)
Prior Distributions for Publication Bias Weights
The default prior distribution for publication bias weights is unit cumulative Dirichlet prior distributions. This encodes the intuitive assumption that studies with statistically significant p values have higher relative publication probability than studies with marginally significant p values, and studies with marginally significant p values have a higher relative publication probability than studies with statistically nonsignificant p values (for an illustration supporting the assumption with a collection of over 1 million test statistics collected from Medline, see van Zwet & Cator, 2021, Fig. 1). This assumption allows a more efficient use of information about the publication process, which is especially relevant when the number of studies is small such that some p-value intervals contain only a few or no studies. We suggest the following alternative priors on the publication weight ω, obtained from simulations (Bartoš, Maier, et al., 2021, Appendix B):
We split the prior model probabilities equally across the different model pairs. In other words, we assign 50% prior model probability to models that assume the presence of an effect, 50% prior model probability to models that assume the presence of heterogeneity, and 50% prior model probability to models that assume the presence of publication bias. This division of prior model probabilities reflects a position of equipoise and puts the models on an equal footing before the arrival of the data (e.g., Gronau et al., 2021; Hinne et al., 2020; Jeffreys, 1939; Madigan et al., 1994; Madigan & Raftery, 1994; Raftery, 1995; for alternatives, see Castillo et al., 2015; Scott & Berger, 2006, 2010; Wilson et al., 2010).
However, we point out that these are only the default settings, and researchers can specify different priors if they so desire. For instance, the prior distribution on the effect-size parameter of the null hypothesis can be modified to specify a test against a perinull hypothesis (i.e., the spike can be changed to a narrow “slab”; e.g., Berger & Delampady, 1987; Cornfield, 1966; George & McCulloch, 1993), and the prior distribution on the alternative hypothesis can be changed to be more informed or directional (Bartoš, Gronau, et al., 2021; Gronau et al., 2017, 2020; see Boxes 2 and 3). Prior knowledge can be also incorporated into the prior model probabilities. Researchers interested in effect-size estimation (e.g., McElreath, 2020) may remove models that assume the effect is absent (i.e., assign these models zero prior probability; but see van den Bergh et al., 2021). Other researchers may for theoretical reasons include only random-effects models and assign zero prior probability to fixed-effects models (e.g., Rothstein et al., 2005; but for empirical evidence from medicine showing that fixed-effects models show relatively good predictive performance, see Bartoš, Gronau, et al., 2021). In addition, it is sometimes argued that models based on the correlation between effect sizes and standard errors might find spurious evidence for publication bias, for example, when researchers take into account heterogeneity by studying small effects with larger samples (Lau et al., 2006). To test this possibility, one may omit the PET-PEESE models (i.e., assigning them zero prior probability) and assess the extent to which this affects the overall conclusions. Another reason for omitting some of the models from the ensemble is when they are clearly inappropriate (e.g., PET-PEESE publication-bias-adjustment models require variability in the standard errors/sample sizes of the original studies; Stanley, 2017)—if all conducted studies had the same standard error, the relationship between the standard errors and sample sizes cannot be estimated. Appendix B shows how RoBMA can be adjusted to compare a perinull hypothesis with an informed alternative hypothesis.
Bayesian model averaging
After we specified the prior distributions and estimated the individual models, we updated the individual models’ posterior model probabilities using Bayes’s rule. In other words, models that predict the data well receive a boost in posterior probability, whereas models that predict the data poorly suffer a decline (Wagenmakers, 2020; Wagenmakers et al., 2016). Comparing only two models, we can describe their relative predictive performance using Bayes’s factors (BFs; Etz & Wagenmakers, 2017; Jeffreys, 1961; Kass & Raftery, 1995; Rouder & Morey, 2019; Wrinch & Jeffreys, 1921). The BF equals the change from prior to posterior odds. If both hypotheses are equally likely a priori, the posterior odds equal the BF. This relationship is illustrated in Equation 1 for two models that both assume the presence of heterogeneity and the absence of publication bias; however, one model assumes the presence of the effect, whereas the other assumes its absence:
where
More than two models can be compared using the “inclusion Bayes factor.” This BF allows researchers to quantify the evidence for a meta-analytical effect, the evidence for heterogeneity, and the evidence for publication bias. When we compare the class of models assuming publication bias with the class of models assuming no publication bias, the inclusion BF can be calculated as in Equation 2:
In other words, the inclusion BF for publication bias is obtained by contrasting the prediction accuracy of all models that assume publication bias to all models that assume no publication bias. The inclusion BF for effect size and heterogeneity can be calculated analogously. One advantage of BFs is that they can distinguish between absence of evidence and evidence of absence. In addition, they can quantify evidence on a continuous scale and can be updated sequentially as studies accumulate, which is not advisable when using a conventional NHST approach. The following rule of thumb can aid the interpretation of BFs:
After updating the models according to their posterior probability, the final effect-size estimate is obtained by Bayesian model averaging (e.g., Hinne et al., 2020; Hoeting et al., 1999). In Bayesian model averaging, the effect size from each individual model is weighted by its posterior probability. Because those models that predicted the data best have the highest posterior probability, the final estimate is based most strongly on the most appropriate models.
Bayesian model averaging is especially relevant in the context of publication-bias adjustment in meta-analyses. Carter et al. (2019) and Hong and Reed (2020) identified the conditions under which particular publication-bias methods perform best (e.g., when heterogeneity is high, selection models generally outperform most other methods); in practical application, however, it usually remains unclear what condition holds for the data set at hand. A similar logic applies to other assumptions—for instance, the degree of variability in the standard errors of the individual studies warranting the use of PET-PEESE publication-bias adjustment is hard to define (i.e., it is unclear what is the degree of variability below which PET-PEESE should no longer be used). Bayesian model averaging applies PET, PEESE, and selection models to the data simultaneously, weighting their relative impact with the extent to which the rival publication-bias methods predicted the observed data. If the variability of standard errors is too low, the PET-PEESE models will predict the data poorly and thus contribute little to inference. In other words, an assumption violation is often equivalent to a poor description of the data by a model, therefore, Bayesian model averaging makes models more robust to misspecification. Finally, whereas a large number of studies will often yield clear evidence for a single model, a low number of studies usually yields evidence that is less conclusive. In such cases, Bayesian model averaging allows the uncertainty about the most appropriate model to be incorporated in a coherent manner, providing optimal estimates that do not suffer from overconfidence (Hoeting et al., 1999).
RoBMA’s benefits
RoBMA overcomes the limitations of frequentist selection models and PET-PEESE in several ways. First, the BF allows researchers to quantify relative evidence for the null hypothesis and thus distinguish between absence of evidence and evidence of absence.
Second, the model averaging obviates the need to select a single model in an all-or-none fashion. Therefore, if there is uncertainty regarding the presence of publication bias, RoBMA can base the inference on both the “normal” models and the publication-bias-adjusted models instead of needing to commit fully to a single model.
Third, the prior distributions allow the selection models to be estimated even in cases with few p values in some of the p-value intervals, which is a limitation of frequentist selection models. The method will not fail to converge under these conditions. However, especially in this context, it is important to specify the prior distributions carefully and check the robustness of the results to different specifications of the prior distributions. Concretely, we recommend using the distributions in Boxes 2 and 3 in addition to the default priors to check whether the conclusions are robust to the prior choice.
Fourth, BFs allow for sequential updating (Rouder, 2014; Rouder & Morey, 2011; Wagenmakers et al., 2016), meaning that new studies can be added to the set and the analysis can be updated without having to worry about accumulation bias. At every point in time, RoBMA quantifies evidence using the relative predictive performance of the rival models for the observed data.
Application to the running example
The last part of the video (at the 15-min 40-s mark) shows how to perform the robust Bayesian meta-analysis in JASP (Fig. 3), which uses the RoBMA R package (Bartoš & Maier, 2020). The corresponding analysis with R is outlined in the fourth part of the R-markdown file. In contrast to the previous methods, RoBMA is estimated via Markov chain Monte Carlo (MCMC), the convergence of which ought to be checked. Both JASP and R return automatic convergence warnings (further diagnostics can be obtained in the “MCMC Diagnostics” menu in JASP; the R-markdown file contains more details regarding the R version) that prompt the user to adjust the MCMC fitting process (for details, see the JASP or R help files).

Results from Lui (2015) using the robust Bayesian meta-analysis (RoBMA) in JASP. Screenshot from the JASP graphical user interface when we analyzed the data of Lui (2015). The analysis settings are specified in the left panel (use the blue “i” icon for description of the controls), and the associated output is shown in the right panel. The shown output displays (1) summary of the model components, (2) model-averaged estimates of the effect size and heterogeneity, and (3) prior (gray) and posterior (black) model-averaged distribution for the effect size estimate. The arrows centered at zero correspond to the point probability mass allocated to the null hypothesis (secondary y-axis), and the smooth densities correspond to the distributions under the alternative hypothesis.
To interpret the results, we first focused on model components summary under the “Model Summary” table in the upper right part of the Figure 3. We found absence of evidence for the presence of the effect,
Summary of Meta-Analytic Estimates for Lui (2015) Based on Different Models
Note: PET = precision-effect test; PEESE = precision-effect estimate with standard errors; RoBMA-PSMA = robust Bayesian meta-analysis specified as in Bartoš, Maier et al. (in press).
Furthermore, the bottom right part of Figure 3 visualizes the model-averaged prior (gray lines) and posterior (black lines) distributions of the effect-size estimate (on the Cohen’s d scale). The black vertical arrow is slightly lower than the gray vertical arrow, reflecting a slight decrease in probability for models that assume the effect is absent—note that the secondary y-axis indexes the prior and posterior probability mass allocated to the null hypothesis. The continuous prior and posterior distributions are associated with models that assume the effect is present.
This example highlights the Bayesian benefit of taking all uncertainty into account. In the frequentist framework, it was unclear whether to adjust for publication bias and how to adjust for publication bias. In contrast, RoBMA does not require an all-or-none decision on the presence of publication bias. Instead, all models are taken into account simultaneously, and the effect-size estimate is based on a weighted average across the various models. The weights are determined according to the support that each model receives from the data. Taking all models into account, we still found the absence of evidence for an effect. In other words, more primary studies are needed to learn about the relationship between intergenerational cultural conflict and acculturation mismatch. When the new studies are conducted, RoBMA allows researchers to continuously update the evidence.
Concluding Comments
In this article, we introduced three approaches to adjust for publication bias in meta-analysis, all implemented in JASP and R. First, we discussed PET-PEESE, a regression-based estimator with low bias that has been shown to perform well on empirical examples. Second, we discussed frequentist selection models, which have been demonstrated to work well even under high heterogeneity. Third, we discussed RoBMA, a Bayesian approach for combining complementary publication-bias-adjustment methods according to how well they describe the data at hand. RoBMA allows researchers to move beyond single-model inference and incorporate model-selection uncertainty into the meta-analytic estimates; in addition, RoBMA allows researchers to conduct multimodel tests for the presence of the effect, for heterogeneity, and for publication bias.
The RoBMA ensemble is highly modifiable; prior parameter distributions may be adjusted to reflect different background knowledge, prior model probabilities can be set to reflect theoretical preferences or expectations (e.g., entire classes of models can be excluded when deemed inappropriate on theoretical grounds), and more generally, researchers who do not wish to engage in model averaging may inspect the parameter estimates and posterior model probabilities for each individual model separately.
In cases in which the data-generating process and type of publication bias are known, the Meta Explorer app (https://tellmi.psy.lmu.de/felix/metaExplorer/) by Carter et al. (2019) can be used to select the most appropriate method in a given situation. In cases in which uncertainty about the data-generating process and the presence or type of publication bias exists, RoBMA allows researchers to combine the adjustment approaches according to their predictive performance for the observed data.
However, we note that RoBMA also has several limitations. First, whereas averaging over a set of models alleviates problems because of model misspecification, the meta-analytic estimate might still suffer from over- or underestimation if none of the models approximate the data-generating process well. Second, whereas the RoBMA’s performance was demonstrated across multiple simulation environments and empirical examples, it can lead to overcorrection of the effect-size estimates under moderate and strong questionable research practices, as simulated by Carter et al. (2019). Third, RoBMA has considerably longer fitting time compared with frequentist approaches. However, for educational purposes or when distributing results with colleagues, one can share a .JASP file with models that have already been fitted to illustrate interpretation of the results.
To conclude, the publication-bias-adjusted meta-analyses in JASP allows researchers without programming experience to conduct state-of-the-art, publication-bias-corrected meta-analysis in an intuitive and user-friendly way. We hope that this methodology will improve the inferences researchers make when conducting meta-analysis.
Footnotes
Appendix A
Appendix B
Appendix C
Acknowledgements
Transparency
Action Editor: Daniel J. Simons
Editor: Daniel J. Simons
Author Contributions
F. Bartoš and M. Maier are joint first authors. All of the authors approved the final manuscript for submission.
