Sage Journals: Discover world-class research

Abstract

Statistical inference in psychology has traditionally relied heavily on p-value significance testing. This approach to drawing conclusions from data, however, has been widely criticized, and two types of remedies have been advocated. The first proposal is to supplement p values with complementary measures of evidence, such as effect sizes. The second is to replace inference with Bayesian measures of evidence, such as the Bayes factor. The authors provide a practical comparison of p values, effect sizes, and default Bayes factors as measures of statistical evidence, using 855 recently published t tests in psychology. The comparison yields two main results. First, although p values and default Bayes factors almost always agree about what hypothesis is better supported by the data, the measures often disagree about the strength of this support; for 70% of the data sets for which the p value falls between .01 and .05, the default Bayes factor indicates that the evidence is only anecdotal. Second, effect sizes can provide additional evidence to p values and default Bayes factors. The authors conclude that the Bayesian approach is comparatively prudent, preventing researchers from overestimating the evidence in favor of an effect.

Keywords

hypothesis testing effect size Bayes factor

Get full access to this article

View all access options for this article.

References

American Psychological Association. (2010). Publication manual of the American Psychological Association (6th ed.). Washington, DC: Author.

Bem

D.J.

(2011). Feeling the future: Experimental evidence for anomalous retroactive influences on cognition and affect. Journal of Personality and Social Psychology, 100, 407–425.

Berger

J.O.

Delampady

(1987). Testing precise hypotheses. Statistical Science, 2, 317–352.

Berger

J.O.

Sellke

(1987). Testing a point null hypothesis: The irreconcilability of p values and evidence. Journal of the American Statistical Association, 82, 112–139.

Berger

J.O.

Wolpert

R.L.

(1988). The likelihood principle (2nd ed.). Hayward, CA: Institute of Mathematical Statistics.

Cohen

(1988). Statistical power analysis for the behavioral sciences. Hillsdale, NJ: Erlbaum.

Cohen

(1994). The earth is round (p < .05). American Psychologist, 49, 997–1003.

Cortina

J.M.

Dunlap

W.P.

(1997). On the logic and purpose of significance testing. Psychological Methods, 2, 161–172.

Cumming

(2008). Replication and p intervals: p values predict the future only vaguely, but confidence intervals do much better. Perspectives on Psychological Science, 3, 286–300.

10.

Dennis

Lee

Kinnell

(2008). Bayesian analysis of recognition memory: The case of the list-length effect. Journal of Memory and Language, 59, 361–376.

11.

Dienes

(2008). Understanding psychology as a science: An introduction to scientiﬁc and statistical inference. New York: Palgrave Macmillan.

12.

Dienes

(2011). Bayesian versus orthodox statistics: Which side are you on? Perspectives on Psychological Science, 6, 274–290.

13.

Dixon

(2003). The p-value fallacy and how to avoid it. Canadian Journal of Experimental Psychology, 57, 189–202.

14.

Edwards

Lindman

Savage

L.J.

(1963). Bayesian statistical inference for psychological research. Psychological Review, 70, 193–242.

15.

Erdfelder

(2010). A note on statistical analysis. Experimental Psychology, 57, 1–4.

16.

Fisher

R.A.

(1935). The design of experiments. Edinburgh: Oliver and Boyd.

17.

Frick

R.W.

(1996). The appropriate use of null hypothesis testing. Psychological Methods, 1, 379–390.

18.

Gallistel

(2009). The importance of proving the null. Psychological Review, 116, 439–453.

19.

Gelman

Hill

(2007). Data analysis using regression and multilevel/hierarchical models. Cambridge, England: Cambridge University Press.

20.

Gigerenzer

(1993). The Superego, the ego, and the id in statistical reasoning. In Keren

Lewis

(Eds.), A handbook for data analysis in the behavioral sciences: Methodological issues (pp. 311–339). Hillsdale, NJ: Erlbaum.

21.

Gigerenzer

(1998). We need statistical thinking, not statistical rituals. Behavioral and Brain Sciences, 21, 199–200.

22.

Gönen

Johnson

W.O.

Westfall

P.H.

(2005). The Bayesian two-sample t test. American Statistician, 59, 252–257.

23.

Good

I.J.

(1983). Good thinking: The foundations of probability and its applications. Minneapolis: University of Minnesota Press.

24.

Good

I.J.

(1985). Weight of evidence: A brief survey. In Bernardo

J.M.

DeGroot

M.H.

Lindley

D.V.

Smith

A.F.M.

(Eds.), Bayesian statistics 2 (pp. 249–269). New York: Elsevier.

25.

Hagen

R.L.

(1997). In praise of the null hypothesis statistical test. American Psychologist, 52, 15–24.

26.

Howard

Maxwell

Fleming

(2000). The proof of the pudding: An illustration of the relative strengths of null hypothesis, meta-analysis, and Bayesian analysis. Psychological Methods, 5, 315–332.

27.

Ioannidis

J.P.A.

(2005). Why most published research ﬁndings are false. PLoS Medicine, 2, 696–701.

28.

Jaynes

E.T.

(2003). Probability theory: The logic of science. Cambridge, UK: Cambridge University Press.

29.

Jeffreys

(1961). Theory of probability. Oxford, UK: Oxford University Press.

30.

Kass

R.E.

Raftery

A.E.

(1995). Bayes factors. Journal of the American Statistical Association, 90, 377–395.

31.

Killeen

P.R.

(2005). An alternative to null-hypothesis significance tests. Psychological Science, 16, 345–353.

32.

Killeen

P.R.

(2006). Beyond statistical inference: A decision theory for science. Psychonomic Bulletin & Review, 13, 549–562.

33.

Kruschke

J.K.

(2010a). Bayesian data analysis. Wiley Interdisciplinary Reviews: Cognitive Science, 1, 658–676.

34.

Kruschke

J.K.

(2010b). Doing Bayesian data analysis: A tutorial introduction with R and BUGS. Burlington, MA: Academic Press.

35.

Kruschke

J.K.

(2010c). What to believe: Bayesian methods for data analysis. Trends in Cognitive Sciences, 14, 293–300.

36.

Kruschke

J.K.

(2011). Bayesian assessment of null values via parameter estimation and model comparison. Perspectives on Psychological Science, 6, 299–312.

37.

Lee

M.D.

(2008). Three case studies in the Bayesian analysis of cognitive models. Psychonomic Bulletin & Review, 15, 1–15.

38.

Lee

M.D.

(in press). How cognitive modeling can benefit from hierarchical Bayesian models. Journal of Mathematical Psychology.

39.

Lee

M.D.

Wagenmakers

E.-J.

(2005). Bayesian statistical inference in psychology: Comment on Trafimow (2003). Psychological Review, 112, 662–668.

40.

Liang

Paulo

Molina

Clyde

Berger

(2008). Mixtures of g priors for Bayesian variable selection. Journal of the American Statistical Association, 103, 410.

41.

Lindley

D.V.

(1972). Bayesian statistics; a review. Philadelphia: Society for Industrial and Applied Mathematics.

42.

Loftus

G.R.

(1996). Psychology will be a much better science when we change the way we analyze data. Current Directions in Psychological Science, 5, 161–171.

43.

Lunn

D.J.

Thomas

Best

Spiegelhalter

(2000). WinBUGS—a Bayesian modelling framework: Concepts, structure, and extensibility. Statistics and Computing, 10, 325–337.

44.

Mussweiler

(2006). Doing is for thinking! Psychological Science, 17, 17–21.

45.

Myung

I.J.

Forster

M.R.

Browne

M.W.

(2000). A special issue on model selection. Journal of Mathematical Psychology, 44.

46.

Nickerson

R.S.

(2000). Null hypothesis statistical testing: A review of an old and continuing controversy. Psychological Methods, 5, 241–301.

47.

Pitt

M.A.

Myung

I.J.

Zhang

(2002). Toward a method of selecting among computational models of cognition. Psychological Review, 109, 472–491.

48.

Press

Chib

Clyde

Woodworth

Zaslavsky

(2003). Subjective and objective Bayesian statistics: Principles, models, and applications. Hoboken, NJ: Wiley-Interscience.

49.

Richard

F.D.

Bond

C.F.J.

Stokes-Zoota

J.J.

(2003). One hundred years of social psychology quantitatively described. Review of General Psychology, 7, 331–363.

50.

Rosenthal

(1990). How are we doing in soft psychology? American Psychologist, 45, 775–777.

51.

Rosenthal

Rubin

(1982). A simple, general purpose display of magnitude of experimental effect. Journal of Educational Psychology, 74, 166–169.

52.

Rouder

J.N.

(2005). An introduction to Bayesian hierarchical models with an application in the theory of signal detection. Psychonomic Bulletin & Review, 12, 573–604.

53.

Rouder

J.N.

Speckman

P.L.

Sun

Morey

R.D.

Iverson

(2009). Bayesian t tests for accepting and rejecting the null hypothesis. Psychonomic Bulletin & Review, 16, 225–237.

54.

Schmidt

F.L.

(1996). Statistical significance testing and cumulative knowledge in psychology: Implications for training of researchers. Psychological Methods, 1, 115–129.

55.

Scott

Berger

(2006). An exploration of aspects of Bayesian multiple testing. Journal of Statistical Planning and Inference, 136, 2144–2162.

56.

Thompson

(2002). What future quantitative social science research could look like: Confidence intervals for effect sizes. Educational Researcher, 31, 25–32.

57.

Wagenmakers

E.-J.

(2007). A practical solution to the pervasive problems of p values. Psychonomic Bulletin & Review, 14, 779–804.

58.

Wagenmakers

E.-J.

Grünwald

(2006). A Bayesian perspective on hypothesis testing. Psychological Science, 17, 641–642.

59.

Wagenmakers

E.-J.

Lodewyckx

Kuriyal

Grasman

(2010). Bayesian hypothesis testing for psychologists: A tutorial on the Savage-Dickey method. Cognitive Psychology, 60, 158–189.

60.

Wagenmakers

E.-J.

Wetzels

Borsboom

van der Maas

H.L.J.

(2011). Why psychologists must change the way they analyze their data: The case of psi: Comment on Bem (2011). Journal of Personality and Social Psychology, 100, 426–432.

61.

Wainer

(1999). One cheer for null hypothesis signiﬁcance testing. Psychological Methods, 4, 212–213.

62.

Wasserman

(2004). All of statistics: A concise course in statistical inference. New York: Springer.

63.

Wetzels

Lee

Wagenmakers

E.-J.

(2010). Bayesian inference using WBDev: A tutorial for social scientists. Behavior Research Methods, 42, 884–897.

64.

Wetzels

Raaijmakers

Jakab

Wagenmakers

E.-J.

(2009). How to quantify support for and against the null hypothesis: A flexible WinBUGS implementation of a default Bayesian t test. Psychonomic Bulletin & Review, 16, 752–760.

65.

Wilkinson

the Task Force on Statistical Inference. (1999). Statistical methods in psychology journals: Guidelines and explanations. American Psychologist, 54, 594–604.

Statistical Evidence in Experimental Psychology

Abstract

Keywords

Get full access to this article

References