Statistical Rituals - The Replication Delusion and How We Got Here
Statistical Rituals - The Replication Delusion and How We Got Here
research-article2018
AMPXXX10.1177/2515245918771329GigerenzerStatistical Rituals
General Article
Advances in Methods and
Gerd Gigerenzer
Harding Center for Risk Literacy, Max-Planck Institute for Human Development, Berlin, Germany
Abstract
The “replication crisis” has been attributed to misguided external incentives gamed by researchers (the strategic-game
hypothesis). Here, I want to draw attention to a complementary internal factor, namely, researchers’ widespread faith
in a statistical ritual and associated delusions (the statistical-ritual hypothesis). The “null ritual,” unknown in statistics
proper, eliminates judgment precisely at points where statistical theories demand it. The crucial delusion is that the
p value specifies the probability of a successful replication (i.e., 1 – p), which makes replication studies appear to
be superfluous. A review of studies with 839 academic psychologists and 991 students shows that the replication
delusion existed among 20% of the faculty teaching statistics in psychology, 39% of the professors and lecturers,
and 66% of the students. Two further beliefs, the illusion of certainty (e.g., that statistical significance proves that an
effect exists) and Bayesian wishful thinking (e.g., that the probability of the alternative hypothesis being true is 1 –
p), also make successful replication appear to be certain or almost certain, respectively. In every study reviewed, the
majority of researchers (56%–97%) exhibited one or more of these delusions. Psychology departments need to begin
teaching statistical thinking, not rituals, and journal editors should no longer accept manuscripts that report results as
“significant” or “not significant.”
Keywords
replication, p-hacking, illusion of certainty, p value, null ritual
Every couple of weeks, the media proclaim the discov- costing $28 billion annually (Freedman, Cockburn, &
ery of a new tumor marker that promises to improve Simcoe, 2015). The recently discovered fact that so many
personalized diagnosis or even treatment of cancer. As published results are apparently false alarms has been
swift a pace as this seems, tumor research in fact pro- baptized the “replication crisis.”
duces many more discoveries. On average, four or five In the social sciences, replication studies were rarely
studies on cancer markers are published daily, almost published until recently, so the problem has lurked
all of them reporting at least one statistically significant below the surface for many decades. An early analysis
prognostic marker (Ioannidis et al., 2014). Nonetheless, of 362 articles in psychology found no attempted single
few of these results have been replicated and translated replication study for any of them (Sterling, 1959), a
into clinical practice. When a team of 100 scientists at subsequent analysis found replications for fewer than
biotech company Amgen tried to replicate the findings 1% of more than 1,000 articles (Bozarth & Roberts,
of 53 “landmark” articles, they succeeded for only 6. 1972), and a 2012 analysis found replications for 1% of
Similarly, when the pharmaceutical company Bayer original articles (Makel, Plucker, & Hegarty, 2012). By
examined 67 projects on oncology, women’s health, and
cardiovascular medicine, they were able to replicate the
results in only 14 cases (Mullard, 2011; Prinz, Schlange, Corresponding Author:
Gerd Gigerenzer, Director, Harding Center for Risk Literacy, Max
& Asadullah, 2011).1 In the United States alone, irrepro- Planck Institute for Human Development, Lentzeallee 94, 14195
ducible preclinical research slowing down the discovery Berlin, Germany
of life-saving therapies and cures has been estimated as E-mail: [email protected]
Statistical Rituals 199
2010, the crisis began to shake up the social sciences, 2016). The various symptoms of wrong incentives
after it was reported that highly publicized results could include hyped-up press releases and misleading
not be replicated (e.g., Open Science Collaboration, abstracts that claim discoveries even when they are not
2015; Pashler, Coburn, & Harris, 2012; Scheibehenne, supported by the data; publication bias, that is, the
Greifeneder, & Todd, 2010). At the beginning of the reluctance of journals and authors to publish negative
21st century, one of the most cited claims in the social results; lack of willingness to share data; financial con-
and biomedical sciences was that most scientific results flicts of interest (particularly in clinical research;
are false (Ioannidis, 2005; Szucs & Ioannidis, 2017). Is Schünemann, Ghersi, Kreis, Antes, & Bousquet, 2011);
science on its last legs? Its detractors certainly appear and commodification and privatization of research
keen to cash in on the crisis. On March 29, 2017, the (Mirowski, 2011). Following Smaldino and McElreath
news Web site Breitbart.com headlined a claim by (2016), I refer to these explanations as the strategic-
Wharton School professor Scott Armstrong that “fewer game hypothesis, according to which science is consid-
than 1 percent of papers in scientific journals follow ered a game that scientists play strategically to maximize
scientific method” (Bokhari, 2017), which was widely their chances of acquiring publications and other tro-
spread across other conservative opinion and news phies (Bakker, van Dijk, & Wicherts, 2012). To counter
Web sites. For politicians and citizens skeptical of sci- gaming, preregistration and data sharing are now
ence, such messages provide a perfect reason to justify encouraged or required by several journals, and radical
cuts in funding for research that conflicts with their changes to the reward system have been proposed. One
beliefs and values. drastic proposal to revamp the current system is that
only publications whose findings are replicated, not
publication per se, should count in the future and that
Only a Matter of Incentives? large grants should count not positively but negatively,
The replication crisis has led to heated debates on how unless the recipient delivers proportionally high-quality
much waste is produced, in what fields this occurs, and science (Ioannidis, 2014).
what exactly counts as a replication in the first place In this article, I present an analysis that goes beyond
(e.g., Pashler & Wagenmakers, 2012; Stroebe & Strack, the important role of external incentives, be they eco-
2014). In this article, I do not deal with these highly nomic or reputational. I discuss the hypothesis that the
debated issues but rather consider the question of how replication crisis is also fueled by an internal factor: the
we got here. Little is known about the causes of the replacement of good scientific practice by a statistical
problem. Some people have suggested that truth simply ritual that researchers perform not simply on the
wears off, as when the efficacy of antidepressants plum- grounds of opportunism or incentives but because they
meted drastically from study to study. In the 1930s, have internalized the ritual and genuinely believe in it.
Joseph B. Rhine concluded that the extrasensory per- I refer to this hypothesis as the statistical-ritual hypoth-
ception of his students declined over the years. And esis. As is the case with many social rituals, devotion
one cognitive psychologist, sincerely puzzled by how entails delusions, which in the present case block judg-
a much-publicized effect that he discovered could have ment about how to conduct good research, that is,
faded away, invoked the concept of “cosmic habitua- inhibit researchers’ common sense. I use the term com-
tion,” a natural law postulating that once a claim for an mon sense because the delusions in question here do
effect is published, the world habituates to it so that it not concern sophisticated statistical technicalities but
begins shrinking away (Lehrer, 2010; see also Schooler, instead concern the very basics. If the cause were
2011). merely strategic behavior, common sense would not
More seriously, the general approach has been to likely be sacrificed.
explain the replication crisis by blaming wrong eco- This article comprises two main parts. In the first, I
nomic and reputational incentives in the academic show how textbook writers created the null ritual as
career system. Specifically, the blame has been laid on an apparently objective procedure to distinguish a true
the “publish or perish” culture, which values quantity cause from mere chance. This procedure is aimed at
over quality, and on the practice of evaluating research- yes/no conclusions from single studies, neglects repli-
ers by impact measures such as h-index and, more cation and other principles of good scientific practice,
recently, “altmetrics” such as the number of tweets and and has become dominant in precisely those sciences
mentions in blogs (Colquhoun, 2014). Richard Horton involved in the replication crisis. It is worth noting that
(2016), editor-in-chief of The Lancet, complained: “No- editors in the natural sciences, which do not practice
one is incentivized to be right” (p. 1380). In this view, the ritual, generally endorse replication and do not
powerful incentives for career advancement actively separate it from original research (Madden, Easley, &
encourage, reward, and propagate poor research meth- Dunn, 1995). In the second part of this article, I review
ods and abuse of statistics (e.g., Smaldino & McElreath, studies on four phenomena that are implied by the
200 Gigerenzer
statistical-ritual hypothesis but are not explained (or, forged an unexpected marriage between two previously
in one case, are only partially explained) by the antagonistic tribes: experimenters and statisticians. In
strategic-game hypothesis. In general, the strategic- fact, statistical inference became the hallmark of experi-
game hypothesis implies that researchers play the game mentation, and experiments without statistical inference
without necessarily being in the grip of systematic delu- were soon virtually unthinkable. This change was so
sions or acting against their own interests. The statisti- profound that quite a few social scientists today would
cal-ritual hypothesis, in contrast, implies that researchers be surprised to learn that Isaac Newton, for instance,
engage in delusions about the meaning of the null used no statistical tests in his experiments. You might
ritual, and above all about its sacred number, the object that Newton was not familiar with statistical infer-
p value. Otherwise, they would realize that the p value ence, but in fact, he was: In his role as the master of
does not answer their research questions and would the London Royal Mint, he used statistical tests to make
abandon its pursuit. The first phenomenon is the rep- sure that the amount of gold in the coins was neither
lication delusion, which makes replication appear virtu- too small nor too large (Stigler, 1999). The Trial of the
ally certain and further studies superfluous. The second Pyx involved random samples of coins, a null hypothesis
concerns the illusion of certainty, and the third Bayesian to be tested (that the tested coin conformed to the stan-
wishful thinking; both of these lead to the same conclu- dard coin), a two-sided alternative hypothesis, and a
sions about replication. A review of studies with 839 test statistic. In Newton’s view, statistical tests were use-
academic psychologists shows that the majority believed ful for quality control, but not for science. Similarly,
in one or more of these three fallacies. Finally, there is 19th-century medicine saw professional rivalry between
the phenomenon of low statistical power. According to experimenters, who looked for causes, and statisticians,
the strategic-game hypothesis, researchers should be who looked for chances. For instance, physiologist
careful to design experiments that have a good chance Claude Bernard, one of the first to suggest blind experi-
of detecting an effect and thus lead to the desired ments, opposed the use of averages or proportions as
result, statistical significance. The statistical-ritual unscientific. For him, averages were no substitutes for
hypothesis, however, implies that researchers pay little a complete investigation of the conditions that cause
attention to statistical power because it is not part of variability (Gigerenzer et al., 1989, pp. 126–130). The
the null ritual. Consistent with the latter prediction, British biologist and statistician Sir Ronald A. Fisher
studies of the psychological literature show that the (1890–1962) was highly influential in ending this antago-
median statistical power for detecting a medium effect nism and forging a marriage between experimenters and
size is generally low and has not improved in the past statisticians.
50 years. The statistical-ritual hypothesis also accounts Second, and most relevant for this article, psycholo-
for why the specific incentives that strategic behavior gists reinterpreted this marriage in their own way. Early
exploits were set in the first place. textbook writers struggled to create a supposedly objec-
tive method of statistical inference that would distin-
guish a cause from a chance in a mechanical way,
The Idol of Automatic Inference and
eliminating judgment. The result was a shotgun wedding
the Elimination of Judgment between some of Fisher’s ideas and those of his intellec-
Statistical methods are not simply applied to a disci- tual opponents, the Polish statistician Jerzy Neyman
pline but can transform it entirely. Consider medicine. (1894–1981) and the British statistician Egon S. Pearson
For centuries, its lifeblood was physicians’ “medical (1895–1980). The essence of this hybrid theory (Gigerenzer,
tact” and the search for deterministic causes. All that 1993) is the null ritual. The null ritual does not exist in
changed when, in the second half of the 20th century, statistics proper, although researchers have been made
probabilities from randomized trials and p values to believe so.
replaced causes with chances. Or think of parapsychol-
ogy, in which statistical tests became common half a
The inference revolution
century earlier. Once the study of unique messages
from the dear departed, extrasensory perception is now The term inference revolution refers to a change in
the study of repetitive card guessing; marvels have been scientific practice that happened in psychology between
replaced with statistical significance (Gigerenzer et al., 1940 and 1955 and subsequently in other social sci-
1989). ences and biomedical research (Gigerenzer & Murray,
Psychology has also been transformed by the proba- 1987). The inference from sample to population came
bilistic revolution (Gigerenzer, 1987). For the present to be considered the sine qua non of good research,
topic, two aspects of this remarkable event are of rel- and statistical significance came to be considered the
evance. First, as in medicine, the probabilistic revolution means of distinguishing between true cause and mere
Statistical Rituals 201
chance. Common scientific standards such as minimiz- using means, standard deviations, correlations, and
ing measurement error, conducting double-blind exper- various descriptive statistics tailored to the problem.
iments, replicating experiments, providing detailed The participants were trained staff members with Ph.D.s
descriptive statistics, and formulating bold hypotheses or graduate students (Danziger, 1990). In the tradition
in the first place were pushed into the background of Wilhelm Wundt’s Leipzig laboratory, the researcher
(Danziger, 1990; Gigerenzer, 1993). The focus on infer- who ran the experiment was typically a technician,
ence from sample to population is remarkable or even which is why it could happen that the participant, not
perplexing given that in most psychological experi- the researcher, published the article. When a larger num-
ments, researchers do not draw random samples from ber of individuals were studied, as in Jean Piaget’s work
a population or define a population in the first place. on the development of cognition, results were typically
Thus, the assumptions for the procedure are not met presented individually rather than as aggregates—you
in most cases, and one does not know the population would not have caught Piaget calculating a t test. The
to which an inference such as µ 1 ≠ µ2 actually refers. overall picture is that past researchers knew their raw
The term revolution underscores the dramatic impact data well, reported descriptive statistics in detail, and
of this change. In the earlier experimental tradition, the had a comparatively flexible attitude toward the issue of
unit of analysis was the individual, who was well statistical inference from sample to population, which
trained and often held a Ph.D. After the revolution, the was considered to be incidental rather than central.
unit became the aggregate, often consisting of minors When treatment groups began to be used in the applied
or undergraduates. In the United States, this change research of the 1930s in the United States, the term sig-
was already under way during the 1930s, when psy- nificant was already widely in use, but the evaluation
chologists tried to promote their field as socially rele- of whether two means differed was based on judgment—
vant, and educational administrators were the key taking the error, the size of the effect, and previous
source of funding. For administrators, the relevant ques- studies into account—rather than on a mechanical rule,
tion was whether a new teaching method led to better except for the use of critical ratios (the ratio of the
performance on average; to that end, psychologists cre- obtained difference to its standard deviation; Gigerenzer
ated a new unit, the treatment group (Danziger, 1987). & Murray, 1987, chap. 1). This practice led to virtually
But when a mean difference between two groups was all of the classical discoveries in psychology. Without
observed, they had to use their judgment about whether calculating p values or Bayes factors, Wolfgang Köhler
it was “real.” When psychologists learned about Fisher’s developed the Gestalt laws of perception, Ivan P. Pavlov
method of statistical inference (see the section on the the principles of classical conditioning, B. F. Skinner
null ritual), it seemed like a dream come true: an appar- those of operant conditioning, George Miller his magi-
ently objective method to tell a cause from a chance. cal number seven plus or minus two, and Herbert A.
Accordingly, early adopters of significance testing were Simon his Nobel Prize–winning work on bounded
concentrated in educational psychology and parapsychol- rationality.
ogy, while the core scientific disciplines, such as percep-
tual psychology, resisted the movement for years and
continued to focus on individuals. It is instructive that The null ritual
the situation in Germany was different (Danziger, 1990). In the 1920s, Fisher had forged the marriage between
German professors did not feel pressured to prove the experiments and inferential statistics in agriculture and
usefulness of their research for education but instead genetics, an event that went largely unnoticed by psy-
considered themselves scientists responsible for unravel- chologists (for a discussion of Fisher’s antecedents, see
ing the laws of the individual mind. Accordingly, the Gigerenzer et al., 1989, pp. 70–90). Through statisticians
inference revolution happened much later in German George W. Snedecor at Iowa State College and Harold
psychology, in its post–World War II assimilation into U.S. Hotelling at Columbia University, among others, Fisher’s
psychology. The changes in research practice spawned theory of null-hypothesis testing soon spread in the
by the inference revolution were so fundamental that United States. By 1961, Snedecor’s (1937) Statistical
psychologists trained in its wake can hardly imagine that Methods became the most cited book according to the
research could entail anything other than analyzing Science Citation Index (Gigerenzer & Murray, 1987,
means of aggregates for statistical significance. p. 21). Psychologists began to cleanse Fisher’s message
of its agricultural odor—the effect of manure, soil fertil-
What did psychologists do before the ity, and the weight of pigs—as well as of its mathemati-
cal sophistication, and wrote a new genre of textbooks.
inference revolution? The most widely read of these was probably Funda-
A typical article in the Journal of Experimental Psychol- mental Statistics in Psychology and Education, which
ogy around 1925 would report single-case data in detail, was published in 1942 and written by J. P. Guilford, a
202 Gigerenzer
psychologist at the University of Southern California theory, of which neither Fisher nor Neyman and Pear-
and later president of the American Psychological son would have approved. In addition to the idol of
Association. Like Guilford, authors of best-selling automatic inference, another decisive factor appears to
statistical textbooks in psychology have typically been have been the commercialization of textbooks, accom-
nonstatisticians. panied by publishers’ requests for single-recipe cook-
Soon things got complicated, however. The statistical books instead of a toolbox (Gigerenzer, 2004, pp.
theory of Neyman and Pearson also became known, 587–588). To this end, virtually all textbooks presented
particularly after World War II. The essential differences the hybrid theory anonymously, without mentioning
relevant for our purpose are threefold: First, whereas that its concepts stem from different theories, detailing
Fisher proposed testing a single specified hypothesis, the conflicting ideas of the theories’ authors, or even
the null, against an unspecified alternative, Neyman disclosing their identities. For instance, although the
and Pearson questioned this logic and called for testing 1965 edition of Guilford’s best-selling Fundamental
against a second specified hypothesis. Second, with Statistics in Psychology and Education cites some 100
only a null and no specified alternative, Fisher had no authors in its index, the names of Neyman and Pearson
measure of statistical power. Moreover, he believed that are left out.
calculating power made sense in quality control but not The essence of this hybrid theory is the null ritual
in science—according to him, there is no place for (Gigerenzer, 2004):
cost-benefit trade-offs such as between power and
alpha when one is seeking the truth. (Alpha is the prob- 1. Set up a null hypothesis of “no mean difference”
ability that the null hypothesis is rejected if it is true, or “zero correlation.” Do not specify the predic-
and beta, calculated as 1 – power, is the probability tions of your own research hypothesis.
that the alternative hypothesis is rejected if it is true. 2. Use 5% as a convention for rejecting the null
Alpha and beta are also called Type I and Type II error hypothesis. If the test is significant, accept your
rates, respectively.) Neyman and Pearson, in contrast, research hypothesis. Report the test result as p <
required power and alpha to be balanced and set before .05, p < .01, or p < .001, whichever level is met
the experiment so that the probability of Type I and by the obtained p value.
Type II errors would be known. Third, Fisher inter- 3. Always perform this procedure.
preted a significant effect in terms of subjective confi-
dence in the result, whereas Neyman (albeit not The null ritual does not exist in statistics proper. This
Pearson) interpreted significance in strictly behavioristic point is not always understood; even its critics some-
terms as a decision, not a belief. For example, in quality times confuse it with Fisher’s theory of null-hypothesis
control, misses matter, and a significant result can lead testing and call it “null-hypothesis significance testing.”
to the decision to stop production and look for a pos- In fact, the ritual is an incoherent mishmash of ideas
sible error, even when one believes that most likely from Fisher on the one hand and Neyman and Pearson
nothing is wrong. Neyman regarded his theory as an on the other, spiked with a characteristically novel con-
improvement of null-hypothesis testing, but Fisher dis- tribution: the elimination of researchers’ judgment.
agreed. The conceptual differences between these two
systems of statistical inference were amplified by a
Elimination of judgment
fierce personal debate. Fisher branded Neyman’s theory
as “childish” and “horrifying [for] the intellectual free- Consider Step 1 of the null ritual. To specify the null
dom of the west,” while Neyman countered that some hypothesis but not an alternative hypothesis follows
of Fisher’s tests were “worse than useless” because their Fisher’s logic but violates that of Neyman and Pearson,
power was smaller than their alpha level (see Gigerenzer which necessitates specifying both (this is one of the
et al., 1989, chap, 3). How should textbook writers have reasons why they thought of their theory as an improve-
coped with the fundamental disagreement between ment over Fisher’s). Yet Step 1 follows Fisher only to a
these two camps? The obvious solution would have point, because Fisher at least thought that judgment
been to present both approaches and discuss the situ- would be necessary for choosing a proper null hypoth-
ations in which each might be more appropriate. esis. As Fisher emphasized, his intent was to test
But such a toolbox approach would have required whether a hypothesis should be nullified, and he did
relinquishing dearly coveted objectivity and opened the not mean to imply that this hypothesis postulates a nil
door to researchers’ judgment. Although a few text- difference (Fisher, 1955, 1956). In his approach,
books did take this approach and taught both theories researchers should use their judgment to select a proper
(e.g., R. L. Anderson & Bancroft, 1952), the great major- null hypothesis, which could be a nonzero difference
ity fused the two antagonistic theories into a hybrid or correlation. In the hands of the textbook writers in
Statistical Rituals 203
psychology, however, null grew to mean “no differ- hypotheses, the appropriate significance level, the
ence,” period. No judgment was required. magnitude of worthwhile effects and the balance
Step 2 blatantly contradicts Fisher, who in the 1950s of utilities. (pp. 395–396)
advised researchers against using 5% in a mechanical
way: Throughout their heated personal debates, each side
accused the other of advocating mechanical statistical
No scientific worker has a fixed level of significance inference. At least here, they saw eye to eye: Statistical
at which from year to year, and in all circumstances, inference should not be automatic. Yet it has become
he rejects hypotheses; he rather gives his mind to not only automatic but also what I call ritualistic,
each particular case in the light of his evidence embodying social emotions, including fear, wishful
and his ideas. (Fisher, 1956, p. 42) 2 thinking, and delusions, along with mechanical
repetition.
Using a conventional 5% level of significance is also I use the term ritual for the core of the hybrid theory
alien to Neyman and Pearson, who in their earliest to highlight its similarity to social rites (Dulaney &
writings already emphasized that their tests should be Fiske, 1994). A ritual is a collective or solemn ceremony
“used with discretion and understanding,” depending consisting of actions performed in a prescribed order.
on the context (Neyman & Pearson, 1933, p. 296). It typically includes the following elements:
Understanding includes making a judgment about the
balance between Type I and Type II errors; this judg- •• sacred numbers or colors,
ment was also omitted in the null ritual. •• repetition of the same action,
The practice of rounding up p values to the next •• fear about being punished when one stops per-
convenient “significance level” (p < .05, p < .01, or p < forming these actions, and
.001) is supported neither by Fisher nor by Neyman •• wishful thinking and delusions.
and Pearson. In their approaches, a level of significance
is set before the experiment, not calculated post hoc The null ritual contains all these features: a fixation on
from data, and p values calculated from data should be the 5% number (or on colors, as in functional MRI
reported as exact values (e.g., p = .004, not p < .01) so images), repetitive behavior resembling compulsive
that they are not confused with significance levels. Step hand washing, fear of sanctions by editors or advisors,
2 is also inconsistent with Fisher’s (1955) argument and delusions about the meaning of the p value (as I
against making binary reject/not-reject decisions, and discuss in the next section). It spread fast. Before 1940,
appears to follow Neyman and Pearson. Yet it does so few articles reported t tests, analyses of variance, or
only partially and neglects the rest of their theory, other significance tests. By 1955 and 1956, more than
which requires two precise hypotheses and a judgment 80% of articles in four leading journals were doing so,
about the balance between alpha and beta for deter- and almost all reported a significant effect (Sterling,
mining the decision criterion. In the null ritual, there 1959). Today, the figure is close to 100%.
is no concern with beta and consequently no concern
with the statistical power of a test, which is the comple-
ment of beta.
Systematic Delusions About Replication
Finally, Step 3 embodies the idol of automatic infer- In this section, I put forward the hypothesis that the
ence that does not require a judgment about the validity null ritual is key to understanding the replication crisis
of the assumptions underlying each statistical test. This in the social and biomedical sciences. As I mentioned
is alien to both camps, and to statistical science in earlier, one explanation of this crisis has been that
general. Fisher (1955, 1956) asserted that constructive researchers are given the wrong incentives. These do,
imagination and much experience are prerequisites of of course, exist, creating the fear of being punished for
good statistics, such as for deciding which null hypoth- not conforming, the third aspect of a social ritual. Yet
eses are worth testing and which test statistic to choose. incentives are only part of the explanation, as is dem-
Neyman and Pearson emphasized that the statistical onstrated by a rare case in which an editor actually
part of inference has to be supplemented by a subjec- eliminated them. When Geoffrey Loftus became editor-
tive part. In Pearson’s (1962) words: elect of Memory & Cognition, he made it clear in his
introductory editorial that he did not want authors to
We left in our mathematical model a gap for the submit manuscripts with routine calculations of p values,
exercise of a more intuitive process of personal but instead wanted adequate figures with descriptive
judgment in such matters—to use our terminology— statistics and confidence intervals (Loftus, 1993). During
as the choice of the most likely class of admissible his editorship, I asked him how his campaign was
204 Gigerenzer
going. Loftus bitterly complained that many researchers sixes in 97% of all further throws. In general, the chance
stubbornly refused the opportunity, experienced deep of replicating a finding depends on many factors (e.g.,
anxiety at the prospect of abandoning p values, and Cumming, 2008, 2014; Greenwald, Gonzalez, Harris, &
insisted on their p values and yes/no significance deci- Guthrie, 1996), most of which the researcher cannot
sions (Gigerenzer, 2004, pp. 598–599). Similarly, after know for sure, such as whether the null or the alterna-
the Task Force for Statistical Inference’s recommenda- tive hypothesis is true. The belief that an obtained
tions calling for various alternatives to the null ritual p value implies a probability of 1 – p that an exact
were incorporated in the fifth edition of the American replication of the same experiment would lead to a
Psychological Association’s (2001) publication manual, significant result is known as the replication delusion
a study found little change in researchers’ behavior or replication fallacy (Rosenthal, 1993).
(Hoekstra, Finch, Kiers, & Johnson, 2006). Note that this delusion is easy to see through, and
My hypothesis is that, beyond incentives, the key experienced researchers should not hold this belief
issue is researchers’ internalized belief in the ritual. If even if they follow the null ritual for opportunistic
researchers only opportunistically adapt their behavior motives. In contrast, if they believe in the ritual, they
to the incentives in order to get published and pro- are likely to exhibit the replication delusion because it
moted, then their common sense with regard to statisti- provides a justification for performing the ritual, even
cal thinking should remain intact. If, however, if it amounts to wishful thinking. Thus, the empirical
researchers have internalized the ritual and believe in question is, do experienced researchers actually suffer
it, conflicting common sense should be repressed. The from the replication delusion?
most basic and crucial test of the statistical-ritual The first study to answer this question appears to have
hypothesis is whether researchers actually understand been conducted by Oakes (1986). He asked 70 British
the desired product: a significant p value. lecturers, research fellows, and postgraduate students
with at least 2 years of research experience whether the
following statement is true or false, that is, whether it
Probability of replication = 1 – p logically follows from a significant result (p = .01):
A p value is a statement about the probability of data,
assuming the null hypothesis is true. More precisely, You have a reliable experimental finding in the
the term data refers to a test statistic (a statistical sum- sense that if, hypothetically, the experiment were
mary of data, such as a t statistic), and the term null repeated a great number of times, you would obtain
hypothesis refers to a statistical model. For instance, a a significant result on 99% of occasions. (p. 80)
result with a p value of .05 means that if the null
hypothesis—including all assumptions made by the Sixty percent of the psychologists answered “true.”
underlying model—is true, the probability of obtaining Table 1 shows that this replication delusion is also
such a result or a more extreme one is 5%. The p value endemic beyond academic psychologists in the United
does not tell us much else. Specifically, a p value of .05 Kingdom. Aggregating across all the studies listed in
does not imply that the probability that the result can Table 1 leads to the following general picture: Among
be replicated is 95%. psychologists who taught statistics or methodology, 20%
Consider a simple example: A researcher designs an (23 of 115) believed in the replication delusion; among
experiment with 50% power to detect an effect of a academic psychologists whose special field was not
medium size (50% power or below is a typical figure, statistics, this number was larger, 39% (282 of 724); and
as I discuss later) and obtains a significant difference among students, it was even larger, 66% (659 of 991).
between means, p < .05. Does this imply that one can The existence of the replication delusion is consis-
expect to find a significant result in 95% (or more) of exact tent with the hypothesis that a substantial number of
replications of this experiment? No, and the reason why researchers follow the null ritual not simply for strategic
not can be easily understood. If the alternative hypoth- reasons but also because they believe in the ritual and
esis is true, then the probability of getting another its associated delusions. According to the replication
significant result equals the statistical power of the test, delusion, given p = .01, a study’s results can be repli-
that is, 50%. If, however, the null hypothesis is true, the cated in 99% of all trials, which means that replication
probability of getting another significant result would studies are superfluous.
still be only 5%. Or consider an even simpler illustra- The limits of the present analysis are the small number
tion: A die, which could be fair or loaded, is thrown of studies that have been conducted and the low
twice and shows a “six” both times, which results in a response rates reported in some of the studies. Badenes-
p value of .03 (1/36) under the null hypothesis of a fair Ribera, Frias-Navarro, Monterde-i-Bort, and Pascual-Soler
die. Yet this does not imply that one can expect two (2015) sent their survey to 4,066 academic psychologists
Statistical Rituals 205
Respondents
exhibiting the
Statistic replication
Study Description of group Country N tested delusion (%)
Professional samples
Oakes (1986) Academic psychologists United Kingdom 70 p = .01 60
Haller & Krauss (2002) Statistics teachers in Germany 30 p = .01 37
psychology
Haller & Krauss (2002) Professors of psychology Germany 39 p = .01 49
Badenes-Ribera, Frias-Navarro, Academic psychologists: Spain 98 p = .001 35
Monterde-i-Bort, & personality, evaluation,
Pascual-Soler (2015) psychological treatments
Badenes-Ribera et al. (2015) Academic psychologists: Spain 67 p = .001 16
methodology
Badenes-Ribera et al. (2015) Academic psychologists: basic Spain 56 p = .001 36
psychology
Badenes-Ribera et al. (2015) Academic psychologists: social Spain 74 p = .001 39
psychology
Badenes-Ribera et al. (2015) Academic psychologists: Spain 29 p = .001 28
psychobiology
Badenes-Ribera et al. (2015) Academic psychologists: Spain 94 p = .001 46
developmental and
educational psychology
Badenes-Ribera, Frias-Navarro, Academic psychologists: Italy, Chile 18 p = .001 6
Iotti, Bonilla-Campos, & methodology
Longobardi (2016)
Badenes-Ribera et al. (2016) Academic psychologists: other Italy, Chile 146 p = .001 13
areas
Hoekstra, Morey, Rouder, & Researchers in psychology Netherlands 118 95% CI 58
Wagenmakers (2014) (Ph.D. students and faculty)
Student samples
Haller & Krauss (2002) German psychology students Germany 44 p = .01 41
who had passed two
statistics courses
Hoekstra et al. (2014) First-year psychology students Netherlands 442 95% CI 66
who had not taken an
inferential-statistics course
Hoekstra et al. (2014) Master’s psychology students Netherlands 34 95% CI 79
who had taken an
inferential-statistics course
Garcia-Pérez & First-year psychology students Spain 313 95% CI 71
Alcalá-Quintana (2016) who had not taken an
inferential-statistics course
Garcia-Pérez & Master’s psychology students Spain 158 95% CI 63
Alcalá-Quintana (2016) who had taken an
inferential-statistics coursea
Note: For the studies in which the replication delusion was tested with respect to p values, the numbers in the last column indicate the percentage
of respondents who erroneously believed that p = .01 or p = .001 implies that the probability of a significant result in a replication study is .99 or
.999, respectively. Haller and Krauss (2002) used the same question used by Oakes (1986; see the text). The problem posed by Badenes-Ribera
et al. (2015, 2016) read: “Let’s suppose that a research article indicates a value of p = 0.001 in the results section (alpha = 0.05). Mark which of the
following statements are true (T) or false (F)” (Badenes-Ribera et al., 2015, p. 291). The statement expressing the replication fallacy read: “A later
replication would have a probability of 0.999 (1 – 0.001) of being significant” (p. 291). For the studies in which the replication delusion was tested
with respect to confidence intervals (CIs), the numbers in the last column indicate the percentage of respondents who wrongly believed that a
95% CI ranging from x to y would imply that if the experiment were repeated over and over, the true mean would fall between x and y 95% of
the time.
a
A subsample of 88 of these students were considered by the authors to have provided “informed responses” (p. 10). Of that subsample, 72%
exhibited the replication delusion.
206 Gigerenzer
in Spain and had a response rate of 10.3%; the study partly blocked and they should endorse these beliefs
finding the lowest rate of the replication delusion, con- about the importance of significant results.
ducted by Badenes-Ribera, Frias-Navarro, Iotti, Bonilla- Table 2 reviews the relevant studies that have been
Campos, and Longobardi (2016), also had the lowest conducted. In the British study mentioned earlier,
response rate, 7% (164 out of 2,321 academic psycholo- Oakes (1986, p. 80) asked academic psychologists what
gists in Italy and Chile); response rates for the other a significant result (p = .01) means:
studies were not reported. Thus, the results may be sub-
ject to a selection bias: Assuming that individuals with Suppose you have a treatment that you suspect
better training in statistics were more likely to respond, may alter performance on a certain task. You
the numbers in Table 1 are probably underestimates of compare the means of your control and
the true frequency of the replication delusion. experimental groups (say, 20 subjects in each
A study with members of the Mathematical Psychol- sample). Furthermore, suppose you use a simple
ogy Group and the American Psychological Association independent means t-test and your result is
(not included in Table 1 because the survey asked dif- significant (t = 2.7, df = 18, p = .01). Please mark
ferent kinds of questions) also found that most of them each of the statements below as “true” or “false.”
trusted in small samples and had high expectations “False” means that the statement does not follow
about the replicability of significant results (Tversky & logically from the above premises. Also note that
Kahneman, 1971). A glance into textbooks and editori- several or none of the statements may be correct.
als reveals that the delusion was already promoted as
early as the 1950s. For instance, in her textbook Dif- (1) You have absolutely disproved the null
ferential Psychology, Anastasi (1958) wrote: “The ques- hypothesis (i.e., there is no difference
tion of statistical significance refers primarily to the between the population means).
extent to which similar results would be expected if an (2) You have found the probability of the null
investigation were to be repeated” (p. 9). In his Intro- hypothesis being true.
duction to Statistics for Psychology and Education, (3) You have absolutely proved your experi-
Nunnally (1975) stated: “If the statistical significance is mental hypothesis (that there is a difference
at the 0.05 level . . . the investigator can be confident between the population means).
with odds of 95 out of 100 that the observed difference (4) You can deduce the probability of the
will hold up in future investigations” (p. 195). Similarly, experimental hypothesis being true.
former editor of the Journal of Experimental Psychology (5) You know, if you decide to reject the null
A. W. Melton (1962) explained that he took the level of hypothesis, the probability that you are
significance as a measure of the “confidence that the making the wrong decision.
results of the experiment would be repeatable under (6) You have a reliable experimental finding in
the conditions described” (p. 553). the sense that if, hypothetically, the experi-
ment were repeated a great number of
The illusion of certainty and Bayesian times, you would obtain a significant result
on 99% of occasions.
wishful thinking
As I have mentioned, a p value is a statement about the Each of the six beliefs is false, a possibility explicitly
probability of a statistical summary of data, assuming stated in the instruction. Beliefs 1 and 3 are illusions
that the null hypothesis is true. It delivers probability, of certainty: significance tests provide probabilities, not
not certainty. It does not tell us the probability that a certainties. Beliefs 2, 4, and 5 are versions of Bayesian
hypothesis—whether the null or the alternative—is wishful thinking. Belief 2 is incorrect because a p value
true; it is not a Bayesian posterior probability. I refer is not the probability that the null hypothesis is true
to the belief that statistical significance delivers cer- but rather the probability of the given data (or more
tainty as the illusion of certainty and to the belief that extreme data), assuming the truth of the null hypothesis.
p is the probability that the null hypothesis is true or For the same reason, one cannot deduce the probability
that 1 – p is the probability that the alternative hypoth- that the experimental (alternative) hypothesis is true,
esis is true as Bayesian wishful thinking (also known as stated in Belief 4. Belief 5 makes essentially the same
as inverse probability error). After a few hours of sta- claim as Belief 2 because a wrong decision to reject the
tistical training at a major university, any person of null hypothesis amounts to the null hypothesis actually
average intelligence should understand that these being true, and again, the p value does not specify the
beliefs are incorrect. By contrast, if the statistical-ritual probability that the null is true. Belief 6 has been dealt
hypothesis is true, researchers’ thinking should be with in the previous section. Note that all six delusions
Table 2. Studies on Systematic Delusions About What a Significant p Value Means: Percentage of Respondents Who Endorsed the Illusion of Certainty, Bayesian Wishful
Thinking, and the Replication Delusion
Great Britain Germany Spain Italy, Chile
Students Academic
who had psychologists: Academic
passed personality, Academic Academic psychologists:
Statistics Professors two evaluation, Academic psychologists: psychologists: Academic developmental Academic Academic
Academic teachers in of statistics psychological psychologists: basic social psychologists: and educational psychologists: psychologists:
psychologists psychology psychology courses treatments methodology psychology psychology psychobiology psychology methodology other areas
Delusion (N = 70) (N = 30) (N = 39) (N = 44) (N = 98) (N = 67) (N = 56) (N = 74) (N = 29) (N = 94) (N = 18) (N = 146)
Note: Beliefs 1 and 3 are instances of the illusion of certainty and make successful replication appear certain when the p value is significant. Beliefs 2, 4, and 5 are instances of Bayesian wishful thinking and
suggest that the p value indicates the probability that the null or the alternative hypothesis is true. Belief 6 is the replication delusion (see Table 1); it is included here so that results for all six delusions are
summarized in one place. The question that respondents answered referred to p = .01 in the studies conducted in Great Britain (Oakes, 1986) and Germany (Haller & Krauss, 2001) and to p = .001 in the
studies conducted in Spain (Badenes-Ribera, Frias-Navarro, Monterde-i-Bort, & Pascual-Soler, 2015) and Italy and Chile (Badenes-Ribera, Frias-Navarro, Iotti, Bonilla-Campos, & Longobardi, 2016).
a
Badenes-Ribera et al. (2015) did not report the percentage of respondents who endorsed at least one delusion separately for each subsample in their study, so this is the percentage for all respondents.
207
208 Gigerenzer
err in the same direction of wishful thinking: They taught methodology and from 12% to 66% among
overestimate what can be concluded from a p value. those who did not.
What is most important for the topic of replication •• Belief 5. The probability of incorrectly reject-
is that each of these beliefs makes replication appear ing the null hypothesis is known: Presented
to be superfluous. Consider the illusion of certainty in only a few studies, this statement received the
(Beliefs 1 and 3): If one concludes that the experimental highest percentage of endorsements when it was
hypothesis has been absolutely proven to be true or included. The percentage of respondents who
that the null hypothesis has been absolutely disproven, agreed with this statement ranged from 67% to
then replicating the study appears to be a waste of time. 86%, and even a majority (73%) of statistics teach-
Similarly, if one incorrectly believes that the p value ers endorsed it.
specifies the probability that the null hypothesis is true
(Beliefs 2 and 5) or that the alternative hypothesis is Belief 6 (the replication delusion; see Table 1) is
true (Belief 4), then one presumably already knows the included in Table 2 to provide a summary of the results
Bayesian posterior probability. For instance, if p = .01, for all six delusions. The last row of the table shows
then someone with these beliefs would misperceive the the percentage of respondents who were in the grip of
probability of the alternative hypothesis being true as at least one of the delusions: 97% of the British aca-
99%, which would make further replication attempts demic psychologists, 80% of the German statistics
appear to be unnecessary. All six delusions lead to a teachers, 90% of the German psychology professors and
vast overestimation of the probability of successful lecturers who did not teach statistics, and 100% of the
replications. German students who had successfully passed two sta-
Do trained academic psychologists believe in the tistics courses (Table 2, last row). The students appear
illusion of certainty and engage in Bayesian wishful to have inherited the delusions from their teachers.
thinking? The studies indicate that they do, to varying Among the Spanish academic psychologists, 94%
extents: endorsed at least one of the delusions, whereas 56%
and 74%, respectively, of the methodology instructors
•• Belief 1. The null hypothesis has been shown and other academic researchers in Italy and Chile did
to be false: Table 2 shows that in every study the same. For the German study, Haller and Krauss
and subgroup, some professionals held this ele- (2002) also reported the average number of delusions
mentary illusion of certainty. The highest percent- in each group: 1.9 among statistics teachers, 2.0 among
ages were obtained among Spanish academic other academic psychologists, and 2.5 among psychol-
psychologists (55%–66%), and the lowest percent- ogy students.
age was found among British academic psycholo- Hoekstra, Morey, Rouder, and Wagenmakers (2014)
gists (1%). Even among professionals teaching adapted Oakes’s (1986) six-item questionnaire to exam-
statistics or methodology, 10% of German, 28% ine delusions regarding confidence intervals. They
of Italian and Chilean, and 36% of Spanish experts reported that the majority of 118 researchers, 34 mas-
shared this illusion of certainty. ter’s students, and 442 first-year students in psychology
•• Belief 2. The probability of the null hypothesis relied on similar wishful thinking about confidence
being true is known: Among psychologists who intervals; in all three groups, the median of number of
taught methods or statistics, 17% to 58% believed delusions endorsed was 3 to 4. Only 3% of the research-
that this conclusion was correct. Among the other ers were able to identify all the statements as
professionals, the range was slightly higher, from delusions.
a minimum of 23% to a maximum of 68%. Similarly, Falk and Greenbaum (1995) reported that
•• Belief 3. The alternative hypothesis has been 87% of 53 Israeli psychology students believed in at
shown to be true: This second illusion of cer- least one of the first four delusions listed in Table 2,
tainty was included in only a few studies. In all even though the correct alternative (“None of the
of these cases, it was endorsed by a small per- answers 1–4 is correct”) was added to the response
centage of psychologists, including 10% of those options.
who taught statistics. In a study in France, Lecoutre, Poitevineau, and
•• Belief 4. The probability of the alternative Lecoutre (2003) presented participants with a vignette
hypothesis being true is known: In every about a drug that had a significant effect of a small size
study, some academic psychologists shared this and found that psychological researchers were more
delusion. The percentage who endorsed this impressed about the efficacy of the drug than were
belief ranged from 6% to 33% among those who statisticians from pharmaceutical companies. Thus, the
Statistical Rituals 209
psychological researchers confused statistical signifi- endorsed the second, none endorsed the third, 20%
cance with substantial significance. endorsed the fourth, 3% endorsed the fifth, and 36%
endorsed the sixth.
B. L. Anderson, Williams, and Schulkin (2013) tested
Delusions about significance among
U.S. obstetrics-gynecology residents (i.e., beginning
medical doctors and researchers doctors) participating in the Council for Resident Edu-
One could argue that delusions about statistical infer- cation in Obstetrics and Gynecology In-Training Exami-
ences and replicability are embarrassing but largely nation. Obstetrics-gynecology is a prestigious specialty
inconsequential in many areas of psychology: These that attracts students with very good grades in medical
delusions do not harm the general public’s health or school. The response rate to the survey was 95% (4,713
wealth. The situation is different in medicine, where out of 4,961 residents). The delusion question Anderson
manipulating statistics and lack of understanding can et al. presented to the residents was similar to Belief 2
lead to death, morbidity, and waste of resources (Welch, in Table 2: “True or False: The P value is the probability
2011). Medical professionals are expected to read medi- that the null hypothesis is correct” (p. 273). Forty-two
cal journals and understand the statistics in order to percent of the respondents correctly answered “false,”
provide the best treatments to patients. 12% did not answer, and 46% incorrectly said “true.”
Do medical doctors and researchers exhibit the same Nevertheless, 63% of the respondents rated their statisti-
misconceptions, despite these possible adverse conse- cal literacy as adequate, 8% rated it as excellent, and
quences? A literature search revealed very few studies only 22% rated it as inadequate (7% did not respond).
on the topic of physicians’ understanding of statistical Wulff, Andersen, Brandenhoff, and Guttler (1987)
significance. I begin with the one that is closest to those tested 148 Danish doctors (randomly sampled) and 97
listed in Table 2. participants in a postgraduate course in research meth-
At three major academic U.S. hospitals—the Barnes ods, mainly junior hospital doctors. When asked what
Jewish Hospital, Brigham & Women’s Hospital, and it means if a controlled trial shows that a new treatment
Massachusetts General Hospital—a total of 246 physi- is significantly better than placebo (p < .05), 20% of the
cians were given the following problem (Westover, doctors in the random sample and 6% of the partici-
Westover, & Bianchi, 2011, p. 1): pants in the postgraduate course) said that “it has been
proved that the treatment is better than placebo”; 51%
Consider a typical medical research study, for and 54%, respectively, believed that the probability of
example designed to test the efficacy of a drug, the null hypothesis being true is less than .05 (Bayesian
in which a null hypothesis H0 (‘no effect’) is tested wishful thinking); and 18% and 2%, respectively, said
against an alternative hypothesis H 1 (‘some that they did not know what p values mean. Only 13%
effect’). Suppose that the study results pass a test of the doctors in the random sample and 39% of the
of statistical significance (that is P-value <0.05) in participants in the course could identify the correct
favor of H1. What has been shown? answer (“If the treatment is not effective, there is less
than a 5 per cent chance of obtaining such results,”
1. H0 is false. p. 5).
2. H0 is probably false. These few available studies suggest that in medicine,
3. H1 is true. where decisions are consequential, the same delusions
4. H1 is probably true. regarding the null ritual appear to persist. This inter-
5. Both (1) and (3) pretation is supported by other tests of physicians’ sta-
6. Both (2) and (4) tistical literacy (Gigerenzer, Gaissmaier, Kurz-Milcke,
7. None of the above. Schwartz, & Woloshin, 2007; Wegwarth, Schwartz,
Woloshin, Gaissmaier, & Gigerenzer, 2012). The sys-
Note that the first four statements correspond to the tematic errors “have been encouraged if not licensed
first four beliefs in Table 2, though the wording differs by unjustified, lax, or erroneous traditions and training
and no precise probability is attached to the null or in the field at large” (Greenland, 2011, p. 228).
alternative hypothesis; in addition, a correct answer (7)
is offered. Nevertheless, only 6% of the physicians rec-
Discussion
ognized the correct answer, and the remaining 94%
believed that a p value less than .05 meant that the null Taken together, the studies indicate that a substantial
hypothesis was false or probably false, or that the alter- proportion of academic researchers wrongly believe
native hypothesis was true or probably true. Specifi- that the p value obtained in a study implies that the
cally, 4% endorsed the first response option, 31% probability of finding another significant effect in a
210 Gigerenzer
replication is 1 – p. In the view of researchers who findings overestimate the size of the effect, which is
share the other delusions I have been discussing, rep- one of the reasons why effects—even those that exist—
lication studies are superfluous because the p value tend to “fade away” (Button et al., 2013).
already provides certainty or at least high probability Statistical power provides another test of whether
of a successful replication. Thus, for these researchers, the incentive structure of “publish or perish” is suffi-
a failed replication may come as a total surprise. cient to explain the replication crisis or whether a sub-
If a problem of incentives only were at the heart of stantial part of this crisis is due to the null ritual and
the replication crisis, these delusions would not likely its associated delusions. Given the incentive structure
exist. Conversely, the thesis that many researchers have for producing statistically significant results, it should
internalized a statistical ritual implies the existence of be in the interest of every researcher to design experi-
delusions that maintain the ritual. ments that have a reasonable chance of detecting an
The results in Tables 1 and 2 are supported by stud- effect. Thus, according to the strategic-game hypothe-
ies of how significant results are interpreted in pub- sis, we would expect researchers to strategically design
lished articles. Finch, Cumming, and Thomason (2001) experiments with high or at least reasonable power
reviewed the articles published in the Journal of (with the exception of experiments for which the effect
Applied Psychology over 60 years and concluded that size is expected to be small; see the Discussion sec-
in 38% of these articles, nonsignificance was inter- tion). A minimal criterion for “reasonable” would be
preted as demonstrating that the null hypothesis was “substantially better than a coin toss.” That is, if one
true (illusion of certainty). Similarly, an analysis of 259 performs a chance experiment by tossing a coin and
articles in the Psychonomic Bulletin & Review revealed accepts the alternative hypothesis if “heads” comes up,
that in 19% of the articles, the authors presented sta- this “experiment” has power of 50% to correctly “detect”
tistical significance as certainty (Hoekstra et al., 2006). an effect if there is one. Any psychological experiment
These values are consistent with those in Table 2. should be designed to have a better power. However,
to the degree that researchers have internalized the null
ritual, which does not know power, we would expect
Statistical power and replication both inattention to power, which results in small power,
In order to investigate whether an effect exists, one and unawareness of this problem. Thus, the statistical-
should design an experiment that has a reasonable ritual hypothesis predicts that researchers act against
chance of detecting it. I take this insight as common their own best interest.
sense. In statistical language, an experiment should
have sufficient statistical power. Better than a coin flip? Cohen (1962) estimated the
.
Yet the null ritual knows no statistical power. Early power of studies published in the Journal of Abnormal
textbook writers such as Guilford declared that the and Social Psychology for detecting what he called small,
concept of power was “too difficult to discuss” (1956, medium, and large effect sizes (corresponding to Pearson
p. 217). The fourth edition of the Publication Manual correlations of .2, .4, and .6, respectively). He reported
of the American Psychological Association was the first that the median power to detect a medium-sized effect
to mention that power should be taken seriously (Amer- was only 46%. A quarter of a century later, Sedlmeier and
ican Psychological Association, 1994), but no practical I checked whether Cohen’s study on power had had an
guidelines were given. Nor did the subsequent fifth and effect on the power of studies in the Journal of Abnormal
sixth editions provide guidelines (American Psychologi- Psychology (Sedlmeier & Gigerenzer, 1989). It had not;
cal Association, 2001, 2010), which is odd in a manual the median power to detect a medium-sized effect had
that instructs authors on minute details such as how to decreased to 37%.3 The decline was a result of the intro-
format variables and when to use a semicolon. duction of alpha-adjustment procedures, reflecting the
Statistical power is a concept from Neyman-Pearson focus of the null ritual on the p value. Low power
theory: Power is the probability of accepting the alter- appeared to go unnoticed: Only 2 of 64 reports men-
native hypothesis if it is true. For instance, if the alter- tioned power at all. Subsequently, we checked the years
native hypothesis is true and the power is 90%, and if 2000 through 2002 of the same journal and found that
the experiment is repeated many times, one will cor- just 9 out of 220 empirical articles included statements
rectly conclude in 90% of the cases that this hypothesis about how the researchers determined the power of
is true. Power is directly relevant for the probability of their tests (Gigerenzer, Krauss, & Vitouch, 2004). Bakker,
a successful replication in two respects. First, if the Hartgerink, Wicherts, and van der Maas (2016) reported
alternative hypothesis is correct but the power is low, that 89% of 214 authors overestimated the power of
then the chances of replicating a significant finding are research designs. Other analyses showed that only 3% of
low. Second, if the power is low, then significant 271 psychological articles reporting significance tests
Statistical Rituals 211
explicitly discussed power as a consideration for design- that they are following the null ritual, which knows no
ing an experiment (Bakker & Wicherts, 2011), and only power.
2% of 436 articles in the Journal of Speech, Language, However, there may be a strategic element in design-
and Hearing Research in 2009 through 2012 reported sta- ing low-power studies if the expected effect size is
tistical power (Rami, 2014). A meta-analysis of 44 reviews small and one runs multiple studies. In this situation,
published in the social and behavioral sciences, begin- the chance of at least one significant result in N low-
ning with Cohen’s 1962 study, found that power had not powered studies with sample size n/N can be higher
increased over half a century (Smaldino & McElreath, than the chance of a significant result in one high-
2016). Instead, the average power had remained consis- powered study with sample size n (Bakker et al., 2012).
tently low, and the mean power for detecting a small- Although it is unlikely that most researchers strategi-
sized effect (Cohen’s d = 0.2) was 24%, assuming α = .05. cally reason this way, given the general lack of thinking
An analysis of 3,801 cognitive neuroscience and psychol- about power, positive experience could well reinforce
ogy articles published between 2011 and 2014 found that such behavior. There is also the possibility that research-
the median power to detect small, medium, and large ers sometimes design a single low-powered study so
effects was 12%, 44%, and 73%, respectively; in other as to engineer nonsignificance, such as when they want
words, there had been no improvement since the first to “demonstrate” the absence of adverse side effects of
power studies were conducted (Szucs & Ioannidis, 2017). drugs (Greenland, 2012). Thus, the typical lack of sta-
To estimate the power of experiments, an alternative tistical power is implied by the statistical-ritual hypoth-
route is to estimate the main factors that affect power: esis, but it may also have a strategic element.
sample size and effect size. Given the median total At the same time, analyses showing that studies are
sample size of 40 in four representative journals (Jour- frequently low powered raise a new question. Why do
nal of Abnormal Psychology, Journal of Applied Psychol- more than 90% of published articles in major psycho-
ogy, Journal of Experimental Psychology: Human logical journals report significant results, despite noto-
Perception and Performance, and Developmental Psy- riously low power (Sterling, Rosenbaum, & Weinkam,
chology; Marszalek, Barber, Kohlhart, & Holmes, 2011) 1995)? The answer appears to be that many researchers
and the average meta-analytic effect size (d) of 0.50, compensate for their blind spot regarding power by
Bakker et al. (2012) estimated that the typical power of violating good scientific practice in order to neverthe-
psychology studies is around 35%. less produce significant results. In one study, 2,155
In the neurosciences, power appears to be excep- academic psychologists at major U.S. universities
tionally low. An analysis of meta-analyses that included agreed to report anonymously whether they had per-
730 individual neuroscience studies—on the genetics sonally engaged in questionable research practices;
of Alzheimer’s disease, brain-volume abnormalities, half of the psychologists received incentives to answer
cancer biomarkers, and other topics—revealed that the honestly ( John, Loewenstein, & Prelec, 2012). To con-
median statistical power of the individual studies to be trol for reporting bias and estimate the true prevalence,
able to detect an effect of the summary effect size the researchers also asked all the psychologists to esti-
reported in their corresponding meta-analyses was 21% mate the percentage of other psychologists who had
(Button et al., 2013). The distribution of power was engaged in the same questionable behaviors and,
bimodal. A few, mostly with a neurological focus, had among those who had, the percentage who would
a power greater than 90%, whereas the power of the actually admit to having done so. For the group with
461 structural and volumetric MRI studies was strikingly incentives to report honestly, the seven most frequent
low, 8%. Among the animal-model studies included in questionable practices were as follows (each practice
this analysis, the average power was 20% to 30%. This is followed by the percentage of the group who admit-
suggests the possibility that animals’ lives are typically ted to engaging in it and, in parentheses, the estimated
sacrificed in poorly designed experiments that have low true prevalence):
chances of finding an effect. Simply flipping a coin
would be a better strategy, sparing both the animals 1. Failing to report all dependent measures: 67%
and the resources involved. (78%)
2. Collecting more data after seeing whether results
Discussion. If statistical significance is so devoutly were significant: 58% (72%)
desired by behavioral scientists, why do they design 3. Selectively reporting studies that “worked”: 50%
experiments that typically have such low chances of find- (67%)
ing a significant result? Consistently low power over some 4. Excluding data after looking at the impact of
50 years is not expected under the hypothesis that doing so on the results: 43% (62%)
researchers strategically aim at achieving statistically sig- 5. Reporting an unexpected finding as having been
nificant results. But it is consistent with the hypothesis predicted from the start: 35% (54%)
212 Gigerenzer
6. Failing to report all of a study’s conditions: 27% section of this article, I reconstructed the creation of
(42%) the “null ritual” by textbook writers who merged two
7. Rounding down a p value (e.g., reporting .054 competing statistical theories into one hybrid theory,
as less than .05): 23% (39%) whose core is the null ritual and whose desired product
is statistical significance. This ritual eventually replaced
By violating the statistical model on which the p value good standards of scientific practice with a single con-
depends, each of these practices makes a significant venient surrogate: the p value.
result noninterpretable. The first practice increases the In the second section, I tested four predictions of
chance of a significant result from the nominal 5% if the statistical-ritual hypothesis. The first of these is that
the null hypothesis is correct to a larger value, depend- a substantial proportion of academic researchers should
ing on the number of dependent variables (Simmons, share the replication delusion. A review of the available
Nelson, & Simonsohn, 2011). Among the 2,155 psy- studies with 839 academic psychologists and 991 psy-
chologists, 67% admitted that they had not reported all chology students showed that 20% of the faculty teach-
the measures they had used, and the estimated true ing statistics in psychology, 39% of the professors and
value was higher. The other practices serve the same lecturers, and 66% of the students did so.
goal: to inflate the production of significant results. The The second and third predictions are that a substan-
last practice, rounding down p values to make them tial proportion of researchers should share the illusion
appear significant, is clearly cheating. It can be inde- of certainty and Bayesian wishful thinking, respectively.
pendently uncovered because it produces a systematic In the studies I reviewed, between 56% and 80% of
gap in the distribution of p values: too few just above statistics and methodology instructors in psychology
.05 and too many just under .05. This pattern was departments believed in one or more of these three
found, for instance, in reports concluding that food delusions; this range increased to 74% to 97% for pro-
ingredients cause cancer (Schoenfeld & Ioannidis, fessors and lecturers who were not methodology spe-
2013). The practice of rounding down p values can also cialists. To see through these delusions does not require
be detected from inconsistencies between reported understanding of high-level statistics; in other contexts,
p values and reported test statistics. Among articles researchers themselves study whether their participants
published in 2001 in the British Medical Journal and in are subject to the illusion of certainty or the inverse
Nature, 25% and 38%, respectively, reported that results probability error (e.g., Hafenbrädl & Hoffrage, 2015).
were statistically significant even though the test statis- The fourth prediction of the statistical-ritual hypoth-
tics revealed that they were not (García-Berthou & esis is that researchers should be largely blind to sta-
Alcaraz, 2004). Among the psychologists in the study tistical power because it is not part of the ritual. The
by John et al. (2012), a similar percentage admitted to available meta-analyses in psychology show that the
rounding down p values. The R package statcheck can median power to detect a medium-sized effect is around
help detect inconsistent p values (Nuijten, Hartgerink, 50% or below, which amounts to the power of tossing
van Assen, Epskamp, & Wicherts, 2016). a coin. There has been no noticeable improvement
The percentages reported by John et al. (2012) are since the first power analysis in the 1960s.
likely conservative, given that out of some 6,000 psy- The statistical-ritual hypothesis also explains why an
chologists originally contacted by the authors, only 36% estimated 94% of academic psychologists engage in
responded. The low response rate could reflect a self- questionable research practices to obtain significant
selection bias that resulted in more honest researchers results ( John et al., 2012). Significance, that is, rejection
being more likely to participate in the survey. Despite of the null hypothesis, is the primary goal of the null
this possibility, a total of 94% of the researchers admit- ritual, relegating good scientific practice to a secondary
ted that they had engaged in at least one questionable role. Researchers do not engage in questionable prac-
research practice. tices to minimize measurement error or to derive pre-
cise predictions from competitive theories; they engage
in these practices solely in order to achieve statistical
General Discussion significance.
The argument
What to do
I have argued that the replication crisis in psychology
and the biomedical sciences is not only a matter of Various proposals have been made to prevent question-
wrong incentives that are gamed by researchers (the able practices by changing the incentive structure and
strategic-game hypothesis) but also a consequence of introducing measures such as preregistration of studies
researchers’ belief in the null ritual and its associated (e.g., Gigerenzer & Muir Gray, 2011; Ioannidis, 2014).
delusions (the statistical-ritual hypothesis). In the first Trust in science could be improved by preregistration,
Statistical Rituals 213
but given the fixation on significance, this important Step 2: editors should make a distinction between
measure has been already gamed: In medicine, system- research aimed at developing hypotheses and re-
atic discrepancies between registered and published search aimed at testing hypotheses. Editors should
experiments are common, registration often occurs after require that researchers clearly distinguish between
the data have been obtained, reviewers do not take the developing hypotheses (e.g., looking through a correla-
time to compare the registered protocol with the submit- tion matrix to find large correlations) and testing hypoth-
ted article, and, despite preregistration, selective report- eses (e.g., running a second experiment in which this
ing of outcomes to achieve significance is frequent large correlation is stated as a hypothesis and subse-
(C. W. Jones, Keil, Holland, Caughey, & Platts-Mills, quently tested). This distinction is also known as the dis-
2015; Walker, Stevenson, & Thornton, 2014). As an alter- tinction between exploratory and confirmatory research
native measure, a group of 72 researchers proposed (Wagenmakers, Wetzels, Borsboom, van der Maas, & Kievit,
redefining the criterion for statistical significance as 2012). In order to make the distinction transparent, authors
p < .005 rather than p < .05 (Benjamin et al., 2017). The should not report any p values or similar inferential statis-
authors concluded: “The new significance threshold tics when they report research aimed at developing hypoth-
will help researchers and readers to understand and eses. Editors should encourage researchers to report on
communicate evidence more accurately” (p. 11). both hypothesis development and hypothesis testing.
Although this measure would be useful for reducing A clear distinction between these two types of
false positives, I do not see how it would improve research would encourage direct replication attempts
understanding and eradicate the delusions documented (i.e., independent tests of interesting observations made
in Tables 1 and 2. For researchers who believe in the in prior experiments). It would also relieve researchers
replication fallacy, a p value less than .005 means that from the pressure to present unexpected findings as
the results can be replicated with a probability of 99.5%. having been predicted from the start (see John et al.,
I now provide a complementary proposal that fol- 2012).
lows from the present analysis. This proposal consists
of four steps, the first of which would serve as a mini- Step 3: editors should require competitive-hypothe-
mal solution by eliminating the surrogate goal. The sis testing, not null-hypothesis testing. Editors
second through fourth steps would extend this solution should require that a new research hypothesis is tested
by refocusing research on good scientific method. The against the best competitors available. Null-hypothesis
ultimate goal of this proposal is to support statistical testing, in contrast, is noncompetitive: Typically, the pre-
thinking instead of statistical rituals. diction of the research hypothesis remains unspecified
and is tested against a null effect only.
Step 1: editors should no longer accept manuscripts Competitive testing requires precise research hypoth-
that report results as “significant” or “not signifi- eses and thus encourages building mathematical mod-
cant.” If p values are reported, they should be reported els of psychological processes. Competition would
as exact p values—for example, p = .04 or .06—as are make null-hypothesis testing obsolete. It could also
other continuous measures, such as effect sizes. Deci- improve the impoverished approach to theory in psy-
sions about accepting or rejecting an article should be chology, which is partly due to the focus on null-
based solely on its theoretical and methodological quali- hypothesis testing (Gigerenzer, 1998). If the competing
ties, regardless of the p values. models use free parameters, these need to be tested in
This measure would eliminate the surrogate goal of prediction (e.g., out-of-sample prediction, such as
reaching “significance” and the associated pressure to cross-validation), never by data fitting alone (e.g.,
sacrifice proper scientific method in order to get a sig- Brandstätter, Gigerenzer, & Hertwig, 2008).
nificant result. Science is a cumulative endeavor, not a
yes/no decision based on a single empirical study. This Step 4: psychology departments should teach the
step is a minimal proposal because it is easy to imple- statistical toolbox, not a statistical ritual. Psychol-
ment, but more radical than the alternative proposal to ogy departments need to begin teaching the statistical
lower the level of significance (Benjamin et al., 2017). toolbox. This toolbox includes techniques to visualize
Lowering p values does not eliminate the surrogate goal the descriptive statistics of the sample, Tukey’s explor-
but only makes it more difficult to attain. Although this atory data analysis, meta-analysis, estimation, Fisher’s
increased difficulty might make some forms of p-hacking null-hypothesis testing (which is not the same as the null
less effective, it may encourage even more concentra- ritual), Neyman-Pearson decision theory, and Bayesian
tion on the p value and increase questionable research inference. Most important, the toolbox approach requires
practices used to attain significant results, thereby using informed judgment to select the appropriate tool
diverting attention from good scientific practice. for a given problem.
214 Gigerenzer
The emphasis on judgment would mean taking the article on the t test in 1908, recognized this long ago
assumptions of statistical models seriously. Editors with respect to measurement error: “Obviously the
should not require routine statistical inference in situ- important thing . . . is to have a low real error, not to
ations in which it is unclear whether the assumptions have a ‘significant’ result at a particular station [level].
of a statistical model actually hold but rather should The latter seems to me to be nearly valueless in itself ”
encourage proper use of descriptive statistics or explor- (quoted in Pearson, 1939, p. 247).
atory data analysis. The toolbox approach replaces sta-
tistical rituals with statistical thinking and includes
Incentives
principles of good scientific method, such as minimiz-
ing measurement error and conducting double-blind Finally, the established incentives themselves need
studies. explanation. Why do they arbitrarily focus on statistical
The key challenge in the toolbox approach is to significance, which by itself is one of the least impor-
develop informed judgment about the kind of problems tant signs of good scientific research? In theory,
that each tool can handle best—a process similar to researchers could be rewarded for quite a number of
learning that hammers are for nails and screwdrivers practices, including demonstrating the stability of
are for screws. Fisher’s null-hypothesis testing is useful effects through replications and designing clever exper-
(if at all) solely for new problems for which little infor- iments that discriminate between two or more compet-
mation is available and one does not even have a pre- ing hypotheses. For instance, physicists are rewarded
cise alternative hypothesis. If two competing hypotheses for designing tools that minimize measurement error,
are known, Neyman-Pearson theory is the better choice. and neoclassical economists are rewarded for develop-
If, in addition, priors are known, as in cancer screening, ing mathematical models, theorems, and proofs.
Bayes rule is likely the preferred method; of the three Whereas the strategic-game hypothesis takes the incen-
tools, it is also the only one designed to estimate prob- tives as given, the statistical-ritual hypothesis provides
abilities that hypotheses are true. Most important, how- a deeper explanation of the roots of the replication
ever, high-quality descriptive statistics and exploratory crisis. Researchers are incentivized to aim for the prod-
data analysis (Breiman, 2001; L. V. Jones & Tukey, 2000) uct of the null ritual, statistical significance, not for
are good candidates for the scores of situations in goals that are ignored by it, such as high power, repli-
which no random samples have been drawn from cation, and precise competing theories and proofs. The
defined populations and the assumptions of the statisti- statistical-ritual hypothesis provides the rationale for
cal models are not in place (Greenland, 1990). Such a the very incentives chosen by editors, administrators,
toolbox approach is the opposite of an automatic infer- and committees. Obtaining significant results became
ence procedure. It requires good judgment about when the surrogate for good science.
to use each tool, which is exactly what Fisher, Neyman
and Pearson, and Jones and Tukey emphasized. Surrogate science: the elimination of
The toolbox approach can correct the historical error
of considering statistical inference from sample to pop-
scientific judgment
ulation as the sine qua non of good scientific practice. The null ritual can be seen as an instance of a broader
This has been an extraordinary blunder, and for two movement toward replacing judgment about the quality
reasons. First, as mentioned before, the assumptions of research with quantitative surrogates. Search com-
underlying the model of statistical inference are typi- mittees increasingly tend to look at applicants’ h-indices,
cally not met. For instance, typically no population has citation counts, and numbers of articles published as
been defined, and no random samples have been opposed to actually reading and judging the arguments
drawn. Thus, unlike in quality control or polling, and evidence provided in these articles. Assessment of
nobody knows the population to which a significant research is also increasingly left to administrators who
result actually refers, which makes the entire p-value do not understand the content of the research, a prac-
procedure a nebulous exercise. Second, and most tice encouraged by the rising commercialization of uni-
important, obtaining statistical significance has become versities and of academic publishing.
a surrogate for good scientific practice, pushing prin- One positive aspect of the replication crisis is that it
ciples such as formulating precise theories, conducting has increased awareness that we need to take action
double-blind experiments, minimizing measurement and protect science from being transformed into a mass
error, and replicating findings into the sidelines. These production of studies that pursue surrogate goals. What
principles have often not even been mentioned in we need are not more but fewer and better publica-
reports on phenomena that later proved difficult to tions. In order to ensure that future generations of
replicate, including reports on priming and too-much- scientists remain innovative risk takers, educators, jour-
choice experiments. W. S. Gosset, who published an nal editors, and researchers themselves need to revert
Statistical Rituals 215
to the original goals of science. It is time to combat the a discussion of counterintuitive issues involving power, see
present system of false incentives and eliminate the null Greenland, 2012). But a word of caution is necessary. Not all
ritual from scientific practice. estimated power values can be directly compared because they
may be based on differing assumptions.
Action Editor
References
Daniel J. Simons served as action editor for this article.
American Psychological Association. (1994). Publication
Author Contributions manual of the American Psychological Association (4th
ed.). Washington, DC: Author.
G. Gigerenzer is the sole author of this article and is respon- American Psychological Association. (2001). Publication
sible for its content. manual of the American Psychological Association (5th
ed.). Washington, DC: Author.
Acknowledgments American Psychological Association. (2010). Publication
I would like to thank the ABC Research Group, Henry Cowles, manual of the American Psychological Association (6th
Geoff Cumming, Sander Greenland, Deborah Majo, Rona ed.). Washington, DC: Author.
Unrau, and E.-J. Wagenmakers for their helpful comments. Anastasi, A. (1958). Differential psychology (3rd ed.). New
This article is based on a seminar I presented at the Davis York, NY: Macmillan.
Center, History Department, Princeton University, October Anderson, B. L., Williams, S., & Schulkin, J. (2013). Statistical
2016, and a talk I gave at the Philosophy of Science meeting, literacy of obstetrics-gynecology residents. Journal of
Atlanta, Georgia, November 2016. Graduate Medical Education, 5, 272–275. doi:10.4300/
JGME-D-12-00161.1
Declaration of Conflicting Interests Anderson, R. L., & Bancroft, T. A. (1952). Statistical theory
in research. New York, NY: McGraw-Hill.
The author(s) declared that there were no conflicts of interest
Badenes-Ribera, L., Frias-Navarro, D., Iotti, B., Bonilla-
with respect to the authorship or the publication of this
Campos, A., & Longobardi, C. (2016). Misconceptions
article.
of the p-value among Chilean and Italian academic
psychologists. Frontiers in Psychology, 7, Article 1247.
Notes doi:10.3389/fpsyg.2016.01247
1. Reporting replication results in yes/no terms can be just as Badenes-Ribera, L., Frias-Navarro, D., Monterde-i-Bort,
problematic as reporting significance in yes/no terms. If an H., & Pascual-Soler, M. (2015). Interpretation of the
original conclusion that an effect exists was based on statisti- p value: A national survey study in academic psycholo
cal significance alone and the replicability of the effect is again gists from Spain. Psicothema, 27, 290–295. doi:10.7334/
determined by significance, then the flaws of relying on p val- psicothema2014.283
ues alone carry over to the replication studies. Bakker, M., Hartgerink, C. H. J., Wicherts, J. M., & van der
2. During his career, Fisher changed his ideas on this and other Maas, H. L. J. (2016). Researchers’ intuitions about power
questions, in part motivated by his controversy with Neyman in psychological research. Psychological Science, 27,
and Pearson. Earlier, he had proposed .05 as a convenient level 1069–1077. doi:10.1177/0956797616647519
of significance, but in the 1950s he rejected the routine use Bakker, M., van Dijk, A., & Wicherts, J. M. (2012). The rules
of such a constant level. Thus, Fisher himself may have con- of the game called psychological science. Perspectives
tributed to the confusion. Similarly, from reading his Design on Psychological Science, 7, 543–554. doi:10.1177/
of Experiments (Fisher, 1935), one might gain the impression 1745691612459060
that null-hypothesis testing is fairly mechanical, but Fisher Bakker, M., & Wicherts, J. M. (2011). The (mis)reporting
later made it quite clear that this was not his intention (see of statistical results in psychology journals. Behavior
Gigerenzer et al., 1989, chap. 3). Research Methods, 43, 666–678. doi:10.3758/s13428-011-
3. In this replication study, we used Cohen’s original defini- 0089-5
tion of a medium effect size to facilitate comparison with his Benjamin, D. J., Berger, J., Johannesson, M., Nosek, B. A.,
original study, although Cohen (1969) later changed his defi- Wagenmakers, E.-J., Berk, R., . . . Johnson, V. (2017).
nition of small, medium, and large effect sizes to correspond Redefine statistical significance. Retrieved from psyarxiv
to Pearson correlations of .1, .3, and .5, respectively. This sys- .com/mky9j
tematic lowering of the effect-size convention has the effect of Bokhari, A. (2017, March 29). J Scott Armstrong: Fewer than
slightly lowering the power, too. These assumed effect sizes 1 percent of papers in scientific journals follow scien-
may still be larger than those typical in some fields. In applied tific method. breitbart.com. Retrieved from http://www
psychology, tertile effect sizes (calculated by dividing the distri- .breitbart.com/tech/2017/03/29/j-scott-armstrong-frac
bution of effect sizes into three equal parts) have been reported tion-1-papers-scientific-journals-follow-scientific-method/
to be only half or a third as large as Cohen’s revised values Bosco, F. A., Aguinis, H., Singh, K., Field, J. G., & Pierce, C. A.
(Bosco, Aguinis, Singh, Field, & Pierce, 2015). In what follows, (2015). Correlational effect size benchmarks. Journal
I report power estimates as published without discussing the of Applied Psychology, 100, 431–449. doi:10.1037/a003
details, which would go beyond the focus of this article (for 8047
216 Gigerenzer
Bozarth, J. D., & Roberts, R. R. (1972). Signifying signifi- PLOS Biology, 13(6), Article e1002165. doi:10.1371/jour
cant significance. American Psychologist, 27, 774–775. nal.pbio.1002165
doi:10.1037/h0038034 García-Berthou, E., & Alcaraz, C. (2004). Incongruence
Brandstätter, E., Gigerenzer, G., & Hertwig, R. (2006). The between test statistics and P values in medical papers.
priority heuristic: Making choices without trade-offs. BMC Medical Research Methodology, 4, Article 13.
Psychological Review, 113, 409–432. doi:10.1037/0033- doi:10.1186/1471–2288-4-13
295X.113.2.409 Garcia-Pérez, M. A., & Alcalá-Quintana, R. (2016). The
Breiman, L. (2001). Statistical modeling: The two cultures. interpretations of scholars’ interpretations of confi-
Statistical Science, 16, 199–231. doi:10.1214/ss/1009213726 dence intervals: Criticism, replication, and extension of
Button, K. S., Ioannidis, J. P. A., Mokrysz, C., Nosek, B. A., Hoekstra et al. (2014). Frontiers in Psychology, 7, Article
Flint, J., Robinson, E. S. J., & Munafò, M. R. (2013). Power 1042. doi:10.3389/fpsyg.2016.01042
failure: Why small sample size undermines the reliabil- Gigerenzer, G. (1987). Probabilistic thinking and the fight
ity of neuroscience. Nature Reviews Neuroscience, 14, against subjectivity. In L. Krüger, G. Gigerenzer, &
365–376. doi:10.1038/nrn3475 M. S. Morgan (Eds.), The probabilistic revolution, Vol.
Cohen, J. (1962). The statistical power of abnormal-social psy- 2. Ideas in the sciences (pp. 11–33). Cambridge, MA:
chological research: A review. Journal of Abnormal and MIT Press.
Social Psychology, 65, 145–153. doi:10.1037/h0045186 Gigerenzer, G. (1993). The Superego, the Ego, and the Id
Cohen, J. (1969). Statistical power analysis for the behav- in statistical reasoning. In G. Keren & C. Lewis (Eds.),
ioral sciences. New York, NY: Academic Press. A handbook for data analysis in the behavioral sci-
Colquhoun, D. (2014). An investigation of the false discov- ences: Methodological issues (pp. 313–339). Hillsdale,
ery rate and the misinterpretation of p-values. Royal NJ: Erlbaum.
Society Open Science, 1(3), Article 140216. doi:10.1098/ Gigerenzer, G. (1998). Surrogates for theories. Theory &
rsos.140216 Psychology, 8, 195–204. doi:10.1177/0959354398082006
Cumming, G. (2008). Replication and p intervals: p values Gigerenzer, G. (2004). Mindless statistics. Journal of Socio-
predict the future only vaguely, but confidence intervals Economics, 33, 587–606. doi:10.1016/j.socec.2004.09.033
do much better. Perspectives on Psychological Science, 3, Gigerenzer, G., Gaissmaier, W., Kurz-Milcke, E., Schwartz,
286–300. doi:10.1111/j.1745-6924.2008.00079.x L. M., & Woloshin, S. (2007). Helping doctors and patients
Cumming, G. (2014). The new statistics: Why and how. to make sense of health statistics. Psychological Science
Psychological Science, 25, 7–29. doi:10.1177/09567976 in the Public Interest, 8, 53–96. doi:10.1111/j.1539-
13504966 6053.2008.00033.x
Danziger, K. (1987). Statistical method and the historical Gigerenzer, G., Krauss, S., & Vitouch, O. (2004). The null rit-
development of research practice in American psychol- ual: What you always wanted to know about null hypoth-
ogy. In L. Krüger, G. Gigerenzer, & M. S. Morgan (Eds.), esis testing but were afraid to ask. In D. Kaplan (Ed.),
The probabilistic revolution, Vol. 2: Ideas in the sciences Handbook on quantitative methods in the social sciences
(pp. 35–47). Cambridge, MA: MIT Press. (pp. 391–408). Thousand Oaks, CA: Sage.
Danziger, K. (1990). Constructing the subject: Historical Gigerenzer, G., & Muir Gray, J. A. (Eds.). (2011). Better doc-
origins of psychological research. Cambridge, England: tors, better patients, better decisions: Envisioning health
Cambridge University Press. care 2020. Cambridge, MA: MIT Press.
Dulaney, S., & Fiske, A. P. (1994). Cultural rituals and obses- Gigerenzer, G., & Murray, D. J. (1987). Cognition as intuitive
sive-compulsive disorder: Is there a common psycho- statistics. Hillsdale, NJ: Erlbaum.
logical mechanism? Ethos, 22, 243–283. doi:10.1525/ Gigerenzer, G., Swijtink, Z., Porter, T., Daston, L., Beatty, J.,
eth.1994.22.3.02a00010 & Krüger, L. (1989). The empire of chance: How probabil-
Falk, R., & Greenbaum, C. W. (1995). Significance tests ity changed science and everyday life. Cambridge, England:
die hard. Theory & Psychology, 5, 75–98. doi:10.1177/ Cambridge University Press.
0959354395051004 Greenland, S. (1990). Randomization, statistics, and causal
Finch, S., Cumming, G., & Thomason, N. (2001). Reporting of inference. Epidemiology, 1, 421–429.
statistical inference in the Journal of Applied Psychology: Greenland, S. (2011). Null misinterpretation in statistical test-
Little evidence of reform. Educational and Psycholog ing and its impact on health risk assessment. Preventive
ical Measurement, 61, 181–210. doi:10.1177/00131 Medicine, 53, 225–228.
640121971167 Greenland, S. (2012). Nonsignificance plus high power
Fisher, R. A. (1935). The design of experiments. Edinburgh, does not imply support for the null over the alterna-
Scotland: Oliver & Boyd. tive. Annals of Epidemiology, 22, 364–368. doi:10.1016/j
Fisher, R. A. (1955). Statistical methods and scientific induc- .annepidem.2012.02.007
tion. Journal of the Royal Statistical Society, Series B, 17, Greenwald, A. G., Gonzalez, R., Harris, R. J., & Guthrie, D.
69–78. (1996). Effect sizes and p values: What should be reported
Fisher, R. A. (1956). Statistical methods and scientific infer- and what should be replicated? Psychophysiology, 33,
ence. Edinburgh, Scotland: Oliver & Boyd. 175–183. doi:10.1111/j.1469-8986.1996.tb02121.x
Freedman, L. P., Cockburn, I. A., & Simcoe, T. S. (2015). Guilford, J. P. (1942). Fundamental statistics in psychology
The economics of reproducibility in preclinical research. and education (1st ed.). New York, NY: McGraw-Hill.
Statistical Rituals 217
Guilford, J. P. (1956). Fundamental statistics in psychology Makel, M. C., Plucker, J. A., & Hegarty, B. (2012). Replications
and education (3rd ed.). New York, NY: McGraw-Hill. in psychology research: How often do they really occur?
Guilford, J. P. (1965). Fundamental statistics in psychology Perspectives on Psychological Science, 7, 537–542.
and education (4th ed.). New York, NY: McGraw-Hill. doi:10.1177/1745691612460688
Hafenbrädl, S., & Hoffrage, U. (2015). Toward an ecological Marszalek, J. M., Barber, C., Kohlhart, J., & Holmes, C. B.
analysis of Bayesian inference: How task characteristics (2011). Sample size in psychological research over the
influence responses. Frontiers in Psychology, 6, Article past 30 years. Perceptual and Motor Skills, 112, 331–348.
939. doi:10.3389/fpsyg.2015.00939 doi:10.2466/03.11.pms.112.2.331-348
Haller, H., & Krauss, S. (2002). Misinterpretations of sig- Melton, A. W. (1962). Editorial. Journal of Experimental
nificance: A problem students share with their teach- Psychology, 64, 553–557. doi:10.1037/h0045549
ers? Methods of Psychological Research Online, 7(1), Mirowski, P. (2011). Science-mart: Privatizing American
1–20. Retrieved from https://www.metheval.uni-jena.de/ science. Cambridge, MA: Harvard University Press.
lehre/0405-ws/evaluationuebung/haller.pdf Mullard, A. (2011). Reliability of ‘new drug target’ claims
Hoekstra, H., Finch, S., Kiers, H. A. L., & Johnson, A. (2006). called into question. Nature Reviews Drug Discovery, 10,
Probability as certainty: Dichotomous thinking and the 643–644. doi:10.1038/nrd3545
misuse of p values. Psychonomic Bulletin & Review, 13, Neyman, J., & Pearson, E. S. (1933). On the problem of the
1033–1037. doi:10.3758/BF03213921 most efficient tests of statistical hypotheses. Philosophical
Hoekstra, H., Morey, R. D., Rouder, J. N., & Wagenmakers, Transactions of the Royal Society of London. Series A,
E.-J. (2014). Robust misinterpretation of confidence in- 231, 289–337. doi:10.1098/rsta.1933.0009
tervals. Psychonomic Bulletin & Review, 21, 1157–1164. Nuijten, M. B., Hartgerink, C. H. J., van Assen, M. A. L. M.,
doi:10.3758/s13423-013-0572-3 Epskamp, S., & Wicherts, J. M. (2016). The prevalence
Horton, R. (2016). Offline: What is medicine’s 5 sigma? The of statistical reporting errors in psychology (1985–2013).
Lancet, 385, 1380. Behavior Research Methods, 48, 1205–1226. doi:10.3758/
Ioannidis, J. P. A. (2005). Why most published research s13428-015-0664-2
findings are false. PLOS Medicine, 2(8), Article e124. Nunnally, J. C. (1975). Introduction to statistics for psychol-
doi:10.1371/journal.pmed.0020124 ogy and education. New York, NY: McGraw-Hill.
Ioannidis, J. P. A. (2014). How to make more published Oakes, M. (1986). Statistical inference: A commentary for
research true. PLOS Medicine, 11(10), Article e1001747. the social and behavioral sciences. New York, NY: Wiley.
doi:10.1371/journal.pmed.1001747 Open Science Collaboration. (2015). Estimating the repro-
Ioannidis, J. P. A., Greenland, S., Hlatky, M. A., Khoury, ducibility of psychological science. Science, 349, Article
M. J., Macleod, M. R., Moher, D., . . . Tibshirani, R. aac4716. doi:10.1126/science.aac4716
(2014). Increasing value and reducing waste in research Pashler, H., Coburn, N., & Harris, C. R. (2012). Priming of
design, conduct, and analysis. The Lancet, 383, 166– social distance? Failure to replicate effects on social
175. doi:10.1016/s0140-6736(13)62227-8 and food judgments. PLOS ONE, 7(8), Article e42510.
John, L. K., Loewenstein, G., & Prelec, D. (2012). Measuring doi:10.1371/journal.pone.0042510
the prevalence of questionable research practices with Pashler, H., & Wagenmakers, E.-J. (2012). Editors’ intro-
incentives for truth telling. Psychological Science, 23, duction to the special section on replicability in psy-
524–532. doi:10.1177/0956797611430953 chological science: A crisis of confidence? Perspectives
Jones, C. W., Keil, L. G., Holland, W. C., Caughey, M. C., on Psychological Science, 7, 528–530. doi:10.1177/17456
& Platts-Mills, T. F. (2015). Comparison of registered 91612465253
and published outcomes in randomized controlled tri- Pearson, E. S. (1939). “Student” as statistician. Biometrika, 30,
als: A systematic review. BMC Medicine, 13, Article 282. 210–250. doi:10.2307/2332648
doi:10.1186/s12916-015-0520-3 Pearson, E. S. (1962). Some thoughts on statistical infer-
Jones, L. V., & Tukey, J. W. (2000). A sensible formulation of ence. Annals of Mathematical Statistics, 33, 394–403.
the significance test. Psychological Methods, 5, 411–414. doi:10.1214/aoms/1177704566
doi:10.1037/1082-989X.5.4.411 Prinz, F., Schlange, T., & Asadullah, K. (2011). Believe it or
Lecoutre, M. P., Poitevineau, J., & Lecoutre, B. (2003). Even not: How much can we rely on published data on poten
statisticians are not immune to misinterpretations of Null tial drug targets? Nature Reviews Drug Discovery, 10,
Hypothesis Significance Tests. International Journal of 712. doi:10.1038/nrd3439-c1
Psychology, 38, 37–45. doi:10.1080/00207590244000250 Rami, M. K. (2014). Power and effect size measures: A cen-
Lehrer, J. (2010, December 13). The truth wears off: Is there sus of articles published from 2009-2012 in the Journal
something wrong with the scientific method? The New of Speech, Language, and Hearing Research. American
Yorker. Retrieved from https://www.newyorker.com/ International Journal of Social Science, 3, 13–19.
magazine/2010/12/13/the-truthwears-off Rosenthal, R. (1993). Cumulating evidence. In G. Keren &
Loftus, G. R. (1993). Editorial comment. Memory & Cognition, C. Lewis (Eds.), A handbook for data analysis in the
21, 1–3. doi:10.3758/BF03211158 behavioral sciences: Methodological issues (pp. 519–559).
Madden, C. S., Easley, R. W., & Dunn, M. G. (1995). How journal Hillsdale, NJ: Erlbaum.
editors view replication research. Journal of Advertising, 24, Scheibehenne, B., Greifeneder, R., & Todd, P. M. (2010). Can
77–87. there ever be too many options? A meta-analytic review
218 Gigerenzer
of choice overload. Journal of Consumer Research, 37, Stigler, S. M. (1999). Statistics on the table: The history of sta-
409–425. doi:10.1086/651235 tistical concepts and methods. Cambridge, MA: Harvard
Schoenfeld, J. D., & Ioannidis, J. P. A. (2013). Is everything University Press.
we eat associated with cancer? A systematic cookbook Stroebe, W., & Strack, F. (2014). The alleged crisis and the
review. American Journal of Clinical Nutrition, 97, illusion of exact replication. Perspectives on Psycholog-
127–134. doi:10.3945/ajcn.112.047142 ical Science, 9, 59–71. doi:10.1177/1745691613514450
Schooler, J. (2011). Unpublished results hide the decline Szucs, D., & Ioannidis, J. P. A. (2017). Empirical assess-
effect. Nature, 470, 437. doi:10.1038/470437 ment of published effect sizes and power in the recent
Schünemann, H., Ghersi, D., Kreis, J., Antes, G., & Bousquet, J. cognitive neuroscience and psychology literature. PLOS
(2011). Reporting of research: Are we in for better health Biology, 15(3), Article e2000797. doi:10.1371/journal
care by 2020? In G. Gigerenzer & J. A. Muir Gray (Eds.), .pbio.2000797
Better doctors, better patients, better decisions: Envision Tversky, A., & Kahneman, D. (1971). Belief in the law of
ing health care 2020 (pp. 83–102). Cambridge, MA: MIT small numbers. Psychological Bulletin, 76, 105–110.
Press. doi:10.1037/h0031322
Sedlmeier, P., & Gigerenzer, G. (1989). Do studies of sta- Wagenmakers, E.-J., Wetzels, R., Borsboom, D., van der Maas,
tistical power have an effect on the power of studies? H. L. J., & Kievit, R. A. (2012). An agenda for purely
Psychological Bulletin, 105, 309–316. doi:10.1037/0033- confirmatory research. Perspectives on Psychological
2909.105.2.309 Science, 7, 632–638. doi:10.1177/1745691612463078
Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). Walker, K. F., Stevenson, G., & Thornton, J. G. (2014).
False-positive psychology: Undisclosed flexibility in Discrepancies between registration and publication of
data collection and analysis allows presenting anything randomised controlled trials: An observational study.
as significant. Psychological Science, 22, 1359–1366. Journal of the Royal Society of Medicine Open, 5(5).
doi:10.1177/0956797611417632 doi:10.1177/2042533313517688
Smaldino, P. E., & McElreath, R. (2016). The natural selection Wegwarth, O., Schwartz, L. M., Woloshin, S., Gaissmaier,
of bad science. Royal Society Open Science, 3(9), Article W., & Gigerenzer, G. (2012). Do physicians understand
160384. doi:10.1098/rsos.160384 cancer screening statistics? A national survey of primary
Snedecor, G. W. (1937). Statistical methods (1st ed.). Ames: care physicians in the United States. Annals of Internal
Iowa State Press. Medicine, 156, 340–349. doi:10.7326/0003-4819-156-5-
Sterling, T. D. (1959). Publication decisions and their possible 201203060-00005
effects on inferences drawn from tests of significance—or Welch, H. G. (2011). Overdiagnosed: Making people sick in
vice versa. Journal of the American Statistical Associ the pursuit of health. Boston, MA: Beacon Press.
ation, 54, 30–34. doi:10.2307/2282137 Westover, M. B., Westover, K. D., & Bianchi, M. T. (2011).
Sterling, T. D., Rosenbaum, W., & Weinkam, J. (1995). Significance testing as perverse probabilistic reasoning.
Publication decisions revisited: The effect of the outcome BMC Medicine, 9, Article 20. doi:10.1186/1741-7015-9-20
of statistical tests on the decision to publish and vice versa. Wulff, H. R., Andersen, B., Brandenhoff, P., & Guttler, F.
The American Statistician, 49, 108–112. doi:10.2307/ (1987). What do doctors know about statistics? Statistics
2684823 in Medicine, 6, 3–10.