0% found this document useful (0 votes)
17 views9 pages

Lengersdorff Lamm 2025 With Low Power Comes Low Credibility Toward A Principled Critique of Results From Underpowered

The article critiques the 'low-power/low-credibility' (LPLC) argument, which suggests that results from underpowered tests are less likely to reflect true effects. It discusses the inconsistencies of this critique within frequentist and Bayesian frameworks, emphasizing that significant findings from low-powered tests should not be dismissed solely based on power. The authors advocate for transparent communication about research practices rather than relying on low power as a proxy for credibility concerns.

Uploaded by

nahrul faidin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views9 pages

Lengersdorff Lamm 2025 With Low Power Comes Low Credibility Toward A Principled Critique of Results From Underpowered

The article critiques the 'low-power/low-credibility' (LPLC) argument, which suggests that results from underpowered tests are less likely to reflect true effects. It discusses the inconsistencies of this critique within frequentist and Bayesian frameworks, emphasizing that significant findings from low-powered tests should not be dismissed solely based on power. The authors advocate for transparent communication about research practices rather than relying on low power as a proxy for credibility concerns.

Uploaded by

nahrul faidin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

1296397

research-article2025
AMPXXX10.1177/25152459241296397Lengersdorff, LammAdvances in Methods and Practices in Psychological Science

ASSOCIATION FOR
General Article PSYCHOLOGICAL SCIENCE
Advances in Methods and

With Low Power Comes Low Credibility? Practices in Psychological Science


January-March 2025, Vol. 8, No. 1,
pp. 1­–9
Toward a Principled Critique of Results © The Author(s) 2025
Article reuse guidelines:

From Underpowered Tests sagepub.com/journals-permissions


DOI: 10.1177/25152459241296397
https://doi.org/10.1177/25152459241296397
www.psychologicalscience.org/AMPPS

Lukas L. Lengersdorff and Claus Lamm


Social, Cognitive, and Affective Neuroscience Unit, Department of Cognition, Emotion, and
Methods in Psychology, Faculty of Psychology, University of Vienna, Vienna, Austria

Abstract
Researchers should be motivated to adequately power statistical tests because tests with low power have a low probability
of detecting true effects. However, it is also often claimed that significant results obtained by underpowered tests are
less likely to reflect a true effect. Here, we critically discuss this “low-power/low-credibility” (LPLC) critique from both
frequentist and Bayesian perspectives. Although the LPLC critique is first and foremost a critique of frequentist tests, it
is itself not consistent with frequentist theory. In particular, it demands that researchers have some information on the
probability that a hypothesis is true before they test it. However, such prior probabilities are dismissed as meaningless in
frequentist inference, and we demonstrate that they cannot be meaningfully introduced into frequentist thinking. Even
when adopting a Bayesian perspective, however, significant results from tests with low power can provide a nonnegligible
amount of support for the tested hypothesis. We conclude that even though low power reduces the chances to obtain
significant findings, there is little justification to dismiss already obtained significant findings on the basis of low power
alone. However, concerns about low power might often reflect suspicions that findings were produced by questionable
research practices. If this is the case, we suggest that communicating these issues transparently rather than using low
power as a proxy concern will be more appropriate. We conclude by providing suggestions on how results from tests
with low power can be critiqued for the correct reasons and in a constructive manner.

Keywords
inference, questionable research practices

Received 2/8/24; Revision accepted 10/4/24

Psychological science is in a period of reform. In the grounds alone, every researcher should be discouraged
wake of the replication crisis, empirical researchers and from conducting studies with sample sizes that are too
methodologists have raised awareness about question- small to detect plausible effect sizes because they will
able research practices (QRPs), pointing out that many likely result in a waste of resources and time. However,
analytical habits decrease the credibility of scientific it now seems a widely held belief that underpowered
output (Bakker et al., 2012; Open Science Collaboration, tests are problematic for another reason: Not only does
2015). Most reform efforts address the (mis)use of fre- low power decrease the probability of a significant result
quentist inference, especially in the form of significance when the hypothesis is true, but (so goes the notion) it
testing and p values (McShane et al., 2019; Wicherts also makes the significant results that are obtained less
et al., 2016). One of the biggest concerns raised is that
many of the conducted statistical tests suffer from low
Corresponding Author:
power (Button et al., 2013; Ioannidis, 2005; Lindsay,
Lukas L. Lengersdorff, Social, Cognitive, and Affective Neuroscience
2015). Per definition, a statistical test with low power Unit, Department of Cognition, Emotion, and Methods in Psychology,
has a low probability of producing a significant result Faculty of Psychology, University of Vienna, Vienna, Austria
even when the tested hypothesis is true. On these Email: [email protected]

Creative Commons CC BY: This article is distributed under the terms of the Creative Commons Attribution 4.0 License
(https://creativecommons.org/licenses/by/4.0/) which permits any use, reproduction and distribution of the work without further permission
provided the original work is attributed as specified on the SAGE and Open Access pages (https://us.sagepub.com/en-us/nam/open-access-at-sage).
2 Lengersdorff, Lamm

credible because they have a lower probability of reflect- critique must include a statement for which effect sizes
ing a true finding. This idea is prominently voiced in the test’s power was too low. As Morey (2019) put it
Ioannidis’s (2005) influential article as one of the reasons succinctly, “‘This sample size feels small’ is not good
“why most published research findings are false” and enough.” Whenever we use phrases such as “low power”
cited and further expanded in numerous other articles or “underpowered test” in this article, we take the point
(e.g., Button et al., 2013; Colquhoun, 2014; but for a criti- of view of scientists who assess the power of a test to
cal evaluation, see Wagenmakers et al., 2015). We from be lower than the standard they would find acceptable
now on refer to this as the “low-power/low-credibility” given the range of effect sizes they would find interest-
(LPLC) critique. ing and plausible in the given situation.
In this article, we sketch out the development of the
LPLC critique and discuss the assumptions one needs to
Frequentist and Bayesian Probability
accept to use it from both frequentist and Bayesian points
of view. We also debate whether results from tests with The aim of the present article is to assess the LPLC cri-
low power are particularly susceptible to being invali- tique from the perspective of the two dominating schools
dated by the use of QRPs and evaluate the LPLC critique of statistical inference: frequentist and Bayesian statistics.
in the presence of publication bias. We expect that These two frameworks differ fundamentally in their
researchers with a general interest in statistical inference understanding of probability, which is why we first need
will benefit from reading this article (if only by the oppor- to shortly explain these differences. According to the
tunity to reflect on why they disagree with our points). frequentist interpretation, probability is the relative fre-
We would like to achieve two things: first, to make sci- quency with which repeatable processes produce out-
entists reflect, question, and openly communicate their comes (Fisher, 1935; Neyman & Pearson, 1933). This is
own assumptions when they use the LPLC critique and consistent with many real-life applications of probability,
second, to empower scientists to demand the same level such as games of chance. However, the frequentist inter-
of reflection and open communication from peers who pretation does not allow the assignment of a probability
use the LPLC critique when assessing their work, for to nonrepeatable, singular events. In particular, from a
example, during peer review. At the end of this article, frequentist perspective, one cannot assign a probability
we provide recommendations on how this reflection and to the truth of a certain hypothesis. In contrast, the
communication could be achieved in practice. Bayesian interpretation treats probability as a way to
We would like to explicitly emphasize here that only quantify knowledge (e.g., De Finetti, 1974; Jaynes, 2003).
an ill-informed reading of our arguments could be used Consequently, Bayesian probability can make statements
as an “excuse” for deliberately conducting studies with about singular events and hypotheses.
suboptimal sample sizes. After all, tests that have low Consider the following example: A friend tells you that
power for even the largest plausible effect size are very she will toss a fair coin. Before the toss, she asks you
likely to produce nonsignificant results. We expect that about the probability of the coin showing heads. What
in the presence of the reflection and open communica- do you answer? In this situation, the obvious answer of
tion we advocate, such studies will be justifiably assessed 50% is a valid statement in both the frequentist and Bayes-
as irrelevant or inefficient to achieve scientific ian frameworks. Now, your friend tosses the coin and
progress. covers the result with her hand. What is the probability
that the coin under her hand is showing heads? As a
Bayesian, you still have every reason to say 50%—you
What Does “Underpowered” Mean? know that a fair coin toss results in heads in 50% of cases,
The recent debate about the prevalence of “underpow- and since this particular toss happened, you gained no
ered” tests can create the impression that power is an additional knowledge about its outcome. From the fre-
objective property of a test. Indeed, in our view, calling quentist point of view, however, you find it impossible to
a test or study “underpowered” is often used as a short- give a meaningful answer. You cannot see through your
hand to say that the sample size is too small relative to friend’s hand, but you are certain that beneath it, the coin
some explicit or implicit standard. However, one test shows either heads or tails. Thus, the only “probability”
with one given sample size does not have “one” power you could assign to this event would be 100% or 0%.
but, rather, many different levels of power for different The same difference arises whenever the truth of sci-
possible effect sizes of interest (Morey & Lakens, 2016). entific hypotheses is discussed. In frequentist inference,
However, what effect sizes are “interesting” or “plausi- the probability of some particular hypothesis being true
ble” must ultimately be decided by scientists and their is not meaningful because its truth is seen as a fixed
peers, and it is through this decision that a substantial state of the world. In contrast, Bayesian inference allows
amount of subjective consideration enters the assess- the assignment of probability to a hypothesis to quantify
ment of a test’s power. Thus, a fair and transparent the existing evidence for its truth.
Advances in Methods and Practices in Psychological Science 8(1) 3

The Original Formulation of the LPLC To complete the derivation of the LPLC critique, it is
Critique easily shown that for any fixed value of α and P ( H true ),
a decrease in power 1 − β leads to a decrease in the PPV.
The LPLC critique originated in an attempt to transfer In the words of the LPLC critique’s proponents, this rela-
the logic of diagnostic medical screening to tests of tion shows that low power comes with a low “post-study
epistemic inference (Colquhoun, 2014; Mayo & Morey, probability” that the “claimed research finding is true”
2017). Its central concept is the “positive predictive (Ioannidis, 2005, paragraph 2, “Modeling the Framework
value” (PPV), a term frequently used in epidemiology. for False Positive Findings”) and that “low power . . .
In epidemiology, the PPV of a test for some disease is reduces the likelihood that a statistically significant result
the proportion of individuals who really have the disease reflects a true effect” (Button et al., 2013, p. 365).
among those individuals for which the test gave a posi- Note that in the publications of Ioannidis (2005) and
tive result (Altman & Bland, 1994). The PPV is calculated, others (e.g., Button et al., 2013), the focus lays on the
using Bayes’s theorem, as effects of low power on the evidence supplied by popu-
lations of published studies. However, the authors also
P ( +|D ) P ( D )
PPV = P ( D| + ) = , invited the use of their critique to individual tests. This
P ( +|D ) P ( D ) + (1 − P ( −|not D ) ) (1 − P ( D ) ) is seen in statements such as “the probability that a
research finding is indeed true depends on . . . the sta-
where the prevalence P ( D ) is the probability that any tistical power of the study” (Ioannidis, 2005, paragraph
randomly selected member of the population of interest 2, “Modeling the Framework for False Positive Findings”)
has the disease, the sensitivity P ( + | D ) is the probability and “a study with low statistical power has a reduced
of a positive result given that the patient has the disease, chance of detecting a true effect, but it is less well appre-
and the specificity P ( −|not D ) is the probability of a ciated that low power also reduces the likelihood that
negative result when the patient does not have the dis- a statistically significant result reflects a true effect” (But-
ease. Note that here, the use of Bayes’s theorem is valid ton et al., 2013, p. 365). Indeed, the LPLC critique is often
within the frequentist framework because all involved also used in the assessment of single studies, and it is
quantities can be interpreted as relative frequencies. on this use that we focus here. With small adaptations,
The central step in the formulation of the LPLC cri- our arguments can also be extended to the population-
tique is to apply the same logic to significance tests. In level LPLC critique, however.
this framework, hypotheses take the role of “individuals,”
and the truth of hypotheses takes the role of the “dis-
ease.” Sensitivity and specificity are identified with the The LPLC Critique in Frequentist Inference
corresponding quantities from frequentist statistics: The The LPLC critique takes a peculiar position within fre-
power of the test, typically written as 1 − β, is equated quentist inference. Concepts such as power and Type I
with the sensitivity, whereas the Type I error probability errors are central to its formulation. However, by introduc-
α is understood to be the complement of the test’s speci- ing the prevalence P ( H true ), the LPLC critique becomes
ficity. However, the prevalence has no direct counterpart decidedly “unfrequentist.” Frequentist tests are designed
in frequentist terminology. Therefore, the quantity to perform inference without any reference to the prob-
P ( H true ) is introduced, defined as the proportion of true ability of hypotheses, which is even dismissed as a mean-
hypotheses among some larger “population” of hypoth- ingless concept (e.g., Fisher, 1935; Neyman & Pearson,
eses, for example, all hypotheses tested in a certain 1933). In the same vein, many recent publications describe
research field. Most formulations of the LPLC critique the fallacies that arise whenever researchers erroneously
represent P ( H true ) by the corresponding odds ratio assume that frequentist tests tell anything about the prob-
P ( H true ) ability of a hypothesis being true (Greenland et al., 2016;
R= , defined as the “ratio of the number of
1 − P ( H true ) Morey et al., 2016; Wasserstein & Lazar, 2016).
‘true relationships’ to ‘no relationships’ among those Why, then, is the LPLC critique widely accepted by a
tested in the field” (Ioannidis, 2005, paragraph 2, “Mod- predominantly frequentist field instead of being dis-
eling the Framework for False Positive Findings”) or “the missed as another such fallacy? Its similarity to diagnostic
odds that a probed effect is indeed non-null among the screening is very appealing, but it is easily overlooked
effects being probed” (Button et al., 2013, p. 366). The that one cannot treat tested hypotheses like screened
PPV can then be calculated as (cf. Ioannidis, 2005) patients without making very strong assumptions about
the nature of the scientific process. For the LPLC critique
(1 − β ) P ( Htrue ) to be consistent with frequentist theory, one would need
PPV =
(1 − β ) P ( Htrue ) + α (1 − P ( Htrue ) ) to assume that tested hypotheses are randomly sampled
from some population of hypotheses, just as screened
(1 − β ) R individuals are sampled from a population of people.
= .
       (1 − β ) R + α (1) This assumption is best illustrated by a variation of the
4 Lengersdorff, Lamm

classic urn model, in which the urn is “filled” with true Findings”), the frequentist probability of any particular
and false hypotheses (Colquhoun, 2014; Mayo & Morey, hypothesis being true is 100% or 0% independent of the
2017). Researchers do not know of any particular hypoth- test’s result or its power. Rather, the PPV is the probabil-
esis if it is true, but they do have knowledge of P ( H true ), ity that a randomly selected hypothesis that produced a
the percentage of hypotheses in the urn that are true. significant result is true. Mistaking the latter for the for-
Researchers should now imagine that when they perform mer is similar to the common misunderstanding that a
scientific inference, they randomly draw a hypothesis given 95% confidence interval contains the true param-
from the urn, perform the significance test with a speci- eter with a probability of 95% (when it is rather the case
fied Type I error and power, and accept or reject the that in the long run, 95% of confidence intervals will
hypothesis based on the test’s result. Then, the proportion contain the true parameter; Morey et al., 2016).
of true hypotheses among all accepted hypotheses will Ultimately, these problems arise because frequentist
be the PPV. However, this is clearly not a useful model inference pursues a different objective than the one
of the scientific process, for a number of reasons. implied by the LPLC critique. Significance tests have
First, hypotheses are not randomly sampled. Scientists been designed with the intention to achieve optimal
do not randomly draw their hypotheses from some larger error control in repeated decision situations without any
set of distinguishable hypotheses, all of which are consideration of prior information. Researchers run into
equally likely to be selected. Rather, new research ques- inconsistencies, however, when they demand that sig-
tions are derived from existing findings, and the results nificance tests inform them about the probability that
of one study change the likelihood that another study claimed findings are true.
will be started.
Second, it is unclear which population of hypotheses
should be considered. The proponents of the LPLC cri-
The LPLC Critique in Bayesian Inference
tique propose that P ( H true ) should reflect the number of Accepting the Bayesian interpretation of probability
true hypotheses in the research field (Button et al., 2013; makes the LPLC critique much simpler—one does not
Ioannidis, 2005). But how does one define the field? For need to identify P ( H true ) with the proportion of true
example, as psychologists, should we consider all pos- hypotheses in some hard-to-define population. Instead,
sible hypotheses within psychology or only within our researchers may define it as the “proper” prior probabil-
discipline or subdiscipline? Do we consider all hypoth- ity that the hypothesis in question is true, reflecting
eses that have ever been tested in this field, or only all available knowledge. One may then also interpret
recent hypotheses, or maybe all hypotheses that could the PPV as the posterior probability that the hypoth-
possibly be tested in the future? Do we consider even esis is true after observing the significant result,
the most absurd hypotheses, or do we restrict the popu- PPV = P ( H true | sig.). It is mathematically correct to say
lation to “reasonable” ones? If the latter, how do we that lower power leads to lower values of P ( H true | sig.).
define “reasonable”? These questions demonstrate that However, this relationship should not lead one to dis-
the LPLC critique suffers heavily from the so-called miss significant findings from underpowered tests
reference-class problem (Hájek, 2007; Mayo & Morey, entirely. Rather, it prompts one to determine exactly how
2017). The choice of an appropriate reference class of much (or little) evidence the results can provide (see
hypotheses must ultimately be based on the assumptions also Wagenmakers et al., 2015).
and goals of the individuals performing or assessing the Figure 1 illustrates the posterior probability P ( H true | sig.)
analysis, introducing an element similarly subjective as after a test significant at α = .05 for different prior prob-
the priors elicited in Bayesian analyses. abilities P ( H true ) and different levels of power. We show
Third, P(Htrue) cannot be known or estimated. The that P ( H true | sig.) does decrease with decreasing power.
situation implied by the LPLC critique is rather peculiar: However, whether low power makes the result not cred-
Researchers do not know of any particular hypothesis ible depends heavily on its prior probability and on what
whether it is true, yet they do know the proportion of values of P ( H true | sig.) one considers credible or not cred-
true hypotheses in the field. It is not clear how research- ible. If, for example, we preassign a prior probability of
ers could have such substantial prior knowledge about 50% to the hypothesis, significant results from tests with
hypotheses that have not yet been tested without resort- power levels of .80, .50, and .20 provide posterior prob-
ing to such Bayesian solutions as expert knowledge or abilities of 94.1%, 90.9%, and 80.0%, respectively. Should
subjective belief. the difference between 94.1% and 90.9% prompt us to
But even if one solved (or ignored) all of these issues, accept the result of the test with the conventional power
the PPV would still not give the probability that any of .80 but to dismiss the result of the test with the low
particular significant finding is true. Although the PPV power of .50? And can we dismiss the significant result
has been equated with “the post-study probability that of a test with the very low power of .20 given that it
[a research finding] is true” (Ioannidis, 2005, paragraph provides the (perhaps surprisingly high) posterior prob-
2, “Modeling the Framework for False Positive ability of 80%? The answers to these questions depend
Advances in Methods and Practices in Psychological Science 8(1) 5

1.0 analysis, the BF would be determined by assessing the


probability of the full data under both H true and H false
0.9
(Wagenmakers et al., 2015). However, if we reduce the
0.8 data to the outcome of a significance test (meaning that
0.7 the only information we use is whether the test was
significant or not), we obtain
P(Htruesig)

0.6
0.5
P ( H true | sig.) P (Data | H true ) P ( H true )
0.4 = × ,
Power P ( H false | sig.) P (Data | H false ) P ( H false )
0.3

{
1 BFsig
0.2 0.8
0.1 0.5 where BFsig , the BF obtained through a significant result,
0.2 is given by
0.0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 P ( sig.| H true ) 1 − β
BFsig = = .
P(Htrue) P ( sig.| H false ) α

Fig. 1. The relationship between prior probability P ( Htrue ), power, and


posterior probability P ( Htrue | sig.) after a significant result (α = .05) is
For a given α, BFsig is a simple linear function of the
obtained. power 1 − β. For the typical α of .05, the threshold of
BFsig = 3, widely seen as indicating moderate support for
on the thresholds that we deem appropriate and on the the hypothesis (Kass & Raftery, 1995; Keysers et al.,
costs of wrong decisions. We note, however, that in con- 2020), is obtained for 1 − β = .15, a power level likely to
trast to frequentist significance testing, Bayesian inference be judged as extremely low by most scientists. A BFsig of
does not demand making clear-cut (i.e., above or below 10, the conventional threshold for strong evidence, is
a significance threshold) decisions between hypotheses. obtained for 1 − β = .50, a power level that is still far
The relative effect of power on the posterior probabil- removed from the typically desired level of .80. Clearly,
ity changes with the prior probability. What, then, is an these observations should not lead to the conclusion
appropriate value of P ( H true )? Despite developments that experiments do not have to be designed to achieve
regarding objective priors on the level of model param- high power—after all, underpowered tests are unlikely
eters (Bayarri et al., 2016; Jaynes, 2003), there is no to produce significant findings to begin with. However,
generally agreed-on way to assign objective priors to we expect these examples to challenge scientists’ intu-
hypotheses themselves. A common choice is 50%, imply- itions about whether and to what extent low power
ing prior indifference toward truth and falsehood of the devalues those results that have already been obtained.
hypothesis (Berger, 2003). But this prior might be inap- We emphasize again that a fully Bayesian analysis
propriate when the tested hypothesis appears very plau- would base inference on the full data (i.e., the BF), not
sible or implausible. In such cases, it is inevitable that on the dichotomous outcome of a significance test (i.e.,
P ( H true ) incorporates personal experience or subjective the BFsig ). However, there is no straightforward relation-
opinion and that different stakeholders assign different ship between frequentist power and the BF. Here, we
priors to the same hypothesis. constrained our attention to BFsig to facilitate the evalu-
Bayesian inference is often based not on the posterior ation of significance tests from a Bayesian perspective.
probability itself but on the Bayes factor (BF; Kass & For more details on the synthesis of frequentist and
Raftery, 1995). The BF is the ratio of the probability of Bayesian inference, see Bayarri et al. (2016).
the observed data under the assumption that the hypoth-
esis is true to the data’s probability under the assumption Power and QRPs
that the hypothesis is false. It plays an important role in
updating the hypothesis’s prior odds to its posterior odds In summary, there is little statistical justification to dis-
after observing the data: miss a finding on the grounds of low power alone.
However, everything written above relied on the assump-
P ( H true | Data ) P ( Data | Htrue ) P ( H true ) tion that the significant result in question was obtained
= × .
P ( H false | Data ) P ( Data | Hfalse ) P ( H false ) in an “honest way,” in the sense that the reported test
was the only test performed to assess the hypothesis (or
{

BF
if not, that this is transparently communicated and that
The BF is independent of P ( H true ) and can therefore be appropriate corrections for multiple comparisons have
interpreted as an objective measure of how strongly the been performed). But significant results reported in psy-
data support the tested hypothesis. In a fully Bayesian chological science are often produced by QRPs, such as
6 Lengersdorff, Lamm

selective reporting and p-hacking (e.g., John et al., The relationship between Q and BFsig. rep is depicted in
2012). Moreover, it has been suggested that the results Figure 2. For all levels of power, evidence decreases
of studies reporting low-powered tests are more likely sharply with increasing Q . If we assume a high probabil-
to be “tainted” by QRPs than those of studies with higher ity that nonsignificant findings will be “embellished”
power (Bakker et al., 2012; Schimmack, 2012). The rea- with QRPs, the evidence provided by even highly pow-
soning behind this is that researchers who do not obtain ered tests is negligibly small. If, for example, we assume
a significant result will, with a certain probability, use Q larger than 0.30, even a test with maximal power
QRPs to nevertheless produce and report “publishable” cannot produce a BFsig. rep larger than 3, the conventional
significant findings. Researchers who conduct under- threshold for moderate evidence.
powered studies are more likely to obtain insignificant To conclude, the use of QRPs can completely nullify
results and thus to be tempted to use QRPs than research- the evidence gained from results. Power plays only a
ers who conduct appropriately powered studies. Cru- side role in this. A study that shows clear signs of
cially, however, this reasoning holds only in cases in p-hacking or selective reporting cannot be saved by a
which the alternative hypothesis is, indeed, true. When large sample size (for simulation results that large sample
the alternative hypothesis is false, low- and high- sizes are not a protective factor against most p-hacking
powered studies have the same probability (1 − α ) of strategies, see also Stefan & Schönbrodt, 2023). At the
producing nonsignificant results and are at the same risk same time, low power gives little reason to question the
of being affected by QRPs. Thus, maybe surprisingly, results of an otherwise well-designed, transparently
low power increases the likelihood that a significant reported, and perhaps preregistered study.
result was produced by QRPs only in those cases in
which it reflects a true finding anyway.
The LPLC Critique in the Presence
Does this mean that QRPs are no issue after all? To
the contrary, QRPs vastly dilute the evidence one can of Publication Bias
gain from studies of low and excellent power alike. For Another phenomenon that severely affects the credibility
illustration, consider the following two-step model for of research output is publication bias (Franco et al.,
the generation of research “findings.” First, a researcher 2014). It is well established that significant findings are
conducts the originally planned hypothesis test. If this far more likely to be published, leading to an overesti-
produces a significant result, it is reported as such. If the mation of effects and even the widespread acceptance
test is nonsignificant, the researcher will, with a certain of false theories because of spurious evidence. This
probability Q , use QRPs to produce and report a signifi- relates to the application of the LPLC critique inasmuch
cant result anyway. Now, when a researcher reports a as it might unduly increase scientists’ estimates of the
significant result, it can reflect either a questionable or prior probability and subsequently also the posterior
nonquestionable finding, depending on whether QRPs probability of research hypotheses. These issues high-
were used; and it can be either a true positive (TP) or a light that one’s belief in a hypothesis should be informed
false positive (FP), depending on whether the tested not only by the number of significant findings reported
hypothesis is true or false. The resulting four possible but also the plausibility of the hypothesis, its fit into
events have the following probabilities: well-established theories, and the quality of previous
research.
P ( Nonquestionable TP ) = (1 − β ) × P ( H true ) In view of the ubiquity of publication biases, one
could argue that decision makers such as editors and
P ( Nonquestionable FP ) = α × P ( H false ) reviewers should apply a “preventive” LPLC critique,
denying publication to studies whose tests are, by some
P ( Questionable TP ) = Q × β × P ( H true ) standard, considered underpowered. The goal of such a
policy would be to prevent an “overflow effect,” in which
P ( Questionable FP ) = Q × (1 − α ) × P ( H false ). false positives from a large number of small sample stud-
ies (which are quicker and cheaper to conduct than large
To demonstrate the effects of QRPs on evidence, it is sample studies) lead to the establishment of spurious
illustrative to consider the BF produced by a reported new theories. This is an arguable position, but we note
significant finding, two crucial points. First, this would necessitate an elabo-
rate and clearly communicated policy to decide when
P ( Nonquestionable TP ) + P ( Questionable TP) power will be considered too low: Given the whole range
BFsig. rep = of possible effect sizes of interest, is the power of a
P ( Nonquestionable FP ) + P ( Questionable FP)
study’s tests so low that the field would not want studies
1 − β + Qβ like this to be published for preventive reasons? Second,
= .
α + Q (1 − a ) this form of the LPLC critique would inevitably prioritize
Advances in Methods and Practices in Psychological Science 8(1) 7

10 results from ostensibly underpowered studies in a prin-


cipled, fair, and constructive manner. Scientists who find
9 their work subjected to the LPLC critique should be
8 allowed to demand that decision makers follow these
Power steps and openly communicate their considerations.
7 1
6 0.8
BFsig,rep

0.5 0. Acknowledge that any critique of low


5
0.2 power depends on subjective aspects
4
Statements about the credibility of results must, at least
3
to some degree, depend on subjective beliefs and knowl-
2 edge (e.g., estimates of prior probabilities and effect
1 sizes). We suggest that you be open about the subjective
aspects of your reasoning.
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 An example statement is, “I have tried to consider all
Probability of QRPs (Q) available objective information in my assessment. How-
Fig. 2. The effects of questionable research practices (QRPs) on the
ever, the following estimates are still partially based on
evidence provided by a reported significant result (α = .05) , as mea- my subjective experience, knowledge, and reading of
sured by the Bayes factor (BFsig. rep). The dashed line marks the con- the literature.”
ventional threshold BFsig. rep = 3.

1. Acknowledge that your critique


the strict execution of the policy over the thorough comes from a Bayesian point of view
assessment of the evidence that a single study can pro-
vide. In our opinion, the issue of publication biases is If the reviewed work is to be solely assessed on frequen-
better addressed by recent developments ensuring that tist criteria, there is no space for invoking the LPLC
all research findings are published, such as preregistra- critique. In other cases, you can invoke it from a Bayes-
tion, Registered Reports, and an increased acceptance of ian point of view. Then, however, you should transpar-
null findings (Nosek et al., 2012, 2018). ently acknowledge this.
For example,
Practical Recommendations for a In my opinion, the present test(s) was(were) under-
Principled Critique of Low Power powered to detect the effect(s) of interest. I am
We have spent a considerable part of this article explain- aware that questioning significant results on the
ing why the LPLC critique is inconsistent with frequentist basis of low power alone is not consistent with the
inference. However, we have also demonstrated that it is goals of significance tests. However, from a Bayesian
partially valid when one adopts a Bayesian perspective. point of view, I have the following concerns. . . .
Why should the applied researcher care about the incon-
sistencies of the critique in frequentist inference if it is so 2. Explain the reasons why you think
easily “saved”? This question is especially salient now that the study was underpowered and what
Bayesian statistics are becoming more prevalent in psy-
chology. However, central decision makers, such as jour-
power you would find appropriate
nal editors, reviewers, and grant committees, are still Any analysis can be called underpowered by assuming
binding researchers to the quality criteria prescribed by that the effect of interest is arbitrarily small or by
frequentist inference, such as Type I error control, cor- demanding that power be arbitrarily close to 1. For con-
rection for multiple comparisons, and rejection of optional structive feedback to the researchers, specify the mini-
stopping. Arguably, the increasing popularity of Bayesian mal effect size of interest that you would expect in the
statistics will ultimately oblige scientific fields to discuss investigated situation, the power resulting from this
and reassess whether statistical inference should be effect size and the study’s sample size, and the power
guided by frequentist or Bayesian principles or a mixture that you would deem acceptable in this situation.
thereof. Until then, however, the “worst of both worlds” For example,
would be if decision makers obliged scientists to fulfill
frequentist criteria, only to then judge their work on Based on the literature, I would expect that the size
grounds that are incompatible with those same criteria. of the effect, if it really existed, would not be much
In the following, we describe several steps that will larger than d = 0.3. The authors report a two-sample
help decision makers such as reviewers to critique t test with a sample size of 15 observations per
8 Lengersdorff, Lamm

group. This test would have a power of around .20, 30%, which is not enough to convince me. With a
which I consider too low. In this situation, I would power of .80, a significant result would raise the
be willing to accept a power of .80 or larger. posterior probability to above 60%, which would
make me consider it further.
3a. If you are concerned that the
I am highly skeptical of the claims the authors are
current result came about by the use making, as they are not at all supported by previ-
of QRPs, communicate that concern ous research. I would give the authors’ research
Sometimes, you might suspect that a study’s significant hypothesis a prior probability of being true of at
finding results from the use of QRPs. Low power might most 1%. The significant result from the present
be one of the factors that led you to this suspicion test, which I assume to have had a power of .20,
because the analysis had a low prior probability to find raises this probability only to about 4%. But, to be
the effect it claims to detect. However, there should also fair, even the significant result from a perfectly
be other causes for this concern, such as signs of selec- powered test would give this hypothesis a posterior
tive reporting, abuse of researcher degrees of freedom, probability of only ≈17%. In my opinion, the
lack of or imprecise preregistration, and so on. If this is authors need to provide much more evidence, such
the case, it is important that you clearly explain the as additional experiments with more stringent tests,
reasons for your concerns. You might also name your to make the claims they are making.
conditions for accepting the current results.
For example, Conclusion
I am concerned that the reported result could have There is no question that the ubiquity of underpowered
been obtained through questionable research prac- tests is a pressing problem for psychological science. How-
tices, especially given that the presented analysis ever, we consider it crucial that low power is criticized for
had such a low probability to detect an effect even the correct reasons. Whenever a study can be construc-
if it were there. My concerns are further corrobo- tively critiqued before its implementation, low power
rated by [further evidence for QRPs]. I kindly ask should be pointed out and improvements demanded.
the authors to provide further evidence for their However, as we have outlined in this article, it is less
claims. As it is now, a preregistered replication straightforward to justify the critique of low power after a
would be necessary to convince me of this result. positive result has been obtained. Most important, low
power should not be used as a proxy concern when there
are deeper concerns about the trustworthiness of the
3b. If you are not concerned about reported results and the possible use of QRPs. When sus-
QRPs, assess the result in view of all picions about QRPs arise, they should be directly com-
available information municated as such whenever possible. This way, it is
ensured that issues of scientific integrity and transparency
In the absence of concerns about QRPs, a significant
are not mistaken for issues of statistical power.
result from an ostensibly underpowered test should be
assessed at face value. At this stage, you should openly
Transparency
discuss if, in your view, the result provides convincing
evidence for the truth of the investigated hypothesis. Action Editor: Yasemin Kisbu-Sakarya
Here, you could also make use of Equation 1 to assess Editor: David A. Sbarra
the posterior probability. Author Contributions
Lukas L. Lengersdorff: Conceptualization; Formal analy-
For example,
sis; Visualization; Writing – original draft; Writing – review
& editing.
Given previous results, I would say that this Claus Lamm: Conceptualization; Supervision; Writing –
hypothesis had a 50% chance of being true. With review & editing.
a power of about .5, the test was certainly not Declaration of Conflicting Interests
optimally powered. However, its significant result The author(s) declared that there were no conflicts of inter-
still gives the hypothesis a posterior probability of est with respect to the authorship or the publication of this
about 90%. This is enough evidence to make this article.
result worthwhile for future research. Open Practices
This article has received the badge for Open Materials. More
I would give this research hypothesis only a 10% information about the Open Practices badges can be found
chance of being true. The significant result from at http://www.psychologicalscience.org/publications/badges
the present test, which I assume to have had a
power of .2, raises this probability only to about
Advances in Methods and Practices in Psychological Science 8(1) 9

ORCID iDs evidence of absence. Nature Neuroscience, 23(7), 788–799.


https://doi.org/10.1038/s41593-020-0660-4
Lukas L. Lengersdorff https://orcid.org/0000-0002-8750-5057
Lindsay, D. S. (2015). Replication in psychological science.
Claus Lamm https://orcid.org/0000-0002-5422-0653
Psychological Science, 26(12), 1827–1832. https://doi.org/
Acknowledgments 10.1177/0956797615616374
Mayo, D. G., & Morey, R. D. (2017). A poor prognosis for
We would like to thank Angelika Stefan for valuable feedback the diagnostic screening critique of statistical tests. OSF.
on an earlier version of this article. R code to reproduce the https://doi.org/10.31219/osf.io/ps38b
figures is available on OSF: https://osf.io/7et23. McShane, B. B., Gal, D., Gelman, A., Robert, C., & Tackett,
J. L. (2019). Abandon statistical significance. American
References Statistician, 73(Suppl. 1), 235–245. https://doi.org/10.108
Altman, D. G., & Bland, J. M. (1994). Statistics notes: Diagnostic 0/00031305.2018.1527253
tests 2: Predictive values. BMJ, 309(6947), Article 102. Morey, R., & Lakens, D. (2016). Why most of psychology is
https://doi.org/10.1136/bmj.309.6947.102 statistically unfalsifiable. Zenodo. https://doi.org/10.5281/
Bakker, M., van Dijk, A., & Wicherts, J. M. (2012). The rules zenodo.838685
of the game called psychological science. Perspectives Morey, R. D. (2019). Why you shouldn’t say “this study is under-
on Psychological Science, 7(6), 543–554. https://doi.org/ powered.” Towards Data Science. https://towardsdatascience
10.1177/1745691612459060 .com/why-you-shouldnt-say-this-study-is-underpowered-
Bayarri, M. J., Benjamin, D. J., Berger, J. O., & Sellke, T. M. 627f002ddf35
(2016). Rejection odds and rejection ratios: A proposal Morey, R. D., Hoekstra, R., Rouder, J. N., Lee, M. D., &
for statistical practice in testing hypotheses. Journal of Wagenmakers, E. J. (2016). The fallacy of placing confi-
Mathematical Psychology, 72, 90–103. https://doi.org/ dence in confidence intervals. Psychonomic Bulletin and
10.1016/j.jmp.2015.12.007 Review, 23(1), 103–123. https://doi.org/10.3758/s13423-
Berger, J. O. (2003). Could Fisher, Jeffreys and Neyman have 015-0947-8
agreed on testing? Statistical Science, 18(1), 1–32. https:// Neyman, J., & Pearson, E. (1933). On the problem of the
doi.org/10.1214/ss/1056397485 most efficient tests of statistical hypotheses. Philosophical
Button, K. S., Ioannidis, J. P. A., Mokrysz, C., Nosek, B. A., Transactions of the Royal Society of London A: Mathematical,
Flint, J., Robinson, E. S. J., & Munafò, M. R. (2013). Power Physical and Engineering Sciences, 231(694–706), 289–337.
failure: Why small sample size undermines the reliabil- https://doi.org/10.1098/rsta.1933.0009
ity of neuroscience. Nature Reviews Neuroscience, 14(5), Nosek, B. A., Ebersole, C. R., DeHaven, A. C., & Mellor, D. T.
365–376. https://doi.org/10.1038/nrn3475 (2018). The preregistration revolution. Proceedings of the
Colquhoun, D. (2014). An investigation of the false discovery National Academy of Sciences, USA, 115(11), 2600–2606.
rate and the misinterpretation of p-values. Royal Society https://doi.org/10.1073/pnas.1708274114
Open Science, 1(3), Article 140216. https://doi.org/10.1098/ Nosek, B. A., Spies, J. R., & Motyl, M. (2012). Scientific Utopia:
rsos.140216 II. Restructuring incentives and practices to promote
De Finetti, B. (1974). Theory of probability. John Wiley & Sons. truth over publishability. Perspectives on Psychological
Fisher, R. A. (1935). The design of experiments. Oliver and Boyd. Science, 7(6), 615–631. https://doi.org/10.1177/174569161
Franco, A., Malhotra, N., & Simonovits, G. (2014). Publication 2459058
bias in the social sciences: Unlocking the file drawer. Open Science Collaboration. (2015). Estimating the reproduc-
Science, 345(6203), 1502–1505. ibility of psychological science. Science, 349(6251), Article
Greenland, S., Senn, S. J., Rothman, K. J., Carlin, J. B., Poole, aac4716. https://doi.org/10.1126/science.aac4716
C., Goodman, S. N., & Altman, D. G. (2016). Statistical Schimmack, U. (2012). The ironic effect of significant results
tests, P values, confidence intervals, and power: A guide on the credibility of multiple-study articles. Psychological
to misinterpretations. European Journal of Epidemiology, Methods, 17(4), 551–566. https://doi.org/10.1037/a0029487
31(4), 337–350. https://doi.org/10.1007/s10654-016-0149-3 Stefan, A. M., & Schönbrodt, F. D. (2023). Big little lies: A
Hájek, A. (2007). The reference class problem is your problem compendium and simulation of p-hacking strategies. Royal
too. Synthese, 156, 563–585. Society Open Science, 10(2), Article 220346. https://doi.org/
Ioannidis, J. P. A. (2005). Why most published research find- 10.1098/rsos.220346
ings are false. PLOS Medicine, 2(8), Article e124. https:// Wagenmakers, E.-J., Verhagen, J., Ly, A., Bakker, M., Lee, M. D.,
doi.org/10.1371/journal.pmed.0020124 Matzke, D., Rouder, J. N., & Morey, R. D. (2015). A power
Jaynes, E. T. (2003). Probability theory: The logic of science. fallacy. Behavior Research Methods, 47, 913–917.
Cambridge University Press. Wasserstein, R. L., & Lazar, N. A. (2016). The ASA’s statement
John, L. K., Loewenstein, G., & Prelec, D. (2012). Measuring on p-values: Context, process, and purpose. American
the prevalence of questionable research practices with Statistician, 70(2), 129–133. https://doi.org/10.1080/000
incentives for truth telling. Psychological Science, 23(5), 31305.2016.1154108
524–532. https://doi.org/10.1177/0956797611430953 Wicherts, J. M., Veldkamp, C. L. S., Augusteijn, H. E. M.,
Kass, R. E., & Raftery, A. E. (1995). Bayes factors. Journal of the Bakker, M., van Aert, R. C. M., & van Assen, M. (2016).
American Statistical Association, 90(430), 773–795. https:// Degrees of freedom in planning, running, analyzing,
doi.org/10.1080/01621459.1995.10476572 and reporting psychological studies: A checklist to avoid
Keysers, C., Gazzola, V., & Wagenmakers, E. J. (2020). Using P-hacking. Frontiers in Psychology, 7, Article 1832. https://
Bayes factor hypothesis testing in neuroscience to establish doi.org/10.3389/fpsyg.2016.01832

You might also like