Introduction To Hypothesis Testing, Power Analysis and Sample Size Calculations
Introduction To Hypothesis Testing, Power Analysis and Sample Size Calculations
and
½ Pn ¾
i=1 Xi 1 2 1 2
V ar{x̄} = V ar = nσ = σ
n n2 n
1
2
If Xi ∼ N {µ, σ 2 }, then we know from earlier results that x̄ ∼ N {µ, σn }.
Additionally, even if the data do not come from a normal distribution
½√ ¾
n (x̄ − µ)
lim P ≤ x = Φ(x).
n→∞ σ
Hence, even if our data are not normal, for a large enough sample size,
we can calculate probabilities for x̄ by applying the Central Limit Theorem,
and our answers will be close enough.
2. Hypothesis Testing
Hypothesis testing is a formal statistical procedure that attempts to answer
the question “Is the observed sample consistent with a given hypothesis.”
This boils down to calculating the probability of the sample given the hy-
pothesis, P {X|H}. To set up the procedure, scientists propose what is called
a null hypothesis. The null hypothesis is usually of the form: these data
were generated by strictly random processes, with no underlying mechanism.
Always, the null hypothesis is the opposite of the hypothesis that we are
interested in. The scientist will then set up a hypothesis test to compare the
null hypothesis to the mechanistic, scientific hypothesis consistent with their
scientific theory.
Example 1 Examples of null hypotheses are:
1. no difference in the response of patients to a drug versus a placebo,
2
The null and alternative hypotheses should be specified before the test is
conducted and before the data are observed. The investigators also need to
specify a value for P {X|H0 } at which they will reject H0 . The idea is that if
the data are quite unlikely under the null hypothesis, then we conclude that
they are inconsistent with the null, and hence accept the alternative. Notice
that the null and the alternative are mutually exclusive and exhaustive–that
is, one or the other must be true, but it’s impossible that both are.
The probability the we reject the null is denoted α and is called the size
of the test. It’s complement 1 − α is called the significance level of the test,
though sometimes you will see these terms used interchangeably.
Note that for some α = P {X|H0 } small enough, we reject H0 . Hence,
α = P {we reject H0 , when H0 is true}, also called the probability of a type I
error.
The probability of a Type II error is given by P {we fail to reject H0 when
H0 is false} = β.
3
The values under the normal curve that are equal or more extreme than
our test statistic constitute the rejection region.
Let’s begin with an example. Say that regulators desire a high certainty
that emissions are below 5 parts per billion for a particular contaminant, and
the regulatory limit is 8 parts per billion. They may conduct the following
test
H0 : µ < 5ppb
versus
Ha : µ ≥ 5ppb
At what value will we reject H0 ? Say, we would like to reject the null
hypothesis at the 95% confidence level. This means we wish to fix the prob-
ability of falsely rejecting H0 (type I error) at no greater than 5%. Here,
under H0 , µ can be fixed at µ = 5 without altering the size (confidence level)
of the test. Now we need to find the rejection region, i.e. the value of x̄ at
which we can reject H0 at 95% confidence.
We need to find a c such that
( ) ( )
x̄ − 5 c−5 c−5
P r{x̄ > c|H0 } = P r > = Pr z> (2.2)
√3 √3 √3
5 5 5
4
3. Power Calculations
Let’s continue with our example. In the event that the managers fail to reject
H0 , that is, they conclude that there is insufficient evidence that emissions
are above 5 ppb, they and their stakeholders, may want to ask the question:
“Was there sufficient information in our sample (i.e. is the lack of evidence
due to insufficient sample size) to have detected a difference of 3 ppb? Hence,
they need to calculate the power of the test when µ = 8 ppb.
The power of a test is defined as the probability that we correctly reject
the null hypothesis, given that a particular alternative is true. Power can
also be defined as
5
We need
( )
x̄ − 8 7.2 − 8
P r {x̄ > 7.2|µ = 8} = P r > = P r {z > −0.5963} = 0.7245
√3 √3
5 5
(3.2)
The power of this test at the specified alternative is then 0.7245. Alterna-
tively, we can say that the probability of type II error, or the probability that
we failed to reject the null when the true mean was 8 is 1 − 0.7245 = 0.2755.
We can conduct a full power analysis by plotting the power at a wide variety
of alternatives, or distances from µ0 , assuming that the standard deviation
remains constant across all concentrations.
0.4
0.2
6 8 10 12 14
6
more than our inability to design a decent experiment. If we have any reason-
able estimate of the variability and a scientifically justifiable and interesting
alternative, or even a range of alternatives, we can estimate before hand
whether or not the experiment is worth doing given the limitations on our
time and budget.
Say we would like to set the probability of a type I error at no greater
than 5% and of a type II error at no greater than 10%, what sample size
would we need for the test shown above? We saw that we rejected H0 at
µ ¶
σ
x̄ ≥ z1−α √ + µ0 .
n
Now consider the desired power. We need to repeat the same process as
we did above for the α level, but this time solving for c using the z value for
the corresponding power.
µ ¶
σ
x̄ ≥ zβ √ + µa .
n
Now recall that zβ = −z1−β . Setting the two expressions for x̄ equal to
one another we have
µ ¶ µ ¶
σ σ
z1−α √ + µ0 = −z1−β √ + µa
n n
7
Of course, often we will have no preliminary data from which to estimate
a standard deviation. In this case, we must use a conservative “best guess”
for the variance. In practice, we may also not know exactly what is a sci-
entifically meaningful alternative. However, as practitioners of science we
should be working to move our community towards more careful planning of
experiments and more careful thinking about our questions before we begin
the experiment.
5. References
1. Pagano M and K. Gauvreau, 1993. Principles of Biostatistics, Duxbury
Press, Belmont, California.