0% found this document useful (0 votes)
47 views95 pages

Biostatistics of HKU MMEDSC Session 7

The document outlines a lecture on designing studies and calculating appropriate sample sizes. It discusses key concepts like type I and II errors, significance levels, and power. It covers the perspectives of Fisher and Neyman-Pearson on hypothesis testing and how current practice synthesizes aspects of both approaches. An example randomized controlled trial comparing two drugs for migraine is presented to illustrate statistical hypothesis testing.

Uploaded by

Xin chao Li
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
47 views95 pages

Biostatistics of HKU MMEDSC Session 7

The document outlines a lecture on designing studies and calculating appropriate sample sizes. It discusses key concepts like type I and II errors, significance levels, and power. It covers the perspectives of Fisher and Neyman-Pearson on hypothesis testing and how current practice synthesizes aspects of both approaches. An example randomized controlled trial comparing two drugs for migraine is presented to illustrate statistical hypothesis testing.

Uploaded by

Xin chao Li
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 95

Outline

Designing studies
CMED6100 – Session 7

ST Ali

School of Public Health


The University of Hong Kong

30 October 2021

sli.do/#hkubiostat21

ST Ali CMED6100 – Session 7 Slide 2


Outline

Announcements

• Practical 3 (sample size calculations) next week.

ST Ali CMED6100 – Session 7 Slide 3


Outline

Announcements
• Ask related questions via Moodle forums only (not via email)
– Mention clearly the questions numbers/slide numbers/assignment or
practical session details
– Put your queries to the respective forums

• For any further clarifications, you need to discuss personally, come to office
hours
• Don’t ask questions, which you are supposed to answer (assignments questions
before deadline)
ST Ali CMED6100 – Session 7 Slide 4
Outline

Objectives

After the lecture, students should be able to:

• Define type I and II errors, significance levels, and power;

• Describe the determinants of power;


• Calculate or comment on appropriate sample sizes in a variety
of settings.
– Formulas for 2-group calculations would be provided if needed

• Practical 3 will cover the use of an online sample size


calculation tool.

ST Ali CMED6100 – Session 7 Slide 9


Outline

Statistical hypothesis: Assertion or statement about the population characteristics (𝜇)


Null hypothesis: Hypothesis of no difference (𝐻! : 𝜇 = 𝜇0 )
Alternative hypothesis: hypothesis which is complementary to the null hypothesis (𝐻# 𝑜𝑟𝐻$ : 𝜇 ≠ 𝜇! )

Critical Value
𝐻! : 𝜇 = 𝜇! 𝐻" : 𝜇 = 𝜇" (> 𝜇! )
Pr (Type I Error) = 𝛼 (Level of significance)
Pr (Type II Error) = 𝛽 (1 − 𝛽 is Power of the test)

ST Ali CMED6100 – Session 7 Slide 10


Outline

Practical outcomes
You should be able to:
• Define key terms such as type I and II errors, power,
significance confidence etc.
• Describe the determinants of power

• In a given scenario, classify the type of sample size calculation


required: one or two groups; dichotomous or continuous
outcome.
• Perform the appropriate sample size calculation using
computer software (not by hand)
ST Ali CMED6100 – Session 7 Slide 11
Introduction Errors Power Summary

Perspective of scientific investigation

Recall the lecture on hypothesis testing.


Karl Popper (1902–1994) stated that a scientist who wishes to
prove an effect should follow two logical steps:

1. Set up a null hypothesis that NO effect exists.

2. Try to build sufficient evidence to DISPROVE the null


hypothesis.

His idea was that no theory is completely correct, but if not


falsified, it can be accepted as truth.

ST Ali CMED6100 – Session 7 Slide 12


Introduction Errors Power Summary

Fisher’s approach
• Set a null hypothesis (usually ‘no effect’ or ‘no difference
between two groups’)
• Perform the study, collect and analyse data.

• Calculate the p-value as the probability of observing such


unusual data (or more unusual data) if the null hypothesis is
true.
• Flexible interpretation of p-value
– 0.05 is a reasonable threshold but should not be strict

• Take action accordingly.


ST Ali CMED6100 – Session 7 Slide 13
Introduction Errors Power Summary

The Neyman-Pearson approach

• Set a null hypothesis (not necessarily ‘no difference’)


• Set an alternative hypothesis
– e.g. H0 : relative risk = 1 (null hypothesis)
– e.g. H1 : relative risk 6= 1 (alternative hypothesis)

• Decide what action will be taken if your study provides


evidence (1) in support of, or (2) against the null hypothesis.

• Set a decision rule as a ‘significance level’, α, for the p-value,


e.g. α = 0.05.

ST Ali CMED6100 – Session 7 Slide 14


Introduction Errors Power Summary

The Neyman-Pearson approach (cont.)


• Perform the study, collect and analyse data.

• Calculate the p-value as the probability of observing such unusual


data (or more unusual data) if the null hypothesis is true.

• Strict interpretation of p-value:


– If p < α reject null hypothesis as false, otherwise if p > α accept
the null hypothesis as true (exact p-value not important).
– (Note – an echo of this approach is seen in some journals which
quote only “ns”, “p <0.05” etc. instead of giving exact p-values)

• Take action accordingly.

• What are the consequences of the resulting action?


ST Ali CMED6100 – Session 7 Slide 15
Introduction Errors Power Summary

Errors in drawing conclusions:

Is there a true difference?


Yes No
Reject H0 with p ≤ α I
Accept H0 with p > α II

The type I error (α) corresponds to the false-positive risk.


The type II error (β) corresponds to the false-negative risk.

ST Ali CMED6100 – Session 7 Slide 16


Introduction Errors Power Summary

Costs/benefits associated with each decision

Is there a true difference?


Yes No
Reject H0 with p ≤ α A B
Accept H0 with p > α C D

To determine the appropriate value of α and β, we need to know


something about the values A, B, C , D and the likelihood that
there really is a true difference.

ST Ali CMED6100 – Session 7 Slide 17


Introduction Errors Power Summary

The clash of titans!


• Clearly these two approaches to statistical inference are
different.
• Fisher’s approach only involves rejecting or failing to reject a
null hypothesis of no difference.
• Neyman-Pearson approach involves rejecting or accepting a
null hypothesis vs an alternative hypothesis
– Decision-making approach (focus on controlled experiments?)
– Cost vs benefit of correct decisions, and type I and type II
errors, should be taken into account.
• What’s the best way?
ST Ali CMED6100 – Session 7 Slide 18
Introduction Errors Power Summary

Synthesis of approaches

• These two approaches are often confused and mixed up in


textbooks and the literature.
• Current practice in medical research is a kind of synthesis:
– Set a significance level before the start of the experiment
(usually α = 0.05).
– Run the study, collect and analyse the data,
– Use a null hypothesis of no difference between groups.
– If p < α reject the null hypothesis, otherwise if p > α fail to
reject the null hypothesis.

ST Ali CMED6100 – Session 7 Slide 19


Introduction Errors Power Summary

Example experiment 1
• A small RCT comparing drug A and drug B for migraine (drug A vs.
drug B).

• Randomise 32 patients to each arm.

• After conducting the study, we analyse the data and estimate the
relative risk of headache on A vs B.

• On average 25% of patients would experience the primary outcome


of this study (1+ severe headache within one month of
randomisation). i.e. 25% is the true value for the risk on drug A
and drug B, and the true RR is 1.

• We use α = 0.05 as the level for statistical significance.


ST Ali CMED6100 – Session 7 Slide 20
Introduction Errors Power Summary

Example experiment 1 (v1)

• Maybe these were the results:

Drug Sample size Headache


n (%)
Drug A 32 8 (25%)
Drug B 32 8 (25%)
• Then RR = 0.25/0.25 = 1.00

• 95% confidence interval for this RR is (0.43, 2.34)

• p-value under the null hypothesis RR = 1 is 1.00.

ST Ali CMED6100 – Session 7 Slide 21


Introduction Errors Power Summary

Example experiment 1 (v2)


• If someone somewhere else had done the experiment maybe
these were the results:
Drug Sample size Headache
n (%)
Drug A 32 4 (13%)
Drug B 32 10 (31%)
• Then RR = 0.40, with 95% CI (0.14, 1.14)

• p-value under the null hypothesis RR = 1 is 0.09.

• We correctly fail to reject the null hypothesis.


ST Ali CMED6100 – Session 7 Slide 22
Introduction Errors Power Summary

Example experiment 1 (v3)

• Or in another scenario, maybe these were the results:

Drug Sample size Headache


n (%)
Drug A 32 14 (44%)
Drug B 32 4 (13%)
• Then RR = 3.50, with 95% CI (1.29, 9.49)

• p-value under the null hypothesis RR = 1 is 0.01.

• This time we incorrectly reject the null hypothesis.

ST Ali CMED6100 – Session 7 Slide 23


Introduction Errors Power Summary

Example experiment 1

• It is possible (but unlikely) that by chance we could observe


such unusual results that we mistakenly reject a true null
hypothesis, at any given level of statistical significance.

• At the 5% significance level, we will make this mistake in 5%


of studies in which the null hypothesis is true.

• The results of 40 repetitions of this experiment are shown on


the next slide ...

ST Ali CMED6100 – Session 7 Slide 24


Introduction Errors Power Summary

Example experiment 1 (v1-40)


Some of the 95%
confidence intervals don’t
even include the ‘true’
value 1.0

Approximately 5% of
these intervals will not
include 1, since we
pre-specified a
significance level of 0.05
ST Ali CMED6100 – Session 7 Slide 25
Introduction Errors Power Summary

Type I error (α)


• The significance level is also called the type I error risk (α) and is defined
as the probability of incorrectly rejecting a true null hypothesis.
• We must be quite sure before we claim the existence of a real effect. To
do otherwise would be dangerous.
• A type I error risk of α = 5% is traditionally used in most medical
research, occasionally α =1% or even 0.1%.
• For α = 5%, this means that we will only make a mistake by rejecting a
true null hypothesis 5 times out of 100.
• Why aren’t we more concerned about making a mistake by rejecting a
true null hypothesis? – we could definitely make less mistakes if we set a
lower type I error risk (e.g. 1%)...
ST Ali CMED6100 – Session 7 Slide 26
Introduction Errors Power Summary

Example experiment 2
• Another small RCT comparing two treatments for migraine (drug A
vs. drug C). Outcome is the same as before (1+ severe headache
within one month of randomisation).

• Drug A is same as before i.e. 25% is the true value for the risk of
the outcome on drug A. However, drug C is much less effective with
a 75% risk of the outcome, and the true RR for A vs C is 0.33.

• Randomise 32 patients to each arm.

• After conducting the study, we analyse the data and estimate the
relative risk of headache on A vs C.

• We use p < 0.05 as the level for statistical significance.


ST Ali CMED6100 – Session 7 Slide 27
Introduction Errors Power Summary

Example experiment 2 (v1)

• Maybe these were the results:

Drug Sample size Headache


n (%)
Drug A 32 8 (25%)
Drug C 32 24 (75%)
• Then RR = 0.33, with 95% CI (0.18, 0.67)

• p-value under the null hypothesis RR = 1 is  0.001.

• SO we reject null hypothesis and conclude A is better.

ST Ali CMED6100 – Session 7 Slide 28


Introduction Errors Power Summary

Example experiment 2 (v2)


• But what if someone else had done the study and these were
the results:
Drug Sample size Headache
n (%)
Drug A 32 17 (53%)
Drug C 32 24 (75%)
• Then RR = 0.71, with 95% CI (0.48, 1.04)

• p-value under the null hypothesis RR = 1 is 0.08.

• SO we fail to reject the null hypothesis.


ST Ali CMED6100 – Session 7 Slide 29
Introduction Errors Power Summary

Example experiment 2 (v1-40)


Quite a few (10/40) of the
95% confidence intervals
include 1, and in these
experiments we wouldn’t be
justified in rejecting the null
hypothesis.

Sometimes our hypothesis


tests will produce false
negative
conclusions.
ST Ali CMED6100 – Session 7 Slide 30
Introduction Errors Power Summary

Type II error (β)


• The type II error risk (β) is defined as the probability of
incorrectly failing to reject a false null hypothesis.
• This is analogous to a false-negative risk.
• 1 − β is known as the power of a hypothesis test, and is the
probability of correctly rejecting a false null hypothesis.
• Phase 3 RCTs will typically be designed to have a power of
70%–90% for effect sizes that are considered clinically
important.
• Earlier we asked why we should use a significance level of 5%
rather than a lower level of (say) 1% ...
ST Ali CMED6100 – Session 7 Slide 31
Introduction Errors Power Summary

Relationship between α and β

Our study had power ∼0.75


1.0
to detect the true RR of
0.8
0.33 at a significance level
0.6
of α = 0.05. Power
β)
(1−β
0.4

If we had specified a 0.2

stricter α of 0.01, we would 0.0


0.00 0.05 0.10 0.15
have had very low power. α)
Significance level (α

ST Ali CMED6100 – Session 7 Slide 32


Introduction Errors Power Summary

Balance between α and β

• Typically we will choose α smaller than β.

• For example, many randomised trials have α = 0.05 and


β = 0.2 (=80% power).

• It is usually considered preferable to make a false negative


error (e.g. fail to approve a new beneficial intervention) than
to make a false positive error (e.g. licence a new (expensive!)
drug which is no more effective than the existing therapy).

ST Ali CMED6100 – Session 7 Slide 33


Introduction Errors Power Summary

What about the effect size?

1.0
We would have lower power
0.8 RR=0.33
to detect a smaller effect or RR=3.0

size (here, an effect which Power


0.6
RR=0.50
β)
(1−β
is closer to the null RR of 0.4
or RR=2.0

RR=0.67
1.0) at any given 0.2
or RR=1.5

significance level. 0.0


0.0 0.1 0.2 0.3 0.4 0.5
α)
Significance level (α

ST Ali CMED6100 – Session 7 Slide 34


Introduction Errors Power Summary

What about the sample size?

Example experiment 2 had


1.0
n = 32 in each arm and a
0.8
true RR of 0.33. n=32

n=20
0.6
Power
β)
(1−β n=10
0.4
A smaller sample would
0.2
reduce our power to detect
0.0
this effect size, at a given
0.0 0.1 0.2 0.3 0.4

significance level. α)
Significance level (α

ST Ali CMED6100 – Session 7 Slide 35


Introduction Errors Power Summary

We will have higher power if . . .


• The significance level is less strict (larger α).

• We are not looking to detect smaller effects (we are only interested in
larger effects).
• The sample size is larger

• Our continuous outcome is less variable between patients (see examples


later)

• Also note that some statistical methods are inherently more powerful

than others
– e.g. parametric tests are more powerful than non-parametric; most
sample size calculations are based on parametric tests unless there
is a strong reason to believe this is inappropriate.
ST Ali CMED6100 – Session 7 Slide 36
Introduction Errors Power Summary

Summary table of α and β

Is there a true difference?


Yes No
type I error risk
Rejects H0 1−β α
(power) (significance)
Fail to type II error risk
reject H0 β 1−α
(confidence)

ST Ali CMED6100 – Session 7 Slide 37


Background Continuous outcome Dichotomous outcome Exercise Odds ratios Other issues

Part II

Sample size calculations

ST Ali CMED6100 – Session 7 Slide 38


Background Continuous outcome Dichotomous outcome Exercise Odds ratios Other issues

Ethics in RCTs

In randomised controlled trials (RCTs), it is particularly important


to get the right sample size:
“A small study with no chance of detecting a clinically significant
difference between treatments is unfair to all the subjects put to
the risk and discomfort of the trial.”
“A study which is too large may be unfair if one treatment is
proven to be more effective and so a large number of patients
receive inferior treatment.”
(Altman & Gore, 1982)
ST Ali CMED6100 – Session 7 Slide 39
Background Continuous outcome Dichotomous outcome Exercise Odds ratios Other issues

Research Ethics

The previous comments are not only true for RCTs ...

In a questionnaire survey of doctors, is it ethical to send a


questionnaire survey to 10,000 doctors, when it would have been
sufficient to question only 1,000?

Alternatively, is it ethical to send a questionnaire to only 500


patients, when you would need at least 5000 returns to be able to
answer your research question?
ST Ali CMED6100 – Session 7 Slide 40
Background Continuous outcome Dichotomous outcome Exercise Odds ratios Other issues

What information do we need to calculate sample size?

• The desired significance level (α), and the desired power


(1 − β);
– Higher confidence and power require bigger sample
• The size of the effect we want to detect (perhaps a ‘clinically
important difference’ ?)
– Need a bigger sample to detect a smaller effect
• The likely variability of the measurements
– If measurements are more variable, need bigger sample

ST Ali CMED6100 – Session 7 Slide 41


Background Continuous outcome Dichotomous outcome Exercise Odds ratios Other issues

What other information might we need?


• Background knowledge (literature review);

• Research objectives;

• Outcome measures;

• Proposed statistical methods of analysis;


• An estimate of the resources required;
– Money;
– Time;

• An idea of the proportion of eligible participants that will


agree to participate.
ST Ali CMED6100 – Session 7 Slide 42
Background Continuous outcome Dichotomous outcome Exercise Odds ratios Other issues

Different types of data

• We will look at sample size calculations for two general


situations:
– Continuous outcome variables (not necessarily following a
Normal distribution);
– Dichotomous (binary) outcome variables;
• For each outcome variable we will look at two types of
calculation:
– One-group calculations
– Two-group calculations

ST Ali CMED6100 – Session 7 Slide 43


Background Continuous outcome Dichotomous outcome Exercise Odds ratios Other issues

Background

Continuous outcome

Dichotomous outcome

Exercise

Odds ratios

Other issues

ST Ali CMED6100 – Session 7 Slide 44


Background Continuous outcome Dichotomous outcome Exercise Odds ratios Other issues

Continuous outcome (1 group)


• We want to produce an estimate of the mean of a continuous
outcome variable, to a particular fixed precision.
• How can we calculate the required sample size n?
• Determine the likely variability, σ, of the measurements.
– Usually based on previous studies in the literature, or on our
own pilot studies.
• Decide what significance level (α) we want, for example
α = 0.05 will produce a 95% confidence interval.
• Decide how wide (w ) we want the final confidence interval to
be, in terms of the units of the outcome variable.
ST Ali CMED6100 – Session 7 Slide 45
Background Continuous outcome Dichotomous outcome Exercise Odds ratios Other issues

Continuous outcome (1 group) – cont

• For the standard choice of α = 0.05, the formula is:

16σ 2
n=
w2


• Why? A 95% confidence interval is the estimate ±1.96σ/ n,

so the width w = 2 × 1.96σ/ n.

• For the standard choice α = 0.05, (2 × 1.96)2 ≈ 16.

ST Ali CMED6100 – Session 7 Slide 46


Background Continuous outcome Dichotomous outcome Exercise Odds ratios Other issues

Example (continuous, 1 group)


• Study to estimate the mean systolic blood pressure of elderly
care home residents.
• When I analyse the data, I want my 95% confidence interval
to have a margin of error of ±2mmHg (or, equivalently, to
have width 4mmHg).
• From a pilot study, I think that the subjects’ systolic blood
pressures will follow a Normal distribution with a standard
deviation of approximately 10mmHg.
• How many patients (n) do I need to measure?
• From above, we specify α = 0.05, w = 4, σ = 10.
ST Ali CMED6100 – Session 7 Slide 47
Background Continuous outcome Dichotomous outcome Exercise Odds ratios Other issues

Example (continuous, 1 group) – cont

• Using the formula given before,

16σ 2 16 × 102
n= = = 100.
w2 42

• I need to measure the systolic blood pressures of 100 patients.

ST Ali CMED6100 – Session 7 Slide 48


Background Continuous outcome Dichotomous outcome Exercise Odds ratios Other issues

Continuous outcome (2 groups)


• We want to detect a difference between a continuous outcome
variable in two equally-sized groups;

• How can we calculate the required sample size n of each of the two
groups?

• Determine the means µ1 and µ2 that we expect to see;

• Or determine µ1 and then choose µ2 so that the difference between


µ2 and µ1 is a ‘clinically important difference’.

• Determine the variability, σ, of the measurements.

• Usually, our ‘estimates’ are based on previous studies in the


literature, or on our own pilot studies.
ST Ali CMED6100 – Session 7 Slide 49
Background Continuous outcome Dichotomous outcome Exercise Odds ratios Other issues

Continuous outcome (2 groups) – cont

For standard choices of (power) 1 − β = 0.80, and (significance


level) α = 0.05, the formula for the number of subjects required in
each group is:

16
n = ,
∆2
µ 1 − µ2
∆= .
σ

ST Ali CMED6100 – Session 7 Slide 50


Background Continuous outcome Dichotomous outcome Exercise Odds ratios Other issues

Example (continuous, 2 groups)


• I want to test a ‘magic diet pill’.

• I will take a group of people, and randomly assign them to


take either this diet pill, or a placebo.
• Patients should take the pill for 6 months, and at the end of
this period I will compare the weight changes in the two
groups of patients.
• I hope that the diet pill will be more effective in reducing
weight, but I should set up the experiment to test the null
hypothesis µ1 = µ2 .
ST Ali CMED6100 – Session 7 Slide 51
Background Continuous outcome Dichotomous outcome Exercise Odds ratios Other issues

Example (continuous, 2 groups) – cont

• From my background knowledge of the patient population, I


think that the mean weight loss in the placebo group is likely
to be approximately 4kg, with standard deviation 2.5kg.

• I would like to detect a difference if the diet pill reduces


weight by at least 1kg on top of the 4kg in the placebo group.

• So I specify µ1 = −4, µ2 = −5, and σ = 2.5.

ST Ali CMED6100 – Session 7 Slide 52


Background Continuous outcome Dichotomous outcome Exercise Odds ratios Other issues

Example (continuous, 2 groups) – cont

• I want to have 80% power and significance level 5%

µ1 − µ 2 −4 − (−5)
∆= = = 0.4
σ 2.5
16
n = = 100
∆2

• Therefore I would need 100 patients in each group, or a total


of 200 patients.

ST Ali CMED6100 – Session 7 Slide 53


Background Continuous outcome Dichotomous outcome Exercise Odds ratios Other issues

Background

Continuous outcome

Dichotomous outcome

Exercise

Odds ratios

Other issues

ST Ali CMED6100 – Session 7 Slide 54


Background Continuous outcome Dichotomous outcome Exercise Odds ratios Other issues

Dichotomous outcome (1 group)


• We want to produce an estimate of the prevalence of a dichotomous
outcome variable, to a particular fixed precision. This could be the
prevalence of a particular condition.

• How can we calculate the required sample size n?

• Determine the probability p that the outcome variable will take the
value 1 rather than 0, or equivalently the prevalence.

• Decide what significance level (α) we want, for example α = 0.05


will produce a 95% confidence interval.

• Decide how wide (w ) we want the confidence interval to be.

ST Ali CMED6100 – Session 7 Slide 55


Background Continuous outcome Dichotomous outcome Exercise Odds ratios Other issues

Dichotomous outcome (1 group) – cont

• For the standard choice of α = 0.05, the formula is:

16p(1 − p)
n=
w2

• Why? The width of a 95% confidence interval is


√ √
w = 2 × 1.96σ/ n (or, the estimate ±1.96σ/ n).

• We can approximate σ 2 by p(1 − p).

• (2 × 1.96)2 ≈ 16.

ST Ali CMED6100 – Session 7 Slide 56


Background Continuous outcome Dichotomous outcome Exercise Odds ratios Other issues

Example (dichotomous, 1 group)

• I want to investigate the prevalence p of smoking in the Hong


Kong population in individuals aged 20-30.

• I want my 95% confidence interval to be precise to ±1% (or,


equivalently, a width of 2%).

• I think that approximately 8% of the people in this age group


are current smokers.

• How many individuals do I need to question?

• From above, we specify α = 0.05, w = 0.02, p = 0.08.

ST Ali CMED6100 – Session 7 Slide 57


Background Continuous outcome Dichotomous outcome Exercise Odds ratios Other issues

Example (dichotomous, 1 group) – cont

• Using the formula given before,

16p(1 − p) 16 × 0.08 × 0.92


n= = = 2944.
w2 0.022

• I need to question 2944 individuals about their current


smoking status.

• ... extra note ... if I want to use a questionnaire, and I expect


that 50% of subjects will complete the survey, I need to send
out 6000 questionnaires.

ST Ali CMED6100 – Session 7 Slide 58


Background Continuous outcome Dichotomous outcome Exercise Odds ratios Other issues

Dichotomous outcome (2 groups)


• We want to detect a difference between a dichotomous
outcome variable (e.g. prevalence) in two equally-sized groups;
• How can we calculate the required sample size n of each of
the two groups?
• Determine the probabilities p1 and p2 of the binary outcome
variable being 1 in each of the two groups;
• Or determine p1 and then choose p2 so that the difference
between p2 and p1 is a ‘clinically important difference’.
• Decide what power and significance level we want to use.
ST Ali CMED6100 – Session 7 Slide 59
Background Continuous outcome Dichotomous outcome Exercise Odds ratios Other issues

Dichotomous outcome (2 groups) – cont

• For standard choices of (power) 1 − β = 0.80, and


(significance level) α = 0.05, the formula for the number of
subjects in each group is:

8
n = ,
∆2
(p1 − p2 )2
∆2 = .
p1 (1 − p1 ) + p2 (1 − p2 )

ST Ali CMED6100 – Session 7 Slide 60


Background Continuous outcome Dichotomous outcome Exercise Odds ratios Other issues

Example (dichotomous, 2 groups)

• Patients often don’t turn up to their appointments.

• I want to investigate whether a telephone reminder will make


patients more likely to attend.

• For a group of 2n patients in my surgery, I will randomise half


(n) of them to be reminded about their appointment by
telephone. The other n will not be reminded by telephone.

• How big should n be?

ST Ali CMED6100 – Session 7 Slide 61


Background Continuous outcome Dichotomous outcome Exercise Odds ratios Other issues

Example (dichotomous, 2 groups) – cont

• Define group 1 as the non-telephoned control group, and


group 2 as the telephoned group.

• From my background knowledge of the patient population, I


know that approximately 70% of patients attend their
appointments anyway.

• I think that a telephone reminder will cause an extra 10% of


patients to attend their appointment.

• So I specify p1 = 0.70, and p2 = 0.80.

ST Ali CMED6100 – Session 7 Slide 62


Background Continuous outcome Dichotomous outcome Exercise Odds ratios Other issues

Example (dichotomous, 2 groups) – cont

• I want to have 80% power and significance level 5%

(p1 − p2 )2 0.102
∆2 = =
p1 (1 − p1 ) + p2 (1 − p2 ) 0.7 × 0.3 + 0.8 × 0.2
8 8
n = 2
= = 296
∆ 0.0207

• Therefore I would need 296 patients in each group, or a total


of 592 patients.

ST Ali CMED6100 – Session 7 Slide 63


Background Continuous outcome Dichotomous outcome Exercise Odds ratios Other issues

Background

Continuous outcome

Dichotomous outcome

Exercise

Odds ratios

Other issues

ST Ali CMED6100 – Session 7 Slide 64


Background Continuous outcome Dichotomous outcome Exercise Odds ratios Other issues

Sample sizes for odds ratios

• For a rule of thumb calculation, odds ratios can be rephrased


as ratios of probabilities, and then we can use the calculations
for binary outcomes in two groups.

• Recall that

p1 /(1 − p1 )
OR = ,
p2 /(1 − p2 )

where p1 = {proportion exposed in the cases},


p2 = {proportion exposed in the controls}.

ST Ali CMED6100 – Session 7 Slide 65


Background Continuous outcome Dichotomous outcome Exercise Odds ratios Other issues

Example (odds ratio)

• In a case-control study, we would like to compare the


immunisation coverage in a group of tuberculosis cases (group
1) to a group of controls (group 2).

• I will find n cases and n controls.

• How big should n be?

ST Ali CMED6100 – Session 7 Slide 66


Background Continuous outcome Dichotomous outcome Exercise Odds ratios Other issues

Example (odds ratio) – cont

• A pilot study has suggested that approximately 30% of the


controls are vaccinated.

• An odds ratio of 2 would be considered an important


difference.

• So I specify p2 = 0.30, and calculate p1 = 0.46.

Note – For calculation ... OR = 2 = [p1 /(1 − p1 )]/(0.3/0.7).


So p1 /(1 − p1 ) = 0.86 and then p1 = 0.86/1.86 = 0.46.

ST Ali CMED6100 – Session 7 Slide 67


Background Continuous outcome Dichotomous outcome Exercise Odds ratios Other issues

Example (odds ratio) – cont

• I want to have 80% power and significance level 5%

(p1 − p2 )2 0.162
∆2 = =
p1 (1 − p1 ) + p2 (1 − p2 ) 0.46 × 0.54 + 0.3 × 0.7
8 8
n = 2
= = 143
∆ 0.0558

• Therefore I would need 143 patients in each group, or a total


of 286 patients.

ST Ali CMED6100 – Session 7 Slide 68


Background Continuous outcome Dichotomous outcome Exercise Odds ratios Other issues

Other effect measures


• It is also possible, though not so straightforward, to calculate
sample sizes for other outcome measures, such as:
– Categorical data;
– Correlation coefficients;
– Count data;
– Regression coefficients for regression models;
– Survival data;
• Note that for regression models (except with survival data) it
is usually the case that adjusting for covariates will improve
the power of a study, compared to a crude comparison

ST Ali between two groups.CMED6100 – Session 7 Slide 69


Background Continuous outcome Dichotomous outcome Exercise Odds ratios Other issues

Unbalanced designs

• Sometimes we may prefer to use unbalanced two-group


designs (i.e. one group is bigger than the other):
– One intervention is much more expensive?
– In a case-control study, cases are very rare?

• Again, this is possible

• One example of this in practical 3.

ST Ali CMED6100 – Session 7 Slide 70


Background Continuous outcome Dichotomous outcome Exercise Odds ratios Other issues

Practical 3

• Other levels of power.

• Other scenarios (e.g. fixed sample size).

• Use computer rather than calculating by hand.

• Presentation of sample size calculations.

ST Ali CMED6100 – Session 7 Slide 71


Background Continuous outcome Dichotomous outcome Exercise Odds ratios Other issues

Post-hoc power evaluation

• Suppose we conduct a cohort study, and find that the


association between smoking and breast cancer is fairly strong
but non-significant, with risk ratio 2.0 (95% CI: 0.91, 3.32;
p=0.09).

• Would it be a good idea to check what was the power of our


study to detect a risk ratio of at least 2.0, 3.0 or 4.0? . . .

ST Ali CMED6100 – Session 7 Slide 72


Background Continuous outcome Dichotomous outcome Exercise Odds ratios Other issues

Post-hoc power evaluation

• . . . not really. We already know what the answer will be,

• The power must be fairly low to detect a risk ratio of 2.0 or


smaller with statistical significance since our study did not
detect this risk ratio with statistical significance.

• Risk ratios as high as 4.0 are not consistent with our data
(the upper limit of our 95% confidence interval was 3.32).

ST Ali CMED6100 – Session 7 Slide 73


Background Continuous outcome Dichotomous outcome Exercise Odds ratios Other issues

Post-hoc power evaluation

• All relevant information is contained in the estimate and


confidence interval – our best estimate of the risk ratio is 2.0
and our findings are most consistent with risk ratios between
0.91 and 3.32.

• Power will be low to detect effects between the point estimate


and the null value (e.g. 1.5 in our example), and higher to
detect effects further away from the null value (e.g. 2.5).
Effects outside the confidence interval are not very consistent
with our data.
ST Ali CMED6100 – Session 7 Slide 74
Background Continuous outcome Dichotomous outcome Exercise Odds ratios Other issues

Relative effect sizes

• In our example sample size calculations for continuous


outcome variables, we were careful to specify the effect size of
interest (µ1 − µ2 ), and to consider also the standard deviation
of measurements (σ).

• There is an alternative approach where we instead describe


the power to detect various effect sizes of the form
∆ = (µ1 − µ2 )/σ.

ST Ali CMED6100 – Session 7 Slide 75


Background Continuous outcome Dichotomous outcome Exercise Odds ratios Other issues

Cohen’s effect sizes


• Jacob Cohen has defined ‘small’ (0.2), ‘medium’ (0.5) and ‘large’
(0.8) effect sizes.

• An effect size of 0.8 would correspond to a difference between


groups of 0.8 standard deviations.

• With effect sizes, we avoid having to know anything about σ.

• Generally it is not a good idea to use these in sample size


calculations,
– because we should be honest about the clinical significance of the
effect size our study is powered to detect (1 mmHg, 0.1 kg/m2 )
rather than a vague statement about ‘a medium effect.’
– because effect sizes depend on the variability of measurements.
ST Ali CMED6100 – Session 7 Slide 76
Background Continuous outcome Dichotomous outcome Exercise Odds ratios Other issues

Danger of effect sizes in power calculations

• If I want to compare the efficacy of two stimulants in


increasing pulse rate, my measurements would be more precise
(less variable) if I use an electronic pulse monitor (EPM) than
if I simply use a wristwatch.

• The mean pulse rate of the two groups will be compared with
a 2 sample t-test.

• I could state that I wish to be able to detect a medium effect


size ∆ = 0.5) . . .

ST Ali CMED6100 – Session 7 Slide 77


Background Continuous outcome Dichotomous outcome Exercise Odds ratios Other issues

Danger of effect sizes in power calculations


• Study should have 80% power to detect a medium effect size
of ∆ = 0.5 (i.e. µ1 − µ2 = 0.5σ).
• Using either measurement device, we include 128 participants.
• But what absolute difference will we have 80% power to
detect? 1bpm, 5bpm, 10bpm?

Choice of σ Absolute Sample size


method effect required
Wristwatch ? ? 128
EPM ? ? 128
ST Ali CMED6100 – Session 7 Slide 78
Background Continuous outcome Dichotomous outcome Exercise Odds ratios Other issues

Danger of effect sizes in power calculations


• After further research, I decide that if I use the wristwatch I
will have σ1 = 4 bpm while if I use the EPM σ2 = 2 bpm.
• If I use the wristwatch I can only detect a difference of at
least 2 bpm, whereas using the EPM I can detect a difference
of 1 bpm.

Choice of σ Absolute Sample size


method effect required
Wristwatch 4 2 128
EPM 2 1 128
ST Ali CMED6100 – Session 7 Slide 79
Background Continuous outcome Dichotomous outcome Exercise Odds ratios Other issues

Danger of effect sizes in power calculations


• What if I decide that I want to detect a difference of at least
1 bpm?

• We would require a much larger sample if we chose to use the


less precise assessment method:

Choice of σ Absolute Sample size


method effect required
Wristwatch 4 1 510
EPM 2 1 128

ST Ali CMED6100 – Session 7 Slide 80


Background Continuous outcome Dichotomous outcome Exercise Odds ratios Other issues

Try to avoid relative effect sizes


• The variance (or precision) of measurements can have a huge
effect on the power and sample size calculation, and should
not be ignored.
• If you are planning a study but really have no idea of what
absolute effect size it might be important to detect
– do further research, speak to some experts.
• If you are planning a study but really have no idea about the
variance of measurements
– make conservative assumptions, and provide power calculations
that allow for a range of plausible values.
ST Ali CMED6100 – Session 7 Slide 81
Pragmatic Multiple testing Review Further reading Appendix – Continuous Appendix – Dichotomous

Pragmatic approaches
• Norman et al. BMJ 2012; 345:e5278

• Note the difficulty in estimating important parameters

• Note the logistical and financial constraints (often we choose the


largest study that we can afford)

• Propose ‘pragmatic’ choice of sample size based on other studies in


the literature, and logistical/financial constraints.

• Criticisms:
– Ethics of inappropriate sample size (but small studies can still be
useful if included in meta analyses).
– Justification for conducting studies without enough background
ST Ali CMED6100 – Session 7 Slide 82
Pragmatic Multiple testing Review Further reading Appendix – Continuous Appendix – Dichotomous

Comment on multiple hypothesis tests


• Example: we conduct an RCT of drug A vs drug B for the
treatment of migraine. We perform a hypothesis test on the results,
with α = 0.05.

• The results are not statistically significant, i.e. no evidence that


either drug is superior,

• Then we split the data into two groups, a high risk and a low risk
group, and perform a hypothesis test on each group, each test with
α = 0.05.

• What is the chance of incorrectly rejecting a true null hypothesis in


the second analysis?
ST Ali CMED6100 – Session 7 Slide 84
Pragmatic Multiple testing Review Further reading Appendix – Continuous Appendix – Dichotomous

Multiple hypothesis tests


• In a test with a type I error risk of 0.05 and when the null
hypothesis is true, we have a 95% chance to correctly fail to reject
the null hypothesis.

• In two such tests (assuming independence), the chance to correctly


fail to reject the null hypotheses in both tests is 0.95 × 0.95 ≈ 0.90

• Therefore the chance of incorrectly rejecting at least one true null


hypothesis is increased to around 10%.

• In general for κ tests at α = 0.05 each with a true null hypothesis,


the chance of incorrectly rejecting at least one true null hypothesis
is 1 − 0.95κ
ST Ali CMED6100 – Session 7 Slide 85
Pragmatic Multiple testing Review Further reading Appendix – Continuous Appendix – Dichotomous

The Bonferroni correction

• When α is small, we can approximate (1 − α)κ ≈ 1 − κα.

– E.g. 0.95κ ≈ 1 − 0.05κ, and 1 − 0.95κ ≈ 0.05κ.


• Then we can simply divide our desired overall significance level by κ for
each hypothesis test

• E.g. if we perform 5 significance tests, and we want an overall 5% type I


error risk, then each hypothesis test would have to give a p-value below
0.01 (=0.05/5) to be considered statistically significant.

• This is known as the Bonferroni correction.

ST Ali CMED6100 – Session 7 Slide 86


Pragmatic Multiple testing Review Further reading Appendix – Continuous Appendix – Dichotomous

Review

• The significance level, or type I error risk (α), is a false


positive risk.

• The type II error risk (β) is a false negative risk.

• The power of a hypothesis test is (1 − β).


• Our test will have increase power if we:
– increase the sample size;
– increase the type I error risk (α)
– specify that we want to detect a larger difference.

ST Ali CMED6100 – Session 7 Slide 87


Pragmatic Multiple testing Review Further reading Appendix – Continuous Appendix – Dichotomous

Review (cont.)

• In determining the sample size for a study, we need to know:


– The objectives of the study and the study design;
– The outcome variables and the method of analysis;
– The variability of measurements;
– Depending on whether we have 1 or 2 groups:
• the width of the confidence interval that we want to produce;
• the magnitude of any treatment effect and the power required;

– The significance level that we want to use;


– The resources available; the estimated response fraction.

ST Ali CMED6100 – Session 7 Slide 88


Pragmatic Multiple testing Review Further reading Appendix – Continuous Appendix – Dichotomous

Different effect sizes

• Sometimes we will quote the power of a hypothesis test to


detect various effect sizes

• E.g. “Our study has 95%, 80% and 60% power to detect
relative risks of 2.0, 1.5 and 1.25, respectively.”

• Or we may present a short table (see practical 3).

ST Ali CMED6100 – Session 7 Slide 89


Pragmatic Multiple testing Review Further reading Appendix – Continuous Appendix – Dichotomous

Two important tips

• I would like to estimate mean height precisely, where the


width of the confidence interval is no more than 1cm
– mean ±1cm?
– mean ±0.5cm?
• I calculated sample size for a comparison of two groups, and
came up with n = 500.
– is that 500 in each group?
– or 500 overall (250 in each group)?

ST Ali CMED6100 – Session 7 Slide 90


Pragmatic Multiple testing Review Further reading Appendix – Continuous Appendix – Dichotomous

Not covered

• Sample size calculations for more complex studies.


– e.g. cluster randomized designs

• The 2-group approaches discussed today are meant for studies


that aim to detect a difference between two groups. Different
calculations are needed for studies that aim to demonstrate
equivalence between two groups.

ST Ali CMED6100 – Session 7 Slide 91


Pragmatic Multiple testing Review Further reading Appendix – Continuous Appendix – Dichotomous

Summary

• There are ethical considerations in study design, particularly in


choosing appropriate sample sizes.
– underpowered studies are a waste of time and can also lead to
misleading research findings (Ioannidis, 2005, PLoS Med)

• Consider consulting a statistician, if you have difficulty with a


sample size calculation.

ST Ali CMED6100 – Session 7 Slide 92


Pragmatic Multiple testing Review Further reading Appendix – Continuous Appendix – Dichotomous

References
• Bland JM. The tyranny of power: is there a better way to
calculate sample size? BMJ, 2009; 339:b3985.

• Gore SM, Altman DG (eds). Statistics in Practice: Articles


Published in the British Medical Journal. (1982).

• Ioannidis JP. Why most published research findings are false.


PLoS Med, 2005; 2(8):e124.

• Norman G, et al. Sample size calculations: should the


emperor’s clothes be off the peg or made to measure? BMJ,
2012; 345:e5278.
ST Ali CMED6100 – Session 7 Slide 93
Pragmatic Multiple testing Review Further reading Appendix – Continuous Appendix – Dichotomous

Further reading

• BMJ Statistics at square one (chapter)

– http://bmj.bmjjournals.com/collections/statsbk

• Altman and Bland. Absence of evidence is not evidence of absence

– http://bmj.bmjjournals.com/cgi/content/full/311/7003/485
• Lenth RV. Some practical considerations for effective sample size
determination. American Statistician, 2001; 55(3), 187:193.

• Hoenig JM and Heisey DM. The abuse of power: the pervasive Fallacy of
power calculation for data analysis. American Statistician, 2001; 55(1),
19:24.

ST Ali CMED6100 – Session 7 Slide 94


Pragmatic Multiple testing Review Further reading Appendix – Continuous Appendix – Dichotomous

Appendix – Continuous outcome (2 groups)


• (No need to remember these formulae!)

• Where did the formula come from? Using Normal distribution


theory, the distribution of the sample difference X¯1 − X¯2
between the two groups will be Normal with mean µ1 − µ2 ,
and standard error, s, where
r
σ2 σ2
s= +
n n

• Or equivalently, s 2 = 2σ 2 /n

• Notice that s depends inversely on n.


ST Ali CMED6100 – Session 7 Slide 95
Pragmatic Multiple testing Review Further reading Appendix – Continuous Appendix – Dichotomous

Continuous outcome (2 groups) – cont

• The null hypothesis (µ1 − µ2 = 0) will be rejected if:

|X¯1 − X¯2 |
> z1−α/2 .
s

• Since the sample difference X¯1 − X¯2 follows a Normal


distribution, we can consider the ‘standardised’ distribution of
the sample difference:

X¯1 − X¯2 − (µ1 − µ2 ) X¯1 − X¯2 µ1 − µ2


= − .
s s s

ST Ali CMED6100 – Session 7 Slide 96


Pragmatic Multiple testing Review Further reading Appendix – Continuous Appendix – Dichotomous

Continuous outcome (2 groups) – cont

• The power of the test (1 − β) is given by 1 − Φ(u), where u is


the standardised distribution and where the null hypothesis is
rejected. In other words,
 
(µ1 − µ2 )
Φ z1−α/2 − =β
s

• Note that if Φ(u) = β, then u = zβ .

ST Ali CMED6100 – Session 7 Slide 97


Pragmatic Multiple testing Review Further reading Appendix – Continuous Appendix – Dichotomous

Continuous outcome (2 groups) – cont

• Therefore we want to have s such that

|µ1 − µ2 |
= z1−α/2 − zβ
s

• Squaring and re-arranging gives:

(µ1 − µ2 )2 n(µ1 − µ2 )2
= = (z1−α/2 − zβ )2
s2 2σ 2

• Then rearrange even further ...

ST Ali CMED6100 – Session 7 Slide 98


Pragmatic Multiple testing Review Further reading Appendix – Continuous Appendix – Dichotomous

Continuous outcome (2 groups) – cont

• We get

2(z1−α/2 − zβ )2 σ 2
n=
(µ1 − µ2 )2

• For the standard choices of α = 0.05 and 1 − β = 0.80, the


values of z1−α/2 and zβ are 1.96 and -0.84.

• Then 2(z1−α/2 − zβ )2 = 15.68 ≈ 16

• (No need to remember these formulae!)

ST Ali CMED6100 – Session 7 Slide 99


Pragmatic Multiple testing Review Further reading Appendix – Continuous Appendix – Dichotomous

Dichotomous outcome (2 groups) – cont

• For standard choices of (power) 1 − β = 0.80, and


(significance level) α = 0.05, the formula for the number of
subjects in each group is:

8
n = ,
∆2
(p1 − p2 )2
∆2 = .
p1 (1 − p1 ) + p2 (1 − p2 )

• We can call this the ‘rule of 8’.

ST Ali CMED6100 – Session 7 Slide 100


Pragmatic Multiple testing Review Further reading Appendix – Continuous Appendix – Dichotomous

Dichotomous outcome (2 groups) – cont

• The derivation of this formula is very similar to the derivation


of the continuous outcome sample size formula. A more
detailed formula, for significance level α and power (1 − β) is:

(z1−α/2 − zβ )2 [p1 (1 − p1 ) + p2 (1 − p2 )]
n= .
(p1 − p2 )2

ST Ali CMED6100 – Session 7 Slide 101

You might also like