0% found this document useful (0 votes)
279 views302 pages

Humansci 1

This document provides an introduction to probability theory and statistics for psychology and quantitative methods in the human sciences. It covers topics such as describing data through variables, plotting data, summary measures, probability, probability distributions like the binomial and Poisson distributions, the normal distribution, confidence intervals, hypothesis testing using the z-test and chi-squared test, and comparing distributions. The document is presented as a series of lectures covering these fundamental statistical concepts.

Uploaded by

rathish14u
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
279 views302 pages

Humansci 1

This document provides an introduction to probability theory and statistics for psychology and quantitative methods in the human sciences. It covers topics such as describing data through variables, plotting data, summary measures, probability, probability distributions like the binomial and Poisson distributions, the normal distribution, confidence intervals, hypothesis testing using the z-test and chi-squared test, and comparing distributions. The document is presented as a series of lectures covering these fundamental statistical concepts.

Uploaded by

rathish14u
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 302

Introduction to

Probability Theory
and Statistics
for Psychology
and
Quantitative Methods for
Human Sciences
David Steinsaltz1
University of Oxford
(Lectures 18 based on earlier version by Jonathan Marchini)

Lectures 18: MT 2011


Lectures 916: HT 2012

University lecturer at the Department of Statistics, University of Oxford

Contents
1 Describing Data
1.1 Example: Designing experiments . . . . . . . . . . . . . . . .
1.2 Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2.1 Types of variables . . . . . . . . . . . . . . . . . . . .
1.2.2 Ambiguous data types . . . . . . . . . . . . . . . . . .
1.3 Plotting Data . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.3.1 Bar Charts . . . . . . . . . . . . . . . . . . . . . . . .
1.3.2 Histograms . . . . . . . . . . . . . . . . . . . . . . . .
1.3.3 Cumulative and Relative Cumulative Frequency Plots
and Curves . . . . . . . . . . . . . . . . . . . . . . . .
1.3.4 Dot plot . . . . . . . . . . . . . . . . . . . . . . . . . .
1.3.5 Scatter Plots . . . . . . . . . . . . . . . . . . . . . . .
1.3.6 Box Plots . . . . . . . . . . . . . . . . . . . . . . . . .
1.4 Summary Measures . . . . . . . . . . . . . . . . . . . . . . . .
1.4.1 Measures of location (Measuring the center point) . .
1.4.2 Measures of dispersion (Measuring the spread) . . . .
1.5 Box Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.6 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.6.1 Mathematical notation for variables and samples . . .
1.6.2 Summation notation . . . . . . . . . . . . . . . . . . .

14
14
16
17
17
19
24
29
32
32
33

2 Probability I
2.1 Why do we need to learn about probability?
2.2 What is probability? . . . . . . . . . . . . .
2.2.1 Definitions . . . . . . . . . . . . . .
2.2.2 Calculating simple probabilities . . .
2.2.3 Example 2.3 continued . . . . . . . .
2.2.4 Intersection . . . . . . . . . . . . . .
2.2.5 Union . . . . . . . . . . . . . . . . .

35
35
38
39
39
40
40
41

iii

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

1
2
3
4
5
7
8
9

iv

Contents
.
.
.
.
.

42
43
43
44
44

3 Probability II
3.1 Independence and the Multiplication Law . . . . . . . . . .
3.2 Conditional Probability Laws . . . . . . . . . . . . . . . . .
3.2.1 Independence of Events . . . . . . . . . . . . . . . .
3.2.2 The Partition law . . . . . . . . . . . . . . . . . . .
3.3 Bayes Rule . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.4 Probability Laws . . . . . . . . . . . . . . . . . . . . . . . .
3.5 Permutations and Combinations (Probabilities of patterns)
3.5.1 Permutations of n objects . . . . . . . . . . . . . . .
3.5.2 Permutations of r objects from n . . . . . . . . . . .
3.5.3 Combinations of r objects from n . . . . . . . . . . .
3.6 Worked Examples . . . . . . . . . . . . . . . . . . . . . . .

.
.
.
.
.
.
.
.
.
.
.

47
47
51
53
55
56
59
59
59
61
62
62

4 The
4.1
4.2
4.3
4.4
4.5

Binomial Distribution
Introduction . . . . . . . . . . . . . . . . . . . . . . .
An example of the Binomial distribution . . . . . . .
The Binomial distribution . . . . . . . . . . . . . . .
The mean and variance of the Binomial distribution
Testing a hypothesis using the Binomial distribution

.
.
.
.
.

69
69
69
72
73
75

5 The
5.1
5.2
5.3
5.4
5.5
5.6
5.7
5.8
5.9

Poisson Distribution
Introduction . . . . . . . . . . . . . . . . . . . . . . . . .
The Poisson Distribution . . . . . . . . . . . . . . . . .
The shape of the Poisson distribution . . . . . . . . . .
Mean and Variance of the Poisson distribution . . . . .
Changing the size of the interval . . . . . . . . . . . . .
Sum of two Poisson variables . . . . . . . . . . . . . . .
Fitting a Poisson distribution . . . . . . . . . . . . . . .
Using the Poisson to approximate the Binomial . . . . .
Derivation of the Poisson distribution (non-examinable)
5.9.1 Error bounds (very mathematical) . . . . . . . .

.
.
.
.
.
.
.
.
.
.

81
81
83
86
86
87
88
89
90
95
96

2.3

2.2.6 Complement . . . . . . . . .
Probability in more general settings
2.3.1 Probability Axioms (Building
2.3.2 Complement Law . . . . . . .
2.3.3 Addition Law (Union) . . . .

. . . . .
. . . . .
Blocks)
. . . . .
. . . . .

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

Contents
6 The
6.1
6.2
6.3
6.4
6.5
6.6
6.7
6.8
6.9

Normal Distribution
Introduction . . . . . . . . . . . . . . . . . . . . .
Continuous probability distributions . . . . . . .
What is the Normal Distribution? . . . . . . . .
Using the Normal table . . . . . . . . . . . . . .
Standardisation . . . . . . . . . . . . . . . . . . .
Linear combinations of Normal random variables
Using the Normal tables backwards . . . . . . . .
The Normal approximation to the Binomial . . .
6.8.1 Continuity correction . . . . . . . . . . .
The Normal approximation to the Poisson . . . .

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

99
99
101
102
103
108
110
113
114
115
117

7 Confidence intervals and Normal Approximation


121
7.1 Confidence interval for sampling from a normally distributed
population . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
7.2 Interpreting the confidence interval . . . . . . . . . . . . . . . 123
7.3 Confidence intervals for probability of success . . . . . . . . . 126
7.4 The Normal Approximation . . . . . . . . . . . . . . . . . . . 127
7.4.1 Normal distribution . . . . . . . . . . . . . . . . . . . 128
7.4.2 Poisson distribution . . . . . . . . . . . . . . . . . . . 128
7.4.3 Bernoulli variables . . . . . . . . . . . . . . . . . . . . 130
7.5 CLT for real data . . . . . . . . . . . . . . . . . . . . . . . . . 130
7.5.1 Quebec births . . . . . . . . . . . . . . . . . . . . . . . 132
7.5.2 California incomes . . . . . . . . . . . . . . . . . . . . 133
7.6 Using the Normal approximation for statistical inference . . . 135
7.6.1 An example: Average incomes . . . . . . . . . . . . . 136
8 The Z Test
8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . .
8.2 The logic of significance tests . . . . . . . . . . . . . . .
8.2.1 Outline of significance tests . . . . . . . . . . . .
8.2.2 Significance tests or hypothesis tests? Breaking
.05 barrier . . . . . . . . . . . . . . . . . . . . . .
8.2.3 Overview of Hypothesis Testing . . . . . . . . . .
8.3 The one-sample Z test . . . . . . . . . . . . . . . . . . .
8.3.1 Test for a population mean . . . . . . . . . . .
8.3.2 Test for a sum . . . . . . . . . . . . . . . . . . .
8.3.3 Test for a total number of successes . . . . . . .
8.3.4 Test for a proportion . . . . . . . . . . . . . . . .
8.3.5 General principles: The square-root law . . . . .

. . .
. . .
. . .
the
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .

139
139
139
142
143
145
146
147
148
148
150
151

vi

Contents
8.4
8.5

One and two-tailed tests . . . . . . . . . . . . . . . . . . . . . 152


Hypothesis tests and confidence intervals . . . . . . . . . . . . 153

9 The 2 Test
9.1 Introduction Test statistics that arent Z
9.2 Goodness-of-Fit Tests . . . . . . . . . . . .
9.2.1 The 2 distribution . . . . . . . . .
9.2.2 Large d.f. . . . . . . . . . . . . . . .
9.3 Fixed distributions . . . . . . . . . . . . . .
9.4 Families of distributions . . . . . . . . . . .
9.4.1 The Poisson Distribution . . . . . .
9.4.2 The Binomial Distribution . . . . . .
9.5 Chi-squared Tests of Association . . . . . .

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

155
. 155
. 157
. 158
. 161
. 162
. 167
. 168
. 169
. 173

10 The T distribution and Introduction to Sampling


177
10.1 Using the T distribution . . . . . . . . . . . . . . . . . . . . . 177
10.1.1 Using t for confidence intervals: Single sample . . . . 178
10.1.2 Using the T table . . . . . . . . . . . . . . . . . . . . . 181
10.1.3 Using t for Hypothesis tests . . . . . . . . . . . . . . . 183
10.1.4 When do you use the Z or the T statistics? . . . . . . 183
10.1.5 Why do we divide by n 1 in computing the sample
SD? . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184
10.2 Paired-sample t test . . . . . . . . . . . . . . . . . . . . . . . 185
10.3 Introduction to sampling . . . . . . . . . . . . . . . . . . . . . 186
10.3.1 Sampling with and without replacement . . . . . . . . 186
10.3.2 Measurement bias . . . . . . . . . . . . . . . . . . . . 187
10.3.3 Bias in surveys . . . . . . . . . . . . . . . . . . . . . . 188
10.3.4 Measurement error . . . . . . . . . . . . . . . . . . . . 192
11 Comparing Distributions
195
11.1 Normal confidence interval for difference between two population means . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
11.2 Z test for the difference between population means . . . . . . 197
11.3 Z test for the difference between proportions . . . . . . . . . . 198
11.4 t confidence interval for the difference between population
means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
11.5 Two-sample test and paired-sample test. . . . . . . . . . . . . 200
11.5.1 Schizophrenia study: Two-sample t test . . . . . . . . 200
11.5.2 The paired-sample test . . . . . . . . . . . . . . . . . . 201
11.5.3 Is the CLT justified? . . . . . . . . . . . . . . . . . . . 203

Contents

vii

11.6 Hypothesis tests for experiments . . . . . . . . . . . . . . . . 204


11.6.1 Quantitative experiments . . . . . . . . . . . . . . . . 204
11.6.2 Qualitative experiments . . . . . . . . . . . . . . . . . 206
12 Non-Parametric Tests, Part I
12.1 Introduction: Why do we need distribution-free tests?
12.2 First example: Learning to Walk . . . . . . . . . . . .
12.2.1 A first attempt . . . . . . . . . . . . . . . . . .
12.2.2 What could go wrong with the T test? . . . . .
12.2.3 How much does the non-normality matter? . .
12.3 Tests for independent samples . . . . . . . . . . . . . .
12.3.1 Median test . . . . . . . . . . . . . . . . . . . .
12.3.2 Rank-Sum test . . . . . . . . . . . . . . . . . .
12.4 Tests for paired data . . . . . . . . . . . . . . . . . . .
12.4.1 Sign test . . . . . . . . . . . . . . . . . . . . . .
12.4.2 Breastfeeding study . . . . . . . . . . . . . . .
12.4.3 Wilcoxon signed-rank test . . . . . . . . . . . .
12.4.4 The logic of non-parametric tests . . . . . . . .

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

209
209
210
210
211
213
215
215
218
221
221
222
224
225

13 Non-Parametric Tests Part II, Power of Tests


13.1 Kolmogorov-Smirnov Test . . . . . . . . . . . . . . . . . .
13.1.1 Comparing a single sample to a distribution . . . .
13.1.2 Comparing two samples: Continuous distributions
13.1.3 Comparing two samples: Discrete samples . . . . .
13.1.4 Comparing tests to compare distributions . . . . .
13.2 Power of a test . . . . . . . . . . . . . . . . . . . . . . . .
13.2.1 Computing power . . . . . . . . . . . . . . . . . .
13.2.2 Computing trial sizes . . . . . . . . . . . . . . . .
13.2.3 Power and non-parametric tests . . . . . . . . . . .

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

227
227
227
234
235
237
237
238
239
241

14 ANOVA and the F test


14.1 Example: Breastfeeding and intelligence . . . . . .
14.2 Digression: Confounding and the adjusted means
14.3 Multiple comparisons . . . . . . . . . . . . . . . . .
14.3.1 Discretisation and the 2 test . . . . . . . .
14.3.2 Multiple t tests . . . . . . . . . . . . . . . .
14.4 The F test . . . . . . . . . . . . . . . . . . . . . . .
14.4.1 General approach . . . . . . . . . . . . . . .
14.4.2 The breastfeeding study: ANOVA analysis
14.4.3 Another Example: Exercising rats . . . . .

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

245
245
245
247
247
248
249
249
252
253

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

viii

Contents

14.5 Multifactor ANOVA . . . . . . . . . . . . . . . . . . . . . . . 254


14.6 Kruskal-Wallis Test . . . . . . . . . . . . . . . . . . . . . . . . 255
15 Regression and correlation: Detecting trends
15.1 Introduction: Linear relationships between variables
15.2 Scatterplots . . . . . . . . . . . . . . . . . . . . . . .
15.3 Correlation: Definition and interpretation . . . . . .
15.4 Computing correlation . . . . . . . . . . . . . . . . .
15.4.1 Brain measurements and IQ . . . . . . . . . .
15.4.2 Galton parent-child data . . . . . . . . . . . .
15.4.3 Breastfeeding example . . . . . . . . . . . . .
15.5 Testing correlation . . . . . . . . . . . . . . . . . . .
15.6 The regression line . . . . . . . . . . . . . . . . . . .
15.6.1 The SD line . . . . . . . . . . . . . . . . . . .
15.6.2 The regression line . . . . . . . . . . . . . . .
15.6.3 Confidence interval for the slope . . . . . . .
15.6.4 Example: Brain measurements . . . . . . . .

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

16 Regression, Continued
16.1 R2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
16.1.1 Example: Parent-Child heights . . . . . . . . . . .
16.1.2 Example: Breastfeeding and IQ . . . . . . . . . . .
16.2 Regression to the mean and the regression fallacy . . . . .
16.3 When the data dont fit the model . . . . . . . . . . . . .
16.3.1 Transforming the data . . . . . . . . . . . . . . . .
16.3.2 Spearmans Rank Correlation Coefficient . . . . .
16.3.3 Computing Spearmans rank correlation coefficient

259
259
261
262
263
265
266
267
269
270
270
271
274
277

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

281
. 281
. 282
. 282
. 284
. 287
. 287
. 287
. 288

Lecture 1

Describing Data
Uncertain knowledge
+ knowledge about the extent of uncertainty in it
= Useable knowledge
C. R. Rao, statistician
As we know, there are known knowns. There are things we know we know.
We also know there are known unknowns. That is to say we know there are
some things we do not know. But there are also unknown unknowns, The
ones we dont know we dont know.
Donald Rumsfeld, US Secretary of defense

Observations and measurements are at the centre of modern science. We


put our ideas to the test by comparing them to what we find out in the
world. Easier said than done, because all observation and measurement is
uncertain. Some of the reasons are:
Sampling Our observations are only a small sample of the range of possible
observations the population.
Errors Every measurement suffers from errors.
Complexity The more observations we have, the more difficult it becomes
to make them tell a coherent story.
We can never observe everything, nor can we make measures without error.
But, as the quotes above suggest, uncertainty is not such a problem if it can
be constrained that is, if we know the limits of our uncertainty.
1

Describing Data

1.1

Example: Designing experiments

If a newborn infant is held under his arms and his bare feet are permitted
to touch a flat surface, he will perform well-coordinated walking movements
similar to those of an adult[. . . ] Normally, the walking and placing reflexes
disappear by about 8 weeks. [ZZK72] The question is raised, whether exercising this reflex would enable children to more quickly acquire the ability
to walk independently. How would we resolve this question?
Of course, we could perform an experiment. We could do these exercises
with an infant, starting from when he or she was a newborn, and follow
up every week for about a year, to find out when this baby starts walking.
Suppose it is 10 months. Have we answered the question then?
The obvious problem, then, is that we dont know what age this baby
would have started walking at without exercise. One solution would be to
take another infant, observe this one at the same weekly intervals without
doing any special exercises, and see which one starts walking first. We call
this other infant the control. Suppose this one starts walking aged 11.50
months (that is, 11 months and 2 weeks). Now, have we answered the
question?
It is clear that were still not done, because children start walking at all
different ages. It could be that we happened to pick a slow child for the
exercises, and a particularly fast-developing child for the control. How can
we resolve this?
Obviously, the first thing we need to do is to understand how much
variability there is among the age at first walking, without imposing an
exercise regime. For that, there is no alternative to looking at multiple
infants. Here several questions must be considered:
How many?
How do we summarise the results of multiple measurements?
How do we answer the original question?: Do the special exercises
make the children learn to walk sooner?
In the original study, the authors had six infants in the treatment group
(the formal name for the ones who received the exercise also called the
experimental group), and six in the control group. (In fact, they had a
second control group, that was subject to an alternative exercise regime. But
thats a complication for a later date.) The results are tabulated in Table
1.1. We see that most of the treatment children did start walking earlier

Describing Data

than most of the control children. But not all. The slowest child from the
treatment group in fact started walking later than four of the six control
children. Should we still be convinced that the treatment is effective? If
not, how many more subjects do we need before we can be confident? How
would we decide?
Treatment
Control

9.00
11.50

9.50
12.00

9.75
9.00

10.00
11.50

13.00
13.25

9.50
13.00

Table 1.1: Age (in months) at which infants were first able to walk independently. Data from [ZZK72].
The answer is, we cant know for sure. The results are consistent with
believing that the treatment had an effect, but they are also consistent with
believing that we happened to get a particularly slow group of treatment
children, or a fast group of control children, purely by chance. What we
need now is a formal way of looking at these results, to tell us how to decide
how to draw conclusions from data The exercise helped children walk
sooner and how properly to estimate the confidence we should have in
our conclusions How likely is it that we might have seen a similary result
purely by chance, if the exercise did not help? We will use graphical tools,
mathematical tools, and logical tools.

1.2

Variables

The datasets that Psychologists and Human Scientists collect will usually
consist of one or more observations on one or more variables.
A variable is a property of an object or event that can take on different values.
For example, suppose we collect a dataset by measuring the hair colour,
resting heart rate and score on an IQ test of every student in a class. The
variables in this dataset would then simply be hair colour, resting heart
rate and score on an IQ test, i.e. the variables are the properties that we
measured/observed.

Describing Data

Qualitative

Quantitative

Discrete
(counts)

Discrete

Number of offspring
size of vocabulary at 18 months

Nominal

Continuous
Height
weight
tumour mass
brain volume

Ordinal
Birth order (firstborn, etc.)
Degree classification
"How offensive is this odour?
(1=not at all, 5=very)"

Binary
Smoking (yes/no)
Sex (M/F)
place of birth (home/hospital)

NonBinary
Hair colour
ethnicity
cause of death

Figure 1.1: A summary of the different data types with some examples.

1.2.1

Types of variables

There are 2 main types of data/variable (see Figure 1.1)


Measurement / Quantitative Data occur when we measure objects/events to obtain some number that reflects the quantitative trait
of interest e.g. when we measure someones height or weight.
Categorical / Qualitative Data occur when we assign objects into
labelled groups or categories e.g. when we group people according to
hair colour or race. Often the categories have a natural ordering. For
example, in a survey we might group people depending upon whether
they agree / neither agree or disagree / disagree with a statement. We
call such ordered variables Ordinal variables. When the categories
are unordered, e.g. hair colour, we have a Nominal variable.

Describing Data

It is also useful to make the distinction between Discrete and Continuous variables (see Figure 1.2). Discrete variables, such as number of
children in a family, or number of peas in a pod, can take on only a limited
set of values. (Categorical variables are, of course, always discrete.) Continuous variables, such as height and weight, can take on (in principle) an
unlimited set of values.
Discrete Data

No. of students late for a lecture

...................................................

There are only a limited set of distinct values/categories


i.e. we cant have exactly 2.23 students late, only integer values
are allowed.

Continuous Data

Time spent studying statistics (hrs)


3.76

5.67

In theory there are an unlimited set of possible values!


There are no discrete jumps between possible values.

Figure 1.2: Examples of Discrete and Continuous variables.

1.2.2

Ambiguous data types

The distinctions between the data types described in section 1.2.1 is not
always clear-cut. Sometimes, the type isnt inherent to the data, but depends
on how you choose to look at it. Consider the experiment described in
section 1.1. Think about how the results may have been recorded in the
lab notebooks. For each child, it was recorded which group (treatment or
control) the child was assigned to, which is clearly a (binary) categorical
variable. Then, you might find an entry for each week, recorded the result
of the walking test: yes or no, the child could or couldnt walk that week. In
principle, this is a long sequence of categorical variables. However, it would

Describing Data

be wise to notice that this sequence consists of a long sequence of no


followed by a single yes. No information is lost, then, if we simply look at
the length of the sequence of noes, which is now a quantitative variable. Is it
discrete or continuous? In principle, the age at which a child starts walking
is a continuous variable: there is no fixed set of times at which this can occur.
But the variable we have is not the exact time of first walking, but the week
of the follow-up visit at which the experimenters found that the child could
walk, reported in the shorthand that treats 1 week as being 1/4 month. In
fact, then, the possible outcomes are a discrete set: 8.0, 8.25, 8.5, . . . . What
this points up, though, is simply that there is no sharp distinction between
continuous and discrete. What continuous means, in practice, is usually
just that there are a large number of possible outcomes. The methods for
analysing discrete variables arent really distinct from those for continuous
variables. Rather, we may have special approaches to analysing a binary
variable, or one with a handful of possible outcomes. As the number of
outcomes increases, the benefits of considering the discreteness disappear,
and the methods shade off into the continuous methods.
One important distinction, though, is between categorical and quantitative data. It is obvious that if you have listed each subjects hair colour,
that that needs to be analysed differently from their blood pressure. Where
it gets confusing is the case of ordinal categories. For example, suppose
we are studying the effect of family income on academic achievement, as
measured in degree classification. The possibilities are: first, upper second,
lower second, third, pass, and fail. It is clear that they are ordered, so that
we want our analysis to take account of the fact that a third is between a fail
and a first. The designation even suggests assigning numbers: 1,2,2.5,3,4,5,
and this might be a useful shorthand for recording the results. But once
we have these numbers, it is tempting to do with them the things we do
with numbers: add them up, compute averages, and so on. Keep in mind,
though, that we could have assigned any other numbers as well, as long as
they have the same order. Do we want our conclusions to depend on the
implication that a third is midway between a first and a fail? Probably not.
Suppose you have asked subjects to sniff different substances, and rate
them 0, 1, or 2, corresponding to unpleasant, neutral, or pleasant. Its
clear that this is the natural ordering neutral is between unpleasant and
pleasant. The problem comes when you look at the numbers and are tempted
to do arithmetic with them. If we had asked subjects how many living
grandmothers they have, the answers could be added up to get the total
number of grandmothers, which is at least a meaningful quantity. Does the
total pleasant-unpleasant smell score mean anything? What about the

Describing Data

average score? Is neutral mid-way between unpleasant and pleasant? If half


the subjects find it pleasant and half unpleasant, do they have on average
a neutral response? The answers to these questions are not obvious, and
require some careful consideration in each specific instance. Totalling and
averaging of arbitrary numbers attached to ordinal categories is a common
practice, often carried out heedlessly. It should be done only with caution.

1.3

Plotting Data

One of the most important stages in a statistical analysis can be simply to


look at your data right at the start. By doing so you will be able to spot
characteristic features, trends and outlying observations that enable you to
carry out an appropriate statistical analysis. Also, it is a good idea to look
at the results of your analysis using a plot. This can help identify if you did
something that wasnt a good idea!
DANGER!! It is easy to become complacent and analyse your data without
looking at it. This is a dangerous (and potentially embarrassing) habit to
get into and can lead to false conclusions on a given study. The value of
plotting your data cannot be stressed enough.
Given that we accept the importance of plotting a dataset we now need
the tools to do the job. There are several methods that can be used which
we will illustrate with the help of the following dataset.
The baby-boom dataset
Forty-four babies (a new record) were born in one 24-hour period at the
Mater Mothers Hospital in Brisbane, Queensland, Australia, on December
18, 1997. For each of the 44 babies, The Sunday Mail recorded the time
of birth, the sex of the child, and the birth weight in grams. The data are
shown in Table 1.2, and will be referred to as the Baby boom dataset.
While we did not collect this dataset based on a specific hypothesis, if we
wished we could use it to answer several questions of interest. For example,
Do girls weigh more than boys at birth?
What is the distribution of the number of births per hour?
Is birth weight related to the time of birth?

Describing Data
Time
(min)

Sex

Weight
(g)

Time
(min)

Sex

Weight
(g)

5
64
78
115
177
245
247
262
271
428
455
492
494
549
635
649
653
693
729
776
785
846

F
F
M
M
M
F
F
M
M
M
M
M
F
F
M
F
F
M
M
M
M
F

3837
3334
3554
3838
3625
2208
1745
2846
3166
3520
3380
3294
2576
3208
3521
3746
3523
2902
2635
3920
3690
3430

847
873
886
914
991
1017
1062
1087
1105
1134
1149
1187
1189
1191
1210
1237
1251
1264
1283
1337
1407
1435

F
F
F
M
M
M
F
M
F
M
M
M
M
M
F
M
M
M
M
F
F
F

3480
3116
3428
3783
3345
3034
2184
3300
2383
3428
4162
3630
3406
3402
3500
3736
3370
2121
3150
3866
3542
3278

Table 1.2: The Baby-boom dataset


Is gender related to the time of birth?
Are these observations consistent with boys and girls being equally
likely?
These are all questions that you will be able to test formally by the end of
this course. First though we can plot the data to view what the data might
be telling us about these questions.

1.3.1

Bar Charts

A Bar Chart is a useful method of summarising Categorical Data. We represent the counts/frequencies/percentages in each category by a bar. Figure
1.3 is a bar chart of gender for the baby-boom dataset. Notice that the bar
chart has its axes clearly labelled.

16
12
0

Frequency

20

24

Describing Data

Girl

Boy

Figure 1.3: A Bar Chart showing the gender distribution in the Baby-boom
dataset.

1.3.2

Histograms

An analogy
A Bar Chart is to Categorical Data as a Histogram is to
Measurement Data
A histogram shows us the distribution of the numbers along some scale.
A histogram is constructed in the following way
Divide the measurements into intervals (sometimes called bins);
Determine the number of measurements within each category.
Draw a bar for each category whose heights represent the counts in
each category.
The art in constructing a histogram is how to choose the number of bins
and the boundary points of the bins. For small datasets, it is often feasible
to simply look at the values and decide upon sensible boundary points.

10

Describing Data

For the baby-boom dataset we can draw a histogram of the birth weights
(Figure 1.4). To draw the histogram I found the smallest and largest values
smallest = 1745

largest = 4162

There are only 44 weights so it seems reasonable to take 6 bins


1500-2000

2000-2500

2500-3000

3000-3500

3500-4000

4000-4500

Frequency
10

15

20

Using these categories works well, the histogram shows us the shape of the
distribution and we notice that distribution has an extended left tail.

1500 2000 2500 3000 3500 4000 4500


Birth Weight (g)

Figure 1.4: A Histogram showing the birth weight distribution in the Babyboom dataset.
Too few categories and the details are lost. Too many categories and the
overall shape is obscured by haphazard details (see Figure 1.5).
In Figure 1.6 we show some examples of the different shapes that histograms can take. One can learn quite a lot about a set of data by looking
just at the shape of the histogram. For example, Figure 1.6(c) shows the
percentage of the tuberculosis drug isoniazid that is acetylated in the livers
of 323 patients after 8 hours. Unacetylated isoniazid remains longer in the
blood, and can contribute to toxic side effects. It is interesting, then, to

Describing Data

11

Too many categories


6
4
2

Frequency

25
20
15
10

Frequency

30

35

Too few categories

1500

2500

3500

4500

Birth Weight (g)

1500

2500

3500

4500

Birth Weight (g)

Figure 1.5: Histograms with too few and too many categories respectively.
notice that there is a wide range of rates of acetylation, from patients who
acetylate almost all, to those who acetylate barely one fourth of the drug in
8 hours. Note that there are two peaks this kind of distribution is called
bimodal which points to the fact that there is a subpopulation who lacks
a functioning copy of the relevant gene for efficiently carrying through this
reaction.
So far, we have taken the bins to all have the same width. Sometimes
we might choose to have unequal bins, and more often we may be forced to
have unequal bins by the way the data are delivered. For instance, suppose
we did not have the full table of data, but were only presented with the
following table: What is the best way to make a histogram from these data?
Bin
Number of Births

15002500g
5

25003000g
4

3000-3500g
19

35004500g
16

Table 1.3: Baby-boom weight data, allocated to unequal bins.

We could just plot rectangles whose heights are the frequencies. We then
end up with the picture in Figure 1.7(a). Notice that the shape has changed
substantially, owing to the large boxes that correspond to the widened bins.
In order to preserve the shape which is the main goal of a histogram
we want the area of a box to correspond to the contents of the bin, rather
than the height. Of course, this is the same when the bin widths are equal.
Otherwise, we need to switch from the frequency scale to density scale,

Describing Data

20

12

0.006
0.000

0.002

0.004

Density

0.008

Frequency
10

0.010

15

0.012

Histogram of California household income 1999

1500 2000 2500 3000 3500 4000 4500


Birth Weight (g)

50

100

150

200

250

300

income in $thousands

0.015
0.005

0.010

Density

30
20
0

0.000

10

Frequency

40

0.020

50

(a) Left-skewed: Weights from Babyboom (b) Right-skewed: 1999 California housedata set.
hold incomes. (From www.census.gov.)

25

30

35

40

45

50

55

60

65

70

75

Percentage acetylation of isoniazid

80

85

90

95

100

120

140

160

180

200

220

240

260

280

300

Serum cholesterol (Mg/100 ml)

(c) Bimodal: Percentage of isoniazid acety- (d) Bell shaped: Serum cholesterol of 10lated in 8 hours.
year-old boys [Abr78].

Figure 1.6: Examples of different shapes of histograms.

13

0.02
0

0.01

Density (babies/g)

Frequency
10

0.03

15

0.04

20

Describing Data

1500 2000 2500 3000 3500 4000 4500


Birth Weight (g)

1500

(a) Frequency scale

2000

2500

3000

3500

4000

4500

Weight (g)

(b) Density scale

Figure 1.7: Same data plotted in frequency scale and density scale. Note
that the density scale histogram has the same shape as the plot from the
data with equal bin widths.
in which the height of a box is not the number of observations in the bin,
but the number of observations per unit of measurement. This gives us the
picture in Figure 1.7(b), which has a very similar shape to the histogram
with equal bin-widths.
Thus, for the data in Table 1.3 we would calculate the height of the first
rectangle as
Density =

Number of births
5 babies
=
= 0.005.
width of bin
1000g

The complete computations are given in Table 1.4, and the resulting histogram is in Figure 1.7(b).
Bin
Number of Births
Density

15002500g
5
0.005

25003000g
4
0.008

3000-3500g
19
0.038

35004500g
16
0.016

Table 1.4: Computing a histogram in density scale


A histogram in density scale is constructed in the following way
Divide the measurements into bins (unless these are already given);

14

Describing Data
Determine the number of measurements within each category (Note:
the number can also be a percentage. Often the exact numbers
are unavailable, but you can simply act as though there were 100
observations);
For each bin, compute the density, which is simply the number of
observations divided by the width of a bin;
Draw a bar for each bin whose height represent the density in each bin.
The area of the bar will correspond to the number of observations in
the bin.

1.3.3

Cumulative and Relative Cumulative Frequency Plots


and Curves

A cumulative frequency plot is very similar to a histogram. In a cumulative frequency plot the height of the bar in each interval represents the
total count of observations within interval and lower than the interval (see
Figure 1.8)
In a cumulative frequency curve the cumulative frequencies are plotted as points at the upper boundaries of each interval. It is usual to join up
the points with straight lines (see Figure 1.8).
Relative cumulative frequencies are simply cumulative frequencies divided
by the total number of observations (so relative cumulative frequencies always lie between 0 and 1). Thus relative cumulative frequency plots
and curves just use relative cumulative frequencies rather than cumulative
frequencies. Such plots are useful when we wish to compare two or more
distributions on the same scale.
Consider the histogram of birth weight shown in Figure 1.4. The frequencies,
cumulative frequencies and relative cumulative frequencies of the intervals
are given in Table 1.5.

1.3.4

Dot plot

A Dot Plot is a simple and quick way of visualising a dataset. This type
of plot is especially useful if data occur in groups and you wish to quickly
visualise the differences between the groups. For example, Figure 1.9 shows

Describing Data
Interval
Frequency
Cumulative
Frequency
Relative
Cumulative
Frequency

15

1500-2000

2000-2500

2500-3000

3000-3500

3500-4000

4000-4500

1
1

4
5

4
9

19
28

15
43

1
44

0.023

0.114

0.205

0.636

0.977

1.0

Table 1.5: Frequencies and Cumulative frequencies for the histogram in


Figure 1.4.

1500

2000

2500

3000

3500

Birth Weight (g)

4000

4500

10

20

30

40

50

Cumulative Frequency Curve


Cumulative Frequency

50
40
30
20
10
0

Cumulative Frequency

Cumulative Frequency Plot

2000

2500

3000

3500

4000

4500

Birth Weight (g)

Figure 1.8: Cumulative frequency curve and plot of birth weights for the
baby-boom dataset.
a dot plot of birth weights grouped by gender for the baby-boom dataset.
The plot suggests that girls may be lighter than boys at birth.

Describing Data

Girl

Gender

Boy

16

1500

2000

2500

3000

3500

4000

4500

Birth Weight (g)

Figure 1.9: A Dot Plot showing the birth weights grouped by gender for the
baby-boom dataset.

1.3.5

Scatter Plots

Scatter plots are useful when we wish to visualise the relationship between
two measurement variables.
To draw a scatter plot we
Assign one variable to each axis.
Plot one point for each pair of measurements.
For example, we can draw a scatter plot to examine the relationship between
birth weight and time of birth (Figure 1.10). The plot suggests that there
is little relationship between birth weight and time of birth.

200

400

600

800

1000 1200 1400

17

Time of birth (mins since 12pm)

Describing Data

2000

2500

3000

3500

4000

Birth Weight (g)

Figure 1.10: A Scatter Plot of birth weights versus time of birth for the
baby-boom dataset.

1.3.6

Box Plots

Box Plots are probably the most sophisticated type of plot we will consider.
To draw a Box Plot we need to know how to calculate certain summary
measures of the dataset covered in the next section. We return to discuss
Box Plots in Section 1.5.

1.4

Summary Measures

In the previous section we saw how to use various graphical displays in order
to explore the structure of a given dataset. From such plots we were able
to observe the general shape of the distribution of a given dataset and
compare visually the shape of two or more datasets.
Consider the histogram in Figure 1.11. Comparing the first and second
histograms we see that the distributions have the same shape or spread but
that the center of the distribution is different. Roughly, by eye, the centers
differ in value by about 10. Comparing first and third histograms we see

18

Describing Data

10

10

20

30

10

10

20

30

10

10

20

30

Figure 1.11: Comparing shapes of histograms


that the distributions seem to have roughly the same center but that the
data plotted in the third are more spread out than in the first. Obviously,
comparing second and the third we observe differences in both the center
and the spread of the distribution.
While it is straightforward to compare two distributions by eye, placing the two histograms next to each other, it is clear that this would be
difficult to do with ten or a hundred different distributions. For example,
Figure 1.6(a) shows a histogram of 1999 incomes in California. Suppose we
wanted to compare incomes among the 50 US states, or see how incomes
developed annually from 1980 to 2000, or compare these data to incomes in
20 other industrialised countries. Laying out the histograms and comparing
them would not be very practical.
Instead, we would like to have single numbers that measure
the center point of the data.
the spread of the data.
These two characteristics of a set of data (the center and spread) are the
simplest measures of its shape. Once calculated we can make precise statements about how the centers or spreads of two datasets differ. Later we will

Describing Data

19

learn how to go a stage further and test whether two variables have the
same center point.

1.4.1

Measures of location (Measuring the center point)

There are 3 main measures of the center of a given set of (measurement)


data
The Mode of a set of numbers is simply the most common value e.g.
the mode of the following set of numbers
1, 1, 2, 2, 2, 3, 3, 3, 3, 4, 5, 5, 6, 6, 7, 8, 10, 13

2
0

Frequency

is 3. If we plot a histogram of this data

10 11 12 13 14

we see that the mode is the peak of the distribution and is a reasonable
representation of the center of the data. If we wish to calculate the
mode of continuous data one strategy is to group the data into adjacent
intervals and choose the modal interval i.e. draw a histogram and take
the modal peak. This method is sensitive to the choice of intervals
and so care should be taken so that the histogram provides a good
representation of the shape of the distribution.
The Mode has the advantage that it is always a score that actually
occurred and can be applied to nominal data, properties not shared
by the median and mean. A disadvantage of the mode is that there
may two or more values that share the largest frequency. In the case
of two modes we would report both and refer to the distribution as
bimodal.

20

Describing Data
The Median can be thought of as the middle value i.e. the value for
which 50% of the data fall below when arranged in numerical order.
For example, consider the numbers
15, 3, 9, 21, 1, 8, 4,
When arranged in numerical order
1, 3, 4, 8 , 9, 15, 21
we see that the median value is 8. If there were an even number of
scores e.g.
1, 3, 4,8 , 9, 15
then we take the midpoint of the two middle values. In this case the
median is (4 + 8)/2 = 6. In general, if we have N data points then the
median location is defined as follows:
Median Location =

(N +1)
2

For example, the median location of 7 numbers is (7 + 1)/2 = 4 and


the median of 8 numbers is (8 + 1)/2 = 4.5 i.e. between observation 4
and 5 (when the numbers are arranged in order).
A major advantage of the median is the fact that it is unaffected by
extreme scores (a point it shares with the mode). We say the median
is resistant to outliers. For example, the median of the numbers
1, 3, 4, 8 , 9, 15, 99999
is still 8. This property is very useful in practice as outlying observations can and do occur (Data is messy remember!).
The Mean of a set of scores is the sum1 of the scores divided by the
number of scores. For example, the mean of
1, 3, 4, 8, 9, 15

is

1 + 3 + 4 + 8 + 9 + 15
= 6.667
6

(to 3 dp)

In mathematical notation, the mean of a set of n numbers x1 , . . . , xn


is denoted by x
where
Pn
P
x
i=1 xi
x
=
or
x
=
(in the formula book)
n
n
1

The total when we add them all up

Describing Data

21

P
See the appendix for a brief description of the summation notation ( )
The mean is the most widely used measure of location. Historically,
this is because statisticians can write down equations for the mean
and derive nice theoretical properties for the mean, which are much
harder for the mode and median. A disadvantage of the mean is that
it is not resistant to outlying observations. For example, the mean of
1, 3, 4, 8, 7, 15, 99999
is 14323.57, whereas the median (from above) is 8.
Sometimes discrete measurement data are presented in the form of a frequency table in which the frequencies of each value are given. Remember,
the mean is the sum of the data divided by the number of observations.
To calculate the sum of the data we simply multiply each value by its frequency and sum. The number of observations is calculated by summing the
frequencies.
For example, consider the following frequency table
Data (x)
Frequency (f)

1
2

2
4

3
6

4
7

5
4

6
1

Table 1.6: A frequency table.


We calculate the sum of the data as
(2 1) + (4 2) + (6 3) + (7 4) + (4 5) + (1 6) = 82
and the number of observations as
2 + 4 + 6 + 7 + 4 + 1 = 24
The the mean is given by
x
=

82
= 3.42
24

(2 dp)

In mathematical notation the formula for the mean of frequency data is


given by
Pn
P
fi xi
fx
i=1
or
x
= P
x
= Pn
f
i=1 fi

22

Describing Data

The relationship between the mean, median and mode


In general, these three measures of location will differ but for certain datasets
with characteristic shapes we will observe simple patterns between the
three measures (see Figure 1.12).
If the distribution of the data is symmetric then the mean, median
and mode will be very close to each other.
MODE MEDIAN MEAN
If the distribution is positively skewed or right skewed i.e. the
data has an extended right tail, then
MODE < MEDIAN < MEAN.
Income data, as in Figure 1.6(b), tends to be right-skewed. The mean
is shown by a red
If the distribution is negatively skewed or left skewed i.e. the data
has an extended left tail, as then
MEAN < MEDIAN < MODE.
If the
The mid-range
There is actually a fourth measure of location that can be used (but rarely
is). The Mid-Range of a set of data is half way between the smallest
and largest observation i.e. half the range of the data. For example, the
mid-range of
1, 3, 4, 8, 9, 15, 21
is (1 + 21) / 2 = 11. The mid-range is rarely used because it is not resistant
to outliers and by using only 2 observations from the dataset it takes no
account of how spread out most of the data are.

Describing Data

23

Symmetric
mean = median = mode

10

10

20

30

Positive Skew

mean
median
mode

10

15

20

25

30

Negative Skew

mean
median
mode

10

15

20

25

30

Figure 1.12: The relationship between the mean, median and mode.

24

Describing Data

1.4.2

Measures of dispersion (Measuring the spread)

The Interquartile Range (IQR) and Semi-Interquartile Range (SIQR)


The Interquartile Range (IQR) of a set of numbers is defined to be the range
of the middle 50% of the data (see Figure 1.13). The Semi-Interquartile
Range (SIQR) is simply half the IQR.
We calculate the IQR in the following way:
Calculate the 25% point (1st quartile) of the
 dataset. The location
of the 1st quartile is defined to be the N 4+1 th data point.
Calculate the 75% point (3rd quartile) of the dataset. The location

of the 3rd quartile is defined to be the 3(N4+1) th data point2 .
Calculate the IQR as
IQR = 3rd quartile - 1st quartile

Example 1 Consider the set of 11 numbers (which have been arranged in


order)
10, 15, 18, 33, 34, 36, 51, 73, 80, 86, 92.
The 1st quartile is the (11+1)
= 3rd data point = 18
4
3(11+1)
The 3rd quartile is the
= 9th data point = 80
4
IQR = 80 - 18 = 62
SIQR = 62 / 2 = 31.
What do we do when the number of points +1 isnt divisible by 4? We
interpolate, just like with the median. Suppose we take off the last data
point, so the list of data becomes
10, 15, 18, 33, 34, 36, 51, 73, 80, 86.
= 2.75 data
What is the 1st quartile? Were now looking for the (10+1)
4
point. This should be 3/4 of the way from the 2nd data point to the 3rd.
The distance from 15 to 18 is 3. 1/4 of the way is .75, and 3/4 of the way
is 2.25. So the 1st quartile is 17.25.
2

The 2nd quartile is the 50% point of the dataset i.e. the median.

25

200

Describing Data

75%

100

25%

50

Frequency

150

IQR

10

15

20

25

Figure 1.13: The Interquartile Range.


The Mean Deviation
To measure the spread of a dataset it seems sensible to use the deviation
of each data point from the mean of the distribution (see Figure 1.14). The
deviation of each data point from the mean is simply the data point minus
the mean.
For example, for deviations of the set of numbers
10, 15, 18, 33, 34, 36, 51, 73, 80, 86, 92
which have mean 48 are given in Table 1.7.

The Mean Deviation of a set of numbers is simply mean of deviations.


In mathematical notation this is written as
Pn
)
i=1 (xi x
n
At first this sounds like a good way of assessing the spread since you might
think that large spread gives rise to larger deviations and thus a larger

26

Describing Data

small spread = small deviations

large spread = large deviations

Figure 1.14: The relationship between spread and deviations..


mean deviation. In fact, though, the mean deviation is always zero. The
positive and negative deviations cancel each other out exactly. Even so, the
deviations still contain useful information about the spread, we just have to
find a way of using the deviations in a sensible way.
Mean Absolute Deviation (MAD)
We solve the problem of the deviations summing to zero by considering the
absolute values of the deviations. The absolute value of a number is the
value of that number with any minus sign removed, e.g. | 3| = 3. We then
measure spread using the mean of the absolute deviations, denoted (MAD).
This can be written in mathematical notation as
Pn
P
|
|x x
|
i=1 |xi x
or
n
n
Note The second formula is just a short hand version of the first (See the
Appendix).
We calculate the MAD in the following way (see Table 1.7 for an example)

Describing Data

27

Data
x

Deviations
xx

|Deviations|
|x x
|

Deviations2
(x x
)2

10
15
18
33
34
36
51
73
80
86
92

10 - 48 = -38
15 - 48 = -33
18 - 48 = - 30
33 - 48 = -15
34 - 48 = -14
36 - 48 = -12
51 - 48 = 3
73 - 48 = 25
80 - 48 = 32
86 - 48 = 38
92 - 48 = 44

38
33
30
15
14
12
3
25
32
38
44

1444
1089
900
225
196
144
9
625
1024
1444
1936

Sum
P = 528
x = 528

PSum = 0
(x x
) = 0

PSum = 284
|x x
| = 284

P Sum =2 9036
(x x
) = 9036

Table 1.7: Deviations, Absolute Deviations and Squared Deviations.


Calculate the mean of the data, x

Calculate the deviations by subtracting the mean from each value,


xx

Calculate the absolute deviations by removing any minus signs from


the deviations, |x x
|.
P
Sum the absolute deviations,
|x x
|.
Calculate the MAD by dividing
P the sum of the absolute deviations by
the number of data points,
|x x
|/n.
From Table 1.7 we see that the sum of the absolute deviations of the numbers
in Example 1 is 284 so
MAD =

284
= 25.818
11

(to 3dp)

The Sample Variance (s2 ) and Population Variance ( 2 )


Another way to ensure the deviations dont sum to zero is to look at the
squared deviations i.e. the square of a number is always positive. Thus

28

Describing Data

another way of measuring the spread is to consider the mean of the squared
deviations, called the variance
If our dataset consists of the whole population (a rare occurrence) then we
can calculate the population variance 2 (said sigma squared) as
Pn
P
)2
(x x
)2
2
2
i=1 (xi x
=
or
=
n
n
When we just have a sample from the population (most of the time) we can
calculate the sample variance s2 as
Pn
P
)2
(x x
)2
2
2
i=1 (xi x
or
s =
s =
n1
n1
Note We divide by n 1 when calculating the sample variance as then s2
is a better estimate of the population variance 2 than if we had divided
by n. We will see why later.
For frequency data (see Table 1.6) the formula is given by
Pn
P
fi (xi x
)2
f (x x
)2
2
2
i=1
P
s =
or
s = P
n
f 1
i=1 fi 1
We can calculate s2 in the following way (see Table 1.7)
Calculate the deviations by subtracting the mean from each value,
xx

Calculate the squared deviations, (x x


)2 .
P
Sum the squared deviations, (x x
)2 .
P
Divide by n 1, (x x
)2 /(n 1).
From Table 1.7 we see that the sum of the squared deviations of the numbers
in Example 1 is 9036 so
s2 =

9036
= 903.6
11 1

Describing Data

29

The Sample Standard Deviation (s) and Population Standard Deviation ()


Notice how the sample variance in Example 1 is much higher than the SIQR
and the MAD.
SIQR = 31

MAD = 25.818

s2 = 903.6

This happens because we have squared the deviations transforming them


to a quite different scale. We can recover the scale of the original data by
simply taking the square root of the sample (population) variance.
Thus we define the sample standard deviation s as
sP
n
)2
i=1 (xi x
s=
n1
and we define the population standard deviation as
r Pn
)2
i=1 (xi x
=
n
Returning to Example 1 the sample standard deviation is

s = 903.6 = 30.05 (to 2dp)


which is comparable with the SIQR and the MAD.

1.5

Box Plots

As we mentioned earlier a Box Plot (sometimes called a Box-and-Whisker


Plot) is a relatively sophisticated plot that summarises the distribution of a
given dataset.
A box plot consists of three main parts
A box that covers the middle 50% of the data. The edges of the box
are the 1st and 3rd quartiles. A line is drawn in the box at the median
value.
Whiskers that extend out from the box to indicate how far the data
extend either side of the box. The whiskers should extend no further
than 1.5 times the length of the box, i.e. the maximum length of a
whisker is 1.5 times the IQR.

30

Describing Data

Upper Whisker

3rd quartile

3500

4000

All points that lie outside the whiskers are plotted individually as
outlying observations.

Median

3000

1st quartile

2000

2500

Lower Whisker

Outliers

Figure 1.15: A Box Plot of birth weights for the baby-boom dataset showing
the main points of plot.
Plotting box plots of measurements in different groups side by side can be
illustrative. For example, Figure 1.16 shows box plots of birth weight for
each gender side by side and indicates that the distributions have quite
different shapes.
Box plots are particularly useful for comparing multiple (but not very
many!) distributions. Figure 1.17 shows data from 14 years, of the total

31

2000

2500

3000

3500

4000

Describing Data

Girls

Boys

Figure 1.16: A Box Plot of birth weights by gender for the baby-boom
dataset.
number of births each day (5113 days in total) in Quebec hospitals. By
summarising the data in this way, it becomes clear that there is a substantial
difference between the numbers of births on weekends and on weekdays. We
see that there is a wide variety of numbers of births, and considerable overlap
among the distributions, but the medians for the weekdays are far outside
the range of most of the weekend numbers.

32

Describing Data

150

200

250

300

350

Babies born in Quebec hospitals 1 Jan 1977 to 31 Dec 1990

Sun

Mon

Tues

Wed

Thur

Fri

Sat

Day of Week

Figure 1.17: Daily numbers of births in Quebec hospitals, 1 Jan 1977 to 31


Dec 1990.

1.6
1.6.1

Appendix
Mathematical notation for variables and samples

Mathematicians are lazy. They cant be bothered to write everything out


in full so they have invented a language/notation in which they can express
what they mean in a compact, quick to write down fashion. This is a good
thing. We dont have to study maths every day to be able to use a bit of the
language and make our lives easier. For example, suppose we are interested
in comparing the resting heart rate of 1st year Psychology and Human Sciences students. Rather than keep on referring to variables the resting heart
rate of 1st year Psychology students and the resting heart rate of 1st year

Describing Data

33

Human Sciences students we can simple denote


X
Y

=
=

the resting heart rate of 1st year Psychology students


the resting heart rate of 1st year Human Sciences students

and refer to the variables X and Y instead.


In general, we use capital letters to denote variables. If we have a sample
of a variable the we use small letters to denote the sample. For example,
if we go and measure the resting heart rate of 1st year Psychology and
Human Sciences students in Oxford we could denote the p measurements on
Psychologists as
x1 , x2 , x3 , . . . , xp
and the h measurements on Human Scientists as
y1 , y2 , y3 , . . . , yh

1.6.2

Summation notation

One of the most common


letters in the Mathematicians alphabet is the
P
Greek letter sigma ( ), which is used to denote summation. It translates
to add up what follows. Usually, the limits of the summation are written
below and above the symbol. So,
5
X

xi

reads add up the xi s from i = 1 to i = 5

i=1

or

5
X

xi = (x1 + x2 + x3 + x4 + x5 )

i=1

If we have some actual data then we know the values of the xi s


x1 = 3 x2 = 6 x3 = 1 x4 = 7 x5 = 6
and we can calculate the sum as
5
X

xi = (3 + 2 + 1 + 7 + 6) = 19

i=1

If the limits of the summation are obvious within context the the notation
is often abbreviated to
X
x = 19

34

Describing Data

Lecture 2

Probability I
In this and the following lecture we will learn about
why we need to learn about probability
what probability is
how to assign probabilities
how to manipulate probabilities and calculate probabilities of complex
events

2.1

Why do we need to learn about probability?

In Lecture 1 we discussed why we need to study statistics, and we saw that


statistics plays a crucial role in the scientific process (see figure 2.1). We saw
that we use a sample from the population in order to test our hypothesis.
There will usually be a very large number of possible samples we could have
taken and the conclusions of the statistical test we use will depend on the
exact sample we take. It might happen that the sample we take leads us to
make the wrong conclusion about the population. Thus we need to know
what the chances are of this happening. Probability can be thought of as
the study of chance.

35

36

Probability I

Examine
Results

Hypothesis
about a
population

STATISTICAL
TEST

Take
a
sample

Propose an
experiment

Study

STATISTICS

Design
Figure 2.1: The scientific process and role of statistics in this process.

Example 2.1: Random controlled experiment

The Anturane Reinfarction Trial (ART) was a famous study of


a drug treatment (anturane) for the aftereffects of myocardial
infarction [MDF+ 81]. Out of 1629 patients, about half (813)
were selected to receive anturane; the other 816 patients received
an ineffectual (placebo) pill. The results are summarised in
Table 2.1.
Table 2.1: Results of the Anturane Reinfarction Trial.

# patients
deaths
% mortality

Treatment
(anturane)

Control
(placebo)

813
74
9.1%

816
89
10.9%

Probability I
Imagine the following dialogue:
Drug Company: Every hospital needs to use anturane. It saves patients lives.
Skeptical Bureaucrat: The effect looks pretty small:
15 out of about 800 patients. And the drug is pretty
expensive.
DC: Is money all you bean counters can think about?
We reduced mortality by 16%.
SB: It was only 2% of the total.
DC: We saved 2% of the patients! What if one of them
was your mother?
SB: Im all in favour of saving 2% more patients. Im
just wondering: You flipped a coin to decide which
patients would get the anturane. What if the coins
had come up differently? Might we just as well be
here talking about how anturane had killed 2% of the
patients?
How can we resolve this argument? 163 patients died. Suppose
anturane has no effect. Could the apparent benefit of anturane
simply reflect the random way the coins fell? Or would such a
series of coin flips have been simply too unlikely to countenance?
To answer this question, we need to know how to measure the
likelihood (or probability) of sequences of coin flips.
Imagine a box, with cards in it, each one having written on it
one way in which the coin flips could have come out, and the patients allocated to treatments. How many of those coinflip cards
would have given us the impression that anturane performed
well, purely because many of the patients who died happened to
end up in the Control (placebo) group? It turns out that its
more than 20% of the cards, so its really not very unlikely at
all.
To figure this out, we are going to need to understand
1. How to enumerate all the ways the coins could come up.
How many ways are there? The number depends on the
exact procedure, but if we flip one coin for each patient, the
number of cards in the box would be 21629 , which is vastly

37

38

Probability I
more than the number of atoms in the universe. Clearly,
we dont want to have to count up the cards individually.
2. How coin flips get associated with a result, as measured in
apparent success or failure of anturane. Since the number
of cards is so large, we need to do this without having to
go through the results one by one.


Example 2.2: Baby-boom


Consider the Baby-boom dataset we saw in Lecture 1. Suppose
we have a hypothesis that in the population boys weigh more
than girls at birth. We can use our sample of boys and girls to
examine this hypothesis. One intuitive way of doing this would
be to calculate the mean weights of the boys and girls in the
sample and compare the difference between these two means
Sample mean of boys weights = x
boys = 3375
Sample mean of girls weights = x
girls = 3132
D=x
boys x
girls = 3375 3132 = 243
Does this allow us to conclude that in the population boys are
born heavier than girls? On what scale to we assess the size
of D? Maybe boys and girls weigh the same at birth and we
obtained a sample with heavier boys just by chance. To be able
to conclude confidently that boys in the population are heavier
than girls we need to know what the chances are of obtaining
a difference between the means that is 243 or greater, i.e. we
need to know the probability of getting such a large value of D.
If the chances are small then we can be confident that in the
population boys are heavier than girls on average at birth. 

2.2

What is probability?

The examples we have discussed so far look very complicated. They arent
really, but in order to see the simple underlying structure, we need to introduce a few new concepts. To do so, we want to work with a much simpler
example:

Probability I

39

Example 2.3: Rolling a die


Consider a simple experiment of rolling a fair six-sided die.
When we toss the die there are six possible outcomes i.e. 1,
2, 3, 4, 5 and 6. We say that the sample space of our experiment is the set S = {1, 2, 3, 4, 5, 6}.
The outcome the top face shows a three is the sample point 3.
The event A1 , that the die shows an even number is the subset
A1 = {2, 4, 6} of the sample space.
The event A2 that the die shows a number larger than 4 is
the subset A2 = {5, 6} of S2 . 

2.2.1

Definitions

The example above introduced some terminology that we will use repeatedly
when we talk about probabilities.
An experiment is some activity with an observable outcome.
The set of all possible outcomes of the experiment is called the sample
space.
A particular outcome is called a sample point.
A collection of possible outcomes is called an event.

2.2.2

Calculating simple probabilities

Simply speaking, the probability of an event is a number between 0 and 1,


inclusive, that indicates how likely the event is to occur.
In some settings (like the example of the fair die considered above) it is
natural to assume that all the sample points are equally likely.
In this case, we can calculate the probability of an event A as
P (A) =

|A|
,
|S|

40

Probability I

where |A| denotes the number of sample points in the event A.

2.2.3

Example 2.3 continued

It is often useful in simple examples like this to draw a diagram (known as


a Venn diagram) to represent the sample space, and then label specific
events in the diagram by grouping together individual sample points.
S = {1, 2, 3, 4, 5, 6}

A1 = {2, 4, 6}

A2 = {5, 6}

S
1
P (A1 ) =

|A1 |
3
1
= =
|S|
6
2

A1
2

P (A2 ) =

2.2.4

5 A2

|A2 |
2
1
= =
|S|
6
3

Intersection

What about P (face is even, and larger than 4) ?


We can write this event in set notation as A1 A2 .
This is the intersection of the two events, A1 and A2
i.e the set of elements which belong to both A1 and A2 .

Probability I

41

A1 A2 = {6}

P (A1 A2 ) =

|A1 A2 |
1
=
|S|
6

S
1

5 A2

A1

2.2.5

Union

What about P (face is even, or larger than 4) ?


We can write this event in set notation as A1 A2 .
This is the union of the two events, A1 and A2
i.e the set of elements which belong either A1 and A2 or both.

A1 A2 = {2, 4, 5, 6}

P (A1 A2 ) =

|A1 A2 |
4
2
= =
|S|
6
3

42

Probability I

S
1

5 A2

A1

2.2.6

Complement

What about P (face is not even) ?


We can write this event in set notation as Ac1 .
This is the complement of the event, A1
i.e the set of elements which do not belong to A1 .

Ac1 = {1, 3, 5}

P (Ac1 ) =

3
1
|Ac1 |
= =
|S|
6
2

Probability I

43

S
1

A1
2

2.3

Probability in more general settings

In many settings, either the sample space is infinite or all possible outcomes
of the experiment are not equally likely. We still wish to associate probabilities with events of interest. Luckily, there are some rules/laws that allow
us to calculate and manipulate such probabilities with ease.

2.3.1

Probability Axioms (Building Blocks)

Before we consider the probability rules we need to know about the axioms
(or mathematical building blocks) upon which these rules are built. There
are three axioms which we need in order to develop our laws
(i).

0 P (A) 1 for any event A.


This axiom says that probabilities must lie between 0 and 1

(ii).

P (S) = 1.
This axiom says that the probability of everything in the sample space
is 1. This says that the sample space is complete and that there are no
sample points or events that allow outside the sample space that can
occur in our experiment.

(iii).

If A1 , . . . , An are mutually exclusive events, then


P (A1 . . . An ) = P (A1 ) + . . . + P (An ).

44

Probability I
A set of events are mutually exclusive if at most one of the events
can occur in a given experiment.
This axiom says that to calculate the probability of the union of distinct
events we can simply add their individual probabilities.

2.3.2

Complement Law

If A is an event, the set of all outcomes that are not in A is called the complement of the event A, denoted A{ .
This is (pronounced A complement).
The rule is
A{ = 1 P (A)

(Law 1)

Example 2.4: Complements


Let S (the sample space) be the set of students at Oxford. We
are picking a student at random.

Let A = The event that the randomly selected student suffers


from depression
We are told that 8% of students suffer from depression, so P (A) =
0.08. What is the probability that a student does not suffer from
depression?
The event {student does not suffer from depression} is A{ . If
P (A) = 0.08 then P (A{ ) = 1 0.08 = 0.92. 

2.3.3

Addition Law (Union)

Suppose,
A
B

=
=

The event that a randomly selected student from the class has brown eyes
The event that a randomly selected student from the class has blue eyes

Probability I

45

What is the probability that a student has brown eyes OR blue eyes?
This is the union of the two events A and B, denoted AB (pronounced A
or B)
We want to calculate P(AB).
In general for two events
P(AB) = P(A) + P(B) - P(AB)

(Addition Law)

To understand this law consider a Venn diagram of the situation (below) in


which we have two events A and B. The event A B is represented in such
a diagram by the combined sample points enclosed by A or B. If we simply
add P (A) and P (B) we will count the sample points in the intersection
A B twice and thus we need to subtract P (A B) from P (A) + P (B) to
calculate P (A B).

B
AB

Example 2.5: SNPs


Single nucleotide polymorphisms (SNPs) are nucleotide positions
in a genome which exhibit variation amongst individuals in a
species. In some studies in humans, SNPs are discovered in European populations. Suppose that of such SNPs, 70% also show
variation in an African population, 80% show variation in an

46

Probability I
Asian population and 60% show variation in both the African
and Asian population.
Suppose one such SNP is chosen at random, what is the probability that it is variable in either the African or the Asian population?
Write A for the event that the SNP is variable in Africa, and
B for the event that it is variable in Asia. We are told
P (A) = 0.7
P (B) = 0.8
P (A B) = 0.6.

We require P (A B). From the addition rule:


P (A B) = P (A) + P (B) P (A B)
= 0.7 + 0.8 0.6
= 0.9.


Lecture 3

Probability II
3.1

Independence and the Multiplication Law

If the probability that one event A occurs doesnt affect the probability that
the event B also occurs, then we say that A and B are independent. For
example, it seems clear than one coin doesnt know what happened to the
other one (and if it did know, it wouldnt care), so if A1 is the event that
the first coin comes up heads, and A2 the event that the second coin comes
up heads, then

Example 3.1: One die, continued


Continuing from Example 2.3, with event A1 = {face is even} =
{2, 4, 6} and A2 = {face is greater than 4} = {5, 6}, we see that
A1 A2 = {6}
3
= 0.5,
6
2
P (A2 ) = = 0.33,
6
1
P (A1 A2 ) = = P (A1 ) P (A2 ).
6
P (A1 ) =

On the other hand, if A3 = {4, 5, 6}, then A3 and A1 are not


independent. 

47

48

Probability II

Example 3.2: Two dice


Suppose we roll two dice. The sample space may be represented
as pairs (first roll, second roll).

(1,1)
(2,1)
(3,1)
(4,1)
(5,1)
(6,1)

(1,2)
(2,2)
(3,2)
(4,2)
(5,2)
(6,2)

(1,3)
(2,3)
(3,3)
(4,3)
(5,3)
(6,3)

(1,4)
(2,4)
(3,4)
(4,4)
(5,4)
(6,4)

(1,5)
(2,5)
(3,5)
(4,5)
(5,5)
(6,5)

(1,6)
(2,6)
(3,6)
(4,6)
(5,6)
(6,6)

There are 36 points in the sample space. These are all equally
likely. Thus, each point has probability 1/36. Consider the
events
A = {First roll is even},
B = {Second roll is bigger than 4},
A B = {First roll is even and Second roll is bigger than 4},

(1,1)

(1,2)

(1,3)

(1,4)

(1,5)

(1,6)

(2,1)

(2,2)

(2,3)

(2,4)

(2,5)

(2,6)

(3,1)

(3,2)

(3,3)

(3,4)

(3,5)

(3,6)

(4,1)

(4,2)

(4,3)

(4,4)

(4,5)

(4,6)

(5,1)

(5,2)

(5,3)

(5,4)

(5,5)

(5,6)

(6,1)

(6,2)

(6,3)

(6,4)

(6,5)

(6,6)

Figure 3.1:
Events A = {First roll is even}
{Second roll is bigger than 4}.

and

Probability II

49

We see from Figure 3.1 that A contains 18 points and B contains


12 points, so that P (A) = 18/36 = 1/2, and P (B) = 12/36 =
1/3. Meanwhile, A B contains 6 points, so P (A B) = 6/36 =
1/6 = 1/21/3. Thus A and B are independent. This should not
be surprising: A depends only on the first roll, and B depends
on the second. These two have no effect on each other, so the
events must be independent. 
This points up an important rule:
Events that depend on experiments that cant influence each
other are always independent.

Thus, two (or more) coin flips are always independent. But this is also
relevant to analysing experiments such as those of Example 2.1. If the
drug has no effect on survival, then events like {patient # 612 survived}
are independent of events like {patient # 612 was allocated to the control
group}.

Example 3.3: Two dice


Suppose we roll two dice. Consider the events
A = {First roll is even},
C = {Sum of the two rolls roll is bigger than 8},
A C = {First roll is even and sum is bigger than 8},

Then we see from Figure 3.2 that P (C) = 10/36, and P (AC) =
6/36 6= 10/36 1/2. On the other hand, if we replace C by
D = {Sum of the two rolls roll is exactly 9}, then we
see from Figure 3.3 that P (D) = 4/36 = 1/9, and P (A D) =
2/36 = 1/9 1/2, so the events A and D are independent. We
see that events may be independent, even if they are not based
on separate experiments.


50

Probability II

(1,1)

(1,2)

(1,3)

(1,4)

(1,5)

(1,6)

(2,1)

(2,2)

(2,3)

(2,4)

(2,5)

(2,6)

(3,1)

(3,2)

(3,3)

(3,4)

(3,5)

(3,6)

(4,1)

(4,2)

(4,3)

(4,4)

(4,5)

(4,6)

(5,1)

(5,2)

(5,3)

(5,4)

(5,5)

(5,6) C

(6,1)

(6,2)

(6,3)

(6,4)

(6,5)

(6,6)

Figure 3.2:
Events A
{Sum is bigger than 8}.

{First roll is even}

(1,1)

(1,2)

(1,3)

(1,4)

(1,5)

(1,6)

(2,1)

(2,2)

(2,3)

(2,4)

(2,5)

(2,6)

(3,1)

(3,2)

(3,3)

(3,4)

(3,5)

(3,6)

(4,1)

(4,2)

(4,3)

(4,4)

(4,5)

(4,6)

(5,1)

(5,2)

(5,3)

(5,4)

(5,5)

(5,6)

(6,1)

(6,2)

(6,3)

(6,4)

(6,5)

(6,6)

and

and

Figure 3.3:
Events
{Sum is exactly 9}.

{First roll is even}

Example 3.4: DNA Fingerprinting


A simplified description of how DNA fingerprinting works is this:
The police find biological material connected to a crime. In the
laboratory, they identify certain SNPs, finding out which version
of each SNP the presumed culprit has. Then, they either search
a database for someone who has the same versions of all the
SNPs, or compares these SNPs to those of a suspect.

Probability II

51

Searching a database can be potentially problematic. Imagine


that the laboratory has found 12 SNPs, at which the crime-scene
DNA has rare versions, each of which is found in only 10% of the
population. They then search a database and find someone with
all the same SNP versions. The expert then comes and testifies,
to say that the probability of any single person having the same
SNPs is (1/10)12 = 1/1 trillion. There are only 60 million people
in the UK, so the probability of there being another person with
the same SNPs is only about 60 million/1 trillion= 0.00006
less than 1 in ten thousand. So it cant be mistaken identity.
Except. . . Having particular variants at different SNPs are not
independent events. For one thing, some SNPs in one population
(Europeans, for example) may not be SNPs in other population
(Asians, for example) where everyone may have the same variant.
Thus, the 10% of the population that has each of these different
rare SNP variants could in fact be the same 10%, and they may
have all of these dozen variants in common because they all come
from the same place, where everyone has just those variants.
And dont forget to check whether the suspect has a monozygotic
twin! More than 1 person in a thousand has one, and in that
case, the twins will have all the same rare SNPs, because their
genomes are identical. 

3.2

Conditional Probability Laws

Suppose,
A
B

=
=

The event that a randomly selected student from the class has a bike
The event that a randomly selected student from the class has blue eyes

and P(A) = 0.36, P(B) = 0.45 and P(AB) = 0.13


What is the probability that a student has a bike GIVEN that the student has blue eyes?
in other words
Considering just students who have blue eyes, what is the probability that
a randomly selected student has a bike?

52

Probability II

This is a conditional probability.


We write this probability as P(B|A) (pronounced probability of B given
A)
Think of P(B|A) as how much of A is taken up by B.

A
ABc

B
AB

A cB
A cBc

Then we see that

P(B|A) = P(A B)
P(A)

(Conditional Probability Law)

Example 3.5: SNPs again

We return to the setting of Example 2.5. What is the probability


that a SNP is variable in the African population given that it is
variable in the Asian population?

Probability II

53

We have that
P (A) = 0.7
P (B) = 0.8
P (A B) = 0.6.

We want
P (A|B) =

P (A B)
0.6
=
= 0.75
P (B)
0.8


We can rearrange the conditional probability law to obtain a general
Multiplication Law.
P(B|A) = P(A B)

P(B|A)P(A) = P(A B)

P(A)
Similarly
P(A|B)P(B) = P(A B)

P(A B) = P(B|A)P(A) = P(A|B)P(B)

Example 3.6: Multiplication Law


If P(B) = 0.2 and P(A|B) = 0.36 what is P(AB)?
P(AB) = 0.360.2 = 0.072 

3.2.1

Independence of Events

Definition Two events are A and B are said to be independent if P (AB) =


P (A)P (B).

54

Probability II

Note that in this case (provided P (B) > 0), if A and B are independent
P (A|B) =

P (A B)
P (A)P (B)
=
= P (A),
P (B)
P (B)

and similarly P (B|A) = P (B) (provided P (A) > 0).


So for independent events, knowledge that one of the events has occurred
does not change our assessment of the probability that the other event has
occur.

Example 3.7: Snails


In a population of a particular species of snail, individuals exhibit
different forms. It is known that 45% have a pink background
colouring, while 55% have a yellow background colouring. In addition, 30% of individuals are striped, and 20% of the population
are pink and striped.
1. Is the presence or absence of striping independent of background colour?
2. Given that a snail is pink, what is the probability that it
will have stripes.
Define the events: A, B, that a snail has a pink, respectively
yellow, background colouring, and S for the event that is has
stripes.
Then we are told P (A) = 0.45, P (B) = 0.55, P (S) = 0.3, and
P (A S) = 0.2.
For part (1), note that
0.2 = P (A S) 6= 0.135 = P (A)P (S),
so the events A and S are not independent.
For part (2),
P (S|A) =

P (S A)
0.2
=
= 0.44.
P (A)
0.45

Probability II

55

Thus, knowledge that a snail has a pink background colouring


increases the probability that it is striped. (That P (S|A) 6= P (S)
also establishes that background colouring and the presence of
stripes, are not independent.) 

3.2.2

The Partition law

The partition law is a very useful rule that allows us to calculate the probability of an event by spitting it up into a number of mutually exclusive
events. For example, suppose we know that P(AB) = 0.52 and P(ABc )
= 0.14 what is p(A)?
P(A) is made up of two parts (i) the part of A contained in B (ii) the
part of A contained in Bc .

B
AB

So we have the rule


P(A) = P(AB) + P(ABc )
and P(A) = P(AB) + P(ABc ) = 0.52 + 0.14 = 0.66
More generally, events are mutually exclusive if at most one of the events
can occur in a given experiment. Suppose E1 , . . . , En are mutually exclusive
events, which together form the whole sample space: E1 E2 En = S.
(In other words, every possible outcome is in exactly one of the Es. Then
P
P
P(A) = ni=1 P(A Ei ) = ni=1 P(A| Ei )P(Ei )

56

Probability II

Example 3.8: Mendelian segregation

At a particular locus in humans, there are two alleles A and B,


and it is known that the population frequencies of the genotypes
AA, AB, and BB, are 0.49, 0.42, and 0.09, respectively. An AA
man has a child with a woman whose genotype is unknown.
What is the probability that the child will have genotype AB?
We assume that as far as her genotype at this locus is concerned
the woman is chosen randomly from the population.
Use the partition rule, where the partition corresponds to the
three possible genotypes for the woman. Then
P (child AB) = P (child AB and mother AA)
+P (child AB and mother AB)
+P (child AB and mother BB)
= P (mother AA)P (child AB|mother AA)
+P (mother AB)P (child AB|mother AB)
+P (mother BB)P (child AB|mother BB)
= 0.49 0 + 0.42 0.5 + 0.09 1
= 0.3.


3.3

Bayes Rule

One of the most common situations in science is that we have some observations, and we need to figure out what state of the world is likely to have
produced those observations. For instance, we observe that a certain number
of vaccinated people contract polio, and a certain number of unvaccinated
people contract polio, and we need to figure out how effective the vaccine
is. The problem is, our theoretical knowledge goes in the wrong direction:
If the vaccine is so effective, how many people will contract polio. Bayes
Rule allows us to turn the inference around.

Probability II

57

Example 3.9: Medical testing


In a medical setting we might want to calculate the probability
that a person has a disease D given they have a specific symptom
S, i.e. we want to calculate P (D|S). This is a hard probability
to assign as we would need to take a random sample of people
from the population with the symptom.
A probability that is much easier to calculate is P (S|D), i.e.
the probability that a person with the disease has the symptom.
This probability can be assigned much more easily as medical
records for people with serious diseases are kept.
The power of Bayes Rule is its ability to take P (S|D) and calculate P (D|S).
We have already seen a version of Bayes Rule
P(B|A) = P(A B)
P(A)
Using the Multiplication Law we can rewrite this as

P(B|A) = P(A|B)P(B)
P(A)

(Bayes Rule)

Suppose P (S|D) = 0.12, P (D) = 0.01 and P (S) = 0.03. Then


P (D|S) =

0.12 0.01
= 0.04
0.03

Example 3.10: Genetic testing


A gene has two possible types A1 and A2 . 75% of the population
have A1 . B is a disease that has 3 forms B1 (mild), B2 (severe)

58

Probability II
and B3 (lethal). A1 is a protective gene, with the probabilities
of having the three forms given A1 as 0.9, 0.1 and 0 respectively.
People with A2 are unprotected and have the three forms with
probabilities 0, 0.5 and 0.5 respectively.
What is the probability that a person has gene A1 given they
have the severe disease?
The first thing to do with such a question is decode the information, i.e. write it down in a compact form we can work
with.
P(A1 ) = 0.75

P(A2 ) = 0.25

P(B1 |A1 ) = 0.9


P(B1 |A2 ) = 0

P(B2 |A1 ) = 0.1


P(B2 |A2 ) = 0.5

P(B3 |A1 ) = 0
P(B3 |A2 ) = 0.5

We want P(A1 |B2 )?


From Bayes Rule we know that
P(A1 |B2 ) = P(B2 |A1 )P(A1 )
P(B2 )
We know P(B2 |A1 ) and P(A1 ) but what is P(B2 )?
We can use the Partition Law since A1 and A2 are mutually
exclusive.
P(B2 ) = P(B2 A1 ) + P(B2 A2 )

(Partition Law)

= P(B2 |A1 )P(A1 ) + P(B2 |A2 )P(A2 )

(Multiplication Law)

= 0.1 0.75 + 0.5 0.25


= 0.2
We can use Bayes Rule to calculate P(A1 |B2 ).
P(A1 |B2 ) =


0.1 0.75
= 0.375
0.2

Probability II

3.4

59

Probability Laws
P(Ac ) = 1 - P(A)

(Complement Law)

P(AB) = P(A) + P(B) - P(AB)

P(B|A) = P(A B)

(Addition Law)

(Conditional Probability Law)

P(A)

P(A B) = P(B|A)P(A) = P(A|B)P(B)

(Multiplication Law)

If E1 , . . . , En are a set of mutually exclusive events then


P
P
P(A) = ni=1 P(A Ei ) = ni=1 P(A| Ei )P(Ei )
(Partition Law)

P(B|A) = P(A|B)P(B)
P(A)

3.5

(Bayes Rule)

Permutations and Combinations (Probabilities


of patterns)

In some situations we observe a specific pattern from a large number of


possible patterns. To calculate the probability of the pattern we need to
count the total number of patterns. This is why we need to learn about
permutations and combinations.

3.5.1

Permutations of n objects

Consider 2 objects

Q. How many ways can they be arranged? i.e. how many permutations
are there?

60

A. 2 ways

Probability II

AB

BA

Consider 3 objects

Q. How many ways can they be arranged (permuted)?


A. 6 ways

ABC

Consider 4 objects

ACB

BCA

BAC
C

CAB

CBA

Q. How many ways can they be arranged (permuted)?


A. 24 ways
ABCD
BACD
CBAD
DBCA

ABDC
BADC
CBDA
DBAC

ACBD
BCAD
CABD
DCBA

ACDB
BCDA
CADB
DCAB

ADBC
BDAC
CDBA
DABC

3
6

5
120

ADCB
BDCA
CDAB
DACB

There is a pattern emerging here.


No. of objects
No. of permutations

2
2

4
24

6
720

...
...

Can we find a formula for the number of permutations of n objects?


A good way to think about permutations is to think of putting objects
into boxes.
Suppose we have 5 objects. How many different ways can we place them
into 5 boxes?

There are 5 choices of object for the first box.


5
There are now only 4 objects to choose from for the second box.
5

There are 3 choices for the 3rd box, 2 for the 4th and 1 for the 5th box.

Probability II

61
5

Thus, the number of permutations of 5 objects is 5 4 3 2 1.


In general, the number of permutations of n objects is
n(n 1)(n 2) . . . (3)(2)(1)
We write this as n! (pronounced n factorial). There should be a button on
your calculator that calculates factorials.

3.5.2

Permutations of r objects from n

Now suppose we have 4 objects and only 2 boxes. How many permutations
of 2 objects when we have 4 to choose from?
There are 4 choices for the first box and 3 choices for the second box
4

So there are 12 permutations of 2 objects from 4. We write this as


4

P2 = 12

We say there are n Pr permutations of r objects chosen from n.


The formula for n Pr is given by

Pr =

n!
(n r)!

To see why this works consider the example above 4 P2 .


Using the formula we get
4

P1 =

4!
4321
=
=43
2!
21

62

Probability II

3.5.3

Combinations of r objects from n

Now consider the number of ways of choosing 2 objects from 4 when the
order doesnt matter. We just want to count the number of possible combinations.
We know that there are 12 permutations when choosing 2 objects from
4. These are
AB
BA

AC
CA

AD
DA

BC
CB

BD
DB

CD
DC

Notice how the permutations are grouped in 2s which are the same combination of letters. Thus there are 12/2 = 6 possible combinations.
AB

AC

AD

BC

BD

CD

We write this as
4

C2 = 6

We say there are n Cr combinations of r objects chosen from n.


The formula for n Cr is given by

Cr =

n!
(n r)!r!

Another way of writing this formula that makes it clearer is


n

Cr =

nP
r

r!

Effectively this says we count the number of permutations of r objects from


n and then divide by r! because the n Pr permutations will occur in groups
of r! that are the same combination.

3.6

Worked Examples

(i). Four letters are chosen at random from the word RANDOMLY. Find
the probability that all four letters chosen are consonants.
8 letters, 6 consonants, 2 vowels

Probability II

P(all four are consonants) =

63

# of ways of choosing 4 consonants


# of ways of choosing 4 letters

# of ways of choosing 4 consonants = 6 C4 =


# of ways of choosing 4 letters = 8 C4 =
P(all four are consonants) =

15
70

8!
4!4!

6!
4!2!

= 15

= 70

3
14

(ii). A bag contains 8 white counters and 3 black counters. Two counters
are drawn, one after the other. Find the probability of drawing one
white and one black counter, in any order
(a) if the first counter is replaced
(b) if the first counter is not replaced
What is the probability that the second counter is black (assume that
the first counter is replaced after it is taken)?
A useful way of tackling many probability problems is to draw a probability tree. The branches of the tree represent different possible events.
Each branch is labelled with the probability of choosing it given what
has occurred before. The probability of a given route through the tree
can then be calculated by multiplying all the probabilities along that
route (using the Multiplication Rule)
(a) With replacement
Let
W1 be the event a white counter is drawn first,
W2 be the event a white counter is drawn second,
B1 be the event a black counter is drawn first,
B2 be the event a black counter is drawn second,

P (one white and one black counter) = P (W1 B2 ) + P (B1 W2 )


24
24
=
+
121 121
48
=
121
(b) Without replacement

64

Probability II

8/11
W1
8/11
3/11

8/11

3/11

P(W1W2)
=(8/11)(8/11)
W2 =64/121
P(W1B2)
=(8/11)(3/11)
B2 =24/121
P(B1W2)
=(3/11)(8/11)
W2
=24/121

B1
3/11

B2

P(B1B2)
=(3/11)(3/11)
=9/121

P (one white and one black counter) = P (W1 B2 ) + P (B1 W2 )


24
24
+
=
110 110
48
=
110

7/10
W1
8/11
3/10

8/10

3/11

P(W1W2)
=(8/11)(7/10)
W2 =56/110
P(W1B2)
=(8/11)(3/10)
B2 =24/110
P(B1W2)
=(3/11)(8/10)
W2
=24/110

B1
2/10

B2

P(B1B2)
=(3/11)(2/10)
=6/110

Probability II

65

P (second counter is black) = P (W1 B2 ) + P (B1 B2 )


24
9
=
+
121 121
33
=
121
(iii). From 2001 TT Prelim Q1 Two drugs that relieve pain are available
to treat patients. Drug A has been found to be effective in threequarters of all patients; when it is effective, the patients have relief
from pain one hour after taking this drug. Drug B acts quicker but only
works with one half of all patients: those who benefit from this drug
have relief of pain after 30 mins. The physician cannot decide which
patients should be prescribed which drug so he prescribes randomly.
Assuming that there is no variation between patients in the times
taken to act for either drug, calculate the probability that:
(a) a patient is prescribed drug B and is relieved of pain;
(b) a patient is relieved of pain after one hour;
(c) a patient who was relieved of pain after one hour took drug A;
(d) two patients receiving different drugs are both relieved of pain
after one hour.
(e) out of six patients treated with the same drug, three are relieved
of pain after one hour and three are not.
Let
R30 = The event that a patient is relieved of pain within 30 mins
R60 = The event that a patient is relieved of pain within 60 mins
A = Event that a patient takes drug A
B = Event that a patient takes drug B
P (R60|A) = 0.75

P (R30|B) = 0.5

P (A) = P (B) = 0.5

(a) P (R30|B)P (B) = 0.25


(b) P (R60) = P (R60|A)P (A) + P (R60|B)P (B) since R30 R60
P (R60) = 0.75 0.5 + 0.5 0.5 = 0.625
(c) P (A|R60) =

P (AR60)
P (R60)

66

Probability II

P (A R60) = P (R60|A)P (A) = 0.75 0.5 = 0.375


P (A|R60) =

0.375
0.625

= 0.6

(d) P (R60|A)P (R60|B) = 0.75 0.5 = 0.375


Assuming the events are independent.
(e) n = 6 X = no. of patients relieved of pain after 1hr
For A, p = P (R60|A) = 0.75
P (X = 3|A) = 6 C3 (0.75)3 (0.25)3 = 0.1312
For B, p = P (R60|B) = 0.5
P (X = 3|B) = 6 C3 (0.5)3 (0.5)3 = 0.3125
P (X = 3) = P (X = 3|A)P (A) + (X = 3|B)P (B)
= 0.1312 0.5 + 0.3125 0.5 = 0.2222

4. In the National Lottery you need to choose 6 balls from 49.


What is the probability that I choose all 6 balls correctly?
There are 2 ways of answering this question
(i) using permutations and combinations
(ii) using a tree diagram

Probability II

67

Method 1 - using permutations and combinations

P (6 correct) =
=
=

No. of ways of choosing the 6 correct balls


No. of ways of choosing 6 balls
6P
6
49 P
6
6!
0!
49!
43!

654321
49 48 47 46 45 44
= 0.0000000715112 (1 in 14 million)
=

Method 2 - using a tree diagram


Consider the first ball I choose, the probability it is correct is
6
49
The second ball I choose is correct with probability
5
48
The third ball I choose is correct with probability
4
47
and so on.
Thus the probability that I get all 6 balls correct is
6 5 4 3 2 1
= 0.0000000715112
49 48 47 46 45 44

(1 in 14 million)

68

Probability II

Lecture 4

The Binomial Distribution


4.1

Introduction

In Lecture 2 we saw that we need to study probability so that we can


calculate the chance that our sample leads us to the wrong conclusion
about the population. To do this in practice we need to model the process
of taking the sample from the population. By model we mean describe the
process of taking the sample in terms of the probability of obtaining each
possible sample. Since there are many different types of data and many
different ways we might collect a sample of data we need lots of different
probability models. The Binomial distribution is one such model that turns
out to be very useful in many experimental settings.

4.2

An example of the Binomial distribution

Suppose we have a box with a very large number1 of balls in it: 23 of the
balls are black and the rest are red. We draw 5 balls from the box. How
many black balls do we get? We can write
X = No. of black balls in 5 draws.
X can take on any of the values 0, 1, 2, 3, 4 and 5.
X is a discrete random variable
1

We say a very large number when we want to ignore the change in probability that
comes from drawing without replacement. Alternatively, we could have a small number
of balls 2 black and 1 red, for instance but replace the ball (and mix well!) after
each draw.

69

70

The Binomial Distribution

Some values of X will be more likely to occur than others. Each value
of X will have a probability of occurring. What are these probabilities?
Consider the probability of obtaining just one black ball, i.e. X = 1.
One possible way of obtaining one black ball is if we observe the pattern
BRRRR. The probability of obtaining this pattern is
P(BRRRR) =

2
3

1
3

1
3

1
3

1
3

There are 32 possible patterns of black and red balls we might observe. 5 of
the patterns contain just one black ball
BBBBB
RBBRB
RRRBB
BRBRR

RBBBB
RBBBR
RRBRB
BBRRR

BRBBB
BRRBB
RRBBR
BRRRR

BBRBB
BRBRB
RBRRB
RBRRR

BBBRB
BRBBR
RBRBR
RRBRR

BBBBR
BBRRB
RBBRR
RRRBR

RRBBB
BBRBR
BRRRB
RRRRB

The other 5 possible combinations all have the same probability so the probability of obtaining one head in 5 coin tosses is


P(X = 1) = 5 32 ( 13 )4 = 0.0412 (to 4 decimal places)
What about P(X = 2)? This probability can be written as
P (X = 2) = No. of patterns
=

5C

10

0.165

Probability of pattern
 2 2  1  3

3
3
4
243

Its now just a small step to write down a formula for this situation specific
situation in which we toss a coin 5 times
 2 x  1 (5x)
P (X = x) = 5 Cx

3
3
We can use this formula to tabulate the probabilities of each possible value
of X.
These probabilities are plotted in Figure 4.1 against the values of X. This
shows the distribution of probabilities across the possible values of X. This

RBRBB
BBBRR
BRRBR
RRRRR

The Binomial Distribution

P(X = 0)

P(X = 1)

P(X = 2)

P(X = 3)

P(X = 4)

=
=

5C
5C
5C
5C
5C
5C

0
1
2
3
4
5

 0
2
3

 1
2
3

 2
2
3

 3
2
3

 4
2
3

 5
2
3

 5
1
3

 4
1
3

 3
1
3

 2
1
3

 1
1
3

 0
1
3

0.0041

0.0412

0.1646

0.3292

0.3292

0.1317

0.0 0.1 0.2 0.3 0.4 0.5

P(X)

P(X = 5)

71

Figure 4.1: A plot of the Binomial(5, 2/3) probabilities.

situation is a specific example of a Binomial distribution.


Note It is important to make a distinction between the probability distribution shown in Figure 4.1 and the histograms of specific datasets seen
in Lecture 1. A probability distribution represents the distribution of values we expect to see in a sample. A histogram is used to represent the
distribution of values that actually occur in a given sample.

72

The Binomial Distribution

4.3

The Binomial distribution

The key components of a Binomial distribution


In general a Binomial distribution arises when we have the following 4 conditions
- n identical trials, e.g. 5 coin tosses
- 2 possible outcomes for each trial success and failure, e.g. Heads
or Tails
- Trials are independent, e.g. each coin toss doesnt affect the others
- P(success) = p is the same for each trial, e.g. P(Black) = 2/3 is the
same for each trial

Binomial distribution probabilities


If we have the above 4 conditions then if we let
X = No. of successes
then the probability of observing x successes out of n trials is given by

P(X = x) =

Cx px (1 p)(nx)

x = 0, 1, . . . , n

If the probabilities of X are distributed in this way, we write


XBin(n, p)
n and p are called the parameters of the distribution. We say X follows a
binomial distribution with parameters n and p.

Examples
With this general formula we can calculate many different probabilities.
(i). Suppose X Bin(10, 0.4), what is P(X = 7)?
P(X = 7) =

10

C7 (0.4)7 (1 0.4)(107)

= (120)(0.4)7 (0.6)3
= 0.0425

The Binomial Distribution

73

(ii). Suppose Y Bin(8, 0.15), what is P(Y < 3)?


P(Y < 3) = P(Y = 0) + P(Y = 1) + P(Y = 2)
=

C0 (0.15)0 (0.85)8 + 8 C 1 (0.15)1 (0.85)7 + 8 C 2 (0.15)2 (0.85)6

= 0.2725 + 0.3847 + 0.2376


= 0.8948
(iii). Suppose W Bin(50, 0.12), what is P(W > 2)?
P(W > 2) = P(W = 3) + P(W = 4) + . . . + P(W = 50)
= 1 P(W 2)


= 1 P(W = 0) + P(W = 1) + P(W = 2)


= 1 50 C0 (0.12)0 (0.88)50 + 50 C 1 (0.12)1 (0.88)49 + 50 C 2 (0.12)2 (0.88)48


= 1 0.00168 + 0.01142 + 0.03817
= 0.94874

4.4

The mean and variance of the Binomial distribution

Different values of n and p lead to different distributions with different


shapes (see Figure 4.2). In Lecture 1 we saw that the mean and standard
deviation can be used to summarize the shape of a dataset. In the case of a
probability distribution we have no data as such so we must use the probabilities to calculate the expected mean and standard deviation. In other
words, the mean and standard deviation of a random variable is the mean
and standard deviation that a collection of data would have if the numbers
appeared in exactly the proportions given by the distribution. The mean
of a distribution is also called the expectation or expected value of the
distribution.
Consider the example of the Binomial distribution we saw above
x
P(X = x)

0
0.004

1
0.041

2
0.165

3
0.329

4
0.329

5
0.132

The expected mean value of the distribution, denoted can be calculated


as
= 0 (0.004) + 1 (0.041) + 2 (0.165) + 3 (0.329) + 4 (0.329) + 5 (0.132)
= 3.333

74

The Binomial Distribution

8 10

0.4
0.0

0.1

0.2

P(X)

0.3

0.4
0.3
P(X)
0.2
0.1
0.0

0.0

0.1

0.2

P(X)

0.3

0.4

0.5

n=10 p=0.7

0.5

n=10 p=0.1

0.5

n=10 p=0.5

8 10

8 10

Figure 4.2: 3 different Binomial distributions.


In general, there is a formula for the mean of a Binomial distribution. There
is also a formula for the standard deviation, .
If X Bin(n, p) then
= np

=
npq

where q = 1 p

In the example above, X Bin(5, 2/3) and so the mean and standard
deviation are given by
= np = 5 (2/3) = 3.333
and
=

npq = 5 (2/3) (1/3) = 1.111

Shapes of Binomial distributions


The skewness of a Binomial distribution will also depend upon the values of
n and p (see Figure 4.2). In general,
if p < 0.5 the distribution will exhibit POSITIVE SKEW

The Binomial Distribution

75

if p = 0.5 the distribution will be SYMMETRIC


if p > 0.5 the distribution will exhibit NEGATIVE SKEW
However, for a given value of p, the skewness goes down as n increases. All
binomial distributions eventually become approximately symmetric for large
n. This will be discussed further in Lecture 6.

4.5

Testing a hypothesis using the Binomial distribution

Consider the following simple situation: You have a six-sided die, and you
have the impression that its somehow been weighted so that the number 1
comes up more frequently than it should. How would you decide whether
this impression is correct? You could do a careful experiment, where you
roll the die 60 times, and count how often the 1 comes up.
Suppose you do the experiment, and the 1 comes up 30 times (and other
numbers come up 30 times all together). You might expect the 1 to come
up one time in six, so 10 times, so 30 times seems high. But is it too high?
There are two possible hypotheses:
(i). The die is biased.
(ii). Just by chance we got more 1s than expected.
How do we decide between these hypotheses? Of course, we can never prove
that any sequence of throws couldnt have come from a fair die. But we can
find that the results we got are extremely unlikely to have arisen from a fair
die, so that we should seriously consider whether the alternative might be
true.
Since the probability of a 1 on each throw is 1/6, so we apply the formula
for the binomial distribution with n = 60 and p = 1/6. Then we have
Now we summarise the general approach:
posit a hypothesis
design and carry out an experiment to collect a sample of data
test to see if the sample is consistent with the hypothesis
Hypothesis The die is fair. All 6 outcomes have the same probability.

76

The Binomial Distribution

Experiment We roll the die.


Sample We obtain 60 outcomes of a die roll.
Testing the hypothesis Assuming our hypothesis is true what is the probability that we would have observed such a sample or a sample more extreme,
i.e. is our sample quite unlikely to have occurred under the assumptions of
our hypothesis?
Assuming our hypothesis is true the experiment we carried out satisfies
the conditions of the Binomial distribution
n identical trials, i.e. 60 die rolls.
2 possible outcomes for each trial: 1 and not 1.
- Trials are independent.
- P(success) = p is the same for each trial, i.e. P(1 comes up) = 1/6
is the same for each trial
We define X = No. of 1s that come up
We observed X = 30. Which samples are more extreme than this?
Under our hypothesis we would expect X = 10
X 30 are the samples as or more extreme than X = 30.
We can calculate each of these probabilities using the Binomial probability formula
 18  6030
1
5
= 2.25 109 .
6
6
 x  60x
60
X
1
5
60
Cx
P (# 1s is at least 30) =
= 2.78 109 .
6
6
P (# 1s is exactly 30) =60 C30

x=30

Which is the appropriate probability? The strange event from the perspective of the fair die was not that 1 came up exactly 30 times, but that it
came up so many times. So the relevant number is the second one, which
is a little bigger. Still, the probability is less than 3 in a billion. In other

The Binomial Distribution

77

words, if you were to perform one of these experiments once a second, continuously, you might expect to see a result this extreme once in 10 years. So
you either have to believe that you just happened to get that one in 10 years
outcome the one time you tried it, or you have to believe that there really
is something biased about the die. In the language of hypothesis testing we
say we would reject the hypothesis.

Example 4.1: Analysing the Anturane Reinfarction Trial


From Table 2.1, we know that there were 163 patients who died,
out of a total of 1629. Now, suppose the study works as follows:
Patients come in the door, we flip a coin, and allocate them to
the treatment group if heads comes up, or to the control group
if tails comes up. (This isnt exactly how it was done, but close
enough. Next term, well talk about other ways of getting the
same results.)
We had a total of 813 heads out of 1629, which is pretty close
to half, which seems reasonable. On the other hand, if we look
at the 163 coin flips for the patients who died, we only had 74
heads, which seems pretty far from half (which would be 81.5).
It seems there are two plausible hypotheses:
(i). Anturane works. In that case, its perfectly reasonable to
expect that fewer patients who died got the anturane treatment.
(ii). Purely by chance we had fewer heads than expected.
One way of thinking about this formally is to use Bayes rule:
P (patient died | coin heads)
P (patient died)
1 P (patient died | anturane treatment)
=
.
2
P (patient died)

P (coin heads | patient died) = P (coin heads)

If the conditional probability of dying is lowered by anturane,


then retrospectively the coin flips for the patients who died have
less than probability 1/2 of coming up heads; but if anturane has
no effect, then these are fair coin flips, and should have come out
50% heads.

78

The Binomial Distribution


So how do we figure out which possibility is true? Of course,
we can never conclusively rule out the possibility of hypothesis
2. Any number of heads is possible. But we can say that
some numbers are extremely unlikely, indicating that it would
be advisable to accept hypothesis 1 anturane works rather
than believe that we had gotten such an exceptional result from
the coin flips.
So. . . 74 heads in 163 flips is not the most likely outcome. But
how unlikely is it? Lets consider three different probabilities:
P (# heads is exactly 74)
P (# heads is at most 74)
P (# heads is at most 74 or at least 89).
Which is the probability we want? Its pretty clearly not the first
one. After all, any particular number of heads is pretty unlikely
(and getting exactly 50% heads is impossible, since the number
of tosses was odd). And if we had gotten 73 heads, that would
have been considered even better evidence for hypothesis 1.
Choosing between the other two probabilities isnt so clear, though.
After all, if we want to answer the question How likely would it
be to get such a strange outcome purely by chance? we probably should consider all outcomes that would have seemed equally
strange, and 89 is as far away from the expected number of
heads as 74. There isnt a definitive answer to choosing between
these one-tailed and two-tailed tests, but we will have more
to say about this later in the course.
Now we compute the probabilities:
P (# heads is exactly 74) =

163

 74  16374
1
1
C74
= 0.0314,
2
2

74
X

 i  163i
1
1
P (# heads is at most 74) =
Ci
2
2
i=0
 0  1630
 1  1631
1
1
1
1
163
163
= C0
+ C1
2
2
2
2
 74  16374
1
1
+ + 163 C74
2
2
= 0.136,
163

The Binomial Distribution

79
74
X

 i  163i
1
1
P (# heads at most 74 or at least 89) =
Ci
2
2
i=0




163
X
1 i 1 163i
163
+
Ci
2
2
163

i=89

= 0.272.
Note that the two-tailed probability is exactly twice the onetailed. We show these probabilities on the probability histogram of Figure 4.3.

0.04

upper tail
X89
prob.=0.136

0.01

0.02

0.03

lower tail
X74
prob.=0.136

0.00

Probability

0.05

0.06

mean=81.5

60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99

Number of Heads

We observe 74
Figure 4.3: The tail probabilities for testing the hypothesis that anturane
has no effect.

What is the conclusion? Even 0.136 doesnt seem like such a


small probability. This says that we would expect, even if anturane does nothing at all, that we would see such a strong apparent effect about one time out of seven. Our conclusion is that
this experiment does not provide significant evidence that anturane is effective. In the language of hypotheses testing we say
that we do not reject the hypothesis that anturane is ineffective.


80

The Binomial Distribution

Lecture 5

The Poisson Distribution


5.1

Introduction

Example 5.1: Drownings in Malta


The book [Mou98] cites data from the St. Lukes Hospital Gazette,
on the monthly number of drownings on Malta, over a period of
nearly 30 years (355 consecutive months). Most months there
were no drownings. Some months there was one person who
drowned. One month had four people drown. The data are
given as counts of the number of months in which a given number of drownings occurred, and we repeat them here as Table
5.1.
Looking at the data in Table 5.1, we might suppose that one of
the following hypotheses is true:
Some months are particularly dangerous;
Or, on the contrary, when one person has drowned, the surrounding publicity makes others more cautious for a while,
preventing drownings?
Or, drownings are simply independent events?
How can we use the data to decide which of these hypotheses
is true? We might reasonably suppose that the first hypothesis
would predict that there would be more months with high numbers of drownings than the independence hypothesis; the second
81

82

The Poisson Distribution


Table 5.1: Monthly counts of drownings in Malta.

No. of drowning
deaths per month

Frequency (No.
months observed)

0
1
2
3
4
5+

224
102
23
5
1
0

hypothesis would predict fewer months with high numbers of


drownings. The problem is, we dont know how many we should
expect, if independence is correct.
What we need is a model: A sensible probability distribution,
giving the probability of a month having a certain number of
drownings, under the independence assumption. The standard
model for this sort of situation is called the Poisson distribution. 
The Poisson distribution is used in situations when we observe the counts
of events within a set unit of time, area, volume, length etc. For example,
The number of cases of a disease in different towns;
The number of mutations in given regions of a chromosome;
The number of dolphin pod sightings along a flight path through a
region;
The number of particles emitted by a radioactive source in a given
time;
The number of births per hour during a given day.
In such situations we are often interested in whether the events occur randomly in time or space. Consider the Babyboom dataset (Table 1.2), that
we saw in Lecture 1. The birth times of the babies throughout the day are
shown in Figure 5.1(a). If we divide up the day into 24 hour intervals and

The Poisson Distribution

83

count the number of births in each hour we can plot the counts as a histogram in Figure 5.1(b). How does this compare to the histogram of counts
for a process that isnt random? Suppose the 44 birth times were distributed
in time as shown in Figure 5.1(c). The histogram of these birth times per
hour is shown in Figure 5.1(d). We see that the non-random clustering of
events in time causes there to be more hours with zero births and more
hours with large numbers of births than the real birth times histogram.
This example illustrates that the distribution of counts is useful in uncovering whether the events might occur randomly or non-randomly in time
(or space). Simply looking at the histogram isnt sufficient if we want to
ask the question whether the events occur randomly or not. To answer
this question we need a probability model for the distribution of counts of
random events that dictates the type of distributions we should expect to
see.

5.2

The Poisson Distribution

The Poisson distribution is a discrete probability distribution for the counts


of events that occur randomly in a given interval of time (or space).
If we let X = The number of events in a given interval,
Then, if the mean number of events per interval is
The probability of observing x events in a given interval is given by

P(X = x) = e

x
x!

x = 0, 1, 2, 3, 4, . . .

Note e is a mathematical constant. e 2.718282. There should be a button


on your calculator ex that calculates powers of e.
If the probabilities of X are distributed in this way, we write
XPo()
is the parameter of the distribution. We say X follows a Poisson distribution with parameter

The Poisson Distribution

10
5
0

Frequency

15

84

200

400

600

800

1000 1200

1440

Birth Time (minutes since midnight)

No. of births per hour

(b) Histogram of Babyboom birth times

10
5
0

Frequency

15

(a) Babyboom data birth times

200

400

600

800

1000 1200

Birth Time (minutes since midnight)


(c) Nonrandom birth times

1440

No. of births per hour

(d) Histogram of nonrandom birth times

Figure 5.1: Representing the babyboom data set (upper two) and a nonrandom hypothetical collection of birth times (lower two).

Note A Poisson random variable can take on any positive integer value.
In contrast, the Binomial distribution always has a finite upper limit.

The Poisson Distribution

Example 5.2: Hospital births


Births in a hospital occur randomly at an average rate of 1.8
births per hour.
What is the probability of observing 4 births in a given hour
at the hospital?
Let X = No. of births in a given hour
(i) Events occur randomly
X Po(1.8)
(ii) Mean rate = 1.8
We can now use the formula to calculate the probability of observing exactly 4 births in a given hour
4

P (X = 4) = e1.8 1.8
4! = 0.0723
What about the probability of observing more than or equal
to 2 births in a given hour at the hospital?
We want P (X 2) = P (X = 2) + P (X = 3) + . . .
i.e. an infinite number of probabilities to calculate
but
P (X 2) = P (X = 2) + P (X = 3) + . . .
= 1 P (X < 2)
= 1 (P (X = 0) + P (X = 1))
1.80
1.81
= 1 (e1.8
+ e1.8
)
0!
1!
= 1 (0.16529 + 0.29753)
= 0.537


Example 5.3: Disease incidence

85

86

The Poisson Distribution


Suppose there is a disease, whose average incidence is 2 per million people. What is the probability that a city of 1 million
people has at least twice the average incidence?
Twice the average incidence would be 4 cases. We can reasonably
suppose the random variable X=# cases in 1 million people has
Poisson distribution with parameter 2. Then


0
1
2
3
2 2
2 2
2 2
3 2
P (X 4) = 1P (X 3) = 1 e
+e
+e
+e
0!
1!
2!
3!
= 0.143.


5.3

The shape of the Poisson distribution

Using the formula we can calculate the probabilities for a specific Poisson
distribution and plot the probabilities to observe the shape of the distribution. For example, Figure 5.2 shows 3 different Poisson distributions. We
observe that the distributions
(i). are unimodal;
(ii). exhibit positive skew (that decreases as increases);
(iii). are centred roughly on ;
(iv). have variance (spread) that increases as increases.

5.4

Mean and Variance of the Poisson distribution

In general, there is a formula for the mean of a Poisson distribution. There


is also a formula for the standard deviation, , and variance, 2 .
If X Po() then
=

2 =

The Poisson Distribution

87

10

15

20

0.20
0.00

0.05

0.10

P(X)

0.15

0.20
0.15
P(X)

0.10
0.05
0.00

0.00

0.05

0.10

P(X)

0.15

0.20

0.25

Po(10)

0.25

Po(5)

0.25

Po(3)

10

15

20

10

15

20

Figure 5.2: Three different Poisson distributions.

5.5

Changing the size of the interval

Suppose we know that births in a hospital occur randomly at an average


rate of 1.8 births per hour.
What is the probability that we observe 5 births in a given 2 hour interval?
Well, if births occur randomly at a rate of 1.8 births per 1 hour interval
Then
births occur randomly at a rate of 3.6 births per 2 hour interval
Let Y = No. of births in a 2 hour period
Then Y Po(3.6)
5

P (Y = 5) = e3.6 3.6
5! = 0.13768
This example illustrates the following rule
If
X Po() on 1 unit interval,
then Y Po(k) on k unit intervals.

88

The Poisson Distribution

5.6

Sum of two Poisson variables

Now suppose we know that in hospital A births occur randomly at an average rate of 2.3 births per hour and in hospital B births occur randomly at
an average rate of 3.1 births per hour.
What is the probability that we observe 7 births in total from the two
hospitals in a given 1 hour period?
To answer this question we can use the following rule
If
X Po(1 ) on 1 unit interval,
and Y Po(2 ) on 1 unit interval,
then X + Y Po(1 + 2 ) on 1 unit interval.
So if we let X = No. of births in a given hour at hospital A
and
Y = No. of births in a given hour at hospital B
Then X Po(2.3), Y Po(3.1) and X + Y Po(5.4)
7

P (X + Y = 7) = e5.4 5.4
7! = 0.11999

Example 5.4: Disease Incidence, continued

Suppose disease A occurs with incidence 1.7 per million, and


disease B occurs with incidence 2.9 per million. Statistics are
compiled, in which these diseases are not distinguished, but simply are all called cases of disease AB. What is the probability
that a city of 1 million people has at least 6 cases of AB?
If Z=# cases of AB, then P P o(4.6). Thus,
P (Z 6) = 1 P (Z 5)
 4.60 4.61 4.62 4.63 4.64 4.65 
= 1 e4.6
+
+
+
+
+
0!
1!
2!
3!
4!
5!
= 0.314.


The Poisson Distribution

5.7

89

Fitting a Poisson distribution

Consider the two sequences of birth times we saw in Section 1. Both of these
examples consisted of a total of 44 births in 24 hour intervals.
Therefore the mean birth rate for both sequences is

44
24

= 1.8333

What would be the expected counts if birth times were really random i.e.
what is the expected histogram for a Poisson random variable with mean
rate = 1.8333.
Using the Poisson formula we can calculate the probabilities of obtaining
each possible value1
x
P (X = x)

0
0.15989

1
0.29312

2
0.26869

3
0.16419

4
0.07525

5
0.02759

6
0.01127

Then if we observe 24 hour intervals we can calculate the expected frequencies as 24 P (X = x) for each value of x.
x
Expected frequency
24 P (X = x)

0
3.837

1
7.035

2
6.448

3
3.941

4
1.806

5
0.662

6
0.271

We say we have fitted a Poisson distribution to the data.


This consisted of 3 steps
(i). Estimating the parameters of the distribution from the data
(ii). Calculating the probability distribution
(iii). Multiplying the probability distribution by the number of observations
Once we have fitted a distribution to the data we can compare the expected
frequencies to those we actually observed from the real Babyboom dataset.
We see that the agreement is quite good.
x
Expected
Observed
1

0
3.837
3

1
7.035
8

2
6.448
6

3
3.941
4

4
1.806
3

5
0.662
0

6
0.271
0

in practice we group values with low probability into one category.

90

The Poisson Distribution

When we compare the expected frequencies to those observed from the nonrandom clustered sequence in Section 1 we see that there is much less agreement.
x
Expected
Observed

0
3.837
12

1
7.035
3

2
6.448
0

3
3.941
2

4
1.806
2

5
0.662
4

6
0.271
1

In Lecture 9 we will see how we can formally test for a difference between
the expected and observed counts. For now it is enough just to know how
to fit a distribution.

5.8

Using the Poisson to approximate the Binomial

The Binomial and Poisson distributions are both discrete probability distributions. In some circumstances the distributions are very similar. For
example, consider the Bin(100, 0.02) and Po(2) distributions shown in Figure 5.3. Visually these distributions are identical.
In general,
If n is large (say > 50) and p is small (say < 0.1) then a
Bin(n, p) can be approximated with a Po() where = np

Example 5.5: Counting lefties

Given that 5% of a population are left-handed, use the Poisson


distribution to estimate the probability that a random sample of
100 people contains 2 or more left-handed people.
X = No. of left handed people in a sample of 100
X Bin(100, 0.05)
Poisson approximation X Po() with = 100 0.05 = 5

The Poisson Distribution

91

Po(2)

0.00

0.10

P(X)

0.10
0.00

P(X)

0.20

0.20

Bin(100, 0.02)

10

10

Figure 5.3: A Binomial and Poisson distribution that are very similar.
We want P (X 2)?
P (X 2) = 1 P (X < 2)


= 1 P (X = 0) + P (X = 1)


51
50
1 e5 + e5
0!
1!
1 0.040428
0.9596
If we use the exact Binomial distribution we get the answer
0.9629. 
The idea of using one distribution to approximate another is widespread
throughout statistics and one we will meet again. Why would we use an
approximate distribution when we actually know the exact distribution?
The exact distribution may be hard to work with.
The exact distribution may have too much detail. There may be some
features of the exact distribution that are irrelevant to the questions

92

The Poisson Distribution


we want to answer. By using the approximate distribution, we focus
attention on the things were really concerned with.

For example, consider the Babyboom data, discussed in Example 5.2.


We said that random birth times should yield numbers of births in each
hour that are Poisson distributed. Why? Consider the births between 6
am and 7 am. When we say that the births are random, we probably mean
something like this: The times are independent of each other, and have equal
chances of happening at any time. Any given one of the 44 births has 24
hours when it could have happened. The probability that it happens during
this hour is p = 1/24 = 0.0417. The births between 6 am and 7 am should
thus have about the Bin(44, 0.0417) distribution. This distribution is about
the same as Po(1.83), since 1.83 = 44 0.0417.

Example 5.6: Drownings in Malta, continued


We now analyse the data on the monthly numbers of drowning
incidents in Malta. Under the hypothesis that drownings have
nothing to do with each other, and have causes that dont change
in time, we would expect the probability the random number X
of drownings occur in a month to have a Poisson distribution?
Why is that? We might imagine that there are a large number
n of people in the population, each of whom has an unknown
probability p of drowning in any given month. Then the number
of drownings in a month has Bin(n, p) distribution. In order to
use this model, we need to know what n and p are. That is, we
need to know the size of the population, which we dont really
care about.
On the other hand, the expected (mean) number of monthly
drownings is np, and that can be estimated from the observed
mean number of drownings. If we approximate the binomial
distribution by P o(), where = np, then we dont have to
worry about
We estimate as total number of drownings/number of months.
The total number of drownings is 0 224 + 1 102 + 2 23 + 3
5 + 4 1 = 167, so we estimate = 167/355 = 0.47. We show the
probabilities for the different possible outcomes in the last last
column of Table 5.2. In the third column we show the expected
number of months with a given number of drownings, assuming

The Poisson Distribution

93

Table 5.2: Monthly counts of drownings in Malta, with Poisson fit.

No. of drowning
deaths per month

Frequency (No.
months observed)

0
1
2
3
4
5+

224
102
23
5
1
0

Expected frequency Probability


Poisson = 0.47
221.9
104.3
24.5
3.8
0.45
0.04

0.625
0.294
0.069
0.011
0.001
0.0001

the independence assumption and hence the Poisson model


is true. This is computed by multiplying the last column by
355. After all, if the probability of no drownings in any given
month is 0.625, and we have 355 months of observations, we
expect 0.625 355 months with 0 drownings.
We see that the observations (in the second column) are pretty
close to the predictions of the Poisson model (in the third column), so the data do not give us strong evidence to reject the
neutral assumption, that drownings are independent of one another, and have a constant rate in time. In Lecture 9 we will
describe one way of testing this hypothesis formally. 

Example 5.7: Swine flu vaccination


In 1976, fear of an impending swine flu pandemic led to a mass
vaccination campaign in the US. The pandemic never materialised, but there were concerns that the vaccination may have
led to an increase in a rare and serious neurological disease,
Guillain-Barre Syndrome (GBS). It was difficult to determine
whether the vaccine was really at fault, since GBS may arise
spontaneously about 1 person in 100,000 develops GBS in a
given year and the number of cases was small.
Consider the following data from the US state of Michigan: Out
of 9 million residents, about 2.3 million were vaccinated. Of

94

The Poisson Distribution


those, 48 developed GBS between July 1976 and June 1977. We
might have expected
2.3 million 105 cases/person-year = 23 cases.
How likely is it that, purely by chance, this population would
have experienced 48 cases in a single year? If Y is the number
of cases, it would then have Poisson distribution with parameter
23, so that

P (Y 48) = 1

47
X
i=0

e23

23i
= 3.5 106 .
i!

So, such an extreme number of cases is likely to happen less than


1 year in 100,000. Does this prove that the vaccine caused GBS?
The people who had the vaccine are people who chose to be
vaccinated. They may differ from the rest of the population in
multiple ways in addition to the elementary fact of having been
vaccinated, and some of those ways may have predisposed them
to GBS. What can we do? The paper [BH84] takes the following
approach: If the vaccine were not the cause of the GBS cases,
we would expect no connection between the timing of the vaccine and the onset of GBS. In fact, though, there seemed to be a
particularly large number of cases in the six weeks following vaccination. Can we say that this was more than could reasonably
be expected by chance?
The data are given in Table 5.3. Each of the 40 GBS cases
was assigned a time, which is the number of weeks after vaccination when the disease was diagnosed. (Thus week 1 is a
different calendar week for each subject.) If the cases are evenly
distributed, the number in a given week should be Poisson distributed with parameter 40/30 = 1.33. Using this parameter, we
compute the probabilities of 0, 1, 2, . . . cases in a week, which we
give in row 3 of Table 5.3. Multiplying these numbers by 30 gives
the expected frequencies in row 4 of the table. It is clear that the
observed and expected frequencies are very different. One way
of seeing this is to consider
the standard deviation. The Poisson

distribution has SD 1.33 = 1.15 (as discussed in section 5.4,

The Poisson Distribution

95

while the data have SD


v

u
u 1 16 (0 1.33)2 + 7 (1 1.33)2 + 3 (2 1.33)2
u 301
 = 2.48.
s=t
2
2
2
+2 (4 1.33) + 1 (9 1.33) + 1 (10 1.33)

Table 5.3: Cases of GBS, by weeks after vaccination

# cases per week

6+

observed frequency
probability
expected frequency

16
0.264
7.9

7
0.352
10.6

3
0.234
7.0

0
0.104
3.1

2
0.034
1.0

0
0.009
0.3

2
0.003
0.1

5.9

Derivation of the Poisson distribution (nonexaminable)

This section is not officially part of the course, but is optional, for those
who are interested in more mathematical detail. Where does the formula in
section 5.2 come from?
Think of the Poisson distribution as in section 5.8, as an approximation
to a binomial distribution. Let X be the (random) number of successes in
a collection of independent random trials, where the expected number of
successes is . This will, of course, depend on the number of trials, but
we show that when the number of trials (call it n) gets large, the exact
number of trials doesnt matter. In mathematical language, we say that
the probability converges to a limit as n goes to infinity. But how large is
large? We would like to know how good the approximation is, for real
values of n, of the sort that we are interested in.
Let Xn be the random number of successes in n independent trials, where
the probability of each success is /n. Thus, the probability of success goes
down as the number of trials goes up, and expected number of successes is
always the same . Then
 x 


nx
P {Xn = x} = n Cx
1
.
n
n

96

The Poisson Distribution

Now, those of you who have learned some calculus at A-levels may remember
the Taylor series for ez :
ez = 1 + z +

z2 z3
+
+ .
2!
3!

In particular, for small z we have ez 1 z, and the difference (or error


in the approximation) is no bigger than z 2 /2. The key idea is that if z is
very small (as it is when z = /n, and n is large), then z 2 is a lot smaller
than z.
Using a bit of algebra, we have
 x 
 


x
n
P {Xn = x} = Cx
1
1
n
n
n
 x
 

n(n 1) (n x + 1)
x
n
=
1
1
x!
nx
n
n


n
x1 
1
x

(1) 1 n 1 n
1
=
.
x

x!
n
1 n
n

Now, if were not concerned about the size of the error, we can simply
say that n is much bigger than or x (because were thinking of a fixed
and x, and n getting large). So we have the approximations




x1
1
1
1;
1
n
n


x
1
1;
n


n  /n n
1
e
= e .
n
Thus
P {Xn = x}

5.9.1

x
e .
x!

Error bounds (very mathematical)

In the long run, Xn has a distribution very close to the Poisson distribution
defined in section 5.2. But how long is the long run? Do we need 10
trials? 1000? a billion?
If you just want the answer, its approximately this: The error that youll
make by taking the Poisson distribution instead of the binomial is no more

The Poisson Distribution

97

than about 1.62 /n3/2 . In Example 5.5, where n = 100 and = 5, this
says the error wont be bigger than about 0.04, which is useful information,
although in reality the maximum error is about 10 times smaller than this.
On the other hand, if n = 400, 000 (about the population of Malta), and
= 0.47, then the error will be only about 108 .

Lets assume that n is at least 42 , so < n/2. Define the approximation error to be


 := max P {Xn = x} P {X = x} .
(The bars | | mean that were only interested in how big the difference is,
not whether its positive or negative.) Then
!


n
x

x (1) 1 n1 1 x1
n
1

e
P {Xn = x} P {X = x} =
x
x!
n
x!
1 n
!




 

x
1
x1
x 1 /n n
=
e
(1) 1
1
1
1
x!
n
n
n
e/n

If x is bigger than n, then P {X = x} and P {Xn = x} are both tiny; we


wont go into the details here, but we will consider only x that are smaller
than this. Now we have to do some careful approximation. Basic algebra
tells us that if a and b are positive,
(1 a)(1 b) = 1 (a + b) + ab > 1 (a + b).
We can extend this to (1a)(1b)(1c) > (1(a+b))(1c) > 1(a+b+c).
And so, finally, if a, b, c, . . . are all positive, then
1 > (1 a)(1 b)(1 c) (1 z) > 1 (a + b + c + + z).
Thus





x1
X
1
x1
k
x2
1 1
1
>1
>1
,
n
n
n
2n
k=0

and


1>

1
n

x
>1

x
.
n

Again applying some calculus, we turn this into




x
x
1< 1
<1+
.
n
n x

98

The Poisson Distribution


We also know that
1

2
< e/n < 1 + 2 ,
n
n 2n

1 /n
2
< /n < 1,
2(n2 n)
e

which means that

and
1

2
<
2(n )


1

2
2
2(n n)

n


<

1 /n
e/n

n
< 1.

Now we put together all the overestimates on one side, and all the underestimates on the other.




x
2
x
x
x
e

P {Xn = x}P {X = x}
e
.
x!
2(n )
n
x!
n x
So, finally, as long as n 42 ,we get


x+1 x
x

 max
e
+ +
.
x!
n n n(1 x/2 n)
We need to find the maximum over all possible x. If x <
becomes
1 x+1
42
 max
e ( + 3x) ,
n x!
n 2n

n then this

(by a formula known as Stirlings formula), where = max{, 1}.

Lecture 6

The Normal Distribution


6.1

Introduction

In previous lectures we have considered discrete datasets and discrete probability distributions. In practice many datasets that we collect from experiments consist of continuous measurements. For example, there are the
weights of newborns in the babyboom data set (Table 1.2). The plots in
Figure 6.1 show histograms of real datasets consisting of continuous measurements. From such samples of continuous data we might want to test
whether the data is consistent with a specific population mean value or
whether there is a significant difference between 2 groups of data. To answer these question we need a probability model for the data. Of course,
there are many different possible distributions that quantities could have. It
is therefore a startling fact that many different quantities that we are commonly interested in heights, weights, scores on intelligence tests, serum
potassium levels of different patients, measurement errors of distance to the
nearest star all have distributions which are close to one particular shape.
This shape is called the Normal or Gaussian1 family of distributions.

Named for German mathematician Carl Friedrich Gauss, who first worked out the
formula for these distributions, and used them to estimate the errors in astronomical
computations. Until the introduction of the euro, Gausss picture and the Gaussian
curve were on the German 10 mark banknote.

99

The Normal Distribution

8
0

Frequency

6
4

Frequency

10

10

12

100

1000

2000

3000

4000

5000

6000

1.0

1.2

Birth weight (g)

1.4

1.6

1.8

Petal length

(b) Petal measurements in a species of flower

20
10

15

Frequency

6
4

2
0

Frequency

25

30

10

35

(a) Babyboom birthweights

700000 800000 900000

1100000

Brain size

(c) Brain sizes of 40 Psychology students

3.0

3.5

4.0

4.5

5.0

5.5

Serum potassium level

(d) Serum potassium measurements from 152 healthy volunteers

Figure 6.1: Histograms of some continuous data.

The Normal Distribution

6.2

101

Continuous probability distributions

0.00 0.02 0.04 0.06 0.08 0.10 0.12

P(X)

When we considered the Binomial and Poisson distributions we saw that the
probability distributions were characterized by a formula for the probability
of each possible discrete value. All of the probabilities together sum up to
1. We can visualize the density by plotting the probabilities against the
discrete values (Figure 6.2). For continuous data we dont have equally
spaced discrete values so instead we use a curve or function that describes
the probability density over the range of the distribution (Figure 6.3). The
curve is chosen so that the area under the curve is equal to 1. If we observe a
sample of data from such a distribution we should see that the values occur
in regions where the density is highest.

10

15

20

Figure 6.2: A discrete probability distribution

The Normal Distribution

0.02
0.00

0.01

density

0.03

0.04

102

60

80

100

120

140

Figure 6.3: A continuous probability distribution

6.3

What is the Normal Distribution?

There will be many, many possible probability density functions over a continuous range of values. The Normal distribution describes a special class
of such distributions that are symmetric and can be described by the distribution mean and the standard deviation (or variance 2 ). 4 different
Normal distributions are shown in Figure 6.4 together with the values of
and . These plots illustrate how changing the values of and alter the
positions and shapes of the distributions.
If X is Normally distributed with mean and standard deviation , we
write
XN(, 2 )
and are the parameters of the distribution.

The Normal Distribution

103

The probability density of the Normal distribution is given by


1
2
2
f (x) = exp(x) /2
2
For the purposes of this course we do not need to use this expression. It is
included here for future reference.

density
50

100

150

50

100

150

= 130 ! = 10

= 100 ! = 15

density

0.00

0.04

0.04

0.08

0.08

0.00

density

0.04
0.00

0.04
0.00

density

0.08

= 100 ! = 5

0.08

= 100 ! = 10

50

100

150

50

100

150

Figure 6.4: 4 different Normal distributions

6.4

Using the Normal table

For a discrete probability distribution we calculate the probability of being


less than some value z, i.e. P (Z < z), by simply summing up the probabilities of the values less than z.

104

The Normal Distribution

For a continuous probability distribution we calculate the probability of


being less than some value z, i.e. P (Z < z), by calculating the area under
the curve to the left of z.
For example, suppose Z N(0, 1) and we want to calculate P (Z < 0)
?

P(Z < 0)

0
For this example we can calculate the required area as we know the distribution is symmetric and the total area under the curve is equal to 1, i.e.
P (Z < 0) = 0.5.
What about P (Z < 1.0)?

P(Z < 1)

Calculating this area is not easy2 and so we use probability tables. Probabil2

For those Mathematicians who recognize this area as a definite integral and try to do
the integral by hand please note that the integral cannot be evaluated analytically

The Normal Distribution

105

ity tables are tables of probabilities that have been calculated on a computer.
All we have to do is identify the right probability in the table and copy it
down! Obviously it is impossible to tabulate all possible probabilities for all
possible Normal distributions so only one special Normal distribution, N(0,
1), has been tabulated.
The N(0, 1) distribution is called the standard Normal distribution.
The tables allow us to read off probabilities of the form P (Z < z). Most of
the table in the formula book has been reproduced in Table 6.1. From this
table we can identify that P (Z < 1.0) = 0.8413 (this probability has been
highlighted with a box).
z
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
1.1

0.0
0.5000
0.5398
0.5793
0.6179
0.6554
0.6915
0.7257
0.7580
0.7881
0.8159
0.8413
0.8643

0.01
5040
5438
5832
6217
6591
6950
7291
7611
7910
8186
8438
8665

0.02
5080
5478
5871
6255
6628
6985
7324
7642
7939
8212
8461
8686

0.03
5120
5517
5910
6293
6664
7019
7357
7673
7967
8238
8485
8708

0.04
5160
5557
5948
6331
6700
7054
7389
7704
7995
8264
8508
8729

0.05
5199
5596
5987
6368
6736
7088
7422
7734
8023
8289
8531
8749

0.06
5239
5636
6026
6406
6772
7123
7454
7764
8051
8315
8554
8770

0.07
5279
5675
6064
6443
6808
7157
7486
7794
8078
8340
8577
8790

0.08
5319
5714
6103
6480
6844
7190
7517
7823
8106
8365
8599
8810

0.09
5359
5753
6141
6517
6879
7224
7549
7852
8133
8389
8621
8830

Table 6.1: N(0, 1) probability table


Once we know how to read tables we can calculate other probabilities.

Example 6.1: Upper tails


The table gives P (Z < z). Suppose we want P (Z > 0.92).
We know that P (Z > 0.92) = 1 P (Z < 0.92) and we can calculate P (Z < 0.92) from the tables.
Thus, P (Z > 0.92) = 1 0.8212 = 0.1788.


106

The Normal Distribution

P(Z > 0.92)

P(Z < 0.92)

0.92

0.92

Example 6.2: Negative z


The table only includes positive values of z. What about negative values? Compute P (Z > 0.5).

P(X > !0.5)

!0.5

P(X < 0.5)

0.5

The Normal distribution is symmetric so we know that P (Z >


0.5) = P (Z < 0.5) = 0.6915
We can use the symmetry of the Normal distribution to calculate

The Normal Distribution

107

P(X < !0.76)

!0.76

P(X < 0.76)

0.76

P (Z < 0.76) = P (Z > 0.76)


= 1 P (Z < 0.76)
= 1 0.7764
= 0.2236.


Example 6.3: Intervals


How do we compute P (0.64 < Z < 0.43)?

P(!0.64 < X < 0.43)

!0.64

0 0.43

P(X < !0.64)

!0.64

P(X < 0.43)

0 0.43

We can calculate this using


P (0.64 < Z < 0.43) = P (Z < 0.43) P (Z < 0.64)
= 0.6664 (1 0.7389)
= 0.4053.


108

The Normal Distribution

Example 6.4: Interpolation


How would we compute P (Z < 0.567)?
From tables we know that P (Z < 0.56) = 0.7123 and P (Z <
0.57) = 0.7157
To calculate P (Z < 0.567) we interpolate between these two
values
P (Z < 0.567) = 0.3 0.7123 + 0.7 0.7157 = 0.7146


6.5

Standardisation

All of the probabilities above were calculated for the standard Normal distribution N(0, 1). If we want to calculate probabilities from different Normal
distributions we convert the probability to one involving the standard Normal distribution. This process is called standardisation.
Suppose X N(3, 4) and we want to calculate P (X < 6.2). We convert
this probability to one involving the N(0, 1) distribution by
(i). Subtracting the mean
(ii). Dividing by the standard deviation
Subtracting the mean re-centers the distribution on zero. Dividing by the
standard deviation re-scales the distribution so it has standard deviation 1.
If we also transform the boundary point of the area we wish to calculate
we obtain the equivalent boundary point for the N(0, 1) distribution. This
process is illustrated in the figure below. In this example, P (X < 6.2) =
P (Z < 1.6) = 0.9452 where Z N(0,1)
This process can be described by the following rule
If X N(, 2 ) and Z =

Example 6.5: Birth weights

then Z N(0, 1)

The Normal Distribution

109

+,!-(/.
!

!!()*(+,"-(/.
"

#$%

!$%

'(%()*(+,"-(&.

" &$#

Suppose we know that the birth weight of babies is Normally distributed with mean 3500g and standard deviation 500g. What
is the probability that a baby is born that weighs less than 3100g?
That is X N(3500, 5002 ) and we want to calculate P (X <
3100)?
We can calculate the probability through the process of standardization.
Drawing a rough diagram of the process can help you to avoid
any confusion about which probability (area) you are trying to
calculate.

110

The Normal Distribution

X ~ N(3500, 5002 )

Z ~ N(0, 1)

P(X < 3100)

P(Z < !0.8)


Z = X ! 3500
500

3100

3500

3100 ! 3500 0
500
= !0.8

P (X < 3100) = P

X 3500
3100 3500
<
500
500

= P (Z < 0.8)

where Z N(0, 1)

= 1 P (Z < 0.8)
= 1 0.7881
= 0.2119


6.6

Linear combinations of Normal random variables

Suppose two rats A and B have been trained to navigate a large maze.
The time it takes rat A is normally distributed with mean 80 seconds and
standard deviation 10 seconds. The time it takes rat B is normally distributed with mean 78 seconds and standard deviation 13 seconds. On any
given day what is the probability that rat A runs the maze faster than rat B?
X = Time of run for rat A
Y = Time of run for rat B

X N(80, 102 )
Y N(78, 132 )

The Normal Distribution

111

Let D = X Y be the difference in times of rats A and B


If rat A is faster than rat B then D < 0 so we want P (D < 0)?
To calculate this probability we need to know the distribution of D. To
do this we use the following rule
If X and Y are two independent normal variable such that
X N(1 , 12 ) and Y N(2 , 22 )
then

X Y N(1 - 2 , 12 + 22 )

In this example,
D = X Y N(80 78, 102 + 132 ) = N (2, 269)
We can now calculate this probability through standardisation

D ~ N(2, 269)

Z ~ N(0, 1)

P(D < 0)

P(Z < !0.122)


Z=D!2
16.40

0!2 0
16.40
= !0.122

112

The Normal Distribution

P (D < 0) = P

D2
02

<
269
269

!
= P (Z < 0.122)

where Z N(0, 1)

= 1 (0.8 0.5478 + 0.2 0.5517)


= 0.45142
Other rules that are often used are
If X and Y are two independent normal variable such that
X N(1 , 12 ) and Y N(2 , 22 )
then
X +Y

N(1 + 2 , 12 + 22 )

aX N(a1 , a2 12 )
aX + bY

N(a1 + b2 , a2 12 + b2 22 )

Example 6.6: Maze-running times


Suppose two rats A and B have been trained to navigate a large
maze. The time it takes rat A is normally distributed with mean
80 seconds and standard deviation 10 seconds. The time it takes
rat B is normally distributed with mean 78 seconds and standard
deviation 13 seconds. On any given day what is the probability
that the average time the rats take to run the maze is greater
than 82 seconds?
X = Time of run for rat A
Y = Time of run for rat B

= 12 X + 12 Y be the average time of rats A and B




Then A N 21 80 + 12 78, ( 21 )2 102 + ( 12 )2 132 = N(79, 67.25)

Let A =

X+Y
2

X N(80, 102 )
Y N(78, 132 )

We want P (A > 82)

The Normal Distribution

113

A ~ N(79, 67.25)

Z ~ N(0, 1)

P(A > 82)

P(Z > 0.366)


Z = A ! 79
8.20

79

P (A > 82) = P

82

A 79
82 79

<
67.25
67.25

82 ! 79
8.20
= 0.366

!
= P (Z > 0.366)

where Z N(0, 1)

= 1 (0.4 0.6406 + 0.6 0.6443)


= 0.35718


6.7

Using the Normal tables backwards

Example 6.7: Exam scores


The marks of 500 candidates in an examination are normally
distributed with a mean of 45 marks and a standard deviation
of 20 marks.
If 20% of candidates obtain a distinction by scoring x marks
or more, estimate the value of x.
We have X N(45, 202 ) and we want x such that P (X >
x) = 0.2
P (X < x) = 0.8

114

The Normal Distribution

X ~ N(45, 400)

Z ~ N(0, 1)

P(X < x) = 0.8

P(Z < 0.84) = 0.8


Z = X ! 45
20

45

x ! 45
20
= 0.84

Standardising this probability we get


P

X 45
x 45
<
20
20

x 45
P Z<
20

= 0.8
= 0.8

From the tables we know that P (Z < 0.84) 0.8 so

x 45
0.84
20
x 45 + 20 0.84 = 61.8

6.8

The Normal approximation to the Binomial

Under certain conditions we can use the Normal distribution to approximate


the Binomial distribution. This can be very useful when we need to sum up

The Normal Distribution

115

a large number of Binomial probabilities to calculate the probability that


we want.
For example, Figure 6.5 compares a Bin(300, 0.5) and a N(150, 75) which
both have the same mean and variance. The figure shows that the distributions are very similar.
N(150, 75)

0.04
0.03
0.00

0.00

0.01

0.02

density

0.03
0.02
0.01

P(X = x)

0.04

Bin(300, 0.5)

100

120

140

160

180

200

100

120

140

160

180

200

Figure 6.5: Comparison of a Bin(300, 0.5) and a N(150, 75) distribution


In general,
If X Bin(n, p) then
= np
2 = npq

where

q =1p

For large n and p not too small or too large


X N(np, npq)
n > 10 and p

6.8.1

1
2

OR n > 30 and p moving away from

Continuity correction

Suppose X Bin(12, 0.5) what is P (4 X 7)?

1
2

116

The Normal Distribution

For this distribution we have


= np = 6
2 = npq = 3
So we can use a N(6, 3) distribution as an approximation.
Unfortunately, its not quite so simple. We have to take into account the
fact that we are using a continuous distribution to approximate a discrete
distribution. This is done using a continuity correction. The continuity
correction appropriate for this example is illustrated in the figure below
In this example, P (4 X 7) transforms to P (3.5 < X < 7.5)

3.5

P (3.5 < X < 7.5) = P

10 11 12

7.5

3.5 6
X 6
7.5 6

<
<
3
3
3

= P (1.443 < Z < 0.866)


= 0.732

where Z N(0, 1)

The Normal Distribution

117

The exact answer is 0.733 so in this case the approximation is very good.

6.9

The Normal approximation to the Poisson

We can also use the Normal distribution to approximate a Poisson distribution under certain conditions.
In general,
If X Po() then
=
2 =
For large (say > 20)
X N(, )

Example 6.8: Radioactive emission


A radioactive source emits particles at an average rate of 25 particles per second. What is the probability that in 1 second the
count is less than 28 particles?
X = No. of particles emitted in 1 second

X Po(25)

So, we can use a N(25, 25) as an approximate distribution.


Again, we need to make a continuity correction
So P (X < 27) transforms to P (X < 26.5)

P (X < 26.5) = P

26.5 25
X 25
<
5
5

= P (Z < 0.3)
= 0.6179


where Z N(0, 1)

118

The Normal Distribution

X ~ N(25, 25)

Z ~ N(0, 1)

P(X < 26.5)

P(Z < 0.3)


Z = X ! 25
5

25

26.5

26.5 ! 25
5
= 0.3

Example 6.9: ESP Experiment


This example is adapted from [FPP98].
In the 1970s, the psychologist Charles Tart tried to test whether
people might have the power to see into the future. His Aquarius machine was a device that would flash four different lights
in random orders. Subjects would press buttons to predict which
of the 4 lights will come on next.
15 different subjects each ran a trial of 500 guesses, so 1500
guesses in total. They produced 2006 correct guesses and 5494
incorrect. What should we conclude?
We might begin by hypothesising that without any power to
predict the future, a subject has just a 1/4 chance of guessing right each time, independent of any other outcomes. Thus,
the number of correct guesses X has Bin(7500, 1/4) distribution.
This has mean = 7500/4 = 1875, and standard deviation

7500 0.25 0.75 = 37.5. The result is thus above the mean:
There were more correct guesses than would be expected. Might
we plausibly say that the difference from the expectation is just
chance variation?
We want to know how likely a result this extreme would be if
X really has this binomial distribution. We could compute this

The Normal Distribution

119

directly from the binomial distribution as


P (X 2006) =

7500
X
x=2006


7500
(0.25)x (0.75)7500x
x



7500
=
(0.25)2006 (0.75)5494
2006


7500
+
(0.25)2007 (0.75)5493 +
2007


7500
+
(0.25)7500 (0.75)0 .
7500
This is not only a lot of work, it is also not very illuminating.
More useful is to treat X as a continuous variable that is approximately normal.
We sketch the relevant normal curve in Figure 6.7. This is the
normal distribution with mean 1875 and SD 37.5. Because of
the continuity correction, the probability we are looking for is
P (X > 2005.5). We convert x = 2005.5 into standard units:
z=

x
2005.5 1875
=
= 3.48.

37.5

(Note that with such a large SD, the continuity correction makes
hardly any difference.) We have then P (X > 2005.5) P (Z >
3.48), where Z has standard normal distribution. Since most of
the probability of the standard normal distribution is between
2 and 2, and nearly all between 3 and 3, we know this is a
small probability. The relevant piece of the normal table is given
in Figure 6.6. (Notice that the table has become less refined for
z > 2, giving only one place after the decimal point in z.) From
the table we see that
P (X > 2005.5) = P (Z > 3.48) = 1 P (Z < 3.48),
which is between 0.0002 and 0.0003. (Using a more refined table,
we would see that P (Z > 3.48) = 0.000250 . . . .) This may be
compared to the exact binomial probability 0.000274.


0.6
0.7
0.8
0.9
1.0
1.1
1.2
1.3
1.4
1.5
120
1.6
1.7
1.8
1.9

0.7257
0.7580
0.7881
0.8159
0.8413
0.8643
0.8849
0.9032
0.9192
0.9332
0.9452
0.9554
0.9641
0.9713

7291
7611
7910
8186
8438
8665
8869
9049
9207
9345
9463
9564
9649
9719

7324
7642
7939
8212
8461
8686
8888
9066
9222
9357
9474
9573
9656
9726

7357
7673
7967
8238
8485
8708
8907
9082
9236
9370
9484
9582
9664
9732

7389
7704
7995
8264
8508
8729
8925
9099
9251
9382
9495
9591
9671
9738

7422 7454 7486 7517 7549


7734 7764 7794 7823 7852
8023 8051 8078 8106 8133
8289 8315 8340 8365 8389
8531 8554 8577 8599 8621
8749 8770 8790 8810 8830
8944 8962 8980 8997 9015
9115 9131 9147 9162 9177
9265 9279 9292 9306 9319
9394The
9406
9418 Distribution
9429 9441
Normal
9505 9515 9525 9535 9545
9599 9608 9616 9625 9633
9678 9686 9693 9699 9706
9744 9750 9756 9761 9767

2.0
3.0

0.9772
0.9987

0.1
9821
9990

0.2
9861
9993

0.3
9893
9995

0.4
9918
9997

0.5
9938
9998

0.6
9953
9998

0.7
9965
9999

0.8
9974
9999

0.9
9981
9999

0.000 0.002 0.004 0.006 0.008 0.010

Figure 6.6: Normal table used to compute tail probability for Aquarius
experiment.

1875

2006
x

Figure 6.7: Normal approximation to the distribution of correct guesses in


the Aquarius experiment, under the binomial hypothesis. The solid line is
the mean, and the dotted lines show 1 and 2 SDs away from the mean.

Lecture 7

Confidence intervals and


Normal Approximation
7.1

Confidence interval for sampling from a normally distributed population

In lecture 8 we learned about significance testing, which is one way to


quantify uncertainty. Consider the following example: Data set #231 from
[HDL+ 94] includes height measurements from 198 men and women, sampled
at random from the much larger data 1980 OPCS (Office of Population Census and Survey) survey. The sample of 198 mens heights averages 1732mm,
with an SD of 68.8mm. What does this tell us about the average height of
British men?
We only measured 198 of the many millions of men in the country, but
if we can assume that this was a random sample, this allows us to draw
conclusions about the entire population from which it was sampled. We
reason as follows: Imagine a box with millions of cards in it, on each of
which is written the height of one British man. These numbers are normally
distributed with mean and variance 2 . We get to look at 198 of these
cards, sampled at random. Call the cards we see X1 , . . . , X198 . Our best
guess for will of course be the sample mean
= 1 (X1 + + X198 ) = 1732mm.
X
198
Best guess is not a very precise statement, though. To make a precise

statement, we use the sampling distribution of the estimation error X


. The average of independent normal random variables is normal. The
121

122

Confidence intervals

expectation is the average of the expectations, while the variance is




2
= 1 V ar(X1 + +X198 ) = 1 V ar(X1 )+ +V ar(X198 ) = .
V ar(X)
n2
n2
n
since the variance of a sum of independent variables is always the sum of
by writing
the variances. Thus, we can standardise X
Z=


X
,
/ n

which is a standard normal random variable (that is, with expectation 0 and
variance 1).
A tiny bit of algebra gives us
Z.
=X
n
This expresses the unknown quantity in terms of known quantities and a
random variable Z with known distribution. Thus we may use our standard
normal tables to generate statements like the probability is 0.95 that Z is
in the range 1.96 to 1.96, implying that
+ 1.96 .
1.96 to X
the probability is 0.95 that is in the range X
n
n
(Note that we have used the fact that the normal distribution is symmetric
about 0.) We call this interval a 95% confidence interval for the unknown
population mean.

The quantity / n, which determines the scale of the confidence interval, is called the Standard Error for the sample mean, commonly abbreviated SE. If we take to be the sample standard deviation
more about this
assumption in chapter 10 the Standard Error is 69mm/ 198 4.9mm.
The 95% confidence interval for the population mean is then 1732 9.8mm,
so (1722, 1742)mm. In place of our vague statement about a best guess for
, we have an interval of width 20 mm in which we are 95% confident that
the true population mean lies.
General procedure for normal confidence intervals: Suppose X1 , . . . , Xn
are independent samples from a normal distribution with unknown mean ,
and known variance 2 . Then a (symmetric) c% normal confidence interval
for is the interval
z SE, X
+ z SE), which we also write as X
z SE,
(X

Confidence intervals

123

where SE = / n, and z is the appropriate quantile of the standard normal distribution. That is, it is the number such that (100 c)/2% of the
probability in the standard normal distribution is above z. Thus, if were
2 SE, whereas a 99%
looking for a 95% confidence interval, we take X

confidence interval would be X 2.6 SE, since we see on the normal table
that P (Z < 2.6) = 0.9953, so P (Z > 2.6) = 0.0047 0.5%. (Note: The
central importance of the 95% confidence interval derives primarily from its
convenient correspondence to a z value of 2. More precisely, it is 1.96, but
we rarely need or indeed, can justify such such precision.)
Level
z
Prob. above z

68%
1.0
0.16

90%
1.64
0.05

95%
1.96
0.025

99%
2.6
0.005

99.7%
3.0
0.0015

Table 7.1: Parameters for some commonly used confidence intervals.


In other situations, as we will see, we use the same formula for a normal
confidence interval for a parameter . The only thing that changes from
and the standard error.
problem to problem is the point estimate X,

7.2

Interpreting the confidence interval

But what does confidence mean? The quantity is a fact, not a random quantity, so we cannot say The probability is 0.95 that is between
The
1722mm and 1742mm.1 The randomness is in our estimate 

= X.

true probability statement P X  1.96SE, + 1.96SE


= 0.95
 is
1.96SE, X
+ 1.96SE =
equivalent, by simple arithmetic, to P X
0.95. The latter statement looks like something different, a probability statement about , but really it is a probability statement about the random
interval: 95% of the time, the random interval generated according to this
recipe will cover the true parameter.
Definition 7.2.1. A ( 100)% confidence interval (also called a confidence interval with confidence coefficient or confidence level ) for a
parameter , based on observations X := (X1 , . . . , Xn ) is a pair of statistics
1

An alternative approach to statistics, called Bayesian statistics, does allow us to make


precises sense of probability statements about unknown parameters, but we will not be
considering it in this course.

124

Confidence intervals

(that is, quantities you can compute from the data X) A(X) and B(X), such
that


P A(X) B(X) = .


The quantity P A(X) B(X) is called the coverage probability
for . Thus, a confidence interval for with confidence coefficient is
precisely a random interval with coverage probability . In many cases, it is
not possible to find an interval with exactly the right coverage probability.
We may have to content ourselves with an approximate confidence interval
(with coverage probability ) or a conservative confidence interval (with
coverage probability ). We usually make every effort not to overstate our
confidence about statistical conclusions, which is why we try to err on the
side of making the coverage probability hence the interval too large.
An illustration of this problem is given in Figure 7.1. Suppose we are
measuring systolic blood pressure on 100 patients, where the true blood
pressure is 120 mmHg, but the measuring device makes normally distributed
errors with mean 0 and SD 10 mmHg. In order to reduce the errors, we
take four measures on each patient and average them. Then we compute
a confidence interval. The measures are shown in figure 7.1(a). In Figure
7.1(b) we have shown a 95% confidence interval for each patient, computed
by taking the average of the patients four measurements, plus and minus 10.
Notice that there are 6 patients (shown by red Xs for their means) where
the true measure 120 mmHg lies outside the confidence interval. In
Figures 7.1(c) and 7.1(d) we show 90% and 68% confidence intervals, which
are narrower, and hence miss the true value more frequently.
A 90% confidence interval tells you that 90% of the time the true value
will lie in this range. In fact, we find that there are exactly 90 out of 100
cases where the true value is in the confidence interval. The 68% confidence
intervals do a bit better than would be expected on average: 74 of the 100
trials had the true value in the 68% confidence interval.

Blood Pressure

90

100 110 120 130 140 150

100 110 120 130 140 150

Blood Pressure

90

20

40

60

80

100

20

40

Trial

80

100

80

100

Trial

Blood Pressure

90
0

20

40

60
Trial

(c) 90% confidence intervals

80

100

20

40

60
Trial

(d) 68% confidence intervals

Figure 7.1: Confidence intervals for 100 patients blood pressure, based on four measurements. Each column of
Figure 7.1(a) shows a single patients four measurements. The true BP in each case is 120, and the measurement
errors are normally distributed with mean 0 and SD 10.

125

90

100 110 120 130 140 150

(b) 95% confidence intervals

100 110 120 130 140 150

(a) 100 patients

Blood Pressure

60

Confidence intervals

126

Confidence intervals

7.3

Confidence intervals for probability of success

We discussed in section 6.8 that the binomial distribution can be well approximated by a normal distribution. This means that if we are estimating
the probability of success p from some observations of successes and failures,
we can use the same methods as above to put a confidence interval on p.
For instance, the Gallup organisation carried out a poll in October, 2005,
of Americans attitudes about guns (see http://www.gallup.com/poll/
20098/gun-ownership-use-america.aspx). They surveyed 1,012 Americans, chosen at random. Of these, they found that 30% said they personally
owned a gun. But, of course, if theyd picked different people, purely by
chance they would have gotten a somewhat different percentage. How different could it have been? What does this survey tell us about the true
fraction (call it p) of Americans who own guns?
We can compute a 95% confidence interval as 0.301.96 SE. All we need
to know is the SE for the proportion p, which is the same as the standard
deviation for the observed proportion of successes. We know from section
6.8 (and discussed again at length in section 8.3), that the standard error is
r
SE =

p(1 p)
,
n

where n is the number of samples. In this case, we get SE=


0.0144. So

p
0.3 0.7/1012 =

95% confidence interval for p is 0.30 0.029 = (0.271, 0.329).


Loosely put, we can be 95% confident that the true proportion supporting EPP is between 27% and 33%. A 99% confidence interval comes from
multiplying by 2.6 instead of 1.96: it goes from 26.3% to 33.7%.
Notice that the Standard Error for a proportion is a maximum when
p = 0.5. Thus, we can always get a conservative confidence interval an
interval where the probability of finding the true parameter
p in it is at least
95% (or whatever the level is) by taking the SE to be .25/n. The 95%
confidence interval then has the particularly simple form sample mean

1/ n.

Confidence intervals

7.4

127

The Normal Approximation


Approximation Theorems in Probability
Suppose X1 , X2 , . . . , Xn are independent samples from a
probability distribution with mean and variance 2 .
Then

will be
Law of Large Numbers (LLN): For n large, X
close to .
Central Limit Theorem (CLT): For n large, the error
in the LLN is close to a normal distribution, with variance
2 /n. That is, using our standardisation procedure for the
normal distribution,
Z=

/ n

(7.1)

is close to having a standard normal distribution. Equivalently, X1 + + Xn has approximately N (n, n 2 ) distribution.
So far, we have been assuming that our data are sampled from a population with a normal distribution. What justification do we have for this
assumption? And what do we do if the data come from a different distribution? One of the great early discoveries of probability theory is that many
different kinds of random variables come close to a normal distribution when
you average enough of them. You have already seen examples of this phenomenon in the normal approximation to the binomial distribution and the
Poisson.
In probability textbooks such as [Fel71] you can find very precise statements about what it means for the distribution to be close. For our purposes, we will simply treat Z as being actually a standard normal random
variable. However, we also need to know what it means for n to be large.
For most distributions that you might encounter, 20 is usually plenty, while
2 or 3 are not enough. The key rules of thumb are that the approximation
works best when the distribution of X
(i). is reasonably symmetric: Not skewed in either direction.

128

Confidence intervals

(ii). has thin tails: Most of the probability is close to the mean, not many
SDs away from the mean.
More specific indications will be given in the following examples.

7.4.1

Normal distribution

Suppose Xi are drawn from a normal distribution with mean and variance
2 N (, 2 ). We know that X1 + + Xn has N (n, n 2 ) distribution,
has N (, 2 /n) distribution. A consequence is that Z, as
and so that X
defined in (7.1), in fact has exactly the standard normal distribution. In
fact, this is an explanation for why the CLT works: The normal distribution
is the only distribution such that when you average multiple copies of it,
you get another distribution of the same sort.2 Other distributions are not
stable under averaging, and naturally converge to the distributions that are.

7.4.2

Poisson distribution

Suppose Xi are drawn from a Poisson distribution with parameter . The


variance is then also . We know that X1 + +Xn has Poisson distribution
with parameter n. The CLT tells us, then, that for n large enough, the
P o(n) distribution is very close to the N (n, n) distribution; or, in other
words, P o() is approximately the same as N (, ) for large. How large
should it be?
The Poisson distribution is shown in Figure 7.2 for different values of ,
together with the approximating normal density curve. One way of seeing
the failure of the approximation
for small is to note that when is not

much bigger than much bigger meaning a factor of 2.5 or so, so


< 6.2, the normal curve will have substantial probability below 0.5. Since
this is supposed to approximate the probability of the corresponding Poisson
distribution below 0, this manifestly represents a failure. For instance, when
= 1, the P o(1) distribution is supposed to be approximated by N (1, 1),
implying






P P o(1) < 0 P N (1, 1) < 0.5 P N (0, 1) < 1.5 = .067.

In general, the threshold 0.5 corresponds to Z = (0.5 )/ . The


corresponding values for other parameters are given in Table 7.2.
2

Technically, it is the only distribution with finite variance for which this is true.

129

0.2
0.0

0.1

Density

0.3

0.4

Confidence intervals

-3

-2

-1

10

0.10
0.00

Density

0.20

(a) = 1

-4

-2

2
Z

0.08
0.04
0.00

Density

0.12

(b) = 4

10

15

20

0.04
0.00

Density

0.08

(c) = 10

10

15

20

25

30

35

(d) = 20

Figure 7.2: Normal approximations to P o(). Shaded region is the implied


approximate probability of the Poisson variable < 0.

130

Confidence intervals

1
4
10
20

Standardised Z
-1.5
-2.25
-3.32
-4.58

Normal probability
0.067
0.012
0.00045
0.0000023

Table 7.2: Probability below 0.5 in the normal approximation to Poisson


random variables with different parameters .

7.4.3

Bernoulli variables

Bernoulli variables is the name for random variables that are 1 or 0,


with probability p or 1 p respectively. Then B = X1 + + Xn is the
number of successes in n trials, with success probability p each time
that is, a Bin(n, p) random variable. Again, we have already discussed
that binomial random variables may be approximated by normal random
variables. Xi has expectation p and variance p(1 p). The CLT then
implies that Bin(n, p) N (np, np(1 p)) for large values of n. Note that
B/n is the proportion of successes in n trials, and this has approximately
N (p, p(1 p)/n) distribution. In other words, the observed proportion will
be close to p, but will
p be off by a small multiple of the SD, which shrinks as

/ n, where = p(1 p). This is exactly the same thing we discussed


in section 7.3.
How large does n need to be? As in the case of the Poisson distribution, discussed in section 7.4.2, a minimum requirementp
is that the mean be
substantially larger than the SD; in other words, np  np(1 p), so that
n  1/p. (The condition is symmetric, so we also need n  1/(1 p).) This
fits with our rule of thumb that n needs to be bigger when the distribution
of X is skewed, which is the case when p is close to 0 or 1.
In Figure 7.3 we see that when p = 0.5 the normal approximation is
quite good, even when n is only 10; on the other hand, when p = 0.1 we
have a good normal approximation when n = 100, but not when n is 25.
(Note, by the way, that Binom(25, 0.1) is approximately P o(2.5), so this is
closely related to the observations we made in section 7.4.2.

7.5

CLT for real data

We show how the CLT is applied to understand the mean of samples from
real data. It permits us to apply our Z and t tests for testing population

131

0.0

0.0

0.1

0.2

0.2

0.4

0.3

0.6

0.4

Confidence intervals

-2

-1

-2

-1

(b) p = 0.1, n = 3

0.0

0.1

0.2

0.3

0.4

0.00 0.05 0.10 0.15 0.20 0.25

(a) p = 0.5, n = 3

10

-2

(d) p = 0.1, n = 10

0.00

0.00

0.05

0.10

0.10

0.20

0.15

(c) p = 0.5, n = 10

-1

10

15

20

-2

(f) p = 0.1, n = 25

0.00

0.00

0.02

0.04

0.04

0.08

0.06

0.12

0.08

(e) p = 0.5, n = 25

35

40

45

50

55

(g) p = 0.5, n = 100

60

65

10

15

(h) p = 0.1, n = 100

Figure 7.3: Normal approximations to Binom(n, p). Shaded region is the implied approximate
probability of the Binomial variable < 0 or > n.

132

Confidence intervals

means and computing confidence intervals for the population mean (as well
as for differences in means) even to data that are not normally distributed.
(Caution: Remember that t is an improvement over Z only when the number
of samples being averaged is small. Unfortunately, the CLT itself may not
apply in such a case.) We have already applied this idea when we did the Z
test for proportions, and the CLT was also hidden in our use of the 2 test.

7.5.1

Quebec births

We begin with an example that is well suited to fast convergence. We


have a list of 5,113 numbers, giving the number of births recorded each
day in the Canadian province of Quebec over a period of 14 years, from 1
January, 1977 through 31 December, 1990. (The data are available at the
Time Series Data Library http://www.robjhyndman.com/TSDL/, under the
rubric demography.) A histogram of the data is shown in Figure 7.4(a).
The mean number of births is = 251, and the SD is = 41.9.
Suppose we were interested in the average number of daily births, but
couldnt observe data for all of the days. How many days would we need
to observe to get a reasonable estimate? Obviously, if we observed just a
single days data, we would be seeing a random pick from the histogram
7.4(a), which could be far off of the true value. (Typically, it would be off
by about the SD, which is 41.9.) Suppose we sample the data from n days,
obtaining counts x1 , . . . , xn , which average to x
. How far off might this be?
The normal approximation tells us that a 95% confidence interval for will

be x
1.96 41.9/ n. For instance, if n = 10, and we find the mean of
our 10 samples to be 245, then a 95% confidence interval will be (229, 271).
If there had been 100 samples, the confidence interval would be (237, 253).
Put differently, the average of 10 samples will lie within 26 of the true mean
95% of the time, while the average of 100 samples will lie within 8 of the
true mean 95% of the time.
This computation depends upon n being large enough to apply the CLT.
Is it? One way of checking is to perform a simulation: We let a computer
pick 1000 random samples of size n, compute the means, and then look at
the distribution of those 1000 means. The CLT predicts that they should
have a certain normal distribution, so we can compare them and see. If
n = 1, the result will look exactly like Figure 7.4(a), where the curve in red
is the appropriate normal approximation predicted by the CLT. Of course,
there is no reason why the distribution should be normal for n = 1. We see
that for n = 2 the true distribution is still quite far from normal, but by
n = 10 the normal is already starting to fit fairly closely, and by n = 100

Confidence intervals

133

the fit has become extremely good.


Suppose we sample 100 days at random. What is the
Pprobability that
the total number of births is at least 25500? Let S = 100
i=1 Xi . Then S
is normally distributed with mean 25100, and SD 41.9 100 = 419. We
compute by standardising:




S 25100
25500 25100
= P {Z > 0.95}
P S > 25500 = P
>
419
419
where Z = (S 25100)/419. By the CLT, Z has approximately the standard
normal distribution, so we can look up its probabilities on the table, and see
that P {Z > 0.95} = 1 P {Z 0.95} = 0.171.
Bonus question: S comes in whole number values, so
shouldnt we have made the cutoff 25500.5? Or should it
be 25499.5? If we want to answer the question about the
probability that S is strictly bigger than 25500, then the
cutoff should be 25500.5. If we want the probability that
S is strictly bigger than 25500, then the cutoff should be
25499.5. If we dont have a specific preference, then 25500
is a reasonable compromise. Of course, in this case, it only
makes a difference of about 0.002 in the value of Z, which
is negligible.
This is of course the same as the probability that the average number of

births is at least 255. We could also compute this by reasoning


that X =
S/100 is normally distributed with mean 251 and SD 41.9/ 100 = 4.19.
Thus,




> 255 = P X 251 > 255 251 = P {Z > 0.95} ,
P X
4.19
4.19
which comes out to the same thing.

7.5.2

California incomes

A standard example of a highly skewed distribution hence a poor candidate for applying the CLT is household income. The mean is much
greater than the median, since there are a small number of extremely high
incomes. It is intuitively clear that the average of incomes must be hard to
predict. Suppose you were sampling 10,000 Americans at random a very
large sample whose average income is 30,000. If your sample happens

Confidence intervals

0.010

Density

0.000

0.005

0.004
0.000

Density

0.008

0.015

134

150

200

250

300

350

200

# Daily births

(b) n = 2

0.020

Density

0.000

0.010

0.020
0.015
0.010
0.000

0.005

Density

300

0.030

(a) n = 1

180

200

220

240

260

280

300

220

240

average # Daily births

260

280

average # Daily births

(d) n = 10

0.06

Density

0.00

0.00

0.02

0.02

0.04

0.04

0.08

0.06

(c) n = 5

Density

250
average # Daily births

240

250

260

average # Daily births

(e) n = 50

270

235

240

245

250

255

260

average # Daily births

(f) n = 100

Figure 7.4: Normal approximations to averages of n samples from the Quebec birth data.

265

Confidence intervals

135

to include Bill Gates, with annual income of, let us say, 3 billion, then his
income will be ten times as large as the total income of the entire remainder
of the sample. Even if everyone else has zero income, the sample mean will
be at least 300,000. The distribution of the mean will not converge, or will
converge only very slowly, if it can be substantially affected by the presence
or absence of a few very high-earning individuals in the sample.
Figure 7.5(a) is a histogram of household incomes, in thousands of US
dollars, in the state of California in 1999, based on the 2000 US census (see
www.census.gov). We have simplified somewhat, since the final category is
more than $200,000, which we have treated as being the range $200,000 to
$300,000. (Remember that histograms are on a density scale, with the area
of a box corresponding to the number of individuals in that range. Thus,
the last three boxes all correspond to about 3.5% of the population, despite
their different heights.) The mean income is about = $62, 000, while the
median is $48,000. The SD of the incomes is = $55, 000.
Figures 7.5(b)7.5(f) show the effect of averaging 2,5,10,50, and 100
randomly chosen incomes, together with a normal distribution (in green) as
predicted by the CLT, with mean and variance 2 /n. We see that the
convergence takes a little longer than it did with the more balanced birth
data of Figure 7.4 averaging just 10 incomes is still quite skewed but
by the time we have reached the average of 100 incomes the match to the
predicted normal distribution is remarkably good.

7.6

Using the Normal approximation for statistical inference

There are many implications of the Central Limit Theorem. We can use it
to estimate the probability of obtaining a total of at least 400 in 100 rolls of
a fair six-sided die, for instance, or the probability of a subject in an ESP
experiment, guessing one of four patterns, obtaining 30 correct guesses out
of 100 purely by chance. These were discussed in lecture 6 of the first set
of lectures. It suggests an explanation for why height and weight, and any
other quantity that is affected by many small random factors, should end
up being normally distributed.
Here we discuss one crucial application: The CLT allows us to compute normal confidence intervals and apply the Z test to data that are not
themselves normally distributed.

136

7.6.1

Confidence intervals

An example: Average incomes

Suppose we take a random sample of 400 households in Oxford, and find


that they have an average income of 36,200, with an SD of 26,400. What
can we infer about the average income of all households in Oxford?
Answer: Although the distribution of incomes is not normal and if
we werent sure of that, we could see from the fact that the SD is not
much smaller than the mean the average of 400 incomes
will be nor
mally distributed. The SE for the mean is 26400/ 400 = 1320, so a
95% confidence interval for the average income in the population will be
36200 1.96 1320 = (33560, 38840). A 99% confidence interval is
36200 2.6 1320 = (32800, 39600).

Confidence intervals

137

Averages of 2 California household incomes

0.000

0.004

0.008

Density

0.004
0.000

Density

0.008

0.012

0.012

Histogram of California household income 1999

50

100

150

200

250

300

100

150

200

income in $thousands

(a) n = 1

(b) n = 2

250

0.020
0.015
0.000

0.005

0.010

Density

0.010
0.005
0.000
0

50

100

150

200

50

income in $thousands

100

150

income in $thousands

(c) n = 5

(d) n = 10
Averages of 100 California household incomes

0.04
0.00

0.00

0.01

0.02

0.02

Density

0.03

0.04

0.06

0.05

Averages of 50 California household incomes

Density

300

Averages of 10 California household incomes

0.015

Averages of 5 California household incomes

Density

50

income in $thousands

40

50

60

70

income in $thousands

(e) n = 50

80

90

100

50

60

70

income in $thousands

(f) n = 100

Figure 7.5: Normal approximations to averages of n samples from the California income data. The green curve shows a normal density with mean
and variance 2 /n.

80

138

Confidence intervals

Lecture 8

The Z Test
8.1

Introduction

In Lecture 1 we saw that statistics has a crucial role in the scientific process
and that we need a good understanding of statistics in order to avoid reaching invalid conclusions concerning the experiments that we do. In Lectures 2
and 3 we saw how the use of statistics necessitates an understanding of probability. This lead us to study how to calculate and manipulate probabilities
using a variety of probability rules. In Lectures 4, 5 and 6 we consider three
specific probability distributions that turn out to be very useful in practical
situations. Effectively, all of these previous lectures have provided us with
the basic tools we need to use statistics in practical situations.
The goal of statistical analysis is to draw reasonable conclusions from the
data and, perhaps even more important, to give precise statements about the
level of certainty that ought to be attached to those conclusions. In lecture
7 we used the normal distribution to derive one form that these precise
statements can take: a confidence interval for some population mean. In
this lecture we consider an alternative approach to describing very much the
same information: Significance tests.

8.2

The logic of significance tests

Example 8.1: Baby-boom hypothesis test


Consider the following hypothetical situation: Suppose we think
that UK newborns are heavier than Australian newborns. We
139

140

The Z Test
know from large-scale studies that UK newborns average 3426g,
with an SD of 538g. (See, for example, [NNGT02].) The weights
are approximately normally distributed. We think that maybe
babies in Australia have a mean birth weight smaller than 3426g
and we would like to test this hypothesis.

Frequency
10

15

20

Intuitively we know how to go about testing our hypothesis. We


need to take a sample of babies from Australia, measure their
birth weights and see if the sample mean is significantly smaller
than 3426g. Now, we have a sample of 44 Australian newborns,
presented in Table 1.2, and with histogram presented in Figure
8.1. (Ignore for the moment that these are not really a sample
of all Australian babies. . . )

1500 2000 2500 3000 3500 4000 4500


Birth Weight (g)

Figure 8.1: A Histogram showing the birth weight distribution in the Babyboom dataset.
We observe that the sample mean of these 44 weights is 3276g.
So we might just say that were done. The average of these
weights is smaller than 3426g, which is what we wanted to show.
But wait! a skeptic might say. You might have just happened
by chance to get a particularly heavy group of newborns. After
all, even in England lots of newborns are lighter than 3276g.
And 44 isnt such a big sample.

The Z Test

141

How do we answer the skeptic? Is 44 big enough to conclude


that there is a real difference in weights between the Australian
sample and the known English average? We need to distinguish
between
The research hypothesis: Australian newborns have
a mean weight greater than 3426g;
and
The null hypothesis: Theres no difference in mean
weight; the apparent difference is purely due to chance.
How do we decide which is true? We put the null hypothesis
to the test. It says that the 44 observed weights are just like
what you might observe if you picked 44 at random from the UK
newborn population; that is, from a normal distribution with
mean 3426g and SD 538g.
Let X1 , . . . , X44 be 44 weights picked at random from a N(3426, 5382 )
1
distribution, and let X = 44
(X1 + +X44 ) be their mean. How
likely is it that X is as small as 3276? Of course, its never impossible, but we want to know how plausible it is.
We know from section 6.6 that
X1 + + X44 N(3426 44, 5382 44), and
X=

1
(X1 + + X44 ) N(3426, 5382 /44) = N(3000, 812 ).
44

Thus,


3276 3426
P (X 3276) = P Z
= P (Z 1.81) = 1P (Z < 1.81),
81
where Z = (X )/ = (X 3426)/81 has standard normal
distribution. Looking this up on the standard normal table, we
see that the probability is about 0.0351. 
The probability 0.0351 that we compute at the end of Example 8.1 is
called the p-value of the test. It tells us how likely it is that we would
observe such an extreme result if the null hypothesis were true. The lower
the p-value, the stronger the evidence against the null hypothesis. We are
faced with the alternative: Either the null hypothesis is false, or we have by

142

The Z Test

chance happened to get a result that would only happen about one time in
30. This seems unlikely, but not impossible.
Pay attention to the double negative that we commonly use for significance tests: We have a research hypothesis, which we think would be
interesting if it were true. We dont test it directly, but rather we use the
data to challenge a less interesting null hypothesis, which says that the
apparently interesting differences that weve observed in the data are simply
the result of chance variation. We find out whether the data support the
research hypothesis by showing that the null hypothesis is false (or unlikely).
If the null hypothesis passes the test, then we know only that this particular
challenge was inadequate. We havent proven the null hypothesis. After
all, we may just not have found the right challenger; a different experiment
might show up the weaknesses of the null. (The potential strength of the
challenge is called the power of the test, and well learn about that in
section 13.2.)
What if the challenge succeeds? We can then conclude with confidence
(how much confidence depends on the p-value) that the null was wrong. But
in a sense, this is shadow boxing: We dont exactly know who the challenger
is. We have to think carefully about what the plausible alternatives are.
(See, for instance, Example 8.2.)

8.2.1

Outline of significance tests

The basic steps carried out in Example 8.1 are common to most significance
tests:
(i). Begin with a research (alternative) hypothesis.
(ii). Set up the null hypothesis.
(iii). Collect a sample of data.
(iv). Calculate a test statistic from the sample of data.
(v). Compare the test statistic to its sampling distribution under the
null hypothesis and calculate the p-value. The strength of the evidence is larger, the smaller the p-value.

The Z Test

8.2.2

143

Significance tests or hypothesis tests? Breaking the


.05 barrier

We use p-values to weigh scientific evidence. What if we need to make a


decision?
One common situation is that the null hypothesis is being compared to
an alternative that implies a definite course of action. For instance, we may
be testing whether daily doses of vitamin C prevent colds: We take 100
subjects, and give them vitamin C supplements every day for a year, and no
vitamin C supplement for another year, and compare the numbers of colds.
At the end, we have two alternatives: either make a recommendation for
vitamin C, or not.
The standard approach is to start by saying: The neutral decision is to
make no recommendation, and we associate that with the null hypothesis,
which says that any difference observed may be due to chance. In this system, the key goal is to control the likelihood of falsely making a positive
recommendation (because we have rejected the null hypothesis). This situation, where we incorrectly reject the null hypothesis is called a Type I
Error. The opposite situation, where we retain the null hypothesis although
it is false, is called a Type II Error.
By definition, if the null hypothesis is true, the probability that the
p-value is less than a given number is exactly . Thus, we begin our
hypothesis test by fixing , the probability of a Type I error, to be some
tolerably low number. We call this the significance level of the test. (A
common choice is = 0.05, but the significance level can be anything you
choose. If the consequences of a Type I Error would be extremely serious
for instance, if we are testing a new and very expensive cancer drug, with
the expectation that we will move to prescribing this drug for all patients at
great expense if it is shown to be significantly better we might choose
a smaller value of .)
In our current example, the p-value is about 106 which is lower than
0.05. In this case, we would conclude that
there is significant evidence against the null hypothesis at the 5% level
Another way of saying this is that
we reject the null hypothesis at the 5% level
If the p-value for the test were much larger, say 0.23, then we would conclude
that

144

The Z Test
Truth
Decision
Retain H0
Reject H0

H0 True

H0 False

Correct
(Prob. 1 )
Type I Error
(Prob.=level=)

Type II Error
(Prob.=)
Correct
(Prob.=Power=1 )

Table 8.1: Types of errors

the evidence against the null hypothesis is not significant at the 5% level
Another way of saying this is that
we cannot reject the null hypothesis at the 5% level
Note that the conclusion of a hypothesis test, strictly speaking, is binary:
We either reject or retain the null hypothesis. There are no gradations, no
strong rejection or borderline rejection or barely retained. The fact that our
p-value was 106 ought not to be taken, retrospectively, as stronger evidence
against the null hypothesis than a p-value of 0.04 would have been.
By the strict logic imposed the data are completely used up in the test:
If we are testing at the 0.05 level and the p-value is 0.06, we cannot then
collect more data to see if we can get a lower p-value. We would have to
throw away the data, and start a new experiment. Needless to say, this
is not what scientists really do, which makes even the apparently clear-cut
yes/no decision set-up of the hypothesis test in reality rather difficult to
interpret.
It is also quite common to confuse this situation with using significance
tests to judge scientific evidence. So common, in fact, that many scientific
journals impose the 0.05 significance threshold to decide whether results are
worth publishing. An experiment that resulted in a statistical test with
a p-value of 0.10 is considered to have failed, even if it may very well be
providing reasonable evidence of something important; if it resulted in a
statistical test with a p-value of 0.05 then it is a success, even if the effect
size is minuscule, and even though 1 out of 20 true null hypotheses will fail
the test at significance level 0.05.
Another way of thinking about hypothesis tests is that there is some
critical region of values such that if the test statistic lies in this region
then we will reject H0 . If the test statistic lies outside this region we will
not reject H0 . In our example, using a 5% level of significance this set of

The Z Test

145

values will be the most extreme 5% of values in the right hand tail of the
distribution. Using our tables backwards we can calculate that the boundary
of this region, called the critical value, will be 1.645. The value of our
test statistic is 3.66 which lies in the critical region so we reject the null
hypothesis at the 5% level.

N(0, 1)

0.05
0

1.645
Critical Region

8.2.3

Overview of Hypothesis Testing

Hypothesis tests are identical to significance tests, except for the choice of
a significance level at the beginning, and the nature of the conclusions we
draw at the end:
(i). Begin with a research (alternative) hypothesis and decide upon
a level of significance for the test.
(ii). Set up the null hypothesis.
(iii). Collect a sample of data.
(iv). Calculate a test statistic from the sample of data.

146

The Z Test

(v). Compare the test statistic to its sampling distribution under the
null hypothesis and calculate the p-value,
or equivalently,
Calculate the critical region for the test.
(vi). Reject the null hypothesis if
the p-value is less than the level of significance,
or equivalently,
the test statistic lies in the critical region.
Otherwise, retain the null hypothesis.

8.3

The one-sample Z test

A common situation in which we use hypothesis tests is when we have multiple independent observations from a distribution with unknown mean, and
we can make a test statistic that is normally distributed. The null hypothesis should then tell us what the mean and standard error are, so that we can
normalise the test statistic. The normalised test statistic is then commonly
called Z. We always define Z by

Z=

observation expectation
.
standard error

(8.2)

The expectation and standard error are the mean and the standard deviation
of the sampling distribution: that is, the mean and standard deviation that
the observation has when seen as a random variable, whose distribution is
given by the null hypothesis. Thus, Z has been standardised: its distribution
is standard normal, and the p-value comes from looking up the observed
value of Z on the standard normal table.
We call this a one-sample test because we are interested in testing the
mean of samples from a single distribution. This is as opposed to the two-

The Z Test

147

sample test (discussed in section ??), in which we are testing the difference
in means between two populations.

8.3.1

Test for a population mean

We know from Lecture 6 that if


X1 N(, 2 )

X2 N(, 2 )

then
1

1
1
1
1
1
X = X1 + X2 N + , ( )2 2 + ( )2 2
2
2
2
2
2
2
 2 
X N ,
2
In general,
If X1 , X2 , . . . , Xn are n independent and identically distributed random variables from a N(, 2 ) distribution then
 2 
X N ,
n
Thus, if we are testing the null hypothesis
H0 : The Xi have N(, ) distribution,

the expectation is , and the standard error is / n. Thus,

When testing the sample mean of n normal samples, with


known SD , for the null hypothesis mean= , the test
statistic is
Z=

sample mean

.
/ n

Thus, under the assumption of the null hypothesis the sample mean of
44 values from a N(3426, 5382 ) distribution is


5382 
X N 3426,
= N 3426, 812
44

148

The Z Test

8.3.2

Test for a sum

Under some circumstances it may seem more intuitive to work with the sum
of observations rather than the mean. If S = X1 + + Xn , where the Xi
are independent with N(, 2 ) distribution, then S N(n, n 2 ). That is,

the expectation is n and the standard error is n.


When testing the sum of n normal samples, with known SD
, for the null hypothesis mean= , the test statistic is
Z=

8.3.3

observed sum of samples n

.
n

Test for a total number of successes

Suppose we are observing independent trials, each of which has unknown


probability of success p. We observe X successes. We have the estimate
p = X/n. Suppose we have some possible value p0 of interest, and we wish
to test the null hypothesis
H0 : p = p0
against the alternative
H1 :

p > p0 .

We already observed in section 6.8 that the random variable Xphas distribution very close to normal, with mean pn and standard error np(1 p),
as long as n is reasonably large. We have then the test statistic
When testing the number of successes in n trials, for the
null hypothesis P (success) = p0 , the test statistic is
Z=

observed number of successes np0


p
.
np0 (1 p0 )

Example 8.2: The Aquarius Machine, continued


We repeat the computation of Example 6.9. The null hypothesis,
corresponding to no extrasensory powers, is
H0 : p = p0 = 0.25;

The Z Test

149

the alternative hypothesis, Tarts research hypothesis, is


H1 : p > 0.25.
With n = 7500, the expected number of successes under the
null hypothesis is 7500 41 = 1875, and the standard error is
q
7500 14 43 = 37.5. We compute the test statistic
observed number of successes expected number of successes
standard error
2006 1875
=
37.5
= 3.49.

Z=

(This is slightly different from the earlier computation (z = 3.48)


because we conventionally ignore the continuity correction when
computing test statistics.) Thus we obtain from the standard
normal table a p-value of 0.0002.
So it is extraordinary unlikely that we would get a result this extreme purely by chance, if the null hypothesis holds. If p0 = 1/4,
then Tart happened to obtain a result that one would expect to
see just one time in 5000. Must we then conclude that p0 > 1/4?
And must we then allow that at least some subjects had precognitive powers? Actually, in this case we know what happened
to produce this result. It seems that there were defects in the
random number generator, making the same light less likely to
come up twice in a row. Subjects presumably cued in to this
pattern after a while they were told after each guess whether
theyd been right and made use of it for their later guesses.
Thus, the binomial distribution did not hold the outcomes of
different tries were not independent, and did not all have probability 1/4 but not in the way that Tart supposed. Thus, one
needs always to keep in mind: Statistical tests tell us that the
come from our chance model, but it doesnt necessarily follow
that our favourite alternative is true.
Some people use the term Type III error to refer to the mistake of correctly rejecting the null hypothesis, but for the wrong
reason. Thus, to infer that the subjects had extrasensory powers
from these data would have been a Type III error. 

150

The Z Test

8.3.4

Test for a proportion

When testing for probability of success in independent trials, it often seems


natural to consider the proportion of successes rather than the number of
successes as the fundamental object. Under the null hypothesis
H0 : p = p 0
the
p expected proportion of successes X/n is p0 , and the standard error is
p0 (1 p0 )/n.
When testing the proportion of successes in n trials, for the
null hypothesis P (success) = p0 , the test statistic is
Z=

proportion of successes p0
p
.
p0 (1 p0 )/n

Z has standard normal distribution.


The test statistic will come out exactly the same, regardless of whether
we work with numbers of successes or proportions.

Example 8.3: The Aquarius machine, again


We repeat the computations of Example 8.2, treating the proportion of correct guesses as the basic object. The observed
proportion of successes is p = k/n = 2006/7500 = 0.26747. The
standard error for the proportion is
r
p
1 3
SEp = p0 (1 p0 )/n =
/7500 = 0.005.
4 4
Thus, the test statistic is
Z=

0.26747 0.25
= 3.49,
0.005

which is exactly the same as what we computed before. 

Example 8.4: GBS and swine flu vaccine

The Z Test

151

In Example 5.7 we fit a Poisson distribution to the number of


GBS cases by week after vaccination. We noted that the fit,
given in Table 5.3, didnt look very good, and concluded that
GBS cases were not independent of the time of vaccination. But
we did not test this goodness of fit formally.
In the current formalism, the null hypothesis formalises the notion that GBS is independent of vaccination, so that numbers of
GBS cases are Poisson distributed, with parameter = 1.33. We
test this by looking at the number of weeks with 0 GBS cases,
which was observed to be 16. The formal null hypothesis is
H0 : P (0 cases in a week) = e1.33 = 0.2645.
The alternative hypothesis is
H1 : P (0 cases in a week) 6= 0.2645.
The observed proportion is 16/40 = 0.4. The standard error is
p
SE = 0.2645 0.7355/40 = 0.0697.
Thus, we may compute
Z=

0.2645 0.4
= 1.94
0.0697

Looking this up on the table, we see that P (Z < 1.94) =


1 P (Z < 1.94) = 0.026. Since we have a two-sided alternative,
the p-value is twice this, or 0.052.
If we were doing a significance test at the 0.05 level (or any lower
level), we would simply report that the result was not significant
at the 0.05 level, and retain the null hypothesis. Otherwise, we
simply report the p-value and let the reader make his or her own
judgement. 

8.3.5

General principles: The square-root law

The fundamental fact which makes statistics work is the fact that when we
add up n independent observations, the expected value increases by a factor

of n, while the standard error increases only by a factor of n. Thus, when


we divide by n to obtain a mean (or a proportion), the standard error ends up

152

The Z Test

shrinking by a factor of n. This corresponds to our intuition that averaging


many independent samples will tend to be closer to the true value than any
single measurement. If the standard deviation of the population is , the

standard error of the sample mean is / n. Intuitively, the standard error


tells us about how far off the sample mean will be from the true population
mean (or true probability of success): we will almost never be off by more
than 3 SEs.

8.4

One and two-tailed tests

In Example 8.1 we wanted to test the research hypothesis that mean birth
weight of Australian babies was less than 3426g. This suggests that we
had some prior information that the mean birth weight of Australian babies
was definitely not higher than 3426g, and that the interesting question was
whether the weight was lower. If this were not the case then our research
hypothesis would be that the mean birth weight of Australian babies was
different from 3426g. This allows for the possibility that the mean birth
weight could be less than or greater than 3426g.
In this case we would write our hypotheses as
H0 : = 3426g
H1 : 6= 3426g
As before we would calculate our test statistic as 1.81. The p-value is
different, though. We are not looking at the probability that Z is only
less than 1.81 (in the positive direction), but that Z is at least this big
in either direction; so P (|Z| > 3.66), or P (Z > 3.66) + P (Z < 3.66) =
2P (Z > 3.66). Because of symmetry,
For a Z test, the two-tailed p-value is always twice as
big as the one-tailed p-value.
In this case we allow for the possibility that the mean value is greater
than 3426g by setting our critical region to be lowest 2.5% and highest 2.5%
of the distribution. In this way the total area of the critical region remains
0.05 and so the level of significance of our test remains 5%. In this example, the critical values are -1.96 and 1.96. Thus if our test statistic is less
than -1.96 or greater than 1.96 we would reject the null hypothesis. In this
example, the value of test statistic does lie in the critical region so we reject

The Z Test

153

the null hypothesis at the 5% level.


This is an example of a two-sided test as opposed to the previous example which was a one-sided test. The prior information we have in a
specific situation dictates what we use as our alternative hypothesis which
in turn dictates the type of test that we use.

N(0, 1)

0.025

0.025
1.96

Critical Region

1.96
Critical Region

Fundamentally, though, the distinction between one-tailed and two-tailed


tests is important only because we set arbitrary p-values such as 0.05 as hard
cutoffs. We should be cautious about significant results that depend for
their significance on the choice of a one-tailed test, where a two-tailed test
would have produced an insignificant result.

8.5

Hypothesis tests and confidence intervals

You may notice that we do a lot of the same things to carry out a statistical
test that we do to compute a confidence interval. We compute a standard
error and look up a value on a standard normal table. For instance, in
Example 8.1 we might have expressed our uncertainty about the average
Australian birthweight in the form of a confidence interval. The calculation
would have been just slightly different:

154

The Z Test

We start with the mean observed


birthweight in the Australian sample:
3276g. The standard error is / 44, where is the (unknown) SD of
Australian birthweights. Since we dont know , we substitute the SD of
the sample, which is s = 528g. So we use
p

SE = 528g 44 = 80g.
Then a 95% confidence interval for the mean birthweight of Australian babies is 3276 1.96 80g = (3120, 3432)g; a 99% confidence interval would be
32761.9680g = (3120, 3432)g; a 32762.5880g = (3071, 3481)g. (Again,
remember that we are making the not particularly realistic assumption that the observed birthweights are a random sample of all Australian
birthweights.) This is consistent with the observation that the Australian
birthweights would just barely pass a test for having the same mean as the
UK average 3426g, at the 0.05 significance level.
In fact, it is almost true to say that the symmetric 95% confidence interval contains exactly the possible means 0 such that the data would pass
a test at the 0.05 significance level for having mean equal to 0 . Whats
the difference? Its in how we compute the standard error. In computing
a confidence interval we estimate the parameters of the distribution from
the data. When we perform a statistical test, we take the parameters (as
far as possible) from the null hypothesis. In this case, that means that it
makes sense to test based on the presumption that the standard deviation of
weights is the SD of the UK births, which is the null hypothesis distribution.
In this case, this makes only a tiny difference between 538g (the UK SD)
and 528g (the SD of the Australian sample).
.

Lecture 9

The 2 Test
9.1

Introduction Test statistics that arent Z

In lecture 8 we learned how to test hypotheses that a population has a


certain mean, or that the probability of an event has a certain value. We
base our test on the Z statistic, defined as
Z=

observed expected
,
standard error

which we then compare to a standard normal distribution. This Z has three


properties that are crucial for making a test statistic:
(i). We compute Z from the data;
(ii). Extreme values of Z correspond to what we intuitively think of as
important failures of the null hypothesis. The larger Z is, the more
extremely the null hypothesis has failed;
(iii). Under the null hypothesis, we know the distribution of Z.
In this lecture and many of the following ones we will learn about other
statistical tests, for testing other sorts of scientific claims. The procedure
will be largely the same: Formulate the claim in terms of the truth or falsity
of a null hypothesis, find an appropriate test statistic (according to the
principles enumerated above), and then judge the null hypothesis according
to whether the p-value you compute is high (good for the null!) or low (bad
for the null, good for the alternative!).
The Z test was used for testing whether the mean of quantitative data
could have a certain value. In this lecture we consider categorical data.
155

2 Test

156

These dont have a mean. Were usually interested in some claim about the
distribution of the data among the categories. The most basic tool we have
for testing whether data we observe really could have come from a certain
distribution is called the 2 test.

Example 9.1: Suicide and month of birth


A recent paper [SCB06] attempts to determine whether people
born in certain months have higher risk of suicide than people
born in other months. They gathered data for 26,886 suicides
in England and Wales between 1979 and 2001, by people born
between 1955 and 1966, for which birth dates were recorded.
Looking just at the women in the sample, the data are summarised in Table 9.1, where the second column gives the fraction
of dates that are in that month (that is, 31 or 30 or 28.25 divided
by 365.25). The number of suicides is not the same every month
Month

Prob

Female

Male

Total

Jan
Feb
Mar
Apr
May
Jun
Jul
Aug
Sep
Oct
Nov
Dec

0.0849
0.0773
0.0849
0.0821
0.0849
0.0821
0.0849
0.0849
0.0821
0.0849
0.0821
0.0849

527
435
454
493
535
515
490
489
476
474
442
471

1774
1639
1939
1777
1969
1739
1872
1833
1624
1661
1568
1690

2301
2074
2393
2270
2504
2254
2362
2322
2100
2135
2010
2161

Total
Mean

1.0000

5801
483.4

21085
1757.1

26886
2240.5

Table 9.1: Suicides 19792001 by month of birth in England and Wales


195566, taken from [SCB06].
as you would inevitably expect, by chance variation. But is
there a pattern? Are there some months whose newborns are

2 Test

157

more likely than others to take their own lives 20 to 40 years


later? There seem to be more suicides among spring babies. We
might formulate a hypothesis as follows:
H0 : a suicide has equal chances (1/365.25) of having been
born on any date of the year
The alternative hypothesis is that people born on some dates are
more likely to commit suicide. (We count 29 February as 0.25
days, since it is present in one year out of four.)
How might we test this? One way would be to split up the
suicides into two categories: Spring births (March through
June) and Others. We then have 9421 suicides among the
spring births, and 17465 among the others. Now, 122 days are
in the spring-birth category, so our null hypothesis is
H0 : p0 = Probability of a suicide being spring-birth = 122/365.25 = 0.334.
The expected number of spring-birth suicides (under the null
hypothesis) is 26886
p
p0 = 8980, and the standard error is
p0 (1 p0 )n = 0.334 0.666 26886 = 77.3. We then perform a Z test with
9421 8980
Z=
= 5.71.
77.3
The Z statistic runs off the end of our table, indicating a p-value
below 0.0001. (In fact, the p-value is below 108 , or about 1
chance in 100 million.)
Is this right, though? Not really. We are guilty here of data
snooping: We looked at the data in order to choose which way
to break up the year into two categories. If we had split up the
year differently, we might have obtained a very different result.
(For example, if we had defined spring-birth to include only
April through June, we would have obtained a Z statistic of only
4.6461.) It would be helpful to have an alternative approach that
could deal with multiple categories as they are, without needing
to group them arbitrarily into two. 

9.2

Goodness-of-Fit Tests

We are considering the following sort of situation. We observe n realisations


of a categorical random variable. There are k categories. We have a null

2 Test

158

hypothesis that tells us that these categories have probabilities p1 , . . . , pk , so


that the expected number of observations in the categories are np1 , . . . , npk .
The observed numbers in each category are n1 , . . . , nk .
We want to have a test statistic that measures how far off the observed
are, in total, from the expected. We wont fall into the trap of summing up
ni npi (which will always be 0). We might instead add up the squared
differences (ni npi )2 . But that seems like a problem, too: If we were
expecting to have just 1 outcome in some category, and we actually got
11, that seems a lot more important than if we were expecting 1000 and
actually got 1010, even though the difference is 10 in each case. So we want
a difference to contribute more to the test statistic if the expected number
is small.
By this reasoning we arrive at the 2 (chi-squared, pronounced keye
squared) statistic
X 2 :=

9.2.1

(n1 np1 )2
(nk npk )2 X (observed expected)2
=
. (9.1)
+ +
np1
npk
expected

The 2 distribution

The statistic X 2 has properties (i) and (ii) for a good test statistic: we
can compute it from the data, and bigger values of X correspond to data
that are farther away from what you would expect from the null hypothesis.
But what about (iii)? We cant do a statistical test unless we know the
distribution of X under the null hypothesis. Fortunately, the distribution is
known. The statistic X has approximately (for large n) one of a family of
distributions, called the 2 distribution. There is a positive integer, called
the number of degrees of freedom (abbreviated d.f.) which tells us
which 2 distribution we are talking about. One of the tricky things about
using the 2 test statistic is figuring out the number of degrees of freedom.
This depends on the number of categories, and how we picked the null
hypothesis that we are testing. In general,

degrees of freedom = # categories - # parameters fit from the data - 1.


When is n large enough? The rule of thumb is that the expected number
in every category must be at least about 5. So what do we do if some of the
expected numbers are too small? Very simple: We group categories together,

2 Test

159

until the problem disappears. We will see examples of this in sections 9.4.1
and 9.4.2.
The 2 distribution with d degrees of freedom is a continuous distribution1 with mean d and variance 2d. In Figure 9.1 we show the density of the
chi-squared distribution for some choices of the degrees of freedom. We note
that these distributions are always right-skewed, but the skew decreases as
d increases. For large d, the 2 distribution becomes close to the normal
distribution with mean d and variance 2d.
As with the standard normal distribution, we rely on standard tables
with precomputed values for the 2 distribution. We could simply have a
separate table for each number of degrees of freedom, and use these exactly
like the standard normal table for the Z test. This would take up quite
a bit of space, though. (Potentially infinite but for large numbers of
degrees of freedom see section 9.2.2.) Alternatively, we could use a computer
programme that computes p-values for arbitrary values of X and d.f. (In the
R programming language the function pchisq does this.) This is an ideal
solution, except that you dont have computers to use on your exams.
Instead, we rely on a traditional compromise approach, taking advantage
of the fact that the most common use of the tables is to find the critical value
for hypothesis testing at one of a few levels, such as 0.05 and 0.01.

Example 9.2: Using the 2 table


We perform a 2 test to test a certain null hypothesis at the
0.05 significance level, and find an observed X 2 of 12.1 with 5
degrees of freedom. We look on the table and see in the row
for 5 d.f. that the critical value is 11.07. Since our observed
value is above this, we reject the null hypothesis. On the other
hand, we would have retained the null hypothesis had we been
testing at the 0.01 level, since the critical value at level 0.01 is
15.09. (The actual p-value is 0.0334.) We show, in figure 9.2,
the region corresponding to the 0.01 significance level in green,
and the remainder of the 0.05 critical region in red. We have
observed 12.1, so the logic tells us that either the data did not
come from the null hypothesis, or we happen, purely by chance,
to have made a pick that wound up in the tiny red region.
1

It happens to be the same as the distribution of the sum of d independent random


variables, each of which is the square of a standard normal. In particular, the 2 with 1
degree of freedom is the same as the square of a standard normal random variable.

2 Test

0.5

160

0.3
0.2
0.0

0.1

Density

0.4

1 d.f.
2 d.f.
3 d.f.
4 d.f.
5 d.f.
6 d.f.
7 d.f.

10

0.15

(a) Small d.f.

0.00

0.05

Density

0.10

5 d.f.
10 d.f.
20 d.f.
30 d.f.

10

20

30

40

50

(b) Moderate d.f.

Figure 9.1: The density of 2 distributions with various degrees of freedom.

2 Test

161
d.f.

P = 0.05

P = 0.01

d.f.

P = 0.05

P = 0.01

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16

3.84
5.99
7.81
9.49
11.07
12.59
14.07
15.51
16.92
18.31
19.68
21.03
22.36
23.68
25.00
26.30

6.63
9.21
11.34
13.28
15.09
16.81
18.48
20.09
21.67
23.21
24.72
26.22
27.69
29.14
30.58
32.00

17
18
19
20
21
22
23
24
25
26
27
28
29
30
40
60

27.59
28.87
30.14
31.41
32.67
33.92
35.17
36.42
37.65
38.89
40.11
41.34
42.56
43.77
55.76
79.08

33.41
34.81
36.19
37.57
38.93
40.29
41.64
42.98
44.31
45.64
46.96
48.28
49.59
50.89
63.69
88.38

Table 9.2: A 2 table

9.2.2

Large d.f.

Table 9.2 only gives values for up to 60 degrees of freedom. What do we


do when the problem gives us more than 60? Looking at Figure 9.1(b) you
probably are not surprised to hear that the 2 distribution gets ever closer
to a normal distribution when the number of degrees of freedom gets large.
Which normal distribution? We already know the mean and variance of the
2 . For large d, then,
2 (d) is approximately the same as N(d, 2d).
Thus, the p-value for a given observed X 2 is found by taking
Z=

X2 d

,
2d

and looking it up on the standard normal table. Conversely, if we want to


know the critical value for rejecting X 2 with d degrees of freedom, with d

2 Test

0.00

0.05

Density

0.10

0.15

162

10 11.07

15.09

20

Figure 9.2: 2 density with 5 degrees of freedom. The green region represents
1% of the total area. The red region represents a further 4% of the area, so
that the tail above 11.07 is 5% in total.

large, at significance level , we start by finding the appropriate z for the


level (if we were doing a Z test). The critical value for X 2 is then

2dz + d.
For example, if we were testing at the 0.01 level, with 60 d.f., we would
first look for 9950 on the standard normal table, finding that this corresponds
to z = 2.58. (Remember that in a two-tailed test at the 0.01 level, the
probability above z is 0.005.) We conclude that the critical value for 2
with 60 d.f. at the 0.01 level is about

2.58 120 + 60 = 88.26.


The exact value, given on the table, is 88.38. For larger values of d.f. we
simply rely on the approximation.

9.3

Fixed distributions

We start with a simple example. Suppose we have a six-sided die, and we


are wondering whether it is fair that is, whether each side is equally likely

2 Test

163

to come up. We roll it 60 times, and tabulate the number of times each side
comes up. The results are given in Table 9.3.
Table 9.3: Outcome of 60 die rolls

Side
Observed Frequency
Expected Frequency

1
16
10

2
15
10

3
4
10

4
6
10

5
14
10

6
5
10

It certainly appears that sides 1,2, and 5 come up more often than they
should, and sides 3, 4, and 6 less frequently. On the other hand, some
deviation is expected, due to chance. Are the deviations we see here too
extreme to be attributed to chance?
Suppose we wish to test the null hypothesis
H0 : Each side comes up with probability 1/6
at the 0.01 significance level. In addition to the Observed Frequency of
each side, we have also indicated, in the last row of Table 9.3, the Expected
Frequency, which is the probability (according to H0 ) of the side coming
up, multiplied by the number of trials, which is 60. Since each side has
probability 1/6 (under the null hypothesis), the expected frequencies are
all 10. (Note that the observed and expected frequencies both add up to
exactly the number of trials.)
We now plug these numbers into our formula for the chi-squared statistic:
X2 =

X (observed expected)2

expected
(15 10)2 (4 10)2 (6 10)2 (14 10)2 (5 10)2
(16
+
+
+
+
+
=
10
10
10
10
10
10
= 3.6 + 2.5 + 3.6 + 1.6 + 1.6 + 2.5
10)2

= 15.4.
Now we need to decide whether this number is a big one, by comparing
it to the appropriate 2 distribution. Which one is the appropriate one?
When testing observations against a single fixed distribution, the number
of degrees of freedom is always one fewer than the number of categories.2
There are six categories, so five degrees of freedom.
2

Why minus one? You can think of degrees of freedom as saying, how many different

2 Test

164

Example 9.3: Suicides, continued


We complete Example 9.1. We want to test whether the observed
data in Table 9.1 could plausibly have come from the null hypothesis, with the probability for each month given in column 2.
In Table 9.4 we add columns for expected numbers of suicides.
For example, if we look in the column Female/Exp, and the
row January, we find the number 493. This is obtained from multiplying 5801 (the total number of women in the study) by 0.0849
(the probability of a woman having been born in January, under
the null hypothesis). (The numbers in the expected columns
dont add up to exactly the same as those in the corresponding
observed column, because of rounding.)
Female
Obs Exp

Male
Obs
Exp

Combined
Obs
Exp

0.0849
0.0773
0.0849
0.0821
0.0849
0.0821
0.0849
0.0849
0.0821
0.0849
0.0821
0.0849

527
435
454
493
535
515
490
489
476
474
442
471

493
448
493
476
493
476
493
493
476
493
476
493

1774
1639
1939
1777
1969
1739
1872
1833
1624
1661
1568
1690

1790
1630
1790
1731
1790
1731
1790
1790
1731
1790
1731
1790

2301
2074
2393
2270
2504
2254
2362
2322
2100
2135
2010
2161

2283
2078
2283
2207
2283
2207
2283
2283
2207
2283
2207
2283

1.0000

5801

5803

21085

21084

26886

26887

Month

Prob.

Jan
Feb
Mar
Apr
May
Jun
Jul
Aug
Sep
Oct
Nov
Dec
Total

Table 9.4: Suicides 19792001 by month of birth in England and Wales


195566, taken from [SCB06].
We now test the null hypothesis for the combined sample of men
and women, at the 0.01 significance level. We plug these columns
numbers did we really observe? We observed six numbers (the frequency counts for the
six sides), but they had to add up to 60, so any five of them determine the sixth one. So
we really only observed five numbers.

2 Test

165

into equation (9.1), obtaining


(2301 2283)2 (2073 2078)2
+
2283
2078
(2393 2283)2
(2161 2283)2
+
+ +
2283
2283
= 71.9.

X2 =

Since there are 12 categories, the number of degrees of freedom


is 12 1 = 11. We look on the 2 table in row 11 d.f., column P = 0.01, and find the critical value is 24.73. Since the
observed value is higher, we reject the null hypothesis, and conclude that the difference between the observed distribution of
suicides birthmonths and what would be expected purely by
chance is highly significant. (The actual p-value is less than
1010 , so there is less than one chance in ten billion that we
would observe such a deviation purely by chance, if the null hypothesis were true.)
If we considered only the female data, on the other hand, the 2
value that we would compute from those two columns is 17.4,
which is below the critical value. Considered on their own, the
female data do not provide strong evidence that suicides birth
months differ in distribution from what would be expected by
chance. 

Example 9.4: Distribution of births


Did we use the right null hypothesis in Example 9.1? We assumed that all birthdates should be equally likely.
At http://www.statistics.gov.uk/downloads/theme_population/
FM1_32/FM1no32.pdf we have official UK government statistics
for births in England and Wales. Table 2.4 gives the data for
birth months, and we repeat them in the column observed in
Table 9.5. The Expected column tells us how many births
would have been expected in that month under the null hypothesis that all dates are equally likely. We compute X 2 =
P (Obsi Expi )2
= 477. The critical value is still the same as beExpi
fore, which is 24.73, so this is much bigger. (In fact, the p-value

2 Test

166

is around 1016 .) So births are definitely not evenly distributed


through the year.
Month

Prob.

Observed

Expected

Jan
Feb
Mar
Apr
May
Jun
Jul
Aug
Sep
Oct
Nov
Dec

0.0849
0.0767
0.0849
0.0822
0.0849
0.0822
0.0849
0.0849
0.0822
0.0849
0.0822
0.0849

29415
26890
30122
30284
31516
30571
32678
31008
31557
31659
29358
29186

30936
27942
30936
29938
30936
29938
30936
30936
29938
30936
29938
30936

Total

1.0000

364244

364246

Table 9.5: Observed frequency of birth months, England and Wales, 1993.

We might then decide to try testing a new null hypothesis


H0 : Suicides have the same distribution of birth month
as the rest of the population.
In Table 9.6 we show the corresponding calculations. In the column Prob. we give the observed empirical fraction of births
for that month, as tabulated in Table 9.5. Thus, the probability
for January is 0.0808, which is 29415/364244. The Observed
column copies the final observed column from Table 9.4, and the
Expected column is obtained by multiplying the Prob column by 26886, the total number of suicides in the sample. Using
these two columns, we compute X 2 = 91.8, which is even larger
than the value computed before. Thus, when we change to this
more appropriate null hypothesis, the evidence that something
interesting is going on becomes even stronger.


2 Test

167
Month

Prob.

Observed

Expected

Jan
Feb
Mar
Apr
May
Jun
Jul
Aug
Sep
Oct
Nov
Dec

0.0808
0.0738
0.0827
0.0831
0.0865
0.0839
0.0897
0.0851
0.0866
0.0869
0.0806
0.0801

2301
2074
2393
2270
2504
2254
2362
2322
2100
2135
2010
2161

2171
1985
2223
2235
2326
2257
2412
2289
2329
2337
2167
2154

Total

1.0000

26886

26885

Table 9.6: Birth months of suicides compared with observed frequencies of


birth months in England and Wales, from Table 9.5.

9.4

Families of distributions

In many situations, it is not that we want to know whether the data came
from a single distribution, but whether it may have come from any one
of a whole family of distributions. For example, in Lecture 5 we considered
several examples of data that might be Poisson distributed, and for which the
failure of the Poisson hypothesis would have serious scientific significance.
In such a situation, we modify our 2 hypothesis test procedure slightly:
(1). Estimate the parameters in the distribution. Now we have a particular distribution to represent the null hypothesis.
(2). Compute the expected occupancy in each cell as before.
(3). Using these expected numbers and the observed numbers (the data)
compute the 2 statistic.
(4). Compare the computed statistic to the critical value. Important: The
degrees of freedom are reduced by one for each parameter that has
been estimated.

2 Test

168

Thus, if we estimated a single parameter (e.g., Poisson distribution) we look


in the row for # categories2 d.f. If we estimated two parameters (e.g.,
normal distribution) we look in the row for # categories3 d.f.

9.4.1

The Poisson Distribution

Consider Example 5.7. We argued there that the distribution of GuillainBarre Syndrome (GBS) cases by weeks should have been Poisson distributed
if the flu vaccine was not responsible for the disease. The data are given in
Table 5.3, showing the number of weeks in which different numbers of cases
were observed.
If some Poisson distribution were correct, what would be the parameter?
We estimated that , which is the expected number of cases per week, should
be estimated by the observed average number of cases per week, which is
40/30 = 1.33. We computed the probabilities for different numbers of cases,
assuming a Poisson distribution with parameter 1.33, and multiplied these
probabilities by 40 to obtain expected numbers of weeks. We compared
the observed to these expected numbers, and expressed the impression that
these distributions were different. But the number of observations is small.
Could it be that the difference is purely due to chance?
We want to test the null hypothesis H0 : The data came from a Poisson
distribution with a 2 test. We cant use the numbers from Table 5.3 directly, though. As discussed in section 9.2.1, the approximation we use for
the 2 distribution depends on the categories all being large enough, the
rule of thumb being that the expected numbers under the null hypothesis
should be at least 5. The last three categories are all too small. So, we
group the last four categories together, to obtain the new Table 9.7. The
last category includes everything 3 and higher.
Table 9.7: Cases of GBS, by weeks after vaccination

# cases per week

3+

observed frequency
probability
expected frequency

16
0.264
10.6

7
0.352
14.1

3
0.234
9.4

4
0.150
6.0

2 Test

169

We now compute
X2 =

(16 10.6)2 (7 14.1)2 (3 9.4)2 (4 6.0)2


+
+
+
= 11.35.
10.6
14.1
9.4
6.0

Suppose we want to test the null hypothesis at the 0.01 significance level.
In order to decide on a critical value, we need to know the correct number
of degrees of freedom. The reduced Table 9.7 has oly 4 categories. There
would thus be 3 d.f., were it not for our having estimated a parameter to
decide on the distribution. This reduces the d.f. by one, leaving us with 2
degrees of freedom. Looking in the appropriate row, we see that the critical
value is 9.21, so we do reject the null hypothesis (the true p-value is 0.0034),
and conclude that the data did not come from a Poisson distribution.

9.4.2

The Binomial Distribution

In 1889, A. Geissler published data on births recorded in Saxony over a 10


year period, tabulating the numbers of boys and girls. (The complete data
set, together with analysis and interpretation, may be found in [Edw58].)
Table 9.8 shows the number of girls in the 6115 families with 12 children.
If the gender of successive children are independent, and the probabilities
remain constant over time, the number of girls born to a particular family
of 12 children should be a binomial random variable with 12 trials and an
unknown probability p of success.
Table 9.8: Numbers of girls in families with 12 children, from Geissler.

# Girls
Frequency
Expected
Probability

0
7
2.3
0.0004

1
45
26.1
0.0043

2
181
132.8
0.0217

3
478
410.0
0.0670

4
829
854.2
0.1397

5
1112
1265.6
0.2070

6
1343
1367.3
0.2236

# Girls
Frequency
Expected
Probability

7
1033
1085.2
0.1775

8
670
628.1
0.1027

9
286
258.5
0.0423

10
104
71.8
0.0117

11
24
12.1
0.0020

12
3
0.9
0.0002

We can use a Chi-squared test to test the hypothesis that the data follow
a Binomial distribution.

2 Test

170
H0 : The data follow a Binomial distribution
H1 : The data do not follow a Binomial distribution
At this point we also decide upon a 0.05 significance level.
From the data we know that n = 6115 and we can estimate p as
p =

7(0) + 45(1) + . . . + 3(12)


=
= 0.4808
12
12 6115

Thus we can fit a Bin(12, 0.4808) distribution to the data to obtain the
expected frequencies (E) alongside the observed frequencies (O). The probabilities are shown at the bottom of Table 9.8, and the expectations are
found by multiplying the probabilities by 6115. The first and last categories have expectations smaller than 5, so we absorb them into the next
categories, yielding Table 9.9.
Table 9.9: Modified version of Table 9.8, with small categories grouped
together.
# Girls
Frequency
Expected
Probability

0, 1
52
28.4
0.0047

2
181
132.8
0.0217

3
478
410.0
0.0670

4
829
854.2
0.1397

5
1112
1265.6
0.2070

6
1343
1367.3
0.2236

7
1033
1085.2
0.1775

8
670
628.1
0.1027

9
286
258.5
0.0423

10
104
71.8
0.0117

The test statistic can then be calculated as


(52 28.4)2
(27 13.0)2
+ ... +
28.4
13.0
= 105.95

X2 =

The degrees of freedom are given by


df = (k 1) p = (11 1) 1 = 9
Thus, the Critical Region for the test is X 2 > 16.92.
The test statistics lies well within the Critical Region so we conclude that
there is significant evidence against the null hypothesis at the 5% level.
We conclude that the sex of newborns in families with 12 children is NOT
binomially distributed.

11, 12
27
13.0
0.0022

2 Test

171

Explaining the Geissler data


Non-examinable. Just if youre interested.
So what is going on? A good discussion may be found in [LA98]. A brief
summary is that two things happened:
(1). Large numbers of children tended to appear in families which started
out unbalanced, particularly if they had all girls. That is, a family
with three girls would be more likely to have another child than a
family with two boys and a girl.
(2). The p value really doesnt seem to be the same between families. Some
families have a tendency to produce more boys, others more girls.
Note that (1) is consistent with our original hypothesis, that babies all have
the same probability p of being female: We have just pushed the variability
from small families to large ones. Think of it this way: Suppose there were
a rule that said: Stop when you have 3 children, unless the children are all
boys or all girls. Otherwise, keep trying to get a balanced family. Then
the small families would be more balanced than you would have expected,
and the big families more unbalanced for instance, half of the four-child
families would have all boys or all girls. Of course, its more complicated
than that: Different parents have different ideas about the ideal family.
But this effect does seem to explain some of the deviation from the binomial
distribution.
The statistical analysis in [LA98] tries to pull these effects apart (and also
take into account the small effect of identical twins), finding that there is an
SD of about 0.16 in the value of p, the probability of a girl, and furthermore
that there is some evidence that some parents produce nothing but girls, or
at most have a very small probability of producing boys.
The Normal Distribution
The following table gives the heights in cm of 100 students. In such a
situation we might be interested in testing whether the data follow a Normal
distribution or not.
Height (cm)
Frequency

155-160
5

161-166
17

167-172
38

173-178
25

179-184
9

185-190
6

We can use a Chi-squared test to test the hypothesis that the data follow a
Normal distribution.

2 Test

172

H0 : The data follow a Normal distribution


H1 : The data do not follow a Normal distribution
At this point we also decide upon a 0.05 significance level.
From the data we can estimate the mean and standard deviation using the
sample mean and standard deviation
x
= 172
s = 7.15
To fit a Normal distribution with this mean and variance we need to calculate
the probability of each interval. This is done in four straightforward steps
(i). Calculate the upper end point of each interval (u)
(ii). Standardize the upper end points (z)
(iii). Calculate the probability P (Z < z)
(iv). Calculate the probability of each interval
(v). Calculate the expected cell counts
Height (cm)
Endpoint (u)
Standardized (z)
P (Z < z)
P (a < Z < b)
Expected
Observed

155-160
160.5
-1.61
0.054
0.054
5.4
5

161-166
166.5
-0.77
0.221
0.167
16.7
17

167-172
172.5
0.07
0.528
0.307
30.7
38

173-178
178.5
0.91
0.818
0.290
29.0
25

179-184
184.5
1.75
0.960
0.142
14.2
9

185-190

1.00
0.040
4.0
6

From this table we see that there is one cell with an expected count less
than 5 so we group it together with the nearest cell. (A single cell with
expected count is on the borderline; we could just leave it. We certainly
dont want any cell with expected count less than about 2, and not more
than one expected count under 5.)

Expected
Observed

Height (cm)
155-160 161-166 167-172
5.4
16.7
30.7
5
17
38

173-178
29.0
25

179-190
18.2
15

2 Test

173

We can then calculate the test statistic as


(5.4 6)2
(15 18.2)2
+ ... +
5.4
18.2
= 2.88

X2 =

The degrees of freedom are given by


df = (# categories 1) # parameters estimated = (5 1) 2 = 2.
Thus, the Critical Region for the test is X 2 > 5.99.
The test statistics lies outside the Critical Region so we conclude that the
evidence against the null hypothesis is not significant at the 0.05 level.

9.5

Chi-squared Tests of Association

This section develops a Chi-squared test that is very similar to the one of
the preceding section but aimed at answering a slightly different question.
To illustrate the test we use an example, which we borrow from [FPP98,
section 28.2].
Table 9.10 gives data on Americans aged 2534 from the NHANES survey, a random sample of Americans. Among other questions, individuals
were asked their age, sex, and handedness.
Table 9.10: NHANES handedness data for Americans aged 2534

right-handed
left-handed
ambidextrous

Men

Women

934
113
20

1070
92
8

Looking at the data, it looks as though the women are more likely to
be right-handed. Someone might come along and say: The left cerebral
hemisphere controls the right side of the body, as well as rational thought.
This proves that women are more rational than men. Someone else might
say, This shows that women are under more pressure to conform to societys expectations of normality. But before we consider this observation
as evidence of anything important, we have to pose the question: Does this

2 Test

174

reflect a difference in the underlying population, or could it merely be a


random effect of sampling?
The research hypothesis in this situation is whether a persons handedness is associated with their sex. In this situation, the null hypothesis would
be that there is no association between the two variables. In other words,
the null hypothesis is that the two variables are independent.
H0 : The two variables are independent.
H1 : The two variables are associated.
Or, to put it differently, each person gets placed in one of the six cells of the
table. The null hypothesis says that which row youre in is independent of
the column. This is a lot like the problems we had in section 9.4: Here the
family of distributions were interested in is all the distributions in which
the rows are independent of the columns. The procedure is essentially the
same:
(1). Estimate the parameters in the null distribution. This means we estimate the probability of being in each row and each column.
(2). Compute the expected occupancy in each cell: This is the number of
observations times the row probability times the column probability.
(3). Using these expected numbers and the observed numbers (the data)
compute the 2 statistic.
(4). Compare the computed statistic to the critical value. The number of
degrees of freedom is (r 1)(c 1), where r is the number of rows
and c the number of columns. (Why? The number of cells is rc. We
estimated r 1 parameters to determine the row probabilities and c1
parameters to determine the column probabilities. So we have r +c2
parameters in all. By our standard formula,
d.f. = rc 1 (r + c 2) = (r 1)(c 1).
We wish to test the null hypothesis at the 0.01 significance level. We extend the table to show the row and column fractions in Table 9.11. Thus, we
see that 89.6% of the sample were right-handed, and 47.7% were male. The
fraction that were right-handed males would be 0.896 0.477 = 0.427 under
the null hypothesis. Multiplying this by 2237, the total number of observations, we obtain 956, the expected number of right-handed men under the

2 Test

175

null hypothesis. We repeat this computation for all six categories, obtaining
the results in Table 9.12. (Note that the row totals and the column totals
are identical to the original data.) We now compute the 2 statistic, taking
the six observed counts from the black Table 9.10, and the six expected
counts from the red Table 9.12:
Men

Women

Total

Fraction

right-handed
left-handed
ambidextrous

934
113
20

1070
92
8

2004
205
28

0.896
0.092
0.013

Total

1067

1170

2237

Fraction

0.477

0.523

Table 9.11: NHANES handedness data for Americans aged 2534.

right-handed
left-handed
ambidextrous

Men

Women

956
98
13

1048
107
15

Table 9.12: Expected counts for NHANES handedness data, computed from
fractions in Table 9.11.

X2 =

(934 956)2 (1070 1048)2


(8 15)2
+
+ +
= 12.4.
956
1048
15

The degrees of freedom are (31)(21) = 2, so we see on Table 9.2 that the
critical value is 9.21. Since the observed value is higher, we reject the null
hypothesis, and conclude that the difference in handedness between men
and women is not purely due to chance.
Remember, this does not say anything about the reason for the difference! It may be something interesting (e.g., women are more rational) or it
may be something dull (e.g., men were more likely to think the interviewer
would be impressed if they said they were left-handed). All the hypothesis

176

2 Test

test tells us is that we would most likely have found a difference in handedness between men and women if we had surveyed the whole population.

Lecture 10

The T distribution and


Introduction to Sampling
10.1

Using the T distribution

You may have noticed a hole in the reasoning we used in reasoning about
the husbands heights in section 7.1. Our computations depended on the
fact that

X

Z :=
/ n
has standard normal distribution, when is the population mean and is
the population standard deviation. But we dont know ! We did a little
slight of hand, and substituted the sample standard deviation
v
u
n
u 1 X
t
2
S :=
(Xi X)
n1
i=1

for the (unknown) value of . But S is only an estimate for : its a random
variable that might be too high, and might be too low. So, what we were
calling Z is not really Z, but a quantity that we should give another name
to:

X
.
T :=
(10.1)
S/ n
If S is too big, then Z > T , and if S is too small then Z < T . On average,
you might suppose, Z and T would be about the same and, in this you
would be right. Does the distinction matter then?
177

178

T distribution

Since T has an extra source of error in the denominator, you would expect it to be more widely scattered than Z. That means that if you compute
T from the data, but look it up on a table computed from the distribution
of Z the standard normal distribution you would underestimate the
probability of a large value. The probability of rejecting a true null hypothesis (Type I error) will be larger than you thought it was, and the confidence
intervals that you compute will be too narrow. This is very bad! If we
make an error, we always want it to be on the side of underestimating our
confidence.
Fortunately, we can compute the distribution of T (sometimes called
Students t, after the pseudonym under which statistician William Gossett
published his first paper on the subject, in 1908). While the mathematics
behind this is beyond the scope of this course, the results can be found in
tables. These are a bit more complicated than the normal tables, because
there is an extra parameter: Not surprisingly, the distribution depends on
the number of samples. When the estimate is based on very few samples
(so that the estimate of SD is particularly uncertain) we have a distribution which is far more spread out than the normal. When the number of
samples is very large, the estimate s varies hardly at all from , and the
corresponding t distribution is very close to normal. As with the 2 distribution, this parameter is called degrees of freedom. For the T statistic,
the number of degrees of freedom is just n 1, where n is the number of
samples being averaged. Figure 10.1 shows the density of the t distribution
for different degrees of freedom, together with that of the normal. Note that
the t distribution is symmetric around 0, just like the normal distribution.
Table 10.1 gives the critical values for a level 0.05 hypothesis test when
Z is replaced by t with different numbers of degrees of freedom.
In other
words, if we define t (d) to be the number such that P T < t = when
T has the Student distribution with d degrees of freedom, Table 10.1(a)
gives values of t0.95 , and Table 10.1(b) gives values of t0.975 . Note that the
values of t (d) decrease as d increases, approaching a maximum, which is
z = t ().

10.1.1

Using t for confidence intervals: Single sample

Suppose we have observations x1 , . . . , xn from a normal distribution, where


the mean and the SD are both unknown. To compute confidence intervals

T distribution

179

0.4

Probability density for normal and tdistributions

0.2
0.0

0.1

dnorm(z)

0.3

Normal
1 d.f.
2 d.f.
4 d.f.
10 d.f.
25 d.f.

Figure 10.1: The standard normal density together with densities for the t
distribution with different degrees of freedom.

Table 10.1: Cutoffs for hypothesis tests at the 0.95 level, using the t statistic
with different degrees of freedom. The level is the limit for a very large
number of degrees of freedom, which is identical to the distribution of the Z
statistic.
(a) one-tailed test

(b) two-tailed test

degrees of
freedom

critical
value

degrees of
freedom

critical
value

1
2
4
10
50

6.31
2.92
2.13
1.81
1.68
1.64

1
2
4
10
50

12.7
4.30
2.78
2.23
2.01
1.96

180

T distribution

with the t statistic, we follow the same procedures as in section 7.1, substituting s for , and the quantiles of the t distribution for the quantiles of the
normal distribution: that is, where we looked up a number z on the normal
table, such that P (Z < z) was a certain probability, we substitute a number
t such that P (T < t) is that same probability, where T has the Student T
distribution with the right number of degrees of freedom. Thus, if we want
a 95% confidence interval, we take
t s ,
X
n
where t is found in the column marked P = 0.05 on the T-distribution
table 0.05 being the probability above t that we are excluding. It corresponds, of course, to P (T < t) = 0.975.

Example 10.1: Heights of British men


In section 7.1 we computed a confidence interval for the heights
of married British men, based on a sample of size 198. Since
we were using the sample SD to estimate the population SD, we
should have used the t quantiles with 197 degrees of freedom,
rather than the Z quantiles. If you look on a table of the t
distribution you wont find a row corresponding to 197 degrees
of freedom, though. Why not? The t distribution with 197
degrees of freedom is almost indistinguishable from the normal
distribution. To give an example, the multiplier for a symmetric
90% normal confidence interval is z = 1.645; the corresponding
t quantile is t(197) = 1.653, so the difference is less than 1%.
There is no real application where you are likely to be able to
notice an error of that magnitude. 

Example 10.2: Kidney dialysis


A researcher measured the blood level of phosphate in the blood
of dialysis patients on six consecutive clinical visits.1 It is important to maintain the levels of various nutrients in appropriate
bounds during dialysis treatment. The values are known to vary
1

This example is adapted from [MM98, p.529], where it was based on a Masters thesis
of Joan M. Susic at Purdue University.

T distribution

181

approximately according to a normal distribution. For one patient, the values (in mg/dl) were measured 5.6, 5.1, 4.6, 4.8,
5.7, 6.4. What is a symmetric 99% confidence interval for the
patients true phosphate level?
We compute
= 1 (5.6 + 5.1 + 4.6 + 4.8 + 5.7 + 6.4) = 5.4mg/dl
X
6
s


1 (5.6 5.4)2 + (5.1 5.4)2 + (4.6 5.4)2
s=
5 +(4.8 5.4)2 + (5.7 5.4)2 + (6.4 5.4)2
= 0.67mg/dl.
The number of degrees of freedom is 5. Thus, the symmetric
confidence interval will be


0.67
0.67
mg/dl,
5.4 t , 5.4 + t
6
6
where t is chosen so that the T variable with 5 degrees of freedom
has probability 0.01 of being bigger than t. 

10.1.2

Using the T table

T tables are like the 2 table. For Z, the table in your official booklet allows
you to choose your value of Z, and gives you the probability of finding Z
below this value. Thus, if you were interested in finding z , you would look
to find inside the table, and then check which index Z corresponds to it. In
principle, we could have a similar series of T tables, one for each number of
degrees of freedom. To save space, though, and because people are usually
not interested in the entire t distribution, but only in certain cutoffs, the
T tables give much more restricted information. The rows of the T table
represent degrees of freedom, and the columns represent cutoff probabilities.
The values in the table are then the values of t that give the cutoffs at those
probabilities. One peculiarity of these tables is that, whereas the Z table
gives one-sided probabilities, the t table gives two-sided probabilities. This
makes things a bit easier when you are computing symmetric confidence
intervals, which is all that we will do here.
The probability we are looking for is 0.01, which is the last column of
the table, so looking in the row for 5 d.f. (see Figure 10.2) we see that

182

T distribution

Figure 10.2: Excerpt from the official t table, p. 21.


the appropriate value of t is 4.03. Thus, we can be 99% confident that the
patients true average phosphate level is between 4.3mg/dl and 6.5mg/dl.
Note that if we had known the SD for the measurements to be 0.67, instead
of having estimated it from the observations, we would have used z = 2.6
(corresponding to a one-sided probability of 0.995) in place of t = 4.03,
yielding a much narrower confidence interval.
Summary
If you want to compute an 100% confidence interval for the population
mean of a normally distributed population based on n samples you do the
following:
(1). Compute the sample mean x
.
(2). Compute the sample SD s.
(3). Look on the table to find the number t in the row corresponding to
n 1 degrees of freedom and the column corresponding to .

(4). The confidence interval is from x


st/ n to x
+ st/ n. In other
words, we are 100% confident that is in this range.

T distribution

10.1.3

183

Using t for Hypothesis tests

We continue Example 10.2. Suppose 4.0 mg/dl is a dangerous level of phosphate, and we want to be 99% sure that the patient is, on average, above
that level. Of course, all of our measurements are above that level, but they
are also quite variable. It could be that all six of our measurements were
exceptionally high. How do we make a statistically precise test?
Let H0 be the null hypothesis, that the patients phosphate level is actually 0 = 4.0mg/dl. The alternative hypothesis is that it is a different
value, so this is a two-sided test. Suppose we want to test, at the 0.01 level,
whether the null hypothesis could be consistent with the observations. We
compute
x
0
= 5.12.
T =
s/ n
This statistic T has the t distribution with 5 degrees of freedom. The critical
value is the value t such that the probability of |T | being bigger than t is
0.01. This is the same value that we looked up in Example 10.2, which is
4.03. Since our T value is 5.12, we reject the null hypothesis. That is, T is
much too big: the probability of such a high value is smaller than 0.01. (In
fact, it is about 0.002.) Our conclusion is that the true value of is not 4.0.
In fact, though, were likely to be concerned, not with a particular value
of , but just with whether is too big or too small. Suppose we are
concerned to be sure that the average phosphate level is really at least 0 =
4.0mg/dl. In this case, we are performing a one-sided test, and we will reject
T values that are too large (meaning that x
is too large to have plausibly
resulted from sampling a distribution with mean 0 . The computation of
T proceeds as before, but now we have a different cutoff, corresponding to
a probability twice as big as the level of the test, so 0.02. (This is because
of the peculiar way the table is set up. Were now only interested in the
probability in the upper tail of the t distribution, which is 0.01, but the table
is indexed according to the total probability in both tails.) This is t = 3.36,
meaning that we would have been more likely to reject the null hypothesis.

10.1.4

When do you use the Z or the T statistics?

When testing or making a confidence interval for the population mean,


If you know the population variance, use Z.
If you estimate the population variance from the sample, use T .
Exception: Use Z when estimating a proportion.

184

T distribution

Another exception: If the number of samples is large there is no difference between Z and t. You may as well use Z, which is conceptually
a bit simpler. For most purposes, n = 50 is large enough to do only Z
tests.

10.1.5

Why do we divide by n 1 in computing the sample


SD?

This section is not examinable. The population variance is defined to be the


average of the squared deviations from the mean (and the SD is defined to
be the square root of that):
n

x2 = V ar(x) =

1X
(xi x
)2
n
i=1

Why is it, then, that we estimate variance and SD by using a sample variance
and sample SD in which n in the denominator is replaced by n 1?:
n

s2x =

1 X
(xi x
)2 .
n1
i=1

The answer is that, if x1 , . . . , xn are random samples from a distribution


with variance 2 , then s2x is a better estimate for 2 than is x2 . Better in
what sense? The technical word is unbiased, which simply means that
over many trials it will turn out to be correct. In other words, x2 is, on
average a bit too small, by exactly a factor of (n 1)/n. It makes sense to
expect it to be too small, since you would expect
n

1X
(xi )2
n
i=1

to be just right, on average, if only we knew what was. Replacing by


the estimate x
will make P
it smaller. (In fact, for any numbers x1 , . . . , xn ,
the number a that makes (xi a)2 as small as possible is a = x
. Can you
see why?)
As an example, consider the case n = 2, and let X1 , X2 be two random

T distribution

185

= (X1 + X2 )/2, and


choices from the distribution. Then X

1
2 + (X2 X)
2
(X1 X)
2


X1 X2 2
=
2
1
= ((X1 ) + ( X2 ))2
4

1
=
(X1 )2 + ( X2 )2 + 2(X1 )( X2 ) .
4

2
X
=

How big is this on average? The first two terms in the brackets will average
to 2 (the technical term is, their expectation is 2 ), while the last term
averages to 0. The total averages then to just 2 /2.

10.2

Paired-sample t test

A study2 was carried out to study the effect of cigarette smoking on blood
clotting. Some health problems that smokers are prone to are a result of
abnormal blood clotting. Blood was drawn from 11 individuals before and
after they smoked a cigarette, and researchers measured the percentage of
blood platelets the factors responsible for initiating clot formation
that aggregated when exposed to a certain stimulus. The results are shown
in Table 10.2.
We see that the Before numbers tend to be larger than the After
numbers. But could this be simply a result of random variation? After all,
there is quite a lot of natural variability in the numbers.
Imagine that we pick a random individual, who has a normally distributed Before score Xi . Smoking a cigarette adds a random effect (also
normally distributed) Di , to make the After score Yi . Its a mathematical
fact that, if X and D are independent, and Y = X + D, then
V ar(Y ) = V ar(X) + V ar(D).
We are really interested to know whether Di is positive on average, which we
do by comparing the observed average value of di to the SD of di . But when
we did the computation, we did not use the SD of di in the denominator;
we used the SD of xi and yi , which is much bigger. That is, the average
difference between Before and After numbers was found to be not large
2

[Lev73], discussed in [Ric95].

186

T distribution
Before
25
25
27
44
30
67
53
53
52
60
28

After
27
29
37
56
46
82
57
80
61
59
43

Difference
2
4
10
12
16
15
4
27
9
-1
15

Table 10.2: Percentage of blood platelets that aggregated in 11 different


patients, before and after smoking a cigarette.

enough relative to the SD to be statistically significant, but it was the wrong


SD. Most of the variability that we found was variation between individuals
in their Before scores, not variability in the change due to smoking.
We can follow exactly the same procedure as in section 10.1.3, applied
now to the differences. We find that the mean of the differences is d = 10.3,
and the SD is sd = 7.98. The t statistic is
T =

10.3
= 4.28.
7.98/ 11

The cutoff for p = 0.05 at 10 d.f. is found from the table to be 2.23, so we can
certainly reject the null hypothesis at the 0.05 level. The difference between
the Before and After measurements is found to be statistically significant.
(In fact, the p-value may be calculated to be about 0.002.)

10.3

Introduction to sampling

10.3.1

Sampling with and without replacement

It may seem odd that the computations in section 7.1 take no account of
the size of the population that we are sampling from. After all, if these 200
men were all the married men in the UK, there would be no sampling error
at all. And if the total population were 300, so that we had sampled 2 men
out of 3, there would surely be less random error than if there are 20 million

T distribution

187

men in total, and we have sampled only 1 man out of 100,000. Indeed, this
is true, but the effect vanishes quite quickly as the size of the population
grows.
Suppose we have a box with N cards in it, each of which has a number,
and we sample n cards without replacement, drawing numbers X1 , . . . , Xn .
Suppose that the cards in the box were themselves drawn from a normal
distribution with variance 2 , and let be the population mean that is,
is still normally
the mean of the numbers in the box. The sample mean X
distributed, with expectation , so the only question now is to determine
the standard error. Call this standard error SENR (NR=no replacement),
and the SE computed earlier SEWR (WR=with replacement. It turns out
that the standard error is precisely
s
s


n1

n1

SENR = SEWR
1
=
1
.
(10.2)
N 1
N 1
n
Thus, if we had sampled 199 out of 300, the SE (and hence alsop
the width of
all our confidence intervals) would be multiplied by a factor of 101/299 =
0.58, so would be barely half as large. If the whole population is 1000, so
that we have sampled 1 out of 5, the correction factor has gone up to 0.89,
so the correction is only by about 10%. And if the population is 10,000, the
correction factor is 0.99, which is already negligible for nearly all purposes.
Thus, if the 199 married men had been sampled from a town with just
300 married men, the 95% confidence interval for the average height of married men in the town would be 1732mm20.584.9mm = 1732mm5.7mm,
so about (1726, 1738)mm, instead of the 95% confidence interval computed
earlier for sampling with replacement, which was (1722, 1742)mm.
The size of the sample matters far more than the size of the population (unless you are sampling a large fraction of the population
without replacement).

10.3.2

Measurement bias

Bias is a crucial piece of the picture. This is the piece of the error that is
systematic: For instance, if you are measuring plots of land with a metre
stick, some of your measures will by chance be too big, and some will be too
small. The random errors will tend to be normally distributed with mean
0. Averaging more measurements produces narrower confidence bounds.

188

T distribution

Suppose your metre stick is actually 101 cm long, though. Then all of your
measurements will start out about 1% too short before you add random
error on to them. Taking more measurements will not get you closer to the
true value, but rather to the ideal biased measure. The important lesson is:
Statistical analysis helps us to estimate the extent of random error. Bias
remains.
Of course, statisticians are very concerned with understanding the sources
of bias; but bias is very subject-specific. The bias that comes in conducting
a survey is very different from the bias that comes from measuring the speed
of blink reflexes in a psychology experiment.
Better measurement procedures, and better sampling procedures,
can reduce bias. Increasing numbers of measurements, or larger
samples, reduce the random error. Both cost time and effort. The
trick is to find the optimum tradeoff.

10.3.3

Bias in surveys

An excellent discussion of the sources of bias in surveys may be found in the


book Statistics, by Freedman, Pisani, and Purves. Here are some types of
bias characteristic of surveys that researchers have given names to:
Selection bias
The distribution you sample from may actually differ from the distribution
you thought you were sampling from. In the simplest (and most common)
type of survey, you mean to be doing a so-called simple random sample of
the population: Each individual has the same chance of being in the sample.
Its easy to see how this assumption could go wrong. How do you pick a
random set of 1000 people from all 60 million people in Britain? Do you
dial a random telephone number? But some people have multiple telephone
lines, while others have none. And what about mobiles? Some people are
home more than others, so are more likely to be available when you call.
And so on. All of these factors can bias a survey. If you survey people on
the street, you get only the people who are out and about.
Early in the 20th century, it was thought that surveys needed to be huge
to be accurate. The larger the better. Then some innovative pollsters, like
George Gallup in the US, realised that for a given amount of effort, you
would get a better picture of the true population distribution by taking

T distribution

189

a smaller sample, but putting more effort into making sure it was a good
random sample. More random error, but less bias. The biggest advantage
is that you can compute how large the random error is likely to be, whereas
bias is almost entirely unknowable.
In section 7.1 we computed confidence intervals for the heights of British
men (in 1980) on the basis of 199 samples from the OPCS survey. In fact,
the description we gave of the data was somewhat misleading in one respect:
The data set we used actually gave the paired heights of husbands and wives.
Why does this matter? This sample is potentially biased because the only
men included are married. It is not inconceivable that unmarried men have
different average height from married men. In fact, the results from the
complete OCPS sample are available [RSKG85], and the average height was
found to be 1739mm, which is slightly higher than the average height for
married men that we found in our selective subsample, but still within the
95% confidence interval.
The most extreme cases of selection bias arise when the sample is selfselected. For instance, if you look on a web site for a camera youre interested in, and see that 27 buyers said it was good and 15 said it was bad,
what can you infer about the true percentage of buyers who were satisfied?
Essentially nothing. We dont know what motivated those particular people
to make their comments, or how they relate to the thousands of buyers who
didnt comment (or commented on another web site).
Non-response bias
If youre calling people at home to survey their opinions about something,
they might not want to speak to you and the people who speak to you
may be different in important respects from the people who dont speak
to you. If you distribute a questionnaire, some will send it back and some
wont. Again, the two groups may not be the same.
As an example, consider the health questionnaire that was mailed to
a random sample of 6009 residents of Somerset Health District in 1992.
The questionnaire consisted of 43 questions covering smoking habits, eating
patterns, alcohol use, physical activity, previous medical history and demographic and socio-economic details. 57.6% of the surveys were returned,
and on this basis the health authorities could estimate, for instance, that
24.2% of the population were current smokers, or that 44.3% engage in no
moderate or vigorous activity. You might suspect something was wrong
when you see that 45% of the respondents were male as compared with
just under 50% of the population of Somerset (known from the census). In

190

T distribution

fact, a study [HREG97] evaluated the nonresponse bias by attempting to


contact a sample of 437 nonrespondents by telephone, and asking them some
of the same questions. 236 were reached and agreed to take part. It turned
out that 57.5% of them were male; 32.3% of the contacted nonrespondents
were current smokers, and 67.8% of them reported no moderate or vigorous
activity. Thus, nonresponse bias had led to substantial underestimation of
these important risk factors in the population.
Response bias
Sometimes subjects dont respond. Sometimes they do, but they dont tell
the truth. Response bias is the name statisticians give to subjects giving
an answer that they think is more acceptable than the true answer. For
instance, one 1973 study [LSS73] asked women to express their opinion (from
1=strongly agree to 5=strongly disagree) to feminist or anti-feminist
statements. When the interviewer was a woman, the average response to the
statement The womans place is in the home was 3.09 essentially neutral
but this shifted to clear disagreement (3.80) when the interviewer was a
man. Similarly for Motherhood and a career should not be mixed (2.96
as against 3.62). On the other hand, those interviewed by women averaged
close to neutral (2.78) on the statement A completely liberalized abortion
law is right, whereas those interviewed by men were close to unanimous
strong agreement (1.31 average).
On a similar note, in the preceding 2 November presidential election,
just over 60% of registered voters cast ballots; when Gallup polled the public less than three weeks later, 80% said they had voted. Another well-known
anomaly is the difference between heterosexual men and women in the number of lifetime sexual partners they report (which logically must be the same,
on average). [BS]

Example 10.3: Tsunami donations


Carl Bialik [Bia05] pointed to a 2005 poll by the Gallup organisation, in which Americans were asked whether they had donated
to aid victims of the recent Asian tsunami. A stunning 33%
said they had. The pollsters went on to ask those who said they
had donated how much they had donated. The results, available from the Gallup web site, are given in Table 10.3. Putting
aside the two very large donors, we see that 1/4 of households

T distribution
$Amount
N
$Amount
N

78
1

$Amount
N

400
1

191

0
3

1
6

79
1

5
7
80
0

500
9

10
19
100
73

550
1

15
1
120
1
750
0

20
20
150
4
1000
4

25
22

30
7
151
2

1100
1

35
1
180
1
2000
1

40
4

50
28

200
15
2500
1

55
1

250
5

75
1

300
8

5000
3

325
0

9999+
2

Table 10.3: The amounts claimed to have been donated to tsunami relief by
254 respondents to a Gallup survey in January 2005, out of 1008 queried.
donated an average of $192, meaning $48 per household. Since
there are about 110 million households in the US, that brings
us to a total of more that $5 billion donated. Bialik noted that
the Chronicle of Philanthropy, a trade publication, reported a
total of $745 million donated by private sources, leaving a gap
of more than $4 billion between what people said they donated,
and what relief organisations received.

Ascertainment bias
You analyse the data you have, but dont know which data you never got to
observe. This can be a particular problem in a public health context, where
you only get reports on the illnesses serious enough to people to seek medical
treatment. The recent swine flu outbreak Before the recent outbreak of swine
flu, a novel bird flu was making public health experts nervous as it spread
through the world from its site of origin in East Asia. While it mainly affects
waterfowl, occasional human cases have occurred. Horrific mortality rates,
on the order of 50%, have been reported. Thorson et al. [TPCE06] pointed
out, though, that most people with mild cases of flu never come to the
attention of the medical system, particularly in poor rural areas of Vietnam
and China, where the disesase is most prevalent. They found evidence of a
high rate of flulike illness associated with contact with poultry among the
rural population in Vietnam. Quite likely, then, many of these illnesses were
mild cases of the avian influenza. Mortality rate is the probability of cases of
disease resulting in death, which we estimate from the fraction of observed
cases resulting in death. In this case, though, the sample of observed cases
was biased: A severe case of flu was more likely to be observed than a

192

T distribution

mild case. Thus, while the fraction of observed cases resulting in death was
quite high in Vietnam 20035 there were 87 confirmed cases, of which
38 resulted in death this likely does not reflect accurately the fraction of
deaths among all cases in the population. In all likelihood, the 38 includes
nearly all the deaths, but the 87 represents only a small fraction of the cases.
During World War II the statistician Abraham Wald worked with the US
air force to analyse the data on damage to military airplanes from enemy fire.
The question: Given the patterns of damage that we observe, what would
be the best place to put extra armour plating to protect the aircraft. (You
cant put too much armour on, because it makes the aircraft too heavy.) His
answer: Put armour in places where you never see a bullet hole. Why? The
bullet holes you see are on the planes that made it back. If you never see
a bullet hole in some part of the plane, thats probably because the planes
that were hit there didnt make it back. (His answer was more complicated
than that, of course, and involved some careful statistical calculations. For
a discussion, see [MS84].)

10.3.4

Measurement error

An old joke says, If you have one watch you know what time it is. If
you have two watches youre never sure. What do you do when multiple
measurements of the same quantity give different answers. One possibility
is to try to find out which one is right. Anders Hald, in his history of
statistics [Hal90], wrote
The crude instruments used by astronomers in antiquity and the
Middle Ages could lead to large [. . . ] errors. By planning their
observations astronomers tried to balance positive and negative
systematic errors. If they made several observations of the same
object, they usually selected the best as estimator of the true
value, the best being defined from such criteria as the occurrence of good observational conditions, special care having been
exerted, and so on.
One of the crucial insights that spurred the development of statistical theory
is that the sampling and error problems are connected. Each measurement
can be thought of as a sample from the population of possible measurements,
and the whole sample tells you more about the population hence about
the hidden true value than any single measurement could.
There are no measurements without error.

T distribution

193

Of course, in most real settings measurement error is mixed with population


variation. Furthermore, effort put into perfecting the measurement might
be more beneficially put into acquiring more samples.
A very basic model is the Gauss model of errors, which breaks down
the error into two pieces, the chance error, which is a random quantity
with expectation 0:
measurement = true value + chance error + bias.

(10.3)

This is a bit of a fiction, whereby we simply define chance error to be zero


on average, and call any trend that is left over bias.
In a study where you measure each individual once, population variability and random measurement error are mixed up together: If your measurements have a high SD, you dont know whether that is because the
population is really variable, or because each measurement had a large error
attached to it. You may not care. If you do care, you can make multiple
measurements for each individual. Then the methods of chapter 14 will
show you how to separate out the two pieces.

194

T distribution

Lecture 11

Comparing Distributions
11.1

Normal confidence interval for difference between two population means

Consider again the sample of 198 mens heights, which we discussed in sections 7.1 and 10.3.3. As mentioned there, the data set gives paired heights
of husbands and wives, together with their ages, and the age of the husband
at marriage. This might allow us to pose a different sort of question. For
instance, What is the average difference in height between men who married
early and the men who married late? We summarise the data in Table 11.1,
defining early-married to mean before age 30.
What does this tell us? We know that the difference in our sample is
19mm, but does this reflect a true difference in the population at large, or
could it be a result of mere random selection variation? To put it differently,
how sure are we that if we took another sample of 199 and measured them,
that we wouldnt find a very different pattern?

number
mean
SD

early (< 30)

late ( 30)

unknown

total

160
1735
67

35
1716
78

3
1758
59

198
1732
69

Table 11.1: Summary statistics for heights in mm of 198 married men,


stratified by age at marriage: early (before age 30), late (age 30 or later),
or unknown.

195

196

Comparing Distributions

Let X be the true average in the population of the heights of earlymarried men, and Y the true average for late-marrieds. The parameter we
are interested in is XY := X Y . Obviously the best estimate for XY
Y , which will be normally distributed with the right mean, so
will be X
Y ) z SE. But
that a symmetric level confidence interval will be (X
what is the appropriate standard error for the difference?
Since the variance of a sum of independent random variables is the sum
of their variances, we see that
2
2
2
Y ) = V ar(X)
+ V ar(Y ) = X + Y ,
SEXY
= V ar(X
nX
nY

where X is the standard deviation for the X variable (the height of earlymarrieds) and Y is the standard deviation for the Y variable (the height of
late-marrieds); nX and nY are the corresponding numbers of samples. This
gives us the standard formula:

SEX
Y =

2
X
nX

Y2
nY

1
nY

+ n1 if X = Y .
X

Formula 11.1: Standard error for the difference between two normally distributed variables

Thus, XY =
XY + SEXY Z, where Z has standard normal distribution. Suppose now we want to compute a 95% confidence interval for the
difference in heights of the early- and late-marrieds. The point estimator we
know is +19mm, and the SE is
r
672
782
+
14mm.
160
35
The confidence interval for the difference ranges then from 9mm to +47mm.
Thus, while our best guess is that the early-marrieds are on average 19mm
taller than the late-marrieds, all that we can say with 95% confidence, on the
basis of our sample, is that the difference in height is between 9mm and
+47mm. That is, heights are so variable, that a sample of this size might
easily be off by 28 mm either way from the true difference in the population.

Comparing Distributions

11.2

197

Z test for the difference between population


means

Suppose we wish to make an important argument about the way people


choose marriage partners, and base it on the observation that men who
marry young tend to be taller hence that taller men marry younger. But
is this true? Or could it be just the particular sample we happened to get,
and might we have come to the opposite conclusion from a different sample?
One way of answering this is to point out that the 95% confidence interval
we computed in section 11.1 includes 0. Another way of expressing exactly
the same information is with a significance test.
We have assumed that our samples come from normal distributions, with
2 and 2 , and unknown (and possibly
known (and distinct) variances X
Y
distinct) means X and Y . (In fact, the variances have been estimated
from the data, but the number of observations is large enough that we can
ignore this limitation. For smaller numbers of observations, see sections 11.4
and 11.5.1. The null hypothesis, which says nothing interesting happened,
its just chance variation is
H0 :

X = Y ,

The two-sided alternative is X 6= Y .


Using our results from section 11.1 compute the test statistic
Z=

X Y
19mm
=
= 1.4.
SEXY
14mm

If we were testing at the 0.05 level, we would not reject the null hypothesis,
we would reject values of Z bigger than 1.96 (in absolute value). Even
testing at the 0.10 level we would not reject the null, since the cutoff is 1.6.
Our conclusion is that the difference in heights between the early-married
and late-married groups is not statistically significant. Notice that this is
precisely equivalent to our previous observation that the symmetric 95%
confidence interval includes 0.
If we wish to test H0 against the alternative hypothesis X > Y , we are
performing a one-sided test: We use the same test statistic Z, but we reject
values of Z which correspond to large values of X Y , so large positive
values of Z. Large negative values of Z, while they are unlikely for the null
hypothesis, are even more unlikely for the alternative. The cutoff for testing
at the 0.05 level is z0.95 = 1.64. Thus, we do not reject the null hypothesis.

198

11.3

Comparing Distributions

Z test for the difference between proportions

Proportions are analysed in exactly the same way as population means,


where the population consists only of the numbers 0 and 1 in some unknown
proportion. We sample two populations, x and y, with nx samples and ny
samples respectively, and observe kx and p
ky successes respectively. Under
the null hypothesis, the population SD is p(1 p), where p is the common
proportion of 1s (successes) in the population. We substitute the estimate
from the sample p = (kx + ky )/(nx + ny ).
Consider the results of a study that was carried out in Rakai, Uganda to
test the theory that circumcision would reduce infection rates for HIV. While
the procedure seemed to succeed in its primary goal reducing infection
rates of the men who were circumcised there was some evidence that it
actually increased the likelihood of the mens partners becoming infected.
The results (reported in The International Herald Tribune 6 March, 2007)
showed that among 70 men with HIV who were circumcised, 11 of their
partners became infected in the month following surgery; among 54 controls
who were not circumcised, only 4 of the partners became infected in the first
month. Writing subscripts c and u for circumcised and uncircumcised
respectively, we have then the estimated proportion infected pc = 11/70 =
0.157, and pu = 4/54 = 0.074. Could the difference be simply due to chance?
We perform a Z test at the 0.05 significance level.
The joint estimate of the proportion is p = 15/124 = 0.121, giving us a
sample SD of
= 0.326. The standard error for the difference is then
r
SEPu Pc =

1
1
+
= 0.326
nu nc

1
1
+
= 0.059.
54 70

The z statistic is then


Z=

0.083
pu pc
=
= 1.41.
SE
0.059

The cutoff for rejecting Z at the 0.05 level is 1.96. Since the observed Z is
smaller than this, we do not reject the null hypothesis, that the infection
rates are in fact equal. The difference in infection rates is not statistically
significant, as we cannot be confident that the difference is not simply due
to chance.

Comparing Distributions

11.4

199

t confidence interval for the difference between population means

Consider the following study [SCT+ 90], discussed in [RS02, Chapter 2]: Researchers measured the volume of the left hippocampus of the brains of 30
men, of whom 15 were schizophrenic. The goal is to determine whether
there is a difference in the size of this brain region between schizophrenics
and unaffected individuals. The data are given in Table 11.2. The average size among the unaffected subjects (in cm3 ) is 1.76, while the mean for
the schizophrenic subjects is 1.56. The sample SDs are 0.24 and 0.3 respectively. What can we infer about the populations that these individuals
were sampled from? Do schizophrenics have smaller hippocampal volume,
on average?
We make the modeling assumption that these individuals are a random
sample from the general population (of healthy and schizophrenic men, respectively. More about this in section 11.5.2.) We also assume that the
underlying variance of the two populations is the same, but unknown: The
difference between the two groups (potentially) is in the population means
x (healthy) and y (schizophrenic), and we want a confidence interval for
the difference.
Since we dont know the population SD in advance, and since the number
of samples is small, we use the T distribution for our confidence intervals
instead of the normal. (Since we dont know that the population is normally
distributed, we are relying on the normal approximation, which may be
questionable for averages small samples. For more about the validity of this
assumption, see section 11.5.3.) As always, the symmetric 95% confidence
interval is of the form
Estimate t SE,
where t is the number such that 95% of the probability in the appropriate
T distribution is between t and t (that is, the number in the P = 0.05
column of your table.) We need to know
(1). How many degrees of freedom?
(2). What is the SE?
The first is easy: We add the degrees of freedom, to get nx + ny 2
in this case, 28.
The second is sort of easy: Like for the Z test, when x = y (thats

200

Comparing Distributions

what were assuming), we get


s
SE =

1
1
+ .
nx ny

The only problem is that we dont know what is. We have our sample
SDs sx and sy , each of which should be approximately . The bigger the
sample, the better the approximation should be. This leads us to the pooled
sample variance s2p , which simply averages these estimates, counting the
bigger sample more heavily:

pooled sample variance : s2p =


SE

(nx 1)s2x +(ny 1)s2y


.
nx +ny 2

= sp

1
nx

1
ny

Plugging in the data, we get sp = 0.27, so that the SE becomes 0.099.


The table gives us t = 2.05 (with 28 d.f.), so that the 95% confidence interval
for x y becomes
0.20 2.05 0.099 = (0.003, 0.403).
Table 11.2: Data from the Suddath [RS90] schizophrenia experiment. Hippocampus volumes in cm3 .

Unaffected :

1.94,1.44, 1.56, 1.58, 2.06, 1.66, 1.75, 1.77, 1.78, 1.92,


1.25, 1.93, 2.04, 1.62, 2.08;

Schizophrenic :

1.27,1.63, 1.47, 1.39, 1.93, 1.26, 1.71, 1.67, 1.28, 1.85,


1.02, 1.34, 2.02, 1.59, 1.97.

11.5

Two-sample test and paired-sample test.

11.5.1

Schizophrenia study: Two-sample t test

We observe that the confidence interval computed in section 11.4 includes 0,


meaning that we are not 95% confident that the difference is not 0. If this is
the question of primary interest, we can formulate this as a hypothesis test.

Comparing Distributions

201

Is the difference in hippocampal volume between the two groups statistically significant? Can the observed difference be due to chance? We perform
a t test at significance level 0.05. Our null hypothesis is that x = y , and
our two-tailed alternative is x 6= y . The standard error is computed exactly as before, to be 0.099. The T test statistic is
T =

Y
X
0.20
=
= 2.02.
SE
0.099

We then observe that this is not above the critical value 2.05, so we RETAIN the null hypothesis, and say that the difference is not statistically
significant.
If we had decided in advance that we were only interested in whether
x > y the one-tailed alternative we would use the same test statistic
T = 2.02, but now we draw our critical value from the P = 0.10 column,
which gives us 1.70. In this case, we would reject the null hypothesis. On the
other hand, if we had decided in advance that our alternative was x < y ,
we would have a critical value 1.70, with rejection region anything below
that, so of course we would retain the null hypothesis.

11.5.2

The paired-sample test

It may seem disappointing that we cant do more with our small samples.
The measurements in the healthy group certainly seem bigger than those of
the schizophrenic group. The problem is, theres so much noise, in the form
of general overall variability among the individuals, that we cant be sure if
the difference between the groups is just part of that natural variation.
It might be nice if we could get rid of some of that noise. For instance,
suppose the variation between people was in three parts call them A, B,
and S, so your hippocampal volume is A + B + S. S is the effect of having
schizophrenia, and A is random stuff we have no control over. B is another
individual effect, but one that we suppose is better defined for instance,
the effect of genetic inheritance. Suppose we could pair schizophrenic and
non-schizophrenic people with the same B score up, and then look at the
difference between individuals within a pair. Then B cancels out, and S
becomes more prominent. This is the idea of the matched case-control
study.
In fact, thats just what was done in this study. The real data are given
in Table 12.5. The 30 subjects were, in fact, 30 pairs of monozygotic twins,
of whom one was schizophrenic and the other not. The paired-sample T
test is exactly the one that we described in section 10.2. The mean of the

202

Comparing Distributions

differences is 0.20, and the sample SD of


the fifteen differences is 0.238. The
SE for these differences is then 0.238/ 15 = 0.0615. In other words, while
the average differences between 15 independent pairs of schizophrenic and
healthy subject hippocampus volumes would vary by about 0.099, the differences in our sample vary by only about 0.0615 so, about 40% less
because some of the variability has been excluded by matching the individuals in a pair.
We compute then T = 0.20/0.615 = 3.25. Since the critical value in 2.15,
for T with 14 degrees of freedom at the 0.05 level, we clearly reject the null
hypothesis, and conclude that there is a significant difference between the
schizophrenic and unaffected brains. (Of course, we do have to ask whether
results about twins generalise to the rest of the population.)
Table 11.3: Data from the Suddath [RS90] schizophrenia experiment. Hippocampus volumes in cm3 .
Unaffected
1.94
1.44
1.56
1.58
2.06
1.66
1.75
1.77
1.78
1.92
1.25
1.93
2.04
1.62
2.08

Schizophrenic
1.27
1.63
1.47
1.39
1.93
1.26
1.71
1.67
1.28
1.85
1.02
1.34
2.02
1.59
1.97

Difference
0.67
-0.19
0.09
0.19
0.13
0.40
0.04
0.10
0.50
0.07
0.23
0.59
0.02
0.03
0.11

General rule: Suppose you wish to do a Z or T test for the difference


between the means of two normally distributed populations. If the data are
naturally paired up, so that the two observations in a pair are positively
correlated (see chapter 15 for precise definitions), then it makes sense to
compute the differences first, and then perform a one-sample test on the
differences. If not, then we perform the two-sample Z or T test, depending
on the circumstances.
You might imagine that you are sampling at random from a box full of

Comparing Distributions

203

cards, each of which has an X and a Y side, with numbers on each, and you
are trying to determine from the sample whether X and Y have the same
average. You could write down your sample of Xs, then turn over all the
cards and write down your sample of Ys, and compare the means. If the
X and Y numbers tend to vary together, though, it makes more sense to
look at the differences X Y over the cards, rather than throw away the
information about which X goes with which Y. If the Xs and Ys are not
actually related to each other then it shouldnt matter.

11.5.3

Is the CLT justified?

In section 11.5.2 we supposed that the average difference between the unaffected and schizophrenic hippocampus volumes would have a nearly normal
distribution. The CLT tells us that that the average of a very large number of such differences, picked at random from the same distribution, would
have an approximately normal distribution. But is that true for just 15?
We would like to test this supposition.
One way to do this is with a random experiment. We sample 15 volume
differences at random from the whole population of twins, and average them.
Repeat 1000 times, and look at the distribution of averages we find. Are
these approximately normal?
But wait! We dont have access to any larger population of
twins; and if we did, we would have included their measurements in our study. The trick (which is widely used in modern
statistics, but is not part of this course), called the bootstrap,
is instead to resample from the data we already have, picking
15 samples with replacement from the 15 we already have
so some will be counted several times, and some not at all. It
sounds like cheating, but it can be shown mathematically that it
works.
A histogram of the results is shown in Figure 11.2(a), together with the
appropriate normal curve. The fit looks pretty good, which should reassure
us of the appropriateness of the test we have applied. Another way of seeing
that this distribution is close to normal is Figure 11.2(b), which shows a socalled Q-Q plot. (The Q-Q plot is not examinable as such. In principle, it is
the basis for the Kolmogorov-Smirnov test, which we describe in section 13.1.
This will give us a quantitative answer to the question: Is this distribution
close to the normal distribution?) The idea is very simple: We have 1000
numbers that we think might have been sampled from a normal distribution.
We look at the normal distribution these might have been sampled from

204

Comparing Distributions

the one with the same mean and variance as this sample and take 1000
numbers evenly spaced from the normal distribution, and plot them against
each other. If the sample really came from the normal distribution, then the
two should be about equal, so the points will all lie on the main diagonal.
Figure 11.2(c) shows a Q-Q plot for the original 15 samples, which clearly
do not fit the normal distribution very well.
Normal Q-Q Plot

Normal Q-Q Plot

0.2

Sample Quantiles

0.2

Sample Quantiles

80
60
0

-0.2

20

0.1

0.0

40

Frequency

0.4

100

0.3

120

0.6

0.4

140

Histogram of resampled means

0.0

0.1

0.2
dmean

0.3

0.4

-3

-2

-1

Theoretical Quantiles

-1

Theoretical Quantiles

(a) Histogram of 1000 resampled means (b) Q-Q plot of 1000 resampled means (c) Q-Q plot of 15 original measurements

Figure 11.2: Comparisons of normal distribution to means of resampled


schizophrenia data and original schizophrenia data.

11.6

Hypothesis tests for experiments

11.6.1

Quantitative experiments

276 women were enrolled in a study to evaluate a weight-loss intervention


program [SKK91]. They were allocated at random to one of two different
groups: 171 women in the intervention group received nutrition counseling
and behaviour modification treatment to help them reduce the fat in their
diets. The 105 women in the control group were urged to maintain their
current diet.
After 6 months, the intervention group had lost 3.2kg on average, with
an SD of 3.5 kg, while the control group had lost 0.4kg on average, with
an SD of 2.8 kg. Is the difference in weight loss between the two groups
statistically significant?

Comparing Distributions

205

Let us first apply a Z test at the 0.05 significance level, without thinking
too deeply about what it means. (Because of the normal approximation,
it doesnt really matter what the underlying distribution of weight loss is.)
We compute first the pooled sample variance:
170 3.52 + 104 2.82
= 10.6,
274
p
so sp = 3.25kg. The standard error is sp 1/n + 1/m = 0.40kg. Our test
statistic is then
x
y
3.1
Z=
=
= 7.7.
SE
0.4
This exceeds the rejection threshold of 1.96 by a large margin, so we conclude that the difference in weight loss between the groups is statistically
significant.
But are we justified in using this hypothesis test? What is the random
sampling procedure that defines the null hypothesis? What is the just by
chance that could have happened, and that we wish to rule out? These
276 women were not randomly selected from any population at least,
not according to any well-defined procedure. The randomness is in the
assignment of women to the two groups: We need to show that the difference
between the two groups is not merely a chance result of the women that we
happened to pick for the intervention group, which might have turned out
differently if we happened to pick differently.
In fact, this model is a lot like the model of section 11.5. Imagine a box
containing 276 cards, one for each woman in the study. Side A of the card
says how much weight the woman would lose if she were in the intervention
group; side B says how much weight she would lose if she were in the control
group. The null hypothesis states, then, that the average of the As is the
same as the average of the Bs. The procedure of section 11.5 says that we
compute all the values A-B from our sample, and test whether these could
have come from a distribution with mean 0. The problem is that we never
get to see A and B from the same card.
Instead, we have followed the procedure of section 11.5.1, in which we
take a sample of As and a sample of Bs, and test for whether they could
have come from distributions with the same mean. But there are some
important problems that we did not address there:
s2p =

We sample the As (the intervention group) without replacement. Furthermore, this is a large fraction of the total population (in the box).
We know from section 10.3.1 that this makes the SE smaller.

206

Comparing Distributions

The sample of Bs is not an independent sample; its just the complement of the sample of As. A bit of thought makes clear that this
tends to make the SE of the difference larger.
This turns out to be one of those cases where two wrongs really do make
a right. These two errors work in opposite directions, and pretty much
cancel each other out. Consequently, in analysing experiments we ignore
these complications and proceed with the Z- or t-test as in section 11.5.1,
as though they were independent samples.

11.6.2

Qualitative experiments

A famous experiment in the psychology of choice was carried out by A.


Tversky and D. Kahnemann [TK81], to address the following question: Do
people make economic decisions by a rational calculus, where they measure
the perceived benefit against the cost, and then choose the course of action
with the highest return? Or do they apply more convoluted decision procedures? They decided to try to find out whether people would give the same
answer to the following questions:
Question A: Imagine that you
have decided to see a play where admission is $10 per ticket. As you enter the theatre you discover that you
have lost a $10 bill. Would you still
pay $10 for a ticket for the play?

Question B: Imagine that you


have decided to see a play and paid
the admission price of $10 per ticket.
As you enter the theatre you discover
that you have lost the ticket. The
seat was not marked and the ticket
cannot be recovered. Would you pay
$10 for another ticket?

From a rational economic perspective, the two situations are exactly


identical, from the point of view of the subject: She is at the theatre, the
play that she wants to see is about to start, but she doesnt have a ticket
and would have to buy one. But maybe people still see these two situations
differently. The problem is, you cant just show people both questions and
ask them if they would answer the same to both questions. Instead, they
posed Question A to about half the subjects (183 people) and Question B
to the other half (200 people). The results are given in Table 11.4.
It certainly appears that people are more likely to answer yes to A than
to B (88% vs. 46%), but could this difference be merely due to chance? As
usual, we need to ask, what is the chance model? It is not about the sample
of 383 people that we are studying. They are not a probability sample from

Comparing Distributions

207
Table 11.4

Question A
Question B

Yes
161
92

No
22
54

any population, and we have no idea how representative they may or may
not be of the larger category of Homo sapiens. The real question is, among
these 383 people, how likely is it that we would have found a different result
had we by chance selected a different group of 200 people to pose question
B to. We want to do a significance test at the 0.01 level.
The model is then: 383 cards in a box. On one side is that persons answer to Question A, on the other side the same persons answer to Question
B (coded as 1=yes, 0=no). The null hypothesis is that the average on the
A side is the same as the average on the B side (which includes the more
specific hypothesis that the As and the Bs are identical).
We pick 183 cards at random, and add up their side As, coming to 161;
from the other 200 we add up the side Bs, coming to 92. Our procedure is
then:
A = 0.88, while the average
(1). The average of the sampled side As is X
B = 0.46.
of the sampled side Bs is X
p
(2). The standard deviation of the A sides is estimated at A = p(1 p) =
0.32, p
while the standard deviation of the B sides is estimated at
B = p(1 p) = 0.50.
(3). The standard error for the difference is estimated at
s
r
2
2
B
A
0.322 0.52
SEAB =
+
=
+
= 0.043.
nA nB
183
200
A X
B )/SEAB = 9.77. The cutoff for a two-sided test at the
(4). Z = (X
0.01 level is z0.995 = 2.58, so we clearly do reject the null hypothesis.
The conclusion is that the difference in answers between the two questions was not due to the random sampling. Again, this tells us nothing
directly about the larger population from which these 383 individuals were
sampled.

208

Comparing Distributions

Lecture 12

Non-Parametric Tests, Part I


12.1

Introduction: Why do we need distributionfree tests?

One of the most common problems we need to deal with in statistics is to


compare two population means: We have samples from two different populations, and we want to determine whether the populations they were drawn
from could have had the same mean, or we want to compute a confidence
interval for the difference between the means. The methods described earlier in the course began with the assumption that the populations under
consideration were normally distributed, and only the means (and perhaps
the variances) were unknown. But most data that we consider do not come
from a normal distribution. What do we do then?
In section 7.4 we discussed how we can use a mathematical result the
Central Limit Theorem to justify treating data as though they had come
from a normal distribution, as long as enough independent random samples
are being averaged. But in some cases we dont have enough samples to
invoke the Central Limit Theorem. In other cases (such as that of section
11.5.1) the experimental design seems to lack the randomisation that would
justify the normal approximation.
In this lecture and the next we describe alternative approaches to the
standard hypothesis tests described in previous lectures, which are independent of any assumption about the underlying distribution of the data. In the
following lectures we describe new problems partitioning variance into different sources, and describing the strength of relationship between different
variables and in sections 14.6 and 16.3.2 we will present non-parametric
versions of these.
209

210

Non-Parametric Tests I

The advantage of these non-parametric tests lie in their robustness:


The significance level is known, independent of assumptions about the distribution from which the data were drawn a valuable guarantee, since we
can never really confirm these assumptions with certainty. In addition, the
non-parametric approach can be more logically compelling for some common
experimental designs, something we will discuss further in section 12.4.4.
Of course, there is always a tradeoff. The reliability of the non-parametric
approach comes at the expense of power: The non-parametric test is always
less powerful than the corresponding parametric test. (Or, to put it the
other way, if we know what the underlying distribution is, we can use that
knowledge to construct a more powerful test at the same level as the generic
test that works for any distribution.)

12.2

First example: Learning to Walk

12.2.1

A first attempt

We recall the study of infant walking that we described way back in section
1.1. Six infants were given exercises to maintain their walking reflex, and
six control infants were observed without any special exercises. The ages
(in months) at which the infants were first able to walk independently are
recapitulated in Table 12.1.
Treatment
Control

9.00
11.50

9.50
12.00

9.75
9.00

10.00
11.50

13.00
13.25

9.50
13.00

Table 12.1: Age (in months) at which infants were first able to walk independently. Data from [ZZK72].
As we said then, the Treatment numbers seem generally smaller than the
Control numbers, but not entirely, and the number of observations is small.
Could we merely be observing sampling variation, where we happened to get
six (five, actually) early walkers in the Treatment group, and late walkers
in the Control group.
Following the approach of Lecture 11, we might perform a two-sample
T test for equality of means. We test the null hypothesis T REAT = CON
against the one-tailed alternative T REAT < CON , at the 0.05 level. To
find the critical value, we look in the column for P = 0.10, with 6 + 6 2
d.f., obtaining 1.81. The critical region is then {T < 1.81}. The relevant

Non-Parametric Tests I

211

summary statistics are given in Table 12.2. We compute the pooled sample
variance
r
(6 1)1.452 + (6 1)1.522
sp =
= 1.48,
6+62
so the standard error is
r
SE = sp

1 1
+ = 0.85.
6 6

We have then the T statistic


T =

Y
X
1.6
=
= 1.85.
SE
0.85

So we reject the null hypothesis, and say that the difference between the
two groups is statistically significant.

Treatment
Control

Mean

SD

10.1
11.7

1.45
1.52

Table 12.2: Summary statistics from Table 12.1

12.2.2

What could go wrong with the T test?

We may wonder about the validity of this test, though, particularly as the
observed T was just barely inside the critical region. After all, the T test
depends on the assumption that when X1 , . . . , X6 and Y1 , . . . , Y6 are independent samples from a normal distribution with unknown mean and
variance, then the observed T statistic will be below 1.81 just 5% of the
time. But the data we have, sparse though they are, dont look like they
come from a normal distribution. They look more like they come from a
bimodal distribution, like the one sketched in Figure 12.1.
So, there might be early walkers and late walkers, and we just happened
to get mostly early walkers for the Treatment group, and late walkers for
the Control group. How much does this matter? We present one way of
seeing this in section 12.2.3. For the time being, we simply note that there
is a potential problem, since the whole idea of this statistical approach was
to develop some certainty about the level of uncertainty.

Non-Parametric Tests I

0.15
0.10
0.05
0.00

Density

0.20

212

10

12

14

Time (months)

Figure 12.1: Sketch of what the distribution of walking times from which
the data of Table 12.1 might have been drawn from, if they all came from
the same distribution. The actual measurements are shown as a rug plot
along the bottom green for Treatment, red for Control. The marks have
been adjusted slightly to avoid exact overlaps.

Non-Parametric Tests I

213

We would like to have an alternative procedure for testing hypotheses


about equality of distributions, which will give correct significance levels
without depending on (possibly false) assumptions about the shape of the
distributions. Of course, you never get something for nothing. In return for
having procedures that are more generally applicable, and give the correct
significance level when the null hypothesis is true, we will lose some power:
The probability of rejecting a false null hypothesis will be generally lower,
so that we will need larger samples to attain the same level of confidence.
Still, true confidence is better than deceptive confidence! We call these
alternatives distribution-free or non-parametric tests. Some possibilities
are described in section 12.3.

12.2.3

How much does the non-normality matter?

This section is optional


Suppose the null hypothesis were true, that the samples really came all from
the same distribution, but that distribution were not normal, but distributed
like the density in Figure 12.1. What would the distribution of the T statistics then look like? We can find this out by simulation: We pick 2 groups of
six at random from this distribution, and call one of them Treatment, and
the other Control. We compute T from these 12 observations, exactly as
we did on the real data, and then repeat this 10,000 times.
The histogram of these 10,000 simulated T values is shown in Figure
12.2. Notice that this is similar to the superimposed Student T density,
but not exactly the same. In fact, in the crucial tails there are substantial
differences. Table 12.3 compares the critical values at different significance
levels for the theoretical T distribution and the simulated (non-normal) T
distribution. Thus, we see in Table 12.3(b) that if we do a one-tailed test
at the 0.05 significance level by rejecting T < 1.81, instead of making the
(correct) 10% type I errors, we will make 10.4% type I errors. Similarly,
if we reject |T | > 3.17, to make what we think is a two-tailed test at level
0.01, we will in fact be making 1.1% type I errors when the null hypothesis
is true. Thus, we would be overstating our confidence.

Non-Parametric Tests I

600
400
200
0

Frequency

800

214

-5

-4

-3

-2

-1

Simulated T

Figure 12.2: Histogram of 10,000 simulated T values, computed from pairs


of samples of six values each from the distribution sketched in Figure 12.1.
The Student T density with 10 degrees of freedom is superimposed in red.
Table 12.3: Comparison of the tail probabilities of the T distribution simulated from non-normal samples, with the standard T distribution.

(a) Real vs. standard critical values

(b) Real vs. standard tail probabilities

Standard

Simulated

Standard

Simulated

0.10
0.05
0.01
0.001

1.81
2.23
3.17
4.59

1.83
2.25
3.22
5.14

1.81
2
2.5
3.17
4

0.100
0.073
0.031
0.010
0.0025

0.104
0.076
0.035
0.011
0.0033

Why are these tests called non-parametric? The more basic


hypothesis tests that you have already learned are parametric,
because they start from the assumption that both populations
have distributions that fall into the same general class
generally normal and the specific member of that class is
determined by a single number, or a few numbers the
parameters. We are testing to decide only whether the parameters
are the same. In the new tests, we drop this assumption, allowing
a priori that the two population distributions could be anything
at all.

Non-Parametric Tests I

215

12.3

Tests for independent samples

12.3.1

Median test

Suppose we have samples x1 , . . . , xnx and y1 , . . . , yny from two distinct distributions whose medians are mx and my . We wish to test the null hypothesis
H0 :

mx = my

against a two-tailed alternative


Halt :

mx 6= my ,

mx < my ,

or Halt :

or a one-tailed alternative
Halt :

mx > my .

The idea of this test is straightforward: Let M be the median of the combined sample {x1 , . . . , xnx , y1 , . . . , yny }. If the medians are the same, then
the xs and the ys should have an equal chance of being above M . Let Px
be the proportion of xs that are above M , and Py the proportion of ys that
are above M . It turns out that we can treat these as though they were the
proportions of successes in nx and ny trials respectively. Analysing these
results is not entirely straightforward; we get a reasonable approximation
by using the Z test for differences between proportions, as in section 11.3.
Consider the case of the infant walking study, described in section 1.1.
The 12 measurements are
9.0, 9.0, 9.5, 9.5, 9.75, 10.0, 11.5, 11.5, 12.0, 13.0, 13.0, 13.25,
where the CONTROLresults have been coloured RED, , and the TREATMENT results have been coloured GREEN. The median is 10.75, and we
see that there are 5 control results above the median, and one treatment.
Calculating the p-value: Exact method
Imagine that we have nx red balls and ny green balls in a box. We pick half
of them at random these are the above median outcomes and get kx red
and ky green. We expected to get about nx /2 red and ny /2 green. What is
the probability of such an extreme result? This is reasonably straightforward
to compute, though slightly beyond what were doing in this course. Well
just go through the calculation for this one special case.

216

Non-Parametric Tests I

We pick 6 balls from 12, where 6 were red and 6 green. We want P(at
least 5 red). Since the null hypothesis says that all picks are equally likely,
this is simply the fraction of ways that we could make our picks which
happen to have 5 or 6 red. That is,
P (at least 5 R) =

# ways to pick 5 R, 1G + # ways to pick 6 R, 0G


.
total # ways to pick 6 balls from 12

The number of ways to pick 6 balls from 12 is what we call 12 C6 = 924. The
number of ways to pick 6 red and 0 green is just 1: We have to take all the
reds, we have no choice. The only slightly tricky one is # ways to pick 5 red
and 1 green. A little thought shows that we have 6 C5 = 6 ways of choosing
the red balls, and 6 C1 = 6 ways of choosing the green one, so 36 ways in all.
Thus, the p-value comes out to 37/954 = 0.039, so we still reject the null
hypothesis at the 0.05 level. (Of course, the p-value for a two-tailed test is
twice this, or 0.078.
Calculating the p-value: Approximate method
Slightly easier is to use the normal approximation to compute the p-value.
This is the method described in your formula booklets. But this needs to
be done with some care.
The idea is the following: We have observed a certain number n+ of
above-median outcomes. (This will be about 12 the total number of samples
n = nx + ny , but slightly adjusted if the number is odd, or if there are ties.)
We have observed proportions px = kx /nx and py = ky /ny above-median
samples from the two groups, respectively. We apply a Z test (as in sections
11.3 and 11.6.2) to test whether these proportions could be really the
same (as they should be under the null hypothesis).1
We compute the standard error as
s
p
1
1
SE = p(1 p)
+ ,
nx ny
1

If youre thinking carefully, you should object at this point, that we cant apply the
Z test for difference between proportions, because that requires that the two samples be
independent. Thats true. The fact that theyre dependent raises the standard error;
but the fact that were sampling without replacement lowers the standard error. The
two effects more or less cancel out, just as we described in section 11.6. Another way
of computing this result is to use the 2 test for independence, writing this as a 2 2
contingency table. In principle this gives the same result, but its hard to apply the
continuity correction there. In a case like this one, the expected cell occupancies are
smaller than we allow.

Non-Parametric Tests I

217

where p = n+ /n is the proportion of samples above the median, which is 12


or a bit below. Then we compute the test statistic
px py
,
Z=
SE
and find the p-value by looking this up on the standard normal table.
Applying this to our infant walking example, we see obtain p = 12 , and
r
1 1 1
SE =
+ = 0.289.
2 6 6
Then we get Z = 0.667/0.289 = 2.3. The normal table gives us a p-value
of 0.01 for the one-tailed test. This is a long way off our exact computation
of 0.04. What went wrong?
The catch is that we are approximating a discrete distribution with a
continuous one (the normal), and that can mean substantial errors when the
numbers involved are small, as they are liable to be when we are interested
in applying the median test. We need a continuity correction (as discussed
in section 6.8.1). We are asking for the probability of having at least 5 out
of 6 from the Control group. But if we are thinking of this as a continuous
measurement, we have to represent it as at least 4.5 out of 6. (A picture of
the binomial probability with the normal approximation is given in Figure
12.3.) Similarly, the extreme fraction of treatment samples in the abovemedian sample must be seen to start at 1.5/6 = 0.25, rather than at 1/6.
Thus, we have the test statistic
0.25 0.75
Z=
= 1.73.
0.289
If we look up 1.73 on the normal table, we get the value 0.9582, meaning that
the one-tailed probability below 1.73 is 1 0.9582 = 0.0418, which is very
close to the exact value computed above. Thus, we see that this normal
approximation can work very well, if we remember to use the continuity
correction.
To compute the observed significance level (p-value) for the median test,
we use the Z test for differences between proportions, applied to the
proportions px and py of the two
r samples that are above the median,
Z = (px py )/SE, with SE = p(1 p) n1x + n1y . If px is the larger
proportion, we adjust it to
px =

kx 0.5
n+

and

py =

ky + 0.5
.
n+

0.30
0.25
0.20
0.00

0.05

0.10

0.15

Probability

0.20
0.15
0.10
0.05
0.00

Probability

0.25

0.30

0.35

Non-Parametric Tests I

0.35

218

0/6 1/6 2/6 3/6 4/6 5/6 6/6

0/6 1/6 2/6 3/6 4/6 5/6 6/6

Fraction red

Fraction green

Figure 12.3: The exact probabilities from the binomial distribution for
extreme results of number of red (control) and green (treatment) in
infant walking experiment. The corresponding normal approximations
shaded. Note that the upper tail starts at 4.5/6 = 0.75, not at 5/6; and
lower tail starts at 1.5/6 = 0.25, rather than at 1/6.

the
the
are
the

There are many defects of the median test. One of them is that the
results are discrete there are at most n/2 + 1 different possible outcomes
to the test while the analysis with Z is implicitly discrete. This is one
of the many reasons why the median test, while it is sometimes seen, is not
recommended. (For more about this, see [FG00].) The rank-sum test is
almost always preferred.
Note that this method requires that the observations be all distinct.
There is a version of the median test that can be used when there are ties
among the observations, but we do not discuss it in this course.

12.3.2

Rank-Sum test

The median test is obviously less powerful than it could be, because it
considers only how many of each group are above or below the median, but

Non-Parametric Tests I

219

not how far above or below. In the example of section 12.2, while 5 of the
6 treatment samples below the median, the one that is above the median is
near the top of the whole sample; and the one control sample that is below
the median is in fact near the bottom. It seems clear that we should want
to take this extra information into account. The idea of the rank-sum test
(also called the Mann-Whitney test) is that we consider not just yes/no,
above/below the median, but the exact relative ranking.
Continuing with this example, we list all 12 measurements in order, and
replace them by their ranks:
measurements
ranks
modified ranks

9.0
1
1.5

9.0
2
1.5

9.5
3
3.5

9.5
4
4.5

9.75
5
5

10.0
6
6

11.5
7
7.5

11.5
8
7.5

12.0
9
9

13.0
10
10.5

13.0
11
10.5

When measurements are tied, we average the ranks (we show this in the
column labelled modified ranks.) We wish to test the null hypothesis
H0 : control and treatment came from the same distribution; against the
alternative hypothesis that the controls are generally larger.
We compute a test statistic R, which is just the sum of the ranks in
the smaller sample. (In this case, the two samples have the same size, so
we can take either one. We will take the treatment sample.) The idea is
that these should, if H0 is true, be like a random sample from the numbers
1, . . . , nx + ny . If R is too big or too small we take this as evidence to reject
H0 . In the one-tailed case, we reject R for being too small (if the alternative
hypothesis is that the corresponding group has smaller values; or for being
too large (if the alternative hypothesis is that the corresponding group has
larger values.
In this case, the alternative hypothesis is that the group under consideration, the treatment group has smaller values, so the rejection region
consists of R below a certain threshold. It only remains to find the appropriate threshold. These are given on the Mann-Whitney table (Table 5 in
the formula booklet). The layout of this table is somewhat complicated.
The table lists critical values corresponding only to P = 0.05 and P = 0.10.
We look in the row corresponding to the size of the smaller sample, and
column corresponding to the larger. For a two-tailed test we look in the
(sub-) row corresponding to the desired significance level; for a one-tailed
test we double the p-value.
The sum of the ranks for the treatment group in our example is R = 30.
Since we are performing a one-tailed test with the alternative being that
the treatment values are smaller, our rejection region will be of the form

13.25
12
12

220

Non-Parametric Tests I

R some critical value. We find the critical value on Table 12.4. The
values corresponding to two samples of size 6 have been highlighted. For a
one-tailed test at the 0.05 level we take the upper values 28, 50. Hence, we
would reject R 28. Since R = 30, we retain the null hypothesis in this
test.
If we were performing a two-tailed test instead, we would reject R 26
and R 52.
The table you are given goes up only as far as the larger sample size
equal to 10. For larger samples, we use a normal approximation:
R
,

1
= nx (nx + ny + 1),
2
r
nx ny (nx + ny + 1)
.
=
12
z=

As usual, we compare this z to the probabilities on the normal table. Thus,


for instance, for a two-tailed test at the 0.05 level, we reject the null hypothesis if |z| > 1.96.

smaller sample
size n1

4
5
6
7
8
9

larger sample size,


4
5
6
12,24 13,27
14,30
11,25 12,28
12,32
19,36
20,40
18,37
19,41
28,50
26,52

n2
7
15,33
13,35
22,43
20,45
30,54
28,56
39,66
37,68

8
16,36
14,38
23,47
21,49
32,58
29,61
41,71
39,73
52,84
49,87

9
17,39
15,41
25,50
22,53
33,63
31,65
43,76
41,78
54,90
51,93
66,105
63,108

10
Table 12.4: Critical values for Mann-Whitney rank-sum test.

10
18,42
16,44
26,54
24,56
35,67
33,69
46,80
43,83
57,95
54,98
69,111
66,114
83,127
79,131

Non-Parametric Tests I

12.4

221

Tests for paired data

As with the t test, when the data fall naturally into matched pairs, we can
improve the power of the test by taking this into account. We are given data
in pairs (x1 , y1 ), . . . , (xn , yn ), and we wish to test the null hypothesis H0 : x
and y come from the same distribution. In fact, the null hypothesis may be
thought of as being even broader than that. As we discuss in section 12.4.4,
there is no reason, in principle, why the data need to be randomly sampled
at all. The null hypothesis says that the xs and the ys are indistinguishable
from a random sample from the complete set of xs and ys together. We
dont use the precise numbers which depend upon the unknown distribution of the xs and ys but only basic reasoning about the relative sizes
of the numbers. Thus, if the xs and ys come from the same distribution,
it is equally likely that xi > yi as that xi < yi .

12.4.1

Sign test

The idea of the sign test is quite straightforward. We wish to test the null
hypothesis that paired data came from the same distribution. If that is the
case, then which one of the two observations is the larger should be just
like a coin flip. So we count up the number of times (out of n pairs) that
the first observation in the pair is larger than the second, and compute the
probability of getting that many heads in n coin flips. If that probability is
below the chosen significance level , we reject the null hypothesis.

Schizophrenia study
Consider the schizophrenia study, discussed in section 11.5.2. We wish to
test, at the 0.05 level, whether the schizophrenic twins have different brain
measurements than the unaffected twins, against the null hypothesis that
the measurements are really drawn from the same distribution. We focus
now on the differences. Instead of looking at their values (which depends
upon the underlying distribution) we look only at their signs, as indicated
in Table 12.5. The idea is straightforward: Under the null hypothesis, the
difference has an equal chance of being positive or negative, so the number
of positive signs should be like the number of heads in fair coin flips. In this
case, we have 14 + out of 15, which is obviously highly unlikely.

222

Non-Parametric Tests I

Table 12.5: Data from the Suddath [SCT+ 90] schizophrenia experiment.
Hippocampus volumes in cm3 .
Unaffected
1.94
1.44
1.56
1.58
2.06
1.66
1.75
1.77
1.78
1.92
1.25
1.93
2.04
1.62
2.08

Schizophrenic
1.27
1.63
1.47
1.39
1.93
1.26
1.71
1.67
1.28
1.85
1.02
1.34
2.02
1.59
1.97

Difference
0.67
-0.19
0.09
0.19
0.13
0.40
0.04
0.10
0.50
0.07
0.23
0.59
0.02
0.03
0.11

Sign
+
+
+
+
+
+
+
+
+
+
+
+
+
+

Formally, we analyse this with a single-sample Z test, p


comparing the
proportion of + to 0.5. We have n = 15 trials, so the SE is 0.5 0.5/15 =
0.129, while the observed proportion of + is 14/15 = 0.933. The Z statistic
is
0.933 0.5
Z=
= 3.36,
0.129
which is far above the cutoff level 1.96 for rejecting the null hypothesis at
the 0.05 level. In fact, the probability of getting such an extreme result (the
so-called p-value) is less than 0.001.
What if we had ignored the pairing and applied the median test instead?
We find that the median is 1.665. There are 9 schizophrenic and 6 unaffected
twins with measures above the median, so pu = 0.4 andpps = 0.6. The
difference is small, as the standard error is SE = 0.5 0.5 1/15 + 1/15 =
0.18. We compute Z = 1.10, which does not allow us to reject the null
hypothesis at any level. The rank-sum test also turns out to be too weak to
reject the null hypothesis.

12.4.2

Breastfeeding study

We adopt an example from [vBFHL04], discussing a study by Brown and


Hurlock on the effectiveness of three different methods of preparing breasts

Non-Parametric Tests I

223

4
3
2
0

Frequency

for breastfeeding. Each mother treated one breast and left the other untreated, as a control. The two breasts were rated daily for level of discomfort,
on a scale 1 to 4. Each method was used by 19 mothers, and the average difference between the treated and untreated breast for each of the 19 mothers
who used the toughening treatment were: 0.525, 0.172, 0.577, 0.200,
0.040, 0.143, 0.043, 0.010, 0.000, 0.522, 0.007, 0.122, 0.040, 0.000,
0.100, 0.050, 0.575, 0.031, 0.060.
The original study performed a one-tailed t test at the 0.05 level of
the null hypothesis that the true difference between treated and untreated
breasts was 0: The cutoff is then 1.73 (so we reject the null hypothesis
on any value of T below 1.73). We have x
= 0.11, and sx = 0.25. We

compute then T = (
x 0)/(sx / n) = 1.95, leading us to reject the null.
We should, however, be suspicious of this marginal result, which depends
upon the choice of a one-tailed test: for a two-tailed test the cutoff would
have been 2.10.
In addition, we note that the assumption of normality is drastically violated, as we see from the histogram of the observed values 12.4. To apply the sign test, we see that there are 8 positive and 9 negative values,
which is as close to an average value as we could have, and so conclude
that there is no evidence in the sign test of a difference between the treated
and untreated breasts.
(Formally, we could compute p = 8/17 = 0.47, and
Z = (0.47 0.50)/(0.5/ 17) = 0.247, which is nowhere near the cutoff of
1.96 for the z test at the 0.05 level.)

-0.6

-0.4

-0.2

0.0

0.2

Average reported difference

Figure 12.4: Histogram of average difference between treated and untreated


breasts, for 19 subjects.

224

Non-Parametric Tests I

Historical note: This study formed part of the rediscovery of breastfeeding in the 1970s, after a generation of its being disparaged by the medical
community. Their overall conclusion as that the traditional treatments were
ineffective, the toughening marginally so. The emphasis on breastfeeding
being fundamentally uncomfortable reflects the discomfort that the medical
research community felt about nursing at the time.

12.4.3

Wilcoxon signed-rank test

As with the two-sample test in section 12.3.2, we can strengthen the pairedsample test by considering not just which number is bigger, but the relative
ranks. The idea of the Wilcoxon (or signed-rank) test is that we might have
about equal numbers of positive and negative values, but if the positive
values are much bigger than the negative (or vice versa) that will still be evidence that the distributions are different. For instance, in the breastfeeding
study, the t test produced a marginally significant result because several of
the very large values are all negative.
The mechanics of the test are the same as for the two-sample rank-sum
test, only the two samples are not the xs and the ys, but the positive and
negative differences. In a first step, we rank the differences by their absolute
values. Then, we carry out a rank-sum test on the positive and negative differences. To apply the Wilcoxon test, we first drop the two 0 values, and
then rank the remaining 17 numbers by their absolute values:

Diff
Rank

0.007
1

0.010
2

0.031
3

0.040
4.5

-0.040
4.5

0.043
6

0.050
7

-0.060
8

Diff
Rank

-0.122
10

-0.143
11

0.172
12

0.200
13

-0.522
14

-0.525
15

-0.575
16

-0.577
17

-0.100
9

The ranks corresponding to positive values are 1, 2, 3 ,4.5, 6, 7, 12, 13, which
sum to R+ = 48.5, while the negative values have ranks 4.5,8,9,10,11,14,15,16,17,
summing to R = 104.5. The Wilcoxon statistic is defined to be T =
min{R+ , R } = 48.5. We look on the appropriate table (given in Figure
12.5). We see that in order for the difference to be significant at the 0.05
level, we would need to have T 34. Consequently, we still conclude that
the effect of the treatment is not statistically significant.

Non-Parametric Tests I

225

Figure 12.5: Critical values for Wilcoxon test.

12.4.4

The logic of non-parametric tests

One advantage of the non-parametric test is that it avoids the assumption


that the means are sufficiently normal to fit the t distribution. In fact,
though, we have presented reasonable evidence in section 11.5.3 that the
sample means are quite close to normal. Is there any other reason to use
a non-parametric test? Yes. The non-parametric test also avoids logically
suspect assumptions which underly the parametric test.
As in section 11.6, there seems to be a logical hole in the application of
the t test in the schizophrenia study of sections 11.5.2 and 11.5.3. Suppose
we accept that the average of a random sample of 15 differences like the
ones weve observed should have an approximately normal distribution, with
mean equal to the population mean and variance equal to the population
variance. Do we actually have such a random sample? The answer is,
almost certainly, no. There is no population that these fifteen twin pairs
were randomly sampled from. There is no register of all schizophreniadiscordant identical twin pairs from which these fifteen could have been
randomly sampled.
We also cannot apply the logic of 11.6, which would say that everyone has

226

Non-Parametric Tests I

a schizophrenic and an unaffected measurement, and that we randomly


decide whether we observe the one or the other. Is there some logic that we
can apply? Yes. Imagine that some prankster had tampered with our data,
randomly swapping the labels on the data, flipping a coin to decide which
measurement would be called schizophrenic and which would be called
unaffected. In other words, she randomly flips the signs on the differences
between positive and negative. Our null hypothesis would say that we should
not be able to recognise this tampering, because the two measurements are
indistinguishable overall. What the sign test and the Wilcoxon test tell us
is that our actual observations are not consistent with this hypothesis.
Note that the null hypothesis is now not tied to any probability model
of the source of the actual data. Rather, it expresses our intuitive notion
that the difference is due to chance in terms of a randomisation that
could be performed. We have 15 cards, one for every twin pair in the study.
One side has the measurement for the schizophrenic twin, the other has
the measurement for the unaffected twin, but we have not indicated which
is which. The null hypothesis says: The schizophrenic and the unaffected
measurements have the same distribution, which means that we are just as
likely to have either side of the card be the schizophrenic measurement. The
differences we observe should be just like the differences we would observe if
we labelled the sides schizophrenic and unaffected purely by flipping a coin.
The signs test shows that this is not the case, forcing us to reject the null
hypothesis. Statistical analysis depends on seeing the data we observe in
the context of the entire array of equivalent observations that we could have
made.

Lecture 13

Non-Parametric Tests Part


II, Power of Tests
13.1

Kolmogorov-Smirnov Test

13.1.1

Comparing a single sample to a distribution

In Figure 11.2 we compared some data in Figure 11.2(c) these were 15


differences in brain measurements between schizophrenic and unaffected subjects, while in Figure 11.2(b) they were simulated data, from random resampling and averaging the original 15 to a normal distribution. While these
Q-Q plots seem to tell us something, it is hard to come to a definitive conclusion from them. Is the fit to the line close enough or not? It is this question
that the Kolmogorov-Smirnov test is supposed to answer.
The basic setup of the Kolmogorov-Smirnov test is that we have some
observations X1 , X2 , . . . , Xn which we think may have come from a given
population (probability distribution) P . You can think of P as being a box
with tickets in it, and the numbers representing the values that you might
sample. We wish to test the null hypothesis
H0 :

The samples came from P

against the general alternative that the samples did not come from P . To
do this, we need to create a test statistic whose distribution we know, and
which will be big when the data are far away from a typical sample from
the population P .
You already know one approach to this problem, using the 2 test. To
do this, we split up the possible values into K ranges, and compare the
227

228

Non-Parametric Tests, Power

number of observations in each range with the number that would have been
predicted. For instance, suppose we have 100 samples which we think should
have come from a standard normal distribution. The data are given in Table
13.1. The first thing we might do is look at the mean and variance of the
sample: In this case, the mean is 0.06 and the sample variance 1.06, which
seems plausible. (A z test for the mean would not reject the null hypothesis
of 0 mean, and the test for variance which you have not learned would
be satisfied that the variance is 1.) We might notice that the largest value is
3.08, and the minimum value is 3.68, which seem awfully large. We have
to be careful, though, about scanning the data first, and then deciding what
to test after the fact: This approach, sometimes called data snooping, can
easily mislead, since every collection of data is likely to have something that
seems wrong with it, purely by chance. (This is the problem of multiple
testing, which we discuss further in section 14.3.)
-0.16
1.30
-0.17
0.13
-1.97
-1.52
0.29
0.65
-0.88
-0.23

-0.68
-0.13
1.29
-1.94
-0.37
-0.06
0.58
-0.26
-0.03
-1.16

-0.32
0.80
0.47
0.78
3.08
-1.02
0.02
-0.17
0.56
0.22

-0.85
-0.75
-1.23
0.19
-0.40
1.06
2.18
-1.53
-3.68
-1.68

0.89
0.28
0.21
-0.12
0.80
0.60
-0.04
-1.69
2.40
0.50

-2.28
-1.00
-0.04
-0.19
0.01
1.15
-0.13
-1.60
0.62
-0.35

0.63
0.14
0.07
0.76
1.32
1.92
-0.79
0.09
0.52
-0.35

0.41
-1.38
-0.08
-1.48
-0.47
-0.06
-1.28
-1.11
-1.25
-0.33

0.15
-0.04
0.32
-0.01
2.29
-0.19
-1.41
0.30
0.85
-0.24

0.74
-0.25
-0.17
0.20
-0.26
0.67
-0.23
0.71
-0.09
0.25

Table 13.1: Data possibly from standard normal distribution


The X 2 statistic for this table is 7.90. This does not exceed the threshold
of 9.49 for rejecting the null hypothesis at the 0.05 level.
There are two key problems with this approach:
(1). We have thrown away some of the information that we had to begin
with, by forcing the data into discrete categories. Thus, the power
to reject the null hypothesis is less than it could have been. The
bottom category, for instance, does not distinguish between 2 and
the actually observed extremely low observation 3.68.
(2). We have to draw arbitrary boundaries between categories, and we may
question whether the result of our significance test would have come

Non-Parametric Tests, Power

229

1
0
-1
-3

-2

Sample Quantiles

Normal Q-Q Plot

-4

-2

Theoretical Quantiles

Figure 13.1: QQ plot of data from Table 13.1.

out differently if we had drawn the boundaries otherwise.


An alternative approach is to work directly from the intuition of figure
13.1: Our test statistic is effectively the maximum distance of the cumulative
probability from the diagonal line. One important concept is the distribution function (or cumulative distribution function, or cdf), written
F (x). This is one way of describing a probability distribution. For every
x, we define F (x) to be the probability of all values x. Thus, if we are
looking at the probability distribution of a single fair die roll, we get the
cdf shown in Figure 13.2(a); the cdf of a normal distribution is in Figure
13.2(b). Note that the cdf always starts at 0, ends at 1, and has jumps when
the distribution is discrete.
The empirical distribution function Fobs (x) of some data is simply
the function that tells you what fraction of the data are below x. We show
this for our normal (possibly) data in Figure 13.3(a). The expected distribution function Fexp (x) is the cdf predicted by the null hypothesis in this
case, the For each number x, we let Fexp (x) be the fraction of the numbers
in our target population that are below x, and Fobs (x) the observed fraction below x. We then compute the Kolmogorov-Smirnov statistic, which is
simply the maximum value of |Fexp (x) Fobs (x)|. The practical approach
is as follows: First, order the sample. We order our normal sample in Table
13.2(a) and give the corresponding normal probabilities (from the normal

Non-Parametric Tests, Power

1/2
0

1/6

2/6

4/6

5/6

230

0.0

0.2

0.4

0.6

0.8

1.0

(a) cdf for a fair die

-2

-1

(b) cdf for normal distribution

Figure 13.2: Examples of cumulative distribution functions

Non-Parametric Tests, Power

231

Category

Lower

Upper

Observed

Expected

1
2
3
4
5

-1.5
-0.5
0.5
1.5

-1.5
-0.5
0.5
1.5

9
15
49
22
5

6.7
24.2
38.3
24.2
6.7

Table 13.2: 2 table for data from Table 13.1, testing its fit to a standard
normal distribution.
table) in Table 13.2(b). These probabilities need to be compared to the
probabilities of the sample, which are just 0.01, 0.02, . . . , 1.00. This procedure is represented graphically in Figure 13.3.
The Kolmogorov-Smirnov statistic is the maximum difference, shown in
blue in Table 13.3. This is Dn = 0.092. For a test at the 0.05 significance
level, we compare this to the critical value, which is
1.36
Dcrit =
n
In this case, with n = 100, we get Dcrit = 0.136. Since our observed Dn is
smaller, we do not reject the null hypothesis.
Of course, if you wish to compare the data to the normal distribution
with mean and variance 2 , the easiest thing to do is to standardise: The
hypothesis that (xi ) come from the N (, 2 ) distribution is equivalent to
saying that (xi )/ come from the standard N (0, 1) distribution. (Of
course, if and are estimated from the data, we get a Student t distribution in place of the standard normal.)
One final point: In point of fact, the data are unlikely to have come
from a normal distribution. One way of seeing this is to look at the largest
(negative) data point, which is 3.68. The probability of a sample from a
standard normal distribution being at least this large is about 0.0002. In 100
observations, the probability of observing such a large value at least once is
no more than 100 times as big, or 0.02. We could have a goodness-of-fit test
based on the largest observation, which would reject the hypothesis that this
sample came from a normal distribution. The Kolmogorov-Smirnov test, on
the other hand, is indifferent to the size of the largest observation.
There are many different ways to test a complicated hypothesis, such as
the equality of two distributions, because there are so many different ways

Non-Parametric Tests, Power

1.0

232

0.0

0.2

0.4

F(x)

0.6

0.8

Fobs(x)
Fexp(x)
Fobs(x) Fexp(x)

-3

-2

-1

0.06
0.04
0.02
0.00

|Fobs(x) Fexp(x)|

0.08

(a) Fobs shown in black circles, and Fexp (the normal distribution function) in red. The green segments show the
difference between the two distribution functions.

-3

-2

-1

(b) Plot of |Fobs Fexp |

Figure 13.3: Computing the Kolmogorov-Smirnov statistic for the data of


Table 13.1.

Non-Parametric Tests, Power

233

(a) Ordered form of data from Table 13.1


-3.68
-1.41
-0.85
-0.32
-0.17
-0.04
0.14
0.32
0.65
1.06

-2.28
-1.38
-0.79
-0.26
-0.17
-0.04
0.15
0.41
0.67
1.15

-1.97
-1.28
-0.75
-0.26
-0.16
-0.04
0.19
0.47
0.71
1.29

-1.94
-1.25
-0.68
-0.25
-0.13
-0.03
0.20
0.50
0.74
1.30

-1.69
-1.23
-0.47
-0.24
-0.13
-0.01
0.21
0.52
0.76
1.32

-1.68
-1.16
-0.40
-0.23
-0.12
0.01
0.22
0.56
0.78
1.92

-1.60
-1.11
-0.37
-0.23
-0.09
0.02
0.25
0.58
0.80
2.18

-1.53
-1.02
-0.35
-0.19
-0.08
0.07
0.28
0.60
0.80
2.29

-1.52
-1.00
-0.35
-0.19
-0.06
0.09
0.29
0.62
0.85
2.40

-1.48
-0.88
-0.33
-0.17
-0.06
0.13
0.30
0.63
0.89
3.08

(b) Normal probabilities corresponding to Table 13.2(a)


0.000
0.080
0.198
0.375
0.432
0.484
0.557
0.627
0.743
0.854

0.011
0.084
0.215
0.396
0.434
0.484
0.560
0.658
0.748
0.874

0.024
0.101
0.226
0.399
0.437
0.485
0.577
0.680
0.761
0.902

0.026
0.107
0.249
0.400
0.447
0.490
0.577
0.692
0.771
0.903

0.045
0.110
0.321
0.407
0.449
0.496
0.582
0.698
0.777
0.907

0.047
0.123
0.343
0.409
0.453
0.505
0.588
0.711
0.783
0.973

0.055
0.133
0.356
0.410
0.464
0.508
0.597
0.720
0.788
0.985

0.064
0.154
0.362
0.423
0.468
0.526
0.610
0.727
0.789
0.989

0.064
0.158
0.363
0.425
0.476
0.535
0.614
0.732
0.803
0.992

0.070
0.189
0.369
0.432
0.477
0.553
0.617
0.735
0.812
0.999

(c) Difference between entry #i in Table 13.2(b) and i/100. Largest value shown
blue.
0.010
0.031
0.012
0.064
0.023
0.026
0.054
0.084
0.068
0.055

0.009
0.036
0.005
0.077
0.013
0.036
0.060
0.061
0.071
0.045

0.006
0.030
0.003
0.067
0.006
0.046
0.055
0.049
0.069
0.029

0.014
0.034
0.008
0.061
0.008
0.052
0.061
0.049
0.070
0.037

0.004
0.041
0.069
0.055
0.002
0.054
0.067
0.052
0.074
0.043

0.014
0.037
0.085
0.049
0.008
0.056
0.073
0.048
0.078
0.013

0.015
0.037
0.086
0.039
0.006
0.062
0.071
0.051
0.082
0.015

0.017
0.026
0.083
0.045
0.012
0.052
0.070
0.054

0.092
0.009

0.026
0.031
0.073
0.035
0.014
0.054
0.076
0.058
0.088
0.002

0.031
0.011
0.071
0.033
0.024
0.048
0.082
0.064
0.087
0.001

Table 13.3: Computing the Kolmogorov-Smirnov statistic for testing the fit
of data to the standard normal distribution.

for distributions to differ. We need to choose a test that is sensitive to


the differences that we expect (or fear) to find between reality and our null
hypothesis. We already discussed this in the context of one- and two-tailed

234

Non-Parametric Tests, Power

t and Z tests. If the null hypothesis says = 0 , and the alternative is that
> 0 , then we can increase the power of our test against this alternative by
taking a one-sided alternative. On the other hand, if reality is that < 0
then the test will have essentially no power at all. Similarly, if we think
the reality is that the distributions of X and Y differ on some scattered
intervals, we might opt for a 2 test.

13.1.2

Comparing two samples: Continuous distributions

Suppose an archaeologist has measured the distances of various settlements


from an important cult site at two different periods let us say, X1 , . . . , X10
are the distances (in km) for the early settlements, and Y1 , . . . , Y12 are the
distances for the late settlements. She is interested to know whether there
has been a change in the settlement pattern, and believes that these settlements represent a random sample from a large number of such settlements
in these two periods. The measurements are:

X:

1.2, 1.4, 1.9, 3.7, 4.4, 4.8, 9.7, 17.3, 21.1, 28.4

Y :

5.6, 6.5, 6.6, 6.9, 9.2, 10.4, 10.6, 19.3.

We put these values in order to compute the cdf (which we also plot in
Figure 13.4).

Fx
Fy

1.2

1.4

1.9

3.7

4.4

0.1
0.0

0.2
0.0

0.3
0.0

0.4
0.0

0.5
0.0

4.8
0.6
0.0

5.6

6.5

6.6

6.9

9.2

9.7

10.4

10.6

17.3

19.3

21.1

28.4

0.6
0.1

0.6
0.2

0.6
0.4

0.6
0.5

0.6
0.6

0.7
0.6

0.7
0.8

0.7
0.9

0.8
0.9

0.8
1.0

0.9
1.0

1.0
1.0

Table 13.4: Archaeology (imaginary) data, tabulation of cdfs.

235

0.2

0.4

0.6

0.8

Non-Parametric Tests, Power

10

15

20

25

30

Figure 13.4: Cumulative distribution functions computed from archaeology data, as tabulated in Table 13.4.

The Kolmogorov-Smirnov statistic is D = 0.6. We estimate the critical


value from this formula:
Kolmogorov-Smirnov
Critical value:
q
q
n +n
1
Dcrit,0.05 = 1.36 nx + n1x = 1.36 nxx nyy .

With nx = 10 and ny = 8, this yields Dcrit,0.05 = 0.645, so that we


just miss rejecting the null hypothesis. (As it happens, the approximate
critical value is too high here. The true critical value is exactly 0.6. A
table of exact critical values is available at http://www.soest.hawaii.edu/
wessel/courses/gg313/Critical_KS.pdf.)

13.1.3

Comparing two samples: Discrete samples

The Kolmogorov-Smirnov test is designed to deal with samples from a continuous distribution without having to make arbitrary partitions. It is not
appropriate for samples from discrete distributions, or when the data are

236

Non-Parametric Tests, Power

presented in discrete categories. Nonetheless, it is often used for such applications. We present here one example.
One method that anthropologists use to study the health and lives of
ancient populations, is to estimate the age at death from skeletons found
in gravesites, and compare the distribution of ages. The paper [Lov71]
compares the age distribution of remains found at two different sites in
Virginia, called Clarksville and Tollifero. The data are tabulated in Table
13.5.
Table 13.5: Ages at death for skeletons found at two Virginia sites, as described in [Lov71].

Age range

Clarksville

Tollifero

03
46
712
1317
1820
2135
3545
4555
55+

6
0
2
1
0
12
15
1
0

13
6
4
9
2
29
8
8
0

37

79

The authors used the Kolmogorov-Smirnov test to compare two different


empirical distributions, in exactly the same way as to compare an empirical
distribution with a theoretical distribution. We compute the cumulative
distributions of the two distributions, just as before sometimes these are
denoted F1 and F2 and we look for the maximum difference between
them. The calculations are shown in Table 13.6. We see that the maximum
difference is at age class 2135, with D = 0.23.
It remains to interpret the result. We are testing the null hypothesis
that the Clarksville and Tollifero ages were sampled from the same age
distribution, against the alternative that they were sampled from different
distributions. With nx = 37 and ny = 79, we get Dcrit,0.05 = 0.27. Since
the observed D is smaller than this, we do not reject the null hypothesis.
This is a common application, but it has to be said that this doesnt

Non-Parametric Tests, Power

237

Table 13.6: Calculations for Kolmogorov-Smirnov test on data from Table


13.5.

Age range
3
6
12
17
20
35
45
55

Cumulative Distrib.
Clarksville Tollifero
0.16
0.16
0.22
0.24
0.24
0.57
0.97
1

0.16
0.24
0.29
0.41
0.43
0.8
0.9
1

Difference
0
0.08
0.07
0.17
0.19
0.23
0.07
0

entirely make sense. What we have done is effectively to take the maximum
difference not over all possible points in the distribution, but only at eight
specially chosen points. This inevitably makes the maximum smaller. The
result is to make it harder to reject the null hypothesis, so our significance
level is too high. We should compensate by lowering the critical value.

13.1.4

Comparing tests to compare distributions

Note that the Kolmogorov-Smirnov test differs from 2 in another important


way: The chi-squared statistic doesnt care what order the categories come
in, while order is crucial to the Kolmogorov-Smirnov statistic. This may be
good or bad, depending on the kinds of deviation from the null hypothesis
you think are important. (Thus, in our example here, Kolmogorov-Smirnov
would record a larger deviation from the null hypothesis if the Clarksville site
had lower mortality at all juvenile age classes, than if it had lower mortality
in age classes 03 and 3555, and higher mortality elsewhere. There are
alternative tests

13.2

Power of a test

What is a hypothesis test? We start with a null hypothesis H0 , which for


our purposes is a distribution that the observed data may have come from,
and a level , and we wish to determine whether the data could have had

238

Non-Parametric Tests, Power

at least a chance of being observed if H0 were true. A test of H0 at level


is a critical region R, a set of possible observations for which you would
reject H0 , which must satisfy P0 {data in R} = . That is, if H0 is true, the
probability is just that we will reject the null hypothesis. We reject H0 if
the observations are such as would be unlikely if H0 were true.
How about this for a simple hypothesis test: Have your computer random
number generator pick a random number Y uniformly between 0 and 1. No
matter what the data are, you reject the null hypothesis if Y < . This
satisfies the rule: The probability of rejecting the null hypothesis is always
. Unfortunately, the probability of rejecting the null hypothesis is also only
if the alternative (any alternative) is true.
Of course, a good test isnt just any procedure that satisfies the probability rule. We also want it to have a higher probability of rejecting the
null hypothesis when the null is false. This probability is called the power
of the test, and is customarily denoted 1 . Here is the probability of
making a Type II Error: not rejecting the null hypothesis, even though it
is false. All things being equal, a test with higher power is preferred, and if
the power is too low the experiment is not worth doing at all. This is why
power computations ought to take place at a very early stage of planning
an experiment.
Truth
Decision
Dont Reject H0
Reject H0

13.2.1

H0 True

H0 False

Correct
(Prob. 1 )
Type I Error
(Prob.=level=)

Type II Error
(Prob.=)
Correct
(Prob.=Power=1 )

Computing power

Power depends on the alternative: It is always the power to reject the null,
given that a specific alternative is actually the case. Consider, for instance,
the following idealised experiment: We make measurements X1 , . . . , X100 of
a quantity that is normally distributed with unknown mean and known
variance 1. The null hypothesis is that = 0 = 0, and we will test it with
a two-sided Z test at the 0.05 level, against a simple alternative = alt .
That is, we assume
What is the power? Once we have the data, we compute x
, and then
z=x
/0.1. We reject the null hypothesis if z > 1.96 or z < 1.96. The power
is the probability that this happens. What is this probability? It is the same

Non-Parametric Tests, Power

239

> 0.196 or X
< 0.196. We know that X
has
as the probability that X
0

N (, 0.1) distribution. If we standardise this, we see that Z := 10(X )


is standard normal. A very elementary probability computation shows us
that






< 0.196 = P Z 0 < 1.96 10
< 0.196 = P X
P X






> 0.196 = P X
> 0.196 = P Z 0 > 1.96 10
P X
Thus
Power = (1.96 10alt ) + 1 (1.96 10alt ),

(13.1)

where is the number we look up on the normal table.


More generally, suppose we have n measures of a quantity with unknown
distribution, unknown mean , but known variance 2 , and we wish to use
a two-tailed Z test for the null hypothesis = 0 against the alternative
= alt . The same computation as above shows that



n
Power = z
(0 alt )



(13.2)

n
(0 alt ) .
+1 z

To put this differently, the power is the probability of getting a decisive


result (rejecting the null hypothesis) if the hidden real mean is alt .
You dont need to memorise this formula, but it is worth paying attention
to the qualitative behaviour of the power. Figure 13.5 shows the power for
a range of alternative numbers of samples, and levels of test. Note that
increasing the number of samples increases the power, while lowering the
level of the test decreases the power. Note, too, that the power approaches
the level of the test, as the alternative alt approaches the null 0 .

13.2.2

Computing trial sizes

Suppose now that you are planning an experiment to test a new bloodpressure medication. Medication A has been found to lower blood pressure
by 10 mmHg, with an SD of 8 mmHg, and we want to test it against the
new medication B. We will recruit some number of subjects, randomly divide
them into control and experiment groups; the control group receives A, the
experiment group receives B. At the end we will perform a Z test at the 0.05
level to decide whether B really lowered the subjects blood pressure more
than A; that is, we will let A and B be the true mean effect of drugs A

Non-Parametric Tests, Power

0.6
0.4
0.2

n=100, =.05
n=100, =.01
n=1000, =.05
n=1000, =.01

0.0

Power

0.8

1.0

240

-0.5

-0.4

-0.3

-0.2

-0.1

0.0

0.1

0.2

0.3

0.4

alt 0

Figure 13.5: Power for Z test with different gaps between the null and alternative hypotheses for the mean, for given sizes of study (n) and significance
levels .

and B respectively on blood pressure, and test the null hypothesis A = B


against the two-tailed alternative A 6= B , or the one-tailed alternative
A < B . Following the discussion of section 11.6, we can act as though
these were simple random samples of size n/2 from one population receiving
A and another receiving B, with the test statistic
b a

Z= q
1
n/2
+

1
n/2

Is it worth doing this experiment? Given our resources, there is a limited


number of subjects we can afford to recruit, and that determines the power
of the test we do at the end. Our estimate of the power also depends, of
course, on what we think the difference between the two population means

0.5

Non-Parametric Tests, Power

241

is. We see from the formula (13.2) that if the gap between A and B
becomes half as big, we need 4 times as many subjects to keep the same
power. If the power is only 0.2, say, then it is hardly worth starting in on
the experiment, since the result we get is unlikely to be conclusive.
Figure 13.6 shows the power for experiments where the true experimental
effect (the difference between A and B ) is 10 mmHg, 5 mmHg, and 1
mmHg), performing one-tailed and two-tailed significance tests at the 0.05
level.
Notice that the one-tailed test is always more powerful when B A
is on the right side (B bigger than A , so B is superior), but essentially 0
when B is inferior; the power of the two-tailed test is symmetric. If we are
interested to discover only evidence that B is superior, then the one-tailed
test obviously makes more sense.
Suppose now that we have sufficient funding to enroll 50 subjects in our
study, and we think the study would be worth doing only if we have at least
an 80% chance of finding a significant positive result. In that case, we see
from Figure 13.6(b) that we should drop the project unless we expect the
difference in average effects to be at least 5 mmHg. On the other hand, if
we can afford 200 subjects, we can justify hunting for an effect only half
as big, namely 2.5 mmHg. With 1000 subjects we have a good chance of
detecting a difference between the drugs as small as 1 mmHg. On the other
hand, with only 10 subjects we would be unlikely to find the difference to
be statistically significant, even if the true difference is quite large.
Important lesson: The difference between a statistically
significant result and a non-significant result may be just the size
of the sample. Even an insignificant difference in the usual sense
(i.e., tiny) has a high probability (power) of producing a
statistically significant result if the sample size is large enough.

13.2.3

Power and non-parametric tests

The cost of dispensing with questionable assumptions is to reduce the power


to reject a false null hypothesis in cases where the assumptions do hold. To
see this, we consider one very basic example: We observe 10 samples from a
normal distribution with mean and variance 1, which is unknown. We do
not know either, so we wish to test the null hypothesis H0 : = 0 against
the alternative hypothesis that 6= 0. We perform our test at the 0.05 level.

Non-Parametric Tests, Power

0.6
0.4

one-tailed test,B A=10


one-tailed test,B A=5
one-tailed test,B A=1
two-tailed test,B A=10
two-tailed test,B A=5
two-tailed test,B A=1

0.0

0.2

Power

0.8

1.0

242

10

20

50

100

200

500

1000

2000

5000 10000

Number of subjects (log scale)

1.0

(a) Power as function of trial size.

0.6
0.4
0.2
0.0

Power

0.8

one-tailed test,n=10
one-tailed test,n=50
one-tailed test,n=200
one-tailed test,n=1000
two-tailed test,n=10
two-tailed test,n=50
two-tailed test,n=200
two-tailed test,n=1000

-10

-5

B A

(b) Power as function of B A .

Figure 13.6: Power of the BP experiment, depending on number of subjects


and true difference between the two means.

10

Non-Parametric Tests, Power

243

0.4

Power

0.6

0.8

1.0

In figure 13.7 we show a plot of the probability (estimated from simulations)


of rejecting the null hypothesis, as a function of the true (unobserved) mean.
We compare the two-tailed t test with the median test and the rank-sum
test. Notice that the median test performs far worse than the others, but
that the Mann-Whitney test is only slightly less powerful than the t test,
despite being far more general in its assumptions.

0.0

0.2

t test
Mann-Whitney test
median test

-3

-2

-1

1 2

Figure 13.7: Estimated power for three different tests, where the underlying
distributions are normal with variance 1, as a function of the true difference
in means. The test is based on ten samples from each distribution.

244

Non-Parametric Tests, Power

Lecture 14

ANOVA and the F test


14.1

Example: Breastfeeding and intelligence

A study was carried out of the relationship between duration of breastfeeding


and adult intelligence. The subjects were part of the Copenhagen Perinatal Cohort, 9125 individuals born at the Copenhagen University Hospital
between October 1959 and December 1961. As reported in [MMSR02], a
subset of the cohort was contacted for follow-up as adults (between ages 20
and 34). 983 subjects completed the Danish version of the Wechsler Adult
Intelligence Scale (WAIS).
Table 14.1 shows the average scores for 3 tests, for 5 classes of breastfeeding duration. Look first at the rows marked Unadjusted mean. We
notice immediately that 1) Longer breastfeeding was associated with higher
mean intelligence scores, but 2) The longest breastfeeding (more than 9
months) was associated with lower mean scores. We ask whether either or
both of these associations is reliably linked to the duration of breastfeeding,
or whether they could be due to chance.

14.2

Digression: Confounding and the adjusted


means

Before we can distinguish between these two possibilities causation or


chance association we need to address another possibility: confounding.
Suppose, for instance, that mothers who smoke are less likely to breastfeed.
Since mothers smoking is known to reduce the childs IQ scores, this would
produce higher IQ scores for the breastfed babies, irrespective of any causal
influence of the milk. The gold standard for eliminating confounding is
245

246

ANOVA

Table 14.1: Intelligence scores (WAIS) by duration of breastfeeding.

Duration of Breastfeeding (months)


1
2-3
4-6
7-9
>9

Test
N

272

305

269

104

23

Verbal IQ

Unadjusted mean
SD
Adjusted Mean

98.2
16.0
99.7

101.7
14.9
102.3

104.0
15.7
102.7

108.2
13.3
105.7

102.3
15.2
103.0

Performance IQ

Unadjusted mean
SD
Adjusted Mean

98.5
15.8
99.1

100.5
15.2
100.6

101.8
15.6
101.3

106.3
13.9
105.1

102.6
14.9
104.4

Full Scale IQ

Unadjusted mean
SD
Adjusted Mean

98.1
15.9
99.4

101.3
15.2
101.7

103.3
15.7
102.3

108.2
13.1
106.0

102.8
14.4
104.0

the double-blind random controlled experiment. Subjects are assigned at


random to receive the treatment or not, so that the only difference between
the two groups is whether they received the treatment. (Double blind
refers to the use of protocols that keep the subjects and the experimenters
from knowing who has received the treatment and who is a control. Without
blinding, the two groups would differ in their knowledge of having received
the treatment or not. We then might be unable to distinguish between effects
which are actually due to the treatment, and those that come from believing
you have received the treatment in particular, the so-called placebo
effects.)
Of course, it is usually neither possible nor ethical to randomly assign
babies to different feeding regimens. What we have here is an observational
study. The next best solution is then to try to remove the confounding. In
this case, the researchers looked at all the factors that they might expect to
have an effect on adult intelligence maternal smoking, maternal height,
parents income, infants birthweight, and so on and adjusted the scores
for each category to compensate for a preponderance of characteristics that
might be expected to raise or lower IQ in that category, regardless of infant
nutrition. Thus, we see that the first and last categories both had their
means adjusted substantially upward, which must mean that the infants

ANOVA

247

who were nursed more than 9 months and those nursed less than 1 month
both had, on average, characteristics (whether their own or their mothers)
that would seem to predispose them to lower IQ. For the rest of this chapter
we will work with the adjusted means.
The statistical technique for doing this, called multiple regression, is
outside the scope of this course, but it is fairly straightforward, and most
textbooks on statistical methods that go beyond the mast basic techniques
will describe it. Modern statistical software makes it particularly easy to
adjust data with multiple regression.

14.3

Multiple comparisons

Let us consider the adjusted Full Scale IQ scores. We wish to determine


whether the scores of individuals with the same breastfeeding class might
have come from the same distribution, with the differences being solely due
to random variation.

14.3.1

Discretisation and the 2 test

One approach would be to group the IQ scores into groups low, medium,
and high, say. We would then have an incidence table If these were categorical data proportions of subjects in each breastfeeding class who scored
high and low, for instance we could produce an incidence table such
as that in Table 14.2. (The data shown here are purely invented, for illustrative purposes.) You have learned how to analyse such a table to determine
whether the vertical categories (IQ score) are independent of the horizontal
categories (duration of breastfeeding), using the 2 test.
The problem with this approach is self-evident: We have thrown away
some of the information that we had to begin with, by forcing the data into
discrete categories. Thus, the power to reject the null hypothesis is less than
it could have been. Furthermore, we have to draw arbitrary boundaries between categories, and we may question whether the result of our significance
test would have come out differently if we had drawn the boundaries otherwise. (These are the same problems, you may recall, that led us to prefer
the Kolomogorov-Smirnov test over 2 . The 2 test has the virtue of being
wonderfully general, but it is often not quite the best choice.)

248

ANOVA

Table 14.2: Hypothetical incidence table, if IQ data were categorised into


low, medium, and high

Full IQ score
high
medium
low

14.3.2

Breastfeeding months

1
100
72
100

2-3
115
85
115

4-6
120
69
80

7-9
40
35
29

>9
9
9
5

Multiple t tests

Alternatively, we can compare the mean IQ scores between two different


breastfeeding categories, using the t test effectively, this is the z test, since
the number of degrees of freedom is so large, but we still need to pool the
variance, because one of the categories has a fairly small number of samples.
(The large number of samples also allows us to be reasonably confident
in treating the mean as normally distributed, as discussed in section 7.4.)
For instance, suppose we wish to compare the children breastfed less than
1 month with those breastfed more than 9 months. We want to test for
equality of means, at the 0.05 level.
We compute the pooled standard deviation by
s
r
(nx 1)s2x + (ny 1)s2y
271 15.92 + 22 14.42
sp =
=
= 15.8,
nx + ny 2
293
and the standard error by
s
SEdiff = sp

1
1
+
= 3.43.
nx ny

This yields a t statistic of


t=

4.6
x
y
=
= 1.34.
SEdiff
3.42

Since the cutoff is 1.96, we do not reject the null hypothesis.


If we repeat this test for all 10 pairs of categories, we get the results
shown in Table 14.3. We see that 4 out of the 10 pairwise comparisons
show statistically significant differences. But what story are these telling together? Remember that if the null hypothesis were true if the population

ANOVA

249

means were in fact all the same 1 out of 20 comparisons should yield a
statistically significant difference at the 0.05 level. How many statistically
significant differences do we need before we can reject the overall null hypothesis of identical population means? And what if none of the differences
were individually significant, but they all pointed in the same direction?
Table 14.3: Pairwise t statistics for comparing all 10 pairs of categories.
Those that exceed the significance threshold for the 0.05 level are shown in
red.

1
2-3
4-6
7-9

14.4

2-3

4-6

7-9

>9

1.78

2.14
0.47

3.78
2.58
2.14

1.34
0.70
0.50
0.65

The F test

We will see in lecture 15 how to treat the covariate the duration of


breastfeeding as a quantitative rather than categorical variable. That is,
how to measure the effect (if any) per unit time breastfeeding. Here we
concern ourselves only with the question: Is there a nonrandom difference
between the mean intelligence in the different categories? As we discussed
in section 14.3, we want to reduce the question to a single test.
What should the test statistic look like? Fundamentally, a test statistic
should have two properties: 1) It measures significant deviation from the
null hypothesis; that is, one can recognise from the test statistic whether
the null hypothesis has been violated substantially. 2) We can compute the
distribution of the statistic under the null hypothesis.

14.4.1

General approach

Suppose we have independent samples from K different normal distributions,


with means 1 , . . . , K and variance 2 (so the variances are all the same).
We call these K groups levels (or sometimes treatments). We have ni
samples from distribution i, which we denote Xk1 , Xk2 , . . . , Xknk . The goal

250

ANOVA

is to determine from these samples whether the K treatment effects k


could be all equal.
P
We let N = K
k=1 nk be the total number of observations. The average
while the average within level i is X
i:
of all the observations is X,
ni
X
i = 1
X
Xij .
ni
j=1

The idea of analysis of variance (ANOVA) is that under the null hypothesis,
which says that the observations from different levels really all are coming
from the same distribution, the observations should be about as far (on
average) from their own level mean as they are from the overall mean of the
whole sample; but if the means are different, observations should be closer
to their level mean than they are to the overall mean.
We define the Between Groups Sum of Squares, or BSS, to be the
total square difference of the group means from the overall mean; and the
Error Sum of Squares, or ESS, to be the total squared difference of
the samples from the means of their own groups. (The term error refers
to a context in which the samples can all be thought of as measures of
the same quantity, and the variation among the measurements represents
random error; this piece is also called the Within-Group Sum of Squares.)
And then there is the Total Sum of Squares, or TSS, which is simply
the total square difference of the samples from the overall mean, if we treat
them as one sample.

BSS =

ESS =

K
X

k X)
2;
ni (X

i=1
ni
K X
X

i )2
(Xij X

i=1 j=1
K
X
=
(ni 1)s2i where si is the SD of observations in level i.
i=1

X
2
T SS =
(Xij X)
i,j

= (N 1)s2 , where s is the sample SD of all observations together.


The initials BMS and EMS stand for Between Groups Mean Squares
and Error Mean Squares respectively.

ANOVA

251

The analysis of variance (ANOVA) is based on two mathematical


facts. The first is the identity

TSS=ESS+BSS.

In other words, all the

variability among the data can be divided into two pieces: The variability
within groups, and the variability among the means of different groups.
Our goal is to evaluate the apportionment, to decide if there is too much
between group variability to be purely due to chance.
Of course, BSS and ESS involve different numbers of observations in
their sums, so we need to normalise them. We define
BM S =

BSS
K 1

EM S =

ESS
.
N K

This brings us to the second mathematical fact: if the null hypothesis is


true, then EMS and BMS are both estimates for 2 . On the other hand,
interesting deviations from the null hypothesis in particular, where the
populations have different means would be expected to increase BMS relative to EMS. This leads us to define the deviation from the null hypothesis
as the ratio of these two quantities:
F =

BM S
N K BSS
=

.
EM S
K 1 ESS

We reject the null hypothesis when F is too large: That is, if we obtain a
value f such that P {F f } is below the significance level of the test.
Table 14.4: Tabular representation of the computation of the F statistic.

SS

d.f.

MS

BSS
(A)

K 1
(B)

BMS
(X = A/B)

X/Y

Errors (Within
Treatments)

ESS
(C)

N K
(D)

EMS
(Y = C/D)

Total

TSS

N 1

Between
Treatments

Under the null hypothesis, the F statistic computed in this way has a
known distribution, called the F distribution with (K 1, N K) degrees
of freedom. We show the density of F for K = 5 different treatments and
different values of N in Figure 14.1.

ANOVA

1.0

252

0.6
0.4
0.0

0.2

Density

0.8

K=5,N=10
K=5,N=20
K=5,N=
K=3,N=6
K=3,N=

Figure 14.1: Density of F distribution for different values of K and N .

14.4.2

The breastfeeding study: ANOVA analysis

In this case, since we do not know the individual observations, we cannot


compute TSS directly. We compute
ESS =

5
X
(nk 1)s2k
k=1

= 271 15.92 + 304 15.22 + 268 15.72 + 104 13.12 + 23 14.42


= 227000;
BSS =

5
X

nk (xk x
)2

k=1

= 272 (99.4 101.7)2 + 305 (101.7 101.7)2 + 269 (102.3 101.7)2


+ 104 (106.0 101.7)2 + 23 (104.0 101.7)2
= 3597.
We complete the computation in Table 14.5, obtaining F = 3.81. The
numbers of degrees of freedom are (4, 968). The table in the official booklet is

ANOVA

253

quite small after all, there is one distribution for each pair of integers. The
table gives only the cutoff only for select values of (d1 , d2 ) at the 0.05 level.
For parameters in between one needs to interpolate, and for parameters
above the maximum we go to the row or column marked . Looking on
the table in Figure 14.2, we see that the cutoff for F (4, ) is 2.37. Using
a computer, we can compute that the cutoff for F (4, 968) at level 0.05 is
actually 2.38; and the cutoff at level 0.01 would be 3.34.
Table 14.5: ANOVA table for breastfeeding data: Full Scale IQ, Adjusted.

SS

d.f.

MS

Between
Samples

3597
(A)

4
(B)

894.8
(X = A/B)

3.81

Errors (Within
Samples)

227000
(C)

968
(D)

234.6
(Y = C/D)

Total

230600
(TSS=A+C)

972
(N 1)

14.4.3

Another Example: Exercising rats

We consider the following example, adapted from [MM98, Chapter 15]. A


study was performed to study the effect of exercise on bone density in rats.
30 rats were divided into three groups of ten: The first group carried out
ten high jumps (60 cm) a day for eight weeks; the second group carried
out ten low jumps (30 cm) a day for eight weeks; the third group had no
special exercise. At the end of the treatment period, each rats bone density
was measured. The results are given in Table 14.6.
We wish to test the null hypothesis that the different groups have the
same mean bone density, against the alternative hypothesis that they have
different bone densities. We first carry out the ANOVA analysis. The total
sum of squares is 20013.4. The error sum of squares (ESS) is computed as
9s21 + 9s22 + 9s23 = 12579.5. The between-groups sum of squares (BSS) is
computed as 10(638.7 617.4)2 + 10(612.5 617.4)2 + 10(601.1 617.4)2 =
7433.9 (here the overall mean is 617.4). Note that indeed T SS = ESS +
BSS. We complete the computations in Table 14.7, obtaining F = 7.98.
Looking in the column for 2, and the row for 30 (since there is no row on your

254

ANOVA

Figure 14.2: Table of F distribution; finding the cutoff at level 0.05 for the
breastfeeding study.

table for 27), we see that the cutoff at level 0.05 is 3.32. Thus, we conclude
that the difference in means between the groups is statistically significant.

14.5

Multifactor ANOVA

The procedure described in section 14.4 leads to obvious extensions. We


have observations
xki = k + ki ,

k = 1, . . . , K;

i = 1, . . . , nk

where the ki are the normally distributed errors, and k is the true mean
for group k. Thus, in the example of section 14.4.3, there were three groups,
corresponding to three different exercise regimens, and ten different samples
for each regimen. The obvious estimate for k is
x
k =

nk
1 X
xki ,
nk
i=1

and we use the F test to determine whether the differences among the means
are genuine. We decompose the total variance of the observations into the

ANOVA

255
(a) Full data

High
Low
Control

626
594
614

650
599
569

622
635
653

674
605
593

626
632
611

643
588
600

622
596
603

650
631
593

643
607
621

631
638
554

(b) Summary statistics

Group

Mean

SD

High
Low
Control

638.7
612.5
601.1

16.6
19.3
27.4

Table 14.6: Bone density of rats after given exercise regime, in mg / cm3

portion that is between groups and the portion that is within groups. If the
between-group variance is too big, we reject the hypothesis of equal means.
Many experiments naturally lend themselves to a two-way layout. For
instance, there may be three different exercise regimens and two different
diets. We represent the measurements as
xkji = k + j + kji ,

k = 1, 2, 3;

j = 1, 2;

i = 1, . . . , nkj .

It is then slightly more complicated to isolate the exercise effect k and the
diet effect j . We test for equality of these effects by again splitting the
variance into pieces: the total sum of squares falls naturally into four pieces,
corresponding to the variance over diets, variance over exercise regimens,
variance over joint diet and exercise, and the remaining variance within
each group. We then test for whether the ratios of these pieces are too far
from the ratio of the degrees of freedom, as determined by the F distribution.
Multifactor ANOVA is quite common in experimental practice, but will
not be covered in this course.

14.6

Kruskal-Wallis Test

Just as there is the non-parametric rank-sum test, similar to the t and z


tests for equality of means, there is a non-parametric version of the F test,
called the Kruskal-Wallis test. As with the rank-sum test, the basic idea is

256

ANOVA
Table 14.7: ANOVA table for rat exercise data.

SS

d.f.

MS

Between
Samples

7434
(A)

2
(B)

3717
(X = A/B)

7.98

Errors (Within
Samples)

12580
(C)

27
(D)

466
(Y = C/D)

Total

20014
(TSS=A+C)

29
(N 1)

simply to substitute ranks for the actual observed values. This avoids the
assumption that the the data were drawn from a normal distribution.
In Table 14.8 we duplicate the data from Table 14.6, replacing the measurements by the numbers 1 through 30, representing the ranks of the data:
the lowest measurement is number 1, and the highest is number 30. In other
words, suppose we have observed K different groups, with ni observations
in each group. We order all the observations in one large sequence of length
N , from lowest to highest, and assign to each one its rank. (In case of ties,
we assign the average rank.) We then sum the ranks in group i, obtaining
numbers R1 , . . . , RK . Then

The Kruskal-Wallis test statistic is


K

H=

X R2
12
i
3(N + 1).
N (N + 1) i=1 ni

Under the null hypothesis, that all the samples came from the same distribution, H has the 2 distribution with K 1 degrees of freedom.
In the rat exercise example, we have the values of Ri given in Table
14.7(b), yielding H = 10.7. If we are testing at the 0.05 significance level,
the cutoff for 2 with 2 degrees of freedom is 5.99. Thus, we conclude again
that there is a statistically significant difference among the distributions of
bone density in the three groups.

ANOVA

257

(a) Full data

High
Low
Control

18.5
6
14

27.5
8
2

16.5
23
29

30
11
4.5

18.5
22
13

25.5
3
9

16.5
7
10

27.5
20.5
4.5

(b) Summary statistics

Group

Sum

High
Low
Control

226.5
136.5
102

Table 14.8: Ranks of data in Table 14.6.

25.5
12
15

20.5
24
1

258

ANOVA

Lecture 15

Regression and correlation:


Detecting trends
15.1

Introduction: Linear relationships between


variables

In lectures 11 and 14 we have seen several examples of statistical problems


which could be stated in this form: We have pairs of observations (xi , yi ),
where yi is numerical, and xi categorical that is, xi is an assignment
to one of a few possible categories and we seek to establish the extent
to which the distribution of yi depends on the category xi . For example,
in section 11.1, yi is a height and xi is either early-married or latemarried. In section 14.4.3 the yi are measures of bone density, and the
xi are the experimental category (control, high exercise, low exercise). In
section 14.4.2 the yi are IQ measures, and the xi are categories of length of
breastfeeding.
The last example in particular points up the limitation of the approach
we have taken so far. If we think that breastfeeding affects IQ, we would
like to know if there is a linear relationship, and if so, how strong it is. That
is, the data would be telling a very different story if group 5 infants (> 9
months) wound up with the highest IQ and group 1 ( 1 month) with the
lowest; than if the high IQs were in groups 1, 2 and 4, with lower IQs in
groups 3 and 5. But ANOVA only gives one answer: The group means are
different.
The linear relationship is the simplest one we can test for, when the
covariate xi is numerical. Intuitively, it says that each increase in xi by
one unit effects a change in yi by the same fixed amount. Formally, we write
259

260

Regression

Regression Model yi = xi + + i .

110
100

*
*

80

90

Adjusted mean Full Scale IQ

120

We think of this as a set of observations from random variables (X, Y )


that satisfy the relationship Y = X + + E, where E is independent
of X. We call this a linear relation because the pairs of points (xi , yi )
with yi = xi + lie approximately on a straight line. Here the i are
noise or error terms. They are the random variation in measurement
of Y that prevent us from seeing the data lying exactly (and obviously) on
a line y = x + . They may represent actual random errors from a true
value of Y that is exactly X + , or it may mean that X + is an overall
trend, with E reflecting the contribution of other factors that have nothing
to do with X. For instance, in the breastfeeding example, we are interested
to see whether children who received more breastmilk had higher IQs
but we dont expect children who were nursed for the same length of time
to end up all with the same IQ, as they will differ genetically, socially, and
in simple random proclivities. (As mentioned in section 14.4.2, the change
from raw to adjusted mean IQ scores is intended to compensate for some of
the more systematic contributions to the error term: mothers smoking,
parents IQ and income, and so forth. The effect is to reduce the size of the
i terms, and so (one hopes) to make the true trend more apparent.) Of
course, if the i are large on average if E has a high variance relative to
the variance of X then the linear relationship will be drowned in a sea of
noise.

10

12

Months breastfeeding

Figure 15.1: Plot of data from the breastfeeding IQ study in Table 14.1.
Stars represent mean for the class, boxes represent mean 2 Standard
Errors.

Regression

15.2

261

Scatterplots

The most immediate thing that we may wish to do is to get a picture of the
data with a scatterplot. Some examples are shown in Figure 15.2.

60

74

72

70

68

Child height (in.)

66

64

62

120
100
80

Birth weight (oz.)

140

160

180

Birth weight and mother's smoking

10

20

30

40

50

64

66

Number of cigarettes smoked

1900

2000

2100

2200

1700

1800

Total brain surface area (cm2)

1900

2000

2100

2200

Total brain surface area (cm2)

(d) Brain volume and brain surface area

2
1
-1

log10Brain weight (g)

1000 2000 3000 4000 5000

(c) IQ and brain surface area

Brain weight (g)

72

1000 1100 1200 1300 1400

Total brain volume (cm3)

110
100

Full-scale IQ

90

1800

70

(b) Galtons parent-child height data

120

(a) Infant birth weight against maternal smoking

1700

68

Parent average height (in.)

1000

2000

3000

4000

5000

6000

-2

Body weight (kg)

(e) Brain weight and body weight (62 species of land


mammals)

-1

log10Body weight (kg)

(f) log Brain weight and log body weight (62 species of
land mammals)

Figure 15.2: Examples of scatterplots

262

15.3

Regression

Correlation: Definition and interpretation

and Y , we define
Given two paired random variables (X, Y ), with means X
the covariance of X and Y , to be



Cov(X, Y ) = mean (X X)(Y


Y ) .
For n paired observations (xi , yi ), with means x
and y, we define the population covariance
cxy

n

 1X
= mean (xi x
)(yi y) =
(xi x
)(yi y).
n
i=1

As with the SD, we usually work with the sample covariance, which is
just
n
n
1 X
sxy =
cxy =
(xi x
)(yi y).
n1
n1
i=1

This is a better estimate for the covariance of the random variables that xi
and yi are sampled from.
Notice that the means of xi x
and yi x
are both 0: On average, xi is
neither higher nor lower than x
. Why is the covariance then not also 0? If
X and Y are independent, then each value of X will come with, on average,
the same distribution of Y s, so the positives and negatives will cancel out,
and the covariance will indeed be 0. On the other hand, if high values of xi
tend to come with high values of yi , and low values with low values, then
the product (xi x
)(yi y) will tend to be positive, making the covariance
positive.
While positive and negative covariance have obvious interpretations, the
magnitude of covariance does not say anything straightforward about the
strength of connection between the covariates. After all, if we simply measure heights in millimetres rather than centimetres, all the numbers will
become 10 times as big, and the covariance will be multiplied by 100. For
this reason, we normalise the covariance by dividing it by the product of the
two standard deviations, producing the quantity called correlation:

Correlation
)
XY = Cov(X,Y
X Y .
Of course, we estimate the correlation in a corresponding way:

Regression

263

Sample correlation
s
rxy = sxxysy .

It is easy to see that correlation does not change when we rescale the data
for instance, by changing the unit of measurement. If xi were universally
replaced by xi = xi + , then sxy becomes sxy = sxy , and sx becomes
sx = sx . Since the extra factor of appears in the numerator and in the
denominator, the final result of rxy remains unchanged. In fact, it turns
out that rxy is always between 1 and 1. The correlation of 1 means that
there is a perfect linear relationship between x and y with negative sign;
correlation of +1 means that there is a perfect linear relationship between
x and y with positive sign; and correlation 0 means no linear relationship
at all.
In Figure 15.3 we show some samples of standard normally distributed
pairs of random variables with different correlations. As you can see, high
positive correlation means the points lie close to an upward-sloping line;
high negative correlation means the points lie close to a downward-sloping
line; and correlation close to 0 means the points lie scattered about a disk.

15.4

Computing correlation

There are several alternative formulae for the covariance, which may be more
convenient than the standard formula:
n

sxy

1 X
=
(xi x
)(yi y)
n1
i=1

1
n1

n
X
i=1

x i yi

n
x
y
n1


1 2
=
sx+y s2xy ,
4
where sx+y and sxy are the sample SDs of the collections (xi + yi ) and
(xi yi ) respectively.

264

Regression

Correlation=0.4
3

Correlation=0.95

2
1
z2

Correlation=0.8

Correlation=0
3

0
1

z3

2
1
y

0
1
2
3

0
2

z1

0
x

Figure 15.3: Examples of pairs of random variables with different correlations.

Regression

15.4.1

265

Brain measurements and IQ

A study was done [TLG+ 98] to compare various brain measurements and IQ
scores among 20 subjects.1 In Table 15.1 we give some of the data, including
a measure of full-scale IQ and MRI estimates of total brain volume and total
brain surface area.
Table 15.1: Brain measurement data.

Subject

Brain volume
(cm3 )

Brain surface
area (cm2 )

IQ

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20

1005
963
1035
1027
1281
1272
1051
1079
1034
1070
1173
1079
1067
1104
1347
1439
1029
1100
1204
1160

1914
1685
1902
1860
2264
2216
1867
1851
1743
1709
1690
1806
2136
2019
1967
2155
1768
1828
1774
1972

96
89
87
87
101
103
103
96
127
126
101
96
93
88
94
85
97
114
113
124

Mean
SD

1125
125

1906
175

101
13.2

In Figures 15.2(c) and 15.2(d) we see scatterplots of IQ against surface


1

In fact, the 20 subjects comprised 5 pairs of male and 5 pairs of female monozygous
twins, so there are plenty of interesting possibilities for factorial analysis. The data are
available in a convenient table at http://lib.stat.cmu.edu/datasets/IQ_Brain_Size.

266

Regression

area and volume against surface area, respectively. There seems to be a


negative relationship between IQ and surface area, and a postive relationship between volume and surface area the latter, while not surprising,
being hardly inevitable. Given the triples of numbers it is straightforward
to compute all three paired correlations. If we denote the volume, surface
area, and IQ by xi , yi , and zi , we can compute
sxy = 13130,

syz = 673,

This leads us to
rxy =

sxz = 105.

13130
= 0.601.
sx sy

Similarly
ryz = 0.291,

15.4.2

rxz = 0.063.

Galton parent-child data

There is a famous collection of data collected by Francis Galton, measuring


heights and lengths of various body parts for various related individuals.
(This was one of the earliest quantitative studies of human inheritance.)
We consider the heights of parents and children.2 Parent height is given
as the average height of the two parents.3 The data were given in units
of whole inches, so we have jittered the data to make the scatterplot of
Figure 15.2(b). That is, we have added some random noise to each of the
values when it was plotted, so that the dots dont lie right on top of each
other.
Suppose we have the list of parent height and child heights; and the list
of differences between parent and child heights and sum of parent and child
heights. We know the variances for each of these: Then we have
1
sxy = (13.66 5.41) = 2.07,
4
and
rxy =

sxy
2.07
=
= 0.459.
sx sy
2.52 1.79

2
Available as the dataset galton of the UsingR package of the R programming
language, or directly from http://www.bun.kyoto-u.ac.jp/~suchii/galton86.
html.
3
We do not need to pay attention to the fact that Galton multiplied all female heights
by 1.08.

Regression

267

Table 15.2: Variances for different combinations of the Galton height data.

Parent
Child
Sum
Difference

15.4.3

SD

Variance

1.79
2.52
3.70
2.33

3.19
6.34
13.66
5.41

Breastfeeding example

This example is somewhat more involved than the others, and than the kinds
of covariance computations that appear on exams.
Consider the breastfeedingIQ data (Table 14.1), which we summarise in
the top rows of Table 15.3. We dont have the original data, but we can use
the above formulas to estimate the covariance, and hence the correlation,
between number of months breastfeeding and adult IQ. Here xi is the number
of months individual i was breastfed, and yi the adult IQ.
Table 15.3: Computing breastfeedingFull-IQ covariance for Copenhagen
infant study.

Average number of months breastfeeding


1
3
5.5
8.5
N
SD
Adjusted Mean
Contribution
P
to
xi yi

>9

Total

272

305

269

104

23

973

15.9
99.4
27037

15.2
101.7
93056

15.7
102.3
151353

13.1
106.0
93704

14.4
104.0
28704

15.7
101.74

In the bottom row of thePtable we give the contribution that those individuals made to the total
xi yi . Since the xi are all about the same in
any column, we treat them as though they were all about the same, equal
to the average x value in the column. (We give these averages in the first
row. This involves a certain amount of guesswork, particularly for the last
column; on the other hand, there are very few individuals in that column.)

268

Regression

The sum of the y values in any column is the average y value multiplied
by the relevant number of samples. Consider the first column:
X
X
xi yi 1
yi
i in first column

i in first column

= 1 N1 y1
= 1 272 99.4
= 27037.
Similarly, the second column contributes
3 305 101.7 = 93065, and so on.
P
Adding these contributions yields
xi yi = 393853.
We next estimate x
, treating it as though there were 272 xi = 1, 305
xi = 3, and so on, yielding
x
=

1
(272 1 + 305 3 + 269 5.5 + 104 8.5 + 27 12) = 3.93.
973

To estimate y, we take
X
X
X
=
yi +
i

i in column 1

i in column 2

yi +

i in column 3

yi +

i in column 4

yi +

yi .

i in column 5

The sum in a column is just the average in the column multiplied by the
number of observations in the column, so we get
y =

1
(272 99.4 + 305 101.7 + 269 102.3 + 104 106.0 + 27 104.0) = 101.74.
973

Thus, we get
sxy =

393853 973

101.74 3.93 = 4.95


972
972

To compute the correlation we now need to estimate the variance of x


separately. (The variance of y could be computed from P
the other given data
using the individual column variances to compute
yi2 but we are
given that sy = 15.7.) For the x values, we continue to treat them as though
they were all the same in a column, and use the formula for grouped data.
We get then
1 
2
sx =
272 (1 3.93)2 + 305 (3 3.93)2 + 269 (5.5 3.93)2
972

2
2
+ 104 (8.5 3.93) + 27 (12 3.93)
= 7.13.

Regression

269

Thus, we have
rxy =

15.5

sxy
4.95
=
= 0.118.
sx sy
2.67 15.7

Testing correlation

One thing we want to do is to test whether the correlation between two


variables is different from 0, or whether the apparent correlation could be
merely due to chance variation. After all, if we sample X and Y at random,
completely independently, the correlation wont come out to be exactly 0.
We suppose we have a sample (xi , yi ), i = 1, . . . , n, from random variables (X, Y ), which we assume are normally distributed, but with unknown
means, variances, and correlation. We formulate the null hypothesis
H0 : XY = 0,
and test it against the alternative hypothesis that XY 6= 0. As usual, we
need to find a test statistic R that has two properties
(1). Extreme values of R correspond to extreme failure of the null hypothesis (relative to the alternative). In other words, R should tend to take
on more extreme values when XY is farther away from 0.
(2). We know the distribution of R (at least approximately).
In this case, our test statistic is

rxy n 2
R= q
.
2
1 rxy
It can be shown that, under the null hypothesis, this R has the Student
t distribution with n 2 degrees of freedom. Thus, we can look up the
appropriate critical value, and reject the null hypothesis if |R| is above this
cutoff. For example, in the Brain measurement experiments of section 15.4.1
we have correlation between brain volume and surface area being 0.601 from
20 samples, which produces R = 3.18, well above the threshold value for t
with 18 degrees of freedom at the 0.05 level, which is 2.10. On the other
hand, the correlation 0.291 for surface area against IQ yields R = 1.29,
which does not allow us to reject the null hypothesis that the true underlying
population correlation is 0; and the correlation 0.063 between volume and
IQ yields R only 0.255.

270

Regression

Note that for a given n and choice of level, the threshold in t translates
directly to a threshold in r. If t is the appropriate threshold value in t,
then we
p reject the null hypothesis when our sample correlation r is larger
than t /(n 2 + t ). In particular, for large n and = 0.05, we have a
threshold
p for t very close to 2, so that we reject the null hypothesis when
|r| > 2/n.

15.6

The regression line

15.6.1

The SD line

One way of understanding the relationship between X and Y is to ask, if


you know X, how much does that help you to predict the corresponding
value of Y ? For instance, in section 15.4.2 we might want to know what the
predicted adult height should be for a child, given the height of the parents.
Presumably, average-height parents (68.3 inches) should have average-height
children (68.1 inches). But what about parents who average 70 inches (remember, the parent height is an average of the mother and father): they
are a bit taller than average, so we expect them to have children who are
taller than average, but by how much?
The most nave approach would be to say that parents who are 1.7 inches
above the mean should have children who are 1.7 inches above the mean.
This cant be right, though, because there is considerably more spread in
the childrens heights than in the parents heights4 the parents SD is
1.79, where the childrens SD is 2.50.
One approach would be to say, the parents are about 1 SD above the
mean for parents (actually, 0.95 SD), so the children should be about 1 SD
above the mean for children. If we make that prediction for all parents, and
plot the corresponding prediction for their children, we get the green line
in Figure 15.4. We follow [FPP98] in calling this the SD line. This is a
line that passes through the point corresponding to the means of the two
variables, and rises one SD in Y for every SD in X. If we think of the cloud
of points as an oval, the SD line runs down the long axis of the oval. The
formula for the SD line is then
Y y =
4

sy
(X x
).
sx

(15.1)

This is because the parent height


is the average of two individuals. We note that the

parents SD is almost exactly 1/ 2 times the childrens SD.

Regression

271

This can be rearranged to




sy
sy
Y = X + y x
.
sx
sx

Figure 15.4: Galton parent-child heights, with SD line in green. The point
of the means is shown as a red circle.

15.6.2

The regression line

On further reflection, though, it becomes clear that the SD line cant be the
best prediction of Y from X. If we look at Figure 15.3, we see that the line
running down the middle of the cloud of points is a good predictor of Y from
X if the correlation is close to 1. When the correlation is 0, though, wed
be better off ignoring X and predicting Y to be y always; and, of course,
when the correlation is negative, the line really needs to slope in the other
direction.
What about intermediate cases, like the Galton data, where the correlation is 0.46? One way of understanding this is to look at a narrow
range of X values (parents heights), and consider what the correspanding range of Y values is. In figure 15.5, we sketch in rectangles showing
the approximate range of Y values corresponding to X = 66, 68, 70, and
72 inches. As you can see the middle of the X = 66 range is substantially above the SD line, whereas the middle of the X = 72 range is below
the SD line. This makes sense, since the connection between the parents

272

Regression

and childrens height is not perfect. If the linear relation Y = X +


+ SD(X), then
held exactly, then SD(Y ) = SD(X), and if X = X
Y = Y + SD(X) = Y + SD(Y ) exactly. But when the Y values are
spread out around X + , the more SD(Y ) gets inflated by irrelevant noise
that has nothing to do with X. This means that an increase of 1 SD in X
wont contribute a full SD to Y , on average, but something less than that.
The line that runs approximately through the midpoints of the columns,
shown in blue in Figure 15.5, is called the regression line. So how many
SDs of Y is a 1 SD change in X worth? It turns out, this is exactly the
correlation.

Figure 15.5: The parent-child heights, with an oval representing the general
range of values in the scatterplot. The SD line is green, the regression
line blue, and the rectangles represent the approximate span of Y values
corresponding to X = 66, 68, 70, 72 inches.

The formula for the regression line is


(Y y) = rxy

sy
sxy
(X x) = 2 (X x).
sx
sx

This is equivalent to
sxy
Y = bX + a, where b = 2 and a =
sx



sy
y x .
sx

Regression

273

Another way of understanding the regression line is as the answer to


the question: Suppose we want to predict yi from xi with a linear relation
y = bx + a. Which choice of a and b will make the total
2 as
P squared error
small as possible. That is, the regression line makes
yi (bxi + a) as
small as possible. Note that the choice of the total square error as the way
to score the prediction errors implies a certain choice about how much we
care about a few big errors relative to a lot of small errors. The regression
line tries, up to a point, to make the biggest error as small as possible, at
the expense of making some of the smaller errors bigger than they might
have been.

Example 15.1: Regression line(s) for Galtons data


For Galtons parent-child height data, the correlation was 0.46,
so that the slope of the regression line is
sy
rxy = 0.649,
sx
a = y b
x = 23.8.
b=

Thus, if we have parents with average height 72 inches, we would


predict that their child should have, on average, height
ypred = 72 0.649 + 23.8 = 70.5 inches.
Suppose we reverse the question: Given a child of height 70.5
inches, what should we predict for the average of his or her parents heights? You might suppose it is 72 inches, simply reversing
the previous calculation. In fact, though, we need to redo the
calculation. All predictions must be closer to the (appropriate)
mean than the observations of the independent variable on which
the prediction is based. The child is (70.5 68.1)/2.52 = 0.95
SDs away from the mean, and the prediction for the parents
must be closer to the parents mean than that. In fact, if we
write the coefficients for the regression line predicting parents
from children as b0 and a0 , we have
b0 =

sxy
= 0.326,
sx sy

a0 = x
b0 y = 45.9.

274

Regression
Thus, the prediction for the parents heights is 45.9 0.326
70.5 = 68.9 inches. 

This may be seen in the following simple example. Suppose we have five
paired observations:
Table 15.4: Simple regression example

x
y
prediction
0.54x + 1.41
residual

6
3

2
2

5
4

7
6

4
5

4.65

2.49

4.11

5.19

3.57

1.65

0.49

0.11

0.81

1.43

mean
4.8
4.0

SD
1.9
1.58

We compute the sample covariance as


 1
1 X
sxy =
xi yi 5
xy = (104 5 4.8 4.0) = 2.0,
51
y
yielding
rxy =

sxy
= 0.66.
sx sy

The regression line then has coefficients


sxy
b = 2 = 0.54, and a = y b
x = 1.41.
sx
This is the line plotted in blue in Figure 15.6; in green is the SD line. We
have shown with dashed lines the size of the errors that would accrue if the
one or the other line were used for predicting y from x. Note that some of
the errors are smaller for the SD line, but the largest errors are made larger
still, which is why the regression line has a smaller total squared error. In
Figure 15.8 we do the same thing for the brain volumes and surface areas of
Table 15.1.

15.6.3

Confidence interval for the slope

Suppose the observations (xi , yi ) are independent observations of normal


random variables (X, Y ), which satisfy
Y = X + + ,

275

0 1 2 3 4 5 6 7

Regression

Regression line
SD line

130
120
110
80

90

100

Full-scale IQ

2200
2000
1800
1600

Total brain surface area (cm2)

2400

Figure 15.6: A scatterplot of the hypothetical data from Table 15.4. The
regression line is shown in blue, the SD line in green. The dashed lines show
the prediction errors for each data point corresponding to the two lines.

900

1000 1100 1200 1300 1400 1500


Total brain volume (cm3)

(a) Volume against surface area regression

1700 1800 1900 2000 2100 2200


Total brain surface area (cm2)

(b) Surface area against IQ regression

Figure 15.7: Regression lines for predicting surface area from volume and
predicting IQ from surface area. Pink shaded region shows confidence interval for slope of the regression line.

Regression

2200
2000
1800

Regression line
SD line

1600

Total brain surface area (cm2)

2400

276

900

1000

1100

1200

1300

1400

1500

Total brain volume (cm3)


Figure 15.8: A scatterplot of brain surface area against brain volume, from
the data of Table 15.1. The regression line is shown in blue, the SD line
in green. The dashed lines show the prediction errors for each data point
corresponding to the two lines.

Regression

277

where  is a normal error term independent of X. The idea is that Y gets a


contribution from X, and then there is the contribution from , representing
everything else that has nothing to do with X. The regression coefficients
b and a will be estimates for the true slope and intercept and . We then
ask the standard question: How certain are we of these estimates?
One thing we may want to do is to test whether there is really a nonzero
slope, or whether the apparent difference from = 0 might be purely due
to chance variation. Since the slope is zero exactly when the correlation is
zero, this turns out to be exactly the same as the hypothesis test for zero
correlation, as described in section 15.5.
The same ideas can be used to produce a confidence interval for . Under
the normality assumption, it can be shown that the random estimate b has
standard error

b 1 r2
SE(b)
,
r n2
and that
t :=

b
SE(b)

has approximately the t distribution with n 2 degrees of freedom. Thus,


a (1 ) 100% confidence interval for is
b T SE(b),
where T is chosen from the t distribution table to be the threshold value
for the test at level with n 2 degrees of freedom; in other words, T =
t1/2 (n 2), the value such that the t distribution has a total of /2
probability above T (and another /2 probability below T ).

15.6.4

Example: Brain measurements

Suppose we know an individuals brain volume x, and wish to make a best


guess about that individuals brain surface area. We have already computed
the correlation to be 0.601 in section 15.4.1. Combining this with the means
and SDs from Table 15.1, we get the regression coefficients
b = rxy

sy
175
= 0.601
= 0.841
sx
125

a = y b
x = 959

The standard error for b is

b 1 r2
0.841 1 0.6012

SE(b)
=
= 0.264.
r n2
0.601 18

278

Regression

We look on the t-distribution table in the row for 18 degrees of freedom and
the column for p = 0.05 (see Figure 15.9) we see that 95% of the probability
lies between 2.10 and +2.10, so that a 95% confidence interval for is
b 2.10 SE(b) = 0.841 2.10 0.264 = (0.287, 1.40).
The scatterplot is shown with the regression line y = .841x + 959 in Figure
15.7(a), and the range of slopes corresponding to the 95% confidence interval
is shown by the pink shaded region. (Of course, to really understand the
uncertainty of the estimates, we would have to consider simultaneously the
random error in estimating the means, hence the intercept of the line. This
leads to the concept of a two-dimensional confidence region, which is beyond
the scope of this course.)
Similarly, for predicting IQ from surface area we have
sz
13.2
b = ryz = 0.291
= 0.031,
a = z b
y = 160.
sy
125
The standard error for b is

0.031 1 0.2912
b 1 r2

=
SE(b)
= 0.024.
r n2
0.291 18
A 95% confidence interval for the true slope is then given by 0.031
2.10 0.024 = (0.081, 0.019). The range of possible predictions of y from
x pretending, again, that we know the population means exactly is
given in Figure 15.7(b).
What this means is that each change of 1 cm3 in brain volume is typically
associated with a change of 0.841 cm2 in brain surface area. A person of
average brain volume 1126 cm3 would be expected to have average
brain surface area 1906 cm2 but if we know that someone has brain
volume 1226 cm3 , we would do well to guess that he has a brain surface
area of 1990 cm2 (1990 = 1906 + 0.841 100 = .841 1226 + 959). However,
given sampling variation the number of cm2 typically associated with 1 cm3
change in volume might really be as low as 0.287 or as high as 1.40, with
95% confidence.
Similarly, a person of average brain surface area 1906 cm2 might be
predicted to have average IQ of 101, but someone whose brain surface area
is found to be 100 cm2 might be predicted to have IQ below average by 3.1
points, so 97.9. At the same time, we can only be 95% certain that the
change associated with 100 cm2 increase in brain surface area is between
8.1 and +1.9 points hence, it might just as well be 0. We say that the
correlation between IQ and brain surface area is not statistically significant,
or that the slope of the regression line is not significantly different from 0.

Regression

279

Figure 15.9: T table for confidence intervals for slopes computed in section
15.6.4

280

Regression

Lecture 16

Regression, Continued
16.1

R2

What does it mean to say that bxi + a is a good predictor of yi from xi ?


One way of interpretting this would be to say that we will typically make
smaller errors by using this predictor, than if we tried to predict yi without
taking account of the corresponding value of xi .
Suppose we have our standard regression probability model Y = X +
+ E: this means that the observations are
yi = xi + + i .
Of course, we dont really get to observe the  terms: they are only inferred
from the relationship between xi and yi . But if we have X independent of
E, then we can use our rules for computing variance to see that
V ar(Y ) = 2 V ar(X) + V ar(E).

(16.1)

If we think of variance as being a measure of uncertainty, then this says that


the uncertainty about Y can be divided up into two parts: One part that
comes from the uncertainty about X, and would be cleared up once we know
X, and a residual uncertainty, that remains independent of X.1 From our
formula for the regression coefficients, we see that 2 = r2 V ar(Y )/V ar(X).
This means that the first term on the right in expression (16.1) becomes
r2 V ar(Y ). In other words, the portion of the variance that is due to variability in X is r2 V ar(Y ), and the residual variance the variance of the
1

If this sounds a lot like the ANOVA approach from lecture 14, thats because it is.
Formally, theyre variants of the same thing, though developing this equivalence is beyond
the scope of this course.

281

282

Regression II, Review

error term is (1 r2 )V ar(Y ). Often this relation is summarised by the


statement: X explains r2 100% of the variance.

16.1.1

Example: Parent-Child heights

From the Galton data of section 15.4.2, we see that the variance of the childs
height is 6.34. Since r2 = 0.21, we say that the parents heights explain
21% of the variance in childs height. We expect the residual variance
to be about 6.34 0.79 = 5.01. What this means is that the variance among
children whose parents were all about the same height should be 5.01. In
Figure 16.1 we see histograms of the heights of children whose parents all
had heights in the same range of 1 inch. Not surprisingly, there is some
variation in the shapes of these histograms, vary somewhat, but the variances
are all substantially smaller than 6.34, a varying between 4.45 and 5.75.

16.1.2

Example: Breastfeeding and IQ

In section 15.4.3 we computed the correlation between number of months


breastfeeding and adult IQ to be 0.118. This gives us r2 = 0.014, so we
say that the length of breastfeeding accounts for about 1.4% of adult IQ.
The variance in IQ among children who were nursed for about the same
length of time is about 1% less than the overall variance in IQ among the
population. Not a very big effect, in other words. Is the effect real at all, or
could it be an illusion, due to chance variation? We perform a hypothesis
test, computing

r 973 2
R=
= 3.70.
1 r2
The threshold for rejecting t with 971 degrees of freedom (the row
essentially the same as the normal distribution) at the = 0.01 significance
level is 2.58. Hence, the correlation is highly significant. This is a good
example of where significant in the statistical sense should not be confused
with important. The difference is significant because the sample is so large
that it is very unlikely that we would have seen such a correlation purely by
chance if the true correlation were zero. On the other hand, explaining 1%
of the variance is unlikely to be seen as a highly useful finding. (At the same
time, it might be at least theoretically interesting to discover that there is
any detectable effect at all.)

Regression II, Review

283

Variance= 4.45

30
0

10

20

Frequency

10
5

Frequency

40

50

15

Variance= 5.75

62

64

66

68

70

72

62

64

66

68

70

72

Heights of children with parent height 67 inches

Variance= 5.6

Variance= 4.83

10

20

Frequency

40

20

Frequency

60

30

80

Heights of children with parent height 65 inches

62 64 66 68 70 72 74
Heights of children with parent height 69 inches

60

64

68

72

Heights of children with parent height 71 inches

Figure 16.1: Histograms of Galtons data for childrens heights, partitioned


into classes whose parents all had the same height 1 inch.

284

16.2

Regression II, Review

Regression to the mean and the regression


fallacy

We have all heard stories of a child coming home proudly from school with
a score of 99 out of 100 on a test, and the strict parent who points out that
he or she had 100 out of 100 on the last test, and what happened to the
other point? Of course, we instinctively recognise the parents response
as absurd. Nothing happened to the other point (in the sense of the child
having fallen down in his or her studies); thats just how test scores work.
Sometimes theyre better, sometimes worse. It is unfair to hold someone to
the standard of the last perfect score, since the next score is unlikely to be
exactly the same, and theres nowhere to go but down.
Of course, this is true of any measurements that are imperfectly correlated: If |r| is substantially less than 1, the regression equation tells us
that those individuals who have extreme values of x tend to have values
of y that are somewhat less extreme. If r = 0.5, those with x values that
are 2 SDs above the mean will tend to have y values that are still above
average, but only 1 SD above the mean. There is nothing strange about
this: If we pick out those individuals who have exceptionally voluminous
brains, for instance, it is not surprising that the surface areas of their brains
are less extreme. While athletic ability certainly carries over from one sport
to another, we do not expect the worlds finest footballers to also win gold
medals in swimming. Nor does it seem odd that great composers are rarely
great pianists, and vice versa.
And yet, when the x and y are successive events in time for instance,
the same childs performance on two successive tests there is a strong
tendency to attribute causality to this imperfect correlation. Since there
is a random component to performance on the test, we expect that the
successive scores will be correlated, but not exactly the same. The plot of
score number n against score number n + 1 might look like the upper left
scatterplot in Figure 15.3. If she had an average score last time, shes likely
to score about the same this time. But if she did particularly well last time,
this time is likely to be less good. But consider how easy it would be to look
at these results and say, Look, she did well, and as a result she slacked off,
and did worse the next time; or Its good that we punish her when she
does poorly on a test by not letting her go outside for a week, because that
always helps her focus, and she does better the next time. Galton noticed
that children of exceptionally tall parents were closer to average than the
parents were, and called this regression to mediocrity [Gal86].

Regression II, Review

285

Some other examples:


Speed cameras tend to be sited at intersections where there have been
high numbers of accidents. Some of those high numbers are certainly
inherent to the site, but sometimes the site was just unlucky in one
year. You would expect the numbers to go down the next year regardless of whether you made any changes, just as you would expect an
intersection with very few accidents to have more the next year. Some
experts have pointed out that this can lead to overestimating the effect
of the cameras. As described in The Times 15 December 2005 [http:
//www.timesonline.co.uk/tol/news/uk/article766659.ece],
The Department for Transport [. . . ] published a study which
found that cameras saved only half as many lives as claimed.
This undermines the Governments main justification for increasing speed camera penalties five-fold from 340,000 in
1997 to 1.8 million in 2003.
Safe Speed, the anti-camera campaign, has argued for years
that the policy of siting cameras where there have been recent clusters of crashes makes it impossible to attribute any
fall in collisions to the presence of a camera. Collisions would
be expected to fall anyway as they reverted from the temporary peak to the normal rate.
The department commissioned the Department of Engineering at Liverpool University to study this effect, which is
known as regression to the mean. The study concluded that
most of the fall in crashes could be attributed to regression
to the mean. The presence of the camera was responsible
for as little as a fifth of the reduction in casualties.
The report goes on to say that The department put the results of
the study close to the bottom of a list of appendices at the back of a
160-page report which claims that cameras play an important role in
saving lives.
Suppose you are testing a new blood pressure medication. As we
have described in section 11.5, it is useful to compare the same individuals blood pressure before and after taking the medication. So
we take a group of subjects, measure their blood pressure (xi ), give
them the medication for two weeks, then measure again (yi ), and test

286

Regression II, Review


to see whether yi xi is negative, on average. We cant give bloodpressure-lowering medication to people who have normal blood pressure, though, so we start by restricting the study to those whose first
measurement x is in the hypertension range, xi > 140mmHg. Since
they are above average, and since there is significant random fluctuation in blood pressure measurements, those individuals would be
expected to have lower blood pressure measurements the second time,
purely by chance. If you are not careful, you will find yourself overestimating the effectiveness of the medication.

The behavioural economists Amos Tversky and Daniel Kahneman tell


the story of having taught psychology to air force flight instructors.
He tried to explain to them that there was a great deal of evidence
that positive reinforcement praise for good performance is more
effective than negative reinforcement (criticism of poor performance).

After some experience with this training approach, the instructors claimed that contrary to psychological doctrine,
high praise for good execution of complex maneuvers typically results in a decrement of performance on the next
try[. . . ] Regression is inevitable in flight maneuvers because
performance is not perfectly reliable and progress between
successive maneuvers is slow. Hence, pilots who did exceptionally well on one trial are likely to deteriorate on the next,
regardless of the instructors reaction to the initial success.
The experienced flight instructors actually discovered the regression but attributed it to the detrimental effect of positive
reinforcement. This true story illustrates a saddening
aspect of the human condition. We normally reinforce
others when their behavior is good and punish them when
their behavior is bad. By regression alone, therefore, they
are most likely to improve after being punished and most
likely to deteriorate after being rewarded. Consequently,
we are exposed to a lifetime schedule in which we
are most often rewarded for punishing others, and
punished for rewarding. [Tve82]

Regression II, Review

16.3

287

When the data dont fit the model

As part of a cross-species study of sleep behaviour [AC76] presented a table


of brain and body weights for 62 different species of land mammal. We
show these data in Table 16.1. We have already shown a scatterplot of brain
weights against body weights in Figure 15.2(e). It is clear from looking at
the data that there is some connection between brain and body weights,
but it is also clear that we have some difficulty in applying the ideas of
this lecture. These are based, after all, on a model in which the variables
are normally distributed, so that they are distributed about in some kind
of approximately oval scatterplot. The correlation is supposed to represent
a summary of the relation between all x values and the corresponding ys.
Here, though, the high correlation (0.93) is determined almost entirely by
the elephants, which have brain and body sizes far above the mean.

16.3.1

Transforming the data

One approach to understanding the correlation in this data set is illustrated


in Figure 15.2(f), where we have plotted log of brain weight against log of
body weight. The result now looks quite a bit like our standard regression
scatter plots. We can compute that the correlation of these two measures
is actually even a bit higher: 0.96. Thus, we can say that 92% (= 0.962 ) of
the variance in log brain weight is explained by body weight. What is more,
the regression line now seems to make some sense.

16.3.2

Spearmans Rank Correlation Coefficient

Taking logarithms may seem somewhat arbitrary. After all, there are a lot
of ways we might have chosen to transform the data. Another approach to
dealing with such blatantly nonnormal data, is to follow the same approach
that we have taken in all of our nonparametric methods: We replace the raw
numbers by ranks. Important: The ranking takes place within a variable.
We have shown in Table 16.1, in columns 3 and 5, what the ranks are:
The highest body weight african elephant gets 62, the next gets 61,
down to the lesser short-tailed shrew, that gets rank 1. Then we start over
again with the brain weights. (When two or more individuals are tied, we
average the ranks.) The correlation that we compute between the ranks is
called Spearmans rank correlation coefficient, denoted rs . It tells us
quantitatively whether high values of one variable tend to go with high values
of the other, without relying on assumptions of normality or otherwise being

288

Bibliography

dominated by a few very extreme values. Since it is really just a correlation,


we test it (for being
p from 0) the same way we test any correlation,
different
by comparing rs n 2/ 1 rs2 with the t distribution with n 2 degrees
of freedom.

16.3.3

Computing Spearmans rank correlation coefficient

There is a slightly quicker way of computing the Spearman coefficient. We


first list the rankings of the two variables in two parallel rows, and let di
be the difference between the xi ranking and P
the yi ranking. We show this
calculation in Table 16.2. We then have D = d2i = 1846.5. We have then
the formula
P
6 d2i
rs = 1
= 0.954.
n(n2 1)
This is exactly what we get when we compute the correlation between the
two lists of ranks, as described in section 15.4.

Bibliography
Species
African elephant
African giant
pouched rat
Arctic Fox
Arctic ground
squirrel
Asian elephant
Baboon
Big brown bat
Brazilian tapir
Cat
Chimpanzee
Chinchilla
Cow
Desert hedgehog
Donkey
Eastern American mole
Echidna
European hedgehog
Galago
Genet
Giant armadillo
Giraffe
Goat
Golden hamster
Gorilla
Gray seal
Gray wolf
Ground squirrel
Guinea pig
Horse
Jaguar
Kangaroo
Lesser shorttailed shrew
Little brown bat
Man
Mole rat
Mountain beaver
Mouse
Musk shrew
N. American
opossum
Nine-banded
armadillo
Okapi
Owl monkey
Patas monkey
Phanlanger
Pig
Rabbit
Raccoon
Rat
Red fox
Rhesus monkey
Rock hyrax
(Hetero. b)
Rock hyrax
(Procavia hab)
Roe deer
Sheep
Slow loris
Star nosed mole
Tenrec
Tree hyrax
Tree shrew
Vervet
Water opossum
Yellow-bellied marmot
mean
SD

289
Body weight (kg)

Body rank

Brain weight (g)

Brain rank

6654.00
1.00

62
21

5712.00
6.60

62
22

3.38
0.92

32
20

44.50
5.70

37
19

2547.00
10.55
0.02
160.00
3.30
52.16
0.42
465.00
0.55
187.10
0.07
3.00
0.79
0.20
1.41
60.00
529.00
27.66
0.12
207.00
85.00
36.33
0.10
1.04
521.00
100.00
35.00
0.01

61
42
4
53
31
47
14
58
16
54
7
30
18
12
25
49
60
44
10
56
51
46
8
22
59
52
45
1

4603.00
179.50
0.30
169.00
25.60
440.00
6.40
423.00
2.40
419.00
1.20
25.00
3.50
5.00
17.50
81.00
680.00
115.00
1.00
406.00
325.00
119.50
4.00
5.50
655.00
157.00
56.00
0.14

61
50
3
47
35
56
21
55
10
54
8
34
14
17
32
41
59
44
6
53
52
45
16
18
58
46
39
1

0.01
62.00
0.12
1.35
0.02
0.05
1.70

2
50
11
23
4
5
27

0.25
1320.00
3.00
8.10
0.40
0.33
6.30

2
60
13
23
5
4
20

3.50

34

10.80

24

250.00
0.48
10.00
1.62
192.00
2.50
4.29
0.28
4.24
6.80
0.75

57
15
41
26
55
29
39
13
38
40
17

490.00
15.50
115.00
11.40
180.00
12.10
39.20
1.90
50.40
179.00
12.30

57
30
44
25
51
26
36
9
38
49
28

3.60

35

21.00

33

14.83
55.50
1.40
0.06
0.90
2.00
0.10
4.19
3.50
4.05

43
48
24
6
19
28
9
37
34
36

98.20
175.00
12.50
1.00
2.60
12.30
2.50
58.00
3.90
17.00

42
48
29
6
12
28
11
40
15
31

199
899

930

243

Table 16.1: Brain and body weights for 62 different land mammal species.
Available at http://lib.stat.cmu.edu/datasets/sleep, and as the object mammals in the statistical language sR.

290

Bibliography

body
brain
diff.

62
62
0

21
22
1

32
37
5

20
19
1

61
61
0

42
50
8

4
3
0

53
47
6

31
35
4

47
56
9

14
21
7

58
55
3

16
10
6

54
54
0

7
8
1

30
34
4

18
14
4

12
17
5

25
32
7

49
41
8

60
59
1

body
brain
diff.

44
44
0

10
6
4

56
53
3

51
52
1

46
45
1

8
16
8

22
18
4

59
58
1

52
46
6

45
39
6

1
1
0

2
2
0

50
60
10

11
13
2

23
23
0

4
5
2

5
4
1

27
20
7

34
24
10

57
57
0

15
30
15

body
brain
diff.

41
44
2

26
25
1

55
51
4

29
26
3

39
36
3

13
9
4

38
38
0

40
49
9

17
28
10

35
33
2

43
42
1

48
48
0

24
29
5

6
6
0

19
12
7

28
28
0

9
11
2

37
40
3

34
15
18

36
31
5

Table 16.2: Ranks for body and brain weights for 62 mammal species, from
Table 16.1, and the difference in ranks between body and brain weights.

Bibliography
[Abr78]

Sidney Abraham. Total serum cholesterol levels of children,


4-17 years, United States, 1971-1974. Number 207 in Vital
and Health statistics: Series 11, Data from the National Health
Survey. National Center for Health Statistics, 1978.

[AC76]

Truett Allison and Domenic V. Cicchetti. Sleep in mammals:


Ecological and constitutional correlates. Science, 194:7324,
November 12 1976.

[BH84]

Joel G. Breman and Norman S. Hayner. Guillain-Barre Syndrome and its relationship to swine influenza vaccination in
Michigan, 1976-1977. American Journal of Epidemiology,
119(6):8809, 1984.

[Bia05]

Carl Bialik. When it comes to donations, polls dont tell the


whole story. The Wall Street Journal, 2005.

[BS]

Norman R. Brown and Robert C. Sinclair.


Estimating number of lifetime sexual partners: Men and women
do it differently. http://www.ualberta.ca/~nrbrown/pubs/
BrownSinclair1999.pdf.

[Edw58]

A. W. F. Edwards. An analysis of Geisslers data on the human


sex ratio. Annals of Human Genetics, 23(1):615, 1958.

[Fel71]

William Feller. An Introduction to Probability and its Applications, volume 2. John Wiley & Sons, New York, 1971.

[FG00]

Boris Freidlin and Joseph L. Gastwirth. Should the median


test be retired from general use? The American Statistician,
54(3):1614, August 2000.
291

292

Bibliography

[FPP98]

David Freedman, Robert Pisani, and Roger Purves. Statistics.


Norton, 3 edition, 1998.

[Gal86]

Francis Galton. Regression toward mediocrity in hereditary


stature. Journal of the Anthropological Institute, 15:24663,
1886.

[Hal90]

Anders Hald. History of Probability and Statistics and their


applications before 1750. John Wiley & Sons, New York, 1990.

[HDL+ 94] D. J. Hand, F. Daly, A. D. Lunn, K. J. McConway, and E. Ostrowski. A Handbook of small data sets. Chapman & Hall, 1994.
[HREG97] Anthony Hill, Julian Roberts, Paul Ewings, and David Gunnell.
Non-response bias in a lifestyle survey. Journal of Public Health,
19(2), 1997.
[LA98]

J. K. Lindsey and P. M. E. Altham. Analysis of the human


sex ratio by using overdispersion models. Journal of the Royal
Statistical Society. Series C (Applied Statistics), 47(1):149157,
1998.

[Lev73]

P. H. Levine. An acute effect of cigarette smoking on platelet


function. Circulation, 48:61923, 1973.

[Lov71]

C. Owen Lovejoy. Methods for the detection of census error


in palaeodemography. American Anthropologist, New Series,
73(1):1019, February 1971.

[LSS73]

Judson R. Landis, Daryl Sullivan, and Joseph Sheley. Feminist attitudes as related to sex of the interviewer. The Pacific
Sociological Review, 16(3):30514, July 1973.

[MDF+ 81] G. S. May, D. L. DeMets, L. M. Friedman, C. Furberg, and


E. Passamani. The randomized clinical trial: bias in analysis.
Circulation, 64:669673, 1981.
[MM98]

David S. Moore and George P. McCabe. Introduction to the


Practice of Statistics. W. H. Freeman, New York, 3rd edition,
1998.

[MMSR02] Erik Lykke Mortensen, Kim Fleischer Michaelsen, Stephanie A.


Sanders, and June Machover Reinisch. The association between duration of breastfeeding and adult intelligence. JAMA,
287(18):236571, May 8 2002.

Bibliography

293

[Mou98]

Richard Francis Mould. Introductory medical statistics. CRC


Press, 3 edition, 1998.

[MS84]

Marc Mangel and Francisco J. Samaniego. Abraham walds


work on aircraft survivability. Journal of the American Statistical Association, 79(386):259267, 1984.

[NNGT02] Mark J. Nieuwenhuijsen, Kate Northstone, Jean Golding, and


The ALSPAC Study Team. Swimming and birth weight. Epidemiology, 13(6):7258, 2002.
[Ric95]

John A. Rice. Mathematical Statistics and Data Analysis.


Duxbury Press, 1995.

[RS90]

Luigi M. Ricciardi and Shunsuke Sato. Diffusion processes and


first-passage-time problems. In Luigi M. Ricciardi, editor, Lectures in applied mathematics and informatics, pages 206285.
Manchester Univ. Press, Manchester, 1990.

[RS02]

Fred L. Ramsey and Daniel W. Schafer. The Statistical Sleuth:


A course in methods of data analysis. Duxbury Press, 2nd edition, 2002.

[RSKG85] S. Rosenbaum, R. K. Skinner, I. B. Knight, and J. S. Garrow. A


survey of heights and weights of adults in Great Britain, 1980.
Annals of Human Biology, 12(2):11527, 1985.
[SCB06]

Emad Salib and Mario Cortina-Borja. Effect of month of birth


on the risk of suicide. British Journal of Psychiatry, 188:41622,
2006.

[SCT+ 90]

R. L. Suddath, G. W. Christison, E. F. Torrey, M. F. Casanova,


and D. R. Weinberger. Anatomical abnormalities in the brains
of monozygotic twins discordant for schizophrenia. New England
Journal of Medicine, 322(12):78994, March 22 1990.

[SKK91]

L. Sheppard, A. R. Kristal, and L. H. Kushi. Weight loss in


women participating in a randomized trial of low-fat diets. The
American Journal of Clinical Nutrition, 54:8218, 1991.

[TK81]

Amos Tversky and Daniel Kahneman. The framing of decisions


and the psychology of choice. Science, 211(4481):4538, January
30 1981.

294

Bibliography

[TLG+ 98] M. J. Tramo, W. C. Loftus, R. L. Green, T. A. Stukel, J. B.


Weaver, and M. S. Gazzaniga. Brain size, head size, and intelligence quotient in monozygotic twins. Neurology, 50:124652,
1998.
[TPCE06] Anna Thorson, Max Petzold, Nguyen Thi Kim Chuc, and Karl
Ekdahl. Is exposure to sick or dead poultry associated with
flulike illness?: A population-based study from a rural area in
Vietnam with outbreaks of highly pathogenic avian influenza.
Archives of Internal Medicine, 166:11923, 2006.
[Tve82]

Amos Tversky. On the psychology of prediction. In Daniel


Kahneman, Paul Slovic, and Amos Tversky, editors, Judgment
under uncertainty: heuristics and biases. Cambridge University
Press, 1982.

[vBFHL04] Gerald van Belle, Lloyd D. Fisher, Patrick J. Heagerty, and


Thomas S. Lumley. Biostatistics: A methodology for the health
sciences. Wiley-IEEE, 2nd edition, 2004.
[ZZK72]

Philip R. Zelazo, Nancy Ann Zelazo, and Sarah Kolb. walking


in the newborn. Science, 176(4032):3145, April 21 1972.

You might also like