Humansci 1
Humansci 1
Probability Theory
and Statistics
for Psychology
and
Quantitative Methods for
Human Sciences
David Steinsaltz1
University of Oxford
(Lectures 18 based on earlier version by Jonathan Marchini)
Contents
1 Describing Data
1.1 Example: Designing experiments . . . . . . . . . . . . . . . .
1.2 Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2.1 Types of variables . . . . . . . . . . . . . . . . . . . .
1.2.2 Ambiguous data types . . . . . . . . . . . . . . . . . .
1.3 Plotting Data . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.3.1 Bar Charts . . . . . . . . . . . . . . . . . . . . . . . .
1.3.2 Histograms . . . . . . . . . . . . . . . . . . . . . . . .
1.3.3 Cumulative and Relative Cumulative Frequency Plots
and Curves . . . . . . . . . . . . . . . . . . . . . . . .
1.3.4 Dot plot . . . . . . . . . . . . . . . . . . . . . . . . . .
1.3.5 Scatter Plots . . . . . . . . . . . . . . . . . . . . . . .
1.3.6 Box Plots . . . . . . . . . . . . . . . . . . . . . . . . .
1.4 Summary Measures . . . . . . . . . . . . . . . . . . . . . . . .
1.4.1 Measures of location (Measuring the center point) . .
1.4.2 Measures of dispersion (Measuring the spread) . . . .
1.5 Box Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.6 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.6.1 Mathematical notation for variables and samples . . .
1.6.2 Summation notation . . . . . . . . . . . . . . . . . . .
14
14
16
17
17
19
24
29
32
32
33
2 Probability I
2.1 Why do we need to learn about probability?
2.2 What is probability? . . . . . . . . . . . . .
2.2.1 Definitions . . . . . . . . . . . . . .
2.2.2 Calculating simple probabilities . . .
2.2.3 Example 2.3 continued . . . . . . . .
2.2.4 Intersection . . . . . . . . . . . . . .
2.2.5 Union . . . . . . . . . . . . . . . . .
35
35
38
39
39
40
40
41
iii
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1
2
3
4
5
7
8
9
iv
Contents
.
.
.
.
.
42
43
43
44
44
3 Probability II
3.1 Independence and the Multiplication Law . . . . . . . . . .
3.2 Conditional Probability Laws . . . . . . . . . . . . . . . . .
3.2.1 Independence of Events . . . . . . . . . . . . . . . .
3.2.2 The Partition law . . . . . . . . . . . . . . . . . . .
3.3 Bayes Rule . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.4 Probability Laws . . . . . . . . . . . . . . . . . . . . . . . .
3.5 Permutations and Combinations (Probabilities of patterns)
3.5.1 Permutations of n objects . . . . . . . . . . . . . . .
3.5.2 Permutations of r objects from n . . . . . . . . . . .
3.5.3 Combinations of r objects from n . . . . . . . . . . .
3.6 Worked Examples . . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
47
47
51
53
55
56
59
59
59
61
62
62
4 The
4.1
4.2
4.3
4.4
4.5
Binomial Distribution
Introduction . . . . . . . . . . . . . . . . . . . . . . .
An example of the Binomial distribution . . . . . . .
The Binomial distribution . . . . . . . . . . . . . . .
The mean and variance of the Binomial distribution
Testing a hypothesis using the Binomial distribution
.
.
.
.
.
69
69
69
72
73
75
5 The
5.1
5.2
5.3
5.4
5.5
5.6
5.7
5.8
5.9
Poisson Distribution
Introduction . . . . . . . . . . . . . . . . . . . . . . . . .
The Poisson Distribution . . . . . . . . . . . . . . . . .
The shape of the Poisson distribution . . . . . . . . . .
Mean and Variance of the Poisson distribution . . . . .
Changing the size of the interval . . . . . . . . . . . . .
Sum of two Poisson variables . . . . . . . . . . . . . . .
Fitting a Poisson distribution . . . . . . . . . . . . . . .
Using the Poisson to approximate the Binomial . . . . .
Derivation of the Poisson distribution (non-examinable)
5.9.1 Error bounds (very mathematical) . . . . . . . .
.
.
.
.
.
.
.
.
.
.
81
81
83
86
86
87
88
89
90
95
96
2.3
2.2.6 Complement . . . . . . . . .
Probability in more general settings
2.3.1 Probability Axioms (Building
2.3.2 Complement Law . . . . . . .
2.3.3 Addition Law (Union) . . . .
. . . . .
. . . . .
Blocks)
. . . . .
. . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Contents
6 The
6.1
6.2
6.3
6.4
6.5
6.6
6.7
6.8
6.9
Normal Distribution
Introduction . . . . . . . . . . . . . . . . . . . . .
Continuous probability distributions . . . . . . .
What is the Normal Distribution? . . . . . . . .
Using the Normal table . . . . . . . . . . . . . .
Standardisation . . . . . . . . . . . . . . . . . . .
Linear combinations of Normal random variables
Using the Normal tables backwards . . . . . . . .
The Normal approximation to the Binomial . . .
6.8.1 Continuity correction . . . . . . . . . . .
The Normal approximation to the Poisson . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
99
99
101
102
103
108
110
113
114
115
117
. . .
. . .
. . .
the
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
139
139
139
142
143
145
146
147
148
148
150
151
vi
Contents
8.4
8.5
9 The 2 Test
9.1 Introduction Test statistics that arent Z
9.2 Goodness-of-Fit Tests . . . . . . . . . . . .
9.2.1 The 2 distribution . . . . . . . . .
9.2.2 Large d.f. . . . . . . . . . . . . . . .
9.3 Fixed distributions . . . . . . . . . . . . . .
9.4 Families of distributions . . . . . . . . . . .
9.4.1 The Poisson Distribution . . . . . .
9.4.2 The Binomial Distribution . . . . . .
9.5 Chi-squared Tests of Association . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
155
. 155
. 157
. 158
. 161
. 162
. 167
. 168
. 169
. 173
Contents
vii
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
209
209
210
210
211
213
215
215
218
221
221
222
224
225
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
227
227
227
234
235
237
237
238
239
241
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
245
245
245
247
247
248
249
249
252
253
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
viii
Contents
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
16 Regression, Continued
16.1 R2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
16.1.1 Example: Parent-Child heights . . . . . . . . . . .
16.1.2 Example: Breastfeeding and IQ . . . . . . . . . . .
16.2 Regression to the mean and the regression fallacy . . . . .
16.3 When the data dont fit the model . . . . . . . . . . . . .
16.3.1 Transforming the data . . . . . . . . . . . . . . . .
16.3.2 Spearmans Rank Correlation Coefficient . . . . .
16.3.3 Computing Spearmans rank correlation coefficient
259
259
261
262
263
265
266
267
269
270
270
271
274
277
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
281
. 281
. 282
. 282
. 284
. 287
. 287
. 287
. 288
Lecture 1
Describing Data
Uncertain knowledge
+ knowledge about the extent of uncertainty in it
= Useable knowledge
C. R. Rao, statistician
As we know, there are known knowns. There are things we know we know.
We also know there are known unknowns. That is to say we know there are
some things we do not know. But there are also unknown unknowns, The
ones we dont know we dont know.
Donald Rumsfeld, US Secretary of defense
Describing Data
1.1
If a newborn infant is held under his arms and his bare feet are permitted
to touch a flat surface, he will perform well-coordinated walking movements
similar to those of an adult[. . . ] Normally, the walking and placing reflexes
disappear by about 8 weeks. [ZZK72] The question is raised, whether exercising this reflex would enable children to more quickly acquire the ability
to walk independently. How would we resolve this question?
Of course, we could perform an experiment. We could do these exercises
with an infant, starting from when he or she was a newborn, and follow
up every week for about a year, to find out when this baby starts walking.
Suppose it is 10 months. Have we answered the question then?
The obvious problem, then, is that we dont know what age this baby
would have started walking at without exercise. One solution would be to
take another infant, observe this one at the same weekly intervals without
doing any special exercises, and see which one starts walking first. We call
this other infant the control. Suppose this one starts walking aged 11.50
months (that is, 11 months and 2 weeks). Now, have we answered the
question?
It is clear that were still not done, because children start walking at all
different ages. It could be that we happened to pick a slow child for the
exercises, and a particularly fast-developing child for the control. How can
we resolve this?
Obviously, the first thing we need to do is to understand how much
variability there is among the age at first walking, without imposing an
exercise regime. For that, there is no alternative to looking at multiple
infants. Here several questions must be considered:
How many?
How do we summarise the results of multiple measurements?
How do we answer the original question?: Do the special exercises
make the children learn to walk sooner?
In the original study, the authors had six infants in the treatment group
(the formal name for the ones who received the exercise also called the
experimental group), and six in the control group. (In fact, they had a
second control group, that was subject to an alternative exercise regime. But
thats a complication for a later date.) The results are tabulated in Table
1.1. We see that most of the treatment children did start walking earlier
Describing Data
than most of the control children. But not all. The slowest child from the
treatment group in fact started walking later than four of the six control
children. Should we still be convinced that the treatment is effective? If
not, how many more subjects do we need before we can be confident? How
would we decide?
Treatment
Control
9.00
11.50
9.50
12.00
9.75
9.00
10.00
11.50
13.00
13.25
9.50
13.00
Table 1.1: Age (in months) at which infants were first able to walk independently. Data from [ZZK72].
The answer is, we cant know for sure. The results are consistent with
believing that the treatment had an effect, but they are also consistent with
believing that we happened to get a particularly slow group of treatment
children, or a fast group of control children, purely by chance. What we
need now is a formal way of looking at these results, to tell us how to decide
how to draw conclusions from data The exercise helped children walk
sooner and how properly to estimate the confidence we should have in
our conclusions How likely is it that we might have seen a similary result
purely by chance, if the exercise did not help? We will use graphical tools,
mathematical tools, and logical tools.
1.2
Variables
The datasets that Psychologists and Human Scientists collect will usually
consist of one or more observations on one or more variables.
A variable is a property of an object or event that can take on different values.
For example, suppose we collect a dataset by measuring the hair colour,
resting heart rate and score on an IQ test of every student in a class. The
variables in this dataset would then simply be hair colour, resting heart
rate and score on an IQ test, i.e. the variables are the properties that we
measured/observed.
Describing Data
Qualitative
Quantitative
Discrete
(counts)
Discrete
Number of offspring
size of vocabulary at 18 months
Nominal
Continuous
Height
weight
tumour mass
brain volume
Ordinal
Birth order (firstborn, etc.)
Degree classification
"How offensive is this odour?
(1=not at all, 5=very)"
Binary
Smoking (yes/no)
Sex (M/F)
place of birth (home/hospital)
NonBinary
Hair colour
ethnicity
cause of death
Figure 1.1: A summary of the different data types with some examples.
1.2.1
Types of variables
Describing Data
It is also useful to make the distinction between Discrete and Continuous variables (see Figure 1.2). Discrete variables, such as number of
children in a family, or number of peas in a pod, can take on only a limited
set of values. (Categorical variables are, of course, always discrete.) Continuous variables, such as height and weight, can take on (in principle) an
unlimited set of values.
Discrete Data
...................................................
Continuous Data
5.67
1.2.2
The distinctions between the data types described in section 1.2.1 is not
always clear-cut. Sometimes, the type isnt inherent to the data, but depends
on how you choose to look at it. Consider the experiment described in
section 1.1. Think about how the results may have been recorded in the
lab notebooks. For each child, it was recorded which group (treatment or
control) the child was assigned to, which is clearly a (binary) categorical
variable. Then, you might find an entry for each week, recorded the result
of the walking test: yes or no, the child could or couldnt walk that week. In
principle, this is a long sequence of categorical variables. However, it would
Describing Data
Describing Data
1.3
Plotting Data
Describing Data
Time
(min)
Sex
Weight
(g)
Time
(min)
Sex
Weight
(g)
5
64
78
115
177
245
247
262
271
428
455
492
494
549
635
649
653
693
729
776
785
846
F
F
M
M
M
F
F
M
M
M
M
M
F
F
M
F
F
M
M
M
M
F
3837
3334
3554
3838
3625
2208
1745
2846
3166
3520
3380
3294
2576
3208
3521
3746
3523
2902
2635
3920
3690
3430
847
873
886
914
991
1017
1062
1087
1105
1134
1149
1187
1189
1191
1210
1237
1251
1264
1283
1337
1407
1435
F
F
F
M
M
M
F
M
F
M
M
M
M
M
F
M
M
M
M
F
F
F
3480
3116
3428
3783
3345
3034
2184
3300
2383
3428
4162
3630
3406
3402
3500
3736
3370
2121
3150
3866
3542
3278
1.3.1
Bar Charts
A Bar Chart is a useful method of summarising Categorical Data. We represent the counts/frequencies/percentages in each category by a bar. Figure
1.3 is a bar chart of gender for the baby-boom dataset. Notice that the bar
chart has its axes clearly labelled.
16
12
0
Frequency
20
24
Describing Data
Girl
Boy
Figure 1.3: A Bar Chart showing the gender distribution in the Baby-boom
dataset.
1.3.2
Histograms
An analogy
A Bar Chart is to Categorical Data as a Histogram is to
Measurement Data
A histogram shows us the distribution of the numbers along some scale.
A histogram is constructed in the following way
Divide the measurements into intervals (sometimes called bins);
Determine the number of measurements within each category.
Draw a bar for each category whose heights represent the counts in
each category.
The art in constructing a histogram is how to choose the number of bins
and the boundary points of the bins. For small datasets, it is often feasible
to simply look at the values and decide upon sensible boundary points.
10
Describing Data
For the baby-boom dataset we can draw a histogram of the birth weights
(Figure 1.4). To draw the histogram I found the smallest and largest values
smallest = 1745
largest = 4162
2000-2500
2500-3000
3000-3500
3500-4000
4000-4500
Frequency
10
15
20
Using these categories works well, the histogram shows us the shape of the
distribution and we notice that distribution has an extended left tail.
Figure 1.4: A Histogram showing the birth weight distribution in the Babyboom dataset.
Too few categories and the details are lost. Too many categories and the
overall shape is obscured by haphazard details (see Figure 1.5).
In Figure 1.6 we show some examples of the different shapes that histograms can take. One can learn quite a lot about a set of data by looking
just at the shape of the histogram. For example, Figure 1.6(c) shows the
percentage of the tuberculosis drug isoniazid that is acetylated in the livers
of 323 patients after 8 hours. Unacetylated isoniazid remains longer in the
blood, and can contribute to toxic side effects. It is interesting, then, to
Describing Data
11
Frequency
25
20
15
10
Frequency
30
35
1500
2500
3500
4500
1500
2500
3500
4500
Figure 1.5: Histograms with too few and too many categories respectively.
notice that there is a wide range of rates of acetylation, from patients who
acetylate almost all, to those who acetylate barely one fourth of the drug in
8 hours. Note that there are two peaks this kind of distribution is called
bimodal which points to the fact that there is a subpopulation who lacks
a functioning copy of the relevant gene for efficiently carrying through this
reaction.
So far, we have taken the bins to all have the same width. Sometimes
we might choose to have unequal bins, and more often we may be forced to
have unequal bins by the way the data are delivered. For instance, suppose
we did not have the full table of data, but were only presented with the
following table: What is the best way to make a histogram from these data?
Bin
Number of Births
15002500g
5
25003000g
4
3000-3500g
19
35004500g
16
We could just plot rectangles whose heights are the frequencies. We then
end up with the picture in Figure 1.7(a). Notice that the shape has changed
substantially, owing to the large boxes that correspond to the widened bins.
In order to preserve the shape which is the main goal of a histogram
we want the area of a box to correspond to the contents of the bin, rather
than the height. Of course, this is the same when the bin widths are equal.
Otherwise, we need to switch from the frequency scale to density scale,
Describing Data
20
12
0.006
0.000
0.002
0.004
Density
0.008
Frequency
10
0.010
15
0.012
50
100
150
200
250
300
income in $thousands
0.015
0.005
0.010
Density
30
20
0
0.000
10
Frequency
40
0.020
50
(a) Left-skewed: Weights from Babyboom (b) Right-skewed: 1999 California housedata set.
hold incomes. (From www.census.gov.)
25
30
35
40
45
50
55
60
65
70
75
80
85
90
95
100
120
140
160
180
200
220
240
260
280
300
(c) Bimodal: Percentage of isoniazid acety- (d) Bell shaped: Serum cholesterol of 10lated in 8 hours.
year-old boys [Abr78].
13
0.02
0
0.01
Density (babies/g)
Frequency
10
0.03
15
0.04
20
Describing Data
1500
2000
2500
3000
3500
4000
4500
Weight (g)
Figure 1.7: Same data plotted in frequency scale and density scale. Note
that the density scale histogram has the same shape as the plot from the
data with equal bin widths.
in which the height of a box is not the number of observations in the bin,
but the number of observations per unit of measurement. This gives us the
picture in Figure 1.7(b), which has a very similar shape to the histogram
with equal bin-widths.
Thus, for the data in Table 1.3 we would calculate the height of the first
rectangle as
Density =
Number of births
5 babies
=
= 0.005.
width of bin
1000g
The complete computations are given in Table 1.4, and the resulting histogram is in Figure 1.7(b).
Bin
Number of Births
Density
15002500g
5
0.005
25003000g
4
0.008
3000-3500g
19
0.038
35004500g
16
0.016
14
Describing Data
Determine the number of measurements within each category (Note:
the number can also be a percentage. Often the exact numbers
are unavailable, but you can simply act as though there were 100
observations);
For each bin, compute the density, which is simply the number of
observations divided by the width of a bin;
Draw a bar for each bin whose height represent the density in each bin.
The area of the bar will correspond to the number of observations in
the bin.
1.3.3
A cumulative frequency plot is very similar to a histogram. In a cumulative frequency plot the height of the bar in each interval represents the
total count of observations within interval and lower than the interval (see
Figure 1.8)
In a cumulative frequency curve the cumulative frequencies are plotted as points at the upper boundaries of each interval. It is usual to join up
the points with straight lines (see Figure 1.8).
Relative cumulative frequencies are simply cumulative frequencies divided
by the total number of observations (so relative cumulative frequencies always lie between 0 and 1). Thus relative cumulative frequency plots
and curves just use relative cumulative frequencies rather than cumulative
frequencies. Such plots are useful when we wish to compare two or more
distributions on the same scale.
Consider the histogram of birth weight shown in Figure 1.4. The frequencies,
cumulative frequencies and relative cumulative frequencies of the intervals
are given in Table 1.5.
1.3.4
Dot plot
A Dot Plot is a simple and quick way of visualising a dataset. This type
of plot is especially useful if data occur in groups and you wish to quickly
visualise the differences between the groups. For example, Figure 1.9 shows
Describing Data
Interval
Frequency
Cumulative
Frequency
Relative
Cumulative
Frequency
15
1500-2000
2000-2500
2500-3000
3000-3500
3500-4000
4000-4500
1
1
4
5
4
9
19
28
15
43
1
44
0.023
0.114
0.205
0.636
0.977
1.0
1500
2000
2500
3000
3500
4000
4500
10
20
30
40
50
50
40
30
20
10
0
Cumulative Frequency
2000
2500
3000
3500
4000
4500
Figure 1.8: Cumulative frequency curve and plot of birth weights for the
baby-boom dataset.
a dot plot of birth weights grouped by gender for the baby-boom dataset.
The plot suggests that girls may be lighter than boys at birth.
Describing Data
Girl
Gender
Boy
16
1500
2000
2500
3000
3500
4000
4500
Figure 1.9: A Dot Plot showing the birth weights grouped by gender for the
baby-boom dataset.
1.3.5
Scatter Plots
Scatter plots are useful when we wish to visualise the relationship between
two measurement variables.
To draw a scatter plot we
Assign one variable to each axis.
Plot one point for each pair of measurements.
For example, we can draw a scatter plot to examine the relationship between
birth weight and time of birth (Figure 1.10). The plot suggests that there
is little relationship between birth weight and time of birth.
200
400
600
800
17
Describing Data
2000
2500
3000
3500
4000
Figure 1.10: A Scatter Plot of birth weights versus time of birth for the
baby-boom dataset.
1.3.6
Box Plots
Box Plots are probably the most sophisticated type of plot we will consider.
To draw a Box Plot we need to know how to calculate certain summary
measures of the dataset covered in the next section. We return to discuss
Box Plots in Section 1.5.
1.4
Summary Measures
In the previous section we saw how to use various graphical displays in order
to explore the structure of a given dataset. From such plots we were able
to observe the general shape of the distribution of a given dataset and
compare visually the shape of two or more datasets.
Consider the histogram in Figure 1.11. Comparing the first and second
histograms we see that the distributions have the same shape or spread but
that the center of the distribution is different. Roughly, by eye, the centers
differ in value by about 10. Comparing first and third histograms we see
18
Describing Data
10
10
20
30
10
10
20
30
10
10
20
30
Describing Data
19
learn how to go a stage further and test whether two variables have the
same center point.
1.4.1
2
0
Frequency
10 11 12 13 14
we see that the mode is the peak of the distribution and is a reasonable
representation of the center of the data. If we wish to calculate the
mode of continuous data one strategy is to group the data into adjacent
intervals and choose the modal interval i.e. draw a histogram and take
the modal peak. This method is sensitive to the choice of intervals
and so care should be taken so that the histogram provides a good
representation of the shape of the distribution.
The Mode has the advantage that it is always a score that actually
occurred and can be applied to nominal data, properties not shared
by the median and mean. A disadvantage of the mode is that there
may two or more values that share the largest frequency. In the case
of two modes we would report both and refer to the distribution as
bimodal.
20
Describing Data
The Median can be thought of as the middle value i.e. the value for
which 50% of the data fall below when arranged in numerical order.
For example, consider the numbers
15, 3, 9, 21, 1, 8, 4,
When arranged in numerical order
1, 3, 4, 8 , 9, 15, 21
we see that the median value is 8. If there were an even number of
scores e.g.
1, 3, 4,8 , 9, 15
then we take the midpoint of the two middle values. In this case the
median is (4 + 8)/2 = 6. In general, if we have N data points then the
median location is defined as follows:
Median Location =
(N +1)
2
is
1 + 3 + 4 + 8 + 9 + 15
= 6.667
6
(to 3 dp)
Describing Data
21
P
See the appendix for a brief description of the summation notation ( )
The mean is the most widely used measure of location. Historically,
this is because statisticians can write down equations for the mean
and derive nice theoretical properties for the mean, which are much
harder for the mode and median. A disadvantage of the mean is that
it is not resistant to outlying observations. For example, the mean of
1, 3, 4, 8, 7, 15, 99999
is 14323.57, whereas the median (from above) is 8.
Sometimes discrete measurement data are presented in the form of a frequency table in which the frequencies of each value are given. Remember,
the mean is the sum of the data divided by the number of observations.
To calculate the sum of the data we simply multiply each value by its frequency and sum. The number of observations is calculated by summing the
frequencies.
For example, consider the following frequency table
Data (x)
Frequency (f)
1
2
2
4
3
6
4
7
5
4
6
1
82
= 3.42
24
(2 dp)
22
Describing Data
Describing Data
23
Symmetric
mean = median = mode
10
10
20
30
Positive Skew
mean
median
mode
10
15
20
25
30
Negative Skew
mean
median
mode
10
15
20
25
30
Figure 1.12: The relationship between the mean, median and mode.
24
Describing Data
1.4.2
The 2nd quartile is the 50% point of the dataset i.e. the median.
25
200
Describing Data
75%
100
25%
50
Frequency
150
IQR
10
15
20
25
26
Describing Data
Describing Data
27
Data
x
Deviations
xx
|Deviations|
|x x
|
Deviations2
(x x
)2
10
15
18
33
34
36
51
73
80
86
92
10 - 48 = -38
15 - 48 = -33
18 - 48 = - 30
33 - 48 = -15
34 - 48 = -14
36 - 48 = -12
51 - 48 = 3
73 - 48 = 25
80 - 48 = 32
86 - 48 = 38
92 - 48 = 44
38
33
30
15
14
12
3
25
32
38
44
1444
1089
900
225
196
144
9
625
1024
1444
1936
Sum
P = 528
x = 528
PSum = 0
(x x
) = 0
PSum = 284
|x x
| = 284
P Sum =2 9036
(x x
) = 9036
284
= 25.818
11
(to 3dp)
28
Describing Data
another way of measuring the spread is to consider the mean of the squared
deviations, called the variance
If our dataset consists of the whole population (a rare occurrence) then we
can calculate the population variance 2 (said sigma squared) as
Pn
P
)2
(x x
)2
2
2
i=1 (xi x
=
or
=
n
n
When we just have a sample from the population (most of the time) we can
calculate the sample variance s2 as
Pn
P
)2
(x x
)2
2
2
i=1 (xi x
or
s =
s =
n1
n1
Note We divide by n 1 when calculating the sample variance as then s2
is a better estimate of the population variance 2 than if we had divided
by n. We will see why later.
For frequency data (see Table 1.6) the formula is given by
Pn
P
fi (xi x
)2
f (x x
)2
2
2
i=1
P
s =
or
s = P
n
f 1
i=1 fi 1
We can calculate s2 in the following way (see Table 1.7)
Calculate the deviations by subtracting the mean from each value,
xx
9036
= 903.6
11 1
Describing Data
29
MAD = 25.818
s2 = 903.6
1.5
Box Plots
30
Describing Data
Upper Whisker
3rd quartile
3500
4000
All points that lie outside the whiskers are plotted individually as
outlying observations.
Median
3000
1st quartile
2000
2500
Lower Whisker
Outliers
Figure 1.15: A Box Plot of birth weights for the baby-boom dataset showing
the main points of plot.
Plotting box plots of measurements in different groups side by side can be
illustrative. For example, Figure 1.16 shows box plots of birth weight for
each gender side by side and indicates that the distributions have quite
different shapes.
Box plots are particularly useful for comparing multiple (but not very
many!) distributions. Figure 1.17 shows data from 14 years, of the total
31
2000
2500
3000
3500
4000
Describing Data
Girls
Boys
Figure 1.16: A Box Plot of birth weights by gender for the baby-boom
dataset.
number of births each day (5113 days in total) in Quebec hospitals. By
summarising the data in this way, it becomes clear that there is a substantial
difference between the numbers of births on weekends and on weekdays. We
see that there is a wide variety of numbers of births, and considerable overlap
among the distributions, but the medians for the weekdays are far outside
the range of most of the weekend numbers.
32
Describing Data
150
200
250
300
350
Sun
Mon
Tues
Wed
Thur
Fri
Sat
Day of Week
1.6
1.6.1
Appendix
Mathematical notation for variables and samples
Describing Data
33
=
=
1.6.2
Summation notation
xi
i=1
or
5
X
xi = (x1 + x2 + x3 + x4 + x5 )
i=1
xi = (3 + 2 + 1 + 7 + 6) = 19
i=1
If the limits of the summation are obvious within context the the notation
is often abbreviated to
X
x = 19
34
Describing Data
Lecture 2
Probability I
In this and the following lecture we will learn about
why we need to learn about probability
what probability is
how to assign probabilities
how to manipulate probabilities and calculate probabilities of complex
events
2.1
35
36
Probability I
Examine
Results
Hypothesis
about a
population
STATISTICAL
TEST
Take
a
sample
Propose an
experiment
Study
STATISTICS
Design
Figure 2.1: The scientific process and role of statistics in this process.
# patients
deaths
% mortality
Treatment
(anturane)
Control
(placebo)
813
74
9.1%
816
89
10.9%
Probability I
Imagine the following dialogue:
Drug Company: Every hospital needs to use anturane. It saves patients lives.
Skeptical Bureaucrat: The effect looks pretty small:
15 out of about 800 patients. And the drug is pretty
expensive.
DC: Is money all you bean counters can think about?
We reduced mortality by 16%.
SB: It was only 2% of the total.
DC: We saved 2% of the patients! What if one of them
was your mother?
SB: Im all in favour of saving 2% more patients. Im
just wondering: You flipped a coin to decide which
patients would get the anturane. What if the coins
had come up differently? Might we just as well be
here talking about how anturane had killed 2% of the
patients?
How can we resolve this argument? 163 patients died. Suppose
anturane has no effect. Could the apparent benefit of anturane
simply reflect the random way the coins fell? Or would such a
series of coin flips have been simply too unlikely to countenance?
To answer this question, we need to know how to measure the
likelihood (or probability) of sequences of coin flips.
Imagine a box, with cards in it, each one having written on it
one way in which the coin flips could have come out, and the patients allocated to treatments. How many of those coinflip cards
would have given us the impression that anturane performed
well, purely because many of the patients who died happened to
end up in the Control (placebo) group? It turns out that its
more than 20% of the cards, so its really not very unlikely at
all.
To figure this out, we are going to need to understand
1. How to enumerate all the ways the coins could come up.
How many ways are there? The number depends on the
exact procedure, but if we flip one coin for each patient, the
number of cards in the box would be 21629 , which is vastly
37
38
Probability I
more than the number of atoms in the universe. Clearly,
we dont want to have to count up the cards individually.
2. How coin flips get associated with a result, as measured in
apparent success or failure of anturane. Since the number
of cards is so large, we need to do this without having to
go through the results one by one.
2.2
What is probability?
The examples we have discussed so far look very complicated. They arent
really, but in order to see the simple underlying structure, we need to introduce a few new concepts. To do so, we want to work with a much simpler
example:
Probability I
39
2.2.1
Definitions
The example above introduced some terminology that we will use repeatedly
when we talk about probabilities.
An experiment is some activity with an observable outcome.
The set of all possible outcomes of the experiment is called the sample
space.
A particular outcome is called a sample point.
A collection of possible outcomes is called an event.
2.2.2
|A|
,
|S|
40
Probability I
2.2.3
A1 = {2, 4, 6}
A2 = {5, 6}
S
1
P (A1 ) =
|A1 |
3
1
= =
|S|
6
2
A1
2
P (A2 ) =
2.2.4
5 A2
|A2 |
2
1
= =
|S|
6
3
Intersection
Probability I
41
A1 A2 = {6}
P (A1 A2 ) =
|A1 A2 |
1
=
|S|
6
S
1
5 A2
A1
2.2.5
Union
A1 A2 = {2, 4, 5, 6}
P (A1 A2 ) =
|A1 A2 |
4
2
= =
|S|
6
3
42
Probability I
S
1
5 A2
A1
2.2.6
Complement
Ac1 = {1, 3, 5}
P (Ac1 ) =
3
1
|Ac1 |
= =
|S|
6
2
Probability I
43
S
1
A1
2
2.3
In many settings, either the sample space is infinite or all possible outcomes
of the experiment are not equally likely. We still wish to associate probabilities with events of interest. Luckily, there are some rules/laws that allow
us to calculate and manipulate such probabilities with ease.
2.3.1
Before we consider the probability rules we need to know about the axioms
(or mathematical building blocks) upon which these rules are built. There
are three axioms which we need in order to develop our laws
(i).
(ii).
P (S) = 1.
This axiom says that the probability of everything in the sample space
is 1. This says that the sample space is complete and that there are no
sample points or events that allow outside the sample space that can
occur in our experiment.
(iii).
44
Probability I
A set of events are mutually exclusive if at most one of the events
can occur in a given experiment.
This axiom says that to calculate the probability of the union of distinct
events we can simply add their individual probabilities.
2.3.2
Complement Law
If A is an event, the set of all outcomes that are not in A is called the complement of the event A, denoted A{ .
This is (pronounced A complement).
The rule is
A{ = 1 P (A)
(Law 1)
2.3.3
Suppose,
A
B
=
=
The event that a randomly selected student from the class has brown eyes
The event that a randomly selected student from the class has blue eyes
Probability I
45
What is the probability that a student has brown eyes OR blue eyes?
This is the union of the two events A and B, denoted AB (pronounced A
or B)
We want to calculate P(AB).
In general for two events
P(AB) = P(A) + P(B) - P(AB)
(Addition Law)
B
AB
46
Probability I
Asian population and 60% show variation in both the African
and Asian population.
Suppose one such SNP is chosen at random, what is the probability that it is variable in either the African or the Asian population?
Write A for the event that the SNP is variable in Africa, and
B for the event that it is variable in Asia. We are told
P (A) = 0.7
P (B) = 0.8
P (A B) = 0.6.
Lecture 3
Probability II
3.1
If the probability that one event A occurs doesnt affect the probability that
the event B also occurs, then we say that A and B are independent. For
example, it seems clear than one coin doesnt know what happened to the
other one (and if it did know, it wouldnt care), so if A1 is the event that
the first coin comes up heads, and A2 the event that the second coin comes
up heads, then
47
48
Probability II
(1,1)
(2,1)
(3,1)
(4,1)
(5,1)
(6,1)
(1,2)
(2,2)
(3,2)
(4,2)
(5,2)
(6,2)
(1,3)
(2,3)
(3,3)
(4,3)
(5,3)
(6,3)
(1,4)
(2,4)
(3,4)
(4,4)
(5,4)
(6,4)
(1,5)
(2,5)
(3,5)
(4,5)
(5,5)
(6,5)
(1,6)
(2,6)
(3,6)
(4,6)
(5,6)
(6,6)
There are 36 points in the sample space. These are all equally
likely. Thus, each point has probability 1/36. Consider the
events
A = {First roll is even},
B = {Second roll is bigger than 4},
A B = {First roll is even and Second roll is bigger than 4},
(1,1)
(1,2)
(1,3)
(1,4)
(1,5)
(1,6)
(2,1)
(2,2)
(2,3)
(2,4)
(2,5)
(2,6)
(3,1)
(3,2)
(3,3)
(3,4)
(3,5)
(3,6)
(4,1)
(4,2)
(4,3)
(4,4)
(4,5)
(4,6)
(5,1)
(5,2)
(5,3)
(5,4)
(5,5)
(5,6)
(6,1)
(6,2)
(6,3)
(6,4)
(6,5)
(6,6)
Figure 3.1:
Events A = {First roll is even}
{Second roll is bigger than 4}.
and
Probability II
49
Thus, two (or more) coin flips are always independent. But this is also
relevant to analysing experiments such as those of Example 2.1. If the
drug has no effect on survival, then events like {patient # 612 survived}
are independent of events like {patient # 612 was allocated to the control
group}.
Then we see from Figure 3.2 that P (C) = 10/36, and P (AC) =
6/36 6= 10/36 1/2. On the other hand, if we replace C by
D = {Sum of the two rolls roll is exactly 9}, then we
see from Figure 3.3 that P (D) = 4/36 = 1/9, and P (A D) =
2/36 = 1/9 1/2, so the events A and D are independent. We
see that events may be independent, even if they are not based
on separate experiments.
50
Probability II
(1,1)
(1,2)
(1,3)
(1,4)
(1,5)
(1,6)
(2,1)
(2,2)
(2,3)
(2,4)
(2,5)
(2,6)
(3,1)
(3,2)
(3,3)
(3,4)
(3,5)
(3,6)
(4,1)
(4,2)
(4,3)
(4,4)
(4,5)
(4,6)
(5,1)
(5,2)
(5,3)
(5,4)
(5,5)
(5,6) C
(6,1)
(6,2)
(6,3)
(6,4)
(6,5)
(6,6)
Figure 3.2:
Events A
{Sum is bigger than 8}.
(1,1)
(1,2)
(1,3)
(1,4)
(1,5)
(1,6)
(2,1)
(2,2)
(2,3)
(2,4)
(2,5)
(2,6)
(3,1)
(3,2)
(3,3)
(3,4)
(3,5)
(3,6)
(4,1)
(4,2)
(4,3)
(4,4)
(4,5)
(4,6)
(5,1)
(5,2)
(5,3)
(5,4)
(5,5)
(5,6)
(6,1)
(6,2)
(6,3)
(6,4)
(6,5)
(6,6)
and
and
Figure 3.3:
Events
{Sum is exactly 9}.
Probability II
51
3.2
Suppose,
A
B
=
=
The event that a randomly selected student from the class has a bike
The event that a randomly selected student from the class has blue eyes
52
Probability II
A
ABc
B
AB
A cB
A cBc
P(B|A) = P(A B)
P(A)
Probability II
53
We have that
P (A) = 0.7
P (B) = 0.8
P (A B) = 0.6.
We want
P (A|B) =
P (A B)
0.6
=
= 0.75
P (B)
0.8
We can rearrange the conditional probability law to obtain a general
Multiplication Law.
P(B|A) = P(A B)
P(B|A)P(A) = P(A B)
P(A)
Similarly
P(A|B)P(B) = P(A B)
3.2.1
Independence of Events
54
Probability II
Note that in this case (provided P (B) > 0), if A and B are independent
P (A|B) =
P (A B)
P (A)P (B)
=
= P (A),
P (B)
P (B)
P (S A)
0.2
=
= 0.44.
P (A)
0.45
Probability II
55
3.2.2
The partition law is a very useful rule that allows us to calculate the probability of an event by spitting it up into a number of mutually exclusive
events. For example, suppose we know that P(AB) = 0.52 and P(ABc )
= 0.14 what is p(A)?
P(A) is made up of two parts (i) the part of A contained in B (ii) the
part of A contained in Bc .
B
AB
56
Probability II
3.3
Bayes Rule
One of the most common situations in science is that we have some observations, and we need to figure out what state of the world is likely to have
produced those observations. For instance, we observe that a certain number
of vaccinated people contract polio, and a certain number of unvaccinated
people contract polio, and we need to figure out how effective the vaccine
is. The problem is, our theoretical knowledge goes in the wrong direction:
If the vaccine is so effective, how many people will contract polio. Bayes
Rule allows us to turn the inference around.
Probability II
57
P(B|A) = P(A|B)P(B)
P(A)
(Bayes Rule)
0.12 0.01
= 0.04
0.03
58
Probability II
and B3 (lethal). A1 is a protective gene, with the probabilities
of having the three forms given A1 as 0.9, 0.1 and 0 respectively.
People with A2 are unprotected and have the three forms with
probabilities 0, 0.5 and 0.5 respectively.
What is the probability that a person has gene A1 given they
have the severe disease?
The first thing to do with such a question is decode the information, i.e. write it down in a compact form we can work
with.
P(A1 ) = 0.75
P(A2 ) = 0.25
P(B3 |A1 ) = 0
P(B3 |A2 ) = 0.5
(Partition Law)
(Multiplication Law)
0.1 0.75
= 0.375
0.2
Probability II
3.4
59
Probability Laws
P(Ac ) = 1 - P(A)
(Complement Law)
P(B|A) = P(A B)
(Addition Law)
P(A)
(Multiplication Law)
P(B|A) = P(A|B)P(B)
P(A)
3.5
(Bayes Rule)
3.5.1
Permutations of n objects
Consider 2 objects
Q. How many ways can they be arranged? i.e. how many permutations
are there?
60
A. 2 ways
Probability II
AB
BA
Consider 3 objects
ABC
Consider 4 objects
ACB
BCA
BAC
C
CAB
CBA
ABDC
BADC
CBDA
DBAC
ACBD
BCAD
CABD
DCBA
ACDB
BCDA
CADB
DCAB
ADBC
BDAC
CDBA
DABC
3
6
5
120
ADCB
BDCA
CDAB
DACB
2
2
4
24
6
720
...
...
There are 3 choices for the 3rd box, 2 for the 4th and 1 for the 5th box.
Probability II
61
5
3.5.2
Now suppose we have 4 objects and only 2 boxes. How many permutations
of 2 objects when we have 4 to choose from?
There are 4 choices for the first box and 3 choices for the second box
4
P2 = 12
Pr =
n!
(n r)!
P1 =
4!
4321
=
=43
2!
21
62
Probability II
3.5.3
Now consider the number of ways of choosing 2 objects from 4 when the
order doesnt matter. We just want to count the number of possible combinations.
We know that there are 12 permutations when choosing 2 objects from
4. These are
AB
BA
AC
CA
AD
DA
BC
CB
BD
DB
CD
DC
Notice how the permutations are grouped in 2s which are the same combination of letters. Thus there are 12/2 = 6 possible combinations.
AB
AC
AD
BC
BD
CD
We write this as
4
C2 = 6
Cr =
n!
(n r)!r!
Cr =
nP
r
r!
3.6
Worked Examples
(i). Four letters are chosen at random from the word RANDOMLY. Find
the probability that all four letters chosen are consonants.
8 letters, 6 consonants, 2 vowels
Probability II
63
15
70
8!
4!4!
6!
4!2!
= 15
= 70
3
14
(ii). A bag contains 8 white counters and 3 black counters. Two counters
are drawn, one after the other. Find the probability of drawing one
white and one black counter, in any order
(a) if the first counter is replaced
(b) if the first counter is not replaced
What is the probability that the second counter is black (assume that
the first counter is replaced after it is taken)?
A useful way of tackling many probability problems is to draw a probability tree. The branches of the tree represent different possible events.
Each branch is labelled with the probability of choosing it given what
has occurred before. The probability of a given route through the tree
can then be calculated by multiplying all the probabilities along that
route (using the Multiplication Rule)
(a) With replacement
Let
W1 be the event a white counter is drawn first,
W2 be the event a white counter is drawn second,
B1 be the event a black counter is drawn first,
B2 be the event a black counter is drawn second,
64
Probability II
8/11
W1
8/11
3/11
8/11
3/11
P(W1W2)
=(8/11)(8/11)
W2 =64/121
P(W1B2)
=(8/11)(3/11)
B2 =24/121
P(B1W2)
=(3/11)(8/11)
W2
=24/121
B1
3/11
B2
P(B1B2)
=(3/11)(3/11)
=9/121
7/10
W1
8/11
3/10
8/10
3/11
P(W1W2)
=(8/11)(7/10)
W2 =56/110
P(W1B2)
=(8/11)(3/10)
B2 =24/110
P(B1W2)
=(3/11)(8/10)
W2
=24/110
B1
2/10
B2
P(B1B2)
=(3/11)(2/10)
=6/110
Probability II
65
P (R30|B) = 0.5
P (AR60)
P (R60)
66
Probability II
0.375
0.625
= 0.6
Probability II
67
P (6 correct) =
=
=
654321
49 48 47 46 45 44
= 0.0000000715112 (1 in 14 million)
=
(1 in 14 million)
68
Probability II
Lecture 4
Introduction
4.2
Suppose we have a box with a very large number1 of balls in it: 23 of the
balls are black and the rest are red. We draw 5 balls from the box. How
many black balls do we get? We can write
X = No. of black balls in 5 draws.
X can take on any of the values 0, 1, 2, 3, 4 and 5.
X is a discrete random variable
1
We say a very large number when we want to ignore the change in probability that
comes from drawing without replacement. Alternatively, we could have a small number
of balls 2 black and 1 red, for instance but replace the ball (and mix well!) after
each draw.
69
70
Some values of X will be more likely to occur than others. Each value
of X will have a probability of occurring. What are these probabilities?
Consider the probability of obtaining just one black ball, i.e. X = 1.
One possible way of obtaining one black ball is if we observe the pattern
BRRRR. The probability of obtaining this pattern is
P(BRRRR) =
2
3
1
3
1
3
1
3
1
3
There are 32 possible patterns of black and red balls we might observe. 5 of
the patterns contain just one black ball
BBBBB
RBBRB
RRRBB
BRBRR
RBBBB
RBBBR
RRBRB
BBRRR
BRBBB
BRRBB
RRBBR
BRRRR
BBRBB
BRBRB
RBRRB
RBRRR
BBBRB
BRBBR
RBRBR
RRBRR
BBBBR
BBRRB
RBBRR
RRRBR
RRBBB
BBRBR
BRRRB
RRRRB
The other 5 possible combinations all have the same probability so the probability of obtaining one head in 5 coin tosses is
P(X = 1) = 5 32 ( 13 )4 = 0.0412 (to 4 decimal places)
What about P(X = 2)? This probability can be written as
P (X = 2) = No. of patterns
=
5C
10
0.165
Probability of pattern
2 2 1 3
3
3
4
243
Its now just a small step to write down a formula for this situation specific
situation in which we toss a coin 5 times
2 x 1 (5x)
P (X = x) = 5 Cx
3
3
We can use this formula to tabulate the probabilities of each possible value
of X.
These probabilities are plotted in Figure 4.1 against the values of X. This
shows the distribution of probabilities across the possible values of X. This
RBRBB
BBBRR
BRRBR
RRRRR
P(X = 0)
P(X = 1)
P(X = 2)
P(X = 3)
P(X = 4)
=
=
5C
5C
5C
5C
5C
5C
0
1
2
3
4
5
0
2
3
1
2
3
2
2
3
3
2
3
4
2
3
5
2
3
5
1
3
4
1
3
3
1
3
2
1
3
1
1
3
0
1
3
0.0041
0.0412
0.1646
0.3292
0.3292
0.1317
P(X)
P(X = 5)
71
72
4.3
P(X = x) =
Cx px (1 p)(nx)
x = 0, 1, . . . , n
Examples
With this general formula we can calculate many different probabilities.
(i). Suppose X Bin(10, 0.4), what is P(X = 7)?
P(X = 7) =
10
C7 (0.4)7 (1 0.4)(107)
= (120)(0.4)7 (0.6)3
= 0.0425
73
4.4
0
0.004
1
0.041
2
0.165
3
0.329
4
0.329
5
0.132
74
8 10
0.4
0.0
0.1
0.2
P(X)
0.3
0.4
0.3
P(X)
0.2
0.1
0.0
0.0
0.1
0.2
P(X)
0.3
0.4
0.5
n=10 p=0.7
0.5
n=10 p=0.1
0.5
n=10 p=0.5
8 10
8 10
=
npq
where q = 1 p
In the example above, X Bin(5, 2/3) and so the mean and standard
deviation are given by
= np = 5 (2/3) = 3.333
and
=
75
4.5
Consider the following simple situation: You have a six-sided die, and you
have the impression that its somehow been weighted so that the number 1
comes up more frequently than it should. How would you decide whether
this impression is correct? You could do a careful experiment, where you
roll the die 60 times, and count how often the 1 comes up.
Suppose you do the experiment, and the 1 comes up 30 times (and other
numbers come up 30 times all together). You might expect the 1 to come
up one time in six, so 10 times, so 30 times seems high. But is it too high?
There are two possible hypotheses:
(i). The die is biased.
(ii). Just by chance we got more 1s than expected.
How do we decide between these hypotheses? Of course, we can never prove
that any sequence of throws couldnt have come from a fair die. But we can
find that the results we got are extremely unlikely to have arisen from a fair
die, so that we should seriously consider whether the alternative might be
true.
Since the probability of a 1 on each throw is 1/6, so we apply the formula
for the binomial distribution with n = 60 and p = 1/6. Then we have
Now we summarise the general approach:
posit a hypothesis
design and carry out an experiment to collect a sample of data
test to see if the sample is consistent with the hypothesis
Hypothesis The die is fair. All 6 outcomes have the same probability.
76
x=30
Which is the appropriate probability? The strange event from the perspective of the fair die was not that 1 came up exactly 30 times, but that it
came up so many times. So the relevant number is the second one, which
is a little bigger. Still, the probability is less than 3 in a billion. In other
77
words, if you were to perform one of these experiments once a second, continuously, you might expect to see a result this extreme once in 10 years. So
you either have to believe that you just happened to get that one in 10 years
outcome the one time you tried it, or you have to believe that there really
is something biased about the die. In the language of hypothesis testing we
say we would reject the hypothesis.
78
163
74 16374
1
1
C74
= 0.0314,
2
2
74
X
i 163i
1
1
P (# heads is at most 74) =
Ci
2
2
i=0
0 1630
1 1631
1
1
1
1
163
163
= C0
+ C1
2
2
2
2
74 16374
1
1
+ + 163 C74
2
2
= 0.136,
163
79
74
X
i 163i
1
1
P (# heads at most 74 or at least 89) =
Ci
2
2
i=0
163
X
1 i 1 163i
163
+
Ci
2
2
163
i=89
= 0.272.
Note that the two-tailed probability is exactly twice the onetailed. We show these probabilities on the probability histogram of Figure 4.3.
0.04
upper tail
X89
prob.=0.136
0.01
0.02
0.03
lower tail
X74
prob.=0.136
0.00
Probability
0.05
0.06
mean=81.5
60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99
Number of Heads
We observe 74
Figure 4.3: The tail probabilities for testing the hypothesis that anturane
has no effect.
80
Lecture 5
Introduction
82
No. of drowning
deaths per month
Frequency (No.
months observed)
0
1
2
3
4
5+
224
102
23
5
1
0
83
count the number of births in each hour we can plot the counts as a histogram in Figure 5.1(b). How does this compare to the histogram of counts
for a process that isnt random? Suppose the 44 birth times were distributed
in time as shown in Figure 5.1(c). The histogram of these birth times per
hour is shown in Figure 5.1(d). We see that the non-random clustering of
events in time causes there to be more hours with zero births and more
hours with large numbers of births than the real birth times histogram.
This example illustrates that the distribution of counts is useful in uncovering whether the events might occur randomly or non-randomly in time
(or space). Simply looking at the histogram isnt sufficient if we want to
ask the question whether the events occur randomly or not. To answer
this question we need a probability model for the distribution of counts of
random events that dictates the type of distributions we should expect to
see.
5.2
P(X = x) = e
x
x!
x = 0, 1, 2, 3, 4, . . .
10
5
0
Frequency
15
84
200
400
600
800
1000 1200
1440
10
5
0
Frequency
15
200
400
600
800
1000 1200
1440
Figure 5.1: Representing the babyboom data set (upper two) and a nonrandom hypothetical collection of birth times (lower two).
Note A Poisson random variable can take on any positive integer value.
In contrast, the Binomial distribution always has a finite upper limit.
P (X = 4) = e1.8 1.8
4! = 0.0723
What about the probability of observing more than or equal
to 2 births in a given hour at the hospital?
We want P (X 2) = P (X = 2) + P (X = 3) + . . .
i.e. an infinite number of probabilities to calculate
but
P (X 2) = P (X = 2) + P (X = 3) + . . .
= 1 P (X < 2)
= 1 (P (X = 0) + P (X = 1))
1.80
1.81
= 1 (e1.8
+ e1.8
)
0!
1!
= 1 (0.16529 + 0.29753)
= 0.537
85
86
5.3
Using the formula we can calculate the probabilities for a specific Poisson
distribution and plot the probabilities to observe the shape of the distribution. For example, Figure 5.2 shows 3 different Poisson distributions. We
observe that the distributions
(i). are unimodal;
(ii). exhibit positive skew (that decreases as increases);
(iii). are centred roughly on ;
(iv). have variance (spread) that increases as increases.
5.4
2 =
87
10
15
20
0.20
0.00
0.05
0.10
P(X)
0.15
0.20
0.15
P(X)
0.10
0.05
0.00
0.00
0.05
0.10
P(X)
0.15
0.20
0.25
Po(10)
0.25
Po(5)
0.25
Po(3)
10
15
20
10
15
20
5.5
P (Y = 5) = e3.6 3.6
5! = 0.13768
This example illustrates the following rule
If
X Po() on 1 unit interval,
then Y Po(k) on k unit intervals.
88
5.6
Now suppose we know that in hospital A births occur randomly at an average rate of 2.3 births per hour and in hospital B births occur randomly at
an average rate of 3.1 births per hour.
What is the probability that we observe 7 births in total from the two
hospitals in a given 1 hour period?
To answer this question we can use the following rule
If
X Po(1 ) on 1 unit interval,
and Y Po(2 ) on 1 unit interval,
then X + Y Po(1 + 2 ) on 1 unit interval.
So if we let X = No. of births in a given hour at hospital A
and
Y = No. of births in a given hour at hospital B
Then X Po(2.3), Y Po(3.1) and X + Y Po(5.4)
7
P (X + Y = 7) = e5.4 5.4
7! = 0.11999
5.7
89
Consider the two sequences of birth times we saw in Section 1. Both of these
examples consisted of a total of 44 births in 24 hour intervals.
Therefore the mean birth rate for both sequences is
44
24
= 1.8333
What would be the expected counts if birth times were really random i.e.
what is the expected histogram for a Poisson random variable with mean
rate = 1.8333.
Using the Poisson formula we can calculate the probabilities of obtaining
each possible value1
x
P (X = x)
0
0.15989
1
0.29312
2
0.26869
3
0.16419
4
0.07525
5
0.02759
6
0.01127
Then if we observe 24 hour intervals we can calculate the expected frequencies as 24 P (X = x) for each value of x.
x
Expected frequency
24 P (X = x)
0
3.837
1
7.035
2
6.448
3
3.941
4
1.806
5
0.662
6
0.271
0
3.837
3
1
7.035
8
2
6.448
6
3
3.941
4
4
1.806
3
5
0.662
0
6
0.271
0
90
When we compare the expected frequencies to those observed from the nonrandom clustered sequence in Section 1 we see that there is much less agreement.
x
Expected
Observed
0
3.837
12
1
7.035
3
2
6.448
0
3
3.941
2
4
1.806
2
5
0.662
4
6
0.271
1
In Lecture 9 we will see how we can formally test for a difference between
the expected and observed counts. For now it is enough just to know how
to fit a distribution.
5.8
The Binomial and Poisson distributions are both discrete probability distributions. In some circumstances the distributions are very similar. For
example, consider the Bin(100, 0.02) and Po(2) distributions shown in Figure 5.3. Visually these distributions are identical.
In general,
If n is large (say > 50) and p is small (say < 0.1) then a
Bin(n, p) can be approximated with a Po() where = np
91
Po(2)
0.00
0.10
P(X)
0.10
0.00
P(X)
0.20
0.20
Bin(100, 0.02)
10
10
Figure 5.3: A Binomial and Poisson distribution that are very similar.
We want P (X 2)?
P (X 2) = 1 P (X < 2)
= 1 P (X = 0) + P (X = 1)
51
50
1 e5 + e5
0!
1!
1 0.040428
0.9596
If we use the exact Binomial distribution we get the answer
0.9629.
The idea of using one distribution to approximate another is widespread
throughout statistics and one we will meet again. Why would we use an
approximate distribution when we actually know the exact distribution?
The exact distribution may be hard to work with.
The exact distribution may have too much detail. There may be some
features of the exact distribution that are irrelevant to the questions
92
93
No. of drowning
deaths per month
Frequency (No.
months observed)
0
1
2
3
4
5+
224
102
23
5
1
0
0.625
0.294
0.069
0.011
0.001
0.0001
94
P (Y 48) = 1
47
X
i=0
e23
23i
= 3.5 106 .
i!
95
6+
observed frequency
probability
expected frequency
16
0.264
7.9
7
0.352
10.6
3
0.234
7.0
0
0.104
3.1
2
0.034
1.0
0
0.009
0.3
2
0.003
0.1
5.9
This section is not officially part of the course, but is optional, for those
who are interested in more mathematical detail. Where does the formula in
section 5.2 come from?
Think of the Poisson distribution as in section 5.8, as an approximation
to a binomial distribution. Let X be the (random) number of successes in
a collection of independent random trials, where the expected number of
successes is . This will, of course, depend on the number of trials, but
we show that when the number of trials (call it n) gets large, the exact
number of trials doesnt matter. In mathematical language, we say that
the probability converges to a limit as n goes to infinity. But how large is
large? We would like to know how good the approximation is, for real
values of n, of the sort that we are interested in.
Let Xn be the random number of successes in n independent trials, where
the probability of each success is /n. Thus, the probability of success goes
down as the number of trials goes up, and expected number of successes is
always the same . Then
x
nx
P {Xn = x} = n Cx
1
.
n
n
96
Now, those of you who have learned some calculus at A-levels may remember
the Taylor series for ez :
ez = 1 + z +
z2 z3
+
+ .
2!
3!
x
n
P {Xn = x} = Cx
1
1
n
n
n
x
n(n 1) (n x + 1)
x
n
=
1
1
x!
nx
n
n
n
x1
1
x
(1) 1 n 1 n
1
=
.
x
x!
n
1 n
n
Now, if were not concerned about the size of the error, we can simply
say that n is much bigger than or x (because were thinking of a fixed
and x, and n getting large). So we have the approximations
x1
1
1
1;
1
n
n
x
1
1;
n
n /n n
1
e
= e .
n
Thus
P {Xn = x}
5.9.1
x
e .
x!
In the long run, Xn has a distribution very close to the Poisson distribution
defined in section 5.2. But how long is the long run? Do we need 10
trials? 1000? a billion?
If you just want the answer, its approximately this: The error that youll
make by taking the Poisson distribution instead of the binomial is no more
97
than about 1.62 /n3/2 . In Example 5.5, where n = 100 and = 5, this
says the error wont be bigger than about 0.04, which is useful information,
although in reality the maximum error is about 10 times smaller than this.
On the other hand, if n = 400, 000 (about the population of Malta), and
= 0.47, then the error will be only about 108 .
Lets assume that n is at least 42 , so < n/2. Define the approximation error to be
:= maxP {Xn = x} P {X = x}.
(The bars | | mean that were only interested in how big the difference is,
not whether its positive or negative.) Then
!
n
x
x (1) 1 n1 1 x1
n
1
e
P {Xn = x} P {X = x} =
x
x!
n
x!
1 n
!
x
1
x1
x 1 /n n
=
e
(1) 1
1
1
1
x!
n
n
n
e/n
x1
X
1
x1
k
x2
1 1
1
>1
>1
,
n
n
n
2n
k=0
and
1>
1
n
x
>1
x
.
n
98
2
< e/n < 1 + 2 ,
n
n 2n
1 /n
2
< /n < 1,
2(n2 n)
e
and
1
2
<
2(n )
1
2
2
2(n n)
n
<
1 /n
e/n
n
< 1.
Now we put together all the overestimates on one side, and all the underestimates on the other.
x
2
x
x
x
e
P {Xn = x}P {X = x}
e
.
x!
2(n )
n
x!
n x
So, finally, as long as n 42 ,we get
x+1 x
x
max
e
+ +
.
x!
n n n(1 x/2 n)
We need to find the maximum over all possible x. If x <
becomes
1 x+1
42
max
e ( + 3x) ,
n x!
n 2n
n then this
Lecture 6
Introduction
In previous lectures we have considered discrete datasets and discrete probability distributions. In practice many datasets that we collect from experiments consist of continuous measurements. For example, there are the
weights of newborns in the babyboom data set (Table 1.2). The plots in
Figure 6.1 show histograms of real datasets consisting of continuous measurements. From such samples of continuous data we might want to test
whether the data is consistent with a specific population mean value or
whether there is a significant difference between 2 groups of data. To answer these question we need a probability model for the data. Of course,
there are many different possible distributions that quantities could have. It
is therefore a startling fact that many different quantities that we are commonly interested in heights, weights, scores on intelligence tests, serum
potassium levels of different patients, measurement errors of distance to the
nearest star all have distributions which are close to one particular shape.
This shape is called the Normal or Gaussian1 family of distributions.
Named for German mathematician Carl Friedrich Gauss, who first worked out the
formula for these distributions, and used them to estimate the errors in astronomical
computations. Until the introduction of the euro, Gausss picture and the Gaussian
curve were on the German 10 mark banknote.
99
8
0
Frequency
6
4
Frequency
10
10
12
100
1000
2000
3000
4000
5000
6000
1.0
1.2
1.4
1.6
1.8
Petal length
20
10
15
Frequency
6
4
2
0
Frequency
25
30
10
35
1100000
Brain size
3.0
3.5
4.0
4.5
5.0
5.5
6.2
101
P(X)
When we considered the Binomial and Poisson distributions we saw that the
probability distributions were characterized by a formula for the probability
of each possible discrete value. All of the probabilities together sum up to
1. We can visualize the density by plotting the probabilities against the
discrete values (Figure 6.2). For continuous data we dont have equally
spaced discrete values so instead we use a curve or function that describes
the probability density over the range of the distribution (Figure 6.3). The
curve is chosen so that the area under the curve is equal to 1. If we observe a
sample of data from such a distribution we should see that the values occur
in regions where the density is highest.
10
15
20
0.02
0.00
0.01
density
0.03
0.04
102
60
80
100
120
140
6.3
There will be many, many possible probability density functions over a continuous range of values. The Normal distribution describes a special class
of such distributions that are symmetric and can be described by the distribution mean and the standard deviation (or variance 2 ). 4 different
Normal distributions are shown in Figure 6.4 together with the values of
and . These plots illustrate how changing the values of and alter the
positions and shapes of the distributions.
If X is Normally distributed with mean and standard deviation , we
write
XN(, 2 )
and are the parameters of the distribution.
103
density
50
100
150
50
100
150
= 130 ! = 10
= 100 ! = 15
density
0.00
0.04
0.04
0.08
0.08
0.00
density
0.04
0.00
0.04
0.00
density
0.08
= 100 ! = 5
0.08
= 100 ! = 10
50
100
150
50
100
150
6.4
104
P(Z < 0)
0
For this example we can calculate the required area as we know the distribution is symmetric and the total area under the curve is equal to 1, i.e.
P (Z < 0) = 0.5.
What about P (Z < 1.0)?
P(Z < 1)
Calculating this area is not easy2 and so we use probability tables. Probabil2
For those Mathematicians who recognize this area as a definite integral and try to do
the integral by hand please note that the integral cannot be evaluated analytically
105
ity tables are tables of probabilities that have been calculated on a computer.
All we have to do is identify the right probability in the table and copy it
down! Obviously it is impossible to tabulate all possible probabilities for all
possible Normal distributions so only one special Normal distribution, N(0,
1), has been tabulated.
The N(0, 1) distribution is called the standard Normal distribution.
The tables allow us to read off probabilities of the form P (Z < z). Most of
the table in the formula book has been reproduced in Table 6.1. From this
table we can identify that P (Z < 1.0) = 0.8413 (this probability has been
highlighted with a box).
z
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
1.1
0.0
0.5000
0.5398
0.5793
0.6179
0.6554
0.6915
0.7257
0.7580
0.7881
0.8159
0.8413
0.8643
0.01
5040
5438
5832
6217
6591
6950
7291
7611
7910
8186
8438
8665
0.02
5080
5478
5871
6255
6628
6985
7324
7642
7939
8212
8461
8686
0.03
5120
5517
5910
6293
6664
7019
7357
7673
7967
8238
8485
8708
0.04
5160
5557
5948
6331
6700
7054
7389
7704
7995
8264
8508
8729
0.05
5199
5596
5987
6368
6736
7088
7422
7734
8023
8289
8531
8749
0.06
5239
5636
6026
6406
6772
7123
7454
7764
8051
8315
8554
8770
0.07
5279
5675
6064
6443
6808
7157
7486
7794
8078
8340
8577
8790
0.08
5319
5714
6103
6480
6844
7190
7517
7823
8106
8365
8599
8810
0.09
5359
5753
6141
6517
6879
7224
7549
7852
8133
8389
8621
8830
106
0.92
0.92
!0.5
0.5
107
!0.76
0.76
!0.64
0 0.43
!0.64
0 0.43
108
6.5
Standardisation
All of the probabilities above were calculated for the standard Normal distribution N(0, 1). If we want to calculate probabilities from different Normal
distributions we convert the probability to one involving the standard Normal distribution. This process is called standardisation.
Suppose X N(3, 4) and we want to calculate P (X < 6.2). We convert
this probability to one involving the N(0, 1) distribution by
(i). Subtracting the mean
(ii). Dividing by the standard deviation
Subtracting the mean re-centers the distribution on zero. Dividing by the
standard deviation re-scales the distribution so it has standard deviation 1.
If we also transform the boundary point of the area we wish to calculate
we obtain the equivalent boundary point for the N(0, 1) distribution. This
process is illustrated in the figure below. In this example, P (X < 6.2) =
P (Z < 1.6) = 0.9452 where Z N(0,1)
This process can be described by the following rule
If X N(, 2 ) and Z =
then Z N(0, 1)
109
+,!-(/.
!
!!()*(+,"-(/.
"
#$%
!$%
'(%()*(+,"-(&.
" &$#
Suppose we know that the birth weight of babies is Normally distributed with mean 3500g and standard deviation 500g. What
is the probability that a baby is born that weighs less than 3100g?
That is X N(3500, 5002 ) and we want to calculate P (X <
3100)?
We can calculate the probability through the process of standardization.
Drawing a rough diagram of the process can help you to avoid
any confusion about which probability (area) you are trying to
calculate.
110
X ~ N(3500, 5002 )
Z ~ N(0, 1)
3100
3500
3100 ! 3500 0
500
= !0.8
P (X < 3100) = P
X 3500
3100 3500
<
500
500
= P (Z < 0.8)
where Z N(0, 1)
= 1 P (Z < 0.8)
= 1 0.7881
= 0.2119
6.6
Suppose two rats A and B have been trained to navigate a large maze.
The time it takes rat A is normally distributed with mean 80 seconds and
standard deviation 10 seconds. The time it takes rat B is normally distributed with mean 78 seconds and standard deviation 13 seconds. On any
given day what is the probability that rat A runs the maze faster than rat B?
X = Time of run for rat A
Y = Time of run for rat B
X N(80, 102 )
Y N(78, 132 )
111
X Y N(1 - 2 , 12 + 22 )
In this example,
D = X Y N(80 78, 102 + 132 ) = N (2, 269)
We can now calculate this probability through standardisation
D ~ N(2, 269)
Z ~ N(0, 1)
P(D < 0)
0!2 0
16.40
= !0.122
112
P (D < 0) = P
D2
02
<
269
269
!
= P (Z < 0.122)
where Z N(0, 1)
N(1 + 2 , 12 + 22 )
aX N(a1 , a2 12 )
aX + bY
N(a1 + b2 , a2 12 + b2 22 )
Let A =
X+Y
2
X N(80, 102 )
Y N(78, 132 )
113
A ~ N(79, 67.25)
Z ~ N(0, 1)
79
P (A > 82) = P
82
A 79
82 79
<
67.25
67.25
82 ! 79
8.20
= 0.366
!
= P (Z > 0.366)
where Z N(0, 1)
6.7
114
X ~ N(45, 400)
Z ~ N(0, 1)
45
x ! 45
20
= 0.84
X 45
x 45
<
20
20
x 45
P Z<
20
= 0.8
= 0.8
x 45
0.84
20
x 45 + 20 0.84 = 61.8
6.8
115
0.04
0.03
0.00
0.00
0.01
0.02
density
0.03
0.02
0.01
P(X = x)
0.04
Bin(300, 0.5)
100
120
140
160
180
200
100
120
140
160
180
200
where
q =1p
6.8.1
1
2
Continuity correction
1
2
116
3.5
10 11 12
7.5
3.5 6
X 6
7.5 6
<
<
3
3
3
where Z N(0, 1)
117
The exact answer is 0.733 so in this case the approximation is very good.
6.9
We can also use the Normal distribution to approximate a Poisson distribution under certain conditions.
In general,
If X Po() then
=
2 =
For large (say > 20)
X N(, )
X Po(25)
P (X < 26.5) = P
26.5 25
X 25
<
5
5
= P (Z < 0.3)
= 0.6179
where Z N(0, 1)
118
X ~ N(25, 25)
Z ~ N(0, 1)
25
26.5
26.5 ! 25
5
= 0.3
7500 0.25 0.75 = 37.5. The result is thus above the mean:
There were more correct guesses than would be expected. Might
we plausibly say that the difference from the expectation is just
chance variation?
We want to know how likely a result this extreme would be if
X really has this binomial distribution. We could compute this
119
7500
X
x=2006
7500
(0.25)x (0.75)7500x
x
7500
=
(0.25)2006 (0.75)5494
2006
7500
+
(0.25)2007 (0.75)5493 +
2007
7500
+
(0.25)7500 (0.75)0 .
7500
This is not only a lot of work, it is also not very illuminating.
More useful is to treat X as a continuous variable that is approximately normal.
We sketch the relevant normal curve in Figure 6.7. This is the
normal distribution with mean 1875 and SD 37.5. Because of
the continuity correction, the probability we are looking for is
P (X > 2005.5). We convert x = 2005.5 into standard units:
z=
x
2005.5 1875
=
= 3.48.
37.5
(Note that with such a large SD, the continuity correction makes
hardly any difference.) We have then P (X > 2005.5) P (Z >
3.48), where Z has standard normal distribution. Since most of
the probability of the standard normal distribution is between
2 and 2, and nearly all between 3 and 3, we know this is a
small probability. The relevant piece of the normal table is given
in Figure 6.6. (Notice that the table has become less refined for
z > 2, giving only one place after the decimal point in z.) From
the table we see that
P (X > 2005.5) = P (Z > 3.48) = 1 P (Z < 3.48),
which is between 0.0002 and 0.0003. (Using a more refined table,
we would see that P (Z > 3.48) = 0.000250 . . . .) This may be
compared to the exact binomial probability 0.000274.
0.6
0.7
0.8
0.9
1.0
1.1
1.2
1.3
1.4
1.5
120
1.6
1.7
1.8
1.9
0.7257
0.7580
0.7881
0.8159
0.8413
0.8643
0.8849
0.9032
0.9192
0.9332
0.9452
0.9554
0.9641
0.9713
7291
7611
7910
8186
8438
8665
8869
9049
9207
9345
9463
9564
9649
9719
7324
7642
7939
8212
8461
8686
8888
9066
9222
9357
9474
9573
9656
9726
7357
7673
7967
8238
8485
8708
8907
9082
9236
9370
9484
9582
9664
9732
7389
7704
7995
8264
8508
8729
8925
9099
9251
9382
9495
9591
9671
9738
2.0
3.0
0.9772
0.9987
0.1
9821
9990
0.2
9861
9993
0.3
9893
9995
0.4
9918
9997
0.5
9938
9998
0.6
9953
9998
0.7
9965
9999
0.8
9974
9999
0.9
9981
9999
Figure 6.6: Normal table used to compute tail probability for Aquarius
experiment.
1875
2006
x
Lecture 7
122
Confidence intervals
X
,
/ n
which is a standard normal random variable (that is, with expectation 0 and
variance 1).
A tiny bit of algebra gives us
Z.
=X
n
This expresses the unknown quantity in terms of known quantities and a
random variable Z with known distribution. Thus we may use our standard
normal tables to generate statements like the probability is 0.95 that Z is
in the range 1.96 to 1.96, implying that
+ 1.96 .
1.96 to X
the probability is 0.95 that is in the range X
n
n
(Note that we have used the fact that the normal distribution is symmetric
about 0.) We call this interval a 95% confidence interval for the unknown
population mean.
The quantity / n, which determines the scale of the confidence interval, is called the Standard Error for the sample mean, commonly abbreviated SE. If we take to be the sample standard deviation
more about this
assumption in chapter 10 the Standard Error is 69mm/ 198 4.9mm.
The 95% confidence interval for the population mean is then 1732 9.8mm,
so (1722, 1742)mm. In place of our vague statement about a best guess for
, we have an interval of width 20 mm in which we are 95% confident that
the true population mean lies.
General procedure for normal confidence intervals: Suppose X1 , . . . , Xn
are independent samples from a normal distribution with unknown mean ,
and known variance 2 . Then a (symmetric) c% normal confidence interval
for is the interval
z SE, X
+ z SE), which we also write as X
z SE,
(X
Confidence intervals
123
where SE = / n, and z is the appropriate quantile of the standard normal distribution. That is, it is the number such that (100 c)/2% of the
probability in the standard normal distribution is above z. Thus, if were
2 SE, whereas a 99%
looking for a 95% confidence interval, we take X
confidence interval would be X 2.6 SE, since we see on the normal table
that P (Z < 2.6) = 0.9953, so P (Z > 2.6) = 0.0047 0.5%. (Note: The
central importance of the 95% confidence interval derives primarily from its
convenient correspondence to a z value of 2. More precisely, it is 1.96, but
we rarely need or indeed, can justify such such precision.)
Level
z
Prob. above z
68%
1.0
0.16
90%
1.64
0.05
95%
1.96
0.025
99%
2.6
0.005
99.7%
3.0
0.0015
7.2
But what does confidence mean? The quantity is a fact, not a random quantity, so we cannot say The probability is 0.95 that is between
The
1722mm and 1742mm.1 The randomness is in our estimate
= X.
124
Confidence intervals
(that is, quantities you can compute from the data X) A(X) and B(X), such
that
P A(X) B(X) = .
The quantity P A(X) B(X) is called the coverage probability
for . Thus, a confidence interval for with confidence coefficient is
precisely a random interval with coverage probability . In many cases, it is
not possible to find an interval with exactly the right coverage probability.
We may have to content ourselves with an approximate confidence interval
(with coverage probability ) or a conservative confidence interval (with
coverage probability ). We usually make every effort not to overstate our
confidence about statistical conclusions, which is why we try to err on the
side of making the coverage probability hence the interval too large.
An illustration of this problem is given in Figure 7.1. Suppose we are
measuring systolic blood pressure on 100 patients, where the true blood
pressure is 120 mmHg, but the measuring device makes normally distributed
errors with mean 0 and SD 10 mmHg. In order to reduce the errors, we
take four measures on each patient and average them. Then we compute
a confidence interval. The measures are shown in figure 7.1(a). In Figure
7.1(b) we have shown a 95% confidence interval for each patient, computed
by taking the average of the patients four measurements, plus and minus 10.
Notice that there are 6 patients (shown by red Xs for their means) where
the true measure 120 mmHg lies outside the confidence interval. In
Figures 7.1(c) and 7.1(d) we show 90% and 68% confidence intervals, which
are narrower, and hence miss the true value more frequently.
A 90% confidence interval tells you that 90% of the time the true value
will lie in this range. In fact, we find that there are exactly 90 out of 100
cases where the true value is in the confidence interval. The 68% confidence
intervals do a bit better than would be expected on average: 74 of the 100
trials had the true value in the 68% confidence interval.
Blood Pressure
90
Blood Pressure
90
20
40
60
80
100
20
40
Trial
80
100
80
100
Trial
Blood Pressure
90
0
20
40
60
Trial
80
100
20
40
60
Trial
Figure 7.1: Confidence intervals for 100 patients blood pressure, based on four measurements. Each column of
Figure 7.1(a) shows a single patients four measurements. The true BP in each case is 120, and the measurement
errors are normally distributed with mean 0 and SD 10.
125
90
Blood Pressure
60
Confidence intervals
126
Confidence intervals
7.3
We discussed in section 6.8 that the binomial distribution can be well approximated by a normal distribution. This means that if we are estimating
the probability of success p from some observations of successes and failures,
we can use the same methods as above to put a confidence interval on p.
For instance, the Gallup organisation carried out a poll in October, 2005,
of Americans attitudes about guns (see http://www.gallup.com/poll/
20098/gun-ownership-use-america.aspx). They surveyed 1,012 Americans, chosen at random. Of these, they found that 30% said they personally
owned a gun. But, of course, if theyd picked different people, purely by
chance they would have gotten a somewhat different percentage. How different could it have been? What does this survey tell us about the true
fraction (call it p) of Americans who own guns?
We can compute a 95% confidence interval as 0.301.96 SE. All we need
to know is the SE for the proportion p, which is the same as the standard
deviation for the observed proportion of successes. We know from section
6.8 (and discussed again at length in section 8.3), that the standard error is
r
SE =
p(1 p)
,
n
p
0.3 0.7/1012 =
1/ n.
Confidence intervals
7.4
127
will be
Law of Large Numbers (LLN): For n large, X
close to .
Central Limit Theorem (CLT): For n large, the error
in the LLN is close to a normal distribution, with variance
2 /n. That is, using our standardisation procedure for the
normal distribution,
Z=
/ n
(7.1)
is close to having a standard normal distribution. Equivalently, X1 + + Xn has approximately N (n, n 2 ) distribution.
So far, we have been assuming that our data are sampled from a population with a normal distribution. What justification do we have for this
assumption? And what do we do if the data come from a different distribution? One of the great early discoveries of probability theory is that many
different kinds of random variables come close to a normal distribution when
you average enough of them. You have already seen examples of this phenomenon in the normal approximation to the binomial distribution and the
Poisson.
In probability textbooks such as [Fel71] you can find very precise statements about what it means for the distribution to be close. For our purposes, we will simply treat Z as being actually a standard normal random
variable. However, we also need to know what it means for n to be large.
For most distributions that you might encounter, 20 is usually plenty, while
2 or 3 are not enough. The key rules of thumb are that the approximation
works best when the distribution of X
(i). is reasonably symmetric: Not skewed in either direction.
128
Confidence intervals
(ii). has thin tails: Most of the probability is close to the mean, not many
SDs away from the mean.
More specific indications will be given in the following examples.
7.4.1
Normal distribution
Suppose Xi are drawn from a normal distribution with mean and variance
2 N (, 2 ). We know that X1 + + Xn has N (n, n 2 ) distribution,
has N (, 2 /n) distribution. A consequence is that Z, as
and so that X
defined in (7.1), in fact has exactly the standard normal distribution. In
fact, this is an explanation for why the CLT works: The normal distribution
is the only distribution such that when you average multiple copies of it,
you get another distribution of the same sort.2 Other distributions are not
stable under averaging, and naturally converge to the distributions that are.
7.4.2
Poisson distribution
Technically, it is the only distribution with finite variance for which this is true.
129
0.2
0.0
0.1
Density
0.3
0.4
Confidence intervals
-3
-2
-1
10
0.10
0.00
Density
0.20
(a) = 1
-4
-2
2
Z
0.08
0.04
0.00
Density
0.12
(b) = 4
10
15
20
0.04
0.00
Density
0.08
(c) = 10
10
15
20
25
30
35
(d) = 20
130
Confidence intervals
1
4
10
20
Standardised Z
-1.5
-2.25
-3.32
-4.58
Normal probability
0.067
0.012
0.00045
0.0000023
7.4.3
Bernoulli variables
7.5
We show how the CLT is applied to understand the mean of samples from
real data. It permits us to apply our Z and t tests for testing population
131
0.0
0.0
0.1
0.2
0.2
0.4
0.3
0.6
0.4
Confidence intervals
-2
-1
-2
-1
(b) p = 0.1, n = 3
0.0
0.1
0.2
0.3
0.4
(a) p = 0.5, n = 3
10
-2
(d) p = 0.1, n = 10
0.00
0.00
0.05
0.10
0.10
0.20
0.15
(c) p = 0.5, n = 10
-1
10
15
20
-2
(f) p = 0.1, n = 25
0.00
0.00
0.02
0.04
0.04
0.08
0.06
0.12
0.08
(e) p = 0.5, n = 25
35
40
45
50
55
60
65
10
15
Figure 7.3: Normal approximations to Binom(n, p). Shaded region is the implied approximate
probability of the Binomial variable < 0 or > n.
132
Confidence intervals
means and computing confidence intervals for the population mean (as well
as for differences in means) even to data that are not normally distributed.
(Caution: Remember that t is an improvement over Z only when the number
of samples being averaged is small. Unfortunately, the CLT itself may not
apply in such a case.) We have already applied this idea when we did the Z
test for proportions, and the CLT was also hidden in our use of the 2 test.
7.5.1
Quebec births
be x
1.96 41.9/ n. For instance, if n = 10, and we find the mean of
our 10 samples to be 245, then a 95% confidence interval will be (229, 271).
If there had been 100 samples, the confidence interval would be (237, 253).
Put differently, the average of 10 samples will lie within 26 of the true mean
95% of the time, while the average of 100 samples will lie within 8 of the
true mean 95% of the time.
This computation depends upon n being large enough to apply the CLT.
Is it? One way of checking is to perform a simulation: We let a computer
pick 1000 random samples of size n, compute the means, and then look at
the distribution of those 1000 means. The CLT predicts that they should
have a certain normal distribution, so we can compare them and see. If
n = 1, the result will look exactly like Figure 7.4(a), where the curve in red
is the appropriate normal approximation predicted by the CLT. Of course,
there is no reason why the distribution should be normal for n = 1. We see
that for n = 2 the true distribution is still quite far from normal, but by
n = 10 the normal is already starting to fit fairly closely, and by n = 100
Confidence intervals
133
7.5.2
California incomes
A standard example of a highly skewed distribution hence a poor candidate for applying the CLT is household income. The mean is much
greater than the median, since there are a small number of extremely high
incomes. It is intuitively clear that the average of incomes must be hard to
predict. Suppose you were sampling 10,000 Americans at random a very
large sample whose average income is 30,000. If your sample happens
Confidence intervals
0.010
Density
0.000
0.005
0.004
0.000
Density
0.008
0.015
134
150
200
250
300
350
200
# Daily births
(b) n = 2
0.020
Density
0.000
0.010
0.020
0.015
0.010
0.000
0.005
Density
300
0.030
(a) n = 1
180
200
220
240
260
280
300
220
240
260
280
(d) n = 10
0.06
Density
0.00
0.00
0.02
0.02
0.04
0.04
0.08
0.06
(c) n = 5
Density
250
average # Daily births
240
250
260
(e) n = 50
270
235
240
245
250
255
260
(f) n = 100
Figure 7.4: Normal approximations to averages of n samples from the Quebec birth data.
265
Confidence intervals
135
to include Bill Gates, with annual income of, let us say, 3 billion, then his
income will be ten times as large as the total income of the entire remainder
of the sample. Even if everyone else has zero income, the sample mean will
be at least 300,000. The distribution of the mean will not converge, or will
converge only very slowly, if it can be substantially affected by the presence
or absence of a few very high-earning individuals in the sample.
Figure 7.5(a) is a histogram of household incomes, in thousands of US
dollars, in the state of California in 1999, based on the 2000 US census (see
www.census.gov). We have simplified somewhat, since the final category is
more than $200,000, which we have treated as being the range $200,000 to
$300,000. (Remember that histograms are on a density scale, with the area
of a box corresponding to the number of individuals in that range. Thus,
the last three boxes all correspond to about 3.5% of the population, despite
their different heights.) The mean income is about = $62, 000, while the
median is $48,000. The SD of the incomes is = $55, 000.
Figures 7.5(b)7.5(f) show the effect of averaging 2,5,10,50, and 100
randomly chosen incomes, together with a normal distribution (in green) as
predicted by the CLT, with mean and variance 2 /n. We see that the
convergence takes a little longer than it did with the more balanced birth
data of Figure 7.4 averaging just 10 incomes is still quite skewed but
by the time we have reached the average of 100 incomes the match to the
predicted normal distribution is remarkably good.
7.6
There are many implications of the Central Limit Theorem. We can use it
to estimate the probability of obtaining a total of at least 400 in 100 rolls of
a fair six-sided die, for instance, or the probability of a subject in an ESP
experiment, guessing one of four patterns, obtaining 30 correct guesses out
of 100 purely by chance. These were discussed in lecture 6 of the first set
of lectures. It suggests an explanation for why height and weight, and any
other quantity that is affected by many small random factors, should end
up being normally distributed.
Here we discuss one crucial application: The CLT allows us to compute normal confidence intervals and apply the Z test to data that are not
themselves normally distributed.
136
7.6.1
Confidence intervals
Confidence intervals
137
0.000
0.004
0.008
Density
0.004
0.000
Density
0.008
0.012
0.012
50
100
150
200
250
300
100
150
200
income in $thousands
(a) n = 1
(b) n = 2
250
0.020
0.015
0.000
0.005
0.010
Density
0.010
0.005
0.000
0
50
100
150
200
50
income in $thousands
100
150
income in $thousands
(c) n = 5
(d) n = 10
Averages of 100 California household incomes
0.04
0.00
0.00
0.01
0.02
0.02
Density
0.03
0.04
0.06
0.05
Density
300
0.015
Density
50
income in $thousands
40
50
60
70
income in $thousands
(e) n = 50
80
90
100
50
60
70
income in $thousands
(f) n = 100
Figure 7.5: Normal approximations to averages of n samples from the California income data. The green curve shows a normal density with mean
and variance 2 /n.
80
138
Confidence intervals
Lecture 8
The Z Test
8.1
Introduction
In Lecture 1 we saw that statistics has a crucial role in the scientific process
and that we need a good understanding of statistics in order to avoid reaching invalid conclusions concerning the experiments that we do. In Lectures 2
and 3 we saw how the use of statistics necessitates an understanding of probability. This lead us to study how to calculate and manipulate probabilities
using a variety of probability rules. In Lectures 4, 5 and 6 we consider three
specific probability distributions that turn out to be very useful in practical
situations. Effectively, all of these previous lectures have provided us with
the basic tools we need to use statistics in practical situations.
The goal of statistical analysis is to draw reasonable conclusions from the
data and, perhaps even more important, to give precise statements about the
level of certainty that ought to be attached to those conclusions. In lecture
7 we used the normal distribution to derive one form that these precise
statements can take: a confidence interval for some population mean. In
this lecture we consider an alternative approach to describing very much the
same information: Significance tests.
8.2
140
The Z Test
know from large-scale studies that UK newborns average 3426g,
with an SD of 538g. (See, for example, [NNGT02].) The weights
are approximately normally distributed. We think that maybe
babies in Australia have a mean birth weight smaller than 3426g
and we would like to test this hypothesis.
Frequency
10
15
20
Figure 8.1: A Histogram showing the birth weight distribution in the Babyboom dataset.
We observe that the sample mean of these 44 weights is 3276g.
So we might just say that were done. The average of these
weights is smaller than 3426g, which is what we wanted to show.
But wait! a skeptic might say. You might have just happened
by chance to get a particularly heavy group of newborns. After
all, even in England lots of newborns are lighter than 3276g.
And 44 isnt such a big sample.
The Z Test
141
1
(X1 + + X44 ) N(3426, 5382 /44) = N(3000, 812 ).
44
Thus,
3276 3426
P (X 3276) = P Z
= P (Z 1.81) = 1P (Z < 1.81),
81
where Z = (X )/ = (X 3426)/81 has standard normal
distribution. Looking this up on the standard normal table, we
see that the probability is about 0.0351.
The probability 0.0351 that we compute at the end of Example 8.1 is
called the p-value of the test. It tells us how likely it is that we would
observe such an extreme result if the null hypothesis were true. The lower
the p-value, the stronger the evidence against the null hypothesis. We are
faced with the alternative: Either the null hypothesis is false, or we have by
142
The Z Test
chance happened to get a result that would only happen about one time in
30. This seems unlikely, but not impossible.
Pay attention to the double negative that we commonly use for significance tests: We have a research hypothesis, which we think would be
interesting if it were true. We dont test it directly, but rather we use the
data to challenge a less interesting null hypothesis, which says that the
apparently interesting differences that weve observed in the data are simply
the result of chance variation. We find out whether the data support the
research hypothesis by showing that the null hypothesis is false (or unlikely).
If the null hypothesis passes the test, then we know only that this particular
challenge was inadequate. We havent proven the null hypothesis. After
all, we may just not have found the right challenger; a different experiment
might show up the weaknesses of the null. (The potential strength of the
challenge is called the power of the test, and well learn about that in
section 13.2.)
What if the challenge succeeds? We can then conclude with confidence
(how much confidence depends on the p-value) that the null was wrong. But
in a sense, this is shadow boxing: We dont exactly know who the challenger
is. We have to think carefully about what the plausible alternatives are.
(See, for instance, Example 8.2.)
8.2.1
The basic steps carried out in Example 8.1 are common to most significance
tests:
(i). Begin with a research (alternative) hypothesis.
(ii). Set up the null hypothesis.
(iii). Collect a sample of data.
(iv). Calculate a test statistic from the sample of data.
(v). Compare the test statistic to its sampling distribution under the
null hypothesis and calculate the p-value. The strength of the evidence is larger, the smaller the p-value.
The Z Test
8.2.2
143
144
The Z Test
Truth
Decision
Retain H0
Reject H0
H0 True
H0 False
Correct
(Prob. 1 )
Type I Error
(Prob.=level=)
Type II Error
(Prob.=)
Correct
(Prob.=Power=1 )
the evidence against the null hypothesis is not significant at the 5% level
Another way of saying this is that
we cannot reject the null hypothesis at the 5% level
Note that the conclusion of a hypothesis test, strictly speaking, is binary:
We either reject or retain the null hypothesis. There are no gradations, no
strong rejection or borderline rejection or barely retained. The fact that our
p-value was 106 ought not to be taken, retrospectively, as stronger evidence
against the null hypothesis than a p-value of 0.04 would have been.
By the strict logic imposed the data are completely used up in the test:
If we are testing at the 0.05 level and the p-value is 0.06, we cannot then
collect more data to see if we can get a lower p-value. We would have to
throw away the data, and start a new experiment. Needless to say, this
is not what scientists really do, which makes even the apparently clear-cut
yes/no decision set-up of the hypothesis test in reality rather difficult to
interpret.
It is also quite common to confuse this situation with using significance
tests to judge scientific evidence. So common, in fact, that many scientific
journals impose the 0.05 significance threshold to decide whether results are
worth publishing. An experiment that resulted in a statistical test with
a p-value of 0.10 is considered to have failed, even if it may very well be
providing reasonable evidence of something important; if it resulted in a
statistical test with a p-value of 0.05 then it is a success, even if the effect
size is minuscule, and even though 1 out of 20 true null hypotheses will fail
the test at significance level 0.05.
Another way of thinking about hypothesis tests is that there is some
critical region of values such that if the test statistic lies in this region
then we will reject H0 . If the test statistic lies outside this region we will
not reject H0 . In our example, using a 5% level of significance this set of
The Z Test
145
values will be the most extreme 5% of values in the right hand tail of the
distribution. Using our tables backwards we can calculate that the boundary
of this region, called the critical value, will be 1.645. The value of our
test statistic is 3.66 which lies in the critical region so we reject the null
hypothesis at the 5% level.
N(0, 1)
0.05
0
1.645
Critical Region
8.2.3
Hypothesis tests are identical to significance tests, except for the choice of
a significance level at the beginning, and the nature of the conclusions we
draw at the end:
(i). Begin with a research (alternative) hypothesis and decide upon
a level of significance for the test.
(ii). Set up the null hypothesis.
(iii). Collect a sample of data.
(iv). Calculate a test statistic from the sample of data.
146
The Z Test
(v). Compare the test statistic to its sampling distribution under the
null hypothesis and calculate the p-value,
or equivalently,
Calculate the critical region for the test.
(vi). Reject the null hypothesis if
the p-value is less than the level of significance,
or equivalently,
the test statistic lies in the critical region.
Otherwise, retain the null hypothesis.
8.3
A common situation in which we use hypothesis tests is when we have multiple independent observations from a distribution with unknown mean, and
we can make a test statistic that is normally distributed. The null hypothesis should then tell us what the mean and standard error are, so that we can
normalise the test statistic. The normalised test statistic is then commonly
called Z. We always define Z by
Z=
observation expectation
.
standard error
(8.2)
The expectation and standard error are the mean and the standard deviation
of the sampling distribution: that is, the mean and standard deviation that
the observation has when seen as a random variable, whose distribution is
given by the null hypothesis. Thus, Z has been standardised: its distribution
is standard normal, and the p-value comes from looking up the observed
value of Z on the standard normal table.
We call this a one-sample test because we are interested in testing the
mean of samples from a single distribution. This is as opposed to the two-
The Z Test
147
sample test (discussed in section ??), in which we are testing the difference
in means between two populations.
8.3.1
X2 N(, 2 )
then
1
1
1
1
1
1
X = X1 + X2 N + , ( )2 2 + ( )2 2
2
2
2
2
2
2
2
X N ,
2
In general,
If X1 , X2 , . . . , Xn are n independent and identically distributed random variables from a N(, 2 ) distribution then
2
X N ,
n
Thus, if we are testing the null hypothesis
H0 : The Xi have N(, ) distribution,
sample mean
.
/ n
Thus, under the assumption of the null hypothesis the sample mean of
44 values from a N(3426, 5382 ) distribution is
5382
X N 3426,
= N 3426, 812
44
148
The Z Test
8.3.2
Under some circumstances it may seem more intuitive to work with the sum
of observations rather than the mean. If S = X1 + + Xn , where the Xi
are independent with N(, 2 ) distribution, then S N(n, n 2 ). That is,
8.3.3
.
n
p > p0 .
We already observed in section 6.8 that the random variable Xphas distribution very close to normal, with mean pn and standard error np(1 p),
as long as n is reasonably large. We have then the test statistic
When testing the number of successes in n trials, for the
null hypothesis P (success) = p0 , the test statistic is
Z=
The Z Test
149
Z=
150
The Z Test
8.3.4
proportion of successes p0
p
.
p0 (1 p0 )/n
0.26747 0.25
= 3.49,
0.005
The Z Test
151
0.2645 0.4
= 1.94
0.0697
8.3.5
The fundamental fact which makes statistics work is the fact that when we
add up n independent observations, the expected value increases by a factor
152
The Z Test
8.4
In Example 8.1 we wanted to test the research hypothesis that mean birth
weight of Australian babies was less than 3426g. This suggests that we
had some prior information that the mean birth weight of Australian babies
was definitely not higher than 3426g, and that the interesting question was
whether the weight was lower. If this were not the case then our research
hypothesis would be that the mean birth weight of Australian babies was
different from 3426g. This allows for the possibility that the mean birth
weight could be less than or greater than 3426g.
In this case we would write our hypotheses as
H0 : = 3426g
H1 : 6= 3426g
As before we would calculate our test statistic as 1.81. The p-value is
different, though. We are not looking at the probability that Z is only
less than 1.81 (in the positive direction), but that Z is at least this big
in either direction; so P (|Z| > 3.66), or P (Z > 3.66) + P (Z < 3.66) =
2P (Z > 3.66). Because of symmetry,
For a Z test, the two-tailed p-value is always twice as
big as the one-tailed p-value.
In this case we allow for the possibility that the mean value is greater
than 3426g by setting our critical region to be lowest 2.5% and highest 2.5%
of the distribution. In this way the total area of the critical region remains
0.05 and so the level of significance of our test remains 5%. In this example, the critical values are -1.96 and 1.96. Thus if our test statistic is less
than -1.96 or greater than 1.96 we would reject the null hypothesis. In this
example, the value of test statistic does lie in the critical region so we reject
The Z Test
153
N(0, 1)
0.025
0.025
1.96
Critical Region
1.96
Critical Region
8.5
You may notice that we do a lot of the same things to carry out a statistical
test that we do to compute a confidence interval. We compute a standard
error and look up a value on a standard normal table. For instance, in
Example 8.1 we might have expressed our uncertainty about the average
Australian birthweight in the form of a confidence interval. The calculation
would have been just slightly different:
154
The Z Test
SE = 528g 44 = 80g.
Then a 95% confidence interval for the mean birthweight of Australian babies is 3276 1.96 80g = (3120, 3432)g; a 99% confidence interval would be
32761.9680g = (3120, 3432)g; a 32762.5880g = (3071, 3481)g. (Again,
remember that we are making the not particularly realistic assumption that the observed birthweights are a random sample of all Australian
birthweights.) This is consistent with the observation that the Australian
birthweights would just barely pass a test for having the same mean as the
UK average 3426g, at the 0.05 significance level.
In fact, it is almost true to say that the symmetric 95% confidence interval contains exactly the possible means 0 such that the data would pass
a test at the 0.05 significance level for having mean equal to 0 . Whats
the difference? Its in how we compute the standard error. In computing
a confidence interval we estimate the parameters of the distribution from
the data. When we perform a statistical test, we take the parameters (as
far as possible) from the null hypothesis. In this case, that means that it
makes sense to test based on the presumption that the standard deviation of
weights is the SD of the UK births, which is the null hypothesis distribution.
In this case, this makes only a tiny difference between 538g (the UK SD)
and 528g (the SD of the Australian sample).
.
Lecture 9
The 2 Test
9.1
observed expected
,
standard error
2 Test
156
These dont have a mean. Were usually interested in some claim about the
distribution of the data among the categories. The most basic tool we have
for testing whether data we observe really could have come from a certain
distribution is called the 2 test.
Prob
Female
Male
Total
Jan
Feb
Mar
Apr
May
Jun
Jul
Aug
Sep
Oct
Nov
Dec
0.0849
0.0773
0.0849
0.0821
0.0849
0.0821
0.0849
0.0849
0.0821
0.0849
0.0821
0.0849
527
435
454
493
535
515
490
489
476
474
442
471
1774
1639
1939
1777
1969
1739
1872
1833
1624
1661
1568
1690
2301
2074
2393
2270
2504
2254
2362
2322
2100
2135
2010
2161
Total
Mean
1.0000
5801
483.4
21085
1757.1
26886
2240.5
2 Test
157
9.2
Goodness-of-Fit Tests
2 Test
158
9.2.1
(n1 np1 )2
(nk npk )2 X (observed expected)2
=
. (9.1)
+ +
np1
npk
expected
The 2 distribution
The statistic X 2 has properties (i) and (ii) for a good test statistic: we
can compute it from the data, and bigger values of X correspond to data
that are farther away from what you would expect from the null hypothesis.
But what about (iii)? We cant do a statistical test unless we know the
distribution of X under the null hypothesis. Fortunately, the distribution is
known. The statistic X has approximately (for large n) one of a family of
distributions, called the 2 distribution. There is a positive integer, called
the number of degrees of freedom (abbreviated d.f.) which tells us
which 2 distribution we are talking about. One of the tricky things about
using the 2 test statistic is figuring out the number of degrees of freedom.
This depends on the number of categories, and how we picked the null
hypothesis that we are testing. In general,
2 Test
159
until the problem disappears. We will see examples of this in sections 9.4.1
and 9.4.2.
The 2 distribution with d degrees of freedom is a continuous distribution1 with mean d and variance 2d. In Figure 9.1 we show the density of the
chi-squared distribution for some choices of the degrees of freedom. We note
that these distributions are always right-skewed, but the skew decreases as
d increases. For large d, the 2 distribution becomes close to the normal
distribution with mean d and variance 2d.
As with the standard normal distribution, we rely on standard tables
with precomputed values for the 2 distribution. We could simply have a
separate table for each number of degrees of freedom, and use these exactly
like the standard normal table for the Z test. This would take up quite
a bit of space, though. (Potentially infinite but for large numbers of
degrees of freedom see section 9.2.2.) Alternatively, we could use a computer
programme that computes p-values for arbitrary values of X and d.f. (In the
R programming language the function pchisq does this.) This is an ideal
solution, except that you dont have computers to use on your exams.
Instead, we rely on a traditional compromise approach, taking advantage
of the fact that the most common use of the tables is to find the critical value
for hypothesis testing at one of a few levels, such as 0.05 and 0.01.
2 Test
0.5
160
0.3
0.2
0.0
0.1
Density
0.4
1 d.f.
2 d.f.
3 d.f.
4 d.f.
5 d.f.
6 d.f.
7 d.f.
10
0.15
0.00
0.05
Density
0.10
5 d.f.
10 d.f.
20 d.f.
30 d.f.
10
20
30
40
50
2 Test
161
d.f.
P = 0.05
P = 0.01
d.f.
P = 0.05
P = 0.01
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
3.84
5.99
7.81
9.49
11.07
12.59
14.07
15.51
16.92
18.31
19.68
21.03
22.36
23.68
25.00
26.30
6.63
9.21
11.34
13.28
15.09
16.81
18.48
20.09
21.67
23.21
24.72
26.22
27.69
29.14
30.58
32.00
17
18
19
20
21
22
23
24
25
26
27
28
29
30
40
60
27.59
28.87
30.14
31.41
32.67
33.92
35.17
36.42
37.65
38.89
40.11
41.34
42.56
43.77
55.76
79.08
33.41
34.81
36.19
37.57
38.93
40.29
41.64
42.98
44.31
45.64
46.96
48.28
49.59
50.89
63.69
88.38
9.2.2
Large d.f.
X2 d
,
2d
2 Test
0.00
0.05
Density
0.10
0.15
162
10 11.07
15.09
20
Figure 9.2: 2 density with 5 degrees of freedom. The green region represents
1% of the total area. The red region represents a further 4% of the area, so
that the tail above 11.07 is 5% in total.
2dz + d.
For example, if we were testing at the 0.01 level, with 60 d.f., we would
first look for 9950 on the standard normal table, finding that this corresponds
to z = 2.58. (Remember that in a two-tailed test at the 0.01 level, the
probability above z is 0.005.) We conclude that the critical value for 2
with 60 d.f. at the 0.01 level is about
9.3
Fixed distributions
2 Test
163
to come up. We roll it 60 times, and tabulate the number of times each side
comes up. The results are given in Table 9.3.
Table 9.3: Outcome of 60 die rolls
Side
Observed Frequency
Expected Frequency
1
16
10
2
15
10
3
4
10
4
6
10
5
14
10
6
5
10
It certainly appears that sides 1,2, and 5 come up more often than they
should, and sides 3, 4, and 6 less frequently. On the other hand, some
deviation is expected, due to chance. Are the deviations we see here too
extreme to be attributed to chance?
Suppose we wish to test the null hypothesis
H0 : Each side comes up with probability 1/6
at the 0.01 significance level. In addition to the Observed Frequency of
each side, we have also indicated, in the last row of Table 9.3, the Expected
Frequency, which is the probability (according to H0 ) of the side coming
up, multiplied by the number of trials, which is 60. Since each side has
probability 1/6 (under the null hypothesis), the expected frequencies are
all 10. (Note that the observed and expected frequencies both add up to
exactly the number of trials.)
We now plug these numbers into our formula for the chi-squared statistic:
X2 =
X (observed expected)2
expected
(15 10)2 (4 10)2 (6 10)2 (14 10)2 (5 10)2
(16
+
+
+
+
+
=
10
10
10
10
10
10
= 3.6 + 2.5 + 3.6 + 1.6 + 1.6 + 2.5
10)2
= 15.4.
Now we need to decide whether this number is a big one, by comparing
it to the appropriate 2 distribution. Which one is the appropriate one?
When testing observations against a single fixed distribution, the number
of degrees of freedom is always one fewer than the number of categories.2
There are six categories, so five degrees of freedom.
2
Why minus one? You can think of degrees of freedom as saying, how many different
2 Test
164
Male
Obs
Exp
Combined
Obs
Exp
0.0849
0.0773
0.0849
0.0821
0.0849
0.0821
0.0849
0.0849
0.0821
0.0849
0.0821
0.0849
527
435
454
493
535
515
490
489
476
474
442
471
493
448
493
476
493
476
493
493
476
493
476
493
1774
1639
1939
1777
1969
1739
1872
1833
1624
1661
1568
1690
1790
1630
1790
1731
1790
1731
1790
1790
1731
1790
1731
1790
2301
2074
2393
2270
2504
2254
2362
2322
2100
2135
2010
2161
2283
2078
2283
2207
2283
2207
2283
2283
2207
2283
2207
2283
1.0000
5801
5803
21085
21084
26886
26887
Month
Prob.
Jan
Feb
Mar
Apr
May
Jun
Jul
Aug
Sep
Oct
Nov
Dec
Total
2 Test
165
X2 =
2 Test
166
Prob.
Observed
Expected
Jan
Feb
Mar
Apr
May
Jun
Jul
Aug
Sep
Oct
Nov
Dec
0.0849
0.0767
0.0849
0.0822
0.0849
0.0822
0.0849
0.0849
0.0822
0.0849
0.0822
0.0849
29415
26890
30122
30284
31516
30571
32678
31008
31557
31659
29358
29186
30936
27942
30936
29938
30936
29938
30936
30936
29938
30936
29938
30936
Total
1.0000
364244
364246
Table 9.5: Observed frequency of birth months, England and Wales, 1993.
2 Test
167
Month
Prob.
Observed
Expected
Jan
Feb
Mar
Apr
May
Jun
Jul
Aug
Sep
Oct
Nov
Dec
0.0808
0.0738
0.0827
0.0831
0.0865
0.0839
0.0897
0.0851
0.0866
0.0869
0.0806
0.0801
2301
2074
2393
2270
2504
2254
2362
2322
2100
2135
2010
2161
2171
1985
2223
2235
2326
2257
2412
2289
2329
2337
2167
2154
Total
1.0000
26886
26885
9.4
Families of distributions
In many situations, it is not that we want to know whether the data came
from a single distribution, but whether it may have come from any one
of a whole family of distributions. For example, in Lecture 5 we considered
several examples of data that might be Poisson distributed, and for which the
failure of the Poisson hypothesis would have serious scientific significance.
In such a situation, we modify our 2 hypothesis test procedure slightly:
(1). Estimate the parameters in the distribution. Now we have a particular distribution to represent the null hypothesis.
(2). Compute the expected occupancy in each cell as before.
(3). Using these expected numbers and the observed numbers (the data)
compute the 2 statistic.
(4). Compare the computed statistic to the critical value. Important: The
degrees of freedom are reduced by one for each parameter that has
been estimated.
2 Test
168
9.4.1
Consider Example 5.7. We argued there that the distribution of GuillainBarre Syndrome (GBS) cases by weeks should have been Poisson distributed
if the flu vaccine was not responsible for the disease. The data are given in
Table 5.3, showing the number of weeks in which different numbers of cases
were observed.
If some Poisson distribution were correct, what would be the parameter?
We estimated that , which is the expected number of cases per week, should
be estimated by the observed average number of cases per week, which is
40/30 = 1.33. We computed the probabilities for different numbers of cases,
assuming a Poisson distribution with parameter 1.33, and multiplied these
probabilities by 40 to obtain expected numbers of weeks. We compared
the observed to these expected numbers, and expressed the impression that
these distributions were different. But the number of observations is small.
Could it be that the difference is purely due to chance?
We want to test the null hypothesis H0 : The data came from a Poisson
distribution with a 2 test. We cant use the numbers from Table 5.3 directly, though. As discussed in section 9.2.1, the approximation we use for
the 2 distribution depends on the categories all being large enough, the
rule of thumb being that the expected numbers under the null hypothesis
should be at least 5. The last three categories are all too small. So, we
group the last four categories together, to obtain the new Table 9.7. The
last category includes everything 3 and higher.
Table 9.7: Cases of GBS, by weeks after vaccination
3+
observed frequency
probability
expected frequency
16
0.264
10.6
7
0.352
14.1
3
0.234
9.4
4
0.150
6.0
2 Test
169
We now compute
X2 =
Suppose we want to test the null hypothesis at the 0.01 significance level.
In order to decide on a critical value, we need to know the correct number
of degrees of freedom. The reduced Table 9.7 has oly 4 categories. There
would thus be 3 d.f., were it not for our having estimated a parameter to
decide on the distribution. This reduces the d.f. by one, leaving us with 2
degrees of freedom. Looking in the appropriate row, we see that the critical
value is 9.21, so we do reject the null hypothesis (the true p-value is 0.0034),
and conclude that the data did not come from a Poisson distribution.
9.4.2
# Girls
Frequency
Expected
Probability
0
7
2.3
0.0004
1
45
26.1
0.0043
2
181
132.8
0.0217
3
478
410.0
0.0670
4
829
854.2
0.1397
5
1112
1265.6
0.2070
6
1343
1367.3
0.2236
# Girls
Frequency
Expected
Probability
7
1033
1085.2
0.1775
8
670
628.1
0.1027
9
286
258.5
0.0423
10
104
71.8
0.0117
11
24
12.1
0.0020
12
3
0.9
0.0002
We can use a Chi-squared test to test the hypothesis that the data follow
a Binomial distribution.
2 Test
170
H0 : The data follow a Binomial distribution
H1 : The data do not follow a Binomial distribution
At this point we also decide upon a 0.05 significance level.
From the data we know that n = 6115 and we can estimate p as
p =
Thus we can fit a Bin(12, 0.4808) distribution to the data to obtain the
expected frequencies (E) alongside the observed frequencies (O). The probabilities are shown at the bottom of Table 9.8, and the expectations are
found by multiplying the probabilities by 6115. The first and last categories have expectations smaller than 5, so we absorb them into the next
categories, yielding Table 9.9.
Table 9.9: Modified version of Table 9.8, with small categories grouped
together.
# Girls
Frequency
Expected
Probability
0, 1
52
28.4
0.0047
2
181
132.8
0.0217
3
478
410.0
0.0670
4
829
854.2
0.1397
5
1112
1265.6
0.2070
6
1343
1367.3
0.2236
7
1033
1085.2
0.1775
8
670
628.1
0.1027
9
286
258.5
0.0423
10
104
71.8
0.0117
X2 =
11, 12
27
13.0
0.0022
2 Test
171
155-160
5
161-166
17
167-172
38
173-178
25
179-184
9
185-190
6
We can use a Chi-squared test to test the hypothesis that the data follow a
Normal distribution.
2 Test
172
155-160
160.5
-1.61
0.054
0.054
5.4
5
161-166
166.5
-0.77
0.221
0.167
16.7
17
167-172
172.5
0.07
0.528
0.307
30.7
38
173-178
178.5
0.91
0.818
0.290
29.0
25
179-184
184.5
1.75
0.960
0.142
14.2
9
185-190
1.00
0.040
4.0
6
From this table we see that there is one cell with an expected count less
than 5 so we group it together with the nearest cell. (A single cell with
expected count is on the borderline; we could just leave it. We certainly
dont want any cell with expected count less than about 2, and not more
than one expected count under 5.)
Expected
Observed
Height (cm)
155-160 161-166 167-172
5.4
16.7
30.7
5
17
38
173-178
29.0
25
179-190
18.2
15
2 Test
173
X2 =
9.5
This section develops a Chi-squared test that is very similar to the one of
the preceding section but aimed at answering a slightly different question.
To illustrate the test we use an example, which we borrow from [FPP98,
section 28.2].
Table 9.10 gives data on Americans aged 2534 from the NHANES survey, a random sample of Americans. Among other questions, individuals
were asked their age, sex, and handedness.
Table 9.10: NHANES handedness data for Americans aged 2534
right-handed
left-handed
ambidextrous
Men
Women
934
113
20
1070
92
8
Looking at the data, it looks as though the women are more likely to
be right-handed. Someone might come along and say: The left cerebral
hemisphere controls the right side of the body, as well as rational thought.
This proves that women are more rational than men. Someone else might
say, This shows that women are under more pressure to conform to societys expectations of normality. But before we consider this observation
as evidence of anything important, we have to pose the question: Does this
2 Test
174
2 Test
175
null hypothesis. We repeat this computation for all six categories, obtaining
the results in Table 9.12. (Note that the row totals and the column totals
are identical to the original data.) We now compute the 2 statistic, taking
the six observed counts from the black Table 9.10, and the six expected
counts from the red Table 9.12:
Men
Women
Total
Fraction
right-handed
left-handed
ambidextrous
934
113
20
1070
92
8
2004
205
28
0.896
0.092
0.013
Total
1067
1170
2237
Fraction
0.477
0.523
right-handed
left-handed
ambidextrous
Men
Women
956
98
13
1048
107
15
Table 9.12: Expected counts for NHANES handedness data, computed from
fractions in Table 9.11.
X2 =
The degrees of freedom are (31)(21) = 2, so we see on Table 9.2 that the
critical value is 9.21. Since the observed value is higher, we reject the null
hypothesis, and conclude that the difference in handedness between men
and women is not purely due to chance.
Remember, this does not say anything about the reason for the difference! It may be something interesting (e.g., women are more rational) or it
may be something dull (e.g., men were more likely to think the interviewer
would be impressed if they said they were left-handed). All the hypothesis
176
2 Test
test tells us is that we would most likely have found a difference in handedness between men and women if we had surveyed the whole population.
Lecture 10
You may have noticed a hole in the reasoning we used in reasoning about
the husbands heights in section 7.1. Our computations depended on the
fact that
X
Z :=
/ n
has standard normal distribution, when is the population mean and is
the population standard deviation. But we dont know ! We did a little
slight of hand, and substituted the sample standard deviation
v
u
n
u 1 X
t
2
S :=
(Xi X)
n1
i=1
for the (unknown) value of . But S is only an estimate for : its a random
variable that might be too high, and might be too low. So, what we were
calling Z is not really Z, but a quantity that we should give another name
to:
X
.
T :=
(10.1)
S/ n
If S is too big, then Z > T , and if S is too small then Z < T . On average,
you might suppose, Z and T would be about the same and, in this you
would be right. Does the distinction matter then?
177
178
T distribution
Since T has an extra source of error in the denominator, you would expect it to be more widely scattered than Z. That means that if you compute
T from the data, but look it up on a table computed from the distribution
of Z the standard normal distribution you would underestimate the
probability of a large value. The probability of rejecting a true null hypothesis (Type I error) will be larger than you thought it was, and the confidence
intervals that you compute will be too narrow. This is very bad! If we
make an error, we always want it to be on the side of underestimating our
confidence.
Fortunately, we can compute the distribution of T (sometimes called
Students t, after the pseudonym under which statistician William Gossett
published his first paper on the subject, in 1908). While the mathematics
behind this is beyond the scope of this course, the results can be found in
tables. These are a bit more complicated than the normal tables, because
there is an extra parameter: Not surprisingly, the distribution depends on
the number of samples. When the estimate is based on very few samples
(so that the estimate of SD is particularly uncertain) we have a distribution which is far more spread out than the normal. When the number of
samples is very large, the estimate s varies hardly at all from , and the
corresponding t distribution is very close to normal. As with the 2 distribution, this parameter is called degrees of freedom. For the T statistic,
the number of degrees of freedom is just n 1, where n is the number of
samples being averaged. Figure 10.1 shows the density of the t distribution
for different degrees of freedom, together with that of the normal. Note that
the t distribution is symmetric around 0, just like the normal distribution.
Table 10.1 gives the critical values for a level 0.05 hypothesis test when
Z is replaced by t with different numbers of degrees of freedom.
In other
words, if we define t (d) to be the number such that P T < t = when
T has the Student distribution with d degrees of freedom, Table 10.1(a)
gives values of t0.95 , and Table 10.1(b) gives values of t0.975 . Note that the
values of t (d) decrease as d increases, approaching a maximum, which is
z = t ().
10.1.1
T distribution
179
0.4
0.2
0.0
0.1
dnorm(z)
0.3
Normal
1 d.f.
2 d.f.
4 d.f.
10 d.f.
25 d.f.
Figure 10.1: The standard normal density together with densities for the t
distribution with different degrees of freedom.
Table 10.1: Cutoffs for hypothesis tests at the 0.95 level, using the t statistic
with different degrees of freedom. The level is the limit for a very large
number of degrees of freedom, which is identical to the distribution of the Z
statistic.
(a) one-tailed test
degrees of
freedom
critical
value
degrees of
freedom
critical
value
1
2
4
10
50
6.31
2.92
2.13
1.81
1.68
1.64
1
2
4
10
50
12.7
4.30
2.78
2.23
2.01
1.96
180
T distribution
with the t statistic, we follow the same procedures as in section 7.1, substituting s for , and the quantiles of the t distribution for the quantiles of the
normal distribution: that is, where we looked up a number z on the normal
table, such that P (Z < z) was a certain probability, we substitute a number
t such that P (T < t) is that same probability, where T has the Student T
distribution with the right number of degrees of freedom. Thus, if we want
a 95% confidence interval, we take
t s ,
X
n
where t is found in the column marked P = 0.05 on the T-distribution
table 0.05 being the probability above t that we are excluding. It corresponds, of course, to P (T < t) = 0.975.
This example is adapted from [MM98, p.529], where it was based on a Masters thesis
of Joan M. Susic at Purdue University.
T distribution
181
approximately according to a normal distribution. For one patient, the values (in mg/dl) were measured 5.6, 5.1, 4.6, 4.8,
5.7, 6.4. What is a symmetric 99% confidence interval for the
patients true phosphate level?
We compute
= 1 (5.6 + 5.1 + 4.6 + 4.8 + 5.7 + 6.4) = 5.4mg/dl
X
6
s
1 (5.6 5.4)2 + (5.1 5.4)2 + (4.6 5.4)2
s=
5 +(4.8 5.4)2 + (5.7 5.4)2 + (6.4 5.4)2
= 0.67mg/dl.
The number of degrees of freedom is 5. Thus, the symmetric
confidence interval will be
0.67
0.67
mg/dl,
5.4 t , 5.4 + t
6
6
where t is chosen so that the T variable with 5 degrees of freedom
has probability 0.01 of being bigger than t.
10.1.2
T tables are like the 2 table. For Z, the table in your official booklet allows
you to choose your value of Z, and gives you the probability of finding Z
below this value. Thus, if you were interested in finding z , you would look
to find inside the table, and then check which index Z corresponds to it. In
principle, we could have a similar series of T tables, one for each number of
degrees of freedom. To save space, though, and because people are usually
not interested in the entire t distribution, but only in certain cutoffs, the
T tables give much more restricted information. The rows of the T table
represent degrees of freedom, and the columns represent cutoff probabilities.
The values in the table are then the values of t that give the cutoffs at those
probabilities. One peculiarity of these tables is that, whereas the Z table
gives one-sided probabilities, the t table gives two-sided probabilities. This
makes things a bit easier when you are computing symmetric confidence
intervals, which is all that we will do here.
The probability we are looking for is 0.01, which is the last column of
the table, so looking in the row for 5 d.f. (see Figure 10.2) we see that
182
T distribution
T distribution
10.1.3
183
We continue Example 10.2. Suppose 4.0 mg/dl is a dangerous level of phosphate, and we want to be 99% sure that the patient is, on average, above
that level. Of course, all of our measurements are above that level, but they
are also quite variable. It could be that all six of our measurements were
exceptionally high. How do we make a statistically precise test?
Let H0 be the null hypothesis, that the patients phosphate level is actually 0 = 4.0mg/dl. The alternative hypothesis is that it is a different
value, so this is a two-sided test. Suppose we want to test, at the 0.01 level,
whether the null hypothesis could be consistent with the observations. We
compute
x
0
= 5.12.
T =
s/ n
This statistic T has the t distribution with 5 degrees of freedom. The critical
value is the value t such that the probability of |T | being bigger than t is
0.01. This is the same value that we looked up in Example 10.2, which is
4.03. Since our T value is 5.12, we reject the null hypothesis. That is, T is
much too big: the probability of such a high value is smaller than 0.01. (In
fact, it is about 0.002.) Our conclusion is that the true value of is not 4.0.
In fact, though, were likely to be concerned, not with a particular value
of , but just with whether is too big or too small. Suppose we are
concerned to be sure that the average phosphate level is really at least 0 =
4.0mg/dl. In this case, we are performing a one-sided test, and we will reject
T values that are too large (meaning that x
is too large to have plausibly
resulted from sampling a distribution with mean 0 . The computation of
T proceeds as before, but now we have a different cutoff, corresponding to
a probability twice as big as the level of the test, so 0.02. (This is because
of the peculiar way the table is set up. Were now only interested in the
probability in the upper tail of the t distribution, which is 0.01, but the table
is indexed according to the total probability in both tails.) This is t = 3.36,
meaning that we would have been more likely to reject the null hypothesis.
10.1.4
184
T distribution
Another exception: If the number of samples is large there is no difference between Z and t. You may as well use Z, which is conceptually
a bit simpler. For most purposes, n = 50 is large enough to do only Z
tests.
10.1.5
x2 = V ar(x) =
1X
(xi x
)2
n
i=1
Why is it, then, that we estimate variance and SD by using a sample variance
and sample SD in which n in the denominator is replaced by n 1?:
n
s2x =
1 X
(xi x
)2 .
n1
i=1
1X
(xi )2
n
i=1
T distribution
185
2
X
=
How big is this on average? The first two terms in the brackets will average
to 2 (the technical term is, their expectation is 2 ), while the last term
averages to 0. The total averages then to just 2 /2.
10.2
Paired-sample t test
A study2 was carried out to study the effect of cigarette smoking on blood
clotting. Some health problems that smokers are prone to are a result of
abnormal blood clotting. Blood was drawn from 11 individuals before and
after they smoked a cigarette, and researchers measured the percentage of
blood platelets the factors responsible for initiating clot formation
that aggregated when exposed to a certain stimulus. The results are shown
in Table 10.2.
We see that the Before numbers tend to be larger than the After
numbers. But could this be simply a result of random variation? After all,
there is quite a lot of natural variability in the numbers.
Imagine that we pick a random individual, who has a normally distributed Before score Xi . Smoking a cigarette adds a random effect (also
normally distributed) Di , to make the After score Yi . Its a mathematical
fact that, if X and D are independent, and Y = X + D, then
V ar(Y ) = V ar(X) + V ar(D).
We are really interested to know whether Di is positive on average, which we
do by comparing the observed average value of di to the SD of di . But when
we did the computation, we did not use the SD of di in the denominator;
we used the SD of xi and yi , which is much bigger. That is, the average
difference between Before and After numbers was found to be not large
2
186
T distribution
Before
25
25
27
44
30
67
53
53
52
60
28
After
27
29
37
56
46
82
57
80
61
59
43
Difference
2
4
10
12
16
15
4
27
9
-1
15
10.3
= 4.28.
7.98/ 11
The cutoff for p = 0.05 at 10 d.f. is found from the table to be 2.23, so we can
certainly reject the null hypothesis at the 0.05 level. The difference between
the Before and After measurements is found to be statistically significant.
(In fact, the p-value may be calculated to be about 0.002.)
10.3
Introduction to sampling
10.3.1
It may seem odd that the computations in section 7.1 take no account of
the size of the population that we are sampling from. After all, if these 200
men were all the married men in the UK, there would be no sampling error
at all. And if the total population were 300, so that we had sampled 2 men
out of 3, there would surely be less random error than if there are 20 million
T distribution
187
men in total, and we have sampled only 1 man out of 100,000. Indeed, this
is true, but the effect vanishes quite quickly as the size of the population
grows.
Suppose we have a box with N cards in it, each of which has a number,
and we sample n cards without replacement, drawing numbers X1 , . . . , Xn .
Suppose that the cards in the box were themselves drawn from a normal
distribution with variance 2 , and let be the population mean that is,
is still normally
the mean of the numbers in the box. The sample mean X
distributed, with expectation , so the only question now is to determine
the standard error. Call this standard error SENR (NR=no replacement),
and the SE computed earlier SEWR (WR=with replacement. It turns out
that the standard error is precisely
s
s
n1
n1
SENR = SEWR
1
=
1
.
(10.2)
N 1
N 1
n
Thus, if we had sampled 199 out of 300, the SE (and hence alsop
the width of
all our confidence intervals) would be multiplied by a factor of 101/299 =
0.58, so would be barely half as large. If the whole population is 1000, so
that we have sampled 1 out of 5, the correction factor has gone up to 0.89,
so the correction is only by about 10%. And if the population is 10,000, the
correction factor is 0.99, which is already negligible for nearly all purposes.
Thus, if the 199 married men had been sampled from a town with just
300 married men, the 95% confidence interval for the average height of married men in the town would be 1732mm20.584.9mm = 1732mm5.7mm,
so about (1726, 1738)mm, instead of the 95% confidence interval computed
earlier for sampling with replacement, which was (1722, 1742)mm.
The size of the sample matters far more than the size of the population (unless you are sampling a large fraction of the population
without replacement).
10.3.2
Measurement bias
Bias is a crucial piece of the picture. This is the piece of the error that is
systematic: For instance, if you are measuring plots of land with a metre
stick, some of your measures will by chance be too big, and some will be too
small. The random errors will tend to be normally distributed with mean
0. Averaging more measurements produces narrower confidence bounds.
188
T distribution
Suppose your metre stick is actually 101 cm long, though. Then all of your
measurements will start out about 1% too short before you add random
error on to them. Taking more measurements will not get you closer to the
true value, but rather to the ideal biased measure. The important lesson is:
Statistical analysis helps us to estimate the extent of random error. Bias
remains.
Of course, statisticians are very concerned with understanding the sources
of bias; but bias is very subject-specific. The bias that comes in conducting
a survey is very different from the bias that comes from measuring the speed
of blink reflexes in a psychology experiment.
Better measurement procedures, and better sampling procedures,
can reduce bias. Increasing numbers of measurements, or larger
samples, reduce the random error. Both cost time and effort. The
trick is to find the optimum tradeoff.
10.3.3
Bias in surveys
T distribution
189
a smaller sample, but putting more effort into making sure it was a good
random sample. More random error, but less bias. The biggest advantage
is that you can compute how large the random error is likely to be, whereas
bias is almost entirely unknowable.
In section 7.1 we computed confidence intervals for the heights of British
men (in 1980) on the basis of 199 samples from the OPCS survey. In fact,
the description we gave of the data was somewhat misleading in one respect:
The data set we used actually gave the paired heights of husbands and wives.
Why does this matter? This sample is potentially biased because the only
men included are married. It is not inconceivable that unmarried men have
different average height from married men. In fact, the results from the
complete OCPS sample are available [RSKG85], and the average height was
found to be 1739mm, which is slightly higher than the average height for
married men that we found in our selective subsample, but still within the
95% confidence interval.
The most extreme cases of selection bias arise when the sample is selfselected. For instance, if you look on a web site for a camera youre interested in, and see that 27 buyers said it was good and 15 said it was bad,
what can you infer about the true percentage of buyers who were satisfied?
Essentially nothing. We dont know what motivated those particular people
to make their comments, or how they relate to the thousands of buyers who
didnt comment (or commented on another web site).
Non-response bias
If youre calling people at home to survey their opinions about something,
they might not want to speak to you and the people who speak to you
may be different in important respects from the people who dont speak
to you. If you distribute a questionnaire, some will send it back and some
wont. Again, the two groups may not be the same.
As an example, consider the health questionnaire that was mailed to
a random sample of 6009 residents of Somerset Health District in 1992.
The questionnaire consisted of 43 questions covering smoking habits, eating
patterns, alcohol use, physical activity, previous medical history and demographic and socio-economic details. 57.6% of the surveys were returned,
and on this basis the health authorities could estimate, for instance, that
24.2% of the population were current smokers, or that 44.3% engage in no
moderate or vigorous activity. You might suspect something was wrong
when you see that 45% of the respondents were male as compared with
just under 50% of the population of Somerset (known from the census). In
190
T distribution
T distribution
$Amount
N
$Amount
N
78
1
$Amount
N
400
1
191
0
3
1
6
79
1
5
7
80
0
500
9
10
19
100
73
550
1
15
1
120
1
750
0
20
20
150
4
1000
4
25
22
30
7
151
2
1100
1
35
1
180
1
2000
1
40
4
50
28
200
15
2500
1
55
1
250
5
75
1
300
8
5000
3
325
0
9999+
2
Table 10.3: The amounts claimed to have been donated to tsunami relief by
254 respondents to a Gallup survey in January 2005, out of 1008 queried.
donated an average of $192, meaning $48 per household. Since
there are about 110 million households in the US, that brings
us to a total of more that $5 billion donated. Bialik noted that
the Chronicle of Philanthropy, a trade publication, reported a
total of $745 million donated by private sources, leaving a gap
of more than $4 billion between what people said they donated,
and what relief organisations received.
Ascertainment bias
You analyse the data you have, but dont know which data you never got to
observe. This can be a particular problem in a public health context, where
you only get reports on the illnesses serious enough to people to seek medical
treatment. The recent swine flu outbreak Before the recent outbreak of swine
flu, a novel bird flu was making public health experts nervous as it spread
through the world from its site of origin in East Asia. While it mainly affects
waterfowl, occasional human cases have occurred. Horrific mortality rates,
on the order of 50%, have been reported. Thorson et al. [TPCE06] pointed
out, though, that most people with mild cases of flu never come to the
attention of the medical system, particularly in poor rural areas of Vietnam
and China, where the disesase is most prevalent. They found evidence of a
high rate of flulike illness associated with contact with poultry among the
rural population in Vietnam. Quite likely, then, many of these illnesses were
mild cases of the avian influenza. Mortality rate is the probability of cases of
disease resulting in death, which we estimate from the fraction of observed
cases resulting in death. In this case, though, the sample of observed cases
was biased: A severe case of flu was more likely to be observed than a
192
T distribution
mild case. Thus, while the fraction of observed cases resulting in death was
quite high in Vietnam 20035 there were 87 confirmed cases, of which
38 resulted in death this likely does not reflect accurately the fraction of
deaths among all cases in the population. In all likelihood, the 38 includes
nearly all the deaths, but the 87 represents only a small fraction of the cases.
During World War II the statistician Abraham Wald worked with the US
air force to analyse the data on damage to military airplanes from enemy fire.
The question: Given the patterns of damage that we observe, what would
be the best place to put extra armour plating to protect the aircraft. (You
cant put too much armour on, because it makes the aircraft too heavy.) His
answer: Put armour in places where you never see a bullet hole. Why? The
bullet holes you see are on the planes that made it back. If you never see
a bullet hole in some part of the plane, thats probably because the planes
that were hit there didnt make it back. (His answer was more complicated
than that, of course, and involved some careful statistical calculations. For
a discussion, see [MS84].)
10.3.4
Measurement error
An old joke says, If you have one watch you know what time it is. If
you have two watches youre never sure. What do you do when multiple
measurements of the same quantity give different answers. One possibility
is to try to find out which one is right. Anders Hald, in his history of
statistics [Hal90], wrote
The crude instruments used by astronomers in antiquity and the
Middle Ages could lead to large [. . . ] errors. By planning their
observations astronomers tried to balance positive and negative
systematic errors. If they made several observations of the same
object, they usually selected the best as estimator of the true
value, the best being defined from such criteria as the occurrence of good observational conditions, special care having been
exerted, and so on.
One of the crucial insights that spurred the development of statistical theory
is that the sampling and error problems are connected. Each measurement
can be thought of as a sample from the population of possible measurements,
and the whole sample tells you more about the population hence about
the hidden true value than any single measurement could.
There are no measurements without error.
T distribution
193
(10.3)
194
T distribution
Lecture 11
Comparing Distributions
11.1
Consider again the sample of 198 mens heights, which we discussed in sections 7.1 and 10.3.3. As mentioned there, the data set gives paired heights
of husbands and wives, together with their ages, and the age of the husband
at marriage. This might allow us to pose a different sort of question. For
instance, What is the average difference in height between men who married
early and the men who married late? We summarise the data in Table 11.1,
defining early-married to mean before age 30.
What does this tell us? We know that the difference in our sample is
19mm, but does this reflect a true difference in the population at large, or
could it be a result of mere random selection variation? To put it differently,
how sure are we that if we took another sample of 199 and measured them,
that we wouldnt find a very different pattern?
number
mean
SD
late ( 30)
unknown
total
160
1735
67
35
1716
78
3
1758
59
198
1732
69
195
196
Comparing Distributions
Let X be the true average in the population of the heights of earlymarried men, and Y the true average for late-marrieds. The parameter we
are interested in is XY := X Y . Obviously the best estimate for XY
Y , which will be normally distributed with the right mean, so
will be X
Y ) z SE. But
that a symmetric level confidence interval will be (X
what is the appropriate standard error for the difference?
Since the variance of a sum of independent random variables is the sum
of their variances, we see that
2
2
2
Y ) = V ar(X)
+ V ar(Y ) = X + Y ,
SEXY
= V ar(X
nX
nY
where X is the standard deviation for the X variable (the height of earlymarrieds) and Y is the standard deviation for the Y variable (the height of
late-marrieds); nX and nY are the corresponding numbers of samples. This
gives us the standard formula:
SEX
Y =
2
X
nX
Y2
nY
1
nY
+ n1 if X = Y .
X
Formula 11.1: Standard error for the difference between two normally distributed variables
Thus, XY =
XY + SEXY Z, where Z has standard normal distribution. Suppose now we want to compute a 95% confidence interval for the
difference in heights of the early- and late-marrieds. The point estimator we
know is +19mm, and the SE is
r
672
782
+
14mm.
160
35
The confidence interval for the difference ranges then from 9mm to +47mm.
Thus, while our best guess is that the early-marrieds are on average 19mm
taller than the late-marrieds, all that we can say with 95% confidence, on the
basis of our sample, is that the difference in height is between 9mm and
+47mm. That is, heights are so variable, that a sample of this size might
easily be off by 28 mm either way from the true difference in the population.
Comparing Distributions
11.2
197
X = Y ,
X Y
19mm
=
= 1.4.
SEXY
14mm
If we were testing at the 0.05 level, we would not reject the null hypothesis,
we would reject values of Z bigger than 1.96 (in absolute value). Even
testing at the 0.10 level we would not reject the null, since the cutoff is 1.6.
Our conclusion is that the difference in heights between the early-married
and late-married groups is not statistically significant. Notice that this is
precisely equivalent to our previous observation that the symmetric 95%
confidence interval includes 0.
If we wish to test H0 against the alternative hypothesis X > Y , we are
performing a one-sided test: We use the same test statistic Z, but we reject
values of Z which correspond to large values of X Y , so large positive
values of Z. Large negative values of Z, while they are unlikely for the null
hypothesis, are even more unlikely for the alternative. The cutoff for testing
at the 0.05 level is z0.95 = 1.64. Thus, we do not reject the null hypothesis.
198
11.3
Comparing Distributions
1
1
+
= 0.326
nu nc
1
1
+
= 0.059.
54 70
0.083
pu pc
=
= 1.41.
SE
0.059
The cutoff for rejecting Z at the 0.05 level is 1.96. Since the observed Z is
smaller than this, we do not reject the null hypothesis, that the infection
rates are in fact equal. The difference in infection rates is not statistically
significant, as we cannot be confident that the difference is not simply due
to chance.
Comparing Distributions
11.4
199
Consider the following study [SCT+ 90], discussed in [RS02, Chapter 2]: Researchers measured the volume of the left hippocampus of the brains of 30
men, of whom 15 were schizophrenic. The goal is to determine whether
there is a difference in the size of this brain region between schizophrenics
and unaffected individuals. The data are given in Table 11.2. The average size among the unaffected subjects (in cm3 ) is 1.76, while the mean for
the schizophrenic subjects is 1.56. The sample SDs are 0.24 and 0.3 respectively. What can we infer about the populations that these individuals
were sampled from? Do schizophrenics have smaller hippocampal volume,
on average?
We make the modeling assumption that these individuals are a random
sample from the general population (of healthy and schizophrenic men, respectively. More about this in section 11.5.2.) We also assume that the
underlying variance of the two populations is the same, but unknown: The
difference between the two groups (potentially) is in the population means
x (healthy) and y (schizophrenic), and we want a confidence interval for
the difference.
Since we dont know the population SD in advance, and since the number
of samples is small, we use the T distribution for our confidence intervals
instead of the normal. (Since we dont know that the population is normally
distributed, we are relying on the normal approximation, which may be
questionable for averages small samples. For more about the validity of this
assumption, see section 11.5.3.) As always, the symmetric 95% confidence
interval is of the form
Estimate t SE,
where t is the number such that 95% of the probability in the appropriate
T distribution is between t and t (that is, the number in the P = 0.05
column of your table.) We need to know
(1). How many degrees of freedom?
(2). What is the SE?
The first is easy: We add the degrees of freedom, to get nx + ny 2
in this case, 28.
The second is sort of easy: Like for the Z test, when x = y (thats
200
Comparing Distributions
1
1
+ .
nx ny
The only problem is that we dont know what is. We have our sample
SDs sx and sy , each of which should be approximately . The bigger the
sample, the better the approximation should be. This leads us to the pooled
sample variance s2p , which simply averages these estimates, counting the
bigger sample more heavily:
= sp
1
nx
1
ny
Unaffected :
Schizophrenic :
11.5
11.5.1
Comparing Distributions
201
Is the difference in hippocampal volume between the two groups statistically significant? Can the observed difference be due to chance? We perform
a t test at significance level 0.05. Our null hypothesis is that x = y , and
our two-tailed alternative is x 6= y . The standard error is computed exactly as before, to be 0.099. The T test statistic is
T =
Y
X
0.20
=
= 2.02.
SE
0.099
We then observe that this is not above the critical value 2.05, so we RETAIN the null hypothesis, and say that the difference is not statistically
significant.
If we had decided in advance that we were only interested in whether
x > y the one-tailed alternative we would use the same test statistic
T = 2.02, but now we draw our critical value from the P = 0.10 column,
which gives us 1.70. In this case, we would reject the null hypothesis. On the
other hand, if we had decided in advance that our alternative was x < y ,
we would have a critical value 1.70, with rejection region anything below
that, so of course we would retain the null hypothesis.
11.5.2
It may seem disappointing that we cant do more with our small samples.
The measurements in the healthy group certainly seem bigger than those of
the schizophrenic group. The problem is, theres so much noise, in the form
of general overall variability among the individuals, that we cant be sure if
the difference between the groups is just part of that natural variation.
It might be nice if we could get rid of some of that noise. For instance,
suppose the variation between people was in three parts call them A, B,
and S, so your hippocampal volume is A + B + S. S is the effect of having
schizophrenia, and A is random stuff we have no control over. B is another
individual effect, but one that we suppose is better defined for instance,
the effect of genetic inheritance. Suppose we could pair schizophrenic and
non-schizophrenic people with the same B score up, and then look at the
difference between individuals within a pair. Then B cancels out, and S
becomes more prominent. This is the idea of the matched case-control
study.
In fact, thats just what was done in this study. The real data are given
in Table 12.5. The 30 subjects were, in fact, 30 pairs of monozygotic twins,
of whom one was schizophrenic and the other not. The paired-sample T
test is exactly the one that we described in section 10.2. The mean of the
202
Comparing Distributions
Schizophrenic
1.27
1.63
1.47
1.39
1.93
1.26
1.71
1.67
1.28
1.85
1.02
1.34
2.02
1.59
1.97
Difference
0.67
-0.19
0.09
0.19
0.13
0.40
0.04
0.10
0.50
0.07
0.23
0.59
0.02
0.03
0.11
Comparing Distributions
203
cards, each of which has an X and a Y side, with numbers on each, and you
are trying to determine from the sample whether X and Y have the same
average. You could write down your sample of Xs, then turn over all the
cards and write down your sample of Ys, and compare the means. If the
X and Y numbers tend to vary together, though, it makes more sense to
look at the differences X Y over the cards, rather than throw away the
information about which X goes with which Y. If the Xs and Ys are not
actually related to each other then it shouldnt matter.
11.5.3
In section 11.5.2 we supposed that the average difference between the unaffected and schizophrenic hippocampus volumes would have a nearly normal
distribution. The CLT tells us that that the average of a very large number of such differences, picked at random from the same distribution, would
have an approximately normal distribution. But is that true for just 15?
We would like to test this supposition.
One way to do this is with a random experiment. We sample 15 volume
differences at random from the whole population of twins, and average them.
Repeat 1000 times, and look at the distribution of averages we find. Are
these approximately normal?
But wait! We dont have access to any larger population of
twins; and if we did, we would have included their measurements in our study. The trick (which is widely used in modern
statistics, but is not part of this course), called the bootstrap,
is instead to resample from the data we already have, picking
15 samples with replacement from the 15 we already have
so some will be counted several times, and some not at all. It
sounds like cheating, but it can be shown mathematically that it
works.
A histogram of the results is shown in Figure 11.2(a), together with the
appropriate normal curve. The fit looks pretty good, which should reassure
us of the appropriateness of the test we have applied. Another way of seeing
that this distribution is close to normal is Figure 11.2(b), which shows a socalled Q-Q plot. (The Q-Q plot is not examinable as such. In principle, it is
the basis for the Kolmogorov-Smirnov test, which we describe in section 13.1.
This will give us a quantitative answer to the question: Is this distribution
close to the normal distribution?) The idea is very simple: We have 1000
numbers that we think might have been sampled from a normal distribution.
We look at the normal distribution these might have been sampled from
204
Comparing Distributions
the one with the same mean and variance as this sample and take 1000
numbers evenly spaced from the normal distribution, and plot them against
each other. If the sample really came from the normal distribution, then the
two should be about equal, so the points will all lie on the main diagonal.
Figure 11.2(c) shows a Q-Q plot for the original 15 samples, which clearly
do not fit the normal distribution very well.
Normal Q-Q Plot
0.2
Sample Quantiles
0.2
Sample Quantiles
80
60
0
-0.2
20
0.1
0.0
40
Frequency
0.4
100
0.3
120
0.6
0.4
140
0.0
0.1
0.2
dmean
0.3
0.4
-3
-2
-1
Theoretical Quantiles
-1
Theoretical Quantiles
(a) Histogram of 1000 resampled means (b) Q-Q plot of 1000 resampled means (c) Q-Q plot of 15 original measurements
11.6
11.6.1
Quantitative experiments
Comparing Distributions
205
Let us first apply a Z test at the 0.05 significance level, without thinking
too deeply about what it means. (Because of the normal approximation,
it doesnt really matter what the underlying distribution of weight loss is.)
We compute first the pooled sample variance:
170 3.52 + 104 2.82
= 10.6,
274
p
so sp = 3.25kg. The standard error is sp 1/n + 1/m = 0.40kg. Our test
statistic is then
x
y
3.1
Z=
=
= 7.7.
SE
0.4
This exceeds the rejection threshold of 1.96 by a large margin, so we conclude that the difference in weight loss between the groups is statistically
significant.
But are we justified in using this hypothesis test? What is the random
sampling procedure that defines the null hypothesis? What is the just by
chance that could have happened, and that we wish to rule out? These
276 women were not randomly selected from any population at least,
not according to any well-defined procedure. The randomness is in the
assignment of women to the two groups: We need to show that the difference
between the two groups is not merely a chance result of the women that we
happened to pick for the intervention group, which might have turned out
differently if we happened to pick differently.
In fact, this model is a lot like the model of section 11.5. Imagine a box
containing 276 cards, one for each woman in the study. Side A of the card
says how much weight the woman would lose if she were in the intervention
group; side B says how much weight she would lose if she were in the control
group. The null hypothesis states, then, that the average of the As is the
same as the average of the Bs. The procedure of section 11.5 says that we
compute all the values A-B from our sample, and test whether these could
have come from a distribution with mean 0. The problem is that we never
get to see A and B from the same card.
Instead, we have followed the procedure of section 11.5.1, in which we
take a sample of As and a sample of Bs, and test for whether they could
have come from distributions with the same mean. But there are some
important problems that we did not address there:
s2p =
We sample the As (the intervention group) without replacement. Furthermore, this is a large fraction of the total population (in the box).
We know from section 10.3.1 that this makes the SE smaller.
206
Comparing Distributions
The sample of Bs is not an independent sample; its just the complement of the sample of As. A bit of thought makes clear that this
tends to make the SE of the difference larger.
This turns out to be one of those cases where two wrongs really do make
a right. These two errors work in opposite directions, and pretty much
cancel each other out. Consequently, in analysing experiments we ignore
these complications and proceed with the Z- or t-test as in section 11.5.1,
as though they were independent samples.
11.6.2
Qualitative experiments
Comparing Distributions
207
Table 11.4
Question A
Question B
Yes
161
92
No
22
54
any population, and we have no idea how representative they may or may
not be of the larger category of Homo sapiens. The real question is, among
these 383 people, how likely is it that we would have found a different result
had we by chance selected a different group of 200 people to pose question
B to. We want to do a significance test at the 0.01 level.
The model is then: 383 cards in a box. On one side is that persons answer to Question A, on the other side the same persons answer to Question
B (coded as 1=yes, 0=no). The null hypothesis is that the average on the
A side is the same as the average on the B side (which includes the more
specific hypothesis that the As and the Bs are identical).
We pick 183 cards at random, and add up their side As, coming to 161;
from the other 200 we add up the side Bs, coming to 92. Our procedure is
then:
A = 0.88, while the average
(1). The average of the sampled side As is X
B = 0.46.
of the sampled side Bs is X
p
(2). The standard deviation of the A sides is estimated at A = p(1 p) =
0.32, p
while the standard deviation of the B sides is estimated at
B = p(1 p) = 0.50.
(3). The standard error for the difference is estimated at
s
r
2
2
B
A
0.322 0.52
SEAB =
+
=
+
= 0.043.
nA nB
183
200
A X
B )/SEAB = 9.77. The cutoff for a two-sided test at the
(4). Z = (X
0.01 level is z0.995 = 2.58, so we clearly do reject the null hypothesis.
The conclusion is that the difference in answers between the two questions was not due to the random sampling. Again, this tells us nothing
directly about the larger population from which these 383 individuals were
sampled.
208
Comparing Distributions
Lecture 12
210
Non-Parametric Tests I
12.2
12.2.1
A first attempt
We recall the study of infant walking that we described way back in section
1.1. Six infants were given exercises to maintain their walking reflex, and
six control infants were observed without any special exercises. The ages
(in months) at which the infants were first able to walk independently are
recapitulated in Table 12.1.
Treatment
Control
9.00
11.50
9.50
12.00
9.75
9.00
10.00
11.50
13.00
13.25
9.50
13.00
Table 12.1: Age (in months) at which infants were first able to walk independently. Data from [ZZK72].
As we said then, the Treatment numbers seem generally smaller than the
Control numbers, but not entirely, and the number of observations is small.
Could we merely be observing sampling variation, where we happened to get
six (five, actually) early walkers in the Treatment group, and late walkers
in the Control group.
Following the approach of Lecture 11, we might perform a two-sample
T test for equality of means. We test the null hypothesis T REAT = CON
against the one-tailed alternative T REAT < CON , at the 0.05 level. To
find the critical value, we look in the column for P = 0.10, with 6 + 6 2
d.f., obtaining 1.81. The critical region is then {T < 1.81}. The relevant
Non-Parametric Tests I
211
summary statistics are given in Table 12.2. We compute the pooled sample
variance
r
(6 1)1.452 + (6 1)1.522
sp =
= 1.48,
6+62
so the standard error is
r
SE = sp
1 1
+ = 0.85.
6 6
Y
X
1.6
=
= 1.85.
SE
0.85
So we reject the null hypothesis, and say that the difference between the
two groups is statistically significant.
Treatment
Control
Mean
SD
10.1
11.7
1.45
1.52
12.2.2
We may wonder about the validity of this test, though, particularly as the
observed T was just barely inside the critical region. After all, the T test
depends on the assumption that when X1 , . . . , X6 and Y1 , . . . , Y6 are independent samples from a normal distribution with unknown mean and
variance, then the observed T statistic will be below 1.81 just 5% of the
time. But the data we have, sparse though they are, dont look like they
come from a normal distribution. They look more like they come from a
bimodal distribution, like the one sketched in Figure 12.1.
So, there might be early walkers and late walkers, and we just happened
to get mostly early walkers for the Treatment group, and late walkers for
the Control group. How much does this matter? We present one way of
seeing this in section 12.2.3. For the time being, we simply note that there
is a potential problem, since the whole idea of this statistical approach was
to develop some certainty about the level of uncertainty.
Non-Parametric Tests I
0.15
0.10
0.05
0.00
Density
0.20
212
10
12
14
Time (months)
Figure 12.1: Sketch of what the distribution of walking times from which
the data of Table 12.1 might have been drawn from, if they all came from
the same distribution. The actual measurements are shown as a rug plot
along the bottom green for Treatment, red for Control. The marks have
been adjusted slightly to avoid exact overlaps.
Non-Parametric Tests I
213
12.2.3
Non-Parametric Tests I
600
400
200
0
Frequency
800
214
-5
-4
-3
-2
-1
Simulated T
Standard
Simulated
Standard
Simulated
0.10
0.05
0.01
0.001
1.81
2.23
3.17
4.59
1.83
2.25
3.22
5.14
1.81
2
2.5
3.17
4
0.100
0.073
0.031
0.010
0.0025
0.104
0.076
0.035
0.011
0.0033
Non-Parametric Tests I
215
12.3
12.3.1
Median test
Suppose we have samples x1 , . . . , xnx and y1 , . . . , yny from two distinct distributions whose medians are mx and my . We wish to test the null hypothesis
H0 :
mx = my
mx 6= my ,
mx < my ,
or Halt :
or a one-tailed alternative
Halt :
mx > my .
The idea of this test is straightforward: Let M be the median of the combined sample {x1 , . . . , xnx , y1 , . . . , yny }. If the medians are the same, then
the xs and the ys should have an equal chance of being above M . Let Px
be the proportion of xs that are above M , and Py the proportion of ys that
are above M . It turns out that we can treat these as though they were the
proportions of successes in nx and ny trials respectively. Analysing these
results is not entirely straightforward; we get a reasonable approximation
by using the Z test for differences between proportions, as in section 11.3.
Consider the case of the infant walking study, described in section 1.1.
The 12 measurements are
9.0, 9.0, 9.5, 9.5, 9.75, 10.0, 11.5, 11.5, 12.0, 13.0, 13.0, 13.25,
where the CONTROLresults have been coloured RED, , and the TREATMENT results have been coloured GREEN. The median is 10.75, and we
see that there are 5 control results above the median, and one treatment.
Calculating the p-value: Exact method
Imagine that we have nx red balls and ny green balls in a box. We pick half
of them at random these are the above median outcomes and get kx red
and ky green. We expected to get about nx /2 red and ny /2 green. What is
the probability of such an extreme result? This is reasonably straightforward
to compute, though slightly beyond what were doing in this course. Well
just go through the calculation for this one special case.
216
Non-Parametric Tests I
We pick 6 balls from 12, where 6 were red and 6 green. We want P(at
least 5 red). Since the null hypothesis says that all picks are equally likely,
this is simply the fraction of ways that we could make our picks which
happen to have 5 or 6 red. That is,
P (at least 5 R) =
The number of ways to pick 6 balls from 12 is what we call 12 C6 = 924. The
number of ways to pick 6 red and 0 green is just 1: We have to take all the
reds, we have no choice. The only slightly tricky one is # ways to pick 5 red
and 1 green. A little thought shows that we have 6 C5 = 6 ways of choosing
the red balls, and 6 C1 = 6 ways of choosing the green one, so 36 ways in all.
Thus, the p-value comes out to 37/954 = 0.039, so we still reject the null
hypothesis at the 0.05 level. (Of course, the p-value for a two-tailed test is
twice this, or 0.078.
Calculating the p-value: Approximate method
Slightly easier is to use the normal approximation to compute the p-value.
This is the method described in your formula booklets. But this needs to
be done with some care.
The idea is the following: We have observed a certain number n+ of
above-median outcomes. (This will be about 12 the total number of samples
n = nx + ny , but slightly adjusted if the number is odd, or if there are ties.)
We have observed proportions px = kx /nx and py = ky /ny above-median
samples from the two groups, respectively. We apply a Z test (as in sections
11.3 and 11.6.2) to test whether these proportions could be really the
same (as they should be under the null hypothesis).1
We compute the standard error as
s
p
1
1
SE = p(1 p)
+ ,
nx ny
1
If youre thinking carefully, you should object at this point, that we cant apply the
Z test for difference between proportions, because that requires that the two samples be
independent. Thats true. The fact that theyre dependent raises the standard error;
but the fact that were sampling without replacement lowers the standard error. The
two effects more or less cancel out, just as we described in section 11.6. Another way
of computing this result is to use the 2 test for independence, writing this as a 2 2
contingency table. In principle this gives the same result, but its hard to apply the
continuity correction there. In a case like this one, the expected cell occupancies are
smaller than we allow.
Non-Parametric Tests I
217
kx 0.5
n+
and
py =
ky + 0.5
.
n+
0.30
0.25
0.20
0.00
0.05
0.10
0.15
Probability
0.20
0.15
0.10
0.05
0.00
Probability
0.25
0.30
0.35
Non-Parametric Tests I
0.35
218
Fraction red
Fraction green
Figure 12.3: The exact probabilities from the binomial distribution for
extreme results of number of red (control) and green (treatment) in
infant walking experiment. The corresponding normal approximations
shaded. Note that the upper tail starts at 4.5/6 = 0.75, not at 5/6; and
lower tail starts at 1.5/6 = 0.25, rather than at 1/6.
the
the
are
the
There are many defects of the median test. One of them is that the
results are discrete there are at most n/2 + 1 different possible outcomes
to the test while the analysis with Z is implicitly discrete. This is one
of the many reasons why the median test, while it is sometimes seen, is not
recommended. (For more about this, see [FG00].) The rank-sum test is
almost always preferred.
Note that this method requires that the observations be all distinct.
There is a version of the median test that can be used when there are ties
among the observations, but we do not discuss it in this course.
12.3.2
Rank-Sum test
The median test is obviously less powerful than it could be, because it
considers only how many of each group are above or below the median, but
Non-Parametric Tests I
219
not how far above or below. In the example of section 12.2, while 5 of the
6 treatment samples below the median, the one that is above the median is
near the top of the whole sample; and the one control sample that is below
the median is in fact near the bottom. It seems clear that we should want
to take this extra information into account. The idea of the rank-sum test
(also called the Mann-Whitney test) is that we consider not just yes/no,
above/below the median, but the exact relative ranking.
Continuing with this example, we list all 12 measurements in order, and
replace them by their ranks:
measurements
ranks
modified ranks
9.0
1
1.5
9.0
2
1.5
9.5
3
3.5
9.5
4
4.5
9.75
5
5
10.0
6
6
11.5
7
7.5
11.5
8
7.5
12.0
9
9
13.0
10
10.5
13.0
11
10.5
When measurements are tied, we average the ranks (we show this in the
column labelled modified ranks.) We wish to test the null hypothesis
H0 : control and treatment came from the same distribution; against the
alternative hypothesis that the controls are generally larger.
We compute a test statistic R, which is just the sum of the ranks in
the smaller sample. (In this case, the two samples have the same size, so
we can take either one. We will take the treatment sample.) The idea is
that these should, if H0 is true, be like a random sample from the numbers
1, . . . , nx + ny . If R is too big or too small we take this as evidence to reject
H0 . In the one-tailed case, we reject R for being too small (if the alternative
hypothesis is that the corresponding group has smaller values; or for being
too large (if the alternative hypothesis is that the corresponding group has
larger values.
In this case, the alternative hypothesis is that the group under consideration, the treatment group has smaller values, so the rejection region
consists of R below a certain threshold. It only remains to find the appropriate threshold. These are given on the Mann-Whitney table (Table 5 in
the formula booklet). The layout of this table is somewhat complicated.
The table lists critical values corresponding only to P = 0.05 and P = 0.10.
We look in the row corresponding to the size of the smaller sample, and
column corresponding to the larger. For a two-tailed test we look in the
(sub-) row corresponding to the desired significance level; for a one-tailed
test we double the p-value.
The sum of the ranks for the treatment group in our example is R = 30.
Since we are performing a one-tailed test with the alternative being that
the treatment values are smaller, our rejection region will be of the form
13.25
12
12
220
Non-Parametric Tests I
R some critical value. We find the critical value on Table 12.4. The
values corresponding to two samples of size 6 have been highlighted. For a
one-tailed test at the 0.05 level we take the upper values 28, 50. Hence, we
would reject R 28. Since R = 30, we retain the null hypothesis in this
test.
If we were performing a two-tailed test instead, we would reject R 26
and R 52.
The table you are given goes up only as far as the larger sample size
equal to 10. For larger samples, we use a normal approximation:
R
,
1
= nx (nx + ny + 1),
2
r
nx ny (nx + ny + 1)
.
=
12
z=
smaller sample
size n1
4
5
6
7
8
9
n2
7
15,33
13,35
22,43
20,45
30,54
28,56
39,66
37,68
8
16,36
14,38
23,47
21,49
32,58
29,61
41,71
39,73
52,84
49,87
9
17,39
15,41
25,50
22,53
33,63
31,65
43,76
41,78
54,90
51,93
66,105
63,108
10
Table 12.4: Critical values for Mann-Whitney rank-sum test.
10
18,42
16,44
26,54
24,56
35,67
33,69
46,80
43,83
57,95
54,98
69,111
66,114
83,127
79,131
Non-Parametric Tests I
12.4
221
As with the t test, when the data fall naturally into matched pairs, we can
improve the power of the test by taking this into account. We are given data
in pairs (x1 , y1 ), . . . , (xn , yn ), and we wish to test the null hypothesis H0 : x
and y come from the same distribution. In fact, the null hypothesis may be
thought of as being even broader than that. As we discuss in section 12.4.4,
there is no reason, in principle, why the data need to be randomly sampled
at all. The null hypothesis says that the xs and the ys are indistinguishable
from a random sample from the complete set of xs and ys together. We
dont use the precise numbers which depend upon the unknown distribution of the xs and ys but only basic reasoning about the relative sizes
of the numbers. Thus, if the xs and ys come from the same distribution,
it is equally likely that xi > yi as that xi < yi .
12.4.1
Sign test
The idea of the sign test is quite straightforward. We wish to test the null
hypothesis that paired data came from the same distribution. If that is the
case, then which one of the two observations is the larger should be just
like a coin flip. So we count up the number of times (out of n pairs) that
the first observation in the pair is larger than the second, and compute the
probability of getting that many heads in n coin flips. If that probability is
below the chosen significance level , we reject the null hypothesis.
Schizophrenia study
Consider the schizophrenia study, discussed in section 11.5.2. We wish to
test, at the 0.05 level, whether the schizophrenic twins have different brain
measurements than the unaffected twins, against the null hypothesis that
the measurements are really drawn from the same distribution. We focus
now on the differences. Instead of looking at their values (which depends
upon the underlying distribution) we look only at their signs, as indicated
in Table 12.5. The idea is straightforward: Under the null hypothesis, the
difference has an equal chance of being positive or negative, so the number
of positive signs should be like the number of heads in fair coin flips. In this
case, we have 14 + out of 15, which is obviously highly unlikely.
222
Non-Parametric Tests I
Table 12.5: Data from the Suddath [SCT+ 90] schizophrenia experiment.
Hippocampus volumes in cm3 .
Unaffected
1.94
1.44
1.56
1.58
2.06
1.66
1.75
1.77
1.78
1.92
1.25
1.93
2.04
1.62
2.08
Schizophrenic
1.27
1.63
1.47
1.39
1.93
1.26
1.71
1.67
1.28
1.85
1.02
1.34
2.02
1.59
1.97
Difference
0.67
-0.19
0.09
0.19
0.13
0.40
0.04
0.10
0.50
0.07
0.23
0.59
0.02
0.03
0.11
Sign
+
+
+
+
+
+
+
+
+
+
+
+
+
+
12.4.2
Breastfeeding study
Non-Parametric Tests I
223
4
3
2
0
Frequency
for breastfeeding. Each mother treated one breast and left the other untreated, as a control. The two breasts were rated daily for level of discomfort,
on a scale 1 to 4. Each method was used by 19 mothers, and the average difference between the treated and untreated breast for each of the 19 mothers
who used the toughening treatment were: 0.525, 0.172, 0.577, 0.200,
0.040, 0.143, 0.043, 0.010, 0.000, 0.522, 0.007, 0.122, 0.040, 0.000,
0.100, 0.050, 0.575, 0.031, 0.060.
The original study performed a one-tailed t test at the 0.05 level of
the null hypothesis that the true difference between treated and untreated
breasts was 0: The cutoff is then 1.73 (so we reject the null hypothesis
on any value of T below 1.73). We have x
= 0.11, and sx = 0.25. We
compute then T = (
x 0)/(sx / n) = 1.95, leading us to reject the null.
We should, however, be suspicious of this marginal result, which depends
upon the choice of a one-tailed test: for a two-tailed test the cutoff would
have been 2.10.
In addition, we note that the assumption of normality is drastically violated, as we see from the histogram of the observed values 12.4. To apply the sign test, we see that there are 8 positive and 9 negative values,
which is as close to an average value as we could have, and so conclude
that there is no evidence in the sign test of a difference between the treated
and untreated breasts.
(Formally, we could compute p = 8/17 = 0.47, and
Z = (0.47 0.50)/(0.5/ 17) = 0.247, which is nowhere near the cutoff of
1.96 for the z test at the 0.05 level.)
-0.6
-0.4
-0.2
0.0
0.2
224
Non-Parametric Tests I
Historical note: This study formed part of the rediscovery of breastfeeding in the 1970s, after a generation of its being disparaged by the medical
community. Their overall conclusion as that the traditional treatments were
ineffective, the toughening marginally so. The emphasis on breastfeeding
being fundamentally uncomfortable reflects the discomfort that the medical
research community felt about nursing at the time.
12.4.3
As with the two-sample test in section 12.3.2, we can strengthen the pairedsample test by considering not just which number is bigger, but the relative
ranks. The idea of the Wilcoxon (or signed-rank) test is that we might have
about equal numbers of positive and negative values, but if the positive
values are much bigger than the negative (or vice versa) that will still be evidence that the distributions are different. For instance, in the breastfeeding
study, the t test produced a marginally significant result because several of
the very large values are all negative.
The mechanics of the test are the same as for the two-sample rank-sum
test, only the two samples are not the xs and the ys, but the positive and
negative differences. In a first step, we rank the differences by their absolute
values. Then, we carry out a rank-sum test on the positive and negative differences. To apply the Wilcoxon test, we first drop the two 0 values, and
then rank the remaining 17 numbers by their absolute values:
Diff
Rank
0.007
1
0.010
2
0.031
3
0.040
4.5
-0.040
4.5
0.043
6
0.050
7
-0.060
8
Diff
Rank
-0.122
10
-0.143
11
0.172
12
0.200
13
-0.522
14
-0.525
15
-0.575
16
-0.577
17
-0.100
9
The ranks corresponding to positive values are 1, 2, 3 ,4.5, 6, 7, 12, 13, which
sum to R+ = 48.5, while the negative values have ranks 4.5,8,9,10,11,14,15,16,17,
summing to R = 104.5. The Wilcoxon statistic is defined to be T =
min{R+ , R } = 48.5. We look on the appropriate table (given in Figure
12.5). We see that in order for the difference to be significant at the 0.05
level, we would need to have T 34. Consequently, we still conclude that
the effect of the treatment is not statistically significant.
Non-Parametric Tests I
225
12.4.4
226
Non-Parametric Tests I
Lecture 13
Kolmogorov-Smirnov Test
13.1.1
against the general alternative that the samples did not come from P . To
do this, we need to create a test statistic whose distribution we know, and
which will be big when the data are far away from a typical sample from
the population P .
You already know one approach to this problem, using the 2 test. To
do this, we split up the possible values into K ranges, and compare the
227
228
number of observations in each range with the number that would have been
predicted. For instance, suppose we have 100 samples which we think should
have come from a standard normal distribution. The data are given in Table
13.1. The first thing we might do is look at the mean and variance of the
sample: In this case, the mean is 0.06 and the sample variance 1.06, which
seems plausible. (A z test for the mean would not reject the null hypothesis
of 0 mean, and the test for variance which you have not learned would
be satisfied that the variance is 1.) We might notice that the largest value is
3.08, and the minimum value is 3.68, which seem awfully large. We have
to be careful, though, about scanning the data first, and then deciding what
to test after the fact: This approach, sometimes called data snooping, can
easily mislead, since every collection of data is likely to have something that
seems wrong with it, purely by chance. (This is the problem of multiple
testing, which we discuss further in section 14.3.)
-0.16
1.30
-0.17
0.13
-1.97
-1.52
0.29
0.65
-0.88
-0.23
-0.68
-0.13
1.29
-1.94
-0.37
-0.06
0.58
-0.26
-0.03
-1.16
-0.32
0.80
0.47
0.78
3.08
-1.02
0.02
-0.17
0.56
0.22
-0.85
-0.75
-1.23
0.19
-0.40
1.06
2.18
-1.53
-3.68
-1.68
0.89
0.28
0.21
-0.12
0.80
0.60
-0.04
-1.69
2.40
0.50
-2.28
-1.00
-0.04
-0.19
0.01
1.15
-0.13
-1.60
0.62
-0.35
0.63
0.14
0.07
0.76
1.32
1.92
-0.79
0.09
0.52
-0.35
0.41
-1.38
-0.08
-1.48
-0.47
-0.06
-1.28
-1.11
-1.25
-0.33
0.15
-0.04
0.32
-0.01
2.29
-0.19
-1.41
0.30
0.85
-0.24
0.74
-0.25
-0.17
0.20
-0.26
0.67
-0.23
0.71
-0.09
0.25
229
1
0
-1
-3
-2
Sample Quantiles
-4
-2
Theoretical Quantiles
1/2
0
1/6
2/6
4/6
5/6
230
0.0
0.2
0.4
0.6
0.8
1.0
-2
-1
231
Category
Lower
Upper
Observed
Expected
1
2
3
4
5
-1.5
-0.5
0.5
1.5
-1.5
-0.5
0.5
1.5
9
15
49
22
5
6.7
24.2
38.3
24.2
6.7
Table 13.2: 2 table for data from Table 13.1, testing its fit to a standard
normal distribution.
table) in Table 13.2(b). These probabilities need to be compared to the
probabilities of the sample, which are just 0.01, 0.02, . . . , 1.00. This procedure is represented graphically in Figure 13.3.
The Kolmogorov-Smirnov statistic is the maximum difference, shown in
blue in Table 13.3. This is Dn = 0.092. For a test at the 0.05 significance
level, we compare this to the critical value, which is
1.36
Dcrit =
n
In this case, with n = 100, we get Dcrit = 0.136. Since our observed Dn is
smaller, we do not reject the null hypothesis.
Of course, if you wish to compare the data to the normal distribution
with mean and variance 2 , the easiest thing to do is to standardise: The
hypothesis that (xi ) come from the N (, 2 ) distribution is equivalent to
saying that (xi )/ come from the standard N (0, 1) distribution. (Of
course, if and are estimated from the data, we get a Student t distribution in place of the standard normal.)
One final point: In point of fact, the data are unlikely to have come
from a normal distribution. One way of seeing this is to look at the largest
(negative) data point, which is 3.68. The probability of a sample from a
standard normal distribution being at least this large is about 0.0002. In 100
observations, the probability of observing such a large value at least once is
no more than 100 times as big, or 0.02. We could have a goodness-of-fit test
based on the largest observation, which would reject the hypothesis that this
sample came from a normal distribution. The Kolmogorov-Smirnov test, on
the other hand, is indifferent to the size of the largest observation.
There are many different ways to test a complicated hypothesis, such as
the equality of two distributions, because there are so many different ways
1.0
232
0.0
0.2
0.4
F(x)
0.6
0.8
Fobs(x)
Fexp(x)
Fobs(x) Fexp(x)
-3
-2
-1
0.06
0.04
0.02
0.00
|Fobs(x) Fexp(x)|
0.08
(a) Fobs shown in black circles, and Fexp (the normal distribution function) in red. The green segments show the
difference between the two distribution functions.
-3
-2
-1
233
-2.28
-1.38
-0.79
-0.26
-0.17
-0.04
0.15
0.41
0.67
1.15
-1.97
-1.28
-0.75
-0.26
-0.16
-0.04
0.19
0.47
0.71
1.29
-1.94
-1.25
-0.68
-0.25
-0.13
-0.03
0.20
0.50
0.74
1.30
-1.69
-1.23
-0.47
-0.24
-0.13
-0.01
0.21
0.52
0.76
1.32
-1.68
-1.16
-0.40
-0.23
-0.12
0.01
0.22
0.56
0.78
1.92
-1.60
-1.11
-0.37
-0.23
-0.09
0.02
0.25
0.58
0.80
2.18
-1.53
-1.02
-0.35
-0.19
-0.08
0.07
0.28
0.60
0.80
2.29
-1.52
-1.00
-0.35
-0.19
-0.06
0.09
0.29
0.62
0.85
2.40
-1.48
-0.88
-0.33
-0.17
-0.06
0.13
0.30
0.63
0.89
3.08
0.011
0.084
0.215
0.396
0.434
0.484
0.560
0.658
0.748
0.874
0.024
0.101
0.226
0.399
0.437
0.485
0.577
0.680
0.761
0.902
0.026
0.107
0.249
0.400
0.447
0.490
0.577
0.692
0.771
0.903
0.045
0.110
0.321
0.407
0.449
0.496
0.582
0.698
0.777
0.907
0.047
0.123
0.343
0.409
0.453
0.505
0.588
0.711
0.783
0.973
0.055
0.133
0.356
0.410
0.464
0.508
0.597
0.720
0.788
0.985
0.064
0.154
0.362
0.423
0.468
0.526
0.610
0.727
0.789
0.989
0.064
0.158
0.363
0.425
0.476
0.535
0.614
0.732
0.803
0.992
0.070
0.189
0.369
0.432
0.477
0.553
0.617
0.735
0.812
0.999
(c) Difference between entry #i in Table 13.2(b) and i/100. Largest value shown
blue.
0.010
0.031
0.012
0.064
0.023
0.026
0.054
0.084
0.068
0.055
0.009
0.036
0.005
0.077
0.013
0.036
0.060
0.061
0.071
0.045
0.006
0.030
0.003
0.067
0.006
0.046
0.055
0.049
0.069
0.029
0.014
0.034
0.008
0.061
0.008
0.052
0.061
0.049
0.070
0.037
0.004
0.041
0.069
0.055
0.002
0.054
0.067
0.052
0.074
0.043
0.014
0.037
0.085
0.049
0.008
0.056
0.073
0.048
0.078
0.013
0.015
0.037
0.086
0.039
0.006
0.062
0.071
0.051
0.082
0.015
0.017
0.026
0.083
0.045
0.012
0.052
0.070
0.054
0.092
0.009
0.026
0.031
0.073
0.035
0.014
0.054
0.076
0.058
0.088
0.002
0.031
0.011
0.071
0.033
0.024
0.048
0.082
0.064
0.087
0.001
Table 13.3: Computing the Kolmogorov-Smirnov statistic for testing the fit
of data to the standard normal distribution.
234
t and Z tests. If the null hypothesis says = 0 , and the alternative is that
> 0 , then we can increase the power of our test against this alternative by
taking a one-sided alternative. On the other hand, if reality is that < 0
then the test will have essentially no power at all. Similarly, if we think
the reality is that the distributions of X and Y differ on some scattered
intervals, we might opt for a 2 test.
13.1.2
X:
1.2, 1.4, 1.9, 3.7, 4.4, 4.8, 9.7, 17.3, 21.1, 28.4
Y :
We put these values in order to compute the cdf (which we also plot in
Figure 13.4).
Fx
Fy
1.2
1.4
1.9
3.7
4.4
0.1
0.0
0.2
0.0
0.3
0.0
0.4
0.0
0.5
0.0
4.8
0.6
0.0
5.6
6.5
6.6
6.9
9.2
9.7
10.4
10.6
17.3
19.3
21.1
28.4
0.6
0.1
0.6
0.2
0.6
0.4
0.6
0.5
0.6
0.6
0.7
0.6
0.7
0.8
0.7
0.9
0.8
0.9
0.8
1.0
0.9
1.0
1.0
1.0
235
0.2
0.4
0.6
0.8
10
15
20
25
30
Figure 13.4: Cumulative distribution functions computed from archaeology data, as tabulated in Table 13.4.
13.1.3
The Kolmogorov-Smirnov test is designed to deal with samples from a continuous distribution without having to make arbitrary partitions. It is not
appropriate for samples from discrete distributions, or when the data are
236
presented in discrete categories. Nonetheless, it is often used for such applications. We present here one example.
One method that anthropologists use to study the health and lives of
ancient populations, is to estimate the age at death from skeletons found
in gravesites, and compare the distribution of ages. The paper [Lov71]
compares the age distribution of remains found at two different sites in
Virginia, called Clarksville and Tollifero. The data are tabulated in Table
13.5.
Table 13.5: Ages at death for skeletons found at two Virginia sites, as described in [Lov71].
Age range
Clarksville
Tollifero
03
46
712
1317
1820
2135
3545
4555
55+
6
0
2
1
0
12
15
1
0
13
6
4
9
2
29
8
8
0
37
79
237
Age range
3
6
12
17
20
35
45
55
Cumulative Distrib.
Clarksville Tollifero
0.16
0.16
0.22
0.24
0.24
0.57
0.97
1
0.16
0.24
0.29
0.41
0.43
0.8
0.9
1
Difference
0
0.08
0.07
0.17
0.19
0.23
0.07
0
entirely make sense. What we have done is effectively to take the maximum
difference not over all possible points in the distribution, but only at eight
specially chosen points. This inevitably makes the maximum smaller. The
result is to make it harder to reject the null hypothesis, so our significance
level is too high. We should compensate by lowering the critical value.
13.1.4
13.2
Power of a test
238
13.2.1
H0 True
H0 False
Correct
(Prob. 1 )
Type I Error
(Prob.=level=)
Type II Error
(Prob.=)
Correct
(Prob.=Power=1 )
Computing power
Power depends on the alternative: It is always the power to reject the null,
given that a specific alternative is actually the case. Consider, for instance,
the following idealised experiment: We make measurements X1 , . . . , X100 of
a quantity that is normally distributed with unknown mean and known
variance 1. The null hypothesis is that = 0 = 0, and we will test it with
a two-sided Z test at the 0.05 level, against a simple alternative = alt .
That is, we assume
What is the power? Once we have the data, we compute x
, and then
z=x
/0.1. We reject the null hypothesis if z > 1.96 or z < 1.96. The power
is the probability that this happens. What is this probability? It is the same
239
> 0.196 or X
< 0.196. We know that X
has
as the probability that X
0
(13.1)
n
Power = z
(0 alt )
(13.2)
n
(0 alt ) .
+1 z
13.2.2
Suppose now that you are planning an experiment to test a new bloodpressure medication. Medication A has been found to lower blood pressure
by 10 mmHg, with an SD of 8 mmHg, and we want to test it against the
new medication B. We will recruit some number of subjects, randomly divide
them into control and experiment groups; the control group receives A, the
experiment group receives B. At the end we will perform a Z test at the 0.05
level to decide whether B really lowered the subjects blood pressure more
than A; that is, we will let A and B be the true mean effect of drugs A
0.6
0.4
0.2
n=100, =.05
n=100, =.01
n=1000, =.05
n=1000, =.01
0.0
Power
0.8
1.0
240
-0.5
-0.4
-0.3
-0.2
-0.1
0.0
0.1
0.2
0.3
0.4
alt 0
Figure 13.5: Power for Z test with different gaps between the null and alternative hypotheses for the mean, for given sizes of study (n) and significance
levels .
Z= q
1
n/2
+
1
n/2
0.5
241
is. We see from the formula (13.2) that if the gap between A and B
becomes half as big, we need 4 times as many subjects to keep the same
power. If the power is only 0.2, say, then it is hardly worth starting in on
the experiment, since the result we get is unlikely to be conclusive.
Figure 13.6 shows the power for experiments where the true experimental
effect (the difference between A and B ) is 10 mmHg, 5 mmHg, and 1
mmHg), performing one-tailed and two-tailed significance tests at the 0.05
level.
Notice that the one-tailed test is always more powerful when B A
is on the right side (B bigger than A , so B is superior), but essentially 0
when B is inferior; the power of the two-tailed test is symmetric. If we are
interested to discover only evidence that B is superior, then the one-tailed
test obviously makes more sense.
Suppose now that we have sufficient funding to enroll 50 subjects in our
study, and we think the study would be worth doing only if we have at least
an 80% chance of finding a significant positive result. In that case, we see
from Figure 13.6(b) that we should drop the project unless we expect the
difference in average effects to be at least 5 mmHg. On the other hand, if
we can afford 200 subjects, we can justify hunting for an effect only half
as big, namely 2.5 mmHg. With 1000 subjects we have a good chance of
detecting a difference between the drugs as small as 1 mmHg. On the other
hand, with only 10 subjects we would be unlikely to find the difference to
be statistically significant, even if the true difference is quite large.
Important lesson: The difference between a statistically
significant result and a non-significant result may be just the size
of the sample. Even an insignificant difference in the usual sense
(i.e., tiny) has a high probability (power) of producing a
statistically significant result if the sample size is large enough.
13.2.3
0.6
0.4
0.0
0.2
Power
0.8
1.0
242
10
20
50
100
200
500
1000
2000
5000 10000
1.0
0.6
0.4
0.2
0.0
Power
0.8
one-tailed test,n=10
one-tailed test,n=50
one-tailed test,n=200
one-tailed test,n=1000
two-tailed test,n=10
two-tailed test,n=50
two-tailed test,n=200
two-tailed test,n=1000
-10
-5
B A
10
243
0.4
Power
0.6
0.8
1.0
0.0
0.2
t test
Mann-Whitney test
median test
-3
-2
-1
1 2
Figure 13.7: Estimated power for three different tests, where the underlying
distributions are normal with variance 1, as a function of the true difference
in means. The test is based on ten samples from each distribution.
244
Lecture 14
14.2
246
ANOVA
Test
N
272
305
269
104
23
Verbal IQ
Unadjusted mean
SD
Adjusted Mean
98.2
16.0
99.7
101.7
14.9
102.3
104.0
15.7
102.7
108.2
13.3
105.7
102.3
15.2
103.0
Performance IQ
Unadjusted mean
SD
Adjusted Mean
98.5
15.8
99.1
100.5
15.2
100.6
101.8
15.6
101.3
106.3
13.9
105.1
102.6
14.9
104.4
Full Scale IQ
Unadjusted mean
SD
Adjusted Mean
98.1
15.9
99.4
101.3
15.2
101.7
103.3
15.7
102.3
108.2
13.1
106.0
102.8
14.4
104.0
ANOVA
247
who were nursed more than 9 months and those nursed less than 1 month
both had, on average, characteristics (whether their own or their mothers)
that would seem to predispose them to lower IQ. For the rest of this chapter
we will work with the adjusted means.
The statistical technique for doing this, called multiple regression, is
outside the scope of this course, but it is fairly straightforward, and most
textbooks on statistical methods that go beyond the mast basic techniques
will describe it. Modern statistical software makes it particularly easy to
adjust data with multiple regression.
14.3
Multiple comparisons
14.3.1
One approach would be to group the IQ scores into groups low, medium,
and high, say. We would then have an incidence table If these were categorical data proportions of subjects in each breastfeeding class who scored
high and low, for instance we could produce an incidence table such
as that in Table 14.2. (The data shown here are purely invented, for illustrative purposes.) You have learned how to analyse such a table to determine
whether the vertical categories (IQ score) are independent of the horizontal
categories (duration of breastfeeding), using the 2 test.
The problem with this approach is self-evident: We have thrown away
some of the information that we had to begin with, by forcing the data into
discrete categories. Thus, the power to reject the null hypothesis is less than
it could have been. Furthermore, we have to draw arbitrary boundaries between categories, and we may question whether the result of our significance
test would have come out differently if we had drawn the boundaries otherwise. (These are the same problems, you may recall, that led us to prefer
the Kolomogorov-Smirnov test over 2 . The 2 test has the virtue of being
wonderfully general, but it is often not quite the best choice.)
248
ANOVA
Full IQ score
high
medium
low
14.3.2
Breastfeeding months
1
100
72
100
2-3
115
85
115
4-6
120
69
80
7-9
40
35
29
>9
9
9
5
Multiple t tests
1
1
+
= 3.43.
nx ny
4.6
x
y
=
= 1.34.
SEdiff
3.42
ANOVA
249
means were in fact all the same 1 out of 20 comparisons should yield a
statistically significant difference at the 0.05 level. How many statistically
significant differences do we need before we can reject the overall null hypothesis of identical population means? And what if none of the differences
were individually significant, but they all pointed in the same direction?
Table 14.3: Pairwise t statistics for comparing all 10 pairs of categories.
Those that exceed the significance threshold for the 0.05 level are shown in
red.
1
2-3
4-6
7-9
14.4
2-3
4-6
7-9
>9
1.78
2.14
0.47
3.78
2.58
2.14
1.34
0.70
0.50
0.65
The F test
14.4.1
General approach
250
ANOVA
The idea of analysis of variance (ANOVA) is that under the null hypothesis,
which says that the observations from different levels really all are coming
from the same distribution, the observations should be about as far (on
average) from their own level mean as they are from the overall mean of the
whole sample; but if the means are different, observations should be closer
to their level mean than they are to the overall mean.
We define the Between Groups Sum of Squares, or BSS, to be the
total square difference of the group means from the overall mean; and the
Error Sum of Squares, or ESS, to be the total squared difference of
the samples from the means of their own groups. (The term error refers
to a context in which the samples can all be thought of as measures of
the same quantity, and the variation among the measurements represents
random error; this piece is also called the Within-Group Sum of Squares.)
And then there is the Total Sum of Squares, or TSS, which is simply
the total square difference of the samples from the overall mean, if we treat
them as one sample.
BSS =
ESS =
K
X
k X)
2;
ni (X
i=1
ni
K X
X
i )2
(Xij X
i=1 j=1
K
X
=
(ni 1)s2i where si is the SD of observations in level i.
i=1
X
2
T SS =
(Xij X)
i,j
ANOVA
251
TSS=ESS+BSS.
variability among the data can be divided into two pieces: The variability
within groups, and the variability among the means of different groups.
Our goal is to evaluate the apportionment, to decide if there is too much
between group variability to be purely due to chance.
Of course, BSS and ESS involve different numbers of observations in
their sums, so we need to normalise them. We define
BM S =
BSS
K 1
EM S =
ESS
.
N K
BM S
N K BSS
=
.
EM S
K 1 ESS
We reject the null hypothesis when F is too large: That is, if we obtain a
value f such that P {F f } is below the significance level of the test.
Table 14.4: Tabular representation of the computation of the F statistic.
SS
d.f.
MS
BSS
(A)
K 1
(B)
BMS
(X = A/B)
X/Y
Errors (Within
Treatments)
ESS
(C)
N K
(D)
EMS
(Y = C/D)
Total
TSS
N 1
Between
Treatments
Under the null hypothesis, the F statistic computed in this way has a
known distribution, called the F distribution with (K 1, N K) degrees
of freedom. We show the density of F for K = 5 different treatments and
different values of N in Figure 14.1.
ANOVA
1.0
252
0.6
0.4
0.0
0.2
Density
0.8
K=5,N=10
K=5,N=20
K=5,N=
K=3,N=6
K=3,N=
14.4.2
5
X
(nk 1)s2k
k=1
5
X
nk (xk x
)2
k=1
ANOVA
253
quite small after all, there is one distribution for each pair of integers. The
table gives only the cutoff only for select values of (d1 , d2 ) at the 0.05 level.
For parameters in between one needs to interpolate, and for parameters
above the maximum we go to the row or column marked . Looking on
the table in Figure 14.2, we see that the cutoff for F (4, ) is 2.37. Using
a computer, we can compute that the cutoff for F (4, 968) at level 0.05 is
actually 2.38; and the cutoff at level 0.01 would be 3.34.
Table 14.5: ANOVA table for breastfeeding data: Full Scale IQ, Adjusted.
SS
d.f.
MS
Between
Samples
3597
(A)
4
(B)
894.8
(X = A/B)
3.81
Errors (Within
Samples)
227000
(C)
968
(D)
234.6
(Y = C/D)
Total
230600
(TSS=A+C)
972
(N 1)
14.4.3
254
ANOVA
Figure 14.2: Table of F distribution; finding the cutoff at level 0.05 for the
breastfeeding study.
table for 27), we see that the cutoff at level 0.05 is 3.32. Thus, we conclude
that the difference in means between the groups is statistically significant.
14.5
Multifactor ANOVA
k = 1, . . . , K;
i = 1, . . . , nk
where the ki are the normally distributed errors, and k is the true mean
for group k. Thus, in the example of section 14.4.3, there were three groups,
corresponding to three different exercise regimens, and ten different samples
for each regimen. The obvious estimate for k is
x
k =
nk
1 X
xki ,
nk
i=1
and we use the F test to determine whether the differences among the means
are genuine. We decompose the total variance of the observations into the
ANOVA
255
(a) Full data
High
Low
Control
626
594
614
650
599
569
622
635
653
674
605
593
626
632
611
643
588
600
622
596
603
650
631
593
643
607
621
631
638
554
Group
Mean
SD
High
Low
Control
638.7
612.5
601.1
16.6
19.3
27.4
Table 14.6: Bone density of rats after given exercise regime, in mg / cm3
portion that is between groups and the portion that is within groups. If the
between-group variance is too big, we reject the hypothesis of equal means.
Many experiments naturally lend themselves to a two-way layout. For
instance, there may be three different exercise regimens and two different
diets. We represent the measurements as
xkji = k + j + kji ,
k = 1, 2, 3;
j = 1, 2;
i = 1, . . . , nkj .
It is then slightly more complicated to isolate the exercise effect k and the
diet effect j . We test for equality of these effects by again splitting the
variance into pieces: the total sum of squares falls naturally into four pieces,
corresponding to the variance over diets, variance over exercise regimens,
variance over joint diet and exercise, and the remaining variance within
each group. We then test for whether the ratios of these pieces are too far
from the ratio of the degrees of freedom, as determined by the F distribution.
Multifactor ANOVA is quite common in experimental practice, but will
not be covered in this course.
14.6
Kruskal-Wallis Test
256
ANOVA
Table 14.7: ANOVA table for rat exercise data.
SS
d.f.
MS
Between
Samples
7434
(A)
2
(B)
3717
(X = A/B)
7.98
Errors (Within
Samples)
12580
(C)
27
(D)
466
(Y = C/D)
Total
20014
(TSS=A+C)
29
(N 1)
simply to substitute ranks for the actual observed values. This avoids the
assumption that the the data were drawn from a normal distribution.
In Table 14.8 we duplicate the data from Table 14.6, replacing the measurements by the numbers 1 through 30, representing the ranks of the data:
the lowest measurement is number 1, and the highest is number 30. In other
words, suppose we have observed K different groups, with ni observations
in each group. We order all the observations in one large sequence of length
N , from lowest to highest, and assign to each one its rank. (In case of ties,
we assign the average rank.) We then sum the ranks in group i, obtaining
numbers R1 , . . . , RK . Then
H=
X R2
12
i
3(N + 1).
N (N + 1) i=1 ni
Under the null hypothesis, that all the samples came from the same distribution, H has the 2 distribution with K 1 degrees of freedom.
In the rat exercise example, we have the values of Ri given in Table
14.7(b), yielding H = 10.7. If we are testing at the 0.05 significance level,
the cutoff for 2 with 2 degrees of freedom is 5.99. Thus, we conclude again
that there is a statistically significant difference among the distributions of
bone density in the three groups.
ANOVA
257
High
Low
Control
18.5
6
14
27.5
8
2
16.5
23
29
30
11
4.5
18.5
22
13
25.5
3
9
16.5
7
10
27.5
20.5
4.5
Group
Sum
High
Low
Control
226.5
136.5
102
25.5
12
15
20.5
24
1
258
ANOVA
Lecture 15
260
Regression
Regression Model yi = xi + + i .
110
100
*
*
80
90
120
10
12
Months breastfeeding
Figure 15.1: Plot of data from the breastfeeding IQ study in Table 14.1.
Stars represent mean for the class, boxes represent mean 2 Standard
Errors.
Regression
15.2
261
Scatterplots
The most immediate thing that we may wish to do is to get a picture of the
data with a scatterplot. Some examples are shown in Figure 15.2.
60
74
72
70
68
66
64
62
120
100
80
140
160
180
10
20
30
40
50
64
66
1900
2000
2100
2200
1700
1800
1900
2000
2100
2200
2
1
-1
72
110
100
Full-scale IQ
90
1800
70
120
1700
68
1000
2000
3000
4000
5000
6000
-2
-1
(f) log Brain weight and log body weight (62 species of
land mammals)
262
15.3
Regression
and Y , we define
Given two paired random variables (X, Y ), with means X
the covariance of X and Y , to be
n
1X
= mean (xi x
)(yi y) =
(xi x
)(yi y).
n
i=1
As with the SD, we usually work with the sample covariance, which is
just
n
n
1 X
sxy =
cxy =
(xi x
)(yi y).
n1
n1
i=1
This is a better estimate for the covariance of the random variables that xi
and yi are sampled from.
Notice that the means of xi x
and yi x
are both 0: On average, xi is
neither higher nor lower than x
. Why is the covariance then not also 0? If
X and Y are independent, then each value of X will come with, on average,
the same distribution of Y s, so the positives and negatives will cancel out,
and the covariance will indeed be 0. On the other hand, if high values of xi
tend to come with high values of yi , and low values with low values, then
the product (xi x
)(yi y) will tend to be positive, making the covariance
positive.
While positive and negative covariance have obvious interpretations, the
magnitude of covariance does not say anything straightforward about the
strength of connection between the covariates. After all, if we simply measure heights in millimetres rather than centimetres, all the numbers will
become 10 times as big, and the covariance will be multiplied by 100. For
this reason, we normalise the covariance by dividing it by the product of the
two standard deviations, producing the quantity called correlation:
Correlation
)
XY = Cov(X,Y
X Y .
Of course, we estimate the correlation in a corresponding way:
Regression
263
Sample correlation
s
rxy = sxxysy .
It is easy to see that correlation does not change when we rescale the data
for instance, by changing the unit of measurement. If xi were universally
replaced by xi = xi + , then sxy becomes sxy = sxy , and sx becomes
sx = sx . Since the extra factor of appears in the numerator and in the
denominator, the final result of rxy remains unchanged. In fact, it turns
out that rxy is always between 1 and 1. The correlation of 1 means that
there is a perfect linear relationship between x and y with negative sign;
correlation of +1 means that there is a perfect linear relationship between
x and y with positive sign; and correlation 0 means no linear relationship
at all.
In Figure 15.3 we show some samples of standard normally distributed
pairs of random variables with different correlations. As you can see, high
positive correlation means the points lie close to an upward-sloping line;
high negative correlation means the points lie close to a downward-sloping
line; and correlation close to 0 means the points lie scattered about a disk.
15.4
Computing correlation
There are several alternative formulae for the covariance, which may be more
convenient than the standard formula:
n
sxy
1 X
=
(xi x
)(yi y)
n1
i=1
1
n1
n
X
i=1
x i yi
n
x
y
n1
1 2
=
sx+y s2xy ,
4
where sx+y and sxy are the sample SDs of the collections (xi + yi ) and
(xi yi ) respectively.
264
Regression
Correlation=0.4
3
Correlation=0.95
2
1
z2
Correlation=0.8
Correlation=0
3
0
1
z3
2
1
y
0
1
2
3
0
2
z1
0
x
Regression
15.4.1
265
A study was done [TLG+ 98] to compare various brain measurements and IQ
scores among 20 subjects.1 In Table 15.1 we give some of the data, including
a measure of full-scale IQ and MRI estimates of total brain volume and total
brain surface area.
Table 15.1: Brain measurement data.
Subject
Brain volume
(cm3 )
Brain surface
area (cm2 )
IQ
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
1005
963
1035
1027
1281
1272
1051
1079
1034
1070
1173
1079
1067
1104
1347
1439
1029
1100
1204
1160
1914
1685
1902
1860
2264
2216
1867
1851
1743
1709
1690
1806
2136
2019
1967
2155
1768
1828
1774
1972
96
89
87
87
101
103
103
96
127
126
101
96
93
88
94
85
97
114
113
124
Mean
SD
1125
125
1906
175
101
13.2
In fact, the 20 subjects comprised 5 pairs of male and 5 pairs of female monozygous
twins, so there are plenty of interesting possibilities for factorial analysis. The data are
available in a convenient table at http://lib.stat.cmu.edu/datasets/IQ_Brain_Size.
266
Regression
syz = 673,
This leads us to
rxy =
sxz = 105.
13130
= 0.601.
sx sy
Similarly
ryz = 0.291,
15.4.2
rxz = 0.063.
sxy
2.07
=
= 0.459.
sx sy
2.52 1.79
2
Available as the dataset galton of the UsingR package of the R programming
language, or directly from http://www.bun.kyoto-u.ac.jp/~suchii/galton86.
html.
3
We do not need to pay attention to the fact that Galton multiplied all female heights
by 1.08.
Regression
267
Table 15.2: Variances for different combinations of the Galton height data.
Parent
Child
Sum
Difference
15.4.3
SD
Variance
1.79
2.52
3.70
2.33
3.19
6.34
13.66
5.41
Breastfeeding example
This example is somewhat more involved than the others, and than the kinds
of covariance computations that appear on exams.
Consider the breastfeedingIQ data (Table 14.1), which we summarise in
the top rows of Table 15.3. We dont have the original data, but we can use
the above formulas to estimate the covariance, and hence the correlation,
between number of months breastfeeding and adult IQ. Here xi is the number
of months individual i was breastfed, and yi the adult IQ.
Table 15.3: Computing breastfeedingFull-IQ covariance for Copenhagen
infant study.
>9
Total
272
305
269
104
23
973
15.9
99.4
27037
15.2
101.7
93056
15.7
102.3
151353
13.1
106.0
93704
14.4
104.0
28704
15.7
101.74
In the bottom row of thePtable we give the contribution that those individuals made to the total
xi yi . Since the xi are all about the same in
any column, we treat them as though they were all about the same, equal
to the average x value in the column. (We give these averages in the first
row. This involves a certain amount of guesswork, particularly for the last
column; on the other hand, there are very few individuals in that column.)
268
Regression
The sum of the y values in any column is the average y value multiplied
by the relevant number of samples. Consider the first column:
X
X
xi yi 1
yi
i in first column
i in first column
= 1 N1 y1
= 1 272 99.4
= 27037.
Similarly, the second column contributes
3 305 101.7 = 93065, and so on.
P
Adding these contributions yields
xi yi = 393853.
We next estimate x
, treating it as though there were 272 xi = 1, 305
xi = 3, and so on, yielding
x
=
1
(272 1 + 305 3 + 269 5.5 + 104 8.5 + 27 12) = 3.93.
973
To estimate y, we take
X
X
X
=
yi +
i
i in column 1
i in column 2
yi +
i in column 3
yi +
i in column 4
yi +
yi .
i in column 5
The sum in a column is just the average in the column multiplied by the
number of observations in the column, so we get
y =
1
(272 99.4 + 305 101.7 + 269 102.3 + 104 106.0 + 27 104.0) = 101.74.
973
Thus, we get
sxy =
393853 973
Regression
269
Thus, we have
rxy =
15.5
sxy
4.95
=
= 0.118.
sx sy
2.67 15.7
Testing correlation
rxy n 2
R= q
.
2
1 rxy
It can be shown that, under the null hypothesis, this R has the Student
t distribution with n 2 degrees of freedom. Thus, we can look up the
appropriate critical value, and reject the null hypothesis if |R| is above this
cutoff. For example, in the Brain measurement experiments of section 15.4.1
we have correlation between brain volume and surface area being 0.601 from
20 samples, which produces R = 3.18, well above the threshold value for t
with 18 degrees of freedom at the 0.05 level, which is 2.10. On the other
hand, the correlation 0.291 for surface area against IQ yields R = 1.29,
which does not allow us to reject the null hypothesis that the true underlying
population correlation is 0; and the correlation 0.063 between volume and
IQ yields R only 0.255.
270
Regression
Note that for a given n and choice of level, the threshold in t translates
directly to a threshold in r. If t is the appropriate threshold value in t,
then we
p reject the null hypothesis when our sample correlation r is larger
than t /(n 2 + t ). In particular, for large n and = 0.05, we have a
threshold
p for t very close to 2, so that we reject the null hypothesis when
|r| > 2/n.
15.6
15.6.1
The SD line
sy
(X x
).
sx
(15.1)
Regression
271
Figure 15.4: Galton parent-child heights, with SD line in green. The point
of the means is shown as a red circle.
15.6.2
On further reflection, though, it becomes clear that the SD line cant be the
best prediction of Y from X. If we look at Figure 15.3, we see that the line
running down the middle of the cloud of points is a good predictor of Y from
X if the correlation is close to 1. When the correlation is 0, though, wed
be better off ignoring X and predicting Y to be y always; and, of course,
when the correlation is negative, the line really needs to slope in the other
direction.
What about intermediate cases, like the Galton data, where the correlation is 0.46? One way of understanding this is to look at a narrow
range of X values (parents heights), and consider what the correspanding range of Y values is. In figure 15.5, we sketch in rectangles showing
the approximate range of Y values corresponding to X = 66, 68, 70, and
72 inches. As you can see the middle of the X = 66 range is substantially above the SD line, whereas the middle of the X = 72 range is below
the SD line. This makes sense, since the connection between the parents
272
Regression
Figure 15.5: The parent-child heights, with an oval representing the general
range of values in the scatterplot. The SD line is green, the regression
line blue, and the rectangles represent the approximate span of Y values
corresponding to X = 66, 68, 70, 72 inches.
sy
sxy
(X x) = 2 (X x).
sx
sx
This is equivalent to
sxy
Y = bX + a, where b = 2 and a =
sx
sy
y x .
sx
Regression
273
sxy
= 0.326,
sx sy
a0 = x
b0 y = 45.9.
274
Regression
Thus, the prediction for the parents heights is 45.9 0.326
70.5 = 68.9 inches.
This may be seen in the following simple example. Suppose we have five
paired observations:
Table 15.4: Simple regression example
x
y
prediction
0.54x + 1.41
residual
6
3
2
2
5
4
7
6
4
5
4.65
2.49
4.11
5.19
3.57
1.65
0.49
0.11
0.81
1.43
mean
4.8
4.0
SD
1.9
1.58
sxy
= 0.66.
sx sy
15.6.3
275
0 1 2 3 4 5 6 7
Regression
Regression line
SD line
130
120
110
80
90
100
Full-scale IQ
2200
2000
1800
1600
2400
Figure 15.6: A scatterplot of the hypothetical data from Table 15.4. The
regression line is shown in blue, the SD line in green. The dashed lines show
the prediction errors for each data point corresponding to the two lines.
900
Figure 15.7: Regression lines for predicting surface area from volume and
predicting IQ from surface area. Pink shaded region shows confidence interval for slope of the regression line.
Regression
2200
2000
1800
Regression line
SD line
1600
2400
276
900
1000
1100
1200
1300
1400
1500
Regression
277
b 1 r2
SE(b)
,
r n2
and that
t :=
b
SE(b)
15.6.4
sy
175
= 0.601
= 0.841
sx
125
a = y b
x = 959
b 1 r2
0.841 1 0.6012
SE(b)
=
= 0.264.
r n2
0.601 18
278
Regression
We look on the t-distribution table in the row for 18 degrees of freedom and
the column for p = 0.05 (see Figure 15.9) we see that 95% of the probability
lies between 2.10 and +2.10, so that a 95% confidence interval for is
b 2.10 SE(b) = 0.841 2.10 0.264 = (0.287, 1.40).
The scatterplot is shown with the regression line y = .841x + 959 in Figure
15.7(a), and the range of slopes corresponding to the 95% confidence interval
is shown by the pink shaded region. (Of course, to really understand the
uncertainty of the estimates, we would have to consider simultaneously the
random error in estimating the means, hence the intercept of the line. This
leads to the concept of a two-dimensional confidence region, which is beyond
the scope of this course.)
Similarly, for predicting IQ from surface area we have
sz
13.2
b = ryz = 0.291
= 0.031,
a = z b
y = 160.
sy
125
The standard error for b is
0.031 1 0.2912
b 1 r2
=
SE(b)
= 0.024.
r n2
0.291 18
A 95% confidence interval for the true slope is then given by 0.031
2.10 0.024 = (0.081, 0.019). The range of possible predictions of y from
x pretending, again, that we know the population means exactly is
given in Figure 15.7(b).
What this means is that each change of 1 cm3 in brain volume is typically
associated with a change of 0.841 cm2 in brain surface area. A person of
average brain volume 1126 cm3 would be expected to have average
brain surface area 1906 cm2 but if we know that someone has brain
volume 1226 cm3 , we would do well to guess that he has a brain surface
area of 1990 cm2 (1990 = 1906 + 0.841 100 = .841 1226 + 959). However,
given sampling variation the number of cm2 typically associated with 1 cm3
change in volume might really be as low as 0.287 or as high as 1.40, with
95% confidence.
Similarly, a person of average brain surface area 1906 cm2 might be
predicted to have average IQ of 101, but someone whose brain surface area
is found to be 100 cm2 might be predicted to have IQ below average by 3.1
points, so 97.9. At the same time, we can only be 95% certain that the
change associated with 100 cm2 increase in brain surface area is between
8.1 and +1.9 points hence, it might just as well be 0. We say that the
correlation between IQ and brain surface area is not statistically significant,
or that the slope of the regression line is not significantly different from 0.
Regression
279
Figure 15.9: T table for confidence intervals for slopes computed in section
15.6.4
280
Regression
Lecture 16
Regression, Continued
16.1
R2
(16.1)
If this sounds a lot like the ANOVA approach from lecture 14, thats because it is.
Formally, theyre variants of the same thing, though developing this equivalence is beyond
the scope of this course.
281
282
16.1.1
From the Galton data of section 15.4.2, we see that the variance of the childs
height is 6.34. Since r2 = 0.21, we say that the parents heights explain
21% of the variance in childs height. We expect the residual variance
to be about 6.34 0.79 = 5.01. What this means is that the variance among
children whose parents were all about the same height should be 5.01. In
Figure 16.1 we see histograms of the heights of children whose parents all
had heights in the same range of 1 inch. Not surprisingly, there is some
variation in the shapes of these histograms, vary somewhat, but the variances
are all substantially smaller than 6.34, a varying between 4.45 and 5.75.
16.1.2
r 973 2
R=
= 3.70.
1 r2
The threshold for rejecting t with 971 degrees of freedom (the row
essentially the same as the normal distribution) at the = 0.01 significance
level is 2.58. Hence, the correlation is highly significant. This is a good
example of where significant in the statistical sense should not be confused
with important. The difference is significant because the sample is so large
that it is very unlikely that we would have seen such a correlation purely by
chance if the true correlation were zero. On the other hand, explaining 1%
of the variance is unlikely to be seen as a highly useful finding. (At the same
time, it might be at least theoretically interesting to discover that there is
any detectable effect at all.)
283
Variance= 4.45
30
0
10
20
Frequency
10
5
Frequency
40
50
15
Variance= 5.75
62
64
66
68
70
72
62
64
66
68
70
72
Variance= 5.6
Variance= 4.83
10
20
Frequency
40
20
Frequency
60
30
80
62 64 66 68 70 72 74
Heights of children with parent height 69 inches
60
64
68
72
284
16.2
We have all heard stories of a child coming home proudly from school with
a score of 99 out of 100 on a test, and the strict parent who points out that
he or she had 100 out of 100 on the last test, and what happened to the
other point? Of course, we instinctively recognise the parents response
as absurd. Nothing happened to the other point (in the sense of the child
having fallen down in his or her studies); thats just how test scores work.
Sometimes theyre better, sometimes worse. It is unfair to hold someone to
the standard of the last perfect score, since the next score is unlikely to be
exactly the same, and theres nowhere to go but down.
Of course, this is true of any measurements that are imperfectly correlated: If |r| is substantially less than 1, the regression equation tells us
that those individuals who have extreme values of x tend to have values
of y that are somewhat less extreme. If r = 0.5, those with x values that
are 2 SDs above the mean will tend to have y values that are still above
average, but only 1 SD above the mean. There is nothing strange about
this: If we pick out those individuals who have exceptionally voluminous
brains, for instance, it is not surprising that the surface areas of their brains
are less extreme. While athletic ability certainly carries over from one sport
to another, we do not expect the worlds finest footballers to also win gold
medals in swimming. Nor does it seem odd that great composers are rarely
great pianists, and vice versa.
And yet, when the x and y are successive events in time for instance,
the same childs performance on two successive tests there is a strong
tendency to attribute causality to this imperfect correlation. Since there
is a random component to performance on the test, we expect that the
successive scores will be correlated, but not exactly the same. The plot of
score number n against score number n + 1 might look like the upper left
scatterplot in Figure 15.3. If she had an average score last time, shes likely
to score about the same this time. But if she did particularly well last time,
this time is likely to be less good. But consider how easy it would be to look
at these results and say, Look, she did well, and as a result she slacked off,
and did worse the next time; or Its good that we punish her when she
does poorly on a test by not letting her go outside for a week, because that
always helps her focus, and she does better the next time. Galton noticed
that children of exceptionally tall parents were closer to average than the
parents were, and called this regression to mediocrity [Gal86].
285
286
After some experience with this training approach, the instructors claimed that contrary to psychological doctrine,
high praise for good execution of complex maneuvers typically results in a decrement of performance on the next
try[. . . ] Regression is inevitable in flight maneuvers because
performance is not perfectly reliable and progress between
successive maneuvers is slow. Hence, pilots who did exceptionally well on one trial are likely to deteriorate on the next,
regardless of the instructors reaction to the initial success.
The experienced flight instructors actually discovered the regression but attributed it to the detrimental effect of positive
reinforcement. This true story illustrates a saddening
aspect of the human condition. We normally reinforce
others when their behavior is good and punish them when
their behavior is bad. By regression alone, therefore, they
are most likely to improve after being punished and most
likely to deteriorate after being rewarded. Consequently,
we are exposed to a lifetime schedule in which we
are most often rewarded for punishing others, and
punished for rewarding. [Tve82]
16.3
287
16.3.1
16.3.2
Taking logarithms may seem somewhat arbitrary. After all, there are a lot
of ways we might have chosen to transform the data. Another approach to
dealing with such blatantly nonnormal data, is to follow the same approach
that we have taken in all of our nonparametric methods: We replace the raw
numbers by ranks. Important: The ranking takes place within a variable.
We have shown in Table 16.1, in columns 3 and 5, what the ranks are:
The highest body weight african elephant gets 62, the next gets 61,
down to the lesser short-tailed shrew, that gets rank 1. Then we start over
again with the brain weights. (When two or more individuals are tied, we
average the ranks.) The correlation that we compute between the ranks is
called Spearmans rank correlation coefficient, denoted rs . It tells us
quantitatively whether high values of one variable tend to go with high values
of the other, without relying on assumptions of normality or otherwise being
288
Bibliography
16.3.3
Bibliography
Species
African elephant
African giant
pouched rat
Arctic Fox
Arctic ground
squirrel
Asian elephant
Baboon
Big brown bat
Brazilian tapir
Cat
Chimpanzee
Chinchilla
Cow
Desert hedgehog
Donkey
Eastern American mole
Echidna
European hedgehog
Galago
Genet
Giant armadillo
Giraffe
Goat
Golden hamster
Gorilla
Gray seal
Gray wolf
Ground squirrel
Guinea pig
Horse
Jaguar
Kangaroo
Lesser shorttailed shrew
Little brown bat
Man
Mole rat
Mountain beaver
Mouse
Musk shrew
N. American
opossum
Nine-banded
armadillo
Okapi
Owl monkey
Patas monkey
Phanlanger
Pig
Rabbit
Raccoon
Rat
Red fox
Rhesus monkey
Rock hyrax
(Hetero. b)
Rock hyrax
(Procavia hab)
Roe deer
Sheep
Slow loris
Star nosed mole
Tenrec
Tree hyrax
Tree shrew
Vervet
Water opossum
Yellow-bellied marmot
mean
SD
289
Body weight (kg)
Body rank
Brain rank
6654.00
1.00
62
21
5712.00
6.60
62
22
3.38
0.92
32
20
44.50
5.70
37
19
2547.00
10.55
0.02
160.00
3.30
52.16
0.42
465.00
0.55
187.10
0.07
3.00
0.79
0.20
1.41
60.00
529.00
27.66
0.12
207.00
85.00
36.33
0.10
1.04
521.00
100.00
35.00
0.01
61
42
4
53
31
47
14
58
16
54
7
30
18
12
25
49
60
44
10
56
51
46
8
22
59
52
45
1
4603.00
179.50
0.30
169.00
25.60
440.00
6.40
423.00
2.40
419.00
1.20
25.00
3.50
5.00
17.50
81.00
680.00
115.00
1.00
406.00
325.00
119.50
4.00
5.50
655.00
157.00
56.00
0.14
61
50
3
47
35
56
21
55
10
54
8
34
14
17
32
41
59
44
6
53
52
45
16
18
58
46
39
1
0.01
62.00
0.12
1.35
0.02
0.05
1.70
2
50
11
23
4
5
27
0.25
1320.00
3.00
8.10
0.40
0.33
6.30
2
60
13
23
5
4
20
3.50
34
10.80
24
250.00
0.48
10.00
1.62
192.00
2.50
4.29
0.28
4.24
6.80
0.75
57
15
41
26
55
29
39
13
38
40
17
490.00
15.50
115.00
11.40
180.00
12.10
39.20
1.90
50.40
179.00
12.30
57
30
44
25
51
26
36
9
38
49
28
3.60
35
21.00
33
14.83
55.50
1.40
0.06
0.90
2.00
0.10
4.19
3.50
4.05
43
48
24
6
19
28
9
37
34
36
98.20
175.00
12.50
1.00
2.60
12.30
2.50
58.00
3.90
17.00
42
48
29
6
12
28
11
40
15
31
199
899
930
243
Table 16.1: Brain and body weights for 62 different land mammal species.
Available at http://lib.stat.cmu.edu/datasets/sleep, and as the object mammals in the statistical language sR.
290
Bibliography
body
brain
diff.
62
62
0
21
22
1
32
37
5
20
19
1
61
61
0
42
50
8
4
3
0
53
47
6
31
35
4
47
56
9
14
21
7
58
55
3
16
10
6
54
54
0
7
8
1
30
34
4
18
14
4
12
17
5
25
32
7
49
41
8
60
59
1
body
brain
diff.
44
44
0
10
6
4
56
53
3
51
52
1
46
45
1
8
16
8
22
18
4
59
58
1
52
46
6
45
39
6
1
1
0
2
2
0
50
60
10
11
13
2
23
23
0
4
5
2
5
4
1
27
20
7
34
24
10
57
57
0
15
30
15
body
brain
diff.
41
44
2
26
25
1
55
51
4
29
26
3
39
36
3
13
9
4
38
38
0
40
49
9
17
28
10
35
33
2
43
42
1
48
48
0
24
29
5
6
6
0
19
12
7
28
28
0
9
11
2
37
40
3
34
15
18
36
31
5
Table 16.2: Ranks for body and brain weights for 62 mammal species, from
Table 16.1, and the difference in ranks between body and brain weights.
Bibliography
[Abr78]
[AC76]
[BH84]
Joel G. Breman and Norman S. Hayner. Guillain-Barre Syndrome and its relationship to swine influenza vaccination in
Michigan, 1976-1977. American Journal of Epidemiology,
119(6):8809, 1984.
[Bia05]
[BS]
[Edw58]
[Fel71]
William Feller. An Introduction to Probability and its Applications, volume 2. John Wiley & Sons, New York, 1971.
[FG00]
292
Bibliography
[FPP98]
[Gal86]
[Hal90]
[HDL+ 94] D. J. Hand, F. Daly, A. D. Lunn, K. J. McConway, and E. Ostrowski. A Handbook of small data sets. Chapman & Hall, 1994.
[HREG97] Anthony Hill, Julian Roberts, Paul Ewings, and David Gunnell.
Non-response bias in a lifestyle survey. Journal of Public Health,
19(2), 1997.
[LA98]
[Lev73]
[Lov71]
[LSS73]
Judson R. Landis, Daryl Sullivan, and Joseph Sheley. Feminist attitudes as related to sex of the interviewer. The Pacific
Sociological Review, 16(3):30514, July 1973.
Bibliography
293
[Mou98]
[MS84]
[RS90]
[RS02]
[SCT+ 90]
[SKK91]
[TK81]
294
Bibliography