0% found this document useful (0 votes)
6 views

STA3030_Note

The document contains course notes for STA3030F Inferential Statistics at the University of Cape Town, aimed at business students specializing in statistics and management science. It covers topics such as one-sample problems, two-sample problems, parameter estimation, Bayesian inference, and generalized linear models, with an emphasis on practical applications using software like Excel and R. The notes are compiled from various contributors and are updated annually to enhance the learning experience.

Uploaded by

Mthethwa Ziyanda
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

STA3030_Note

The document contains course notes for STA3030F Inferential Statistics at the University of Cape Town, aimed at business students specializing in statistics and management science. It covers topics such as one-sample problems, two-sample problems, parameter estimation, Bayesian inference, and generalized linear models, with an emphasis on practical applications using software like Excel and R. The notes are compiled from various contributors and are updated annually to enhance the learning experience.

Uploaded by

Mthethwa Ziyanda
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 141

University of Cape Town

Department of Statistical Sciences

Inferential Statistics
Course Notes for STA3030F
Contents

1 One-sample problems 5

2 Two-sample and three-sample problems 34

3 Parameter estimation and inference 58

4 Bayesian inference 84

5 Generalized linear models 110

A Formulae sheet i

B Bibliography iii
Preface

These notes form the core course material for STA3030F Inferen-
tial Statistics, a third-year semester-long course for students in the
Applied Statistics stream.
The purpose of these notes, and the course, is to provide busi-
ness students specializing in statistics and management science
with a broad understanding of the principles underlying the prac-
tice of statistical inference. It is assumed that students entering this
course have completed at least three semester courses in statistical
methods (typically STA1000F/S, STA2020F and STA2030S at UCT),
and are familiar with the following concepts:

• Basic principles of probability, random variables and probability dis-


tributions, including definitions of the (cumulative) distribution
function, the probability function for discrete random variables
(especially those of the binomial and Poisson distributions), and
the probability density function for continuous random variables
(especially those of the uniform, exponential and normal distri-
butions); these principles are, however, revised in the beginning
parts of Section 1.2;

• Basic principles of hypothesis testing and p-values, particularly in


respect of two sample tests, ANOVA and regression;

• Basic principles of interval estimation, i.e. the construction of confi-


dence intervals.

These notes have been compiled over several years and several
people have been involved in that process – this will sometimes be
clear to the reader in a change of style, or way of presenting infor-
mation graphically, or preference for a particular software package.
Most of the material in the first four chapters was written by Prof.
Theodor Stewart. The material in Chapter 5, on GLMs, was writ-
ten by Dr. Birgit Erni for the course STA2007S. Other people who
have worked on various parts of the notes include Dr. Freedom
Gumedze, Dr. Juwa Nyirenda and Dr. Ian Durbach. The notes are
updated from year-to-year, so if you have any comments or correc-
tions, these are welcome.
A last note on software: this is a course in inferential statistics,
often demonstrating concepts using computer simulation. A num-
ber of software packages are available to do this. In past years
these notes have been based primarily on Microsoft Excel, whose
4

spreadsheet environment is well-suited to many of the tasks we


wish to do. More recently, we have also included code written in
R, an open-source statistical software package. At the current time
students are not exposed to R in previous courses (which is one
of the motivations for including it here), but learning R is a highly
useful skill and the exercises in these notes introduce R in a gentle
way. Students are strongly advised to attempt the examples and
exercises in both Excel and R, wherever possible, until they feel
comfortable using either.
1
One-sample problems

1.1 Using simulation as a tool for solving inferential problems

In previous courses you would have been introduced to quite a


large number of procedures for testing hypotheses and constructing
associated confidence intervals (z-tests, t-tests, F-tests, Chi-squared
tests, non-parametric tests, analysis of variance, . . . ). Often in first
statistics courses these are taught by pairing data “types” with a
particular tests, often in a fairly formulaic way1 . A complete under- 1
For example, if you wish to compare
standing of theoretical foundations which support the procedures two group means and the population
variance is unknown, you use a t-test.
and tests introduced in earlier courses requires a relatively high But then you need to know “which”
level of mathematics, beyond that with which most students in this t-test to use – if the data is paired,
you use a paired t-test; if there’s no
course will be familiar. Our approach will be to develop a degree pairing you use an unpaired t-test,
of understanding and insight by simulating sampling processes, but the degrees of freedom depend on
hypothesis tests and estimation procedures on a computer (largely whether the sample variances of the
two groups are significantly different
within a spreadsheet framework, such as Microsoft Excel). Some or not (which you test with an F-test!).
mathematical representations of the simulated phenomena will be
introduced, without rigorous proofs, but with demonstration of
consistency with empirical results. These mathematical representa-
tions then underpin more efficient and rigorous statistical methods.
It is useful at this stage to illustrate our modus operandi in the
context of one of the simplest statistical problems, namely that of
drawing inferences about a single population mean.
A mining company has been concerned about its cash flow po-
sition, and one of the problem areas seems to be the delay in pay-
ments for bulk shipments of copper to industrial clients. A sample
of 30 recent deliveries has been carefully followed up, and the days
between invoicing and receipt of payment recorded in each case as
follows:
37 42 38 44 39 35
43 41 42 38 36 34
37 42 36 38 41 39
39 37 34 34 41 41
40 38 38 46 38 42

The sample data can be summarized in any of the familiar ways,


for example by a box-and-whisker plot, or by histogram. We can do
this by loading the data into R using
6

pdelays <- c(37,42,38,44,39,35,43,41,42,38,36,34,37,42,


36,38,41,39,39,37,34,34,41,41,40,38,38,46,38,42)

and creating the plots

boxplot(pdelays,ylab="Days",main="Boxplot")
hist(pdelays,breaks=10,xlab="Days",main="Histogram")

Figure 1.1: Summary plots of sample


data copper payment delays
Boxplot Histogram
34 36 38 40 42 44 46

6
5
Frequency

4
Days

3
2
1
0

34 36 38 40 42 44 46

Days

(a) (b)

We can also compute a few summary statistics:

mean(pdelays)

[1] 39

sd(pdelays)

[1] 3.051286

median(pdelays)

[1] 38.5

The first question to be asked might relate to the mean time to


payment, as in the long run this would impact on the cash reserves
tied up. It is easily calculated that the mean and median of the sam-
ple data are 39 and 38.5 days respectively, but how representative
are these of long-run conditions? Or, in other words, what differ-
ences might arise between the sample mean and the population mean
(i.e. in the “long run”)? One approach to answering this question
is the so-called sampling theory approach to statistical inference,
which focusses on the following fundamental questions:

What would happen if we were able to repeat the sampling


process many times? How different would the sample mean
and standard deviation be each time we re-sampled? Would
that lead to different conclusions? How sure does this make
us of our current conclusions?
7

It is precisely these questions which lead us to the familiar con-


cepts of significance levels for hypothesis tests (e.g. are the data
consistent with a claim that the long-run mean time does not ex-
ceed 40 days?) and of confidence intervals for the true population
mean. The mathematics behind the standard tests can become quite
intricate, but we can easily mimic, or simulate the process of re-
sampling on a computer (for example within a spreadsheet). One
simple procedure for performing such a simulation is the method
which has been termed “bootstrapping”. In essence, this procedure is
as follows:

(1) Start with the observed set of observations (a “random sam-


ple”), denoted by x1 , x2 , . . . , xn , and calculate any relevant sum-
mary statistics (e.g. sample mean and variance)

(2) Place all the observations “in a hat” and “shuffle” them. Then
draw n items with replacement to form a new pseudo-sample (the
“bootstrap sample”).

(3) Recalculate the summary statistics for this new sample.

(4) Repeat the previous two steps as often as desired

The above procedure would be equivalent to the process of re-


peatedly re-sampling from the population, if the original sample
were an exact representation of the total population, i.e. with pre-
cisely 1/n of the population taking on each of the values x1 , x2 , . . . , xn .
The reason for sampling with replacement is that the proportions in
the population corresponding to each xi should not change during
the sampling process. Of course, the sample can never be an ex-
act representation of the population, but it is usually close enough
for purposes of assessing the ranges of variation which can arise
through re-sampling.
The “shuffling” and “sampling with replacement” step of the
bootstrap procedure is easily achieved algorithmically (i.e. within a
computer program) by the following mechanism:

For i = 1, . . . , n do:

• Draw a uniformly distributed random number u


(e.g. by using Excel’s RAND() function);

• Define k as 1+INT(n × u), where the function


‘INT()’ is the Excel spreadsheet function
which returns the integer part of a real
number;

• Set the i-th observation in the bootstrap


sample to xk .

The process is illustrated in the following table, in which the


sample is composed of the first eight observations only from the
payment delay data above.
8

Original sample First bootstrap Second Bootstrap


index value random index value random index value
number number
(1) (2) (3) (4) (5) (6) (7) (8)
1 37 0.4488 4 44 0.5033 5 39
2 42 0.7895 7 43 0.7191 6 35
3 38 0.5253 5 39 0.6484 6 35
4 44 0.3476 3 38 0.3332 3 38
5 39 0.6935 6 35 0.4328 4 44
6 35 0.9859 8 41 0.5852 5 39
7 43 0.4037 4 44 0.5374 5 39
8 41 0.4019 4 44 0.3565 3 38

mean 39.88 41.00 38.38


std dev 3.14 3.38 2.83

In column (3) is shown a set of 8 random numbers generated using


the Excel RAND() function, which are used as described in the box
above to give the index numbers in column (4). The correspond-
ing values from the original data set are given in column (5). This
process is repeated in columns (6)–(8), in order to give a second
bootstrap sample.
Note that in the first bootstrap sample, the 4th observation from
the original data appears three times, while the first two do not
appear at all. In the second bootstrap sample, only observations
3–6 of the original sample appear at all. If this process is repeated a
large number of times, many different combinations of the original
data will appear.
For each bootstrap sample, the sample mean and standard de-
viation can be calculated, giving an immediate impression of how
variable these estimates are. (This is even evident from the two
repetitions above!)
Such a process would be tedious to repeat for very large num-
bers of resamplings, so that some form of automated procedure
is needed, i.e. a computer programme. We’ll first implement this
using the Visual Basic for Applications (VBA) facility in Excel, and
then write our own R program. VBA is an extremely useful fea-
ture for many statistical and management science applications,
and students are advised to become familiar with this package.
However, we do make available for students a simple spreadsheet
package (BootStrap.xls), in which such a macro has already been
coded. [Students interested in the coding can examine the code by click-
ing Tools|Macro|Visual Basic Editor; feel free to experiment with changing
the code!] If this file is opened under Excel, the user will see rows
1-6 and columns A-H of the spreadsheet illustrated in Figure 1.2.
In order to use the BootStrap.xls package, the user needs to do
the following:

• Enter the sample data as a long row, in the spreadsheet row 6;

• Create the formulae to calculate any summary statistics (e.g.


mean and standard deviation) to the right of the data, in the
9

Figure 1.2: Illustration of the Boot-


Strap.xls package
A B C D E F G H I J K
1 Module for simple bootstrap simulations
2 Enter original data on one ROW --- shaded row recommended!
3 Select (highlight) data and press Ctrl-b to run bootstrap --
4 Simulated samples will start two rows below the original data Mean StdDev
5
6 37 42 38 44 39 35 43 41 39.88 3.14
7
8 35 39 39 38 38 43 37 43 39.00 2.78
9 43 35 37 44 43 43 38 41 40.50 3.38
10 43 37 41 38 39 43 37 39 39.63 2.45
11 44 38 39 35 38 38 43 43 39.75 3.20
12 39 41 41 42 35 41 42 39 40.00 2.33
13 37 41 35 37 39 37 37 43 38.25 2.60
14 38 37 38 44 38 41 41 44 40.13 2.80
15 38 42 42 35 44 44 35 38 39.75 3.73
16 35 42 42 39 37 44 41 38 39.75 3.01
17 43 44 38 41 35 35 44 37 39.63 3.85
18 39 35 41 43 37 39 41 44 39.88 3.00
19 35 39 39 44 38 44 38 37 39.25 3.20
20 42 41 37 44 38 44 42 44 41.50 2.73
21 38 35 39 42 41 35 39 44 39.13 3.18
22 37 43 44 43 39 43 37 42 41.00 2.88
23 37 37 38 42 37 39 35 39 38.00 2.07
24 43 37 42 35 44 38 42 35 39.50 3.66
25 41 39 37 43 44 44 44 42 41.75 2.60
26 38 37 39 42 41 37 44 38 39.50 2.56
27 43 43 38 35 38 37 37 38 38.63 2.88

same row (row 6);

• Select (highlight) the sample data (not the summary statistics)


in row 6, and press Ctrl-b; you will be prompted to enter the
number of times the re-sampling of the data is to be carried out;

• The simulated bootstrap samples will appear row-wise, starting


in row 8; copy the formulae for the summary statistics down to
all the re-sampled rows;

• Analyze the variation in the summary statistics from sample to


sample.

For example, in Figure 1.2, the first 20 bootstrap samples (based


on a sample of size 8) are shown, together with the means and
standard deviations in each case. Of course, in order to obtain a
meaningful understanding of the extent of variation in sample
estimates (such as of the mean and standard deviation), you need to
carry out much more than 20 repeated samples.
To do bootstrap sampling in R, we’ll first load the full sample of
30 payment delays previously described:

pdelays <- c(37,42,38,44,39,35,43,41,42,38,36,34,37,42,


36,38,41,39,39,37,34,34,41,41,40,38,38,46,38,42)

We then use the sample() function to create a single bootstrap


sample, which we store in a new variable called bootx.2 2
The parts inside the brackets are
called the arguments of the sample
function. The sample function takes 3
boot <- sample(pdelays,size=30,replace=TRUE)
arguments: the original data (which
we have called pdelays, the size of
To create many bootstrap samples we put the code above in a the new sample, and a logical variable
(TRUE or FALSE) indicating whether
for loop, being careful to store the results. sampling should be with replacement
or not. In R you can get help on any
function (say sample) by ?sample or
help(sample)
10

# set up a variable to store the bootstrap samples


all_boots <- matrix(NA,nrow=5000,ncol=30)
for(i in 1:5000){
# draw a single bootstrap sample from pdelays
boot <- sample(pdelays,size=30,replace=TRUE)
# store that bootstrap
all_boots[i,] <- boot
}

Having created the bootstrap samples3 , we can now extract the 3


In general there are many ways
to program a particular task. An-
bootstrap means using the apply function.
other way of constructing a matrix
containing the bootstrap samples is
bs_means <- apply(all_boots,1,mean) all_boots=matrix(sample(pdelays,
size=5000*30,replace=TRUE),nrow=
5000,ncol=30).This way is slightly
However we construct the bootstrap sample means, we can now better because we avoid the for loop,
summarize these in a number of ways. Figure 1.3 shows a boxplot which saves time, but the code is a bit
trickier to understand.
and histogram of the bootstrap sample means

boxplot(bs_means,ylab="Days",main="Boxplot")
hist(bs_means,breaks=10,xlab="Days",main="Histogram")

Figure 1.3: Summary plots of bootstrap


sample mean copper payment delays
Boxplot Histogram
41

1500
40

1000
Frequency
Days

39

500
38
37

37 38 39 40 41

Days

(a) (b)

We can also compute a few summary statistics:

mean(bs_means)

[1] 38.99818

sd(bs_means)

[1] 0.5539776

min(bs_means)

[1] 36.83333

max(bs_means)

[1] 40.9
11

Figure 1.3 shows the range of variation in sample means based


on 5000 bootstrap replicate samples drawn from the full sample of
30 payment delays previously described. Note that the re-sampled
sample means range between 36.83 and 40.9 (in comparison with
the original sample mean of 39). Thus we see that sampling er-
rors in the estimation of the mean are at least of the order of ±2
days (although this could still grow as number of re-samples is
increased), when these estimates are based on a sample of size 30.
We can be more precise in analyzing sampling errors. It is simple
to sort all the bootstrap sample means from smallest to largest. In R
we would do this by

# sort the elements in bs_means from smallest to largest


sorted_bs_means = sort(bs_means, decreasing = FALSE)
# show the first six
sorted_bs_means[1:6]

[1] 36.83333 37.26667 37.30000 37.33333 37.33333 37.40000

For our particular simulation of 5000 repetitions of the sampling


we can find, for example, the 125-th and 4875-th (sorted) mean

sorted_bs_means[125]

[1] 37.93333

sorted_bs_means[4875]

[1] 40.1

These are perhaps more interesting than the absolute extremes,


and tell us that 4750 out of 5000 (i.e. 95%) of the bootstrapped
means lay between 37.93 and 40.1. This is a useful observation,
but has to be interpreted with some caution. It is wrong to interpret
this result as a 95% probability that the actual mean lies between
37.93 and 40.1. (In this particular example, the numerical errors
made by such a wrong interpretation are minor; but cases can arise
in which the errors are substantial, and in any case we need always
to ensure that conclusions we reach are theoretically justifiable.)
In order to properly interpret the result, we must recognize that
the bootstrap samples have been generated from a hypothetical
population in which the true (population) mean is precisely the
original sample mean, i.e. 39.0. So our observation really says that
in 95% of samples (in hypothetical re-sampling from the popula-
tion), the deviation between the sampled mean and the true mean
will lie between −1.07 (=37.93-39) and 1.1 (=40.1-39). IF (and this is
the critical assumption) the same errors apply to the real sample,
we may be “95% confident” that the error of estimation lies between
−1.07 and 1.1. Based on our observed mean of 39.0, we argue that:
• If the error is as small as −1.07, then the population mean would
actually be 39 + 1.07 = 40.07;
12

• If the error is as large as 1.1, then the population mean would


actually be 39 - 1.1 = 37.9.

We would thus claim a 95% confidence interval (the “bootstrap


confidence interval”) for the mean as [37.9;40.07]
Suppose that the true (population) mean is µ, while the sample
mean is X̄. The above ideas can be expressed in algebraic terms by
noting that the measurement error is X̄ − µ, so that the bootstrap
results imply:
Pr[−1.07 ≤ X̄ − µ ≤ 1.1] = 0.95
which is sometimes written as:

Pr[ X̄ − 1.1 ≤ µ ≤ X̄ + 1.07] = 0.95

which gives the same result as above for X̄ = 39. This last ex-
pression can, however, be misleading, as it looks like a probability
statement about µ. In this context, µ is fixed (although unknown),
while the probabilities of occurrence refer to the different realiza-
tions of the sample mean when the sampling process is repeated
many times.
The confidence interval described in introductory statistics
courses (e.g. STA100) is in fact derived on precisely the same type
of re-sampling argument. The only difference is that the assump-
tion is made that the errors themselves are normally distributed with
zero mean, i.e. that X̄ − µ has a normal distribution with mean 0

and standard deviation σ/ n, where σ is the population standard
deviation and n the sample size. When σ is known, the standard
confidence interval is written in the form:
σ σ
Pr[ X̄ − zα √ ≤ µ ≤ X̄ + zα √ ] = 1 − 2α
n n

where zα is the α critical value of the normal distribution distribu-


tion (e.g. z0.025 = 1.960 for the usual 95% interval).
When σ is unknown and is replaced by its corresponding sample
estimate s, then the standard result is to calculate the confidence
interval from the following expression:
s s
Pr[ X̄ − tn−1,α √ ≤ µ ≤ X̄ + tn−1,α √ ] = 1 − 2α
n n

where tn−1,α is the α critical value of the t-distribution with n − 1


degrees of freedom. We shall return later to the precise reasons for
this result, but record for now that the resulting 95% confidence
interval for the same sample data as above is:

2.045 × 3.05
39 ± √ = [37.86; 40.14]
30
where the factor 2.045 is the 0.025 critical value for the t-distribution
with 29 degrees of freedom. Note how close the bootstrapped and
t-based confidence intervals are; this would suggest that the as-
sumptions of the normal theory hold quite well in this case.
13

In a rather similar manner, we can also perform hypothesis tests


on the basis of the bootstrap samples. Suppose, for example, that
we wish to test the “null hypothesis” H0 : µ ≤ 38 against the
alternative H1 : µ > 38. (Perhaps the company’s current cash flow
planning assumes that µ ≤ 38, and they are now worried in the
light of the sample mean of 39.) If the (population) mean does not
actually exceed 38, then the observed sample mean corresponds to
a sampling error of at least +1.0 days. Since the bootstrap sample
means were based on a “true” mean of 39, errors of 1.0 or more
correspond to bootstrap sample mean of 40.0 or more. We can find
the number of values exceeding 40.0 by

# create a vector with elements TRUE if >40, else FALSE


mean_gt40 <- (bs_means > 40)
# count up the number of TRUEs (TRUE = 1, FALSE = 0)
sum(mean_gt40)

[1] 179

so in fact found that 179 of the 5000 values exceeded 40.0, i.e.
3.58% of the data. This may be interpreted as an observed signifi-
cance level (or p-value) of 0.036, which is significant at the conven-
tional 5% level. Recall that for the standard t-test based on normal
errors, the t-statistic for this test would be:
39 − 38
√ = 1.795
3.05/ 30
Since the 5% critical value from t tables is 1.699 for 29 degrees of
freedom, we would conclude that the difference is “significant” (but
not “highly significant”). We can calculate the associated p-value in
R as follows 4 , 5 4
The crucial part of the code below is
pt, which calculates the value of the
CDF of a t-distribution at the point
# calculate parts of the t-statistic
1.8. This gives us the probability of
mpd <- mean(pdelays) obtaining a value less than 1.8 (think
sdpd <- sd(pdelays) of the definition of a CDF). For our
(one-sided) hypothesis test we want
npd <- length(pdelays) the probability of getting more than
# compute the t-statistic 1.8, which is why we take 1- the value
returned by dt. R has similar functions
tstat <- (mpd - 38) / (sdpd/sqrt(npd))
for lots of other distributions, see for
# enter the degrees of freedom example dnorm, dgamma, dbeta, dbinom.
dof <- npd - 1 Note that dt is specifically for the
t-distribution!
# get the p-value from a t-distribution 5
Note that again you could do
p <- 1 - pt(tstat,dof) this in a single line of code:
1-pt((mean(pdelays)-38)
# display p
/(sd(pdelays)/sqrt(length(pdelays)
p )),length(pdelays-1)). Its just
easier to understand, and harder to
[1] 0.04153733 make a mistake, if you break it up a
little. In fact, you can run the t-test
directly using t.test(pdelays,mu=38,
we obtain a p-value of 0.042, which is somewhat larger than (but alternative="greater"). Try this, and
of much the same order of magnitude as) the bootstrap estimate. note that you get the same p-value.
For help type help(t.test). Using
It is worth emphasizing two critical insights derived from the R’s built-in t.test function is the way
above arguments: you would normally run a t-test in
practice, but it doesn’t give you any
insight into sampling theory, which is
the whole point of this section!
14

• A 95% confidence interval does not mean a probability of 0.95


that the population parameter is within the stated bounds . . . it
is the frequency (in repeated sampling) of stating erroneous
bounds;

• A 5% significance level (p-value of 0.05) does not mean a 5% prob-


ability that the null hypothesis is correct . . . it is the frequency (in
repeated sampling) of “type I” errors (rejecting H0 when it is in
fact true).

While the “bootstrap” concept does in principle allow us to


simulate the effects of re-sampling, so as to generate confidence
intervals or p-values, this can become quite tedious, especially for
complicated sampling situations involving two or more different
populations. For this reason, we seek means of solving essentially
the same problems, at least approximately, by a more analytical
approach. In the following chapter, we extend both the bootstrap
and analytical approaches into more complex sampling situations
involving two or more samples.

1.2 Simulating Random Variables

Recall that a random variable is defined such that its value is only
determined by performing some form of experiment, measurement
or observation. Examples in earlier courses would have included
the number of heads in a fixed number of spins of a coin, the time
to occurrence of some specified event (e.g. the breakdown of a ma-
chine), or rainfall at a particular point over a fixed period of time.
The critical point is that prior to making the necessary observa-
tions, the value of the random variable is unknown, and can only
be described probabilistically.
A convenient notation is to use upper case letters (e.g. X, Y, . . . )
to denote the random variable itself (prior to any observation), and
to use lower case letters (e.g. a, b, . . . , x, y, . . . ) to represent particular
real values that might be taken on by the random variable. An
expression such as Pr[ X = x ] would then denote the probability
that when the random variable X is observed, it is found to take on
the value x. Clearly, expressions such as Pr[ X = y] or Pr[ X = 10]
are equally meaningful.
The distribution function (sometimes termed the cumulative distri-
bution function) of the random variable X is a function F ( x ), such
that for each real number x: F ( x ) = Pr[ X ≤ x ]. Clearly, as x in-
creases, the probability cannot decrease, although it could remain
constant over some range of x (since if X cannot take on values be-
tween a and b, say, then F ( a) = Pr[ X ≤ a] = Pr[ X ≤ b] = F (b)).
In other words, F ( x ) is a non-decreasing function of x. Further-
more, by the properties of probabilities, 0 ≤ F ( x ) ≤ 1. Figure 1.4
illustrates two possible forms of distribution function.
Both distributions illustrated in Figure 1.4 relate to random vari-
ables taking on values between 0 and 5 only. The distribution func-
15

Figure 1.4: Examples of distribution


F (x)
functions

1.0
Distr A
0.8
Distr B
0.6

0.4

0.2

0 x
0 1 2 3 4 5

tion for Distribution A is continuous, which implies that any real


number between 0 and 5 is at least possible. On the other hand,
the function for Distribution B has a discrete number of jumps (at
integer values of x), and is otherwise flat (horizontal). Thus, for ex-
ample, F (1.8) = F (1.1), so that Pr[ X ≤ 1.8] = Pr[ X ≤ 1.1], implying
that the probability of 1.1 < X ≤ 1.8 is zero. In fact, X can only take
on the integer values 0, 1, . . . , 5, where the probabilities are given
by the magnitudes of the jumps. This leads us to the concept of
the probability function, or probability mass function p( x ) defined by
p( x ) = Pr[ X = x ].
It is usually convenient (although not entirely necessary) to
define discrete random variables as taking on non-negative integer
values. Then p( x ) = 0 if x is not a non-negative integer, while for
k = 0, 1, 2, . . . :
(
F (0) for k = 0
p( x ) = Pr[ X = k ] =
F (k ) − F (k − 1) for k > 0

Conversely, it is easy to see that:

k
F (k) = ∑ p ( i ).
i =0

The probability function is often intuitively easier to understand


than the more fundamental distribution function, as it can be in-
terpreted as showing relative frequencies for each possible value
for X. Unfortunately, it does not carry over to continuous distri-
butions such as Distribution A in Figure 1.4. The problem is that
the probability of any precise value is always zero (for example,
Pr[ X = π ] in Figure 1.4), even though it is possible. We can al-
ways evaluate the probability associated with a range of values, i.e.
Pr[ a < X ≤ b] = F (b) − F ( a), but of course the magnitude of this
probability depends on the length of the interval. The probability
density at any point x on the real line is then defined by the prob-
ability of X belonging to small interval on the real line containing
16

x, divided by the length of the interval, in the limit as this length


tends to zero. Formally we express this in terms of the probability
density function:

Pr[ x < X ≤ x + h] F ( x + h) − F ( x ) dF ( x )
f ( x ) = lim = lim = .
h →0 h h →0 h dx
Clearly this also implies that:
Z x
F(x) = f ( x )dx
−∞

so that F ( x ) is simply the area under the probability density func-


tion curve up the point x.
It is conventional to express the lower bound as −∞ in general
expressions for continuous random variables, although f ( x ) may be
zero for some range of values (such as for x < 0 in many cases).
It also follows that
Z b
Pr[ a < X ≤ b] = F (b) − F ( a) = f ( x )dx
a

i.e. the area under the density function curve between the points a
and b on the x-axis.
A notational convention: Sometimes it becomes necessary to iden-
tify to which random variable a particular distribution, probability
or density function applies. In such cases we will label the func-
tions by a subscript denoting the name of the random variable, e.g.
FX ( x ), pX or f X ( x ). The subscripts will, however, be omitted if no
confusion can arise.

You should be familiar with the following distributional forms:


 
n x
• Binomial (discrete): p( x ) = p (1 − p)n− x for x = 0, 1, . . . , n
x

λ x e−λ
• Poisson (discrete): p( x ) = for x = 0, 1, . . .
x!
• Exponential (continuous): f ( x ) = λe−λx for x > 0, so that
Z x
F(x) = f (u) du = 1 − e−λx
u =0

1 2 /2σ2
• Normal: f ( x ) = √ e−( x−µ)
2πσ
It is worth noting here that a non-negative random variable X is
said to have the log-normal distribution if Y = log X follows a
normal distribution. The base to which the logarithm is taken is
irrelevant, but it is conventional to use natural logarithms (i.e. to
base e).
As we have seen in Section 1.1, it is possible to explore the be-
haviour of statistical sampling processes by means of numerical
experimentation in computer simulations (often then termed a
Monte Carlo approach). In order to implement such an approach, we
often need to simulate realizations of random variables drawn from
some specified distribution (such as the normal or Poisson with
17

specific parameter values). Some computer systems allow us to do


this directly, but it is useful to master the general principles which
are quite simple.
Suppose that we observe a sequence of random variables, say
X1 , X2 , . . . , drawn independently from the same probability distri-
bution, and let x1 , x2 , . . . be the actual values observed. Suppose
now that we actually report the cumulative probabilities corre-
sponding to each observation, i.e. u1 = F ( x1 ); u2 = F ( x2 ); . . . .
It can be shown that the sequence of values u1 , u2 , . . . arise in fact
from the uniform distribution on [0,1]. We reverse this process to
generate numbers from any desired distribution (where the distri-
bution function F ( x ) is given) as follows:

• Generate a sequence of numbers from the uniform distribution,


say u1 , u2 , . . . . Most computer software systems provide some
facility for doing this. For example, in R the function runif(n)
returns n values drawn from U [0, 1]

runif(4)

[1] 0.64219599 0.09792858 0.48317477 0.07788739

while in Excel the spreadsheet function RAND() does the same.

• For each ui find the value xi such that F ( xi ) = ui ; the resulting


sequence x1 , x2 , . . . arises from the desired distribution.

The only complication may arise from having to find the x such
that F ( x ) = u. For discrete distributions, the search for a solu-
tion can be carried out systematically: Set X to the smallest non-
negative integer k such that u ≤ ∑ik=0 p(i ). This is easily found
by a set of nested IF functions6 . For example, suppose we want to 6
Again there are different ways to
draw from a discrete distribution giving X = 0 with probability program this. A better way, once you
have generated your uniform random
0.2, X = 1 with probability 0.5, and X = 2 with probability 0.3. We numbers, is to use x<-ifelse(u<0.
could simulate 1000 values from this distribution as follows: 2,0,ifelse(u<0.7,1,2)). This does
the same thing as in the main text,
but avoids the use of the for loop. Try
# set up an empty vector to store the values help(ifelse) to get more information
x <- c() on the useful ifelse function.
# generate 1000 U[0,1] variates
u <- runif(1000)
# use u to pick appropriate value of x
for(i in 1:length(u)){
if(u[i] < 0.2){
x[i] <- 0
} else if(u[i] < 0.7){
x[i] <- 1} else {
x[i] <- 2
}
}
# print out first 5 u values
round(u[1:5],2)
18

[1] 0.51 0.59 0.81 0.73 0.15

# print out first 5 x values


x[1:5]

[1] 1 1 2 2 0

A histogram of the generated values is shown below:

Histogram of x
200 400
Frequency

0.0 0.5 1.0 1.5 2.0

For continuous distributions, we have to solve the non-linear


equation F ( x ) = u for x in terms of u. Formally we often repre-
sent this by an “inverse” function, i.e. as x = F −1 (u). The idea is
illustrated in Figure 1.5.

F (x) Figure 1.5: Finding x = F −1 (u)

1.0

0.8
u

0.6

0.4

0.2

0 x
0 1 2 3 4 5
x = F −1 (u)

In some cases the solution is relatively simple. For example,


the exponential distribution has a distribution function given by
1 − e−λx . Solving 1 − e−λx = u for x gives x = −[ln(1 − u)]/λ. In
order to generate a sequence of exponentially distributed random
variables in Excel, we need therefore only to generate an array
of uniform numbers (using the RAND() function), and then to
19

convert these to the desired x’s using the above function. In fact
there is a little simplifying trick! If the random variable U has a
uniform distribution on [0,1], then so has 1 − U. The expression
x = −[ln(u)]/λ will therefore also produce a number from the
desired exponential distribution.
More generally, it may be difficult to solve the equation F ( x ) = u
in a closed form. However, R provides inverse functions for many
standard distributions, i.e. giving values for F −1 (u) directly. For
example, the function qnorm(u,m,s) gives F −1 (u) for the normal
distribution with mean m and standard deviation s. Another func-
tion, rnorm(n,m,s) directly generates n values from N (m, s).

# Few examples of qnorm and rnorm


qnorm(0.025,0,1)

[1] -1.959964

qnorm(0.975,0,1)

[1] 1.959964

qnorm(runif(1),1,3)

[1] -0.1832144

round(rnorm(4,10,1),2)

[1] 9.78 8.96 10.34 10.65

Excel also provides inverse functions for many distributions. For


example, the spreadsheet function NORMINV(u, m, s) gives F −1 (u)
for the normal distribution with mean m and standard deviation s.

Simulating the central limit theorem


As an illustration of the use of simulation in understanding statisti-
cal sampling, let us consider an extremely simple situation. Students
are advised to repeat this exercise for themselves!
We start with the Bernoulli distribution, which is the Binomial
distribution with n = 1, so that X = 1 with probability p, and
X = 0 otherwise. The random variable is easily generated; simply
obtain a uniformly distributed random number u, and set X = 1 if
u < p, and X = 0 otherwise. For the experiments reported below,
we arbitrarily chose p = 0.25.
A set of 200000 Bernoulli random variables was generated as
described above i.e.

u <- runif(200000)
x <- ifelse(u < 0.25, 1, 0)

Two sets of analyses were carried out using the generated se-
quence of 0’s and 1’s:
20

• The 200000 values were grouped into 20000 sets of 10 observa-


tions each, and the mean of each set recorded.

# arrange x into matrix of 20000 x 10


xmat1 <- matrix(x,20000,10)
# calculate mean of each row
mean1 <- apply(xmat1,1,mean)

Clearly these means must take on one of the values 0, 0.1, . . . , 1.0,
so that the distribution of the sample means is still very much
discrete. The observed frequencies of each of value of the mean
are displayed in the histogram below

hist(mean1,xlab="mean values",main="Histogram of means")

Figure 1.6: Distributions of means of


samples of size 10
Histogram of means
5000
Frequency

0 2000

0.0 0.2 0.4 0.6 0.8

mean values

Even for samples of size 10, the distribution of the mean is start-
ing to look quite smooth, although somewhat skewed to the
right rather than normally distributed.

• The same 200000 values were then grouped into 2000 sets of 100
observations each, once again calculating the means in each set.

# arrange x into matrix of 2000 x 100


xmat2 <- matrix(x,2000,100)
# calculate mean of each row
mean2 <- apply(xmat2,1,mean)

These means may now take on values 0, 0.01, 0.02, . . . which


are still not continuous, but are very nearly so to a reasonable
approximation. Again, we can show the distribution of mean
values in the form of a histogram:
21

hist(mean2,xlab="mean values",main="Histogram of means")

Figure 1.7: Distributions of means of


samples of size 100
Histogram of means
300
Frequency

150
0

0.10 0.15 0.20 0.25 0.30 0.35 0.40

mean values

Clearly the distribution of means is starting to take on a dis-


tinctly normal shape, even if still a little ragged.

What this type of numerical experiment shows is that even for


random variables which are far from normal (the Bernoulli in this
case), sample means tend to exhibit increasingly normal-like be-
haviour. This is the basic principle of the central limit theorem. We
shall return to this theorem later, but it is useful to summarize the
key concepts here:

• A random sample is a set of independent random variables drawn


from the same probability distribution.

• The sample mean and sample variance from a random sample


are also random variables, and thus have their own probability
distributions

• For almost all distributions of practical interest (discrete and


continuous), the distribution of the sample mean approaches
a normal distribution for large enough sample sizes (as in the
above example).

• The central limit theorem thus justifies using normal distribution


theory for any inference involving averages (which explains why
the “bootstrap” and normal theory results in the introductory
chapter were so similar).

1.3 Order statistics

In Section 1.1, we have used simulation to re-examine some fairly


straightforward inference problems involving the mean of a prob-
ability distribution function, of the kind that you would have en-
countered in a first-year statistics course. In this section, we use
22

the same approach to look at similar inference problems involving


statistics other than the mean, and in particular those called order
statistics.
Order statistics are obtained by simply ordering a random sam-
ple from smallest to largest. Suppose that we have a random sam-
ple X1 , X2 , . . . , Xn . We can order the observed values of the random
sample from smallest to largest, and denote the sorted values by
X(1) , X(2) , . . . , X(n) , where:

X (1) < X (2) < . . . < X ( n ) .

Prior to observing the random sample, we won’t know the val-


ues of X(1) , X(2) , . . . , X(n) , and we won’t even know which obser-
vation will turn out to be the smallest, second smallest, etc. But for
any given set of observations we can calculate the corresponding
realizations of X(1) , X(2) , . . . , X(n) . Thus each of these quantities sat-
isfies the definition of a statistic: they are termed the order statistics
of the sample.

Example: Suppose that we observe the following four numbers:


5, 8, 3, 10. These would usually be denoted x1 = 5, x2 = 8, x3 = 3,
x4 = 10. That is, the subscript i in xi just denotes the order in
which the observations were recorded and does not indicate any
ranking. The order statistics, however, would be denoted x(1) = 3,
x(2) = 5, x(3) = 8, x(4) = 10. Here, the subscript (i ) indicates that
that observation is the ith largest in the sample. The first order
statistic X(1) is always the minimum of the sample, that is

X(1) = min{ X1 , X2 , . . . , Xn }

which is why, here, x(1) = 3. For a sample of size n the nth order
statistic X(n) is always the maximum of the sample, that is

X(n) = min{ X1 , X2 , . . . , Xn }

so that here x(n) = 10. The sample range is the difference be-
tween the maximum and minimum, and so is expressed as a
function of the order statistics:

Range = X(n) − X(1)

Here, the range is clearly the 10 − 3 = 7.

The “five-number summary” introduced at the start of the first year


course consisted of certain order statistics, or averages of pairs of
order statistics. Other useful summaries can also be derived from
the order statistics, such as range (which we have just seen), or
inter-quartile range, which are alternatives to standard deviation
as a measure of spread. As with other statistics, such as the sample
mean and variance, we need to derive the distributions of the order
statistics, if we are to use them for statistical inference. One way
of doing this is to use a bootstrap approach, while another way
23

is to use a mathematical analysis. We first consider the bootstrap


approach.
Let us return to our earlier example examining the times to
payment for the customers of a mining company. The original
data, which was shown on page 3, has been sorted from smallest to
largest in the table below (note that this is not necessary for any of
the bootstrap calculations, but makes it easier to see that the sample
median is in fact 38.5 days).

34 34 34 35 36 36
37 37 37 38 38 38
38 38 38 39 39 39
40 41 41 41 41 42
42 42 42 43 44 46
Suppose that there is some concern from management that a few
very late customers may be unfairly skewing the average payment
times. In such a case, it might be a better approach to examine the
median payment time, which is resistant to such outliers. We can
approach the construction of a bootstrap confidence interval around
the median in much the same way that we did for the mean. That
is, we can generate a large number (5000 or whatever) of bootstrap
samples as before, and for each bootstrap sample compute the me-
dian. We can then sort the bootstrap sample medians from smallest
to largest. For one particular run of 5000 bootstrap replications, the
following selection of sorted bootstrap medians was obtained

Rank 1 25 125 250 500 1000 2500


Median 36 37 38 38 38 38 38.5
Rank 4000 4500 4750 4875 4975 5000
Median 39 40 40.5 41 41 42
We have used these values to graphically show the distribution of
bootstrap medians in Figure 1.8.

Figure 1.8: Distribution of boot-


2500 strapped medians

2000

1500

1000

500

0
36 36.5 37 37.5 38 38.5 39 39.5 40 40.5 41 41.5 42

Once again, we need to remember that these bootstrapped medians


24

in fact refer to sampling errors that might be made in the estima-


tion of the median. Our bootstrap samples were drawn from a pop-
ulation with known median of 38.5. Therefore, for a 95% confidence
interval:

• the 125th sorted median of 38 represents an underestimate of 0.5


days i.e. a sampling error of 38 − 38.5 = −0.5.

• the 4875th sorted median of 41 represents an overestimate of 2.5


days i.e. a sampling error of 41 − 38.5 = 2.5.

Under the key assumption that the same sampling errors apply to
the originally-taken sample, we can use the above statements about
sampling errors to construct a bootstrap confidence interval around
the median.

• If the sampling error is as small as -0.5 (i.e. if the true median


can be underestimated by 0.5 days), then the population median
would be 38.5 + 0.5 = 39.

• If the sampling error is as large as +2.5 (i.e. if the true median


can be overestimated by 2.5 days), then the population median
would be 38.5 − 2.5 = 36.

The 95% (bootstrap) confidence interval around the median is


therefore given by 36–39. Note that this is precisely the approach
we took in building a bootstrap confidence interval for the mean.
All we do here is to use the median in place of the mean. In fact,
the 95% confidence interval around the median is slightly wider
than the confidence interval around the mean (calculated earlier
to be 37.9–40.3). This will generally be the case. It is also quite
straightforward to use our bootstrap approach to perform hypoth-
esis tests on the median, although we will not do that here – the
details are left to the interested student.
Can we work out a confidence interval for the median without
using a bootstrap approach? In fact, it turns out that we can, al-
though the mathematics is somewhat more complicated. The p-th
quantile (or, equivalently, the 100p-th percentile) of the distribution
of X defined by F ( x ) is simply the quantity ξ p such that:

Pr[ X ≤ ξ p ] = F (ξ p ) = p.

In other words, 100p% of the population of X falls below ξ p . Spe-


cial cases are ξ 0.5 , which is the median, ξ 0.25 , which is the lower
quartile, and ξ 0.75 , which is the upper quartile. The order statistics
can be viewed as estimates of the 1/n-th, 2/n-th, etc. quantiles.
Our concern here is to obtain a confidence interval for ξ p for any
arbitrary p, based on the sample observations, but not using any
assumed distributional properties (apart from assuming that the
distribution is continuous). For this purpose, the order statistics are
useful. Our aim will be to find two integers (1 ≤ r < s ≤ n) such
that:
Pr[ X(r) ≤ ξ p < X(s) ] ≥ 1 − α
25

for some given α. We can’t be sure that we can ever find r and s
such that the above probability is exactly 1 − α, which is why we
use the ≥. We would, however, prefer to seek the shortest interval
for which the above applies.
For any given p, let us define the random variable Y as the num-
ber of observations in the sample which do not exceed ξ p . Since
we don’t know ξ p , we can never actually observe Y, but we can
still state its probability distribution. For example, we know that
the probability of a single observation drawn from the sample be-
ing less than the median ξ 0.5 is by definition 0.5, even if we don’t
know the value of ξ 0.5 . Therefore the number of observations in the
sample which are less than ξ 0.5 is a random variable which follows
the binomial distribution (since there are multiple independent tri-
als) with parameters n and p = 0.5. Generally, since by definition
F (ξ p ) = p, Y has the binomial distribution with parameters n and p
(i.e. the p defining the required quantile). Now {X(r) ≤ ξ p } is equiv-
alent to Y ≥ r, while {ξ p < X(s) } is equivalent to Y < s. We then
have:

Pr[ X(r) ≤ ξ p < X(s) ] = Pr[r ≤ Y < s]


s −1
n
= ∑ ( y ) p y (1 − p ) n − y
y =r

Generally, we have to use trial and error to find values of r and


s, as close together as possible, but for which the above expression
evaluates to at least 1 − α. Fortunately, this trial and error is facil-
itated either by the use of binomial tables or even more easily by
using a spreadsheet package like Microsoft Excel, as is illustrated in
the next example.

Example: Returning to our example, suppose we wish to find the


95% confidence interval for the median ξ 0.5 based on our sample,
which is of size 30. We set up the following calculations in a
spreadsheet environment

• In column A, set up all possible values of x. Here, there can


be anywhere between 0 and 30 “successes”, so we write the
values 0 to 30 in the rows of column A.
• In some cell (we have used cell E2), set up the desired value
of p. This is the “probability of success” in the usual calcula-
tion of a binomial probability, and here p = 0.5 since we are
interested in the median.
• In another cell (we have used cell E3), set up the sample size
n. This is the “number of trials” in the usual calculation of a
binomial probability, and here n = 30.
• In column B, calculate the probability of achieving x out of n
successes when the probability of an individual success is p by
entering the following formula:
=BINOMDIST(x,n,p,false)
26

NOTE: the n, p and x’s should refer to appropriate cells on the


spreadsheet e.g. in cell B2, type =BINOMDIST(A2,$E$3,$E$2,false).
Then drag the formula down to compute the other values.

Following the setup of your spreadsheet calculations, the screen


should look something like the following.

The entries in column B give the probability of achieving x out of


30 successes with p = 0.5. We can now quite easily find values of
r and s for which the sum of all the probabilities between r and
s − 1 evaluates to at least 1 − α. Here, we find that the entries
between 11 (Pr( X = 11) = 0.0519) and 24 (Pr( X = 24) = 0.0006)
sum to 0.9505, which is just above 0.95. Thus we have found that
r = 11 and s = 25 and that

Pr[11 ≤ Y < 25] = 0.9505

and therefore that [X(11) ; X(25) ] is approximately a 95% con-


fidence interval for ξ 0.5 . This means that we can form a 95%
confidence interval for the median by taking the 11th-ranked
observation (which turns out to be 38) and the 25th ranked ob-
servation (which turns out to be 42) i.e. 38–42. This interval is
quite different to the results of our bootstrap experiments (which
gave a confidence interval of 36–39), perhaps because of the bi-
modality of the original sample (see Figure ??).

For large values of n, it becomes easier to use the normal ap-


proximation to the binomial distribution for Y. In other words, we
approximate the distribution of Y by the normal distribution with
mean np and variance np(1 − p). In moving in this way from a dis-
crete to a continuous distribution, we need to apply the “continuity
correction”. What this in effect means is that we approximate the
probability that Y = i (for any integer i), by the probability implied
27

by the normal distribution for the interval i − 21 < Y < i + 12 . Thus


the event {r ≤ Y < s} is replaced by {r − 12 < Y < s − 12 }. In other
words, by standardizing the normal distribution in the usual way,
we approximate Pr[r ≤ Y < s] by:
" #
r − 12 − np s − 12 − np
Pr p <Z< p
np(1 − p) np(1 − p)

where Z has the standard normal distribution. From normal tables,


we can look up the critical value zα/2 such that:

Pr[−zα/2 < Z < +zα/2 ] = 1 − α

and thus by equating corresponding terms above, we can solve


for r and s. These will normally turn out to be fractions, which
means that we have to widen the interval by moving out to the next
integer values.

Example: Suppose we wish to find a 95% confidence interval for the


median (i.e. ξ 0.5 ) of a distribution, based on a sample of size 30
(note: a sample size of 30 – as in our payment time example – is
probably large enough to start using large sample approxima-
tions), but using the normal approximation to the binomial. Thus
p
np = 15, np(1 − p) = 7.5 and np(1 − p) = 2.74. We thus need
to find r and s such that:
r − 15.5 s − 15.5
 
Pr <Z< = 0.95
2.74 2.74

The 2.5% critical value of the standard normal distribution is


1.96. We therefore require:

r − 15.5
= −1.96
2.74
which gives r = 10.1, and:

s − 15.5
= 1.96
2.74
which gives s = 20.8. Moving out to the next integers gives
r = 10 and s = 21, and thus the desired confidence interval for
the median is [X(10) ; X(21) ]. This is reasonably similar to the
confidence interval arrived at using the full binomial calcula-
tions, which gave [X(11) ; X(25) ]. Checking back with the original
data, we state that the 95% confidence interval using the normal
approximation is given by 38–41.

An important point in both these examples is that we have suc-


ceeded in the construction of distribution free (i.e. applying no mat-
ter what the underlying form of the probability distribution func-
tion F ( x )) confidence intervals for population quantiles or percentiles.
Using a bootstrap approach, we even get the full sampling distri-
bution of these order statistics. However, we can also work out the
sampling distributions of order statistics using a more analytical
28

approach, which is useful for some inferential problems. We will


only consider the cases of the smallest and largest order statistics,
X(1) and X(n) respectively, although the results can be extended to
the intermediate order statistics too.
For ease of notation, let us denote the distribution function
of X(r) by F(r) ( x ) = Pr[ X(r) ≤ x ], with p.d.f. f (r) ( x ). The case
r = n (the largest order statistic) is particularly simple as the event
{X(n) ≤ x} is simply the event that all n observations do not ex-
ceed x. Because the initial observations are independent, we can
therefore write the distribution function of X(n) as:

F(n) ( x ) = Pr[ X(n) ≤ x ] = [ F ( x )]n .

The p.d.f. can easily be obtained by differentiation,

f (n) ( x ) = n[ F ( x )]n−1 f ( x )

The case r = 1 (the smallest order statistic) is almost as easy. In


this case, we note that the event {X(1) > x} is the event that all n
observations are greater than x. Thus, once again by independence:

1 − F(1) ( x ) = Pr[ X(1) > x ] = [1 − F ( x )]n

i.e.:
F(1) ( x ) = 1 − [1 − F ( x )]n .
Once again the distribution function can be differentiated with
respect to x to obtain the p.d.f.

f (1) ( x ) = n[1 − F ( x )]n−1 f ( x )

Example: Let X be the lifetime of a single light bulb, which is expo-


nentially distributed with a mean of 2000 hours. Six light bulbs
are put into operation together (in some sort of bank of lights).
What is:

• The probability that the time until the last bulb fails is greater
than 8000 hours, assuming that no bulbs are replaced in the
interim?
• The p.d.f. of the time until the last bulb fails?

We are thus concerned with properties of the distribution of X(6) .


The distribution function is thus:

F(6) ( x ) = [1 − e x/2000 ]6

Substituting x = 8000 gives F(6) (8000) = 0.8950, and thus


Pr[ X(6) > 8000] = 1 − 0.8950 = 0.1050. Thus, the probability that
any of the lightbulbs will be working after 8000 hours (remem-
bering that they each have an average lifetime of 2000 hours) is
10.5%.
The p.d.f. is:
6 − x/2000
f (6) ( x ) = e [1 − e− x/2000 ]5
2000
29

Tutorial Exercises

1. For each of the two data sets given below:

(a) Perform 1000 replications of a “bootstrap” simulation of


resampling from the population
(b) Use the simulation to find 95% confidence limits on the
sampling error in determining the mean
(c) Convert the error limit into a confidence interval for the
population mean, and compare this with the usual limits
based on the t-statistics

Data Set A: The following data refer to numbers of vehicles


arriving at a service station during each 5-minute interval over
a total period of one-and-a-half hours:
11 0 2 0 3 4
1 3 10 2 8 6
4 5 0 0 6 1
Data Set B: Data were collected by a building contractor, in order
to improve his tendering strategy. For each of the 25 contracts,
the table on the next page shows the contractors own cost
estimates and the lowest (winning) bid for the contract. Also
shown is the ratio of winning bid to estimated cost.
Perform the analysis on this set of ratios only.

2. A random sample of size 10 from an unknown distribution with


a mean µ has yielded the following observations:

8.54 2.11 6.78 8.06 9.57


5.52 0.05 5.40 21.89 2.08

The sample mean and standard deviation were calculated as 7.00


and 6.08 respectively.

(a) A total of 1000 “bootstrap samples” were generated from this


data. Explain what is meant by this assertion.
(b) The sample averages for each of the 1000 bootstrap samples
were calculated, and then ordered from smallest to largest.
The following are some of the ordered results obtained:

Sample No. 1 10 25 50 100 250 500


Sample Ave. 2.39 3.30 3.93 4.42 4.86 5.71 6.82
Sample No. 750 900 950 975 990 1000
Sample Ave. 8.13 9.27 9.96 10.68 11.50 12.78
i. Estimate the p-value corresponding to a test of the null
hypothesis µ ≤ 5 versus µ > 5.
ii. Construct a 95% confidence interval for µ based on the
bootstrap data
30

Estimated Lowest Ratio Table 1.1: Data Set B


Cost Bid
74600 90500 1.213
170600 195100 1.144
94500 101100 1.070
57800 73400 1.270
65400 76300 1.167
127200 133300 1.048
56200 77400 1.377
65400 61500 0.940
50600 67500 1.334
112500 117600 1.045
135200 160200 1.185
77800 92000 1.183
65000 71400 1.098
61900 68500 1.107
148800 196900 1.323
69600 76700 1.102
135200 114700 0.848
77000 79900 1.038
65800 59000 0.897
148700 138200 0.929
127100 109100 0.858
122900 122500 0.997
99900 98700 0.988
70300 85200 1.212
31

(c) Compare the results from the previous section with that
obtained from standard normal theory.

3. A random sample of size 8 from an unknown distribution with a


mean µ has yielded the following observations:

0.326 1.463 0.421 0.060


0.038 0.203 0.125 0.182

The sample mean and standard deviation were calculated as


0.352 and 0.437 respectively. The sample averages for each of the
2000 bootstrap samples were calculated, and then ordered from
smallest to largest. The following are some of the ordered results
obtained:

Sample No. 1 20 50 100 200 500 1000


Sample Ave. 0.093 0.109 0.126 0.141 0.168 0.223 0.343
Sample No. 1500 1800 1900 1950 1980 2000
Sample Ave. 0.456 0.553 0.665 0.704 0.733 0.950

(a) What is meant by a “bootstrap” sample?


(b) Estimate the p-value corresponding to a test of the null hy-
pothesis µ ≤ 0.15 versus µ > 0.15.
(c) Construct a 95% confidence interval for µ based on the boot-
strap data
(d) Compare the results from part (b) and (c) (i.e. both hypothe-
sis test and confidence interval) with the results that would be
obtained from standard normal theory. Use a 5% significance
level for the hypothesis test.
(e) What would you expect to happen to the following quantities
as the number of bootstrap samples increases?
i. The minimum bootstrap sample average?
ii. The median bootstrap sample average?
iii. The maximum bootstrap sample average?

4. The probability density function for the random variable X is


expressed in the following form:
c
f (x) = for x > 2
x2
for some constant c.

(a) Determine the value of c.


(b) The first three of a sequence of uniformly distributed ran-
dom variables has been generated as follows: 0.883; 0.167;
0.545. Use these to simulate the generation of three corre-
sponding values for X.
32

5. The distribution function of X is given by:


(
1 2
2x for 0 < x < 1
F(x) = 1 2
1 − 2 (2 − x ) for 1 < x < 2

(a) Calculate and sketch the pdf for X


(b) Suppose that you need to generate a simulated random
sample of values from X. Calculate such simulated values cor-
responding to the following three random numbers generated
by the RAND() function in Excel: 0.5924; 0.7374; 0.1504.

6. The probability density function of a random variable X has


been stated as follows:
(
x for 0 < x < k
f (x) =
2 − x for k ≤ x < 2

(a) Show that k = 1 (3)


(b) Explain how you would generate numbers from the above
probability distribution this, and illustrate your answer by
calculating values of x corresponding to the following three
random numbers generated by the RAND() function in Excel:
0.543; 0.344; 0.054

7. Derive the distribution functions of the distributions with the


following probability density functions (pdf’s):

(a) f ( x ) = 6x (1 − x ) for 0 ≤ x ≤ 1
5
(b) f ( x ) = 6 for x > 1
x
8. For the exponential distribution with a mean of 10 (i.e. λ =
0.1), and for the second distribution defined in the previous
question: Simulate the occurrence of 1000 sample values from
the distribution, group the results of the simulation into 100 sets
of 10 values each, and calculate the corresponding sample means
in each set. Plot a histogram of these means. Comment on the
results.

9. The probability density function (pdf) of a random variable X is


given by: f ( x ) = c(1 − x2 ) for −1 < x < 1.

(a) Determine the value of c


(b) Determine the distribution function for X

10. A random sample of size 10 from an unknown distribution


with a mean µ has yielded the following observations:

8.54 2.11 6.78 8.06 9.57


5.52 0.05 5.40 21.89 2.08

The sample medians for each of the 1000 bootstrap samples were
calculated, and then ordered from smallest to largest. The fol-
lowing are some of the ordered results obtained:
33

Sample No. 1 10 25 50 100 250 500


Sample Ave. 1.08 2.09 2.11 3.74 3.75 5.46 6.15
Sample No. 750 900 950 975 990 1000
Sample Ave. 7.42 8.06 8.54 8.81 9.05 15.73

(a) Construct a 95% confidence interval for the median based on


the bootstrap data
(b) Use the analytical approach (and binomial tables) to con-
struct a 90% confidence interval for ξ 0.6
(c) Compare the two intervals obtained in the previous two
questions with those obtained using the normal approxima-
tion to the binomial distribution (which is not really appropri-
ate, since our n here is not at all large)

11. Let X1 , X2 , . . . , X15 be a random sample from the density func-


tion f ( x ) = 9xe−3x for x > 0. What is

(a) The probability that the X(15) > 2?


(b) The p.d.f. of the random variable representing the smallest of
the 15 observations?

12. Let T1 , T2 , . . . , T8 be a random sample from the density function


f ( x ) = 0.75e− x + 0.05e−0.2x for x > 0. What is

(a) The probability that the X(1) ≤ 3?


(b) The p.d.f. of the random variable representing the largest
observation?

13. For each of the two data sets in Question 1:

(a) Perform 5000 replications of a “bootstrap” simulation of


resampling from the population
(b) Use the simulation to find 95% confidence limits on the
sampling error in determining the median
(c) Convert the error limit into a confidence interval for the
population median, and compare this with the limits based
on the statistics making use of (i) probabilities obtained using
the binomial distribution (ii) the normal approximation to the
binomial
(d) Repeat parts (b) and (c), this time for the 35th percentile.
2
Two-sample and three-sample problems

2.1 Distributions of Sample Statistics from Normal Theory

We saw in Chapter 1 how important it is to develop an understand-


ing of sampling variations when, for example, using the sample
mean to represent the population mean. We saw further that some
such understanding can be obtained by simulations such as the
“bootstrap”. The Central Limit Theorem enables us to examine the
same problems more simply and elegantly, by making use of the
properties of the normal distribution.
Suppose again that X1 , X2 , . . . , Xn is a random sample from a
distribution (or population) with mean µ and variance σ2 , and let
X̄ be the sample mean. We know that if we repeated the sampling
many times, we would obtain different values for X̄. We want to
know how variable these sample means are, in order to be able
to draw conclusions from the currently observed mean. The CLT
tells us that for “large enough” n, the distribution of X̄ under such
repeated sampling is “approximately” normal with mean µ and
variance σ2 /n. In fact, for practical purposes, the approximation
can be pretty good for quite moderate n (perhaps as low as 20),
except for very heavy-tailed distributions.
Define zα to be upper 100α percentile of the standard normal
distribution, i.e. such that Pr[ Z > zα ] = α when Z has the standard
normal distribution.
Provided that the population variance is known, we can use the
CLT and normal tables to address the following questions regard-
ing the population mean:

Confidence Intervals: Since

X̄ − µ
 
Pr −zα/2 ≤ √ ≤ zα/2 = 1 − α
σ/ n

we are 100(1 − α)% “confident” that

X̄ − µ
−zα/2 ≤ √ ≤ zα/2 ,
σ/ n
or (re-arranging terms) that
√ √
X̄ − zα/2 σ/ n ≤ µ ≤ X̄ + zα/2 σ/ n.
35

Significance Levels for Hypothesis Tests: Suppose we wish to test


a conjecture (alternative hypothesis) H1 : µ > µ0 (for some
specified number µ0 ) against the “null hypothesis” H0 : µ ≤ µ0 .
We would reject H0 if X̄ is too large, say if X̄ > c. Now if H0 is
true, then for any 0 < α < 1 it follows that:

X̄ − µ0
 
Pr √ > zα ≤ α
σ/ n
for the stated µ0 . Recall that we may either set the desired sig-
nificance level a priori (e.g. something like the conventional 0.05,

giving z0.05 = 1.645) and reject H0 if X̄ > µ0 + zα σ/ n; or we
could look up the value of α such that zα equates to the observed

value of ( X̄ − µ0 )/(σ/ n) (giving the “p-value”).

We note again that the above probabilities refer to sampling


variability in X̄ for given µ and σ. They are not probabilities on µ or
on the truth of H0 .
But what do we do if the population variance is unknown (which
would seem to be the rule rather than the exception)? To get some
sense of the problem, we carry out the following exercise:

• Generate a large number of random numbers (say 45000, as


we shall want the number to be divisible by 9), drawn from a
normal distribution with mean 0 and variance σ2 = 9

x <- rnorm(45000,0,3)
hist(x)

Figure 2.1: Distibution of simulated


values
Histogram of x
0 4000 10000
Frequency

−10 −5 0 5 10 15

• Cluster the results into groups of 9 each, and obtain the sample
mean and sample variance within each group

xmat <- matrix(x,nrow=5000,ncol=9)


mymean <- apply(xmat,1,mean)
myvar <- apply(xmat,1,var)
36

• The group sample means should have the standard normal dis-
tribution (why?). We check this by plotting a histogram of the
sample means. This shows how sample means do vary from
sample to sample, as we can view each group as a sample of size
9.

hist(mymean,main="Histogram of sample means")

Figure 2.2: Distibution of means from


constructed samples of size 9. We
Histogram of sample means know that these means should be
distributed around the value of zero
that we used to generate the random
numbers.
800
Frequency

400
0

−4 −2 0 2

mymean

• If we had not known the population variance (σ2 ), we would


have had to estimate it by the sample variance S2 . We would
expect, of course, that S2 would vary around the true value of 9.
Below we plot a histogram of the sample standard deviations S
(which should vary around 3)

hist(sqrt(myvar),main="Histogram of sample std devs")

Figure 2.3: Distibution of standard


deviations from constructed samples
Histogram of sample std devs of size 9. We see that the sample stan-
dard deviation estimates range from
considerably below to considerably
above the true value of 3.
1000
Frequency

0 400

1 2 3 4 5 6

sqrt(myvar)

• If we used data from any one group to draw inferences about


the true mean, then our analyses would need to be based on the
37

standardized form: T = X̄/(S/3). We calculate these values for


each group, and plot the corresponding histogram.

myt <- mymean/(sqrt(myvar)/3)


hist(myt,main="Histogram of sample t-stats")
# overlay a scaled N(0,3) dbn for comparison
xn <- seq(-8,8,length=1000)
yn <- 5000*dnorm(xn,0,3)
lines(xn,yn,col="red")

Figure 2.4: Distibution of t-statistics


from constructed samples of size
Histogram of sample t−stats 9 (histogram), with a scaled N(0,3)
distribution shown for comparison.
Note that to get the N(0,3) onto the
1500

same scale as the histogram, we have


Frequency

to multiply the p.d.f. at each point by


5000 (the number of values used to
create the histogram).
0 500

−6 −4 −2 0 2 4 6

myt

• Note how the histogram of T differs from a normal distribution;


it has much higher kurtosis. The reason for this is that large
values of T can derive from two sources, viz. large values of X̄ or
small values of S

At first sight, the distribution of the T-values in Figure 2.4 may


appear fairly normal. Closer examination, however, reveals the
following features:

• Only 27 out of 5000 sample means (5.4%) exceeded an absolute


value of 2.5 (compared to the theoretical probability of 1.24%
for the standard normal distribution), while 76 of the 5000 T-
values (15.2%) exceeded an absolute value of 2.5 (compared to
a theoretical probability of 3.70% for the t-distribution with 8
degrees of freedom).

• Only 4 out of 5000 sample means (0.8%) exceeded an absolute


value of 3 (compared to the theoretical probability of 0.27% for
the standard normal distribution), while 41 of of the 5000 T-
values (1.72%) exceeded an absolute value of 3 (compared to
a theoretical probability of 1.71% for the t-distribution with 8
degrees of freedom).

• 0 out of 5000 sample means exceeded 3.5 in absolute value (the


largest value being 3.25). On the other hand, 41 of the 5000 T-
38

values exceeded 3.5 in absolute value, with a maximum absolute


value of 5.23.

• The sample estimate of the kurtosis for the T’s was 4.0.

The above points indicate that the distribution of the T’s has a dis-
tinct tendency towards having heavier tails than the normal distri-
bution. This can seriously bias estimates of p-values or confidence
intervals, if the sample estimate is used in place of the true popula-
tion mean, but critical values from the normal distribution are still
used.
We can deal with the additional variation due to using sample
estimates of the variance in quite a simple manner. We simply
express the “t-statistic” in the following way:

X̄ − µ X̄ − µ h σ i
 
σ
T= √ = √ =Z .
S/ n σ/ n S S

We know that Z = ( X̄ − µ)/(σ/ n) has the standard normal
distribution. The usual sample variance estimator is given by

∑in=1 ( Xi − X̄ )2
S2 =
n−1
so that U defined by:
n  2
( n − 1) S2 Xi − X̄
U= =∑
σ2 i =1
σ

has the χ2 distribution with n − 1 degrees of freedom. We can thus


express T in the form:

Z
T= p
U/(n − 1)

where we know the distributions of Z (standard normal) and U (χ2


with n − 1 degrees of freedom).
The precise derivation of the distribution of T from the distribu-
tions of Z and U requires a bit of intricate mathematics, which will
be omitted here. We state, however, the following theorem which is
expressed in a slightly more general form.

Theorem 1. Let Z and U be independent random variables, having the


standard normal distribution and the χ2 distribution with d degrees of
freedom respectively. Define:

Z
T= √ .
U/d
Then the probability density function for T is given by:

Γ((d + 1)/2) 1
f (t) = √ ·
πdΓ(d/2) [1 + t /d](d+1)/2
2

The distribution defined by the p.d.f. defined in Theorem 1 is


called the t-distribution with d degrees of freedom, tables for which
39

are widely available. Values for the cumulative distribution can


be obtained from Excel’s TDIST() spreadsheet function. Excel also
provides the TINV() function for the inverse function.
The value of Theorem 1 is that it allows the statistician to dis-
cover many sampling situations in which the distribution of rele-
vant statistics follows a t-distribution. One can then derive p-values
or confidence intervals just by using tables of the t-distribution. The
trick is to recognize which is the normally distributed and which
the χ2 distributed random variables, as illustrated in the following
examples.

Example: Suppose that X, Y and Z are independent random vari-


ables, all normally distributed with zero means and with stan-
dard deviations 3, 4 and 5 respectively. Thus X/3, Y/4 and Z/5
have the standard normal distribution. Furthermore:

Y2 Z2
+
16 25
has the χ2 distribution with 2 degrees of freedom. The theorem
thus tells us that:

X 2X
r = q
2 2
3 Y16 + Z25

Y2 2
3 16 + Z /2
25

has the t-distribution with 2 degrees of freedom. From tables,


therefore, we could conclude that:
 √ 
2X
Pr  q ≥ 2.920 = 0.05
Y2 Z2
3 16 + 25

i.e.: " r #
Y2 Z2
Pr X ≥ 6.194 + = 0.05.
16 25

Example: Suppose that X1 , X2 , . . . , X8 is a random sample from


N (0, 4σ2 ), while Y1 , . . . , Y5 is a random sample from N (0, σ2 ).
The two samples are independent q of each other. What is the
probability that X̄ exceeds 0.5 ∑5i=1 Yi2 ?
We know that X̄ has the normal distribution with mean 0 and a

variance of 4σ2 /8 = 0.5σ2 . Thus 2X̄/σ has the standard normal
distribution.
Furthermore, ∑5i=1 Yi2 /σ2 has the χ2 distribution with 5 (not 4!)
degrees of freedom, and is independent of X̄ (as it is indepen-
dent of all the Xi ’s). We therefore conclude that:

2X̄/σ √ X̄
W= q = 10 q
5 2 5
∑i=1 Yi /5σ2 ∑i=1 Yi2

has the t-distribution with 5 degrees of freedom.


40

q
The probability that X̄ exceeds 0.5 ∑5i=1 Yi2 is given by:
 
X̄ √
Pr  q > 0.5 = Pr[W > 0.5 10 = 1.581].
∑5i=1 Yi2
This last probability we can look up in t-tables, and it turns out
to be 8.74%.

It is useful to record here another probability distribution which


arises naturally from functions of random variables having the χ2
distribution, especially (for example) in problems involving the
comparison of variances.
Theorem 2. Let U and V be independent random variables, having χ2
distributions with p and q degrees of freedom respectively. Define:
U/p
Y= .
V/q
Then the probability density function for Y is given by:

Γ(( p + q)/2) ( p/q) p/2 y( p/2)−1


f (y) = ·
Γ( p/2)Γ(q/2) [1 + py/q]( p+q)/2
The distribution defined by the p.d.f. defined in Theorem 2 is
called the F-distribution with p and q degrees of freedom (some-
times referred to as the numerator and denominator degrees of
freedom respectively). You have already met the F-distribution in
tests for equality of variances in a number of contexts. Values for
the cumulative distribution can be obtained from Excel’s FDIST()
spreadsheet function. Excel also provides the FINV() function for
the inverse function.
In the remainder of this chapter, we return to a number of clas-
sical sampling situations, in order to examine how both bootstrap-
ping and the theoretical results above contribute to our understand-
ing of sampling variation.

2.2 Two-Sample Problems

Let us return to the cash flow problem illustrated in Chapter 1, but


now suppose that we wish to compare delays in invoice payments
in two different markets, say the copper and zinc markets. Once
again, suppose that 30 recent deliveries in each market have been
analyzed to determine numbers of days between invoicing and
receipt. (At this stage, equal sample sizes for the two populations is
not essential, but we shall keep to these for illustration). The data
were recorded as follows:
Copper 39 35 44 33 19 6 27 24 40 13
35 34 34 33 61 56 43 19 34 40
34 28 29 45 46 28 41 46 25 17
Zinc 39 32 22 32 55 34 46 37 31 45
41 40 46 66 36 42 37 43 35 62
48 47 34 42 43 33 47 41 34 41
41

We note that the average delay time1 in the Zinc market sample 1
The mean of the numbers for copper
is 7.43 days larger than that in the Copper market. The questions is 33.60; the standard deviation is
12.09. The mean of the numbers for
which arise are (1) how large is the true difference in population zinc is 41.03; the standard deviation is
means? (2) Is the evidence for a larger delay in the Zinc market 9.08
convincing in the light of sampling variation.
The first question can be structured in terms of a confidence
interval for the true difference in means. The second can be formu-
lated in hypothesis testing terms, viz. to test the “null hypothesis”
H0 that the means are equal.
The hypothesis test is easily addressed in a bootstrapping man-
ner. Suppose that we re-formulate the null hypothesis simply as:
“The distributions of times in the two populations are identical”.
If this H0 is true, then both sets of 30 observations come from the
same population; in other words all 60 observations arise from the
same population. We may then simulate the two-sample results as
follows:
• Place all 60 observations in a “hat” and “shuffle”

• Draw a sample of size 60 with replacement, and split arbitrarily


into two sets of 30 each

• Calculate the sample mean in each set, and the difference be-
tween the two sample means

• Repeat as often as needed, to obtain a distribution of the differ-


ences in means
The above process is easily implemented by using the macro in the
BootStrap.xls spreadsheet package. Simply enter and highlight all
60 values in row 6, and press Ctrl-b. The two sample means can
be calculated from the first and last 30 columns of the results. The
process is also easily programmed in R, using a small extension to
the code we used in Chapter 1.

# load the payment delay data


cop_pds <- c(39,35,44,33,19,6,27,24,40,13,35,34,34,
33,61,56,43,19,34,40,34,28,29,45,46,28,41,46,25,17)
zinc_pds <- c(39,32,22,32,55,34,46,37,31,45,41,40,
46,66,36,42,37,43,35,62,48,47,34,42,43,33,47,41,34,41)
# put all data together (in a hat)
pdelays2 <- c(cop_pds,zinc_pds)
# shuffle - this step is not needed (why?)
pdelays2 <- sample(pdelays2, size=60, replace=FALSE)
# set up a variable to store the bootstrap samples
all_boots <- matrix(NA,nrow=5000,ncol=60)
for(i in 1:5000){
# draw a single bootstrap sample from pdelays2
boot <- sample(pdelays2,size=60,replace=TRUE)
# store that bootstrap
all_boots[i,] <- boot
}
42

We can now extract the bootstrap means for “group 1”2 using 2
Although note that referring to group
the apply function, but now only applied to the first 30 columns of 1 and group 2 in the context of the
bootstrap is rather meaningless, since
all_boots. we have already randomly shuffled all
60 observations.
bs_means1 <- apply(all_boots[,1:30],1,mean)

Similarly we extract the bootstrap means for “group 2” by using


the apply function on the last 30 columns of all_boots.

bs_means2 <- apply(all_boots[,31:60],1,mean)

Finally, we can compute the difference between the two group


means in each of our 5000 bootstraps samples, and plot the his-
togram of these differences:

# difference in means (note the order is arbitrary)


bs_diffs <- bs_means1 - bs_means2
hist(bs_diffs)

Figure 2.5: Distribution of boot-


strapped difference in sample means
Histogram of bs_diffs
1000
Frequency

0 400

−10 −5 0 5 10

bs_diffs

In the 5000 repetitions reported here, the difference in means


exceeded +7.43 on 34 occasions3 , and was less than -7.43 on 21 oc- 3
In R, sum(bs_diffs > 7.43)
casions4 . In other words the absolute magnitude of the difference 4
In R, sum(bs_diffs < -7.43)
exceeded the originally observed difference on 55 out of 5000 occa-
sions, around 1%. This would give a p-value appropriate to a two
sided test of approximately 0.01, so we would conclude that the
difference in means is significant.
The same bootstrap results can, with a little thought, also be
used to construct a confidence interval for the true difference in
population means. We sort the 5000 differences and find the 125th
and 4875th smallest difference (corresponding to the 2.5 and 97.5
percentile respectively of the distribution shown in Figure 2.5)
43

sorted_bs_diffs <- sort(bs_diffs)


# 2.5 percentile
sorted_bs_diffs[125]

[1] -5.6

# 97.5 percentile
sorted_bs_diffs[4875]

[1] 5.7

Since by construction, the bootstrap sampling was from a distri-


bution having a true difference in means of zero, we can conclude
as follows:

• With 95% “confidence”, the errors (defined as the deviation


between the observed and true differences in means) will lie
between −5.6 and +5.7;

• Since the actually observed difference in means was -7.43, an


error of −5.6 must mean that the true population difference is
−1.83 (= −7.43 + 5.6) – Note the direction of the signs!;

• Similarly, an error of 5.7 means that the true population differ-


ence must be −13.13 (= −7.43 − 5.7).

A 95% confidence interval could thus be stated as [−13.13 ; −1.83],


based on the bootstrap results.
Now, how may we use the central limit theorem and the theo-
retical results for sampling from a normal distribution to obtain
corresponding solutions to the same problems, perhaps more eas-
ily?
Let X1 , X2 , . . . , Xm and Y1 , Y2 , . . . , Yn be random samples from
populations with means and variances µX , σX2 , µY and σY2 respec-
tively. By the CLT, X̄ and Ȳ are approximately normally distributed
with means µX and µY , and variances σX2 /m and σY2 /n respectively.
Our usual assumption is that the samples are independent, so that
X̄ − Ȳ is also approximately normal with mean µX − µY , and vari-
ance σX2 /m + σY2 /n (Note the sum!).
If zα is the critical point of the normal distribution such that
Pr[ Z > zα ] = α, then:

• Under a null hypothesis that µX = µY , the appropriate two-sided


test would be based on:
 
| X̄ − Ȳ |
Pr  q > zα  = 2α
σX2 /m + σY2 /n

(so that for a significance level of 5% we would need α = 0.025);

• Under a null hypothesis that µX ≤ µY (implying a priori infor-


mation or judgement that only situations in which µX > µY are
44

important or interesting), the appropriate one-sided test would


be based on:  
X̄ − Ȳ
Pr  q > zα  = α
σX /m + σY2 /n
2

(since any observations in which X̄ < Ȳ, no matter how large the
difference, are consistent with the null hypothesis).

This immediately solves the problems of both hypothesis tests


and confidence intervals, provided that the two variances σX2 and σY2
are known. But again we have the problem of unknown variances.
For the two sample problem, it is useful to distinguish at least three
separate cases.

Equal variances: Suppose that σX2 = σY2 = σ2 , say. The standardized


difference can thus be written as:
X̄ − Ȳ
Z= q
σ m1 + 1
n

which would approximately normally distributed with mean 0 and


variance 1.
To the same levels of approximation:

∑im=1 ( Xi − X̄ )2 ∑in=1 (Yi − Ȳ )2


and
σ2 σ2
have χ2 distributions with degrees of freedom m − 1 and n − 1
respectively, so that their sum has the χ2 distribution with m + n − 2
degrees of freedom. This sum can be expressed as:

∑im=1 ( Xi − X̄ )2 + ∑in=1 (Yi − Ȳ )2 (m + n − 2)S2pool


=
σ2 σ2
where S2pool is the usual “pooled” variance estimate.
It therefore follows that:
X̄ − Ȳ σ
q = Z·
S pool m1 + 1 S pool
n

must have (approximately) the t-distribution with m + n − 2 degrees


of freedom, so that we can use t-tables in place of normal tables for
carrying out hypothesis tests or constructing confidence intervals.
Recall that under the same assumption of equal variances, the
ratio:

∑im=1 ( Xi − X̄ )2 /[(m − 1)σ2 ] n − 1 ∑im=1 ( Xi − X̄ )2 S2X


n = n =
∑i=1 (Yi − Ȳ )2 /[(n − 1)σ2 ] m − 1 ∑i=1 (Yi − Ȳ )2 SY2

has the F-distribution with m − 1 and n − 1 degrees of freedom.


Similarly:
m − 1 ∑in=1 (Yi − Ȳ )2 SY2
m =
n − 1 ∑i=1 ( Xi − X̄ )2 S2X
45

has the F-distribution with n − 1 and m − 1 degrees of freedom.


These two facts can be used to test the hypothesis of equal vari-
ances using the appropriate F-test. In our current example, sY2 =
12.092 and s2X = 9.082 , so that the ratio of the two is F = 1.77. This
should be compared to a critical value taken from a F-distribution
with m − 1 = 29 and n − 1 = 29 degrees of freedom. At a 5%
level of significance, this is approximately 1.86, and so we would
not have sufficient evidence to reject the null hypothesis of equal
variance at p = 0.05. An exact p-value can be found by typing
=FDIST(1.77,29,29) in Excel (it turns out to be 0.065).

A little further thought will make it clear that we could also


use a bootstrap approach to test the hypothesis of equal variances.
Suppose we again re-formulate the null hypothesis simply as: “The
distributions of times in the two populations are identical”. We may
then perform the same bootstrap sampling as for the two-sample
test of means, except that we compute the ratio of variances after
each replication:

• Place all 60 observations in a “hat” and “shuffle”

• Draw a sample of size 60 with replacement, and split arbitrarily


into two sets of 30 each

• Calculate the sample variance in each set, and the ratio of the two
variances

• Repeat as often as needed, to obtain a distribution of the ratio of


variances.

We can again use either the R code of a few pages above, or the
BootStrap.xls spreadsheet package (this is left to you as an exer-
cise!). A histogram of ratios of sample variances obtained using this
procedure, based on 5000 repetitions, is shown in Figure 2.6. In fact,
you can see for yourself that this (empirical) distribution closely
resembles the F-distributions of your earlier courses.

Figure 2.6: Bootstrap distribution of


ratios of two sample variances
Histogram of variance ratios
800
Frequency

400
0

0 1 2 3 4 5

var1/var2
46

In this particular experiment, the originally observed F-ratio


of 1.77 was exceeded 500 times out of 5000 repetitions. Under
H0 , such an occurrence would therefore appear to be fairly likely
(p=0.1) so that we would tend to accept (or least say we cannot re-
ject) the null hypothesis of equal variances. This is consistent with
the results obtained using standard normal theory, although the
p-value is somewhat larger (0.1 vs. 0.065)

Paired observations: Suppose that we are able to pair up the Xi and


Yj observations in some way. This does of course mean that m = n.
In some cases the pairing will be natural, or even forced upon us,
for example when Xi and Yi are results of two different treatments
applied to the same subject. In other cases, where the Xi and Yj
observations are entirely independent of each other, but sample
sizes are the same, they may be paired in random fashion.
Now define Wi = Xi − Yi . The Wi are independent, with zero
mean under the null hypothesis, but with unknown variance (=
σX2 + σY2 , but we never really need to know the individual variances).
We can then simply apply the single sample procedure to test the
hypothesis that µW = 0. This approach brings with it an added
bonus!. As long as the X1 , X2 , . . . , Xn are independent of each other,
and the Y1 , Y2 , . . . , Yn are independent of each other, it does not
matter if Xi and Yi are associated for the same i! The unknown variance
of W will be given by σW2 = σX2 + σY2 − 2σXY , but once again we do
not need separately to identify the components. In order to apply
the one sample analysis to the Wi , we simply need to estimate the
unknown variance σW2 by the sample variance of the Wi .

The Behrens-Fisher Problem: We now turn to the general case in


which σX ̸= σY and m ̸= n. We could of course pair up what we
can and throw away any unmatched observations. But discarding
sample information does not seem to be good statistics.
The obvious alternative would be to calculate separate sample
variances S2X and SY2 , and to base inferences on an expression of the
form:
X̄ − Ȳ
q .
S2X /m + SY2 /n

But what is the probability distribution of this ratio? Struggle as we


may, it turns out to be impossible to massage the above expression
into anything to which Theorem 1 may be applied, so that no exact
t-distribution result can be derived (a problem first identified by
Behrens and Fisher). Nevertheless, from many simulation studies it
has emerged that the required distribution can be approximated by
a t-distribution with fractional “degrees of freedom” given by

s21 s22 2
 
+
n1 n2
 2 2  −2
(s1 /n1 ) (s22 /n2 )2
+
n1 + 1 n2 + 1
47

Note that the definition of the p.d.f. of the t-distribution in Theo-


rem 1 does not mathematically require d to be integer.

2.3 One-Way Analysis of Variance

Analysis of Variance (ANOVA) problems can be seen as a general-


ization of the two sample problem to many samples. Consider for
example the following data on traffic counts (vehicles per hour) at
each of five intersections (which may, for example, be suggested
sites for a new petrol station). Figure 2.7 gives a box-and-whisker
plot for the same data, comparing the five samples.

Place Hourly Traffic Counts


I 344 382 353 395 207 312 407 421 366 222
II 365 391 538 471 431 450 299 371 442 343
III 261 429 402 391 239 295 129 301 317 386
IV 422 408 470 523 398 387 433 440
V 367 445 480 323 366 325 316 381 407 339

Figure 2.7: Box and whisker plots of 5


600
samples of traffic counts

550

500

450

400
TRAFFIC

350

300

250

200

150

100
I II III IV V
LOCATION

The question at this stage is whether the population means


(the true long-run means at each intersection) do differ signifi-
cantly (or could the differences displayed in Figure 2.7 simply be
due to chance?). We could apply the two-sample approach to all
pairs of intersections, but this quickly becomes messy, and it some-
times happens that no single pairwise difference is significant even
though other tests (see below) indicate that some difference occurs
somewhere. It is useful, therefore, to base our analysis on some more
aggregated summary statistics.
Define Yij as the j-th observation from set (or “treatment” or
“population”) i. Suppose that there are k different populations (so
that k = 5 in the above example), and that there are ni observations
48

from population i. (In the above example, n1 = n2 = n3 = n5 = 10,


while n4 = 8.) Further define N = ∑ik=1 ni , i.e. the total number
of observations. Let µi denote the population mean for population
i, and suppose that the sampling variances are the same in each
population (= σ2 ).
We now define the following summary statistics:
n
• Sample mean for population i: Yi· = ∑ j=i 1 Yij /ni
n
• Overall sample mean: Y·· = ∑ik=1 ∑ j=i 1 Yij /N, which can also be
written in the form:
k
n
∑ Ni Yi·
i =1
i.e. as a weighted average of the sample means for each popula-
tion.
n
• Error sum of squares: SSE = ∑ik=1 ∑ j=i 1 (Yij − Yi· )2

• “Treatment” sum of squares: SST = ∑ik=1 ni (Yi· − Y·· )2 .


We note that SSE/( N − k) is an unbiased estimator of the sam-
pling variance σ2 (which is a general result, not dependent upon
any assumptions of normality). Under the null hypothesis (H0 ) that
all means are equal, say µ1 = · · · = µk = µ, Yi· has mean µ and

variance σ2 /ni , so that ni (Yi· − µ) has mean 0 and variance σ2 . It
thus follows that:
" #
∑ik=1 ni (Yi· − µ)2
E = σ2 .
k

It can once again be shown that replacement of the population


mean µ by the sample mean Y·· “loses” one degree of freedom, so
that: " #
∑ik=1 ni (Yi· − Y·· )2
 
SST
E =E = σ2
k−1 k−1
so that if the null hypothesis is true, then SST/(k − 1) is also an
unbiased estimator of σ2 .
Overall, this implies that if H0 is true, then the ratio:
SST/(k − 1)
F=
SSE/( N − k)
should not significantly deviate from 1; but if there are any devi-
ations from H0 (i.e. if one or more means differ from others), then
the above ratio will tend to be larger than 1. In the case of the traffic
count data above, the F-ratio turns out to be 4.56.
For any given set of sample data, we can easily calculate the
observed value for F. If this is less than 1, then there is clearly no
evidence to support any difference between the means of the popu-
lations. If F > 1, we need (as usual) to ask whether the deviations
can be due to chance.
Once again, we can get an answer to this last question by means
of a bootstrap simulation. In this case, the procedure would be as
follows:
49

• Place all N observations in a “box” and “shuffle”

• Draw (with replacement) a sample of size N, and divide these


arbitrarily into k samples of sizes n1 , n2 , . . . , nk respectively.

• Calculate the SSE and SST, and the F-ratio as defined above.

• Repeat as many times as desired, to obtain a distribution of the


F-ratio under H0

• Compare the originally observed ratio with the empirical distri-


bution

This process can again be implemented using R and/or the boot-


strap macro in the BootStrap.xls spreadsheet package5 . Results 5
This is left to you as one final exercise
from 5000 repetitions for the traffic count data are displayed in for the chapter!

Figure 2.8.

Figure 2.8: Bootstrap distribution of


1800
F-ratios for traffic count data

1600

1400

1200

1000

800

600

Observed Ratio

400

200

0
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6

In this experiment, the originally observed F-ratio of 4.56 was


exceeded only 15 times out of 5000 repetitions. Under H0 , such an
occurrence would appear to be highly unlikely (p=0.003) so that we
would tend to reject H0 .
Once again, we ask whether a similar conclusion can be reached
without such heavy computation. We now need to make the stronger
assumption that the individual Yij values are normally distributed.
If this is (at least approximately) true, then the Yi· must also be
normally distributed. The following properties then apply:

• For each population i


ni (Yij − Yi· )2
∑ σ2
j =1

has the χ2 distribution with ni − 1 degrees of freedom.


50

• Since observations from the different populations are assumed to


be independent, it follows that

k n i (Y − Y ) 2
SSE ij i·
U=
σ2
= ∑ ∑ σ 2
i =1 j =1

has the χ2 distribution with ∑ik=1 (ni − 1) = N − k degrees of


freedom.

• If H0 is true, then the Yi· have the same mean and variances
σ2 /ni . By a slight variation of previous results, it can be shown
that
k
SST (Y − Y·· )2
V = 2 = ∑ i· 2
σ i =1
σ /ni

has the χ2 distribution with k − 1 degrees of freedom.

• It can also be shown that U and V are independent.

• The F-ratio can thus be expressed as

V/(k − 1)
U/( N − k)

which by Theorem 2 has the F-distribution with k − 1 and N − k


degrees of freedom.

We may thus use F-tables to reach a conclusion as to whether


the observed F-ratio is too large to be due to chance. In our traffic
count example, we had k = 5 and N = 48. From F-tables we can
find that the 1% critical value with 4 and 43 degrees of freedom is
3.79. Since the observed 4.56 > 3.79, we would reject H0 at the 1%
significance level. Using the Statistica probability calculator, or the
FDIST() function from Excel, we can also determine that the p-value
corresponding to an observation of 4.76 is 0.0037, which is close to
our Bootstrap approximation.
51

Tutorial Exercises

1. For the example discussed at the start of Section 2.2 (for the
problem of comparing payment delays in two markets), com-
pare the results based on the two-sample t-test with pooled
variance, with those reported in the notes based on bootstrap-
ping approach. Remember to first check the hypothesis of equal
variances, using the F-test, before applying the above t-test.

2. Relationships between normal, χ2 and t distributions:

(a) X has the normal distribution with mean 0 and variance 4;


V has the χ2 distribution with 5 degrees of freedom. X and
V are independent random variables. Making use of t-tables,
obtain an approximate value for the probability that X 2 > V.
(b) X1 , X2 , . . . , X10 is a random sample of values from a normal
distribution with mean 0 and variance 9. Y1 , Y2 , . . . , Y16 is a
random sample of values from a normal distribution with
mean 20 and variance 1. Let X̄ and Ȳ be the corresponding
sample means. Find the value z such that:
" #
X̄ 2
Pr ≥ z = 0.05
∑16
i =1 (Yi − Ȳ )
2

3. The following are independent random variables:

• X: Normal with mean 0 and variance 10;


• Y: Normal with mean 5 and standard deviation 3
• U: χ2 with 10 degrees of freedom

Answer the following:

(a) Identify the distribution of X − Y


(b) For what values of k and c does U + k( X − Y + c)2 have a χ2
distribution? What degrees of freedom does it have?
(c) Find the relevant constants which would make the following
probability statements true:
Y−5
 
i. Pr √ > k = 0.05
2
 0.1X + U 
X+Y+c
ii. Pr √ > k = 0.95
U
( X − Y − c )2
 
iii. Pr ≤ k = 0.9
U
 2
X /10 + (Y − 5)2 /9

iv. Pr ≤ k = 0.99
U
4. In a study of fuel consumption with two different fuel additives,
20 new cars were selected at random. Tests were conducted un-
der identically controlled conditions, with 10 of the cars using
petrol containing additive A, and 10 using petrol containing
additive B. The sample means and standard deviations of con-
sumption expressed as litres per 100km for the two groups were
recorded as follows:
52

Additive
A B
mean 6.89 7.19
std.dev. 0.374 0.475

A bootstrap simulation was performed in which 1000 data sets


of size 20 were regenerated from the original data, and split into
two groups of size 10. The resultant differences in means (A-
B) from the two groups were sorted from smallest to largest. A
summary of the differences observed at various positions in the
sorted list are given as follows:

Position No.: 1 10 25 50 75
Value -0.776 -0.477 -0.385 -0.308 -0.260
Position No.: 100 250 500 750 900
Value -0.234 -0.127 0.008 0.131 0.249
Position No.: 925 950 975 990 1000
Value 0.274 0.319 0.387 0.469 0.585

(a) Use the bootstrapped data to estimate the p-value for the
test of differences between the means. Compare this with the
corresponding value obtained on the assumption that the data
are normally distributed, and that the group variances are the
same.
(b) Use the bootstrapped data to construct a 95% confidence
interval for the difference between the means.

5. For each of the data sets attached to the end of this chapter, use
bootstrapping and standard normal theory to test the hypothesis
of no differences between the groups.

6. The following data represent numbers of defects observed in


product produced from three assembly lines for electronic equip-
ment. Each count represents the number of defects in one hour
of operation, where the hours selected for inspection were cho-
sen randomly. The question of interest is whether defect rates
differ between assembly lines.

Line 1 Line 2 Line 3


6 34 13
38 28 35
3 42 19
17 13 4
11 40 29
30 31 0
15 9 7
16 32 33
25 39 18
5 27 24
53

Using a simulation (“bootstrap”) approach, estimate the relevant


significance level, and compare this with that obtained from the
standard analysis of variance (F) test, and the non-parametric
(Kruskal-Wallis) test, both of which can be obtained through Sta-
tistica. Since the data are counts, they are unlikely to be normally
distributed. From the results of this exercise can you comment
on the robustness of the usual ANOVA F-test?

7. Independent random variables X, Z and W have the following


distributions:

• X is normally distributed with a mean of 10 and a standard


deviation of 5;
• Z has the standard normal distribution;
• W has the χ2 distribution with 9 degrees of freedom.
a( X + Z − b)
(a) For what values of a and b will √ have a t-
W
distribution? How many degrees of freedom will it have?
(b) Use tables to find the value of β such that
( X − 10)2 + 25Z2
 
Pr ≥ β = 0.05.
W
8. As part of an investigation into differences between fuel con-
sumptions on different trucks in a transport fleet, total fuel con-
sumptions (in litres) over the same fixed route were measured a
number of times for each of three trucks. Results obtained were
as follows:

n
Truck Consumptions (Litres) Yi· ∑ j=i 1 (Yij − Yi· )2
A 35.6 37.1 32.6 31.3 32.4 33.80 23.780
B 34.5 34.2 32.5 30.5 32.93 10.168
C 36.6 33.9 32.5 35.5 35.6 37.5 35.27 16.453

(a) Motivate and explain how a bootstrapping approach might


be used to test the hypothesis of no differences between mean
fuel consumptions of the three trucks.
(b) Such a bootstrapping approach was applied, based on 5000
bootstrap replications. For each replication, the SSE, SST and
the ratio SST/SSE were calculated. (NOTE: The ratio has not
been adjusted for degrees of freedom.) The SST/SSE ratios
were sorted from smallest to largest, and the following are a
selection of the observed values:
Sample No. 2500 4000 4500 4750 4875 4950
SST/SSE: 0.078 0.181 0.263 0.352 0.474 0.672
What conclusion should be drawn about any differences be-
tween the trucks?
(c) Compare the above answer with that obtained from the stan-
dard normal theory approach to ANOVA.
54

Data Sets for Exercises

Data Set C: A press used to remove water from copper-bearing


materials is being tested using two different types of filter plates.
These data are obtained on the percentage of moisture remaining
in the material after treatment.

Regular chamber (I) Diaphragm chamber (II)


8.10 8.16 8.16 7.58 7.65 7.69
7.96 7.98 7.93 7.66 7.67 7.67
7.97 8.08 8.06 7.58 7.62 7.65
8.02 7.87 7.94 7.65 7.58 7.71
7.82 8.11 7.92 7.63 7.54
8.15 7.91 8.00 7.46 7.40

Data Set D: It is thought that the gas mileage obtained by a partic-


ular model of automobile will be higher if unleaded premium
gasoline is used in the vehicle rather than regular unleaded gaso-
line. To gather evidence to support this contention 10 cars are
randomly selected from the assembly line and tested using a
specified brand of premium gasoline; 10 others are randomly
selected and tested using the brand’s regular gasoline. Tests are
conducted under identical controlled conditions. These data
result:

Premium Regular
35.4 31.7 29.7 34.8
34.5 35.4 29.6 34.6
31.6 35.3 32.1 34.8
32.4 36.6 35.4 32.6
34.8 36.0 34.0 32.2

Data Set E: A study of visual and auditory reaction time is con-


ducted for a group of college basketball players. Visual reaction
time is measured by time needed to respond to a light signal and
auditory reaction time is measured by time needed to respond to
the sound of an electric switch. Fifteen subjects were measured
with time recorded to the nearest millisecond.
55

Subject Visual Auditory


1 161 157
2 203 207
3 235 198
4 176 161
5 201 234
6 188 197
7 228 180
8 211 165
9 191 202
10 178 193
11 159 173
12 227 137
13 193 182
14 192 159
15 212 156

Is there evidence that the visual reaction time tends to be slower


than the auditory reaction time?

Data Set F: A firm has two possible sources for its computer hard-
ware. It is thought that supplier X tends to charge more than
supplier Y for comparable items. Do these data support this
contention at the α = 0.05 level?

Item X price ($) Y price ($)


1 6 000 5900
2 575 580
3 15000 15000
4 150000 145000
5 76000 75000
6 5650 5600
7 10000 9975
8 850 870
9 900 890
10 3000 2900

Data Set G: Twenty randomly selected cars of the same make and
model were split into two groups of ten each. Premium grade
petrol was used in cars from the first group and regular grade
in the other group. Petrol consumptions over a standard set of
identically controlled conditions were measured as follows:

Premium Regular
6.71 7.49 8.00 6.82
6.88 6.71 8.02 6.86
7.52 6.73 7.40 6.82
7.33 6.49 6.71 7.29
6.82 6.60 6.99 7.38
56

Data Set H: These are the running times in minutes of films pro-
duced by two different directors. Is there a difference?

Director I 103 94 110 87 98


Director II 97 82 123 92 175 88 118

Data Set J: Ten samples of dried milk produced by Company A


were analyzed for fat content by the company’s own laboratory,
and by the laboratory of their main customer (Company B). As
each pair of analyses relate to the same original sample, they are
not independent. We must therefore use a paired test in both the
simulation and the t-test.

Sample Analysis by Analysis by


Number Company A Company B
1 0.50 0.79
2 0.58 0.71
3 0.90 0.82
4 1.17 0.82
5 1.14 0.73
6 1.25 0.77
7 0.75 0.72
8 1.22 0.79
9 0.74 0.72
10 0.80 0.91

Data Set K: A study on the tensile strength of aluminium rods is


conducted. Forty identical rods are randomly divided into four
groups each of size 10. Each group is subjected to a different
heat treatment and the tensile strength, in thousands of pounds
per square inch, of each rod is determined. The following data
result.

Treatment
1 2 3 4
18.9 18.3 21.3 15.9
20.0 19.2 21.5 16.0
20.5 17.8 19.9 17.2
20.6 18.4 20.2 17.5
19.3 18.8 21.9 17.9
19.5 18.6 21.8 16.8
21.0 19.9 23.0 17.7
22.1 17.5 22.5 18.1
20.8 16.9 21.7 17.4
20.7 18.0 21.9 19.0

Data Set L: Following a major accidental spill from a chemical


manufacturing plant near a river, a study was conducted to de-
termine whether certain species of fish caught from the river
differ in terms of the amounts of the chemical absorbed. If dif-
ferences are found, regulations on human consumption may
57

be recommended. Samples from catches of three major species


were measured in parts per million. The resulting data are given
below.

Species
A B C
18.1 29.1 26.6
16.5 15.8 16.1
21.0 20.4 18.8
18.7 23.5 25.0
7.4 18.5 21.8
12.4 21.3 15.4
16.1 23.1 19.9
17.9 23.8 15.5
20.1 21.1
11.9 25.5

Data Set M: Four brands of tyres are tested for tread wear. Since
different cars may lead to different amounts of wear, cars are
considered as blocks to reduce the effect of differences among
cars. An experiment is conducted with cars considered as blocks,
and brands of tyres randomly assigned to the four positions of
tyres on the cars. After a predetermined number of miles driven,
the amount of tread wear (in millimetres) is measured for each
tyre. The resulting data are given below.

Car Tyre brand


A B C D
1 8.9 6.6 5.6 4.2
2 7.2 6.9 7.3 6.9
3 3.1 6.2 7.2 4.1
4 7.1 8.3 6.3 5.8
5 6.7 6.4 5.9 9.4
6 5.3 6.7 8.0 7.9
7 2.4 5.5 6.1 3.1
8 5.7 9.2 9.6 4.2
3
Parameter estimation and inference

The art and science of statistical modelling involves inter alia the
following steps:

• The construction of a model to represent the real world process.


Typically this model would include both relationships between
variables (e.g. the linear relation between the response and pre-
dictor variables in regression) and random variations (expressed
in terms of a relevant probability distribution);

• Use of available sample data in order to obtain estimates of the


parameters of the model, so that the model can be used for pre-
diction, planning, etc.

Where the parameters represent something simple such as the


population mean or variance, we already have some understanding
of the use of the sample mean or variance in estimating the popula-
tion values. More generally, however, a model may be expressed in
the form of a probability mass or density function f (x|θ ), where:

• x represents a possibly multivariate observation of a variable or


variables;

• θ represents all relevant parameters, of which there may be


many.

The intention is thus to estimate θ on the basis of data consisting


of a set of n observations x1 , x2 , . . . , xn . Note that for ease of pre-
sentation, we shall from now on use the same notation f ( x ) for
probability density or probability mass functions. It should be clear
from context whether we are referring to continuous or discrete
variables.
Some examples are as follows:

• A single observation X is normally distributed with mean µ and


variance σ2 , so that θ = (µ, σ2 ). In this case, the “parameters” are
just the population mean and variance.

• X has a “Pareto” distribution with p.d.f. given by:

αλα
f ( x |α, λ) =
( λ + x ) α +1
59

for x > 0. This is probably an unfamiliar distribution, but is


clearly defined by the pair of parameters: θ = (α, λ). These pa-
rameters are, however, only indirectly related to the population
mean and variance. (Try to find the relationship, as an exercise!)

• Regression models include structural relationships as well as


error or noise terms. In this case, a single observation consists of
y, x1 , x2 , . . . , x p , and the probability distribution of y has mean of
p
β 0 + ∑k=1 β k xk . In some cases the xi ’s may be random as well.
Thus θ is the vector of parameters defined by β 0 , β 1 , . . . , β p , the
variance of the distribution of y, plus any parameters related to
the distribution of the xi ’s.

For the purposes of this presentation, we shall keep to the sim-


pler cases represented by the first two examples above, but the
principles carry through to more general models.

3.1 Methods of Moments and Quantiles

As we have seen, distributions can be characterized by their mo-


ments. An “obvious” approach to estimating parameter values
would therefore be as follows:

• Calculate the theoretical moments (usually the mean and the


centred forms for higher order moments) from the proposed
model, expressed as functions of the parameter(s) θ;

• Obtain the numerical sample estimates of the same moments;

• Find values of the parameters that equate the two sets;

• As we often only have two or three parameters to estimate, we


may have too many equations – thus we might use only the first
two or three moments for estimation, and then use the remainder
as a rough check of distributional fit.

Example – Gamma Distribution: The gamma distribution is often


used, for example, to model times between equipment failures. We
have seen that for the Gamma distribution with p.d.f. given by

λα x α−1 e−λx
f X (x) = for 0 < x < ∞
Γ(α)

or

x α−1 e− x/β
f (x) = ( x > 0)
βα Γ(α)

The mean and variance are given by µ = αβ and σ2 = αβ2 . The


third and fourth centred moments are given by µ3 = 2αβ3 and
µ4 = 3(α + 2)αβ4 , but we really only need two moments in order to
estimate the two parameters.
60

Thus let x̄ and s2 be the sample mean and variance. We would


then estimate the two parameters by solving for α and β from the
two equations x̄ = αβ and s2 = αβ2 . The resulting estimates are
then β̂ = s2 / x̄ and α̂ = x̄/ β̂. Note how we have placed a “hat”
above the variable names, to indicate that we are referring to an
estimate and not to the true parameter value. The goodness of
fit of the assumed gamma distribution model to the data may be
assessed by comparing the values of µ3 and µ4 derived from α̂ and
β̂ with the corresponding sample moments ∑in=1 ( xi − x̄ )3 /n and
∑in=1 ( xi − x̄ )4 /n.
The following two data sets contain data which are conjectured
to have risen from two distinct gamma distributions.

Data Set 1. A first set of data is provided as follows:

12.52 3.30 10.90 11.36 6.72


2.24 12.63 14.87 15.57 9.88
6.20 7.32 9.85 11.34 8.80
17.15 4.97 25.06 6.79 11.29

The sample mean and variance are easily calculated to be x̄ =


10.44 and s2 = 27.52, which yield the estimates α̂ = 3.96 and
β̂ = 2.637. The resulting 3rd and 4th moments based on these
estimates are compared with the sample moments from the data in
the following table:

Theoretical Sample
Third 145 119
Fourth 3419 2882

The theoretical and sample moments are very much of the same
order of magnitude, giving some credence to the assumption of a
gamma distribution.

Data Set 2. Now, however, let us carry out the same analysis on the
following data:

15.82 13.60 13.23 8.85 4.75


10.11 0.05 5.12 13.61 11.88
2.13 15.72 12.09 6.04 7.44
9.82 8.09 4.53 9.42 10.09

Now we obtain x̄ = 9.12 and s2 = 19.22, giving the α̂ = 4.33 and


β̂ = 2.108. The comparison of the theoretical and sample moments
now gives:

Theoretical Sample
Third 81.0 -23.9
Fourth 1621 773

The discrepancies cast some doubt at least on the validity of the


gamma distribution in this context.
61

Goodness of Fit. Use of additional moments to check fit is a useful


“rule-of-thumb”, but is far from being a rigorous test. In principle,
the chi-squared goodness-of-fit test gives a better test, but does
require a substantial sample size in order to be applied effectively.
For example, in the case of data set 2 above, we could try to base a
chi-squared test on numbers falling into each of the four quartiles
of the distribution. For the estimated parameter values (α̂ = 4.33;
β̂ = 2.108), the theoretical lower quartile is 5.92, the median is 8.44,
and the upper quartile is 11.59. If the Gamma model is correct,
then we should expect 5 observations each in the intervals [0–5.92],
[5.92–8.44], [8.44–11.59] and [over 11.59]. The observed numbers are
4, 5, 4 and 7 respectively, giving a chi-squared statistic of 1.20 on
one degree of freedom (since two parameters have been estimated).
This would not be anywhere near significant, so that we could not
formally “reject” the hypothesis of a Gamma distribution, in spite
of serious reservations we might have from the moments.
A useful alternative is given by the concept of a probability plot.
We start by rank ordering the sample data from smallest to largest,
i.e. as x(1) < x(2) < · · · < x(n) . Clearly, for each k, the observed
proportion of the sample having values not exceeding x(k) is k/n,
which suggests that we should expect to have F ( x(k) ) ≈ k/n if
the assumed distribution is correct. Seeking such an equality is
going to cause problems at least for k = n, however (as in many
cases F (∞) = 1). In general, we probably should be comparing
F ( x(k) ) to some number between (k − 1)/n and k/n. A number of
possible comparative values have been suggested and tested in the
literature, one of which is based on checking whether:
1
k− 4
F ( x(k) ) ≈ 1
.
n+ 2

In order to obtain a simple visual test of how well an assumed


distribution matches the data, we could either:
• Plot F ( x(k) ) against [k − 14 ]/[n + 12 ] (sometimes called the “probability-
probability”, or P-P plot); or

• Plot x(k) values directly against:


!
1
−1 k− 4
F 1
n+ 2

(sometimes called the “quantile-quantile”, or Q-Q plot). We


recall that the inverse distribution function F −1 (u) is available
for most distributions as spreadsheet functions in Excel.
Theoretically, either of the above plots should give a straight line
through the origin. This is often easily judged by eye, although rig-
orous statistical tests can be used (e.g. the “Kolmogorov-Smirnov”
test.) Probability plots are generally provided by most statistical
packages such as Statistica.
Let us return to the examples of the above two data sets. Ta-
ble 3.1 shows the plotting positions and data needed for both
62

types of plot, for both data sets. Note that we denote the adjusted
value(k − 1/4)/(n + 1/2) by k∗

k k∗ Data Set 1 Data Set 2 Table 3.1: Plotting positions and data
x(k) F ( x ( k ) ) F −1 ( k ∗ ) x(k) F ( x ( k ) ) F −1 ( k ∗ ) points for probability plots

1 0.0366 2.24 0.0119 3.19 0.05 0.0000 2.98


2 0.0854 3.30 0.0406 4.27 2.13 0.0118 3.92
3 0.1341 4.97 0.1277 5.07 4.53 0.1282 4.61
4 0.1829 6.20 0.2179 5.75 4.75 0.1456 5.19
5 0.2317 6.72 0.2602 6.37 5.12 0.1768 5.72
6 0.2805 6.79 0.2661 6.96 6.04 0.2624 6.22
7 0.3293 7.32 0.3107 7.54 7.44 0.4029 6.71
8 0.3780 8.80 0.4366 8.11 8.09 0.4674 7.19
9 0.4268 9.85 0.5219 8.68 8.85 0.5396 7.68
10 0.4756 9.88 0.5242 9.27 9.42 0.5903 8.17
11 0.5244 10.90 0.6005 9.88 9.82 0.6240 8.69
12 0.5732 11.29 0.6276 10.52 10.09 0.6457 9.22
13 0.6220 11.34 0.6310 11.21 10.11 0.6472 9.80
14 0.6707 11.36 0.6324 11.95 11.88 0.7682 10.42
15 0.7195 12.52 0.7051 12.77 12.09 0.7801 11.10
16 0.7683 12.63 0.7114 13.71 13.23 0.8367 11.88
17 0.8171 14.87 0.8190 14.82 13.60 0.8522 12.80
18 0.8659 15.57 0.8450 16.21 13.61 0.8526 13.95
19 0.9146 17.15 0.8922 18.13 15.72 0.9190 15.54
20 0.9634 25.06 0.9859 21.50 15.82 0.9214 18.32

The plots are shown graphically in Figures 3.1 and 3.2. The plots
for data set 1 are (apart from some random fluctuations) close ap-
proximations to straight lines through the origin. The plots for data
set 2, on the other hand, show a tendency to curvature, and the
Q-Q plot in particular does not appear to come close to the origin.
This reinforces our earlier conclusions.

Use of quantiles
Moments are sometimes quite difficult to calculate theoretically in
terms of the parameters. In may cases, the cumulative distribution
function may be more easily available, so that one can directly
calculate population medians or quartiles. Instead of matching
moments, then, one can choose parameter values to match up the
corresponding quantiles.
A good example of this situation is provided by the “Weibull dis-
tribution”, which is also often used to model equipment lifetimes in
reliability studies. The distribution function for the Weibull distri-
bution is defined by:
F ( x ) = 1 − e−cx
γ

for x > 0. There are thus two parameters, c and γ to be estimated


(both strictly positive). Although it is relatively easy to obtain the
probability density function, it is not possible to derive general
closed form expressions for the mean and variance in terms of c
and γ. On the other hand, the 100p-th percentile of the distribution
is found by solving F ( x ) = p. If we define ξ p to be this percentile,
63

Figure 3.1: P-P and Q-Q plots for data


set 1

P-P Plot: Data Set 1

1.0
0.9
0.8
0.7
F(observed)

0.6
0.5
0.4
0.3
0.2
0.1
0.0
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Probability

Q-Q Plot: Data Set 1

30

25

20
Actual

15

10

0
0 5 10 15 20 25
Predicted
64

Figure 3.2: P-P and Q-Q plots for data


set 2

P-P Plot: Data Set 2

1.0
0.9
0.8
0.7
F(observed)

0.6
0.5
0.4
0.3
0.2
0.1
0.0
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Probability

Q-Q Plot: Data Set 2

18
16
14
12
Actual

10
8
6
4
2
0
0 2 4 6 8 10 12 14 16 18 20
Predicted
65

then: 1/γ
− ln(1 − p)

ξp =
c
Thus in order to estimate c and γ, we could (for example) match
the upper and lower quartiles (p =¼ and ¾ respectively) to parame-
ter values, and then to use the median (p =½) to check goodness of
fit.
The following data were generated (in Excel) from a Weibull
distribution with parameters c = 0.2 and γ = 3 respectively:

0.457 1.289 1.314 1.488 1.523


1.533 1.537 1.819 1.953 1.973
2.030 2.071 2.215 2.680 2.781
The lower and upper quartiles of the sample are 1.506 and 2.051
respectively, while the median is 1.819. Matching of the quartiles
requires that the parameter values satisfy:
1/γ
− ln(0.75)

= 1.506 and
c
1/γ
− ln(0.25)

= 2.051
c
respectively. The solution is easily found to be ĉ = 0.036 and
γ̂ = 5.09. The theoretical median corresponding to these param-
eter values is 1.788, not far from the sample median of 1.819. This
suggests that the data are well-modelled by the Weibull distribu-
tion, even though the parameter estimates turn out in this case to
be quite far from their true values.
The construction and interpretation of the relevant probability
plots for the above problem is left to the student as an exercise.

3.2 Maximum Likelihood Estimation

The student will have noticed that the precise choice of estimator
has up to now been somewhat arbitrary. We could use moment
or quantile estimators. In either case, there may be a number of
different moments or quantiles which can be matched to the data.
The question now is: Is there some estimator (or estimators) with a
claim to being “better” than others in some sense?
In order to illustrate the question, we could conduct a simple
simulation exercise. Suppose that we wish to estimate the param-
eter λ in the exponential distribution. Since there is only one pa-
rameter, we could obtain a moments estimator by matching to the
sample mean; and a quantile estimator by matching to the sample
median. These estimators are easily seen to be given by:

Moment estimator: λ̂ = 1/ x̄

Quantile estimator: λ̂ =0.693/(sample median)

The simulation can then proceed as follows (do it for yourself!):


66

• Generate 1000 samples of size 10 each, from the exponential


distribution with λ = 1 (say);

• Calculated both estimates for each sample of size 10;

• Calculate the mean and standard deviation of the 1000 estimates,


for each of the two formulae.

In the simulation which we conducted we found the mean and


standard deviation of the moment estimator to be 1.12 and 0.40;
while that for the quantile estimator were 1.14 and 0.58. There
is little to choose between the two estimates on average, but the
quantile (median) estimator appears to exhibit greater variability.
In this sense, we would probably recommend use of the moments
estimator rather than the quantile estimator in similar contexts.
More generally, suppose that θ̂ is a sample-based estimator for a
population parameter θ.

• By definition, θ̂ is a function of the sample data, and is thus a


random variable with its own expectation and variance;

• Suppose that the true value of θ is given by θ (fixed, but un-


known). We then define the bias of the estimator by B(θ̂ ) =
E(θ̂ ) − θ, where the expectation is taken w.r.t. the sampling vari-
ation in X1 , X2 , . . . , Xn given the fixed value of θ. An estimator is
said to be unbiased if E(θ̂ ) = θ, i.e. if B(θ̂ ) = 0.
If E(θ̂ ) ̸= θ then θ̂ is said to be a biased estimator.

Example (Normal distribution): The MLEs are given by

µ̂ = x̄
n
1
σ̂2 =
n ∑ (xi − x̄)2
i =1

Show that µ̂ is an unbiased estimate but that σ̂2 is a biased esti-


mate.

We start with µ̂. To show this estimator is unbiased we need to


show that E(µ̂) = µ. Substituting the MLE µ̂ = x̄ easily leads to

1
E( x̄ ) =
n ∑ E ( Xi )
1
=
n ∑µ = µ
Thus µ̂ is an unbiased estimate of µ.

To show the variance estimator is unbiased we need to show that


E(σ̂2 ) = σ2 . Substituting the MLE σ̂2 = ∑in ( xi − x̄ )2 /n leads (a
67

little bit less easily this time) to


" #
1 n
n i∑
2 2
E(σ̂ ) = E ( xi − x̄ )
=1
" #
1 n
n i∑
2
= E ( xi − 2xi x̄ + x̄ )
=1
" !#
n
1
n i∑
= E xi2 − n x̄2
=1
" #
1 n  2  
n i∑
2
= E xi − nE x̄
=1
" #
1 n  2
 2
 σ
n i∑
= σ + µ2 − n + µ2
=1
n
n−1 2
= σ
n
Therefore σ̂2 is a biased estimate of σ2 ; its bias −σ2 /n. Clearly an
unbiased estimate is the sample variance
n
1
s2 = ∑ ( x − x̄ )2 .
n − 1 i =1 i

We now define mean squared error (MSE) as the expectation


(with respect to sampling variability) of (θ̂ − θ )2 . In other words,
MSE measures the long run magnitude of deviations between the
true value and its estimate. A bit of algebra yields the following:

MSE = E[(θ̂ − θ )2 ]
h i
= E (θ̂ − E[θ̂ ] + E[θ̂ ] − θ )2
= E[(θ̂ − E[θ̂ ])2 ] + 2(E[θ̂ ] − θ ) E[θ̂ − E[θ̂ ]] + (E[θ̂ ] − θ )2
= var[θ̂ ] + 0 + (E[θ̂ ] − θ )2
= var(θ̂ ) + (B[θ̂ ])2

• We conclude that the MSE is the sum of the variance and the
square of the bias, where bias is defined as the difference be-
tween the true parameter value and the long-run expectation of
the estimator.

Generally, we would say that if one estimator has a smaller MSE


than another, then it is a “better” estimate. A problem in applying
this principle is that in many cases the bias is difficult to estimate
without knowing the true parameter value. For this reason, statis-
ticians often (although not always!) restrict their search for good
estimators to those which are unbiased. Clearly, the best unbiased
estimator is that which minimizes variance within this class. Such
an estimator is termed the Minimum Variance Unbiased Estimator
(MVUE).
It turns out that the search for the MVUE is closely linked to the
concept of the maximum likelihood estimator (MLE) which we shall
now introduce.
68

The Likelihood Function. The value of the probability mass or prob-


ability density function f ( x |θ ), at a given point x, represents the
relative likelihood of observing the value x for the random variable,
relative to all other possible values. Now consider a random sam-
ple X1 , X2 , . . . , Xn (i.e. n independent observations drawn from the
same distribution). It then follows that the joint probability mass or
density function, given by the product
n
∏ f ( xi | θ )
i =1

must represent the relative likelihood of the full set of observations


x1 , x2 , . . . , xn , relative to all other possible samples. Clearly, this
likelihood will depend mathematically on θ; for some parameter
values, the same observations would be more likely than for others.
When we view the joint probability as a function of θ for a given set
of observed values, it is referred to as the likelihood function:
n
L(θ; x1 , x2 , . . . , xn ) = ∏ f ( x i | θ ). (3.1)
i =1

As an example, consider the normal distribution with θ =


(µ, σ2 ):
n
1
∏ √2πσ e−(xi −µ) /2σ
2 2
L(µ, σ2 ; x1 , x2 , . . . , xn ) =
i =1
 n
1 n 2 /2σ2
= √ e − ∑ i =1 ( x i − µ )
2πσ
Suppose that for any set of sample values x1 , x2 , . . . , xn , we find
the parameter value(s) which maximize L(θ; x1 , x2 , . . . , xn ) with
respect to choice of θ. In an intuitive sense, this identifies the pa-
rameter value(s) with which the observed data are most consistent.
The maximizing value of θ is termed the maximum likelihood estima-
tor (MLE) for the parameter.
In practice, it is almost always easier to maximize the logarithm
of the likelihood function (which must still gave the same result),
i.e. the log-likelihood:

ℓ(θ; x1 , x2 , . . . , xn ) = ln L(θ; x1 , x2 , . . . , xn ).

Once again, let is illustrate the idea by means of the normal


distribution. The log-likelihood is easily obtained as:
n
n
ℓ(µ, σ2 ; x1 , x2 , . . . , xn ) = − ln(2π ) − n ln(σ) − ∑ ( xi − µ)2 /2σ2 .
2 i =1

Now that this expression can be simplified somewhat by re-writing


the quadratic term in the form:
n n
∑ (xi − µ)2 = ∑ (xi − x̄ + x̄ − µ)2 = s2XX + n(x̄ − µ)2
i =1 i =1

where we define XX = ∑in=1 ( xi − x̄ )2 . It is then evident that the


s2
likelihood function depends entirely on the two statistics only,
69

namely x̄ and s2XX . These are termed sufficient statistics for the prob-
lem.
With the above definitions, the log-likelihood becomes:

n s2 n( x̄ − µ)2
ℓ(µ, σ2 ; x1 , x2 , . . . , xn ) = − ln(2π ) − n ln(σ) − XX2 − .
2 2σ 2σ2
We can now differentiate with respect to the two parameters (µ and
σ), and obtain the optimum values by setting the two derivatives to
zero and solving for µ and σ.
Differentiating with respect to µ gives
2n( x̄ − µ)
=0
2σ2
so that the MLE estimator for µ must satisfy is x̄ − µ = 0, i.e. µ̂ = x̄,
no matter what the value of σ2 . This hardly comes as any great
surprise!
Differentiation with respect to σ gives

n s2XX n( x̄ − µ)2
− + 3 + = 0.
σ σ σ3
Recall, however, that the optimizing value for µ must make the
third term 0, no matter how σ is chosen. The MLE for σ must there-
fore satisfy:
n s2
− + XX =0
σ σ3
so that the MLE estimator for σ2 is given by s2XX /n. We can, in fact,
see that this estimator is biased (for small sample sizes n), as it has
an expectation of (n − 1)σ2 /n. Conventionally, therefore, we correct
the MLE for bias by using the standard sample variance estimate
s2XX /(n − 1).
Derivation of the properties of the MLE in general requires some
quite sophisticated mathematics, beyond the scope of the current
course. It is essential, however, to have an understanding of some of
the key properties, as listed below.

• The MLE is asymptotically unbiased (i.e. for large n);

• Small sample biases may often be corrected by a simple factor, as


for the sample variance estimator;

• The variance of the MLE estimator tends to the absolute lower


bound on variances for all unbiased estimators; in other words,
for large enough sample sizes, the MLE is the MVUE;

• The MLE estimates are approximately normally distributed


for large n, and it is possible to estimate the variance of this
distribution from the likelihood function and its derivatives; this
allows the statistician to construct confidence intervals for any
parameter.

It is important to bear in mind that the MLE is itself a random


variable (because it is calculated from sample data – if the sam-
ple changes, so does the MLE). Therefore, the MLE has its own
70

probability distribution. The properties above tell us what kind of


distribution this is. Suppose we estimate a population parameter
(denoted θ) by its MLE (denoted θ̂). The properties above tell us
that θ̂ follows a normal distribution with mean θ (because it is un-
biased) and some variance. We can express this in the following
theorem

Theorem 3. If θ̂ is the MLE of θ then the distribution of θ̂ tends to


N (θ, In−1 (θ )) as n → ∞

Proof of this theorem is beyond the scope of this course but relies
on a theorem known as the Cramer-Rao inequality. Essentially,
this states that no unbiased estimate can have a variance below a
certain bound. This bound, known as the Cramer-Rao lower bound,
is given by 1/In (θ ). From our properties above, we know that the
variance of the MLE tends to this lower bound as the sample size
gets bigger (Property 3). We therefore use the Cramer-Rao lower
bound to estimate the variance of the MLE. A proof of the theorem
is given on p.275 of Rice’s Mathematical Statistics and Data Analysis
(1995).
The crucial part of applying the normal distribution above is of
course the In−1 (θ ), which gives the variance of θ̂ and is therefore
needed for any inference. We call In (θ ) the “expected Fisher infor-
mation” and it is computed using
"  #
∂l (θ ) 2
In (θ ) = E
∂θ
" 2 #
n

= ∑E ln f ( Xi |θ )
i =1
∂θ
" 2 #

= nE ln f ( X |θ )
∂θ

or
∂2 l ( θ )
 
In (θ ) = −E
∂θ 2
n  2 

=−∑E ln f ( Xi |θ )
i =1
∂θ 2
 2 

= −nE ln f ( X |θ )
∂θ 2
Here we have made use of the fact that the likelihood function
L(θ; x) = ∏in=1 f ( Xi |θ ) so the log-likelihood ℓ(θ; x) = ln [∏in=1 f ( Xi |θ )] =
∑in=1 ln[ f ( Xi |θ )]. In the final (third) line of each set of equations we
have assumed that our random sample is independent and identi-
cally distributed.

Of course, we do not know the population parameter θ, but an


approximation for In (θ ) can be obtained by plugging in the MLE θ̂
for θ. We can write this as

In (θ̂ ) → In (θ ) as n → ∞
71

This gives us a practical procedure for calculating In (θ ), and so the


variance of the MLE θ̂, from which we can construct confidence in-
tervals, perform hypothesis tests, etc, in the usual way. For example,
a (1 − α)% confidence interval around the MLE is given by
q
θ̂ ± zα/2 Var(θ̂ )
q
= θ̂ ± zα/2 In−1 (θ̂ )

where zα/2 is a quantile from the standard normal distribution


(remember that, from the above properties, the MLE is normally
distributed so that we can make use of zα/2 ).

It is not always possible to evaluate the expected Fisher informa-


tion. Sometimes, it can be difficult or impossible to calculate the
expectations in the previous expressions. In those cases, we can
make use of a different kind of Fisher information known as ob-
served Fisher information, which is defined as
2
∂l (θ )

Jn (θ ) =
∂θ
or
∂2 l ( θ )
Jn (θ ) = −
∂θ 2
Note that the observed Fisher information bears a strong resem-
blance to the expected Fisher information, we have just dropped
the “expectations”. The observed Fisher information comes directly
from the sample data – if we can evaluate the log-likelihood, we can
evaluate the observed Fisher information. Under some additional
assumptions
Jn (θ ) → In (θ ) as n → ∞
so that, for large sample sizes, we can use In (θ ) and Jn (θ ) inter-
changeably. As previously, we don’t know θ but we can use θ̂ as a
“plug in” estimator. Thus we can approximate the distribution of
the MLE θ̂ using
θ̂ ∼ N (θ, In−1 (θ̂ ))
or
θ̂ ∼ N (θ, Jn−1 (θ̂ ))

Example 1 (Poisson): Suppose that we want to find a (1 − α)%


confidence interval around the MLE for the parameter of a Poisson
distribution, λ

The p.d.f. of a Poisson random variable is given by

λ x e−λ
x!
From before (chapter 3), we know that the MLE for λ is given by
λ̂ = X̄. To find a confidence interval for λ, we work out the ex-
pected Fisher information In (λ) using the second of our equations
72

above (any one will do)

∂2
 
In (λ) = −nE ln f ( Xi |λ)
∂λ2
We know that
ln f ( x |λ) = x ln λ − λ − ln x!
so that
∂ X
ln f ( x |λ) = −1
∂λ λ
and
∂2 X
ln f ( x |λ) = − 2
∂λ2 λ
Therefore In (λ) is simply

nE[ X ]
 
X nλ n
−nE − 2 = = 2 =
λ λ2 λ λ
Note here that we have made use of the fact that, for a Poisson
random variable X, E[ X ] = λ. In other contexts i.e. for other dis-
tributions, working out these expectations can take considerable
effort! In any case, a (1 − α)% confidence interval for λ̂ is therefore
given by q
λ̂ ± zα/2 1/In (λ)

which we approximate by plugging in λ̂ for λ, giving


q
λ̂ ± zα/2 λ̂/n

and finally, because we know the MLE of λ is λ̂ = X̄,


p
X̄ ± zα/2 X̄/n

Example 2 (Poisson application): Suppose that we want to find the


MLE for the number of new companies registered each month at
South Africa’s Companies and Intellectual Properties Registration
Office (CIPRO), and to construct a confidence interval around that
estimate. The number of new companies registered each month
can reasonably be expected to follow a Poisson distribution with
parameter λ. Data for 2008 is given in the table below The MLE for

Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
3 10 9 8 13 21 25 13 9 4 3 2
λ is given by λ̂ = X̄ = 10, and a 95% confidence interval for λ is
given by
r

X̄ ± zα/2
n

= 10 ± 1.96 10/12
= [8.21; 11.79]

Example 3 (Normal distribution) For this distribution θ′ =


(θ1 , θ2 ) = (µ, σ2 ). As θ is now a vector of parameters, rather than
73

a single parameter, we need a vectorized version of the previous


theorem:

If θ̂ is the MLE of θ then the distribution of θ̂ tends to N (θ, In−1 (θ)) as


n→∞

In dealing with a vector of parameters everything proceeds as


before, except that we need to be a little more careful when taking
partial derivatives. The log-likelihood is given by:
n
1 1 1
l (θ) = l (µ, σ2 ) = − n ln σ2 − n ln 2π − 2
2 2 2σ ∑ ( x i − µ )2 .
i −1

Differentiating w.r.t. the two components of θ, we get:


n
∂ 1
∂µ
l (θ) =
σ2 ∑ ( xi − µ )
i =1
n
∂ n 1
∂σ2
l (θ) = −
2σ 2
+
2( σ 2 )2 ∑ ( x i − µ )2
i =1

As seen in the earlier example for the normal distribution, the


1 n
MLEs are given by: µ̂ = x̄ and σ̂2 = ∑ ( xi − x̄ )2 .
n i =1
We now obtain the elements of the information matrix from the
second derivatives:
∂2 n
I n11 (θ) =− l (θ) =+
∂µ2 σ2
∂2 n ∑in=1 ( xi − µ)2
I n22 (θ) =− l (θ) =− +
∂ ( σ 2 )2 2σ4 σ6
∂2 ( x̄ − µ)
I n12 (θ) = I n21 (θ) =− l (θ) = +n .
∂µ∂σ2 σ4
The information matrix is based on the true value θ0 which is
unknown. As an approximation, however, we could replace θ0 by
the MLE, which would give:
( x̄ − x̄ )
I n12 (θ̂) = I n21 (θ̂) = n =0
σ̂4
and
n ∑in=1 ( xi − x̄ )2
I n22 (θ̂) = − +
2σ̂4 σ̂6
n nσ̂ 2
= − 4+ 6
2σ̂ σ̂
n
= .
2σ̂4
In matrix form:
" #
2 n/σ̂2 0
I n (µ̂, σ̂ ) =
0 n/2σ̂4

with inverse given by:


" #
σ̂2 /n 0
I n−1 (µ̂, σ̂2 ) = . (3.2)
0 2σ̂4 /n
74

Thus asymptotically x̄ and σ̂2 are jointly normally distributed


and independent, with variances given by the diagonal elements
of (3.2). This seemingly important result turns out in this case,
however, not to be all that useful! For the normal distribution, we
already know, even for small sample sizes, that x̄ ∼ N (µ, σ2 /n)
(exactly) and that nσ̂2 /σ2 is independently distributed as χ2 with
n − 1 degrees of freedom, so that var(σ̂2 ) = 2σ4 (n − 1)/n2 which
tends to 2σ4 /n as n → ∞. Nevertheless, more generally (for non-
normal distributions), use of the above type of calculations for
MLEs can give very useful results.
The above properties have led to the MLE being the estimator
of choice in a very wide range of statistical problems. Perhaps the
only serious competitor to MLE would be the Bayesian estimates
discussed in the next chapter. Bayesian estimates tend to be biased,
but to have smaller mean squared errors than the MLEs.

Multiple Regression: For a fixed design matrix, each observation yi


(in the notation of the previous chapter) is independently normally
p
distributed with a mean of β 0 + ∑k=1 β k xik and a variance of σ2 .
Recalling the matrix formulation, we can express the likelihood
function as follows:
"
p
n #
 n
1
L( β; y1 , . . . , yn ) = √ exp − ∑ (yi − β 0 + ∑ β k xik ) /2σ
2 2
2πσ i =1 k =1
 n  
1 1
= √ exp − 2 (y − Xβ)′ (y − Xβ)
2πσ 2σ

It is clear, therefore, that whatever the estimate for σ2 , the joint


MLE for β must be that which minimizes the expression (y − Xβ)′ (y − Xβ).
i.e. the usual least squares estimator which was discussed in STA2020S
and STA2030S.

Another Example: The Gamma Distribution The likelihood function


in this case is given by:
n
∏in=1 xiα−1 e− ∑i=1 xi /β
L(α, β; x1 , . . . , xn ) = .
βnα [Γ(α)]n

Taking logs, we obtain:


n
∑in=1 xi
ℓ(α, β; x1 , . . . , xn ) = (α − 1) ∑ ln( xi ) − − nα ln( β) − n ln(Γ(α)).
i =1
β

Differentiating with respect to β gives one condition for optimality


as:
∑in=1 xi nα
− =0
β2 β
or:
∑ n xi
β = i =1 . (3.3)

Differentiation of the log-likelihood with respect to α (in or-
der to find the full optimality conditions) would involve us in
75

the difficult task of attempting to differentiate the gamma func-


tion Γ(α). It is possible, however, to obtain a graphical solution to
the problem using Excel. For any given value of α, the maximum
value of ℓ(α, β; x1 , . . . , xn ) is that obtained when β is given by (3.3).
Thus, in a spreadsheet containing the sample data, we could in-
sert a sequence of trial values for α. Next to each of these we could
insert the formula giving β from (3.3), and then the formula giv-
ing ℓ(α, β; x1 , . . . , xn ) for the given α and resultant β.Recall that
the spreadsheet function GAMMALN() may be used to calculate
ln(Γ(α)).
For example, using the data from the “data set 1” introduced
earlier, we find that ∑20 20
i =1 xi = 208.725 and ∑i =1 ln( xi ) = 44.229.
From these two statistics, values of β from (3.3), ln(Γ(α)) and ulti-
mately ℓ(α, β; x1 , . . . , xn ) can be calculated for each α. For example,
the following tabular extract shows the values calculated for α be-
tween 3 and 5:

α Maximizing ln(Γ(α)) max. log-


β likelihood
3.00 3.479 0.693 -60.20
3.25 3.211 0.936 -60.03
3.50 2.982 1.201 -59.92
3.75 2.783 1.487 -59.87
4.00 2.609 1.792 -59.87
4.25 2.456 2.114 -59.91
4.50 2.319 2.454 -59.98
4.75 2.197 2.809 -60.09
5.00 2.087 3.178 -60.23

It is then possible to plot a graph of the maximum values of


the log-likelihood against α, from which the value of α giving the
globally maximum log-likelihood can be read off. Such a plot is
given in Figure 3.3 for the case of data set 1. An approximate MLE
for α is thus 3.9, with the corresponding MLE for β then being 2.68.
In this case, the maximum likelihood and method of moments
estimators give very similar solutions.

Numerical Maximization of the Log-Likelihood


In most non-trivial settings, it is not possible to find closed form
(analytical) solutions to the maximum likelihood problem. We thus
need to use some form of software for numerical optimization of
mathematical functions. One of the most accessible optimization
packages is provided by the Solver Tool in Excel. All that is re-
quired is to:

• Enter the sample data into Excel, and set up the formulae to
calculate any statistics;

• Provide a block of cells into which the parameter vales may be


entered;
76

Figure 3.3: Maximum values of the


log-likelihood for various α
-59.6

-60.0

-60.4
Log-Likelihood

-60.8

-61.2

-61.6

-62.0
1 2 3 4 5 6
Alpha-parameter

• Enter the formula for the log-likelihood as a function of the


sample statistics and the parameter values;

• Call Solver, to maximize the log-likelihood function by changing


the parameter values. Make sure that the “assume linear” option
is NOT selected. In many cases, the only constraints (if any) are
that one or more of the parameters need to be non-negative. In
such cases, rather than to check the “non-negative” option, it
is often more stable to explicitly enter a constraint forcing the
parameter to be greater than some small but strictly positive
number (e.g. something like θ ≥ 0.001).

As an example of this computer-based approach, let us return to


the Weibull distribution and the simulated data introduced earlier
in this chapter. By differentiating the distribution function with
respect to x, we obtain the probability density function as follows:

f ( x ) = cγx γ−1 e−cx


γ

The likelihood function for a sample of size n is thus:


n
γ−1 −c ∑in=1 x γ
L(c, γ; x1 , x2 , . . . , xn ) = cn γn ∏ xi e i

i =1

and taking logs gives the log-likelihood function:


n n
ℓ(c, γ; x1 , . . . , xn ) = n ln(c) + n ln(γ) + (γ − 1) ∑ ln( xi ) − c ∑ xiγ
i =1 i =1

Figure 3.4 illustrates the set-up of a spreadsheet containing the


same data as previously used (shown in Column A). Cells D4 and
D5 make space for the parameters c and γ respectively; for non-
linear optimization, it is advisable to enter some reasonable first
guess for the parameter values, and for our runs we started with
the values obtained from matching the quartiles (i.e. c = 0.036 and
77

γ
γ = 5.09). Further columns show the ln( xi ) and xi , together with
their sums. These are used to set up the log-likelihood function
in cell D7. It is strongly recommended that students set up the same
spreadsheet for themselves, and proceed through with the optimization.

Figure 3.4: Spreadsheet set-up for


the Weibull maximum likelihood
A B C D E F G estimation
1 Use same data as Weibull Sheet
2 ln X X^gamma
3 0.457 Estimates: -0.783072 0.062621
4 1.289 c 0.090964 0.253867 2.455251
5 1.314 gamma 3.53819 0.273076 2.627925
6 1.488 0.397433 4.080393
7 1.523 log-likel.: -12.69924 0.420682 4.430238
8 1.533 0.427227 4.534021
9 1.537 0.429832 4.576018
10 1.819 Count: 15 0.598287 8.304961
11 1.953 0.669367 10.67972
12 1.973 0.679555 11.07174
13 2.03 0.708036 12.24559
14 2.071 0.728032 13.14333
15 2.215 0.795252 16.6724
16 2.68 0.985817 32.72069
17 2.781 1.022811 37.29647
18
19 Sums 7.606201 164.9014

We now select Solver from the Tools menu to maximize the log-
likelihood by changing cells containing c and γ. The constraints
c ≥ 0.001 and γ ≥ 0.001 are also entered. Figure 3.5 illustrates the
Solver options which are selected. Clicking on Solve then yielded
the MLE estimates: ĉ = 0.091 and γ̂ = 3.54. Note that these are
quite a bit closer to the true values (which are known in this case)
than the estimates obtained from matching the quartiles.

Figure 3.5: Solver options for ML


estimation

3.3 Mean Square Error and Efficiency

Suppose now that we have two estimators for θ say θ̂1 and θ̂2 , ei-
ther of which may be biased or unbiased. Then θ̂1 is called a more
efficient estimator for θ than θ̂2 if

MSE(θ̂1 ) < MSE(θ̂2 ). (3.4)


78

For two unbiased estimators θ̂1 and θ̂2 , θ̂1 is more efficient than θ̂2
if:
var(θ̂1 ) < var(θ̂2 ). (3.5)

The most efficient unbiased estimator is therefore that exhibiting


minimum variance, sometimes called the minimum variance unbiased
estimator (MVUE).

Relative Efficiency

For two unbiased estimators, θ̂1 and θ̂2 of the parameter θ, with
variances var(θ̂1 ) > var(θ̂2 ), (so that θ̂1 is less efficient than θ̂2 ) , we
define the efficiency of θ̂1 relative to θ̂2 by the ratio

var(θ̂2 )
Relative efficiency =
var(θ̂1 )

(which by definition is less than 1).


The relative efficiency is often quoted as a percentage.

Example: Let y1 , y2 , y3 be a random sample from a normal distribu-


tion with mean µ and variance σ2 .
Consider the following two estimators for the population mean:
µ̂1 = 14 y1 + 12 y2 + 41 y3 and µ̂2 = 13 y1 + 13 y2 + 31 y3 .
 
1 1 1
E(µ̂1 ) = E y1 + y2 + y3
4 2 4
1 1 1
= E( y1 ) + E( y2 ) + E( y3 )
4 2 4
1 1 1
= µ+ µ+ µ
4 2 4
= µ

and
 
1 1 1
E(µ̂2 ) = E y1 + y2 + y3
3 3 3
1 1 1
= E( y1 ) + E( y2 ) + E( y3 )
3 3 3
1 1 1
= µ+ µ+ µ
3 3 3
= µ.

Thus both estimators are unbiased, but now consider the vari-
ances:
 
1 1 1
var(µ̂1 ) = var y1 + y2 + y3
4 2 4
1 1 1
= var(y1 ) + var(y2 ) + var(y3 )
16 4 16
3σ2
=
8
79

while
 
1 1 1
var(µ̂2 ) = var y1 + y2 + y3
3 3 3
1 1 1
= var(y1 ) + var(y2 ) + var(y3 )
9 9 9
3σ2
= .
9
Thus var(µ̂2 ) < var(µ̂1 ), so that µ̂2 is more efficient than µ̂1 . The
relative efficiency of µ̂1 compared to µ̂2 is:

3σ2 /9
= 0.889
3σ2 /8
i.e. an efficiency of 89% relative to µ̂2 .

Example: Suppose we have a random sample X = ( X1 , X2 ) from a


normal distribution with known mean and variance.

• Show that the two estimators, µ̂1 = ( x1 + 2x2 )/3 and µ̂2 =
(2x1 + 3x2 )/5 are unbiased estimators of µ.
• Which is the more efficient estimator of µ?

3.4 Likelihood ratio test

The log-likelihood function is used for hypothesis testing, for exam-


ple for testing the null hypothesis H0 : θ = θ0 against the alternative
hypothesis H A : θ ̸= θ0 . The likelihood ratio test for this hypothesis
test is defined as
l ( θ0 )
Λ ( x1 , x2 , . . . , x n ) = .
l (θ̂ )
We can reject H0 declaring it ‘unsupported by data’ if its likeli-
hood is ‘too small’, indicating there are other hypotheses which are
much better supported by the data. How small is too small can be
determined by use of a p-value from the sampling distribution of
Λ( x1 , x2 , . . . , xn ). Derivation of the theory behind the likelihood
ratio test and its distributional properties is beyond the scope of the
current course.
The essential idea of likelihood ratio test defined above is that we
can interpret the likelihood function or log-likelihood function in
terms of ratios. If, for example, L(θ1 )/L(θ2 ) > 1, then θ1 explains
the data better than θ2 . If L(θ1 )/L(θ2 ) = k, then θ1 explains the data
k times better than θ2 . To illustrate1 , suppose students in a statistics 1
From : Lavine, M. Introduction to
class conduct a study to estimate the fraction of cars on Campus statistical thought, 2007.

Drive that are red. Student A decides to observe the first 10 cars
and record X, the number that are red. Student A observes

NR; R; NR; NR; NR; R; NR; NR; NR; R

and records X = 3. She did a Binomial experiment; her statistical


model is X ∼ Binonimal(10; θ ), where θ = p; her likelihood func-
10 3
tion is L(θ ) = ( )θ (1 − θ )7 . It is plotted if Figure 3.6. Because
3
80

only ratios matter, the likelihood function can be rescaled by any


arbitrary positive constant. In Figure 3.6 it has been rescaled so the
maximum is 1. The interpretation of Figure 3.6 is that values of θ
around θ ≈ 0.3 explain the data best, but that any value of θ in
close to 0.1 or 0.6 explains the data not too much worse than the
best, i.e., θ ≈ 0.3 explains the data only about 10 times better than
θ ≈ 0.1 or θ ≈ 0.6, and a factor of 10 is not really very much. On the
other hand, values of θ less than about 0.05 or greater than about
0.7 explain the data much worse than θ ≈ 0.3.

Figure 3.6: Likelihood function L(θ )


1.0

for the proportion θ of red cars on


Campus Drive
0.8
likelihood function

0.6
0.4
0.2
0.0

0.0 0.2 0.4 0.6 0.8 1.0

θ
81

Tutorial Exercises

1. Derive an expression for the maximum likelihood estimator for


the parameter λ of the Poisson distribution (based on a random
sample of size n)

2. The table below shows the number of times, X, that 356 students
switched majors during their under-graduate study.
Number of major changes 0 1 2 3
Observed frequency 237 90 22 7

(a) Assume that X follows a Poisson distribution. Find the ML


estimate of λ. using the result from Exercise 1.
(b) Also construct a 95% asymptotic confidence interval on λ.

3. For the probability density function with parameter θ :

1 −y/θ
f (y|θ ) = e , y>0 θ>0
θ2
(a) Find the maximum likelihood estimator of θ.
(b) What is the mle of g(θ ) = 1θ ?
(c) Suppose a random sample of 6 observations from the above
pdf yielded the following observations:
9.2 5.6 18.3 12.1 10.7 11.5
What is the maximum likelihood estimate of θ?

4. Derive an expression for the maximum likelihood estimator for


the parameter θ in the probability density function given by:

f ( x ) = (θ + 1) x θ for 0 < x < 1

Note that we must have θ > −1 for this to define a proper


probability distribution.

5. The Beta distribution is defined by the probability density func-


tion:
Γ ( α + β ) α −1
f (x) = x (1 − x ) β−1 for 0 < x < 1
Γ(α)Γ( β)
where Γ() is the gamma function (see notes on the gamma distri-
bution). Derive the log-likelihood function for this distribution,
based on a random sample of size n.
The following data are assumed to have come from a beta distri-
bution: 0.593; 0.451; 0.359; 0.510; 0.302; 0.305; 0.442; 0.363; 0.635;
0.305; 0.494; 0.442; 0.323; 0.371; 0.324.
Use the Excel Solver to find the ML estimates for α and β. Con-
struct the probability plot to check this fit. (Note that Excel also
has a BETAINV() function available for this purpose.)

6. Suppose that the following data have been collected on mag-


nitudes of insured losses (in Rm) over a number of industrial
accidents:
82

4.26 3.18 3.09 1.89 2.07


1.74 1.53 5.85 4.23 3.12
1.77 1.26 1.29 1.17 0.81
0.48 1.26 1.29 2.67 0.51
1.98 6.15 1.17 3.15 1.35

Use method of moments and MLE estimation to fit a gamma


distribution to these data. Construct the probability plots to
check both fits.

7. For the data of the previous problem, use quartiles and MLE
estimation to fit a Weibull distribution to these data. Construct
the probability plots to check both fits.

8. The following data (which have been sorted from smallest to


largest) are supposed to have come from a distribution with
probability density function given by f ( x ) = θx θ −1 for 0 < x < 1:

0.713 0.717 0.727 0.801 0.868


0.912 0.924 0.928 0.938 0.997

The sample mean and standard deviation of the above data are
0.853 and 0.105 respectively.

(a) Estimate θ using both the method of moments and matching


of quantiles. Comment on the goodness of the fit, by reference
to other moments and/or quantiles
(b) Construct the probability plot needed to test the goodness of
fit using the method of moments estimator, and comment on
any conclusions you may draw.

9. A sample of size 16 has been collected from a random variable


whose distribution is thought to be defined by the following
probability density function:

Γ(α + β) x β −1
f (x) =
Γ ( α ) Γ ( β ) (1 + x ) α + β

for 0 < x < ∞. The parameters α and β are strictly positive.


The following values were observed:

0.19 3.90 0.86 1.29


0.36 0.26 1.72 0.64
0.56 2.15 0.35 1.59
1.47 1.06 2.07 1.24

Use Excel Solver to obtain the maximum likelihood estimators


for α and β.

10. The following data refer to thousands of operating hours be-


tween breakdowns experienced on a particular piece of produc-
tion equipment (recorded in order of increasing values):
83

0.29 0.84 1.93 2.16 2.21 2.65 3.70 5.44 8.50 21.93

(a) By the process of matching moments, attempt to fit an ex-


ponential distribution (p.d.f. f ( x ) = λe−λx for x > 0) to this
data.
(b) Check the fit to the exponential distribution by construct-
ing and roughly plotting in your answer books the relevant
probability (“quantile-quantile”) plot. What conclusion do you
reach?
(c) The engineer has suggested that on the basis of the manufac-
turer’s information, these times should follow a generalized
Pareto distribution, with p.d.f. given by:

αγx γ−1
f (x) =
(1 + x γ ) α

where α, γ > 0. Find the maximum likelihood estimator for α,


on the assumption that γ = 1.
4
Bayesian inference

4.1 Introduction

The Bayesian approach to statistical inference differs from the con-


ventional “sampling theory” approach in that it makes use of prob-
ability measures to represent uncertainty concerning unknown
parameter values. You may recall that in the conventional approach,
probability distributions are used to describe sampling variability;
but the parameters are considered fixed (albeit at unknown values)
and not random. The Bayesian approach requires an extension of
our thinking to allow probability to imply any type of uncertainty,
and not just those implied by relative frequencies.
The justification for the Bayesian approach is often stated in
terms of “decision theory”, i.e. the analysis of decision making
under uncertainty. The following simplified example illustrates the
decision theory mode of analysis, and serves as motivation for what
follows, even though the example may not immediately resemble a
problem in statistical inference.

Steel Strike Example: A steel consumer has to decide on how much


steel should be stocked, in anticipation of a pending strike in
the steel industry. Let θ denote the (unknown) strike length,
measured in days. For simplicity, we suppose that the consumer
is only considering three possible scenarios to represent possi-
ble outcomes or “states”, namely that the strike will last for 0
(i.e. no strike), 30 or 60 days, and that therefore the actions to
be considered are the laying in of stocks for 0, 30 or 60 days.
The accountants have determined the nett costs (of storage and
working capital tied up, and of bringing in emergency stocks
from other areas) for each state, and these are displayed in the
Excel spreadsheet extract shown in Figure 4.1. Also indicated
are probabilities associated with each scenario or state. Note
that these state probabilities would typically need to be assessed
subjectively, possibly by discussion amongst management.
Expected losses for each action are easily calculated using the
SUMPRODUCT() function, and are shown in column E. It might
often be appropriate to select the action minimizing expected
loss, which in this case would be to lay in supplies for 30 days.
85

Figure 4.1: Costs and Probabilities for


the Steel Strike Example
A B C D E
1 Estimated Costs (Rm):
2
3 Stock Scenario (Strike Duration) Expected
4 Ordered 0 30 60 Loss
5 0 0 2 6 2.60
6 30 1.5 1 2.5 1.60
7 60 3 2 1.5 2.15
8
9 Probability 0.3 0.4 0.3
10
11 Loss using
12 perfect information 0 1 1.5 0.85
13
14 Expected Value of Perfect Information: 0.75

We note in passing that minimization of expected loss is not


always the rational course of action. When potential gains or
losses are very large relative to the decision maker’s total assets,
risk aversion may start to play a role. For example, which of the
following might you prefer:
• to risk a loss of one million Rand with probability of one-in-
a-thousand (i.e. an expected loss of R1000); or
• to pay an insurance premium of R1200 for someone else to
carry the risk?
Many would buy the insurance, even though the “expected
loss” is greater, because a million Rand loss is just too much to
cope with. Decision theoretists suggest that instead of expected
loss we should consider expected utility, in which case if the util-
ity associated with avoiding a million Rand loss is more than
1000×the utility associated with avoiding a R1200 loss, then the
insurance option is preferred.
There is a considerable literature on the topic of expected utility
theory, but which we do not have time in this course to explore.
We shall assume throughout that minimization of expected loss,
or maximization of expected gain, is always an appropriate
objective.

The question we need to ask, however, is whether the above an-


swer is the end of the story. In particular, are the probabilities the
last word in describing the uncertainty concerning θ. In many
situations we can reduce uncertainty about θ by obtaining addi-
tional information, either by statistical sampling methods (e.g.
market surveys and the like), or by consulting knowledgeable
“experts”. In the steel strike example, we could undertake some
form of survey amongst workers at the steel plant to judge the
mood, or we could consult a labour relations expert who knows
something about the lengths of strike either side can afford. Such
information is never free, and the first question is whether the
86

information is worth its cost.


As a first step towards assessing the value of information, con-
sider what would happen if the information provided were to
be perfect, i.e. the survey or expert could resolve our uncertainty
completely. For the steel strike example, if this information tells
us that the strike will last 0 days, we will lay in 0 stock at a cost
of 0; similarly, if the information is that the strike will be 30 days,
then we will lay in stock for 30 days for a loss of R1m; and so
on. These figures are shown in row 12 of the spreadsheet in Fig-
ure 4.1 (obtained by use of the MIN() function). The expectation
of the minimum loss using perfect information is obtained by
summing the products (SUMPRODUCT function) of the proba-
bilities in row 9 and the minimum losses in row 12. The differ-
ence between the best that can be achieved without information
(the minimum of E5:E7), and the expected minimum loss with
perfect information (E12), gives a measure of the EXPECTED
VALUE OF PERFECT INFORMATION, or EVPI (cell E14).

In general terms, suppose we have a decision problem in which


we face a choice between m potential actions ai (i = 1, 2, . . . , m), and
in which there are p possible scenarios or states θr (r = 1, 2, . . . , p).
We can then set up a table of losses L( ai , θr ) corresponding to each
ai and θr (say with actions represented by rows, and scenarios ro
states by columns). Suppose further that we can associate proba-
bilities π (θr ) with each state. Setting up the data in this way Excel,
it is simple to calculate the expected loss (or an “expected utility”
p
if that were appropriate) for each action, ∑r=1 L( ai , θr )π (θr ), by
making use of the SUMPRODUCT function. The action giving the
smallest value of the expected loss is optimal in the light of current
information. For later reference, let us denote this action by a∗ .
The losses incurred under perfect information for each state are
identified by applying the MIN function to each column of the loss
table (i.e. separately for each scenario or state). The expected loss
when (optimally) using perfect information is given by:
p 
m

∑ min L( ai , θr ) π (θr )
i =1
r =1

which is also calculated by means of the SUMPRODUCT function.


The expected loss using current information is given by
p m p
∑ L(a∗ , θr )π (θr ) = min ∑ L(ai , θr )π (θr )
i =1 r =1
r =1

. The expected difference between the two expressions above gives


the expected value of perfect information (EVPI):
p p 
m

EVPI = ∑ L(a∗ , θr )π (θr ) − ∑ min L( ai , θr ) π (θr )
i =1
r =1 r =1

which must of necessity be non-negative. All of these calculations


are easily incorporated as formulae in Excel.
87

It is also possible to formulate decision problems in terms of


payoffs rather than losses, in which case signs change and minima
get replaced by maxima, but we shan’t digress on to this formu-
lation for now. (One can always simply write payoffs as negative
losses.)
In practical terms, information is seldom “perfect”, and thus
even if EVPI is greater than the cost of information, it is not neces-
sarily optimal to obtain the information before making the decision.
Further treatment of the value of information rests heavily upon
Bayes’ Theorem, and it is thus useful to remind ourselves of this im-
portant law of probability. We shall state the theorem in terms of a
notation that suits our needs here particularly well. But first, let us
illustrate the principle of Bayes’ theorem by a couple of numerical
examples.

Steel Strike Example (cont) Recall that we had three states (=dura-
tion of strike), with associated probabilities of 0.3, 0.4 and 0.3.
We shall now term these the prior probabilities, as they refer to
the probabilities applying prior to any information gathering.
Suppose now that an opinion polling organization specializing
in labour relations can be commissioned to conduct a survey of
labour opinions, on the basis of which they will produce a report
containing a simple statement to the effect either that the situa-
tion is “serious” or that it is “not serious”. Experience in similar
previous situations (or perhaps subjective assessment of the reli-
ability of the survey) may allow us to assess the probabilities of
each of the two possible survey results, for each possible state.
For example, from past experience with this organization in pre-
vious studies we may judge that if θ turns out ultimately to be 0
(in which case the “correct” statement should be “not serious”),
they are likely to get it right about 90% of the time. In other
words, we judge that Pr[ Result = “Not serious”|θ = 0] = 0.9,
so that Pr[ Result = “Serious”|θ = 0] = 0.1. In a similar manner,
we could judge reliability under the other two scenarios, to give
the full set of conditional probabilities displayed in Figure 4.2.
In standard statistical terms, these conditional probabilities rep-
resent sampling variability, i.e. the extent to which sample (or
survey) results may vary for each particular state. Note that the
conditional probabilities sum to 1 down the columns, but not
along the rows. Think why!
At an intuitive level, a survey result of “not serious” should in-
crease the probability associated with θ = 0, while a survey
result of “serious” should increase the probability associated
with θ = 60. But by how much should these change? The an-
swer is provided by Bayes’ theorem. The values in rows 16 and
17 of columns B-D of Figure 4.2 are obtained by multiplying the
conditional probabilities on the survey results (rows 12-13) by
the prior probabilities for the corresponding states (row 9). Rows
16-17 thus contain the joint probabilities of occurrence for each
88

A B C D E F G
Figure 4.2: Application of Bayes rule to
1 Estimated Costs (Rm): the Steel Strike Example
2
3 Stock Scenario (Strike Duration) Expected
4 Ordered 0 30 60 Loss
5 0 0 2 6 2.60
6 30 1.5 1 2.5 1.60
7 60 3 2 1.5 2.15
8
9 Probability 0.3 0.4 0.3
10
11 Conditional probabilities for survey result given the state:
12 Result="serious" 0.1 0.4 0.85
13 Result="not serious" 0.9 0.6 0.15
14
15 Joint probabilities of survey result and state: Predictive probs.
16 Result="serious" 0.03 0.16 0.255 0.445
17 Result="not serious" 0.27 0.24 0.045 0.555
18
19 Conditional probabilites on state given the survey result:
20 Result="serious" 0.0674 0.3596 0.5730
21 Result="not serious" 0.4865 0.4324 0.0811

combination of state and survey result. For example, the prob-


ability that the strike duration will be 30 days and the survey
generates a “not serious” result is 0.4 × 0.6 = 0.24.
Summing the entries in each column in the block B16:D17 (Fig-
ure 4.2) simply returns the prior probabilities in row 9 (as it must
do by definition). The interesting exercise is to sum the entries
in each row of the same block. For example, SUM(B16:D16) gives
the value 0.445 in cell F16. This gives the total probability of a
result “serious”, taking all possible states into consideration. We
sometimes refer to this total probability as the “marginal” or
“predictive” probability. The latter term stems from the idea that
if we have to predict the result of the survey based on current
(“prior”) information, then we would associate a probability of
0.445 with a “serious” result.
The three entries in cells B16:D16 give the relative likelihoods of
each state when we know that the survey result is “serious”. In
order to turn these relative values into probabilities, we must
scale them so that they add to 1. Clearly this is achieved by
simply dividing by the sum, namely the value in cell F16. The
resultant conditional probabilities (probabilities for each state
conditional on the “serious” result) are given in row 20. Row 21
provides the same probabilities conditional on the “not serious”
result. These calculations represent precisely the implementation
of Bayes’ theorem, the full theory of which is summarized after
the next example. The resulting conditional probabilities for each
state given each possible survey result are sometimes referred
to as posterior probabilities, as they give the logical probabilities
to be associated with each state after (posterior to) the latest
information (the survey result in this case). The appropriate
posterior probabilities can be used to re-calculate the expected
losses for each action, to get the optimal decision conditional on
89

the result obtained. We shall return to this shortly.

A Marketing Example In order to reinforce the ideas introduced


in the previous example, let us look at another. Suppose some
critical decisions depend on the marketability of a new product
nationally. It is possible, however, to undertake a test marketing
of the product within a small localized region, and to observe
what happens.
Suppose four outcome scenarios have been defined, designated
as “total success” (TS), “moderate success” (MS), “moderate
failure” (MF) and “total failure” (TF). The states refer to the un-
known outcomes nationally, which we shall denote by TSN,
MSN, MFN and TFN. The results of the information gathering
(the test market) can be denoted by TST, MST, MFT and TFT.
We cannot assume that the test market will behave precisely as
the national market (so that, for example, total success in the
test market will not imply total success nationally), but the mar-
keting experts have been prepared to associate probabilities on
the test market outcomes conditional on each (as yet unknown)
national market state. These probabilities and the resulting anal-
yses are recorded in the spreadsheet extract shown in Figure 4.3.
The confirmation and interpretation of the results are left as an
exercise.

Figure 4.3: Application of Bayes rule in


the Marketing Example

STATES TSN MSN MFN TFN

Prior probabilities 0.2 0.3 0.4 0.1

Conditional probs
for test result:
TST 0.64 0.34 0.12 0.10
MST 0.20 0.40 0.35 0.25
MFT 0.10 0.16 0.29 0.30
TFT 0.06 0.10 0.24 0.34
Predictive
Joint probs Probs
TST 0.128 0.102 0.048 0.010 0.288
MST 0.040 0.120 0.140 0.025 0.325
MFT 0.020 0.048 0.116 0.030 0.214
TFT 0.012 0.030 0.096 0.034 0.172

Posterior probs
given test result of:
TST 0.444 0.354 0.167 0.035
MST 0.123 0.369 0.431 0.077
MFT 0.093 0.224 0.542 0.140
TFT 0.070 0.174 0.558 0.198

The application of Bayes theorem in the decision theory context


is so fundamental, that we need to state it in more formal terms
(which should, however, be easily understood by reference to the
above examples). The following are the key principles:
90

• We represent relevant uncertainties in terms of some underlying


but unobserved scenario, condition or “state of nature”, often
represented by a parameter θ. For the moment, we shall assume
that the true state (scenario, condition) is one and only one of the
p distinct possibilities {θ1 , θ2 , . . . , θ p }. In the absence of (or prior
to) any further sample information or observation, uncertainty
regarding θ is described by the prior probabilities π (θr ).

• Imperfect information regarding θ is to be gathered (which may


often represent some form of statistical sampling). Although the
results or observations from this process are random variables,
we do assume that the full set of possible outcomes is known.
For now, let us assume that there are only a finite number of n
distinct possibilities, which we shall denote by { x1 , x2 , . . . , xn }.
The possible outcomes xk may be numerical values (for example
the number of defective items found in a consignment of ma-
terial), or simply a set of pre-specified categories (for example,
“serious”; “not serious”).

• The sampling variability or reliability of the results obtained


are modelled by the probabilities associated with each outcome
(xk ) conditional on the true state θ, which are assumed known for
every possible state θr . We denote these conditional probabilities
by f ( xk |θr ). By the relevant probability laws, f ( xk |θr ) ≥ 0, while:
n
∑ f ( x k | θr ) = 1
k =1

for each r = 1, 2, . . . , p. The conditional probabilities may be


established on the basis of fundamental statistical principles (see
the quality control example described in the next section), or
empirical evidence from earlier observations, or expert judgment.

Bayes’ theorem provides an expression for the posterior probabil-


ities (sometimes termed “inverse probabilities”) for the possible
causes (states) given a particular outcome. Formally, the theorem
can be expressed in terms of the above formulation as follows:

Bayes’ Theorem:
f ( x k | θr ) π ( θr )
π ( θr | x k ) =
m( xk )

where we define m( xk ) as the predictive probability that the out-


come will be xk , given by:
p
m( xk ) = ∑ f ( x k | θr ) π ( θr ).
r =1

There is no fundamental reason why we should use different


symbols for probabilities describing sampling variation ( f ( xk |θr )),
for probabilities representing knowledge about underlying states
(prior and posterior probabilities, π (θr ) and π (θr | xk )), and for the
91

predictive probabilities (m( xk )), and many writers do not differen-


tiate. The distinct notation is nevertheless helpful in understanding
the different components of Bayes theorem.
Bayes theorem is of fundamental importance in understanding
the rationality of evidence. Many fallacies and prejudices arise be-
cause people generally do not understand the difference between
sampling frequencies or probabilities (given in our terminology
by f ( xk |θr )), and posterior probabilities (inferential probabilities
on causes), i.e. π (θr | xk ). For example, if it is known to be very
likely that male accountants wear grey suits, then it may often be
assumed that the wearer of a grey suit is very likely to be an ac-
countant. This, however, assumes in effect that π (θr | xk ) ∝ f ( xk |θr ),
which clearly cannot be true in general. (The exercise on drunk
driving convictions at the end of these notes, exercise 3, gives some
sense of the issues.)
In decision theory, the primary role of Bayes theorem is to pro-
vide a logically coherent means of updating state probabilities as
new evidence becomes available, and to use these in re-assessing
the optimal actions a posteriori, i.e. after the outcomes of data col-
lections, surveys, etc. become available. Since we can perform such
calculations on a “what if?” basis even before observing the relevant
outcomes, the theorem thus also enables us to establish the worth
of information.
We have suggested that what we need to do is to perform the
calculations of the posterior probabilities for each possible result or
observation { x1 , x2 , . . . , xn }, in a series of “what if?” exercises. The
posterior probabilities can be applied to the loss functions, in order
to establish the expected losses to be minimized for each possible
xk . This is illustrated by returning to the steel strike example.

Steel Strike Example (Continued): It is convenient now to display the


probability calculations separately for each possible outcome of
the survey, of which in this example there are only two. This is
illustrated in the part of the spreadsheet shown in Figure 4.4,
which should be compared with that shown in Figure 4.2. The
entries have been transposed relative to Figure 4.2, so that each
outcome of the survey is now represented by a column.
The data down to row 18 simply repeat computations from Fig-
ure 4.2. In rows 21-23 we calculate for each outcome, the ex-
pected losses for each action based on the corresponding pos-
terior probabilities in rows 16-18. These expectations can again
be obtained by use of the SUMPRODUCT spreadsheet function.
Note, however, that we first need to store the loss table illus-
trated in Figure 4.1 in transposed form, with the states as rows.
(Alternatively, one can use Excel’s MMULT function, applied
to the original loss table.) The minimum conditional expected
losses conditional on each outcome are shown in row 25; these
correspond to a stock of 60 if the result is “serious”, and to a
stock of 0 otherwise.
92

A B C D E Figure 4.4: EVSI for the Steel Strike


1 Results of Survey
2 Serious Not Serious
Example
3 Probs conditional on state:
4 0 days 0.1 0.9
5 30 days 0.4 0.6
6 60 days 0.85 0.15
7 Prior
8 Joint probabilities Probabilities
9 0 days 0.03 0.27 0.3
10 30 days 0.16 0.24 0.4
11 60 days 0.255 0.045 0.3
12
13 Sum (=predictive prob.) 0.445 0.555
14
15 Posterior state probabilities:
16 0 days 0.0674 0.4865
17 30 days 0.3596 0.4324
18 60 days 0.5730 0.0811
19
20 Posterior expected losses
21 0 day stock 4.157 1.351
22 30 day stock 1.893 1.365
23 60 day stock 1.781 2.446 Predictive
24 Expectation
25 Minimum expected loss 1.781 1.351 1.543
26
27 EVSI: 0.058

What does this tell us? It says that if we conduct the survey, then:

• If the result is “serious”, which will happen with probability


0.445 (the predictive probability), we will lay in supply for 60
days with a conditional (posterior) expected loss of 1.781;
• If the result is “not serious”, which will happen with proba-
bility 0.555, we will lay in no additional supply with a condi-
tional (posterior) expected loss of 1.351.

Overall: if we conduct the survey and act optimally on the result,


our expected loss will be 0.445 × 1.781 + 0.555 × 1.351, which is
calculated in cell E25 of Figure 4.4. The resulting value (i.e. 1.543)
should be compared with the optimal expected loss without
further information (i.e. the 1.60 identified in Figure 4.1). The
difference between these two numbers gives the EXPECTED
VALUE OF SAMPLE INFORMATION or EVSI shown in cell
E27 of Figure 4.4. Note how much smaller this is than the EVPI
(0.058 vs. 0.75 for an efficiency of only 7.7%)

The above concepts can be made quite general, by expressing


them in algebraic form (which may in turn be translated into for-
mulae in an Excel spreadsheet). Continuing with our earlier no-
tation, we recall that the optimal expected loss using only prior
information is simply the minimum value across all actions ai of
p
∑r=1 L( ai , θr )π (θr ).
Now suppose that we have the opportunity to obtain some fur-
ther information (by statistical sampling or surveys, for example).
As before, we let the set of possible outcomes or observations be
{ x1 , x2 , . . . , xn }, and denote by f ( xk |θr ) the probabilities for each
outcome conditional on the state. Then for whatever outcome xk
which is observed, we use Bayes’ theorem to calculate the posterior
probabilities conditional on this outcome, and hence identify the
93

conditionally optimal decision minimizing conditional expected


loss E[L( ai , θ | xk )], given by:
p
∑ L(ai , θr )π (θr |xk ). (4.1)
r =1

Before the observations are made (i.e. while we are still considering
whether collecting the information will be worthwhile), we will
not, of course, know what xk will be observed, and thus will not
know what the corresponding expected loss is going to be. We can,
however, compute the posterior distributions and corresponding
optimal actions for every possible outcome xk . Let the optimal
action corresponding to the observation of xk be denoted by a(k) .
The prior, or predictive, expectation of the corresponding optimal
expected loss is then given by:
n p
∑ ∑ L(a(k) , θr )π (θr |xk )m(xk ) =
k =1 r =1
"
p
#
n m
∑ min
i =1
∑ L(ai , θr )π (θr |xk ) m( xk ) (4.2)
k =1 r =1

Recalling that a∗ is defined as the action minimizing prior expected


loss, we obtain the expected value of sample information (EVSI) by
subtracting the expectation given by (4.2) from the optimal expected
loss using only prior information, as given by (4.1). This yields:

p n p
EVSI = ∑ L(a∗ , θr )π (θr ) − ∑ ∑ L(a(k) , θr )π (θr |xk )m(xk )
r =1 k =1 r =1

Clearly, the observations (surveys, samples, experiments, expert


evaluations) are justifiable if and only if the EVSI exceeds the cost
of obtaining the observations.

4.2 Examples from Binomial Sampling

In the previous examples, the probability distributions on the ob-


servations conditional on the states were based to some extent at
least on subjective evaluation of the reliability of the observations
as indicators of the state. In many cases, however, the conditional
probabilities can be determined on a more objective basis from
standard statistical theory. The resulting analysis is then a form of
statistical inference, termed Bayesian statistics or Bayesian inference.
(The standard statistical analysis covered in the previous chapters is
termed frequentist.)
One of the simplest forms of statistical sampling is that of bino-
mial sampling, in which a set of n observations are classified into
two groups (e.g. “successes” and “failures”). The unknown state (or
parameter) θ may be defined by (say) the proportion of “successes”
in the total population. The number of observed successes in the
sample was seen in the first year course to follow the binomial dis-
tribution, provided either that the population is so large that the
94

extraction of the sample does not disturb the population propor-


tions, or that we sample “with replacement”. Binomial sampling,
and the use of decision theory to analyse results, is particularly
relevant within the field of quality control, where sampling infor-
mation must be used efficiently, and the cost of sampling needs to
be balanced against the value gained through improved quality.
Let us start by looking at a very simple example, which serves
also as revision of the concepts introduced in the previous sections.
Suppose that the production from some factory is packaged into
“batches” which are sold in that form. One possible measure of
the quality of a batch may then be the proportion (say θ) of items
in the batch which are “defective” in the sense of failing to meet
some desired standard. From the outside, all batches look the same
(i.e. there is no obvious manner in which the value of θ can be de-
termined by a cursory examination). Suppose further that there
are three different markets for the product, which we shall desig-
nate as the high (HQ), medium (MQ) and low (LQ) quality markets
respectively. The basic selling price (R per batch), the minimum re-
quired quality level (i.e. the maximum tolerable value of θ), and the
monetary penalty for failing to meet market quality requirements
(expressed as R per percentage point over the required quality
level) are given in the following table.

Market Selling Required Penalty


Price Quality (as %) per %
HQ 4000 10 500
MQ 2000 15 300
LQ 1000 20 200

For simplicity, suppose that θ can take on one of the three values
0.05 (i.e. 5%), 0.15 (15%) or 0.25 (25%) only, with associated prob-
abilities 0.5, 0.3 and 0.2 respectively. It is important at this stage
to make sure than you do not confuse the different probabilities
running around!

• The θ is an unknown proportion, which can be interpreted as the


probability that a single item drawn from the batch will turn out
to be defective. This θ also becomes the probability parameter in
the binomial distribution of the number of defectives in a sample
of a given size (see below). Note that θ has a classical frequentist
interpretation and is in principle objectively knowable (although
currently still unknown).

• The probability π (0.05) is a subjective probability, indicating the


current assessment of how likely it is that the true value for θ
is 0.05. Similar interpretations apply to $pi (0.15) and π (0.25).
Note that these probabilities may change over time as other
information becomes available.

Where then should we sell the current batch? Just to empha-


size the importance of understanding concepts rather than blindly
95

plugging numbers into formulae, let us in this case focus on pay-


offs rather than losses associated with each action. These payoffs
and their expectations are easily calculated from the above table,
and are shown in columns A-E of the spreadsheet displayed in
Figure 4.5. Check these by manual calculation, or (preferably) by
setting up your own spreadsheet. Clearly, the decision maximizing
expected payoffs is the HQ market.
Suppose, however, that before making the decision we can take
a sample of three items from the batch, and test whether each of
these meet the standard. Let X be the observed number out of these
three which fail (are “defectives”). By standard first year statistics,
we know that X has the binomial distribution with parameters n =
3 and θ. [Strictly speaking, we need sampling “with replacement”,
but for large batches this is not really an issue.] Thus we know that
for x = 0, 1, 2, 3:
 
3 x
f ( x |θ ) = Pr[ X = x |θ ] = θ [1 − θ ]3− x .
x

The binomial probabilities can be obtained from Excel’s BINOMDIST


function, and are displayed in Figure 4.5, together with the Bayesian
calculations set out as before.

Figure 4.5: Spreadsheet for quality


A B C D E F G H I J K
1 States: Prior Actions (sell as:) Number defective in sample: control example
2 probs. HQ MQ LQ 0 1 2 3
3 Probabilities conditional on state:
4 0.05 0.5 4000 2000 1000 0.8574 0.1354 0.0071 0.0001
5 0.15 0.3 1500 2000 1000 0.6141 0.3251 0.0574 0.0034
6 0.25 0.2 -3500 -1000 0 0.4219 0.4219 0.1406 0.0156
7
8 Expected payoffs: 1750 1400 800 Joint probabilities:
9 0.05 0.4287 0.0677 0.00356 0.00006
10 Maximum expected: 1750 0.15 0.1842 0.0975 0.01721 0.00101
11 0.25 0.0844 0.0844 0.02813 0.00313
12
13 Predictive probability: 0.6973 0.2496 0.0489 0.0042
14
15 Posterior probabilities for each state:
16 0.05 0.6148 0.2712 0.0729 0.0149
17 0.15 0.2642 0.3908 0.3520 0.2411
18 0.25 0.1210 0.3380 0.5752 0.7440
19
20 Conditional expected payoffs:
21 HQ Market 2431.94 487.76 -1193.63 -2183.04
22 MQ Market 1636.99 985.88 274.54 -232.14
23 LQ Market 879.00 661.96 424.85 255.95
24
25 Optimum: 2431.94 985.88 424.85 255.95
26
27 Predictive expectation: 1963.72
28
29 EVSI: 213.72

It is left as an exercise to work through the details in Figure 4.5,


preferably by setting up your own spreadsheet. The following
points may, however, be helpful:

(1) The joint probabilities in each of columns H-K are obtained by


multiplying the prior probabilities by the relevant conditional
(binomial) probabilities. The predictive probability in each col-
umn is the sum of the joint probabilities (shown in row 13).

(2) The posterior probabilities are obtained by dividing the joint


probabilities by the predictive probability.
96

(3) The conditional expected payoffs in each case are obtained


by applying the SUMPRODUCT function to the corresponding
payoffs (for each action) and the posterior probabilities (for the
relevant X). The resulting decision rule is to sell in HQ if X = 0,
in MQ if X = 1, and in LQ otherwise.

(4) The EVSI is obtained by applying SUMPRODUCT to the pre-


dictive probabilities in row 13 and the corresponding maximum
conditional expected payoffs in row 25 (giving the expectation
shown in cell I27), and subtracting from this the maximum ex-
pected payoff without the sample information (in cell C10). Note
which of I27 and C10 is the larger, as a result of our working
with payoffs rather than losses.

More generally, let’s look further at situations in which state


can be characterized by some proportion θ of a population which
satisfy some property of interest (which may be either desirable or
undesirable, i.e. either “successes” or “failures”). Suppose that the
number in a binomial sample of size n which satisfy this property
is x, so that the probability mass function of x conditional on θ is
given by:  
n x
f ( x |θ ) = Pr[ X = x |θ ] = θ [1 − θ ] n − x
x
for x = 0, 1, . . . , n. What can we then say about θ on the basis of a
particular observed value for x?
We could perhaps start by considering a large number of possi-
ble values for θ, say θ = 0.00, 0.01, 0.02, . . . , 0.99, 1.00 (i.e. θr = r/100
for r = 0, 1, . . . , 100), associating a prior probability with each value.
(This is only an approximation to reality, but is probability close
enough for most practical purposes.) Bayes’ theorem in this case
can be expressed as follows:

(nx)θrx [1 − θr ]n− x π (θr )


Pr[θr | x ] = π (θr | x ) = n x
.
∑100
k =0 ( x ) θ k [ 1 − θ k ]
n− x π (θ )
k

One simplifying feature is that the term (nx), which does not depend
on θ, appears in both numerator and denominator, and can thus be
cancelled out, giving:

θrx [1 − θr ]n− x π (θr )


π ( θr | x ) = 100 x
.
∑ k =0 θ k [ 1 − θ k ] n − x π ( θ k )

This is easily set up in a spreadsheet, perhaps with rows corre-


sponding to each value of θ, and columns to each value of x.
As an illustration, suppose that we start with all values of θ
being equally likely a priori, i.e. π (θr ) = 1/101 for r = 0, 1, . . . , 100.
Suppose further that we have a binomial sample of size 100, and
observe x = 5. The resulting values for π (θr | x = 5) are displayed
graphically in Figure 4.6.
As may be expected, the posterior distribution of θ is centred ap-
proximately on θ = 0.05, the proportion of successes in the sample
(5 out of 100). The distribution is, however, somewhat skewed; by
97

Figure 4.6: Posterior probabilities for θ


0.20
given x = 5

0.15
Posterior Probability

0.10

0.05

0.00
0.00 0.05 0.10 0.15 0.20
Parameter Value

definition, θ > 0, but the effective upper bound is about 0.15, giving
about twice the spread above the most likely value as there is below
it.
The above results which we have generated numerically can in
fact be obtained algebraically. If all values of θ over the entire inter-
val 0 ≤ θ ≤ 1 are deemed to be possible, then we would require a
continuous prior probability distribution defined by its probability
density function. If we still consider all values to be equally likely a
priori, then the appropriate prior probability density function would
be that of the uniform distribution, namely:

π (θ ) = 1 for 0 ≤ θ ≤ 1.

The posterior probability density function would then be:

(nx)θ x [1 − θ ]n− x
π (θ | x ) = R 1 .
0 (nx)θ x [1 − θ ]n− x dθ

Once again, the binomial term (nx) cancels out. Furthermore, the de-
nominator is simply a constant to ensure that the resulting density
integrates to 1, so that we can write π (θ | x ) in the form:

π (θ | x ) = kθ x [1 − θ ]n− x .

This can be shown (see the formula sheet at the back of the notes,
for example) to be the probability density of the beta distribution,
where the constant k would take on the value Γ(n + 2)/[Γ( x +
1)Γ(n − x + 1)] = (n + 1)!/[ x!(n − x )!]. Figure 4.6 does in fact
closely match the corresponding beta distribution for x = 5 and
n = 100.
In fact, the beta distribution can be used to model other prior
information concerning θ. For example, if the prior distribution is
not uniform, but is of the beta distribution form with probability
98

density function given by π (θ ) = kθ a−1 [1 − θ ]b−1 , then the same


argument as above gives:

π (θ | x ) = kθ a+ x−1 [1 − θ ]b+n− x−1 .

Appropriate values for the parameters a and b can be selected by


making use of the following properties of the prior distribution:

a b ab
E[ θ ] = E[1 − θ ] = Var[θ ] = .
a+b a+b ( a + b )2 ( a + b + 1)
Suppose, for example, that in a particular context we judge a priori
that θ is expected to be around 0.2 (the “prior expectation”), and
that we are fairly sure (say with something like 95% “confidence”)
that 0.1 ≤ θ ≤ 0.4. To match the prior expectation, we need a/( a +
b) = 0.2 and b/( a + b) = 0.8, i.e. ab/( a + b)2 = 0.16. If the range
corresponds to something like 4 standard deviations (compare the
normal distribution), then we should have a standard deviation
σ ≈ 0.3/4, i.e. σ2 ≈ [(0.4 − 0.1)/4]2 = 0.005625, so that 0.16/( a +
b + 1) ≈ 0.005625. This yields a + b + 1 = 28.4, or a + b = 27.4. (We
do not have to use integers, as Γ( a), Γ(b) etc. are defined for non-
integer values, if we ever need to evaluate the constant k.) Thus we
need a = 0.2 × 27.4 = 5.48. It is possible to plot this density using
Statistics’s probability calculator (or in Excel) to confirm that it
properly represents prior judgement, and to modify the parameters
slightly if needed.

Example: It is useful to look briefly at some specific numerical val-


ues. Suppose that the prior distribution on θ is uniform (i.e.
α = β = 1). A simulated set of binomial trials with θ = 0.25 gen-
erated the following numbers of successes up to and including
the n-th trial for various n:

Cumulative Numbers
of Trials of Successes
(n) (x)
10 4
30 8
100 24

The posterior distributions for each of the three cases are dis-
played in Figure 4.7. These distributions are more-or-less centred
around the sample means (since the uniform prior is not very
informative), and gradually concentrate around the true value of
θ.

4.3 General Principles of Bayesian Inference

In the general problem of statistical inference, we have a random


sample of n independent and identically distributed observations
x1 , x2 , . . . , xn , with their common distribution represented by the
99

10 Figure 4.7: Posterior distributions from


binomial sampling

4 out of 10
0 8 out of 30
0.0 0.2 0.4 0.6 0.8 1.0
24 out of 100

probability or probability density function f ( x |θ ). The joint prob-


ability or probability density for the full data set is given by the
likelihood function:
n
L(θ; x1 , x2 , . . . , xn ) = ∏ f ( x i | θ ).
i =1

In order to apply Bayes theorem, we need firstly to specify the


prior distribution π (θ ) for θ. This requirement has been seen by
some as a strength, and by others as a weakness, of the Bayesian
approach to inference. On the one hand, we do often have real and
meaningful prior information which does need to be exploited
especially if the results of the statistical analysis are to be used as
a basis for decision making. On the other hand, some argue that
the prior information introduces an element of subjectivity which
may conflict with a desire for objectivity in the data analysis under
certain circumstances.
The posterior probability or probability density for θ is then
defined by:
L(θ; x1 , x2 , . . . , xn )π (θ )
π ( θ | x1 , x2 , . . . , x n ) =
m ( x1 , x2 , . . . , x n )
where (in the case of continuous distributions) the predictive proba-
bility density function m( x1 , x2 , . . . , xn ) is given by:
Z ∞
m ( x1 , x2 , . . . , x n ) = L(θ; x1 , x2 , . . . , xn )π (θ )dθ.
−∞

For purposes of calculating the posterior distribution, how-


ever, the only role of the m( x1 , x2 , . . . , xn ) term is to standardize
π (θ | x1 , x2 , . . . , xn ) so that it integrates to 1. We could just as well
write the posterior in the form kL(θ; x1 , x2 , . . . , xn )π (θ ), where k is
simply a normalization constant. In practice, therefore, it is suffi-
cient just to evaluate the numerator term L(θ; x1 , x2 , . . . , xn )π (θ ),
and to fix up the normalization once all calculations have been
done. We often emphasize this by expressing the posterior in the
form:
π (θ | x1 , x2 , . . . , xn ) ∝ L(θ; x1 , x2 , . . . , xn )π (θ ).
100

In many cases (see examples below) we may immediately recognize


the distributional form which emerges, so that the normalization
constant can be written down directly; otherwise, we might deter-
mine the constant by numerical approximation. A further advan-
tage of this approach is that any other factors in L(θ; x1 , x2 , . . . , xn )
or in π (θ ) which do not depend on θ can also be absorbed into the
normalization constant (and thus effectively ignored).
Before turning to two examples, we need to ask what we might
do the posterior distribution for θ once it has been derived. We
consider briefly the three main areas of statistical inference to which
this distribution may be applied.
Hypothesis Testing: Suppose that we need to choose between two
simple hypotheses, namely H0 : θ = θ0 and H1 : θ = θ1 (where
θ0 and θ1 are specific numerical values). In effect, θ0 and θ1 are
the only values judged a priori to be possible. Bayesian analysis
requires a prior distribution, which in this case is fully specified
by the probability associated with one of the two hypothesized
values, say π0 = Pr[θ = θ0 ] (the prior probability that H0 is true).
The posterior probabilities can then be expressed as:
π ( θ0 | x1 , x2 , . . . , x n ) ∝ L ( θ0 ; x1 , x2 , . . . , x n ) π0
π (θ1 | x1 , x2 , . . . , xn ) ∝ L(θ1 ; x1 , x2 , . . . , xn )(1 − π0 ).
It is useful to express this in terms of the odds on θ0 being the
correct value, which is given by:
L ( θ0 ; x1 , x2 , . . . , x n ) π0
· .
L ( θ1 ; x1 , x2 , . . . , x n ) 1 − π0
This represents the product of the a priori odds π0 /(1 − π0 ) and
the “likelihood ratio” (sometimes called the “Bayes Factor”). The
Bayes Factor is a useful summary of the strength of data-based
evidence favouring H0 , even if it is difficult to specify π0 .
Suppose that we can associate costs of C I with a type-I error
(rejecting H0 when true) and C I I with a type-II error (accepting
H0 when false). The decision minimizing expected cost is to
reject H0 if C I π (θ0 | x1 , x2 , . . . , xn ) ≤ C I I π (θ1 | x1 , x2 , . . . , xn ). After
some algebraic re-arrangement of terms, it is easily seen that
this condition for rejection of H0 is equivalent to the “likelihood
ratio” test:
L ( θ1 ; x1 , x2 , . . . , x n ) π0 C
≥ · I.
L ( θ0 ; x1 , x2 , . . . , x n ) 1 − π0 C I I
Interval Estimation: Once π (θ | x1 , x2 , . . . , xn ) has been identified, it is
in principle always possible (although perhaps only by numerical
approximation) to find two values, say θ L and θU such that:
Z θU
Pr[θ L ≤ θ ≤ θU ] = π (θ | x1 , x2 , . . . , xn )dθ = 1 − α
θL

for any specified probability level α. We are thus able to state


that θ belongs to the interval [θ L , θU ] with 100(1 − α)% probabil-
ity. Conventionally, this is termed a credibility interval, to differen-
tiate from the more conventional confidence interval. Note that
101

in contrast to conventional confidence intervals, the credibility


intervals do have a natural interpretation in terms of the proba-
bility (subjective in this case) of the parameter being within the
given interval.

Point Estimation: Here we wish to specify an explicit estimate, say


θ̂ for θ. In order to apply decision theoretic concepts, we would
need to specify in some way the loss associated with error of
estimation. A convenient surrogate measure of such loss may
be the expectation of the squared error (θ − θ̂ )2 . Note that θ̂ is a
value to be selected, and is thus not a random variable; the only
expectation refers to the uncertainty in θ as represented by the
relevant posterior distribution. The expected loss can thus be
expressed as follows:

E[(θ − θ̂ )2 ] = E[θ 2 ] − 2θ̂ E[θ ] + θ̂ 2 .

Differentiating with respect to θ̂ yields the optimality condition


−2 E[θ ] + 2θ̂ = 0, i.e. θ̂ = E[θ ] (where, of course, the expectation
refers to the posterior expectation, conditional on the data). For
this reason, the posterior expectation of θ is often termed the
Bayesian estimate. Under certain circumstances, however, other
measures such as the mode or median of the posterior distribu-
tion can and have been used as estimators

We now look at two further examples of the use of Bayesian


arguments.

Poisson sampling
Suppose that x1 , x2 , . . . , xn are independent observations from the
Poisson distribution with parameter λ, i.e. f ( x |λ) = λ x e−λ /x!.
Ignoring the x! factor which does not depend on the unknown
parameter λ, we can write the likelihood function in the form:
n
L(λ; x1 , x2 , . . . , xn ) ∝ λ∑i=1 xi e−nλ .

We know for sure that λ > 0, so that the prior distribution


should be restricted to positive values only. Suppose that we at-
tempt to approximate prior knowledge concerning λ by means of a
gamma distribution, with parameters α and ϕ, i.e.:

ϕα λα−1 e−ϕλ
π (λ) = .
Γ(α)

Once more, it is convenient to ignore terms which do not involve λ,


to express the prior distribution in the form:

π (λ) ∝ λα−1 e−ϕλ .

Note that α/ϕ represents the prior expectation of λ, in other words


the best estimate of λ available prior to any sample data.
102

The posterior probability density function for λ can thus be


expressed as:
n
π (λ| x1 , x2 , . . . , xn ) ∝ λ∑i=1 xi e−nλ λα−1 e−ϕλ
n
= λα+∑i=1 xi −1 e−(ϕ+n)λ

which is again a gamma distribution, but with parameters α +


∑in=1 xi and ϕ + n. The posterior expectation, or Bayes estimate for
λ is thus:
α + ∑in=1 xi ϕ α n ∑n x
= · + · i =1 i
ϕ+n ϕ+n ϕ ϕ+n n

which is a weighted average of the prior and sample estimates (α/ϕ


and x̄ = ∑in=1 xi /n). This is often, but not always, the result of a
Bayesian analysis.

Normal sampling
Suppose now that x1 , x2 , . . . , xn are independent observations from
a normal distribution with unknown mean µ. For simplicity, we
restrict ourselves to the case of known variance σ2 . It will simplify
the algebra if we define τ = 1/σ2 (where τ is sometimes called the
“precision” of the distribution), for then the likelihood function can
be written in the form:
h τ in/2 n 2
L(µ; x1 , x2 , . . . , xn ) = e−τ ∑i=1 ( xi −µ) /2 .

Once again we can ignore factors which do not depend on the
unknown parameter µ, so that:
n 2 /2
L(µ; x1 , x2 , . . . , xn ) ∝ e−τ ∑i=1 ( xi −µ) .

We can simplify this expression for the likelihood even further by


expanding:
n n
∑ ( x i − µ )2 = ∑ (xi − x̄ + x̄ − µ)2
i =1 i =1
n n
= ∑ (xi − x̄)2 + 2(x̄ − µ) ∑ (xi − x̄) + n(x̄ − µ)2
i =1 i =1
= SXX + 0 + n(µ − x̄ )2

where x̄ is the usual sample mean and SXX = ∑in=1 ( xi − x̄ )2 . It then


follows that:
n 2 /2 2 ] /2 2 /2
e − τ ∑ i =1 ( x i − µ ) = e−τ [SXX +n(µ− x̄) = e−τSXX /2 × e−nτ (µ− x̄) .

Yet again, the first term does not depend on the unknown parame-
ter µ, so that we can write:
2 /2
L(µ; x1 , x2 , . . . , xn ) ∝ e−nτ (µ− x̄) .

Now suppose that we can represent prior information on µ by a


normal distribution with mean m and variance v2 . Once again, it is
103

useful to define (say) ξ = 1/v2 , so that the prior distribution on µ


(after dropping terms independent of µ) can be written as:
2 /2
π ( µ ) ∝ e−ξ (µ−m)

from which we obtain the following characterization of the poste-


rior probability density function:
2 + nτ ( µ − x̄ )2 ] /2
π (µ| x1 , x2 , . . . , xn ) ∝ e−[ξ (µ−m) .

Completion of the square in µ gives:

ξ (µ − m)2 + nτ (µ − x̄ )2 = (ξ + nτ )µ2 − 2(ξm + nτ x̄ )µ


+ terms independent of µ

ξm + nτ x̄ (ξm + nτ x̄ )2
 
2
= (ξ + nτ ) µ − 2 µ+
ξ + nτ (ξ + nτ )2
+ terms independent of µ

= (ξ + nτ )[µ − µ̂]2 + terms independent of µ

where we define:
ξm + nτ x̄
µ̂ = .
ξ + nτ
It then finally follows that:
2 /2
π (µ| x1 , x2 , . . . , xn ) ∝ e−(ξ +nτ )[µ−µ̂]

which defines a normal distribution with mean µ̂ and variance


1/(ξ + nτ ). The Bayes estimate for µ is thus µ̂ as defined above,
which can be written as:
ξ nτ
µ̂ = m+ x̄
ξ + nτ ξ + nτ

which is again a weighted average of the prior estimate and the


sample mean.
104

Tutorial Exercises

Wherever relevant, it is recommended that you set up and solve the


problems in a spreadsheet rather than by hand.

1. We have R10000 to invest, all of which has to be placed in one


of three investments (I1 , I2 or I3 ). Assume that we wish to maxi-
mize the value of the investment in one year from now, but that
the value of R10000 invested in each of the three alternatives de-
pends on the state of the economy over the next year, which can
be S1 , S2 or S3 , as given by the entries in the following table:

Investment: State S1 State S2 State S3


I1 R11000 R11000 R11000
I2 R10000 R11000 R12000
I3 R16000 R3000 R14000

Assume that the three states are equally likely. In which invest-
ment should the R10000 be placed? What is the EVPI?

2. A wildcat oilman must decide how to finance the drilling of a


new well. He has three options:

a1 : Finance everything himself for a nett total cost (including


interest charges etc.) of $100 000; he will then of course, retain
all profits.
a2 : Take in a partner who will share equally in profits; his nett
total cost will then be $30 000.
a3 : Obtain financing from an independent consortium; he will
have no initial expenses and will receive 10% of profits for
managing the venture.

Profits are expected to be $4 per barrel of oil obtained from the


well. Three possibilities (states of nature) exist:

θ1 : The well is empty.


θ2 : The well will produce 20 000 barrels of oil;
θ3 : The well will produce 50 000 barrels.

Prior probabilities on these three states are assessed as follows:


π (1) = 0.5; π (2) = 0.3; π (3) = 0.2. Identify the oilman’s optimal
decision, and assess the corresponding EVPI.

3. A simplified version of the way a law relating to drunken driv-


ing operates in a number of countries is as follows: A motorist
can be stopped by a policeman and asked to take a breath test.
If this is negative, no further action ensues. If the test is posi-
tive the motorist is taken to a police station where a second test
based on a blood test is given. If this second test is negative the
motorist is released, if positive the motorist is automatically
charged and convicted of drunken driving.
105

The two tests concerned are not entirely precise in their oper-
ation and their accuracy has been investigated by means of a
large-scale controlled trial on a probabilistic basis with the re-
sults shown in the table below (expressed as conditional proba-
bilities for each test result, given the true state of the driver, for
each test).

TEST Test Result Motorist’s true state


Drunk Sober
First Test: Positive 0.8 0.2
Negative 0.2 0.8
Second Test: Positive 0.9 0.05
Negative 0.1 0.95

(a) What is the probability that a motorist stopped who is in


reality drunk will be convicted under this law? Conversely,
what is the probability that a motorist who is stopped and is
in reality sober will be convicted?
(b) Past information suggests that the proportion P of those
stopped for the first test who are in reality drunk is 0.6. A
motorist is stopped for testing and subsequently convicted.
What is the probability that he was actually drunk? What is
the probability that he was sober? Comment on how the latter
probability varies with changes in P.

4. Consider the following payoff (profit) matrix:

θ1 θ2 θ3 θ4
a1 10 20 -20 13
a2 12 14 0 15
a3 7 2 18 9

The probabilities of θ1 , θ2 , θ3 and θ4 are respectively 0.2, 0.1, 0.3


and 0.4. An experiment is conducted and its outcomes x1 and x2
are described by the following probabilities:

θ1 θ2 θ3 θ4
x1 .1 .2 .7 .4
x2 .9 .8 .3 .6

(a) Determine the best action when no data are used.


(b) Determine the best action when the experimental data are
used.
(c) What is the expected value of sample information (E.V.S.I.)?

5. A group of medical professionals is considering the construc-


tion of a private clinic. If the medical demand is high (i.e. there
is a favourable market for the clinic), the doctors expect to re-
alize a nett profit of R4m. If the market is not favourable, they
106

can expect to lose R1.6m. In the light of their present informa-


tion, they estimate that there is a 50:50 chance that the market is
favourable.
The group has been approached by a market research firm that
offers to perform a study of the market for a fee of R200000. The
market researchers claim that from their experience:

(a) The probability that the study will be unfavourable given


that the market is favourable (i.e. a false negative) is 10%;
while
(b) The probability that the study will be favourable given that
the market is unfavourable (i.e. a false positive) is 20%.

What is expected value of sample information? Is it worth hiring


the market research firm?

6. (a) In a manufacturing process, lots having 8, 10, 12 or 14% de-


fectives are produced according to the respective probabilities
.4, .3, .25, and .05. Three customers have contracts to receive
lots from the manufacturer. The contracts specify that the per-
centages of defectives in lots shipped to customers A, B, and
C should not exceed 8, 12, and 14, respectively. If a lot has
a higher percentage of defectives than stipulated, a penalty
of $100 per percentage point is incurred. On the other hand,
supplying better quality than required costs the manufacturer
$50 per percentage point. If the lots are not inspected prior to
shipment, which customer should have the highest priority for
receiving the order?
(b) Suppose that a sample of size n = 20 is inspected before each
lot is shipped to customers. If four defectives are found in the
sample, compute the posterior probabilities of the lot having
8, 10, 12 and 14% defectives. By using the new probabilities,
determine which customer has the lowest expected cost.

7. A firm must purchase a large quantity of a commodity either


today or tomorrow. Today’s price is $14.50 per unit. The firm
believes that tomorrow’s price is equally likely to be either $10.00
or $20.00. Letting P represent tomorrow’s price, the prior proba-
bilities are thus given by: π (10) = π (20) = 0.50.
A commodity market expert offers to give the firm his prediction
on tomorrow’s price. That is, he will say to the firm, “In my
opinion tomorrow’s price will be P” where P is either $10 or $20.
The firm knows that the market expert is right 60 per cent of the
time. For his prediction, he charges a commission which is equal
to $0.15 per unit of commodity purchased. Show that the firm
should be willing to employ the market expert.

8. A firm buys an important component for its product from a sup-


plier who sometimes has trouble with his production process.
The components are shipped in lots of size 50, and it is believed
107

that lots come from a process in which the mean per cent defec-
tive is either 3 or 8 per cent. At the outset these two possibilities
are taken to be equally likely. For analysis, the cost of testing a
component is taken to be $1000; the cost of accepting a defective
component is $15000, and the cost of rejecting a good component
is $1000.
A plan is proposed which involves testing two components from
a lot and accepting or rejecting the entire lot based on what is
learned from these two tests. Show that this particular plan is
not as good as simply accepting the lot without any tests.

9. In a binomial sampling trial with success probability θ, expert


judgement gives a best estimate of 1/3 for θ, together with a
statement that they are 95% sure that 0.2 ≤ θ ≤ 0.6. A total of
7 successes is observed in 30 trials. Develop a posterior distribu-
tion for θ:

• Numerically for a discrete set of possible values, for example at


intervals of 0.01; Note that probabilities at the discrete points
for θ may be made proportional to the probability densities at
these points, but then these need to be re-standardized to sum
to 1;
• Analytically, in the form of a posterior probability density
function.

Compare the two results.

10. Prior information concerning a parameter λ gives a best esti-


mate of 1.5. It is further stated that experts are 50% sure that the
value lies between 1 and 2. A sample of 10 observations from a
Poisson distribution with mean of λ generated a sample aver-
age of 2.3. Generate the Bayesian estimate for λ, and construct
a 95% credibility interval for λ. (It is suggested that you use the
Excel’s GAMMAINV() function in order to obtain the credibility
interval.)

11. Let θ be the true grade of ore in a proposed new gold mining
site (expressed in grams per ton). A series of 16 borehole sam-
ples has resulted in sample grades given by X1 , . . . , X16 , which
are assumed to be normally distributed with mean θ and a stan-
dard deviation of 4. Mine management has to decide whether or
not to mine this site. If they mine, the nett profit will be 3θ − 60
(millions of Rand). Note that a positive profit will only be made
if θ > 20 grams per ton. The company geologist, prior to see-
ing the sample grades, estimates θ to be 24 grams per ton (i.e.
payable). The average of the sixteen samples is however only 18.5
grams per ton.

(a) Motivate why we would attempt to represent the geologist’s


prior views by a normal distribution with a mean of 24.
108

(b) How large must the geologist’s prior probability on {θ > 20}
be, for it to be optimal to mine the site in spite of the sample
values.

Hint: First find values for the variance of the prior distribution
which would make it optimal to mine the site.

12. System design engineers have to choose between three equip-


ment configurations. Discounted life cycle costs have been cal-
culated for each of three configurations, and for three possible
scenarios regarding the unknown mean operating life of a critical
component. These costs (in Rm), and prior probabilities for each
scenario, are summarized as follows:

Configuration State (θ) = mean operating life


in thousands of hours
2 3 5
A 33 20 10
B 30 18 13
C 26 24 18
Prior probs (π (θ )) 0.15 0.40 0.45

There is time for an accelerated life test trial on a sample of two


of the critical component. The two items are placed on a test bed
for a fixed period of time, after which we observe X, the number
which fail (which may be 0, 1 or 2). The conditional probabilities
for each value of X under each state have been estimated as
follows:

State Number failing (X)


(θ) 0 1 2
2 0.16 0.48 0.36
3 0.30 0.50 0.20
5 0.56 0.38 0.06

(a) Determine the configuration minimizing expected life cycle


costs (without use of any life test data), and hence calculate
the expected value of perfect information.
(b) Using Bayes’ rule, find the optimal configuration conditional
upon the observations X = 0, X = 1 and X = 2 respectively.
(c) What is the maximum it is worth paying for the accelerated
life test?

13. During the six months immediately after changes to the road
construction, a total of 22 accidents occurred along a particular
stretch of highway. It may be assumed that the number of acci-
dents in any one month follow a Poisson distribution with mean
λ, i.e. having a probability function:

λ x e−λ
p( x ) = for x = 0, 1, . . .
x!
109

Over a long period of time prior to the new construction, it was


established that the mean rate of accidents per month was 5.2.
The engineers have asserted that the new construction should
halve the accident rate, but others have been skeptical.
Suppose that we wish to compare the two hypotheses regarding
the current situation (i.e. as represented by the last 6 months),
namely H0 : λ = 5.2 (i.e. no change) versus H1 : λ = 2.6 as
asserted by the engineers. Suppose further that the two hypothe-
ses are viewed as a priori equally likely. What is the posterior
probability that H1 is true given the recent data?
5
Generalized linear models

5.1 Introduction to generalized linear models

Generalized linear models (GLMs) are an extension of linear regres-


sion models to deal with non-normal response variables.
Many response variables that are regularly measured are actually
not normal, they may not even be continuous variables. Here are a
couple of examples:

• binary response (only two possible outcomes): 10 years after a


first visit, each of the n quiver trees has survived or not; a plant
is present or absent at a certain site; an animal was captured on
occasion j or not

• binomial response (number of successes out of n trials): the


number of bugs out of n that survive a certain dose of pesticide;
the number of germinated seeds out of n planted

The above data types are discrete (not continuous), and we will
see in this chapter how to construct models for these two types of
data.
Other types of response which we will not cover in this course,
but which are also not normal, and for which other types of gener-
alized linear models could be used, include:

• count response: the number of grass species per square km;


the number of whales spotted in one hour; the number of earth
quakes per year; the number of globular clusters in a galaxy; the
number of fairy circles per hectare

• survival time: time to death of an individual plant or animal;


time between earth quakes

• proportion / percentage data where n is not known: the percent-


age cover of a certain vegetation type at a certain site;

• strictly positive, positively skewed measurements: rainfall data

For many of the above response variables, normal linear regres-


sion models will just not be appropriate. And once we realize that
there are models that can handle such data, it is actually liberating
111

not to have to use normal linear regression or analysis of variance


for everything.
Generalized linear models are regression models for non-normal
response variables. We are still modelling the mean response as a
function of explanatory variables. As such, the linear regression
model is a special case of a generalized linear model.

5.1.1 The generalized linear model


In linear regression we modelled the mean response as a function
of one or more explanatory variables. We wrote the model as:

Yi = β 0 + β 1 x1i + β 2 x2i + . . . + ei
| {z } |{z}
µi N (0,σ2 )

This means that

Yi ∼ N (µi , σ2 )
where µi = β 0 + β 1 x1i + β 2 x2i + . . .
In other words, we model how the mean response µi is related to
covariates or explanatory variables. Once we have estimated the β
parameters, we have a model for how the mean response is related
to the explanatory variables. The normal distribution, with the
σ2 parameter, comes in to describe the scatter of the observations
around this mean response (often a line). There is only one variance
parameter, implying that for all values of the mean µi the scatter of
the observations around the mean (expected value) is the same.
So, when we use a normal linear regression model (or an anova A histogram is the easiest way to
quickly check the approximate nor-
model) we are assuming that there is normal scatter of observations
mality of the response. But note that
around the mean. For this to be a valid assumption, the response the raw response variable will be a
variable should be something for which a normal distribution is mixture of lots of normal distributions,
because, possibly each of, the observa-
suitable to begin with. tions come from different groups, have
Let’s look at again at this different format of the linear regression different means. However, very skew
histograms are a good indication that
model:
the response variable is not normally
distributed.
Yi ∼ N (µi , σ2 )
and
µi = β 0 + β 1 x1i + β 2 x2i + . . .
The first part is the random or stochastic part of the model. It
describes the variability of the observations for a given mean. The
second part is the structural part of the model. Once we have esti-
mated the parameters it is fully known.
We will now generalize the above model to cope with other types
of response. We will change the distribution of the response to
something that is more suitable for count or binomial data. In the
second part we still model some parameter related to the mean of
the response in terms of explanatory variables. For example for
count data we model the average rate of events λi as a function of
explanatory variables, usually using:
112

log(λi ) = β 0 + β 1 x1i + β 2 x2i + . . .


and for binary or binomial data, we model the probability of
success pi as a function of the explanatory variables, usually using:

pi
log( ) = β 0 + β 1 x1i + β 2 x2i + . . .
1 − pi
The relationship between the parameter and the explanatory
variables does not need to linear. In fact, by using a log link we are
assuming that there is an exponential relationship between λ and
the explanatory variables, and by using a logit link we are assuming
that there is an S-shaped relationship between the probability of
success pi and the explanatory variables.
To summarize the above:

(1) The random component of the model specifies how the obser-
vations vary around a mean parameter. This is usually in the
form of a probability distribution: normal for normal response
variables, Poisson for count response variables, binomial for
GLM’s are linear models because the
binomial response variables, etc.. mean parameter (or some form of it) is
still linearly related to the explanatory
(2) The systematic component of the model is a linear combination variables. Linear here refers to the
(function) of explanatory variables that are related to the mean β coefficients appearing in a linear
combination, i.e. the terms are just
parameter. added together. β 0 + β 1 x1i + β 2 x2i +
. . . + β p x pi is also called the linear
(3) The link function defines the form of the relationship between predictor.
the mean parameter and the explanatory variables (linear pre-
dictor): linear (identity link), exponential (log link) or S-shaped
(logit link).

Which model/distribution we choose for the response, depends


on the type of response data: we would choose a Poisson model for
count data, and a binomial model for binary or binomial data. The
reason we need a different model (or distribution) for the response
is that count data, binomial or binary data don’t have a normal or
even symmetrical distribution. So a model for the mean, which as-
sumes random error (scatter of the observations around the mean),
as linear regression models do, often will not work very well for
non-normal data. Further problems are that the range of the re-
sponse is limited to integers ≥ 0 for count data, 0 or 1 for binary
data, and limited to integers between 0 and n for binomial data.
It therefore makes sense to use a model specifically for such data,
and the simplest models are a Poisson model for count data and a
binomial model for binary or binomial data.

5.1.2 Parameter estimation


Parameter estimation in generalized linear models is mostly via the
method of maximum likelihood. This chooses the parameter values
which maximize the likelihood (probability) of the occurrence of
113

the observed data. The method of least squares is usually not ap-
propriate1 , but for the special case of normal linear regression, least 1
Least squares treats all error terms
squares and maximum likelihood turn out to be equivalent. (residuals) equally. But this is not
appropriate if the variance (or uncer-
tainty) of observations is not constant,
as is often the case in GLMs.
5.2 Logistic regression

Logistic regression is used to model the effect of explanatory vari-


ables on a binary or binomial response. Typical examples of binary
data are presence/absence data, data on whether an individual has
survived or not, whether an individual has a disease or not, etc..
Sometimes, binary observations are grouped. For example, we can
record whether we see a certain species of bird every time we go
to a certain grid cell (binary observations). If we do this for many
grid cells, we would group the observations in one grid cell into a
binomial observation: the number of times we sighted the bird out
of the total number of times we visited that grid cell. For each grid
cell we then have a binomial observation (assuming the individual
visits to a grid cell are independent).
Another example is a germination experiment. If we have dif-
ferent trays with ten seeds each, and we would use different treat-
ments on the different trays to see which treatments results in the
highest germination rate, we would obtain one binomial observa-
tion from every tray (number of seeds germinated out of ten). Here
the tray is the experimental unit, and we make it the observational
unit by taking one binomial observation from each tray.
In the above examples, we are usually interested in the prob-
ability of sighting a species, or of germination. And we want to
understand how this probability changes with different environ-
mental variables or different treatments (the explanatory variables).
We want to understand how the probability of success relates to
the explanatory variables. If we know the probability of success, we
can determine the expected number of successes (= np), the fitted
value.
In linear regression we assumed that the actual observations (re- In logistic regression (for binary
or binomial data) we are trying to
sponse) are ‘scattered’ around the mean. The same holds true for
understand/model the probability of
binary and binomial data: The observed number of successes are ‘success’, pi . To obtain the expected
scattered around the mean number of successes (np), now accord- or fitted value, Ŷi , from this p̂i , we
multiply by the known ni , where ni is
ing to a binomial distribution. Whether we take data to be binary the total number of visits to site i, the
or binomial often depends on which groups of observations have total number of seeds in tray i. ni = 1
for binary observations.
the same covariate patterns.

5.2.1 Distribution of grazing lawn


These data are part of a larger study in the Hluhluwe Umfolozi
Park, South Africa, to investigate how fire and grazers and their
interactions influence the landscape, in particular the distribution
of grasslands2 . The pixels from a Landsat satellite image of the 2
S Archibald, WJ Bond, WD Stock,
reserve were classified as grazing lawn, tall bunch grass or other. and DHK Fairbanks. Shaping the
landscape: fire-grazer interactions
[Archibald et al., 2005] then used a logistic regression model to in an African savanna. Ecological
Applications, 15(1):96–109, 2005
114

understand how the distribution of lawn grass was related to fire


frequency (number of fires during the last 40 years), topography
and geological variables. Here, each pixel or point in the park is
an observational unit, and we want to model the probability of lawn
grass in terms of explanatory variables (fire frequency, topography,
and soil). The response is whether a pixel is predominantly lawn
grass (1) or other vegetation (0), i.e. a binary response.

Exploratory data analysis


Let’s start by looking at some features of binary data, using the
lawn grass example.

> dat <- read.csv("lawn.bunch.grass.csv", skip = 6)


> head(dat)
Indiv_id Groundtrth Size_id Community_ Grass_heig Andropog_c Tree_densi
1 3 1 2 BG tall grass andropogonoid NA
2 4 2 1 BG patchy andropogonoid NA
3 5 3 1 patchy short grass unclear NA
4 6 4 1 BG patchy andropogonoid NA
5 7 5 2 BG short grass andropogonoid NA
6 8 6 2 BG short grass andropogonoid NA
Geol_id Xval Yval Fire5698 Altitude Slope Aspect Rsp Trmi Twi
1 2 72135 -3126461 7 240 4 89 15 41 0.23
2 2 72160 -3126499 7 234 7 90 15 41 0.19
3 2 72172 -3126540 7 234 7 89 15 40 0.19
4 0 72186 -3126581 7 235 7 88 15 40 0.18
5 0 72202 -3126626 7 228 8 85 15 42 -0.03
6 0 72217 -3126666 7 228 8 85 15 42 -0.03

> names(dat)
[1] "Indiv_id" "Groundtrth" "Size_id" "Community_" "Grass_heig"
[6] "Andropog c" "Tree_densi" "Geol_id"
_ "Xval" "Yval"
[11] "Fire5698" "Altitude" "Slope" "Aspect" "Rsp"
[16] "Trmi" "Twi"
>

The classification (lawn grass (LG), bunch grass (BG), or other),


is in the Community variable. This is a categorical variable, so we can
summarize it by using a frequency table: the number of observa-
tions (or pixels) that have been classified into each of the categories.
We use the table() function to obtain a frequency table, and the
function prop.table on this table to obtain the percentage of obser-
vations in each category :

> table(dat$Community)

BG LG other patchy Patchy unclear wallow


645 192 1 89 9 3 9
> options(digits = 2)
> prop.table(table(dat$Community))
We start with the original data here.
115
We think that sifting through the data,
deciding which variables are required,
exploratory data analysis, missing
values and all these real-life problems
are an essential part of data analysis.
BG LG other patchy Patchy unclear wallow And we hope that you can learn some
of these data manipulation and critical
0.6804 0.2025 0.0011 0.0939 0.0095 0.0032 0.0095 thinking skills when we don’t start
with the pre-processed data.
Roughly 20% (192) of all pixels were classified as lawn grass.
We want to understand which factors predict the presence of lawn
grass. For the moment, we are not so much interested in the other
categories. So, we transform the categorical variable into a binary
variable (presence of lawn grass = 1, absence of lawn grass = 0)3 . 3
Whenever we are interested in the
The following code defines a new binary variable LG, coded as 1 factors that predict presence (of an
animal, plant, etc.) we also need
when the pixel was classified as lawn grass, else 0. And we add this observations from sites where the
new variable to the data frame. animal/plant was absent.

dat$LG <- ifelse(dat$Community_ == "LG", 1, 0)

LG is the response variable. Let’s look at its relationship with


slope:

Figure 5.1: Occurrence of lawn grass


1 ● ● ● ● ● ● ● ● ● ●
●● ●



●●● ●



●● ●

● ●








●●


●●



●●
● ●
● ●
●● ●


●● ●

●●



●●
● ●
● ●


plotted against slope. Because the
●●● ●● ●
● ● ●
1 ●●

●●

●●
●●

● ●

● ●
●● ●●

●●







●●




●●

●●

●●





● ●
●●
response is binary, and the resolution

● ● ● ●●● ●●


●● ●

●●

●●
●●




● ●


● ●




● ●●
●● ●





● of the slope variable is not very high,
the points are all on top of each other
lawn grass

lawn grass

(LHS). Therefore they have been


jittered for the RHS plot.

●● ●●● ●●
● ● ● ●● ●●
●● ● ●● ●● ●

●●● ● ●● ● ●●● ●●● ●● ● ●●
● ●● ● ●● ● ●
●●

●●
●● ● ●
●●
●●
● ● ●● ●
●●
●● ●

● ●
● ●
● ●●●
●●
●●● ● ● ● ●
● ● ●
● ● ●
● ●

●●●● ●● ●● ● ●


● ● ● ● ● ●● ●

●●●
● ● ●●● ● ●● ●●●

● ●●
● ●●
● ●
● ● ● ●


●●
● ● ● ● ●
● ● ● ● ●
●● ●●● ● ● ● ●

●● ● ●● ● ●● ●● ●

●●● ● ● ●●
● ● ● ● ●● ● ● ●● ●●
● ● ● ● ● ●●
● ● ●●
●● ●
●●●●

●●
●● ●
●●
● ●●● ●
● ●
● ●● ● ●
● ● ●
●●
● ●● ●● ● ● ● ● ● ● ● ● ●
●●
●●●● ● ● ●
●● ●●
●● ●● ● ● ●
● ●

●● ●●●
● ● ● ● ● ●
● ●●
● ●
0 ●






●●





●● ●

●● ●
●●
● ●● ●

●●
● ●

●●

●● ●
● ●

●●







●●











●●
●●





●●


●●


●●



●●
●●








●●





●●




●●
●●


● ●



●●

● ●
●●



●●
● ●



●●



●● ● ● ● ●●● ●●
● ● ● ●●● ● ● ●● ● ●
● ●
●●● ●●
● ●●● ●●
●●

●●

● ●

● ● ●●

● ●
●●
● ●● ● ●●
●●
●●● ● ● ●
●●
●●●● ●●
●●●
●● ●● ●● ●


●●
●● ●●●
● ●● ●●

●●
● ● ● ● ● ● ● ●

●●● ●
● ●●● ●● ●
● ●● ●


● ●

●● ●●
● ● ● ●



●● ● ● ●● ●
● ● ●
●●● ● ●
● ●● ●●● ● ●

● ● ●● ● ●
●●
●●
●● ● ●● ●
●●● ●● ●● ● ● ●

0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●



●●●● ● ●

● ●●●
● ●
●●
●● ●● ●●
●●

● ● ●

0 5 10 15 0 5 10 15
slope Slope

It can be quite difficult to see the relationship in plots with a


binary response. Because the response is either 0 or 1, all values lie
on one of these two lines. We can see slightly more if we jitter the
points, so that they don’t all lie on top of each other. The RHS plot
was created with exactly the same data as the LHS plot, but now
we can see the individual data points more clearly.
We can see that for small values of slope, there is a fair number
of 1’s, but that the proportion of 1’s (relative to the number of ze-
ros) decreases with increasing slope, until there are no 1’s at high
levels of slope. Remember that we are interested in the probability
of lawn grass and how this changes with slope. The probability of
lawn grass corresponds to the probability of a 1.
In this particular case it seems that there are only a few distinct
values of slope. This is not usually the case for continuous explana-
tory variables. But here we can exploit this feature and calculate the
proportion of 1s (lawn grass) at each value of slope, to get a better
idea of how proportion or probability of lawn grass changes with
slope.

t1 <- with(dat, table(LG, Slope))


116

(t2 <- prop.table(t1, 2))


par(mar = c(5.5, 5.5, 1, 1), mgp = c(4, 1, 0))
plot(0:19, t2[2,], las = 1, cex.axis = 1.5, cex.lab = 1.5,
ylab = "proportion lawn grass", pch = 19, col = "firebrick3",
xlab = "slope")


Figure 5.2: Proportion of pixels with
0.35 lawn grass at each value of slope.

0.30
proportion lawn grass


0.25 ●


0.20

0.15

0.10 ● ●


0.05

0.00 ● ● ● ● ● ● ● ● ● ● ●

0 5 10 15

slope

This confirms our initial observation that the probability of lawn


grass decreases with increasing slope.

5.2.2 Logistic regression model for binary data


Binary data is a special case of binomial data: all ni = 1. The re-
sponse variable Y is of the form If we are more interested in the failures,
e.g. which factors make failure more
likely, we can code failure as 1 and

1 for success
non-failure as 0. In any case, we
Y= construct a model for the probability
0 for failure
of a 1.
Y is still the number of successes (out of 1 trial).

Yi ∼ Bin(1, pi )
A binomial distribution with n = 1 is also called a Bernoulli
distribution, i.e. Yi ∼ Bernoulli ( pi ).
As we saw in the introduction to GLMs, we need to define a ran-
dom part (the distribution of the observations given the probability
of success), and a systematic part (which explanatory variables in-
fluence the parameter of interest). The parameter of interest is the
probability of success, pi .
The most natural random model for binary data is a Bernoulli
distribution, which is just a binomial distribution with n = 1.

Yi ∼ Bernoulli ( pi )

Yi is the ith observation (0 for failure, 1 for success), pi is the


probability of success for the ith observation. There is a subscript
117

on p, because each observation has a different covariate pattern;


the probability of success may be different for every observation,
depending on the values of the covariates.
For the moment let’s assume that we have a single continuous
explanatory variable X. The probability of success usually is not
linearly related to explanatory variables4 but the log odds of suc- 4
It is not reasonable to assume that
cess often are, so that the logit link function is a reasonable model. a probability increases or decreases
linearly with any X variable, because it
  is restricted to lie between 0 and 1.
pi
logit( pi ) = log = β 0 + β 1 Xi
1 − pi
Other link functions than the logit link are possible, but the
logit link is by far the one most commonly used, and the only one
we will cover in this course. This transformation or link function
leads to the name logistic regression. Although logit( pi ) is linearly
  In the field of statistics, whenever we
pi write log, we actually mean or imply
related to X, the probability of success is not. logit( pi ) = log 1− p i the natural logarithm, ln. So log-odds,
is called the logit transformation of pi or the log-odds. The link log-likelihood, log-transformation,
function is a transformation of the parameter we want to model etc., almost always refer to taking
p
natural logarithms. And here log( 1− p )
that can then be linearly related to the explanatory variables. p
actually means ln( 1− p ).
Above we have the three parts of the generalized linear model
(a logistic model in this case): the model for the response, the link
function, and the linear predictor (ηi = β 0 + β 1 Xi ). η = eta
The logit transformation of the parameter ensures that

• when transforming back to the probability of success, the logit


transformation ensures that all predicted probabilities range
between 0 and 1

• logit( pi ) can take on any value between −∞ and ∞. So no matter


how extreme the x-values become, the predicted value will never
be illegal on the log-odds or logit scale.

• the logit transformation of pi (log odds) is often linearly related


to the explanatory variables, whereas pi is not.

In summary, the logistic model is:

Yi ∼ Bin(ni , pi )

logit( pi ) = β 0 + β 1 xi
By fitting this model to the data we mean that we search for
those values of β 0 and β 1 which best describe the observed change
in relative frequency in 1s compared to 0s with increasing slope
(slope is the explanatory variable in the lawn grass example). With
the logit link we specify that this relationship is s-shaped (between
probability of success and explanatory variable, and linear between
logit( p) and the explanatory variable). Estimates for the β param-
eters are found by using the method of maximum likelihood, i.e.
by finding those values of β 0 and β 1 that make the observed values
most likely, given the specified model.
118

5.2.3 Odds, odds ratios and log-odds


The β coefficients in logistic regression estimate the change in the
log-odds of success per unit increase in the explanatory variable.
exp( β i ) estimates the factor by which the odds (of success) change
per unit increase in the explanatory variable.
exp( β i ) is an odds-ratio because it compares odds at 2 levels (at
x0 + 1 vs x0 ). This odds ratio is assumed to be constant (by assum-
ing a logistic regression model), i.e. it is the same at every value
of X. The odds-ratio can be understood as the factor by which the
odds of success change for every one unit increase in the explana-
tory variable. The odds-ratio (exp( β i )) does not tell us anything
about the current probability of success, or the odds, only the
change in odds when the explanatory variable increases by one
unit. The odds of an event occurring are
defined as
Prob Odds log(Odds) P(event) p
Odds = =
0.01 1:99 = 0.0101 -4.59 1 − P(event) 1− p
0.1 1:9 = 0.1111 -2.20
0.5 1:1 = 1 0
0.6 3:2 = 1.5 0.41
0.9 9:1 = 9 2.20
0.99 99:1 = 99 4.59

5.2.4 Parameter estimation


Fitting a logistic model with multiple explanatory variables:

logit( pi ) = β 0 + β 1 X1i + β 2 X2i + ... + β k Xki


requires estimation of the k + 1 parameters, β 0 , β 1 , . . . , β k . Pa-
rameter estimation in generalized linear models is done using the
method of maximum likelihood. Note that there is no variance
parameter.

Likelihood
Likelihood is defined as the probability of the observed data
given the assumed model, as a function of unknown parameters.
0.25
For one binomial data point (let’s say an observed 3 successes in 10
trials) the likelihood function would be 0.20

 
10 3 0.15
p (1 − p)(10−3)
L(p)

L ( p ) = P (Y = y ) =
3 0.10

Figure 5.3 shows the likelihood function for the parameter p,


0.05
when we have observed 3 out of 10 successes, and assuming a
binomial model for the outcome. 0.00
0.0 0.2 0.4 0.6 0.8 1.0
The unknown parameter p is the probability of success.
p
For many independent observations, the probabilities can be
multiplied and the likelihood function becomes: Figure 5.3: Likelihood function for p,
the probability of success, based on
n n a binomial observation of 3 successes
∏ P ( yi ) = ∏ p i i ( 1 − p i ) (1− y i )
y
L(λ) = in 10 trials. The vertical line indicates
i =1 i =1 the maximum likelihood and the
corresponding maximum likelihood
estimate ( p̂ = 0.3).
119

If the observations are a random sample, we can often assume


independence of the observations. The λ in the likelihood function
above may be a vector of parameters, e.g. ( β 0 , β 1 ).5 5
For each pi we substitute the inverse
Taking logs gives the log-likelihood logit transformation (or the model for
pi ):

exp( β 0 + β 1 xi )
n pi =
1 + exp( β 0 + β 1 xi )
log( L(λ)) = ℓ(λ) = ∑ yi log( pi ) + (1 − yi ) log(1 − pi )
i =1

The log-likelihood is mathematically much easier to work with


(mainly because we can work with sums instead of products).
Maximum likelihood estimation involves choosing those values
of the parameters that maximize the likelihood or, equivalently,
the log-likelihood. Intuitively, this means that we have chosen the
parameter values which maximize the probability of occurrence of
the observations.
For the lawn grass example, and slope as the only explanatory
variable, we would write the model as follows:

Yi ∼ Bin(ni , pi )

logit( pi ) = β 0 + β 1 × slopei
where Yi is presence (1) or absence (0) of lawn grass, all ni = 1,
and pi denotes the probability of lawn grass in pixel i.

5.2.5 Logistic regression in R


To fit a logistic regression model in R we use the function glm (gen-
eralized linear model). The response variable (now binary) goes on
the left, the explanatory variables to the right of the ∼. We specify
the model for the response as family = binomial. The logit link
function is the default for this family. R’s glm() function is very
similar to its lm() function, the only extra bit is the family state-
ment.

----------------------------------------------------------
> m1 <- glm(LG ~ Slope, family = binomial, data = dat)
> summary(m1)

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.5461 0.1151 -4.74 2.1e-06
Slope -0.2773 0.0348 -7.96 1.7e-15

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 954.93 on 946 degrees of freedom


Residual deviance: 866.69 on 945 degrees of freedom
(1 observation deleted due to missingness)
AIC: 870.7
120

Number of Fisher Scoring iterations: 5


----------------------------------------------------------

Interpreting the coefficients


The coefficient estimates given in the R output are on the linear
predictor scale (logit or log-odds scale), i.e. estimates of β 0 and β 1 :

c = log\ pi
LO ( ) = β̂ 0 + β̂ 1 × slopei
1 − pi
Assuming that the logistic regression model is valid, we can
interpret the above estimates as follows: the log-odds for lawn grass
decrease by 0.2773 per unit increase in slope (let’s assume one unit
corresponds to 1 degree).
p
The odds of lawn grass, ( 1−ip ), i.e. probability of lawn grass vs
i
probability of other vegetation, change by a factor exp(−0.2773) =
0.76 per degree increase in slope, i.e. the odds of a pixel being lawn
grass decreases by 24% for every unit increase in slope. This means
that lawn grass is more common in flat areas.
The log-odds decrease at a constant rate. Also the odds ratio
is constant, which means that the odds change by a constant fac-
tor. However, the probability does not decrease at a constant rate;
its rate of change depends on its value, according to an s-shaped
curve.
Therefore, most of the time we don’t directly calculate the ef-
fect of the coefficients on the probability, but rather on the odds or
log-odds. However, we often need to predict the probability of a
success (lawn grass) given values of the covariates. We saw above
how to calculate the odds and the log-odds. Predicting the proba-
bility is a bit more complicated:

exp( β̂ 0 + β̂ 1 × slopei )
p̂i =
1 + exp( β̂ 0 + β̂ 1 × slopei )
or

e LO
pb =
1 + e LO
where LO denotes the log-odds.
To calculate the probability of lawn grass in pixels with zero
slope, lets first calculate

ηi = β̂ 0 + β̂ 1 × slopei
= −0.5461 − 0.2773 × 0
= −0.5461

Then
exp(−0.5461)
p̂ = = 0.37
1 + exp(−0.5461)
So we would expect about 37% of the pixels with slope 0 to have
lawn grass. We have already seen from the plot that lawn grass is
121

very rare in pixels with high values of slope. But let’s calculate the
probability of lawn grass in a pixel with a slope of 5:

odds = exp(−0.5461 − 0.2773 × 5) = exp(−1.9326) = 0.145

and
0.145
= 0.126 p̂ =
1 + 0.145
i.e. lawn grass is expected to be present in only about 13% of
pixels with slope 5. A good way to understand how the probability
of success is related to the explanatory variable is to add the fitted
line to the plot:

Figure 5.4: Jittered observations of


●●●




●●
●●●
● ●
● ●

● ●



●●









●●
●●




presence/absence of lawn grass, and
●● ● ● ●
● ●
●●
● ●●
●●●
●●●




● ●







fitted logistic regression line for the

● ● ● ● ●

1.0

●●●
●●

●●
● ●



●●
● ●●



●●







● probability of lawn grass, in relation to
● ●● ●●
●●●●
●●
●●

●●
●●
● ●●


●●


● slope.
0.9 ●●
●●


●●
●● ●
●●





●●

● ●


● ● ●●● ●● ●
● ●●
● ● ●●● ●
●●
0.8 ●
●●
● ● ●

0.7
lawn grass

0.6
0.5
0.4
0.3
0.2 ●●
●●
●●
●●


●●



●● ●

● ●



●●
● ●●

●● ●
●●●
●●


●●


●●● ●



●●● ●


●●●
● ●●
● ● ●



●●
● ●
●● ●
● ●● ● ●

●●

●● ●●

● ●
● ●● ● ●
●●●
● ● ● ●●● ●●●
●● ● ● ●● ●●
●●
● ●● ●● ●●
●● ● ● ● ●
● ● ● ●● ● ●
●●
0.1 ●●●
●●●

●●








●●
●●

●●●

●●

●●





●●


●●
●●
●●
●●
●●
●●

●●●


●●●

●●
●●

●●

●●●




●●
● ●
●●

●●



●●




●●

●●
● ● ●●●● ●●● ● ● ● ●● ● ●
●●●
●●

● ●
●● ● ●● ● ●●
● ● ●● ●
● ●● ●●●
● ●

●● ●●●
● ●● ●● ●

0.0 ●

●●
●●

●●
●●



●●


●●

●●
●●


● ●








●●
●●
●●

●●●

●●


●●








●●
●●


●●
●●

●●







● ●
●●



●●





●●

●●
● ●




●●
●●








●●
● ●● ● ●● ●● ●
●● ●● ● ●● ●●●
●● ●
●● ●● ● ●
●●● ● ●● ●● ● ●● ●●● ● ●●● ●
● ● ●● ●● ●●● ● ● ● ●
●●●● ●●●● ●● ● ●●● ●●
●●
● ●
● ● ●● ●●●
●● ● ● ● ●
●●● ●●● ●● ● ●● ●● ●● ●
●●●

●●● ● ●● ● ●
●● ●● ● ●
●● ● ●● ●
● ● ●
●●
●● ●● ●● ● ●●● ●● ● ● ● ●

● ● ●
●●●● ●●● ●● ● ● ●● ●●● ● ● ● ● ●
●●●
●● ● ● ● ●●●● ●●● ●●
● ● ●● ●● ●●
● ● ● ● ●● ● ● ●
●●●●




● ●● ● ● ● ●● ●
●●● ● ●
● ●● ●

●●● ●● ●●● ● ●●
● ●● ●● ● ● ●●
●●● ● ● ● ●● ● ● ● ● ●
●● ● ● ● ● ●

0 5 10 15
Slope

plot(jitter(LG) ~ jitter(Slope), data = dat, las = 1,


xlab = "Slope", pch = 20, col = "firebrick3",
ylab = "lawn grass", cex.lab = 1.5, cex.axis = 1.5, yaxt = "n")
axis(2, at = seq(0, 1, by = 0.1), cex.axis = 1.5, las = 1)

x <- seq(0, 20, length = 500)


y <- predict(m1, newdata = data.frame(Slope = x), type = "response")
lines(x, y, col = "blue", lwd = 2)
abline(h = 0, lty = 2)
abline(h = 1, lty = 2)

To obtain the fitted curve using R, we predict the probability for


the entire range of slope values (here x). In the predict() function
we specify type = "response" to obtain predictions on the proba-
bility scale. This may be a bit misleading but refers the the original
parameter related to the response, pi , as opposed to the log-odds
scale, or the transformed parameter on the linear predictor scale.
The fitted curve or line does not look s-shaped, but this is only
because we only have the lower end of the probabilities or the
curve.
122

Confidence intervals
Maximum likelihood estimates are asymptotically normally
distributed. This means that the normal approximation is relatively
good in large data sets. In such cases we can construct a normal
confidence interval for the log-odds, and then, to obtain confidence
intervals for the odds or probability, we transform these confidence
limits to the wanted scale.
For example, a 95% confidence interval for the effect of slope on
the log-odds would be:

−0.2773 ± 1.96 × 0.0348 = [−0.35; −0.21]


A 95% CI for a parameter, based on
So, a 95% confidence interval for the true change in log-odds a normally distributed estimate (e.g.
regression coefficient, MLE estimate)
per unit increase in slope is [−0.35; −0.21]. This means that we are can be calculated as:
fairly sure that the log-odds decrease by between 0.2 and 0.35 units
per unit increase in slope, but there is a small chance that the true estimate ± 1.96 × SE(estimate)
value lies outside of the interval. In GLMs such intervals are called
To obtain a confidence limits for the corresponding odds ratio we Wald intervals. Wald implies that we
have assumed a normal distribution
exponentiate the above confidence limits: for the MLE estimates.

[exp(−0.35); exp(−0.21)] = [0.70; 0.81]


This means that we are 95% confident that the odds of lawn
grass decrease by between 20 and 30% per unit increase in slope.

Confidence intervals in R:

-----------------------------------------------
> ## Wald intervals
> confint.default(m1)
2.5 % 97.5 %
(Intercept) -0.77 -0.32
Slope -0.35 -0.21
> exp(confint.default(m1)) ## CI for the odds ratio
2.5 % 97.5 %
(Intercept) 0.46 0.73
Slope 0.71 0.81

> ## profile likelihood intervals


> confint(m1)
Waiting for profiling to be done...
2.5 % 97.5 %
(Intercept) -0.77 -0.32
Slope -0.35 -0.21
> exp(confint(m1)) ## profile likelihood intervals
Waiting for profiling to be done... ## for the odds ratio
2.5 % 97.5 %
(Intercept) 0.46 0.72
Slope 0.71 0.81
-----------------------------------------------
123

Alternatively, we can construct profile likelihood intervals. These


don’t make the assumption of normally distributed MLEs, and
are based directly on the likelihood function6 . Profile likelihood 6
The profile likelihood of any one
intervals are always at least as good as Wald intervals and much parameter is found by maximizing out
the other parameters.
better when the assumption of normality does not hold, e.g. with
small sample sizes, or for parameter estimates that are close to the
boundary, e.g. small probabilities. In the lawn grass example, Wald
and profile likelihood confidence intervals are identical, at least to 2
decimal places (we had a large number of pixels, N > 700).

5.2.6 Model Checking


As in normal regression models, we need to check whether the
model we have chosen to fit to the data provides a reasonable de-
scription of the structure in the data, or a reasonable summary of
our data. If not, we need to specify a model that does.
Let’s first look at the assumptions we are making about the data
when we choose a logistic regression model, i.e. what exactly do we
assume when we assume that the response can be modelled using a
binomial distribution?

Assumptions made in logistic regression


When we assume a binomial distribution for a random variable
we are assuming that:

(1) there is a constant probability of success, i.e. all of the ni tri-


als (of one binomial observation) have the same probability of
success, and that this does not change over time.

(2) the ni trials are independent. This means that the outcome of
one trial is not influenced by the outcome of the others.

(3) the final number of trials does not depend on the number of
successes

There are many ways in which the above assumptions (espe-


cially the first two) are violated in reality. For example, when the
response is the number of times a rare bird was sighted at a partic-
ular site, out of n visits, it could be that it becomes known in which
tree the bird is to be found, and then all visits will sight the bird,
until it moves to another tree later in the season. Or if one observer
knows where the bird is to be found, and several of the n trials are
from this observer. Violations of this type often show themselves in
over- or underdispersion. We will come back to this in Section 5.2.7.
The other assumptions refer more to the relationship between the
response and the explanatory variables.
As in linear regression, probably the most important assumption
to check is that the relationship between the response, or rather the
modelled parameter (probability of success in logistic regression),
and the explanatory variables is adequately captured by the model.
To check this in logistic regression we plot the observed proportions

0.35

0.30
124

proportion lawn grass



0.25 ●


0.20

0.15
(if we have proportions) against every explanatory variable, with
0.10
the fitted line superimposed onto this plot. For binary data we
● ●

mostly don’t have proportions, even though it may be possible to 0.05

calculate proportions for sections of the explanatory variable (as we 0.00 ● ● ● ● ● ● ● ● ● ● ●

0 5 10 15
did in the lawn grass example). If the fitted line does not describe
slope
the observed relationship very well, we must change our model.
A bunch of outliers can also point out a misspecification of the Figure 5.5: Fitted logistic regression
relationship. curve for lawn grass data. The points
are observed proportions of lawn grass
Based on a visual comparison of the observed proportions and out of all pixels for each distinct value
the fitted probability, the lawn grass model works very well (Figure of slope.
5.2.6).
As in linear regression, we need to check that there are no in-
fluential observations (a few single observations that have a large
influence on the parameter estimates).
Also, as in linear regression, the residuals must be independent,
which means that we must account for spatial, serial or block-
ing structures in the model. Often, the observations are a random
sample from a large population, in which case it is reasonable to
assume independent observations.

Residuals in Logistic Regression


For large n (or large counts) we can construct residuals in such a
way that they should be approximately normally distributed. These
don’t work for small n, especially not when n = 1 (as in binary
data).
Two types of residuals are commonly used in GLMs: Pearson
and deviance residuals.
1. Pearson residuals:

observed − f itted
ri =
SE( f itted)
2. Deviance residuals:
The deviance of a model is calculated as

D = −2 (ℓ(model) − ℓ(saturated model))


i.e. twice the difference in the log-likelihood between the current
model and the model that fits the data perfectly. The log-likelihood
is usually the sum of the log-likelihoods of the individual observa-
tions. So the deviance residuals are the contributions of the individ-
ual observations to the overall deviance.
Both Pearson and deviance residuals are further standardized

by dividing by ( 1 − hi ), where hi is a measure of the leverage
(extremity in x-variables). This ensures that the residuals are ap-
proximately distributed as N (0, 1), but the approximation is only
good for large ni and counts.
In R, Pearson and deviance residuals can be extracted from the
fitted model as:

-------------------------------------------
125

residuals(model.x, type = "pearson")


residuals(model.x, type = "deviance")
-------------------------------------------

Residuals should not show systematic patterns or trends when


plotted against fitted values or explanatory variables. With binary
data, we cannot use the residuals for model checking, especially
with a single explanatory variable. The residuals come in two lines
(one for the 0s one for the 1s, see Figure 5.6), and this doesn’t really
tell us anything. Also, we don’t expect them to be normally dis-
tributed. We can look for outliers, but it may be easier to detect the
outliers in a plot of presence/absence vs the explanatory variables.

Figure 5.6: Residual plots for lawn


grass vs slope model.
Residuals vs Fitted Normal Q−Q
● 471
635
472 635
471
●● 472

● ●●●●
● ●
●●
●●

2.0

● ●●
●●

2


Std. deviance resid.

● ●
●●


●●

●●

● ●●

●●


●●
● ●

●●


● ●

●●



●●


●●


●●
Residuals

1.0
1
0

0.0

● ● ● ● ● ●
● ● ● ●
● ● ●



● ● ●●



● ● ●●



●●


● ● ●

●●

●●



●●


●●


●●



●●


●●

● ●

●●


●●


−1

● ●


●●


●●

−1.0

●●


●●


●●
● ●●●●
●●
●●

●●

●●

●●

●●

●●


●●

●●


●●

●●


●●

●●


●●

−6 −5 −4 −3 −2 −1 −3 −2 −1 0 1 2 3

Predicted values Theoretical Quantiles

Scale−Location Residuals vs Leverage


● 471
635
472
1.5


● ● 471
635
472
4




● ●
Std. deviance resid.


Std. Pearson resid.

● ●
3


1.0

● ●

2

● ●
● ●
● ●
● ●


1



0.5

● ●
● ●
● ●
● ●
0

● ● ● ● ● ● ● ● ● ● ●
● ● ● ●●
●●
● ●

●● ●
● ●
Cook's distance
−1
0.0

−6 −5 −4 −3 −2 −1 0.0000 0.0010 0.0020 0.0030

Predicted values Leverage

With larger binomial n or large counts, the residual plots be-


come more useful. Then, as in linear regression models, we check if
there is any remaining structure in the data that our model hasn’t
picked up, e.g. changes in the mean of the residuals point out that
the relationship between the parameter of interest and the explana-
tory variables has been misspecified, changes in the variance of the
residuals is an indication that the variance is not adequately de-
scribed by the assumed model, skew distributions of the residuals
point out that the large observations are not adequately fitted by
the model. And of course we need to check for influential observa-
tions.
Overdispersion and underdispersion is another form of misspec-
ification of the model common in logistic and Poisson regression
126

models. This is not a misspecification of the structural part and will


often not be picked up in residual plots.

5.2.7 Overdispersion and Underdispersion


Overdispersion is common for both binomial and count data. Both
the binomial and Poisson distribution have a single parameter:
probability of success, and average rate, respectively. They don’t
have an extra variance parameter to separately describe the vari-
ability, as the normal distribution. Instead, the variance is restricted
and directly related to the one parameter.
Real data will sometimes not behave according to such a very
restricted distribution, and will exhibit larger (overdispersion) or
smaller (underdispersion) variability than the model can account
for.
Overdispersion often comes from clustering, groups of obser-
vations with similar values. Underdispersion is less common, but
may occur, for example in a spatial context in counts per grid cell:
if the animals have territories, the count of animals per grid cell
will be much less variable than predicted by a Poisson distribution.
On the other hand, if animals move in packs and packs have large
territories, the count per grid cell will be much more variable than
predicted by a Poisson distribution (overdispersed).
In a previous section (assumptions of logistic regression model)
we have seen a few factors that can cause overdispersion in bino-
mial data.
In generalized linear models overdispersion shows itself in resid-
ual deviance much larger than the residual degrees of freedom.
However, large residual deviance also occurs when we have failed
to measure, and include in the model, some important explana-
tory variables, i.e. large residual deviance does not always mean
overdispersion.
When we assume a binomial model we assume that the variabil-
ity of the observations given a particular probability of success, pi
is

Var (Yi ) = ni pi (1 − pi )

In other words Yi ∼ Bin(ni , pi ) defines, through the pi parameter,


both the mean and variance of the observations. If the variability
(uncertainty) of the data is in reality larger than this we are over-
estimating the precision of our estimates (e.g. the fitted line for pi ),
i.e. the standard errors are too small.

What to do about overdispersion?


One solution to overdispersion or underdispersion is to fit a
quasi-binomial (or quasi-Poisson) model. This will estimate the
variability separately, and adjust estimates of standard errors ac-
cordingly, i.e. standard errors will incorporate the extra uncertainty,
and our estimates will be more conservative.
127

Quasi models do not define a model/distribution for the data,


only a mean-variance relationship. Therefore one cannot define a
likelihood function for quasi-models, and therefore no AIC values
can be calculated. However, they are a way to obtain better, or more
realistic, estimates of the uncertainty.

5.2.8 Goodness of Fit


To assess the overall fit of a logistic regression model we can com-
pare (1) the maximum likelihood under a full or saturated model7 7
A saturated model is one with the
with the maximum likelihood under the current model. The dif- number of parameters = the number
of observations, and that hence fits the
ference between current and saturated model is measured by the data perfectly, no residuals.
deviance:

D = −2(ℓc − ℓ f )
where ℓc is the (maximized) log-likelihood of the current model,
and ℓ f is the log-likelihood of the saturated model.
This gives a rough idea of the extent to which the current model
adequately represents the data. The deviance will be large when
Lc is small relative to L f and will be small when the current model
explains the data nearly as well as the full model. It can be shown
that asymptotically D ≈ χ2n−k−1 , where n is the number of ob-
servations, and k + 1 is the number of parameters estimated. This
approximation holds only for large n, and is not a good measure of
fit for binary data.
So, a rough measure of fit (but not for small data sets, or small
ni ), is obtained by comparing the residual deviance to the corre-
sponding degrees of freedom. They should be roughly the same.
When the residual deviance is much larger than the degrees of
freedom, the model leaves much of the variability in the response
unexplained. When we checked for overdispersion (Section 5.2.7)
we used the same check.

Percentage deviance explained


One can also calculate the percentage of (total) deviance ex-
plained, similarly to R-squared in regression. The null deviance
measures the maximum (total) deviance: distance between a model
with only an intercept and the saturated model.

Dnull − Dcurrent
% deviance explained ≈
Dnull
The change in deviance (Dnull − Dcurrent ) measures the amount
of deviance explained, or reduction in deviance, when adding some
extra terms to the model. Percentage deviance explained does not
require large-sample conditions, and can be used as a rough mea-
sure of goodness-of-fit for any model.
Here is part of the summary from the logistic regression model
for presence of lawn grass on slope:

----------------------------------------------------------
128

> m1 <- glm(LG ~ Slope, family = binomial, data = dat)


> summary(m1)

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.5461 0.1151 -4.74 2.1e-06
Slope -0.2773 0.0348 -7.96 1.7e-15

Null deviance: 954.93 on 946 degrees of freedom


Residual deviance: 866.69 on 945 degrees of freedom
----------------------------------------------------------

This model explains

954.93 − 866.69
% deviance explained = = 0.09
954.93
roughly 9% of the total deviance (total distance to saturated
model). What this means is that slope alone helps to discriminate
between some of the 0s and 1s but by no means will give a perfect
prediction. It does not mean that only 9% of the data are correctly
predicted. However, often for binary logistic regression we would
like to have some estimate of how well we will be able to predict
presence/absence given specific values of the covariates, i.e. how
useful is the model for predicting whether a pixel will be covered
by lawn grass or not, i.e. what percentage is classified correctly,
what percentage incorrectly, given the slope?

Area under the curve (AUC)


Let’s look at this prediction problem a bit more closely. From
the above model we have learned that the probability of lawn grass
is higher with small slopes. However it is only about 0.35 when
the slope is zero (Figure ??). How does one predict presence or ab-
sence of lawn grass given the probability of lawn grass? One could
predict presence of lawn grass for cells with predicted probability
> 0.5, and absence if the predicted probability < 0.5. However, here
the predicted probability is never so high, so we would always pre-
dict absence. This is not a useful prediction. But let’s see how many
pixels would be classified correctly. For this we construct a confusion
matrix:

> library(SDMTools) ### Tools for species distribution models


>
> obs <- dat$LG[!is.na(dat$Slope)]
> pred <- predict(m1, type = "response" )
>
> mat <- confusion.matrix(obs, pred, threshold = 0.5)
> print(mat)
obs
pred 0 1
0 755 192
129

1 0 0
attr(,"class")
[1] "confusion.matrix"

This matrix shows the observed number of pixels classified as


0 and 1, and the predicted number of pixels classified as 0 and 1,
when the cutoff is 0.5. In this case all are predicted to be 0s. The
number of correctly predicted pixels is 755755+192 = 0.8, 80%. Al-
though this sounds good, this approach is not very useful, because
just by always predicting no lawn grass we are correct 80% of the
time, because lawn grass is fairly rare. What about other thresh-
olds?
We need to define 2 more terms:
Sensitivity is the proportion of true positives predicted correctly
as positives (0 out of 192 in table above).
Specificity is the proportion of true negatives predicted correctly
as negatives (755 out of 755 in table above).
Sensitivity and specificity change when we change the thresh-
old. A receiver operating characteristic (ROC) curve plots sensitivity
against specificity, for different probability thresholds. The area un-
der the curve (AUC) is used as a measure of how good the classifier
works. The 45 degree line from (0, 0) to (1, 1) corresponds to a ran-
dom guess. A good classifier/model has a ROC curve that extends
far into the top left corner of the plot. The AUC gives the prob-
ability that a randomly chosen positive is ranked higher (higher
predicted probability) than a randomly chosen negative, or, if you 8
Tom Fawcett. An introduction to ROC
analysis. Pattern Recognition Letters,
have a negative and a positive, the proportion of time the model
27(8):861–874, 2006; and P Collinson.
will classify/rank them correctly. Tests/models with an area under Of bombers, radiologists, and cardi-
the curve of 0.5 to 0.7 have low accuracy, 0.7 to 0.9 moderate accu- ologists: time to ROC. Heart, 80(3):
215–217, 1998
racy, and > 0.9 high accuracy8 . A random guess model will have an
AUC of 0.5.
There is a trade-off between specificity and sensitivity, i.e. if
we always predict absence of lawn grass, sensitivity is zero, but
specificity is 1. Often we want to choose a threshold probability that
gives high specificity and high sensitivity, i.e. the point on the curve
closest to the top left corner.

---------------------------------------------------------
> auc(obs, pred) # area under the curve
[1] 0.7

> library(pROC)

> roc1 <- roc(obs, pred)


> plot(roc1, xaxs = "i", yaxs = "i", las = 1,
cex.axis = 1.5, cex.lab = 1.5)

Call:
roc.default(response = obs, predictor = pred)
130

Data: pred in 755 controls (obs 0) < 192 cases (obs 1).
1.0
Area under the curve: 0.7
--------------------------------------------------------- 0.8

AUC is a useful tool to evaluate the accuracy of a binary logistic 0.6

Sensitivity
regression model (which aims to predict presence/absence, or
0.4
success/failure). Our model for lawn grass has low to medium
accuracy. There are likely some important variables that determine
0.2
presence or absence of lawn grass, which we haven’t yet included in
the model. 0.0
1.0 0.8 0.6 0.4 0.2 0.0
Specificity

5.2.9 Model selection Figure 5.7: Receiver operating charac-


teristic (ROC) curve for logistic lawn
The same principles hold as for regression models when comparing grass model, with slope as the only
or choosing between models. The two most commonly used meth- explanatory variable. The area under
the curve (AUC = 0.7) is a measure of
ods for comparing GLMs are likelihood ratio tests and Akaike’s
how well the model correctly predicts
Information Criterion, AIC (or similar criteria). The likelihood ratio presence/absence of lawn grass.
test can only be used for nested models, and is valid only asymp-
totically, i.e. large ni and large N (number of observations). The
AIC can be used for comparing non-nested models and is valid for
small ni and small N, therefore we much prefer the AIC.

Likelihood ratio test


When one model contains terms that are additional to those
in another model, the two models are said to be nested9 . The dif- 9
The smaller model is nested in the
ference in deviance between the two nested models measures the larger model.

extent to which the additional terms improve the fit.


For example, let’s say model 1 has:

logit( p) = β 0 + β 1 x1 + β 2 x2

and model 2 has:

logit( p) = β 0 + β 1 x1 + β 2 x2 + β 3 x3 + β 4 x4
Then model 1 is nested in model 2. If the deviance of model 1 is
D1 with degrees of freedom p = n − 3, and the deviance of model
2 is D2 with degrees of freedom q = n − 5, then D1 − D2 measures
the change in deviance due to the variables x3 and x4 (after x1 and x2
have already been included in the model).
The change in deviance has an approximate χ2q− p distribution,
where (q − p) is the difference in the number of parameters. Again
the approximation holds only for large ni and N.

5.2.10 Presence of lawn grass: multiple predictors


The logistic model with only slope was not terribly good at pre-
dicting which pixels would be covered by lawn grass, even though
the presence of lawn grass clearly was related to slope. Let’s see
if we can improve the predictions for the presence of lawn grass if
we add more predictor or explanatory variables. [Archibald et al.,
131

Figure 5.8: Presence/absence of lawn


grass vs fire frequency. In the centre
2005] were interested in the role of fire (and environmental vari- plot both the response (0 or 1) and
ables that are known to influence grazing) on the presence of lawn the fire frequency values were jittered
to bring out all observations. On the
grass. Let’s look at the relationship between presence of lawn grass RHS for every distinct value of fire
and fire: frequency the proportion of pixels
with lawn grass is plotted.

1 ● ● ● ● ● ● ● ● ● ● ● ●





●●
●● ●
●● ●

● ●●


●●
●●



● ●





● ● ● ● ●
●● ●
● ● ● ●● ●●

● ● ●●● ●

● ●●● ●● ●
● ●




●●



●● ●
●●




● ●
● ●●

●●
0.4
1 ●
●●
●●


● ●●

●●

●●●



●●●



● ●







● ●
● ● ● ●
● ●
● ●
● ●
●●●

● ●
● ●●
● ●
● ● ●
● ●● ●●
● ● ● ●● ●

●●
●●


● ● ● ●



● ●

● ●
● ● ●
● ● ● ● ●

proportion lawn grass


● ●

0.3
lawn grass

lawn grass
● ●


0.2


● ● ●

● ● ●●● ●● ● ●● ● ●
● ●●

● ●

●●


● ●
●●● ● ● ●● ●●

● ●

●●● ●
● ●●
● ●● ●●● ●● ● ● ● ● ●●
●● ●● ● ●●
● ● ●●


●●●
●●

●● ●

●●

●●
● ●●●

●●

● ●
●●

●●
●●
● ●
●●●


●●

●●●
●●●

● ●
●● ●
● ● ●
● ●
● ●
●●●●
●●
● ●●
●●
●●
●●●

●●

●●


●●

●●

●●
●●●

●●
●●
●●

●●


●●


●●

●●
●● ●


●●

●●
● ●
●●●
●●

●●
●●
●●●
●●
●●
●●

●●

●●

●● ●
●●●
●●





●●



● 0.1
● ● ●●
● ● ●● ● ●● ●●● ● ●● ●● ●● ●

● ●●●● ●● ●● ● ●●● ●
● ●●●● ●● ●
● ● ● ●● ●●●● ● ●
●● ●● ● ●●
● ●●●
● ●●

● ●●●
● ● ●●



●● ●
●●

●●●●● ● ●●

● ● ●

● ●
0 ●


●●
●●

●●

●●
●●


●●








●●

●●●
●●
●●
●●

●●

●●

●●

●●



●●
●●

●●
●●
● ●


●●
●●●




●●

●●
●●

●● ●
●●●
●●
●●
●●●●


● ●
●●●



●●


● ● ● ● ●
● ● ● ●● ●●● ●●● ● ●● ●● ●
●●●● ●●
●●
●●●
●●

●●●

●●
●●●


●● ●●

●● ●●
● ●●● ●●●
● ●
●●●

● ●

●●

●● ●



● ● ●
●● ●
● ●
● ●
● ●● ● ● ●
●●● ●● ●
●● ●●●● ● ● ●
● ●●●● ●● ● ● ● ●●● ● ● ● ●
●● ●● ● ● ●●● ●● ● ●● ●
●● ●●●
● ●● ●● ●●
● ● ●

● ● ●● ●●● ●
●● ●● ● ● ● ● ● ●
●● ●● ●● ●● ●●●
●●● ●●
●●
●● ●●●● ● ● ●● ●●●
●● ● ●
●●●●
● ● ●● ● ● ●
0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ●
●● ● ● ● ● ●
● ●
● ● ●
● ●● ●
●●
0.0 ● ● ● ● ● ● ●

0 5 10 15 20 0 5 10 15 20 0 5 10 15 20
fire frequency (per 40 yrs) fire frequency (per 40 yrs) fire frequency (per 40 yrs)

--------------------------------------------------------------------------------------------
### presence/absence of lawn grass vs fire frequency
plot(LG ~ Fire5698, data = dat, las = 1, xlab = "fire frequency (per 40 yrs)",
ylab = "lawn grass", cex.lab = 1.5, cex.axis = 1.5, yaxt = "n")
axis(2, at = c(0, 1), cex.axis = 1.5, las = 1)

plot(jitter(LG) ~ jitter(Fire5698), data = dat, las = 1,


xlab = "fire frequency (per 40 yrs)", pch = 20, col = "firebrick3",
ylab = "lawn grass", cex.lab = 1.5, cex.axis = 1.5, yaxt = "n")
axis(2, at = c(0, 1), cex.axis = 1.5, las = 1)

t1 <- with(dat, table(LG, Fire5698)) ## frequency table


(t2 <- prop.table(t1, 2)) ## convert to proportions
xs <- c(0, 1, 3, 5:20)
plot(xs, t2[2,], las = 1, cex.axis = 1.5, cex.lab = 1.5, ylab = "proportion lawn grass",
xlab = "fire frequency (per 40 yrs)", pch = 16, col = "firebrick3", cex = 1.5)
--------------------------------------------------------------------------------------------

From the above plot it seems that lawn grass is absent or rare at
both very low values of fire frequency and at very high values of
fire frequency. From the RHS plot it seems that there is an inter-
mediate fire frequency at which lawn grass is most common. This
suggest a quadratic effect of fire frequency on the presence of lawn
grass. Let’s try all three models: with linear, quadratic and cubic
effects. Just like in regression, all lower order terms should always
also be present in the model, e.g. when we add the cubic term we
also keep intercept, linear and quadratic terms in the model.

--------------------------------------------------------------------------------------------
## three models with linear, quadratic, cubic effect of fire
132

m1 <- glm(LG ~ Fire5698, family = binomial, data = dat)


summary(m1)

m2 <- glm(LG ~ Fire5698 + I(Fire5698^2), family = binomial, data = dat)


summary(m2)

m3 <- glm(LG ~ Fire5698 + I(Fire5698^2) + I(Fire5698^3), family = binomial, data = dat)


summary(m3)
--------------------------------------------------------------------------------------------

These three models are nested, so we can compare them using


an analysis of deviance, which checks whether the deviance has
changed sufficiently more than it would anyway by using the extra
parameters, i.e. have the extra parameters contributed considerably
to increase the likelihood.

--------------------------------------------------------------------------------------------
> anova(m1, m2, m3, test = "LRT")
Analysis of Deviance Table

Model 1: LG ~ Fire5698
Model 2: LG ~ Fire5698 + I(Fire5698^2)
Model 3: LG ~ Fire5698 + I(Fire5698^2) + I(Fire5698^3)
Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1 945 954
2 944 921 1 33.1 8.6e-09
3 943 921 1 0.0 0.86

> anova(m1, m2, m3, test = "Chisq")


--------------------------------------------------------------------------------------------

The likelihood ratio test compares the two models by checking


the change in deviance when adding the additional parameters.
The test = "Chisq" option will produce exactly the same results
as the test = "LRT". This is because the likelihood ratio test is a
chi-squared test. Model 1 is nested in model 2, and both model 1
and 2 are nested in model 3, each model uses one extra parame-
ter (Df). We now look at the change in the residual deviance (the
larger model will always have smaller residual deviance). Adding
the quadratic term reduces deviance by 33.1 with 1 degree of free-
dom. This is a considerable change in deviance, also indicated by a
very small p-value. This is strong evidence that the extra quadratic
term greatly improves the model (helps to explain the observed
values). The addition cubic term as good as not at all changes the
deviance, and is therefore not worth having in the model. So, from
this analysis of deviance table we would select the quadratic model.
Next we plot the predicted probabilities for the presence of lawn
grass against fire frequency, we do this for all three models, just to
visualize these predictions.

--------------------------------------------------------------------------------------------
133

## calculate predicted lines for probability of lawn grass

x <- seq(0, 21, length = 500)


y <- predict(m1, newdata = data.frame(Fire5698 = x), type = "response")
y2 <- predict(m2, newdata = data.frame(Fire5698 = x), type = "response")
y3 <- predict(m3, newdata = data.frame(Fire5698 = x), type = "response")

lines(x, y, col = "blue", lwd = 2)


lines(x, y2, col = "darkgreen", lwd = 2)
lines(x, y3, lwd = 2)
--------------------------------------------------------------------------------------------

Figure 5.9: Presence/absence of


lawn grass vs fire frequency with

● ●
● ●●

● ●
● ●
● ●●●

● ●●

● ●●




● ●
● predicted lines for probability of lawn
● ● ●● ● ● ● ●●● ● ● ●
● ●●
●● ●
●●
●●
● ● ● ●
●● ●● ●
● ●
●● ●● ●

● ●



● ●
0.4 grass from three logistic regression
proportion lawn grass

●● ● ● ●● ● ●
● ●
1 ●
●●
● ●






● ●

●● ●● ● ●●
● ●● ● ●
● ●


● ● ●
● ● ●
●●

●●



●●



● ●
● ●
●● ● ●


●● ●


●●
● ●●

● ●●


●●


● models: linear effect of fire frequency
● ● ●● ● ●
● ● ● ●

0.3 (blue line), quadratic effect of fire


lawn grass

● ● frequency (green line), cubic effect of


● fire frequency (black line). The green
0.2 ●

● ● ●● ●
● and black lines virtually coincide. On
●● ●

●●● ●● ●
●●●● ● ●●
●● ● ● ●
● ● ● ● ● ●●
●● ● ● ● ● ●
●● ● ● ●●● ●●

●●
● ●

●●




● ●●

●●● ●●●● ● ●●● ●

●●


● ● ●● ●



●●
● ●●



●● ●●

● the RHS for every distinct value of





●●
●●





●●






























●●
●●




●●
●●


●●

●●






●● ●

●● ●
●● ●●



●●
●●
●●













●●





















●●

●●






●●
●●
●●



● ●




0.1
● ●● ●● ●● ●
● ●
●●●
●●

● ●
● ●●
●● ● ●●

0 ●

●●

●●
●●●








































●●
●●


●●
●●
●●


●●

●●






●●




●●
● ●●
● ●●

● ●
● ●




●●
●●
●●






●●



●●











●●●





●●
●●

●●
●● ●




● ●

●●




● fire frequency the proportion of pixels
●●●


●●●
● ●●●
●●●●
● ●
● ●●





●●●
● ●

●●
● ●●●
● ●
●●●

● ●
● ● ●● ●● ● ● ●● ● ●●●
● ● ●
● ●
● ●
● ●
●●


●●●


●● ● ●
●●● ●●●
● ●●●
● ●●

● ●● ●






●●






● ●




●●

●●
● ●●●

●●●

●● ●● ●
● ●●● ●

●●


● ●

●●





●●















● ●●
● ●



●●


● ●

0.0 ●● ● ● ●●● with lawn grass is plotted. Lines as on


0 5 10 15 20 0 5 10 15 20 LHS.
fire frequency (number in 40 yrs) fire frequency (per 40 yrs)

Of the three models fitted here, the quadratic model seems to


give the best description of the relationship between lawn grass
presence and fire frequency. In this example, the plot with pre-
dicted probabilities is very useful and probably much easier to
understand than direct interpretation of the coefficients.

-----------------------------------------------------------------
> summary(m2)

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -5.8455 0.9727 -6.01 1.9e-09
Fire5698 1.0149 0.2082 4.87 1.1e-06
I(Fire5698^2) -0.0515 0.0104 -4.94 8.0e-07

Null deviance: 954.93 on 946 degrees of freedom


Residual deviance: 920.80 on 944 degrees of freedom
(1 observation deleted due to missingness)
AIC: 926.8
-----------------------------------------------------------------

Categorical explanatory variables


Geology usually has a strong influence on vegetation type at a
particular site. [Archibald et al., 2005] had data on the geological
134

type of each pixel, classified into 9 groups. Geology is a categorical


variable. Let’s see if this has an effect on the presence of lawn grass:

-----------------------------------------------------------------
> dat$Geol_id <- as.factor(dat$Geol_id)
> m5 <- glm(LG ~ Geol_id, family = binomial, data = dat)
> summary(m5)

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.7346 0.2087 -8.31 < 2e-16
Geol_id1 1.7346 1.4295 1.21 0.225
Geol_id2 0.2846 0.2446 1.16 0.245
Geol_id3 2.4277 1.2424 1.95 0.051
Geol_id4 0.6931 0.3161 2.19 0.028
Geol_id5 1.7126 0.2959 5.79 7.1e-09
Geol_id6 -1.0688 0.4694 -2.28 0.023
Geol_id7 -0.0233 0.5273 -0.04 0.965
Geol_id9 0.0606 0.4914 0.12 0.902

Null deviance: 954.02 on 944 degrees of freedom


Residual deviance: 887.04 on 936 degrees of freedom
(3 observations deleted due to missingness)
AIC: 905
-----------------------------------------------------------------

Here we have a reference category (Geology type 0). The coeffi-


cients in the above output measure the change in the log-odds relative
to the reference category. The intercept is the log-odds of lawn grass
with geology type 0. For some of the geology types the log-odds
(for presence of lawn grass) are not measurably different compared
to the reference type, but for types 3, 4 and 5 the data seem to sug-
gest that lawn grass is more common (compared to type 0), and for
type 6 there is evidence that lawn grass is less common (negative
coefficient) relative to geology type 0. Note that the p-value is not
related to the size of the effect. Each of the above tests tests

H0 : LO(i ) − LO(0) = 0

i.e. that there is no difference in the log-odds for success between


category i and the baseline or reference category (geology type 0).
What about the probability of lawn grass in each of these geol-
ogy types?

-----------------------------------------------------------------
> xs <- c(0:7, 9)
> xs <- as.factor(xs)

> pred.geol <- predict(m5, newdata = data.frame(Geol_id = xs), type = "link",


+ se.fit = TRUE)
135

> est <- pred.geol$fit


> lcl <- est - 1.96 * pred.geol$se.fit
> ucl <- est + 1.96 * pred.geol$se.fit

> library(boot) ## for inv.logit function

Inverse logit:
> p.est <- inv.logit(est)
exp( LO)
> p.lcl <- inv.logit(lcl) p̂ = inv.logit( LO) =
1 + exp( LO)
> p.ucl <- inv.logit(ucl)
where LO denotes the log-odds.

> data.frame(xs, est, lcl, ucl, p.est, p.lcl, p.ucl)


xs est lcl ucl p.est p.lcl p.ucl
1 0 -1.7e+00 -2.14 -1.33 0.150 0.105 0.21
2 1 6.7e-16 -2.77 2.77 0.500 0.059 0.94
3 2 -1.5e+00 -1.70 -1.20 0.190 0.154 0.23
4 3 6.9e-01 -1.71 3.09 0.667 0.154 0.96
5 4 -1.0e+00 -1.51 -0.58 0.261 0.181 0.36
6 5 -2.2e-02 -0.43 0.39 0.495 0.393 0.60
7 6 -2.8e+00 -3.63 -1.98 0.057 0.026 0.12
8 7 -1.8e+00 -2.71 -0.81 0.147 0.063 0.31
9 9 -1.7e+00 -2.55 -0.80 0.158 0.073 0.31
-----------------------------------------------------------------

In the above code we have predicted the probability of lawn


grass presence for each geology type, with a confidence interval.
Note again that the predict() function requires the explanatory
variables to appear exactly as in the model, so we need to convert
to a factor first.
To obtain confidence intervals on the probability scale, we need
to start with confidence intervals on the link/linear predictor, or
log-odds scale, because these estimates are usually more likely
to be normally distributed. We construct Wald intervals on this
scale (est ± 1.96 × SE(est)), then transform the estimate and both
confidence limits to the probability scale, by using the inverse logit
transformation (inv.logit() function in library boot). The last 3
columns in the data frame now give the estimated probability, with
95% confidence limits for the probability, of lawn grass in each of
the geology types (xs). The point estimate for type 3 is highest, but
also has large uncertainty, indicating that the true probability might
be as low as 0.15 or as high as 0.96, or even beyond these limits.
The smallest probability of lawn grass, with high precision, is in
type 6.

Model selection with AIC


Let’s now add a few more models with multiple predictors.

-----------------------------------------------------------------
dat2 <- na.omit(dat)

m1 <- glm(LG ~ Fire5698, family = binomial, data = dat2)


136

m2 <- glm(LG ~ Fire5698 + I(Fire5698^2), family = binomial, data = dat2)


m3 <- glm(LG ~ Fire5698 + I(Fire5698^2) + I(Fire5698^3), family = binomial, data = dat2)
m4 <- glm(LG ~ Slope, family = binomial, data = dat2)
m5 <- glm(LG ~ Geol_id, family = binomial, data = dat2)
m6 <- glm(LG ~ Trmi, family = binomial, data = dat2)
m7 <- glm(LG ~ Fire5698 + I(Fire5698^2) + Slope + Geol_id + Trmi,
family = binomial, data = dat2)
m8 <- glm(LG ~ Fire5698 + I(Fire5698^2) + Slope + Geol_id,
family = binomial, data = dat2)
-----------------------------------------------------------------
Table 5.1: Model selection table for
logistic regression models for presence
of lawn grass.

model terms -2 loglik numpar AIC delta.aics wi


1 m1 fire 925.42 2.00 929.42 143.43 0.00
2 m2 fire + fire^2 891.94 3.00 897.94 111.95 0.00
3 m3 fire + fire^2 + fire^3 891.94 4.00 899.94 113.95 0.00
4 m4 slope 837.51 2.00 841.51 55.53 0.00
5 m5 geology 851.14 9.00 869.14 83.16 0.00
6 m6 trmi 924.12 2.00 928.12 142.13 0.00
7 m7 fire + fire^2 + slope + geology + trmi 761.80 13.00 787.80 1.82 0.29
8 m8 fire + fire^2 + slope + geology 761.98 12.00 785.98 0.00 0.71
When we use the AIC to compare models, and because we cal-
culate the likelihood (probability of the data), we need exactly the
same data for the models to be comparable. This means that we
cannot have (1) missing covariate values in one of the models (data
points with missing covariate values are ignored), and (2) we can-
not compare models with different transformations of the response.
The latter is more of a problem in linear regression models than in
GLMs. If the likelihood is calculated for the same data, they are In GLMs we assume a distribution
comparable, and a larger likelihood means a model that explains for the (untransformed!) response
variable. This distribution or model is
the data better. the basis of calculating the likelihood.
In the lawn grass data there are a number of missing values for Even if we would use a different
model for the observations (e.g. a
the covariates, not always on the same observations. To make the Beta-binomial distribution instead of
models comparable, we therefore remove all pixels/observations a Binomial distribution) we would
with any missing covariate values, and then compare the eight still find the probability of the original
number of successes out of ni , i.e. for
models using AIC. the raw observations.
One model clearly stands out: model 8 with a quadratic effect of
fire, slope and geology. Trmi (topographic relative moisture index),
a measure of surface water, did hardly change the likelihood when
added as an extra term, so was not able to explain extra variation in
the data (in addition to what is already explained by fire, slope and
geology). Among these eight models the data favour model 8.
Finally, we produce some conditional plots for model 7, both as a
tool for model checking (to see if fitted and observed relationships
correspond), and to understand the estimates (and their uncertain-
ties) a bit better. This is especially useful for the geology variable:
we can for example see that with type 6 there were no lawn grass
observations; with type 3 two-thirds were lawn grass, but there
137


2 ●

Figure 5.10: Conditional plots for

model 7 (with variables fire, fire2 ,


● ● ● ● ●
● ● ● ●
● ●
● ●
● ● ●
● ● ●

● ● ● ● ● ●
● ● ● ● ● ● ●
● ●

● ● ● ● ● ● ●
● ● ● ● ●
● ● ●
● ● ●
● ● ● ●
● ● ●
● ● ● ● ●
● ● ●

0

● ● ●
● ● ● ● ●
● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ●
● ●

geology, slope and TRMI. These plots


● ● ● ●
● ● ● ●
● ● ● ●

● ● ●

● ●
● ● ● ● ● ●

● ●
● ● ● ● ● ● ●
● ● ● ● ● ●
● ● ● ● ●
● ● ●

0
● ● ● ●
● ● ● ●

● ● ●
● ●
● ● ●
● ●
● ● ●

are on the linear predictor scale (log-


● ●
● ●


● ● ●
● ● ●
● ● ●
● ● ●
● ●
f(Fire5698)

● ●
● ●
● ● ● ●
● ● ●
● ●
● ● ●
● ● ● ● ●
● ●
● ●
● ●

● ● ● ● ● ● ●
● ● ● ●

● ● ● ● ● ● ●
● ● ● ●

−2 odds), and show the residuals after


● ● ●
● ● ● ● ●
● ● ●
● ●

● ● ● ●
● ● ●
● ●
● ●
● ●
● ●
● ●
● ●
● ● ● ● ● ●
● ● ●
● ● ●
● ●
● ●

● ● ●

f(Slope)

● ●
● ●
● ● ● ● ●
● ●
● ● ● ●
● ●

● ●
● ● ● ● ●
● ●
● ● ● ●
● ●
● ●
● ● ●
● ●
● ● ●
● ● ● ● ●

● ● ● ● ● ● ●
● ●
● ●
● ●

● ● ●




● ●
● ● ● ● ●
● ●
● ● ●
● ●

−2
● ● ● ● ●
● ●
● ●
● ●
● ●


● ● ●
● ● ● ● ●
● ●


● ● ● ●
● ● ● ●
● ●

● ● ● ● ● ●
● ●

having fitted the other 3 variables.



● ●
● ● ● ● ● ●

● ● ● ● ● ●

● ●
● ● ● ● ● ● ●
● ●

● ● ● ● ●

● ●
● ● ● ● ●
● ● ● ● ● ●
● ● ●

● ● ● ● ●
● ●

● ● ● ●


● ●

● ●

● ●

● ●

−4
● ●
● ●


● ●

● ●







● −4 ●









−6 ●

−6 ●

−8
0 5 10 15 20 0 5 10 15
Fire5698 Slope

● ●

● ● ● ● ●
● ●
● ●

1
● ● ●
● ●

3 ● ● ● ● ● ● ●
● ● ● ● ●
● ● ● ● ●
● ●
● ● ●
● ● ●

● ● ● ● ●
● ● ● ● ●
● ● ● ●
● ● ●
● ● ● ● ●

● ● ● ●
● ● ● ● ●

2
● ● ● ● ●
● ● ●
● ● ● ● ● ●
● ● ●
● ● ● ●
● ● ● ●
● ● ● ● ● ●
●● ● ●
● ● ● ●
● ● ●
● ● ● ●
● ● ● ●

0


●● ●● ● ● ●
f(Geol_id)

● ● ●
●●● ●●
● ● ● ● ●
● ● ● ● ●

● ●●● ● ● ● ● ●
● ● ● ●● ●
● ● ● ●

1
●● ●

● ●● ●●●
● ●
● ●
● ● ● ●
f(Trmi)

● ● ● ●

●●● ● ●● ● ● ● ●


●●●● ●
● ● ● ● ●● ● ●●
● ● ● ● ● ●
●● ● ●

●●
● ● ●● ●●
●●● ●
● ● ●
●● ● ●
●●● ●●
● ●● ●
● ●
● ●
● ●
● ●●

● ●●
●● ●●
● ●

●●

0 ●● ●

● ●

−1


● ●
● ●
● ●
● ●● ● ●
● ● ● ●
● ● ●
●●● ● ● ● ●
● ● ● ● ● ● ● ● ●
● ● ● ● ●
● ● ● ● ● ●
● ● ● ● ●
● ● ● ● ● ● ●
● ● ● ● ● ● ●
●●● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ●

−1
● ● ● ● ● ●
● ● ● ●
● ● ● ●
● ● ● ● ●
● ● ● ● ● ●
● ● ● ●
● ● ●
● ●
● ●
● ●

● ●
● ● ●
● ● ● ●

● ● ● ● ●
●● ●
● ● ● ● ● ● ●
●● ●● ● ●
● ●●
●● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ●
● ● ● ● ● ●
● ● ●
● ● ●
● ● ● ● ● ● ●●●

● ●
●●
●●●● ●

●●
●●●
●●● ● ● ● ● ● ● ●
● ● ● ● ● ● ●
● ● ● ● ●
● ● ●

●●●
●● ● ●

● ●●● ●●● ●●●●
●●●●
●●●


● ● ● ●●
●●



● ● ●
● ● ● ● ● ● ● ●
● ● ●
● ● ● ● ●
● ●●● ●● ● ● ●● ●●● ● ●●● ● ● ● ●
● ●
● ● ● ● ●
● ●●●● ●●
●● ●

●● ● ●●
● ● ● ●
●●

●●
● ●●● ● ● ● ●
● ● ● ● ● ● ● ● ● ●
● ● ●
● ● ●
● ● ● ● ●
● ●
● ●
●●● ●● ● ● ●●



● ●●

● ●
● ● ● ●
● ●
●●● ● ● ●


● ● ●
● ● ●
● ● ●

● ● ● ● ● ●
● ● ● ● ●
● ●
● ●
● ●
●●
● ●● ● ●●
●●
● ●
● ●●●●●
●●● ●

●●
●● ● ● ●
● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ●
● ●
● ●●●
●●
●● ●●● ●

● ●●●
●● ●●●●●

●●●●●
● ●●
●●●
● ● ●●● ● ●
● ●
● ●
● ● ● ● ● ● ●
● ● ●
● ● ● ●
● ● ● ●

● ●
● ● ● ●

● ●●● ●● ●●
●● ● ●

●●
● ●● ● ● ● ● ● ● ● ● ● ●

●●
●●●
● ●●
●●●●

● ●●●●

● ●
●●● ●●
● ●●● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ●
●● ●
●● ● ●
●●●
● ●●●●


● ●●
●●●●

● ●●●●

● ● ●
●●

●●●
● ●●
●●● ● ● ● ●
● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ●


●● ●● ●●● ●●
●● ● ● ●
●●●●●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●
● ●

−2
● ● ● ●


● ●● ●● ● ●● ● ● ● ● ● ● ● ●
● ● ● ●●●
● ●●●
●● ● ● ● ● ● ● ●

−2
● ●
● ●●●● ● ●●

●●
● ●
●● ●●● ● ● ● ●
● ● ●

● ● ● ● ● ●
● ● ●
● ● ●●

● ●●

●●● ●● ● ●
● ● ● ● ● ● ● ● ● ● ● ●
● ● ●

● ● ●● ● ● ●
● ●●●
● ● ● ● ● ●
● ●

● ●
● ●
● ● ● ● ● ● ● ●
● ● ●
● ● ● ● ●
●● ●
● ●● ● ● ● ●
● ●
●●

●●
●●

●● ● ● ● ●

● ● ● ●
● ● ● ● ● ●
● ● ●


● ●

● ●

−3 ●
● ●





0 1 2 3 4 5 6 7 9 0 10 20 30 40 50
Geol_id Trmi

Figure 5.11: Receiver operating char-


1.0 acteristic curve for model 8 (logistic
regression model for presence of lawn
grass with variables fire, slope and
geology (black line, AUC = 0.78). The
0.8
blue line represents the ROC curve for
model 2 (quadratic model for fire), as
comparison (AUC = 0.63). The AUC
0.6 for the latter model changed com-
Sensitivity

pared to the previous value, because


we removed some observations with
missing values (to make all 8 models
0.4 AIC-comparable).

0.2

0.0
1.0 0.8 0.6 0.4 0.2 0.0
Specificity
138

were only 3 observations, so lots of uncertainty; with type 5 about


half were lawn grass, so compared to other types, a much higher
probability of lawn grass. And we can see that a linear effect of
TRMI really does not help to explain where you would expect more
lawn grass (1s). It might be worth, however, to check if there is a
quadratic effect of TRMI.
Finally, finally, we plot the ROC curve for model 8, and can see
that model 8 is a much better model for predicting where lawn
grass is to be found (AUC = 0.78), relative to model 2.
One strategy for predicting which pixels will have lawn grass
would be to predict presence for the 20% of pixels with highest
predicted probabilities.
A
Formulae sheet

Probability Distributions
Name Probability or Range MGF Mean Variance
  Density fn.
Prob.
n x
Binomial p (1 − p ) n − x x = 0, 1, . . . , n (1 − p + pet )n np np(1 − p)
x

λ x e−λ t
Poisson x = 0, 1, . . . e λ ( e −1) λ λ
x!
p q q
Geometric p(1 − p) x = pq x x = 0, 1, . . .
1 − qet p p2

1 2
/2σ2 1 2 2
Normal √ e−( x−µ) −∞ < x < ∞ eµt+ 2 t σ µ σ2
2π σ

λ 1 1
Exponential λe−λx 0<x<∞
λ−t λ λ2

x α−1 e− x/β
Gamma 0<x<∞ (1 − βt)−α αβ αβ2
βα Γ(α)
or:
λα x α−1 e−λx λα α α
0<x<∞
Γ(α) (λ − t)α λ λ2

Γ ( a + b ) a −1 a ab
Beta x (1 − x ) b −1 0<x<1 −−−
Γ( a)Γ(b) a+b ( a + b )2 ( a + b + 1)

Relations between Moments


µ3 = µ3′ − 3µ2′ µX + 2(µX )3
µ4 = µ4′ − 4µ3′ µX + 6µ2′ (µX )2 − 3(µX )4

Var[aX + bY] = a2 œ2X + b2 œ2Y + 2abœXY = a2 œ2X + b2 œ2Y + 2abæXY œX œY .

Log expansion
r2 r3
ln(1 − r ) = −r − − −···
2 3
ii

ANOVA
n
• Sample mean for population i: Yi· = ∑ j=i 1 Yij /ni

n ni
• Overall sample mean: Y·· = ∑ik=1 ∑ j=i 1 Yij /N = ∑ik=1 Y
N i·
n
• Error sum of squares: SSE = ∑ik=1 ∑ j=i 1 (Yij − Yi· )2

• “Treatment” sum of squares: SST = ∑ik=1 ni (Yi· − Y·· )2 .


SST/(k − 1)
• F=
SSE/( N − k )

Regression
• Least squares estimate: (X′ X)−1 X′ y

• Var[ β̂ k ] = ξ kk σ2 , where ξ kℓ is the (k, ℓ)-th element of (X′ X)−1 , and


σ2 is the sampling variance.

• Let Y (x) be the response corresponding to a design point x. It



follows that E[Y (x)] = µ(x) = β′ x. Define µd
(x) = β̂ x. Then:

◦ E[(µ(x) − µd(x))2 ] = σ2 x′ (X′ X)−1 x


(x))2 ] = σ2 (1 + x′ (X′ X)−1 x)
◦ E[(Y (x) − µd
B
Bibliography

S Archibald, WJ Bond, WD Stock, and DHK Fairbanks. Shaping the landscape: fire-grazer interactions in
an African savanna. Ecological Applications, 15(1):96–109, 2005.

P Collinson. Of bombers, radiologists, and cardiologists: time to ROC. Heart, 80(3):215–217, 1998.

Tom Fawcett. An introduction to ROC analysis. Pattern Recognition Letters, 27(8):861–874, 2006.

You might also like