STA3030_Note
STA3030_Note
Inferential Statistics
Course Notes for STA3030F
Contents
1 One-sample problems 5
4 Bayesian inference 84
A Formulae sheet i
B Bibliography iii
Preface
These notes form the core course material for STA3030F Inferen-
tial Statistics, a third-year semester-long course for students in the
Applied Statistics stream.
The purpose of these notes, and the course, is to provide busi-
ness students specializing in statistics and management science
with a broad understanding of the principles underlying the prac-
tice of statistical inference. It is assumed that students entering this
course have completed at least three semester courses in statistical
methods (typically STA1000F/S, STA2020F and STA2030S at UCT),
and are familiar with the following concepts:
These notes have been compiled over several years and several
people have been involved in that process – this will sometimes be
clear to the reader in a change of style, or way of presenting infor-
mation graphically, or preference for a particular software package.
Most of the material in the first four chapters was written by Prof.
Theodor Stewart. The material in Chapter 5, on GLMs, was writ-
ten by Dr. Birgit Erni for the course STA2007S. Other people who
have worked on various parts of the notes include Dr. Freedom
Gumedze, Dr. Juwa Nyirenda and Dr. Ian Durbach. The notes are
updated from year-to-year, so if you have any comments or correc-
tions, these are welcome.
A last note on software: this is a course in inferential statistics,
often demonstrating concepts using computer simulation. A num-
ber of software packages are available to do this. In past years
these notes have been based primarily on Microsoft Excel, whose
4
boxplot(pdelays,ylab="Days",main="Boxplot")
hist(pdelays,breaks=10,xlab="Days",main="Histogram")
6
5
Frequency
4
Days
3
2
1
0
34 36 38 40 42 44 46
Days
(a) (b)
mean(pdelays)
[1] 39
sd(pdelays)
[1] 3.051286
median(pdelays)
[1] 38.5
(2) Place all the observations “in a hat” and “shuffle” them. Then
draw n items with replacement to form a new pseudo-sample (the
“bootstrap sample”).
For i = 1, . . . , n do:
boxplot(bs_means,ylab="Days",main="Boxplot")
hist(bs_means,breaks=10,xlab="Days",main="Histogram")
1500
40
1000
Frequency
Days
39
500
38
37
37 38 39 40 41
Days
(a) (b)
mean(bs_means)
[1] 38.99818
sd(bs_means)
[1] 0.5539776
min(bs_means)
[1] 36.83333
max(bs_means)
[1] 40.9
11
sorted_bs_means[125]
[1] 37.93333
sorted_bs_means[4875]
[1] 40.1
which gives the same result as above for X̄ = 39. This last ex-
pression can, however, be misleading, as it looks like a probability
statement about µ. In this context, µ is fixed (although unknown),
while the probabilities of occurrence refer to the different realiza-
tions of the sample mean when the sampling process is repeated
many times.
The confidence interval described in introductory statistics
courses (e.g. STA100) is in fact derived on precisely the same type
of re-sampling argument. The only difference is that the assump-
tion is made that the errors themselves are normally distributed with
zero mean, i.e. that X̄ − µ has a normal distribution with mean 0
√
and standard deviation σ/ n, where σ is the population standard
deviation and n the sample size. When σ is known, the standard
confidence interval is written in the form:
σ σ
Pr[ X̄ − zα √ ≤ µ ≤ X̄ + zα √ ] = 1 − 2α
n n
2.045 × 3.05
39 ± √ = [37.86; 40.14]
30
where the factor 2.045 is the 0.025 critical value for the t-distribution
with 29 degrees of freedom. Note how close the bootstrapped and
t-based confidence intervals are; this would suggest that the as-
sumptions of the normal theory hold quite well in this case.
13
[1] 179
so in fact found that 179 of the 5000 values exceeded 40.0, i.e.
3.58% of the data. This may be interpreted as an observed signifi-
cance level (or p-value) of 0.036, which is significant at the conven-
tional 5% level. Recall that for the standard t-test based on normal
errors, the t-statistic for this test would be:
39 − 38
√ = 1.795
3.05/ 30
Since the 5% critical value from t tables is 1.699 for 29 degrees of
freedom, we would conclude that the difference is “significant” (but
not “highly significant”). We can calculate the associated p-value in
R as follows 4 , 5 4
The crucial part of the code below is
pt, which calculates the value of the
CDF of a t-distribution at the point
# calculate parts of the t-statistic
1.8. This gives us the probability of
mpd <- mean(pdelays) obtaining a value less than 1.8 (think
sdpd <- sd(pdelays) of the definition of a CDF). For our
(one-sided) hypothesis test we want
npd <- length(pdelays) the probability of getting more than
# compute the t-statistic 1.8, which is why we take 1- the value
returned by dt. R has similar functions
tstat <- (mpd - 38) / (sdpd/sqrt(npd))
for lots of other distributions, see for
# enter the degrees of freedom example dnorm, dgamma, dbeta, dbinom.
dof <- npd - 1 Note that dt is specifically for the
t-distribution!
# get the p-value from a t-distribution 5
Note that again you could do
p <- 1 - pt(tstat,dof) this in a single line of code:
1-pt((mean(pdelays)-38)
# display p
/(sd(pdelays)/sqrt(length(pdelays)
p )),length(pdelays-1)). Its just
easier to understand, and harder to
[1] 0.04153733 make a mistake, if you break it up a
little. In fact, you can run the t-test
directly using t.test(pdelays,mu=38,
we obtain a p-value of 0.042, which is somewhat larger than (but alternative="greater"). Try this, and
of much the same order of magnitude as) the bootstrap estimate. note that you get the same p-value.
For help type help(t.test). Using
It is worth emphasizing two critical insights derived from the R’s built-in t.test function is the way
above arguments: you would normally run a t-test in
practice, but it doesn’t give you any
insight into sampling theory, which is
the whole point of this section!
14
Recall that a random variable is defined such that its value is only
determined by performing some form of experiment, measurement
or observation. Examples in earlier courses would have included
the number of heads in a fixed number of spins of a coin, the time
to occurrence of some specified event (e.g. the breakdown of a ma-
chine), or rainfall at a particular point over a fixed period of time.
The critical point is that prior to making the necessary observa-
tions, the value of the random variable is unknown, and can only
be described probabilistically.
A convenient notation is to use upper case letters (e.g. X, Y, . . . )
to denote the random variable itself (prior to any observation), and
to use lower case letters (e.g. a, b, . . . , x, y, . . . ) to represent particular
real values that might be taken on by the random variable. An
expression such as Pr[ X = x ] would then denote the probability
that when the random variable X is observed, it is found to take on
the value x. Clearly, expressions such as Pr[ X = y] or Pr[ X = 10]
are equally meaningful.
The distribution function (sometimes termed the cumulative distri-
bution function) of the random variable X is a function F ( x ), such
that for each real number x: F ( x ) = Pr[ X ≤ x ]. Clearly, as x in-
creases, the probability cannot decrease, although it could remain
constant over some range of x (since if X cannot take on values be-
tween a and b, say, then F ( a) = Pr[ X ≤ a] = Pr[ X ≤ b] = F (b)).
In other words, F ( x ) is a non-decreasing function of x. Further-
more, by the properties of probabilities, 0 ≤ F ( x ) ≤ 1. Figure 1.4
illustrates two possible forms of distribution function.
Both distributions illustrated in Figure 1.4 relate to random vari-
ables taking on values between 0 and 5 only. The distribution func-
15
1.0
Distr A
0.8
Distr B
0.6
0.4
0.2
0 x
0 1 2 3 4 5
k
F (k) = ∑ p ( i ).
i =0
Pr[ x < X ≤ x + h] F ( x + h) − F ( x ) dF ( x )
f ( x ) = lim = lim = .
h →0 h h →0 h dx
Clearly this also implies that:
Z x
F(x) = f ( x )dx
−∞
i.e. the area under the density function curve between the points a
and b on the x-axis.
A notational convention: Sometimes it becomes necessary to iden-
tify to which random variable a particular distribution, probability
or density function applies. In such cases we will label the func-
tions by a subscript denoting the name of the random variable, e.g.
FX ( x ), pX or f X ( x ). The subscripts will, however, be omitted if no
confusion can arise.
λ x e−λ
• Poisson (discrete): p( x ) = for x = 0, 1, . . .
x!
• Exponential (continuous): f ( x ) = λe−λx for x > 0, so that
Z x
F(x) = f (u) du = 1 − e−λx
u =0
1 2 /2σ2
• Normal: f ( x ) = √ e−( x−µ)
2πσ
It is worth noting here that a non-negative random variable X is
said to have the log-normal distribution if Y = log X follows a
normal distribution. The base to which the logarithm is taken is
irrelevant, but it is conventional to use natural logarithms (i.e. to
base e).
As we have seen in Section 1.1, it is possible to explore the be-
haviour of statistical sampling processes by means of numerical
experimentation in computer simulations (often then termed a
Monte Carlo approach). In order to implement such an approach, we
often need to simulate realizations of random variables drawn from
some specified distribution (such as the normal or Poisson with
17
runif(4)
The only complication may arise from having to find the x such
that F ( x ) = u. For discrete distributions, the search for a solu-
tion can be carried out systematically: Set X to the smallest non-
negative integer k such that u ≤ ∑ik=0 p(i ). This is easily found
by a set of nested IF functions6 . For example, suppose we want to 6
Again there are different ways to
draw from a discrete distribution giving X = 0 with probability program this. A better way, once you
have generated your uniform random
0.2, X = 1 with probability 0.5, and X = 2 with probability 0.3. We numbers, is to use x<-ifelse(u<0.
could simulate 1000 values from this distribution as follows: 2,0,ifelse(u<0.7,1,2)). This does
the same thing as in the main text,
but avoids the use of the for loop. Try
# set up an empty vector to store the values help(ifelse) to get more information
x <- c() on the useful ifelse function.
# generate 1000 U[0,1] variates
u <- runif(1000)
# use u to pick appropriate value of x
for(i in 1:length(u)){
if(u[i] < 0.2){
x[i] <- 0
} else if(u[i] < 0.7){
x[i] <- 1} else {
x[i] <- 2
}
}
# print out first 5 u values
round(u[1:5],2)
18
[1] 1 1 2 2 0
Histogram of x
200 400
Frequency
1.0
0.8
u
0.6
0.4
0.2
0 x
0 1 2 3 4 5
x = F −1 (u)
convert these to the desired x’s using the above function. In fact
there is a little simplifying trick! If the random variable U has a
uniform distribution on [0,1], then so has 1 − U. The expression
x = −[ln(u)]/λ will therefore also produce a number from the
desired exponential distribution.
More generally, it may be difficult to solve the equation F ( x ) = u
in a closed form. However, R provides inverse functions for many
standard distributions, i.e. giving values for F −1 (u) directly. For
example, the function qnorm(u,m,s) gives F −1 (u) for the normal
distribution with mean m and standard deviation s. Another func-
tion, rnorm(n,m,s) directly generates n values from N (m, s).
[1] -1.959964
qnorm(0.975,0,1)
[1] 1.959964
qnorm(runif(1),1,3)
[1] -0.1832144
round(rnorm(4,10,1),2)
u <- runif(200000)
x <- ifelse(u < 0.25, 1, 0)
Two sets of analyses were carried out using the generated se-
quence of 0’s and 1’s:
20
Clearly these means must take on one of the values 0, 0.1, . . . , 1.0,
so that the distribution of the sample means is still very much
discrete. The observed frequencies of each of value of the mean
are displayed in the histogram below
0 2000
mean values
Even for samples of size 10, the distribution of the mean is start-
ing to look quite smooth, although somewhat skewed to the
right rather than normally distributed.
• The same 200000 values were then grouped into 2000 sets of 100
observations each, once again calculating the means in each set.
150
0
mean values
X(1) = min{ X1 , X2 , . . . , Xn }
which is why, here, x(1) = 3. For a sample of size n the nth order
statistic X(n) is always the maximum of the sample, that is
X(n) = min{ X1 , X2 , . . . , Xn }
so that here x(n) = 10. The sample range is the difference be-
tween the maximum and minimum, and so is expressed as a
function of the order statistics:
34 34 34 35 36 36
37 37 37 38 38 38
38 38 38 39 39 39
40 41 41 41 41 42
42 42 42 43 44 46
Suppose that there is some concern from management that a few
very late customers may be unfairly skewing the average payment
times. In such a case, it might be a better approach to examine the
median payment time, which is resistant to such outliers. We can
approach the construction of a bootstrap confidence interval around
the median in much the same way that we did for the mean. That
is, we can generate a large number (5000 or whatever) of bootstrap
samples as before, and for each bootstrap sample compute the me-
dian. We can then sort the bootstrap sample medians from smallest
to largest. For one particular run of 5000 bootstrap replications, the
following selection of sorted bootstrap medians was obtained
2000
1500
1000
500
0
36 36.5 37 37.5 38 38.5 39 39.5 40 40.5 41 41.5 42
Under the key assumption that the same sampling errors apply to
the originally-taken sample, we can use the above statements about
sampling errors to construct a bootstrap confidence interval around
the median.
Pr[ X ≤ ξ p ] = F (ξ p ) = p.
for some given α. We can’t be sure that we can ever find r and s
such that the above probability is exactly 1 − α, which is why we
use the ≥. We would, however, prefer to seek the shortest interval
for which the above applies.
For any given p, let us define the random variable Y as the num-
ber of observations in the sample which do not exceed ξ p . Since
we don’t know ξ p , we can never actually observe Y, but we can
still state its probability distribution. For example, we know that
the probability of a single observation drawn from the sample be-
ing less than the median ξ 0.5 is by definition 0.5, even if we don’t
know the value of ξ 0.5 . Therefore the number of observations in the
sample which are less than ξ 0.5 is a random variable which follows
the binomial distribution (since there are multiple independent tri-
als) with parameters n and p = 0.5. Generally, since by definition
F (ξ p ) = p, Y has the binomial distribution with parameters n and p
(i.e. the p defining the required quantile). Now {X(r) ≤ ξ p } is equiv-
alent to Y ≥ r, while {ξ p < X(s) } is equivalent to Y < s. We then
have:
r − 15.5
= −1.96
2.74
which gives r = 10.1, and:
s − 15.5
= 1.96
2.74
which gives s = 20.8. Moving out to the next integers gives
r = 10 and s = 21, and thus the desired confidence interval for
the median is [X(10) ; X(21) ]. This is reasonably similar to the
confidence interval arrived at using the full binomial calcula-
tions, which gave [X(11) ; X(25) ]. Checking back with the original
data, we state that the 95% confidence interval using the normal
approximation is given by 38–41.
f (n) ( x ) = n[ F ( x )]n−1 f ( x )
i.e.:
F(1) ( x ) = 1 − [1 − F ( x )]n .
Once again the distribution function can be differentiated with
respect to x to obtain the p.d.f.
• The probability that the time until the last bulb fails is greater
than 8000 hours, assuming that no bulbs are replaced in the
interim?
• The p.d.f. of the time until the last bulb fails?
F(6) ( x ) = [1 − e x/2000 ]6
Tutorial Exercises
(c) Compare the results from the previous section with that
obtained from standard normal theory.
(a) f ( x ) = 6x (1 − x ) for 0 ≤ x ≤ 1
5
(b) f ( x ) = 6 for x > 1
x
8. For the exponential distribution with a mean of 10 (i.e. λ =
0.1), and for the second distribution defined in the previous
question: Simulate the occurrence of 1000 sample values from
the distribution, group the results of the simulation into 100 sets
of 10 values each, and calculate the corresponding sample means
in each set. Plot a histogram of these means. Comment on the
results.
The sample medians for each of the 1000 bootstrap samples were
calculated, and then ordered from smallest to largest. The fol-
lowing are some of the ordered results obtained:
33
X̄ − µ
Pr −zα/2 ≤ √ ≤ zα/2 = 1 − α
σ/ n
X̄ − µ
−zα/2 ≤ √ ≤ zα/2 ,
σ/ n
or (re-arranging terms) that
√ √
X̄ − zα/2 σ/ n ≤ µ ≤ X̄ + zα/2 σ/ n.
35
X̄ − µ0
Pr √ > zα ≤ α
σ/ n
for the stated µ0 . Recall that we may either set the desired sig-
nificance level a priori (e.g. something like the conventional 0.05,
√
giving z0.05 = 1.645) and reject H0 if X̄ > µ0 + zα σ/ n; or we
could look up the value of α such that zα equates to the observed
√
value of ( X̄ − µ0 )/(σ/ n) (giving the “p-value”).
x <- rnorm(45000,0,3)
hist(x)
−10 −5 0 5 10 15
• Cluster the results into groups of 9 each, and obtain the sample
mean and sample variance within each group
• The group sample means should have the standard normal dis-
tribution (why?). We check this by plotting a histogram of the
sample means. This shows how sample means do vary from
sample to sample, as we can view each group as a sample of size
9.
400
0
−4 −2 0 2
mymean
0 400
1 2 3 4 5 6
sqrt(myvar)
−6 −4 −2 0 2 4 6
myt
• The sample estimate of the kurtosis for the T’s was 4.0.
The above points indicate that the distribution of the T’s has a dis-
tinct tendency towards having heavier tails than the normal distri-
bution. This can seriously bias estimates of p-values or confidence
intervals, if the sample estimate is used in place of the true popula-
tion mean, but critical values from the normal distribution are still
used.
We can deal with the additional variation due to using sample
estimates of the variance in quite a simple manner. We simply
express the “t-statistic” in the following way:
X̄ − µ X̄ − µ h σ i
σ
T= √ = √ =Z .
S/ n σ/ n S S
√
We know that Z = ( X̄ − µ)/(σ/ n) has the standard normal
distribution. The usual sample variance estimator is given by
∑in=1 ( Xi − X̄ )2
S2 =
n−1
so that U defined by:
n 2
( n − 1) S2 Xi − X̄
U= =∑
σ2 i =1
σ
Z
T= p
U/(n − 1)
Z
T= √ .
U/d
Then the probability density function for T is given by:
Γ((d + 1)/2) 1
f (t) = √ ·
πdΓ(d/2) [1 + t /d](d+1)/2
2
Y2 Z2
+
16 25
has the χ2 distribution with 2 degrees of freedom. The theorem
thus tells us that:
√
X 2X
r = q
2 2
3 Y16 + Z25
Y2 2
3 16 + Z /2
25
i.e.: " r #
Y2 Z2
Pr X ≥ 6.194 + = 0.05.
16 25
q
The probability that X̄ exceeds 0.5 ∑5i=1 Yi2 is given by:
X̄ √
Pr q > 0.5 = Pr[W > 0.5 10 = 1.581].
∑5i=1 Yi2
This last probability we can look up in t-tables, and it turns out
to be 8.74%.
We note that the average delay time1 in the Zinc market sample 1
The mean of the numbers for copper
is 7.43 days larger than that in the Copper market. The questions is 33.60; the standard deviation is
12.09. The mean of the numbers for
which arise are (1) how large is the true difference in population zinc is 41.03; the standard deviation is
means? (2) Is the evidence for a larger delay in the Zinc market 9.08
convincing in the light of sampling variation.
The first question can be structured in terms of a confidence
interval for the true difference in means. The second can be formu-
lated in hypothesis testing terms, viz. to test the “null hypothesis”
H0 that the means are equal.
The hypothesis test is easily addressed in a bootstrapping man-
ner. Suppose that we re-formulate the null hypothesis simply as:
“The distributions of times in the two populations are identical”.
If this H0 is true, then both sets of 30 observations come from the
same population; in other words all 60 observations arise from the
same population. We may then simulate the two-sample results as
follows:
• Place all 60 observations in a “hat” and “shuffle”
• Calculate the sample mean in each set, and the difference be-
tween the two sample means
We can now extract the bootstrap means for “group 1”2 using 2
Although note that referring to group
the apply function, but now only applied to the first 30 columns of 1 and group 2 in the context of the
bootstrap is rather meaningless, since
all_boots. we have already randomly shuffled all
60 observations.
bs_means1 <- apply(all_boots[,1:30],1,mean)
0 400
−10 −5 0 5 10
bs_diffs
[1] -5.6
# 97.5 percentile
sorted_bs_diffs[4875]
[1] 5.7
(since any observations in which X̄ < Ȳ, no matter how large the
difference, are consistent with the null hypothesis).
• Calculate the sample variance in each set, and the ratio of the two
variances
We can again use either the R code of a few pages above, or the
BootStrap.xls spreadsheet package (this is left to you as an exer-
cise!). A histogram of ratios of sample variances obtained using this
procedure, based on 5000 repetitions, is shown in Figure 2.6. In fact,
you can see for yourself that this (empirical) distribution closely
resembles the F-distributions of your earlier courses.
400
0
0 1 2 3 4 5
var1/var2
46
s21 s22 2
+
n1 n2
2 2 −2
(s1 /n1 ) (s22 /n2 )2
+
n1 + 1 n2 + 1
47
550
500
450
400
TRAFFIC
350
300
250
200
150
100
I II III IV V
LOCATION
• Calculate the SSE and SST, and the F-ratio as defined above.
Figure 2.8.
1600
1400
1200
1000
800
600
Observed Ratio
400
200
0
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6
k n i (Y − Y ) 2
SSE ij i·
U=
σ2
= ∑ ∑ σ 2
i =1 j =1
• If H0 is true, then the Yi· have the same mean and variances
σ2 /ni . By a slight variation of previous results, it can be shown
that
k
SST (Y − Y·· )2
V = 2 = ∑ i· 2
σ i =1
σ /ni
V/(k − 1)
U/( N − k)
Tutorial Exercises
1. For the example discussed at the start of Section 2.2 (for the
problem of comparing payment delays in two markets), com-
pare the results based on the two-sample t-test with pooled
variance, with those reported in the notes based on bootstrap-
ping approach. Remember to first check the hypothesis of equal
variances, using the F-test, before applying the above t-test.
Additive
A B
mean 6.89 7.19
std.dev. 0.374 0.475
Position No.: 1 10 25 50 75
Value -0.776 -0.477 -0.385 -0.308 -0.260
Position No.: 100 250 500 750 900
Value -0.234 -0.127 0.008 0.131 0.249
Position No.: 925 950 975 990 1000
Value 0.274 0.319 0.387 0.469 0.585
(a) Use the bootstrapped data to estimate the p-value for the
test of differences between the means. Compare this with the
corresponding value obtained on the assumption that the data
are normally distributed, and that the group variances are the
same.
(b) Use the bootstrapped data to construct a 95% confidence
interval for the difference between the means.
5. For each of the data sets attached to the end of this chapter, use
bootstrapping and standard normal theory to test the hypothesis
of no differences between the groups.
n
Truck Consumptions (Litres) Yi· ∑ j=i 1 (Yij − Yi· )2
A 35.6 37.1 32.6 31.3 32.4 33.80 23.780
B 34.5 34.2 32.5 30.5 32.93 10.168
C 36.6 33.9 32.5 35.5 35.6 37.5 35.27 16.453
Premium Regular
35.4 31.7 29.7 34.8
34.5 35.4 29.6 34.6
31.6 35.3 32.1 34.8
32.4 36.6 35.4 32.6
34.8 36.0 34.0 32.2
Data Set F: A firm has two possible sources for its computer hard-
ware. It is thought that supplier X tends to charge more than
supplier Y for comparable items. Do these data support this
contention at the α = 0.05 level?
Data Set G: Twenty randomly selected cars of the same make and
model were split into two groups of ten each. Premium grade
petrol was used in cars from the first group and regular grade
in the other group. Petrol consumptions over a standard set of
identically controlled conditions were measured as follows:
Premium Regular
6.71 7.49 8.00 6.82
6.88 6.71 8.02 6.86
7.52 6.73 7.40 6.82
7.33 6.49 6.71 7.29
6.82 6.60 6.99 7.38
56
Data Set H: These are the running times in minutes of films pro-
duced by two different directors. Is there a difference?
Treatment
1 2 3 4
18.9 18.3 21.3 15.9
20.0 19.2 21.5 16.0
20.5 17.8 19.9 17.2
20.6 18.4 20.2 17.5
19.3 18.8 21.9 17.9
19.5 18.6 21.8 16.8
21.0 19.9 23.0 17.7
22.1 17.5 22.5 18.1
20.8 16.9 21.7 17.4
20.7 18.0 21.9 19.0
Species
A B C
18.1 29.1 26.6
16.5 15.8 16.1
21.0 20.4 18.8
18.7 23.5 25.0
7.4 18.5 21.8
12.4 21.3 15.4
16.1 23.1 19.9
17.9 23.8 15.5
20.1 21.1
11.9 25.5
Data Set M: Four brands of tyres are tested for tread wear. Since
different cars may lead to different amounts of wear, cars are
considered as blocks to reduce the effect of differences among
cars. An experiment is conducted with cars considered as blocks,
and brands of tyres randomly assigned to the four positions of
tyres on the cars. After a predetermined number of miles driven,
the amount of tread wear (in millimetres) is measured for each
tyre. The resulting data are given below.
The art and science of statistical modelling involves inter alia the
following steps:
αλα
f ( x |α, λ) =
( λ + x ) α +1
59
λα x α−1 e−λx
f X (x) = for 0 < x < ∞
Γ(α)
or
x α−1 e− x/β
f (x) = ( x > 0)
βα Γ(α)
Theoretical Sample
Third 145 119
Fourth 3419 2882
The theoretical and sample moments are very much of the same
order of magnitude, giving some credence to the assumption of a
gamma distribution.
Data Set 2. Now, however, let us carry out the same analysis on the
following data:
Theoretical Sample
Third 81.0 -23.9
Fourth 1621 773
types of plot, for both data sets. Note that we denote the adjusted
value(k − 1/4)/(n + 1/2) by k∗
k k∗ Data Set 1 Data Set 2 Table 3.1: Plotting positions and data
x(k) F ( x ( k ) ) F −1 ( k ∗ ) x(k) F ( x ( k ) ) F −1 ( k ∗ ) points for probability plots
The plots are shown graphically in Figures 3.1 and 3.2. The plots
for data set 1 are (apart from some random fluctuations) close ap-
proximations to straight lines through the origin. The plots for data
set 2, on the other hand, show a tendency to curvature, and the
Q-Q plot in particular does not appear to come close to the origin.
This reinforces our earlier conclusions.
Use of quantiles
Moments are sometimes quite difficult to calculate theoretically in
terms of the parameters. In may cases, the cumulative distribution
function may be more easily available, so that one can directly
calculate population medians or quartiles. Instead of matching
moments, then, one can choose parameter values to match up the
corresponding quantiles.
A good example of this situation is provided by the “Weibull dis-
tribution”, which is also often used to model equipment lifetimes in
reliability studies. The distribution function for the Weibull distri-
bution is defined by:
F ( x ) = 1 − e−cx
γ
1.0
0.9
0.8
0.7
F(observed)
0.6
0.5
0.4
0.3
0.2
0.1
0.0
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Probability
30
25
20
Actual
15
10
0
0 5 10 15 20 25
Predicted
64
1.0
0.9
0.8
0.7
F(observed)
0.6
0.5
0.4
0.3
0.2
0.1
0.0
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Probability
18
16
14
12
Actual
10
8
6
4
2
0
0 2 4 6 8 10 12 14 16 18 20
Predicted
65
then: 1/γ
− ln(1 − p)
ξp =
c
Thus in order to estimate c and γ, we could (for example) match
the upper and lower quartiles (p =¼ and ¾ respectively) to parame-
ter values, and then to use the median (p =½) to check goodness of
fit.
The following data were generated (in Excel) from a Weibull
distribution with parameters c = 0.2 and γ = 3 respectively:
The student will have noticed that the precise choice of estimator
has up to now been somewhat arbitrary. We could use moment
or quantile estimators. In either case, there may be a number of
different moments or quantiles which can be matched to the data.
The question now is: Is there some estimator (or estimators) with a
claim to being “better” than others in some sense?
In order to illustrate the question, we could conduct a simple
simulation exercise. Suppose that we wish to estimate the param-
eter λ in the exponential distribution. Since there is only one pa-
rameter, we could obtain a moments estimator by matching to the
sample mean; and a quantile estimator by matching to the sample
median. These estimators are easily seen to be given by:
Moment estimator: λ̂ = 1/ x̄
µ̂ = x̄
n
1
σ̂2 =
n ∑ (xi − x̄)2
i =1
1
E( x̄ ) =
n ∑ E ( Xi )
1
=
n ∑µ = µ
Thus µ̂ is an unbiased estimate of µ.
MSE = E[(θ̂ − θ )2 ]
h i
= E (θ̂ − E[θ̂ ] + E[θ̂ ] − θ )2
= E[(θ̂ − E[θ̂ ])2 ] + 2(E[θ̂ ] − θ ) E[θ̂ − E[θ̂ ]] + (E[θ̂ ] − θ )2
= var[θ̂ ] + 0 + (E[θ̂ ] − θ )2
= var(θ̂ ) + (B[θ̂ ])2
• We conclude that the MSE is the sum of the variance and the
square of the bias, where bias is defined as the difference be-
tween the true parameter value and the long-run expectation of
the estimator.
ℓ(θ; x1 , x2 , . . . , xn ) = ln L(θ; x1 , x2 , . . . , xn ).
namely x̄ and s2XX . These are termed sufficient statistics for the prob-
lem.
With the above definitions, the log-likelihood becomes:
n s2 n( x̄ − µ)2
ℓ(µ, σ2 ; x1 , x2 , . . . , xn ) = − ln(2π ) − n ln(σ) − XX2 − .
2 2σ 2σ2
We can now differentiate with respect to the two parameters (µ and
σ), and obtain the optimum values by setting the two derivatives to
zero and solving for µ and σ.
Differentiating with respect to µ gives
2n( x̄ − µ)
=0
2σ2
so that the MLE estimator for µ must satisfy is x̄ − µ = 0, i.e. µ̂ = x̄,
no matter what the value of σ2 . This hardly comes as any great
surprise!
Differentiation with respect to σ gives
n s2XX n( x̄ − µ)2
− + 3 + = 0.
σ σ σ3
Recall, however, that the optimizing value for µ must make the
third term 0, no matter how σ is chosen. The MLE for σ must there-
fore satisfy:
n s2
− + XX =0
σ σ3
so that the MLE estimator for σ2 is given by s2XX /n. We can, in fact,
see that this estimator is biased (for small sample sizes n), as it has
an expectation of (n − 1)σ2 /n. Conventionally, therefore, we correct
the MLE for bias by using the standard sample variance estimate
s2XX /(n − 1).
Derivation of the properties of the MLE in general requires some
quite sophisticated mathematics, beyond the scope of the current
course. It is essential, however, to have an understanding of some of
the key properties, as listed below.
Proof of this theorem is beyond the scope of this course but relies
on a theorem known as the Cramer-Rao inequality. Essentially,
this states that no unbiased estimate can have a variance below a
certain bound. This bound, known as the Cramer-Rao lower bound,
is given by 1/In (θ ). From our properties above, we know that the
variance of the MLE tends to this lower bound as the sample size
gets bigger (Property 3). We therefore use the Cramer-Rao lower
bound to estimate the variance of the MLE. A proof of the theorem
is given on p.275 of Rice’s Mathematical Statistics and Data Analysis
(1995).
The crucial part of applying the normal distribution above is of
course the In−1 (θ ), which gives the variance of θ̂ and is therefore
needed for any inference. We call In (θ ) the “expected Fisher infor-
mation” and it is computed using
" #
∂l (θ ) 2
In (θ ) = E
∂θ
" 2 #
n
∂
= ∑E ln f ( Xi |θ )
i =1
∂θ
" 2 #
∂
= nE ln f ( X |θ )
∂θ
or
∂2 l ( θ )
In (θ ) = −E
∂θ 2
n 2
∂
=−∑E ln f ( Xi |θ )
i =1
∂θ 2
2
∂
= −nE ln f ( X |θ )
∂θ 2
Here we have made use of the fact that the likelihood function
L(θ; x) = ∏in=1 f ( Xi |θ ) so the log-likelihood ℓ(θ; x) = ln [∏in=1 f ( Xi |θ )] =
∑in=1 ln[ f ( Xi |θ )]. In the final (third) line of each set of equations we
have assumed that our random sample is independent and identi-
cally distributed.
In (θ̂ ) → In (θ ) as n → ∞
71
λ x e−λ
x!
From before (chapter 3), we know that the MLE for λ is given by
λ̂ = X̄. To find a confidence interval for λ, we work out the ex-
pected Fisher information In (λ) using the second of our equations
72
∂2
In (λ) = −nE ln f ( Xi |λ)
∂λ2
We know that
ln f ( x |λ) = x ln λ − λ − ln x!
so that
∂ X
ln f ( x |λ) = −1
∂λ λ
and
∂2 X
ln f ( x |λ) = − 2
∂λ2 λ
Therefore In (λ) is simply
nE[ X ]
X nλ n
−nE − 2 = = 2 =
λ λ2 λ λ
Note here that we have made use of the fact that, for a Poisson
random variable X, E[ X ] = λ. In other contexts i.e. for other dis-
tributions, working out these expectations can take considerable
effort! In any case, a (1 − α)% confidence interval for λ̂ is therefore
given by q
λ̂ ± zα/2 1/In (λ)
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
3 10 9 8 13 21 25 13 9 4 3 2
λ is given by λ̂ = X̄ = 10, and a 95% confidence interval for λ is
given by
r
X̄
X̄ ± zα/2
n
√
= 10 ± 1.96 10/12
= [8.21; 11.79]
• Enter the sample data into Excel, and set up the formulae to
calculate any statistics;
-60.0
-60.4
Log-Likelihood
-60.8
-61.2
-61.6
-62.0
1 2 3 4 5 6
Alpha-parameter
i =1
γ
γ = 5.09). Further columns show the ln( xi ) and xi , together with
their sums. These are used to set up the log-likelihood function
in cell D7. It is strongly recommended that students set up the same
spreadsheet for themselves, and proceed through with the optimization.
We now select Solver from the Tools menu to maximize the log-
likelihood by changing cells containing c and γ. The constraints
c ≥ 0.001 and γ ≥ 0.001 are also entered. Figure 3.5 illustrates the
Solver options which are selected. Clicking on Solve then yielded
the MLE estimates: ĉ = 0.091 and γ̂ = 3.54. Note that these are
quite a bit closer to the true values (which are known in this case)
than the estimates obtained from matching the quartiles.
Suppose now that we have two estimators for θ say θ̂1 and θ̂2 , ei-
ther of which may be biased or unbiased. Then θ̂1 is called a more
efficient estimator for θ than θ̂2 if
For two unbiased estimators θ̂1 and θ̂2 , θ̂1 is more efficient than θ̂2
if:
var(θ̂1 ) < var(θ̂2 ). (3.5)
Relative Efficiency
For two unbiased estimators, θ̂1 and θ̂2 of the parameter θ, with
variances var(θ̂1 ) > var(θ̂2 ), (so that θ̂1 is less efficient than θ̂2 ) , we
define the efficiency of θ̂1 relative to θ̂2 by the ratio
var(θ̂2 )
Relative efficiency =
var(θ̂1 )
and
1 1 1
E(µ̂2 ) = E y1 + y2 + y3
3 3 3
1 1 1
= E( y1 ) + E( y2 ) + E( y3 )
3 3 3
1 1 1
= µ+ µ+ µ
3 3 3
= µ.
Thus both estimators are unbiased, but now consider the vari-
ances:
1 1 1
var(µ̂1 ) = var y1 + y2 + y3
4 2 4
1 1 1
= var(y1 ) + var(y2 ) + var(y3 )
16 4 16
3σ2
=
8
79
while
1 1 1
var(µ̂2 ) = var y1 + y2 + y3
3 3 3
1 1 1
= var(y1 ) + var(y2 ) + var(y3 )
9 9 9
3σ2
= .
9
Thus var(µ̂2 ) < var(µ̂1 ), so that µ̂2 is more efficient than µ̂1 . The
relative efficiency of µ̂1 compared to µ̂2 is:
3σ2 /9
= 0.889
3σ2 /8
i.e. an efficiency of 89% relative to µ̂2 .
• Show that the two estimators, µ̂1 = ( x1 + 2x2 )/3 and µ̂2 =
(2x1 + 3x2 )/5 are unbiased estimators of µ.
• Which is the more efficient estimator of µ?
Drive that are red. Student A decides to observe the first 10 cars
and record X, the number that are red. Student A observes
0.6
0.4
0.2
0.0
θ
81
Tutorial Exercises
2. The table below shows the number of times, X, that 356 students
switched majors during their under-graduate study.
Number of major changes 0 1 2 3
Observed frequency 237 90 22 7
1 −y/θ
f (y|θ ) = e , y>0 θ>0
θ2
(a) Find the maximum likelihood estimator of θ.
(b) What is the mle of g(θ ) = 1θ ?
(c) Suppose a random sample of 6 observations from the above
pdf yielded the following observations:
9.2 5.6 18.3 12.1 10.7 11.5
What is the maximum likelihood estimate of θ?
7. For the data of the previous problem, use quartiles and MLE
estimation to fit a Weibull distribution to these data. Construct
the probability plots to check both fits.
The sample mean and standard deviation of the above data are
0.853 and 0.105 respectively.
Γ(α + β) x β −1
f (x) =
Γ ( α ) Γ ( β ) (1 + x ) α + β
0.29 0.84 1.93 2.16 2.21 2.65 3.70 5.44 8.50 21.93
αγx γ−1
f (x) =
(1 + x γ ) α
4.1 Introduction
Steel Strike Example (cont) Recall that we had three states (=dura-
tion of strike), with associated probabilities of 0.3, 0.4 and 0.3.
We shall now term these the prior probabilities, as they refer to
the probabilities applying prior to any information gathering.
Suppose now that an opinion polling organization specializing
in labour relations can be commissioned to conduct a survey of
labour opinions, on the basis of which they will produce a report
containing a simple statement to the effect either that the situa-
tion is “serious” or that it is “not serious”. Experience in similar
previous situations (or perhaps subjective assessment of the reli-
ability of the survey) may allow us to assess the probabilities of
each of the two possible survey results, for each possible state.
For example, from past experience with this organization in pre-
vious studies we may judge that if θ turns out ultimately to be 0
(in which case the “correct” statement should be “not serious”),
they are likely to get it right about 90% of the time. In other
words, we judge that Pr[ Result = “Not serious”|θ = 0] = 0.9,
so that Pr[ Result = “Serious”|θ = 0] = 0.1. In a similar manner,
we could judge reliability under the other two scenarios, to give
the full set of conditional probabilities displayed in Figure 4.2.
In standard statistical terms, these conditional probabilities rep-
resent sampling variability, i.e. the extent to which sample (or
survey) results may vary for each particular state. Note that the
conditional probabilities sum to 1 down the columns, but not
along the rows. Think why!
At an intuitive level, a survey result of “not serious” should in-
crease the probability associated with θ = 0, while a survey
result of “serious” should increase the probability associated
with θ = 60. But by how much should these change? The an-
swer is provided by Bayes’ theorem. The values in rows 16 and
17 of columns B-D of Figure 4.2 are obtained by multiplying the
conditional probabilities on the survey results (rows 12-13) by
the prior probabilities for the corresponding states (row 9). Rows
16-17 thus contain the joint probabilities of occurrence for each
88
A B C D E F G
Figure 4.2: Application of Bayes rule to
1 Estimated Costs (Rm): the Steel Strike Example
2
3 Stock Scenario (Strike Duration) Expected
4 Ordered 0 30 60 Loss
5 0 0 2 6 2.60
6 30 1.5 1 2.5 1.60
7 60 3 2 1.5 2.15
8
9 Probability 0.3 0.4 0.3
10
11 Conditional probabilities for survey result given the state:
12 Result="serious" 0.1 0.4 0.85
13 Result="not serious" 0.9 0.6 0.15
14
15 Joint probabilities of survey result and state: Predictive probs.
16 Result="serious" 0.03 0.16 0.255 0.445
17 Result="not serious" 0.27 0.24 0.045 0.555
18
19 Conditional probabilites on state given the survey result:
20 Result="serious" 0.0674 0.3596 0.5730
21 Result="not serious" 0.4865 0.4324 0.0811
Conditional probs
for test result:
TST 0.64 0.34 0.12 0.10
MST 0.20 0.40 0.35 0.25
MFT 0.10 0.16 0.29 0.30
TFT 0.06 0.10 0.24 0.34
Predictive
Joint probs Probs
TST 0.128 0.102 0.048 0.010 0.288
MST 0.040 0.120 0.140 0.025 0.325
MFT 0.020 0.048 0.116 0.030 0.214
TFT 0.012 0.030 0.096 0.034 0.172
Posterior probs
given test result of:
TST 0.444 0.354 0.167 0.035
MST 0.123 0.369 0.431 0.077
MFT 0.093 0.224 0.542 0.140
TFT 0.070 0.174 0.558 0.198
Bayes’ Theorem:
f ( x k | θr ) π ( θr )
π ( θr | x k ) =
m( xk )
What does this tell us? It says that if we conduct the survey, then:
Before the observations are made (i.e. while we are still considering
whether collecting the information will be worthwhile), we will
not, of course, know what xk will be observed, and thus will not
know what the corresponding expected loss is going to be. We can,
however, compute the posterior distributions and corresponding
optimal actions for every possible outcome xk . Let the optimal
action corresponding to the observation of xk be denoted by a(k) .
The prior, or predictive, expectation of the corresponding optimal
expected loss is then given by:
n p
∑ ∑ L(a(k) , θr )π (θr |xk )m(xk ) =
k =1 r =1
"
p
#
n m
∑ min
i =1
∑ L(ai , θr )π (θr |xk ) m( xk ) (4.2)
k =1 r =1
p n p
EVSI = ∑ L(a∗ , θr )π (θr ) − ∑ ∑ L(a(k) , θr )π (θr |xk )m(xk )
r =1 k =1 r =1
For simplicity, suppose that θ can take on one of the three values
0.05 (i.e. 5%), 0.15 (15%) or 0.25 (25%) only, with associated prob-
abilities 0.5, 0.3 and 0.2 respectively. It is important at this stage
to make sure than you do not confuse the different probabilities
running around!
One simplifying feature is that the term (nx), which does not depend
on θ, appears in both numerator and denominator, and can thus be
cancelled out, giving:
0.15
Posterior Probability
0.10
0.05
0.00
0.00 0.05 0.10 0.15 0.20
Parameter Value
definition, θ > 0, but the effective upper bound is about 0.15, giving
about twice the spread above the most likely value as there is below
it.
The above results which we have generated numerically can in
fact be obtained algebraically. If all values of θ over the entire inter-
val 0 ≤ θ ≤ 1 are deemed to be possible, then we would require a
continuous prior probability distribution defined by its probability
density function. If we still consider all values to be equally likely a
priori, then the appropriate prior probability density function would
be that of the uniform distribution, namely:
π (θ ) = 1 for 0 ≤ θ ≤ 1.
(nx)θ x [1 − θ ]n− x
π (θ | x ) = R 1 .
0 (nx)θ x [1 − θ ]n− x dθ
Once again, the binomial term (nx) cancels out. Furthermore, the de-
nominator is simply a constant to ensure that the resulting density
integrates to 1, so that we can write π (θ | x ) in the form:
π (θ | x ) = kθ x [1 − θ ]n− x .
This can be shown (see the formula sheet at the back of the notes,
for example) to be the probability density of the beta distribution,
where the constant k would take on the value Γ(n + 2)/[Γ( x +
1)Γ(n − x + 1)] = (n + 1)!/[ x!(n − x )!]. Figure 4.6 does in fact
closely match the corresponding beta distribution for x = 5 and
n = 100.
In fact, the beta distribution can be used to model other prior
information concerning θ. For example, if the prior distribution is
not uniform, but is of the beta distribution form with probability
98
a b ab
E[ θ ] = E[1 − θ ] = Var[θ ] = .
a+b a+b ( a + b )2 ( a + b + 1)
Suppose, for example, that in a particular context we judge a priori
that θ is expected to be around 0.2 (the “prior expectation”), and
that we are fairly sure (say with something like 95% “confidence”)
that 0.1 ≤ θ ≤ 0.4. To match the prior expectation, we need a/( a +
b) = 0.2 and b/( a + b) = 0.8, i.e. ab/( a + b)2 = 0.16. If the range
corresponds to something like 4 standard deviations (compare the
normal distribution), then we should have a standard deviation
σ ≈ 0.3/4, i.e. σ2 ≈ [(0.4 − 0.1)/4]2 = 0.005625, so that 0.16/( a +
b + 1) ≈ 0.005625. This yields a + b + 1 = 28.4, or a + b = 27.4. (We
do not have to use integers, as Γ( a), Γ(b) etc. are defined for non-
integer values, if we ever need to evaluate the constant k.) Thus we
need a = 0.2 × 27.4 = 5.48. It is possible to plot this density using
Statistics’s probability calculator (or in Excel) to confirm that it
properly represents prior judgement, and to modify the parameters
slightly if needed.
Cumulative Numbers
of Trials of Successes
(n) (x)
10 4
30 8
100 24
The posterior distributions for each of the three cases are dis-
played in Figure 4.7. These distributions are more-or-less centred
around the sample means (since the uniform prior is not very
informative), and gradually concentrate around the true value of
θ.
4 out of 10
0 8 out of 30
0.0 0.2 0.4 0.6 0.8 1.0
24 out of 100
Poisson sampling
Suppose that x1 , x2 , . . . , xn are independent observations from the
Poisson distribution with parameter λ, i.e. f ( x |λ) = λ x e−λ /x!.
Ignoring the x! factor which does not depend on the unknown
parameter λ, we can write the likelihood function in the form:
n
L(λ; x1 , x2 , . . . , xn ) ∝ λ∑i=1 xi e−nλ .
ϕα λα−1 e−ϕλ
π (λ) = .
Γ(α)
Normal sampling
Suppose now that x1 , x2 , . . . , xn are independent observations from
a normal distribution with unknown mean µ. For simplicity, we
restrict ourselves to the case of known variance σ2 . It will simplify
the algebra if we define τ = 1/σ2 (where τ is sometimes called the
“precision” of the distribution), for then the likelihood function can
be written in the form:
h τ in/2 n 2
L(µ; x1 , x2 , . . . , xn ) = e−τ ∑i=1 ( xi −µ) /2 .
2π
Once again we can ignore factors which do not depend on the
unknown parameter µ, so that:
n 2 /2
L(µ; x1 , x2 , . . . , xn ) ∝ e−τ ∑i=1 ( xi −µ) .
Yet again, the first term does not depend on the unknown parame-
ter µ, so that we can write:
2 /2
L(µ; x1 , x2 , . . . , xn ) ∝ e−nτ (µ− x̄) .
ξm + nτ x̄ (ξm + nτ x̄ )2
2
= (ξ + nτ ) µ − 2 µ+
ξ + nτ (ξ + nτ )2
+ terms independent of µ
where we define:
ξm + nτ x̄
µ̂ = .
ξ + nτ
It then finally follows that:
2 /2
π (µ| x1 , x2 , . . . , xn ) ∝ e−(ξ +nτ )[µ−µ̂]
Tutorial Exercises
Assume that the three states are equally likely. In which invest-
ment should the R10000 be placed? What is the EVPI?
The two tests concerned are not entirely precise in their oper-
ation and their accuracy has been investigated by means of a
large-scale controlled trial on a probabilistic basis with the re-
sults shown in the table below (expressed as conditional proba-
bilities for each test result, given the true state of the driver, for
each test).
θ1 θ2 θ3 θ4
a1 10 20 -20 13
a2 12 14 0 15
a3 7 2 18 9
θ1 θ2 θ3 θ4
x1 .1 .2 .7 .4
x2 .9 .8 .3 .6
that lots come from a process in which the mean per cent defec-
tive is either 3 or 8 per cent. At the outset these two possibilities
are taken to be equally likely. For analysis, the cost of testing a
component is taken to be $1000; the cost of accepting a defective
component is $15000, and the cost of rejecting a good component
is $1000.
A plan is proposed which involves testing two components from
a lot and accepting or rejecting the entire lot based on what is
learned from these two tests. Show that this particular plan is
not as good as simply accepting the lot without any tests.
11. Let θ be the true grade of ore in a proposed new gold mining
site (expressed in grams per ton). A series of 16 borehole sam-
ples has resulted in sample grades given by X1 , . . . , X16 , which
are assumed to be normally distributed with mean θ and a stan-
dard deviation of 4. Mine management has to decide whether or
not to mine this site. If they mine, the nett profit will be 3θ − 60
(millions of Rand). Note that a positive profit will only be made
if θ > 20 grams per ton. The company geologist, prior to see-
ing the sample grades, estimates θ to be 24 grams per ton (i.e.
payable). The average of the sixteen samples is however only 18.5
grams per ton.
(b) How large must the geologist’s prior probability on {θ > 20}
be, for it to be optimal to mine the site in spite of the sample
values.
Hint: First find values for the variance of the prior distribution
which would make it optimal to mine the site.
13. During the six months immediately after changes to the road
construction, a total of 22 accidents occurred along a particular
stretch of highway. It may be assumed that the number of acci-
dents in any one month follow a Poisson distribution with mean
λ, i.e. having a probability function:
λ x e−λ
p( x ) = for x = 0, 1, . . .
x!
109
The above data types are discrete (not continuous), and we will
see in this chapter how to construct models for these two types of
data.
Other types of response which we will not cover in this course,
but which are also not normal, and for which other types of gener-
alized linear models could be used, include:
Yi = β 0 + β 1 x1i + β 2 x2i + . . . + ei
| {z } |{z}
µi N (0,σ2 )
Yi ∼ N (µi , σ2 )
where µi = β 0 + β 1 x1i + β 2 x2i + . . .
In other words, we model how the mean response µi is related to
covariates or explanatory variables. Once we have estimated the β
parameters, we have a model for how the mean response is related
to the explanatory variables. The normal distribution, with the
σ2 parameter, comes in to describe the scatter of the observations
around this mean response (often a line). There is only one variance
parameter, implying that for all values of the mean µi the scatter of
the observations around the mean (expected value) is the same.
So, when we use a normal linear regression model (or an anova A histogram is the easiest way to
quickly check the approximate nor-
model) we are assuming that there is normal scatter of observations
mality of the response. But note that
around the mean. For this to be a valid assumption, the response the raw response variable will be a
variable should be something for which a normal distribution is mixture of lots of normal distributions,
because, possibly each of, the observa-
suitable to begin with. tions come from different groups, have
Let’s look at again at this different format of the linear regression different means. However, very skew
histograms are a good indication that
model:
the response variable is not normally
distributed.
Yi ∼ N (µi , σ2 )
and
µi = β 0 + β 1 x1i + β 2 x2i + . . .
The first part is the random or stochastic part of the model. It
describes the variability of the observations for a given mean. The
second part is the structural part of the model. Once we have esti-
mated the parameters it is fully known.
We will now generalize the above model to cope with other types
of response. We will change the distribution of the response to
something that is more suitable for count or binomial data. In the
second part we still model some parameter related to the mean of
the response in terms of explanatory variables. For example for
count data we model the average rate of events λi as a function of
explanatory variables, usually using:
112
pi
log( ) = β 0 + β 1 x1i + β 2 x2i + . . .
1 − pi
The relationship between the parameter and the explanatory
variables does not need to linear. In fact, by using a log link we are
assuming that there is an exponential relationship between λ and
the explanatory variables, and by using a logit link we are assuming
that there is an S-shaped relationship between the probability of
success pi and the explanatory variables.
To summarize the above:
(1) The random component of the model specifies how the obser-
vations vary around a mean parameter. This is usually in the
form of a probability distribution: normal for normal response
variables, Poisson for count response variables, binomial for
GLM’s are linear models because the
binomial response variables, etc.. mean parameter (or some form of it) is
still linearly related to the explanatory
(2) The systematic component of the model is a linear combination variables. Linear here refers to the
(function) of explanatory variables that are related to the mean β coefficients appearing in a linear
combination, i.e. the terms are just
parameter. added together. β 0 + β 1 x1i + β 2 x2i +
. . . + β p x pi is also called the linear
(3) The link function defines the form of the relationship between predictor.
the mean parameter and the explanatory variables (linear pre-
dictor): linear (identity link), exponential (log link) or S-shaped
(logit link).
the observed data. The method of least squares is usually not ap-
propriate1 , but for the special case of normal linear regression, least 1
Least squares treats all error terms
squares and maximum likelihood turn out to be equivalent. (residuals) equally. But this is not
appropriate if the variance (or uncer-
tainty) of observations is not constant,
as is often the case in GLMs.
5.2 Logistic regression
> names(dat)
[1] "Indiv_id" "Groundtrth" "Size_id" "Community_" "Grass_heig"
[6] "Andropog c" "Tree_densi" "Geol_id"
_ "Xval" "Yval"
[11] "Fire5698" "Altitude" "Slope" "Aspect" "Rsp"
[16] "Trmi" "Twi"
>
> table(dat$Community)
●●
●
●
●
●●
● ●
● ●
●● ●
●
●
●● ●
●●
●
●
●
●●
● ●
● ●
●
●
plotted against slope. Because the
●●● ●● ●
● ● ●
1 ●●
●
●●
●
●●
●●
●
● ●
●
● ●
●● ●●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●●
●●
●
●●
●
●
●
●
●
● ●
●●
response is binary, and the resolution
●
● ● ● ●●● ●●
●
●
●● ●
●
●●
●
●●
●●
●
●
●
●
● ●
●
●
● ●
●
●
●
●
● ●●
●● ●
●
●
●
●
●
● of the slope variable is not very high,
the points are all on top of each other
lawn grass
lawn grass
●● ●●● ●●
● ● ● ●● ●●
●● ● ●● ●● ●
●
●●● ● ●● ● ●●● ●●● ●● ● ●●
● ●● ● ●● ● ●
●●
●
●●
●● ● ●
●●
●●
● ● ●● ●
●●
●● ●
●
● ●
● ●
● ●●●
●●
●●● ● ● ● ●
● ● ●
● ● ●
● ●
●
●●●● ●● ●● ● ●
●
●
● ● ● ● ● ●● ●
●
●●●
● ● ●●● ● ●● ●●●
●
● ●●
● ●●
● ●
● ● ● ●
●
●
●●
● ● ● ● ●
● ● ● ● ●
●● ●●● ● ● ● ●
●
●● ● ●● ● ●● ●● ●
●
●●● ● ● ●●
● ● ● ● ●● ● ● ●● ●●
● ● ● ● ● ●●
● ● ●●
●● ●
●●●●
●
●●
●● ●
●●
● ●●● ●
● ●
● ●● ● ●
● ● ●
●●
● ●● ●● ● ● ● ● ● ● ● ● ●
●●
●●●● ● ● ●
●● ●●
●● ●● ● ● ●
● ●
●
●● ●●●
● ● ● ● ● ●
● ●●
● ●
0 ●
●
●
●
●
●
●
●●
●
●
●
●
●
●● ●
●
●● ●
●●
● ●● ●
●
●●
● ●
●
●●
●
●● ●
● ●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●●
●
●●
●
●
●●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●●
●●
●
●
● ●
●
●
●
●●
●
● ●
●●
●
●
●
●●
● ●
●
●
●
●●
●
●
●
●
●
●● ● ● ● ●●● ●●
● ● ● ●●● ● ● ●● ● ●
● ●
●●● ●●
● ●●● ●●
●●
●
●●
●
● ●
●
● ● ●●
●
● ●
●●
● ●● ● ●●
●●
●●● ● ● ●
●●
●●●● ●●
●●●
●● ●● ●● ●
●
●
●●
●● ●●●
● ●● ●●
●
●●
● ● ● ● ● ● ● ●
●
●●● ●
● ●●● ●● ●
● ●● ●
●
●
● ●
●
●● ●●
● ● ● ●
●
●
●
●● ● ● ●● ●
● ● ●
●●● ● ●
● ●● ●●● ● ●
●
● ● ●● ● ●
●●
●●
●● ● ●● ●
●●● ●● ●● ● ● ●
0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
●
●
●
●●●● ● ●
●
● ●●●
● ●
●●
●● ●● ●●
●●
●
● ● ●
0 5 10 15 0 5 10 15
slope Slope
●
Figure 5.2: Proportion of pixels with
0.35 lawn grass at each value of slope.
0.30
proportion lawn grass
●
0.25 ●
●
●
0.20
0.15
0.10 ● ●
●
●
0.05
0.00 ● ● ● ● ● ● ● ● ● ● ●
0 5 10 15
slope
Yi ∼ Bin(1, pi )
A binomial distribution with n = 1 is also called a Bernoulli
distribution, i.e. Yi ∼ Bernoulli ( pi ).
As we saw in the introduction to GLMs, we need to define a ran-
dom part (the distribution of the observations given the probability
of success), and a systematic part (which explanatory variables in-
fluence the parameter of interest). The parameter of interest is the
probability of success, pi .
The most natural random model for binary data is a Bernoulli
distribution, which is just a binomial distribution with n = 1.
Yi ∼ Bernoulli ( pi )
Yi ∼ Bin(ni , pi )
logit( pi ) = β 0 + β 1 xi
By fitting this model to the data we mean that we search for
those values of β 0 and β 1 which best describe the observed change
in relative frequency in 1s compared to 0s with increasing slope
(slope is the explanatory variable in the lawn grass example). With
the logit link we specify that this relationship is s-shaped (between
probability of success and explanatory variable, and linear between
logit( p) and the explanatory variable). Estimates for the β param-
eters are found by using the method of maximum likelihood, i.e.
by finding those values of β 0 and β 1 that make the observed values
most likely, given the specified model.
118
Likelihood
Likelihood is defined as the probability of the observed data
given the assumed model, as a function of unknown parameters.
0.25
For one binomial data point (let’s say an observed 3 successes in 10
trials) the likelihood function would be 0.20
10 3 0.15
p (1 − p)(10−3)
L(p)
L ( p ) = P (Y = y ) =
3 0.10
exp( β 0 + β 1 xi )
n pi =
1 + exp( β 0 + β 1 xi )
log( L(λ)) = ℓ(λ) = ∑ yi log( pi ) + (1 − yi ) log(1 − pi )
i =1
Yi ∼ Bin(ni , pi )
logit( pi ) = β 0 + β 1 × slopei
where Yi is presence (1) or absence (0) of lawn grass, all ni = 1,
and pi denotes the probability of lawn grass in pixel i.
----------------------------------------------------------
> m1 <- glm(LG ~ Slope, family = binomial, data = dat)
> summary(m1)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.5461 0.1151 -4.74 2.1e-06
Slope -0.2773 0.0348 -7.96 1.7e-15
c = log\ pi
LO ( ) = β̂ 0 + β̂ 1 × slopei
1 − pi
Assuming that the logistic regression model is valid, we can
interpret the above estimates as follows: the log-odds for lawn grass
decrease by 0.2773 per unit increase in slope (let’s assume one unit
corresponds to 1 degree).
p
The odds of lawn grass, ( 1−ip ), i.e. probability of lawn grass vs
i
probability of other vegetation, change by a factor exp(−0.2773) =
0.76 per degree increase in slope, i.e. the odds of a pixel being lawn
grass decreases by 24% for every unit increase in slope. This means
that lawn grass is more common in flat areas.
The log-odds decrease at a constant rate. Also the odds ratio
is constant, which means that the odds change by a constant fac-
tor. However, the probability does not decrease at a constant rate;
its rate of change depends on its value, according to an s-shaped
curve.
Therefore, most of the time we don’t directly calculate the ef-
fect of the coefficients on the probability, but rather on the odds or
log-odds. However, we often need to predict the probability of a
success (lawn grass) given values of the covariates. We saw above
how to calculate the odds and the log-odds. Predicting the proba-
bility is a bit more complicated:
exp( β̂ 0 + β̂ 1 × slopei )
p̂i =
1 + exp( β̂ 0 + β̂ 1 × slopei )
or
e LO
pb =
1 + e LO
where LO denotes the log-odds.
To calculate the probability of lawn grass in pixels with zero
slope, lets first calculate
ηi = β̂ 0 + β̂ 1 × slopei
= −0.5461 − 0.2773 × 0
= −0.5461
Then
exp(−0.5461)
p̂ = = 0.37
1 + exp(−0.5461)
So we would expect about 37% of the pixels with slope 0 to have
lawn grass. We have already seen from the plot that lawn grass is
121
very rare in pixels with high values of slope. But let’s calculate the
probability of lawn grass in a pixel with a slope of 5:
and
0.145
= 0.126 p̂ =
1 + 0.145
i.e. lawn grass is expected to be present in only about 13% of
pixels with slope 5. A good way to understand how the probability
of success is related to the explanatory variable is to add the fitted
line to the plot:
●
●
●
●●
●●●
● ●
● ●
●
● ●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
presence/absence of lawn grass, and
●● ● ● ●
● ●
●●
● ●●
●●●
●●●
●
●
●
●
● ●
●
●
●
●
●
●
●
fitted logistic regression line for the
●
● ● ● ● ●
●
1.0
●
●●●
●●
●
●●
● ●
●
●
●
●●
● ●●
●
●
●
●●
●
●
●
●
●
●
●
● probability of lawn grass, in relation to
● ●● ●●
●●●●
●●
●●
●
●●
●●
● ●●
●
●
●●
●
●
● slope.
0.9 ●●
●●
●
●
●●
●● ●
●●
●
●
●
●
●
●●
● ●
●
●
● ● ●●● ●● ●
● ●●
● ● ●●● ●
●●
0.8 ●
●●
● ● ●
0.7
lawn grass
0.6
0.5
0.4
0.3
0.2 ●●
●●
●●
●●
●
●
●●
●
●
●
●● ●
●
● ●
●
●
●
●●
● ●●
●
●● ●
●●●
●●
●
●
●●
●
●
●●● ●
●
●
●
●●● ●
●
●
●●●
● ●●
● ● ●
●
●
●
●●
● ●
●● ●
● ●● ● ●
●
●●
●
●● ●●
●
● ●
● ●● ● ●
●●●
● ● ● ●●● ●●●
●● ● ● ●● ●●
●●
● ●● ●● ●●
●● ● ● ● ●
● ● ● ●● ● ●
●●
0.1 ●●●
●●●
●
●●
●
●
●
●
●
●
●
●
●
●●
●●
●
●●●
●
●●
●
●●
●
●
●
●
●
●●
●
●
●●
●●
●●
●●
●●
●●
●
●●●
●
●
●●●
●
●●
●●
●
●●
●
●●●
●
●
●
●
●●
● ●
●●
●
●●
●
●
●
●
●●
●
●
●
●
●●
●●
● ● ●●●● ●●● ● ● ● ●● ● ●
●●●
●●
●
● ●
●● ● ●● ● ●●
● ● ●● ●
● ●● ●●●
● ●
●
●● ●●●
● ●● ●● ●
●
0.0 ●
●
●●
●●
●
●●
●●
●
●
●
●●
●
●
●●
●
●●
●●
●
●
● ●
●
●
●
●
●
●
●
●●
●●
●●
●
●●●
●
●●
●
●●
●
●
●
●
●
●
●
●
●●
●●
●
●
●●
●●
●
●●
●
●
●
●
●
●
●
●
● ●
●●
●
●
●●
●
●
●
●
●
●●
●●
● ●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●●
● ●● ● ●● ●● ●
●● ●● ● ●● ●●●
●● ●
●● ●● ● ●
●●● ● ●● ●● ● ●● ●●● ● ●●● ●
● ● ●● ●● ●●● ● ● ● ●
●●●● ●●●● ●● ● ●●● ●●
●●
● ●
● ● ●● ●●●
●● ● ● ● ●
●●● ●●● ●● ● ●● ●● ●● ●
●●●
●
●●● ● ●● ● ●
●● ●● ● ●
●● ● ●● ●
● ● ●
●●
●● ●● ●● ● ●●● ●● ● ● ● ●
●
● ● ●
●●●● ●●● ●● ● ● ●● ●●● ● ● ● ● ●
●●●
●● ● ● ● ●●●● ●●● ●●
● ● ●● ●● ●●
● ● ● ● ●● ● ● ●
●●●●
●
●
●
●
● ●● ● ● ● ●● ●
●●● ● ●
● ●● ●
●
●●● ●● ●●● ● ●●
● ●● ●● ● ● ●●
●●● ● ● ● ●● ● ● ● ● ●
●● ● ● ● ● ●
0 5 10 15
Slope
Confidence intervals
Maximum likelihood estimates are asymptotically normally
distributed. This means that the normal approximation is relatively
good in large data sets. In such cases we can construct a normal
confidence interval for the log-odds, and then, to obtain confidence
intervals for the odds or probability, we transform these confidence
limits to the wanted scale.
For example, a 95% confidence interval for the effect of slope on
the log-odds would be:
Confidence intervals in R:
-----------------------------------------------
> ## Wald intervals
> confint.default(m1)
2.5 % 97.5 %
(Intercept) -0.77 -0.32
Slope -0.35 -0.21
> exp(confint.default(m1)) ## CI for the odds ratio
2.5 % 97.5 %
(Intercept) 0.46 0.73
Slope 0.71 0.81
(2) the ni trials are independent. This means that the outcome of
one trial is not influenced by the outcome of the others.
(3) the final number of trials does not depend on the number of
successes
0.30
124
●
●
0.20
0.15
(if we have proportions) against every explanatory variable, with
0.10
the fitted line superimposed onto this plot. For binary data we
● ●
●
0 5 10 15
did in the lawn grass example). If the fitted line does not describe
slope
the observed relationship very well, we must change our model.
A bunch of outliers can also point out a misspecification of the Figure 5.5: Fitted logistic regression
relationship. curve for lawn grass data. The points
are observed proportions of lawn grass
Based on a visual comparison of the observed proportions and out of all pixels for each distinct value
the fitted probability, the lawn grass model works very well (Figure of slope.
5.2.6).
As in linear regression, we need to check that there are no in-
fluential observations (a few single observations that have a large
influence on the parameter estimates).
Also, as in linear regression, the residuals must be independent,
which means that we must account for spatial, serial or block-
ing structures in the model. Often, the observations are a random
sample from a large population, in which case it is reasonable to
assume independent observations.
observed − f itted
ri =
SE( f itted)
2. Deviance residuals:
The deviance of a model is calculated as
-------------------------------------------
125
● ●●
●●
●
2
●
Std. deviance resid.
● ●
●●
●
●
●●
●
●●
●
● ●●
●
●●
●
●
●●
● ●
●
●●
●
●
● ●
●
●●
●
●
●
●●
●
●
●●
●
●
●●
Residuals
1.0
1
0
0.0
● ● ● ● ● ●
● ● ● ●
● ● ●
●
●
●
● ● ●●
●
●
●
● ● ●●
●
●
●
●●
●
●
● ● ●
●
●●
●
●●
●
●
●
●●
●
●
●●
●
●
●●
●
●
●
●●
●
●
●●
●
● ●
●
●●
●
●
●●
●
●
−1
● ●
●
●
●●
●
●
●●
●
−1.0
●●
●
●
●●
●
●
●●
● ●●●●
●●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●
●●
●
●●
●
●
●●
●
●●
●
●
●●
●
●●
●
●
●●
●
●
−6 −5 −4 −3 −2 −1 −3 −2 −1 0 1 2 3
●
● ● 471
635
472
4
●
●
●
● ●
Std. deviance resid.
●
Std. Pearson resid.
● ●
3
●
1.0
● ●
●
2
● ●
● ●
● ●
● ●
●
●
1
●
●
0.5
● ●
● ●
● ●
● ●
0
● ● ● ● ● ● ● ● ● ● ●
● ● ● ●●
●●
● ●
●
●● ●
● ●
Cook's distance
−1
0.0
Var (Yi ) = ni pi (1 − pi )
D = −2(ℓc − ℓ f )
where ℓc is the (maximized) log-likelihood of the current model,
and ℓ f is the log-likelihood of the saturated model.
This gives a rough idea of the extent to which the current model
adequately represents the data. The deviance will be large when
Lc is small relative to L f and will be small when the current model
explains the data nearly as well as the full model. It can be shown
that asymptotically D ≈ χ2n−k−1 , where n is the number of ob-
servations, and k + 1 is the number of parameters estimated. This
approximation holds only for large n, and is not a good measure of
fit for binary data.
So, a rough measure of fit (but not for small data sets, or small
ni ), is obtained by comparing the residual deviance to the corre-
sponding degrees of freedom. They should be roughly the same.
When the residual deviance is much larger than the degrees of
freedom, the model leaves much of the variability in the response
unexplained. When we checked for overdispersion (Section 5.2.7)
we used the same check.
Dnull − Dcurrent
% deviance explained ≈
Dnull
The change in deviance (Dnull − Dcurrent ) measures the amount
of deviance explained, or reduction in deviance, when adding some
extra terms to the model. Percentage deviance explained does not
require large-sample conditions, and can be used as a rough mea-
sure of goodness-of-fit for any model.
Here is part of the summary from the logistic regression model
for presence of lawn grass on slope:
----------------------------------------------------------
128
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.5461 0.1151 -4.74 2.1e-06
Slope -0.2773 0.0348 -7.96 1.7e-15
954.93 − 866.69
% deviance explained = = 0.09
954.93
roughly 9% of the total deviance (total distance to saturated
model). What this means is that slope alone helps to discriminate
between some of the 0s and 1s but by no means will give a perfect
prediction. It does not mean that only 9% of the data are correctly
predicted. However, often for binary logistic regression we would
like to have some estimate of how well we will be able to predict
presence/absence given specific values of the covariates, i.e. how
useful is the model for predicting whether a pixel will be covered
by lawn grass or not, i.e. what percentage is classified correctly,
what percentage incorrectly, given the slope?
1 0 0
attr(,"class")
[1] "confusion.matrix"
---------------------------------------------------------
> auc(obs, pred) # area under the curve
[1] 0.7
> library(pROC)
Call:
roc.default(response = obs, predictor = pred)
130
Data: pred in 755 controls (obs 0) < 192 cases (obs 1).
1.0
Area under the curve: 0.7
--------------------------------------------------------- 0.8
Sensitivity
regression model (which aims to predict presence/absence, or
0.4
success/failure). Our model for lawn grass has low to medium
accuracy. There are likely some important variables that determine
0.2
presence or absence of lawn grass, which we haven’t yet included in
the model. 0.0
1.0 0.8 0.6 0.4 0.2 0.0
Specificity
logit( p) = β 0 + β 1 x1 + β 2 x2
logit( p) = β 0 + β 1 x1 + β 2 x2 + β 3 x3 + β 4 x4
Then model 1 is nested in model 2. If the deviance of model 1 is
D1 with degrees of freedom p = n − 3, and the deviance of model
2 is D2 with degrees of freedom q = n − 5, then D1 − D2 measures
the change in deviance due to the variables x3 and x4 (after x1 and x2
have already been included in the model).
The change in deviance has an approximate χ2q− p distribution,
where (q − p) is the difference in the number of parameters. Again
the approximation holds only for large ni and N.
1 ● ● ● ● ● ● ● ● ● ● ● ●
●
●
●
●
●
●●
●● ●
●● ●
●
● ●●
●
●
●●
●●
●
●
● ●
●
●
●
●
●
● ● ● ● ●
●● ●
● ● ● ●● ●●
●
● ● ●●● ●
●
● ●●● ●● ●
● ●
●
●
●
●
●●
●
●
●
●● ●
●●
●
●
●
●
● ●
● ●●
●
●●
0.4
1 ●
●●
●●
●
●
● ●●
●
●●
●
●●●
●
●
●
●●●
●
●
●
● ●
●
●
●
●
●
●
●
● ●
● ● ● ●
● ●
● ●
● ●
●●●
●
● ●
● ●●
● ●
● ● ●
● ●● ●●
● ● ● ●● ●
●
●●
●●
●
●
● ● ● ●
●
●
●
● ●
●
● ●
● ● ●
● ● ● ● ●
●
0.3
lawn grass
lawn grass
● ●
●
0.2
●
●
● ● ●
●
● ● ●●● ●● ● ●● ● ●
● ●●
●
● ●
●
●●
●
●
● ●
●●● ● ● ●● ●●
●
● ●
●
●●● ●
● ●●
● ●● ●●● ●● ● ● ● ● ●●
●● ●● ● ●●
● ● ●●
●
●
●●●
●●
●
●● ●
●
●●
●
●●
● ●●●
●
●●
●
● ●
●●
●
●●
●●
● ●
●●●
●
●
●●
●
●●●
●●●
●
● ●
●● ●
● ● ●
● ●
● ●
●●●●
●●
● ●●
●●
●●
●●●
●
●●
●
●●
●
●
●●
●
●●
●
●●
●●●
●
●●
●●
●●
●●
●
●
●●
●
●●
●
●●
●● ●
●
●
●●
●
●●
● ●
●●●
●●
●
●●
●●
●●●
●●
●●
●●
●
●●
●
●●
●
●● ●
●●●
●●
●
●
●
●
●
●●
●
●
●
● 0.1
● ● ●●
● ● ●● ● ●● ●●● ● ●● ●● ●● ●
●
● ●●●● ●● ●● ● ●●● ●
● ●●●● ●● ●
● ● ● ●● ●●●● ● ●
●● ●● ● ●●
● ●●●
● ●●
●
● ●●●
● ● ●●
●
●
●
●● ●
●●
●
●●●●● ● ●●
●
● ● ●
●
● ●
0 ●
●
●
●
●●
●●
●
●●
●
●●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●●●
●●
●●
●●
●●
●
●
●●
●
●●
●●
●
●
●
●
●●
●●
●
●
●●
●●
● ●
●
●●
●●●
●
●
●
●
●●
●
●●
●●
●
●● ●
●●●
●●
●●
●●●●
●
●
● ●
●●●
●
●
●●
●
●
●
● ● ● ● ●
● ● ● ●● ●●● ●●● ● ●● ●● ●
●●●● ●●
●●
●●●
●●
●
●●●
●
●●
●●●
●
●
●● ●●
●
●● ●●
● ●●● ●●●
● ●
●●●
●
● ●
●
●●
●
●● ●
●
●
●
● ● ●
●● ●
● ●
● ●
● ●● ● ● ●
●●● ●● ●
●● ●●●● ● ● ●
● ●●●● ●● ● ● ● ●●● ● ● ● ●
●● ●● ● ● ●●● ●● ● ●● ●
●● ●●●
● ●● ●● ●●
● ● ●
●
● ● ●● ●●● ●
●● ●● ● ● ● ● ● ●
●● ●● ●● ●● ●●●
●●● ●●
●●
●● ●●●● ● ● ●● ●●●
●● ● ●
●●●●
● ● ●● ● ● ●
0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
●
● ●
●● ● ● ● ● ●
● ●
● ● ●
● ●● ●
●●
0.0 ● ● ● ● ● ● ●
0 5 10 15 20 0 5 10 15 20 0 5 10 15 20
fire frequency (per 40 yrs) fire frequency (per 40 yrs) fire frequency (per 40 yrs)
--------------------------------------------------------------------------------------------
### presence/absence of lawn grass vs fire frequency
plot(LG ~ Fire5698, data = dat, las = 1, xlab = "fire frequency (per 40 yrs)",
ylab = "lawn grass", cex.lab = 1.5, cex.axis = 1.5, yaxt = "n")
axis(2, at = c(0, 1), cex.axis = 1.5, las = 1)
From the above plot it seems that lawn grass is absent or rare at
both very low values of fire frequency and at very high values of
fire frequency. From the RHS plot it seems that there is an inter-
mediate fire frequency at which lawn grass is most common. This
suggest a quadratic effect of fire frequency on the presence of lawn
grass. Let’s try all three models: with linear, quadratic and cubic
effects. Just like in regression, all lower order terms should always
also be present in the model, e.g. when we add the cubic term we
also keep intercept, linear and quadratic terms in the model.
--------------------------------------------------------------------------------------------
## three models with linear, quadratic, cubic effect of fire
132
--------------------------------------------------------------------------------------------
> anova(m1, m2, m3, test = "LRT")
Analysis of Deviance Table
Model 1: LG ~ Fire5698
Model 2: LG ~ Fire5698 + I(Fire5698^2)
Model 3: LG ~ Fire5698 + I(Fire5698^2) + I(Fire5698^3)
Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1 945 954
2 944 921 1 33.1 8.6e-09
3 943 921 1 0.0 0.86
--------------------------------------------------------------------------------------------
133
●● ● ● ●● ● ●
● ●
1 ●
●●
● ●
●
●
●
●
●
●
● ●
●
●● ●● ● ●●
● ●● ● ●
● ●
●
● ● ●
● ● ●
●●
●
●●
●
●
●
●
●●
●
●
●
● ●
● ●
●● ● ●
●
●
●● ●
●
●
●●
● ●●
●
● ●●
●
●
●●
●
●
● models: linear effect of fire frequency
● ● ●● ● ●
● ● ● ●
● ● ●● ●
● and black lines virtually coincide. On
●● ●
●
●●● ●● ●
●●●● ● ●●
●● ● ● ●
● ● ● ● ● ●●
●● ● ● ● ● ●
●● ● ● ●●● ●●
●
●●
● ●
●
●●
●
●
●
●
● ●●
●
●●● ●●●● ● ●●● ●
●
●●
●
●
● ● ●● ●
●
●
●
●●
● ●●
●
●
●
●● ●●
●
● the RHS for every distinct value of
●
●
●
●
●
●●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●●
●●
●
●
●●
●
●●
●
●
●
●
●
●
●● ●
●
●● ●
●● ●●
●
●
●
●●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●●
●●
●●
●
●
●
● ●
●
●
●
●
0.1
● ●● ●● ●● ●
● ●
●●●
●●
●
● ●
● ●●
●● ● ●●
●
0 ●
●
●●
●
●●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●●
●●
●●
●
●
●●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●●
● ●●
● ●●
●
● ●
● ●
●
●
●
●
●●
●●
●●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●●
●●
●
●●
●● ●
●
●
●
●
● ●
●
●●
●
●
●
●
● fire frequency the proportion of pixels
●●●
●
●
●●●
● ●●●
●●●●
● ●
● ●●
●
●
●
●
●
●●●
● ●
●
●●
● ●●●
● ●
●●●
●
● ●
● ● ●● ●● ● ● ●● ● ●●●
● ● ●
● ●
● ●
● ●
●●
●
●
●●●
●
●
●● ● ●
●●● ●●●
● ●●●
● ●●
●
● ●● ●
●
●
●
●
●
●
●●
●
●
●
●
●
●
● ●
●
●
●
●
●●
●
●●
● ●●●
●
●●●
●
●● ●● ●
● ●●● ●
●
●●
●
●
● ●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●●
● ●
●
●
●
●●
●
●
● ●
●
●
-----------------------------------------------------------------
> summary(m2)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -5.8455 0.9727 -6.01 1.9e-09
Fire5698 1.0149 0.2082 4.87 1.1e-06
I(Fire5698^2) -0.0515 0.0104 -4.94 8.0e-07
-----------------------------------------------------------------
> dat$Geol_id <- as.factor(dat$Geol_id)
> m5 <- glm(LG ~ Geol_id, family = binomial, data = dat)
> summary(m5)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.7346 0.2087 -8.31 < 2e-16
Geol_id1 1.7346 1.4295 1.21 0.225
Geol_id2 0.2846 0.2446 1.16 0.245
Geol_id3 2.4277 1.2424 1.95 0.051
Geol_id4 0.6931 0.3161 2.19 0.028
Geol_id5 1.7126 0.2959 5.79 7.1e-09
Geol_id6 -1.0688 0.4694 -2.28 0.023
Geol_id7 -0.0233 0.5273 -0.04 0.965
Geol_id9 0.0606 0.4914 0.12 0.902
H0 : LO(i ) − LO(0) = 0
-----------------------------------------------------------------
> xs <- c(0:7, 9)
> xs <- as.factor(xs)
Inverse logit:
> p.est <- inv.logit(est)
exp( LO)
> p.lcl <- inv.logit(lcl) p̂ = inv.logit( LO) =
1 + exp( LO)
> p.ucl <- inv.logit(ucl)
where LO denotes the log-odds.
-----------------------------------------------------------------
dat2 <- na.omit(dat)
●
2 ●
●
Figure 5.10: Conditional plots for
●
0
●
● ● ●
● ● ● ● ●
● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ●
● ●
0
● ● ● ●
● ● ● ●
●
● ● ●
● ●
● ● ●
● ●
● ● ●
●
● ●
● ●
● ● ● ●
● ● ●
● ●
● ● ●
● ● ● ● ●
● ●
● ●
● ●
●
● ● ● ● ● ● ●
● ● ● ●
●
● ● ● ● ● ● ●
● ● ● ●
●
f(Slope)
●
● ●
● ●
● ● ● ● ●
● ●
● ● ● ●
● ●
●
● ●
● ● ● ● ●
● ●
● ● ● ●
● ●
● ●
● ● ●
● ●
● ● ●
● ● ● ● ●
●
● ● ● ● ● ● ●
● ●
● ●
● ●
●
● ● ●
●
●
●
●
● ●
● ● ● ● ●
● ●
● ● ●
● ●
−2
● ● ● ● ●
● ●
● ●
● ●
● ●
●
●
● ● ●
● ● ● ● ●
● ●
●
●
● ● ● ●
● ● ● ●
● ●
●
● ● ● ● ● ●
● ●
●
−4
● ●
● ●
●
●
● ●
●
● ●
●
●
●
●
●
●
●
●
● −4 ●
●
●
●
●
●
●
●
●
●
●
−6 ●
●
−6 ●
−8
0 5 10 15 20 0 5 10 15
Fire5698 Slope
● ●
●
● ● ● ● ●
● ●
● ●
1
● ● ●
● ●
3 ● ● ● ● ● ● ●
● ● ● ● ●
● ● ● ● ●
● ●
● ● ●
● ● ●
●
● ● ● ● ●
● ● ● ● ●
● ● ● ●
● ● ●
● ● ● ● ●
●
● ● ● ●
● ● ● ● ●
2
● ● ● ● ●
● ● ●
● ● ● ● ● ●
● ● ●
● ● ● ●
● ● ● ●
● ● ● ● ● ●
●● ● ●
● ● ● ●
● ● ●
● ● ● ●
● ● ● ●
0
●
●
●● ●● ● ● ●
f(Geol_id)
● ● ●
●●● ●●
● ● ● ● ●
● ● ● ● ●
●
● ●●● ● ● ● ● ●
● ● ● ●● ●
● ● ● ●
1
●● ●
●
● ●● ●●●
● ●
● ●
● ● ● ●
f(Trmi)
● ● ● ●
●
●●● ● ●● ● ● ● ●
●
●
●●●● ●
● ● ● ● ●● ● ●●
● ● ● ● ● ●
●● ● ●
●
●●
● ● ●● ●●
●●● ●
● ● ●
●● ● ●
●●● ●●
● ●● ●
● ●
● ●
● ●
● ●●
●
● ●●
●● ●●
● ●
●
●●
●
●
0 ●● ●
●
● ●
−1
●
●
● ●
● ●
● ●
● ●● ● ●
● ● ● ●
● ● ●
●●● ● ● ● ●
● ● ● ● ● ● ● ● ●
● ● ● ● ●
● ● ● ● ● ●
● ● ● ● ●
● ● ● ● ● ● ●
● ● ● ● ● ● ●
●●● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ●
−1
● ● ● ● ● ●
● ● ● ●
● ● ● ●
● ● ● ● ●
● ● ● ● ● ●
● ● ● ●
● ● ●
● ●
● ●
● ●
●
● ●
● ● ●
● ● ● ●
●
● ● ● ● ●
●● ●
● ● ● ● ● ● ●
●● ●● ● ●
● ●●
●● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ●
● ● ● ● ● ●
● ● ●
● ● ●
● ● ● ● ● ● ●●●
●
● ●
●●
●●●● ●
●
●●
●●●
●●● ● ● ● ● ● ● ●
● ● ● ● ● ● ●
● ● ● ● ●
● ● ●
●
●●●
●● ● ●
●
● ●●● ●●● ●●●●
●●●●
●●●
●
●
● ● ● ●●
●●
●
●
●
● ● ●
● ● ● ● ● ● ● ●
● ● ●
● ● ● ● ●
● ●●● ●● ● ● ●● ●●● ● ●●● ● ● ● ●
● ●
● ● ● ● ●
● ●●●● ●●
●● ●
●
●● ● ●●
● ● ● ●
●●
●
●●
● ●●● ● ● ● ●
● ● ● ● ● ● ● ● ● ●
● ● ●
● ● ●
● ● ● ● ●
● ●
● ●
●●● ●● ● ● ●●
●
●
●
● ●●
●
● ●
● ● ● ●
● ●
●●● ● ● ●
●
●
● ● ●
● ● ●
● ● ●
●
● ● ● ● ● ●
● ● ● ● ●
● ●
● ●
● ●
●●
● ●● ● ●●
●●
● ●
● ●●●●●
●●● ●
●
●●
●● ● ● ●
● ● ● ● ● ● ●
●
● ● ● ● ● ● ● ● ●
● ●
● ●●●
●●
●● ●●● ●
●
● ●●●
●● ●●●●●
●
●●●●●
● ●●
●●●
● ● ●●● ● ●
● ●
● ●
● ● ● ● ● ● ●
● ● ●
● ● ● ●
● ● ● ●
●
● ●
● ● ● ●
●
● ●●● ●● ●●
●● ● ●
●
●●
● ●● ● ● ● ● ● ● ● ● ● ●
●
●●
●●●
● ●●
●●●●
●
● ●●●●
●
● ●
●●● ●●
● ●●● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ●
●● ●
●● ● ●
●●●
● ●●●●
●
●
● ●●
●●●●
●
● ●●●●
●
● ● ●
●●
●
●●●
● ●●
●●● ● ● ● ●
● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ●
●
●
●● ●● ●●● ●●
●● ● ● ●
●●●●●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●
● ●
−2
● ● ● ●
●
●
● ●● ●● ● ●● ● ● ● ● ● ● ● ●
● ● ● ●●●
● ●●●
●● ● ● ● ● ● ● ●
−2
● ●
● ●●●● ● ●●
●
●●
● ●
●● ●●● ● ● ● ●
● ● ●
●
● ● ● ● ● ●
● ● ●
● ● ●●
●
● ●●
●
●●● ●● ● ●
● ● ● ● ● ● ● ● ● ● ● ●
● ● ●
●
● ● ●● ● ● ●
● ●●●
● ● ● ● ● ●
● ●
●
● ●
● ●
● ● ● ● ● ● ● ●
● ● ●
● ● ● ● ●
●● ●
● ●● ● ● ● ●
● ●
●●
●
●●
●●
●
●● ● ● ● ●
●
● ● ● ●
● ● ● ● ● ●
● ● ●
●
●
● ●
●
● ●
−3 ●
● ●
●
●
●
●
●
●
0 1 2 3 4 5 6 7 9 0 10 20 30 40 50
Geol_id Trmi
0.2
0.0
1.0 0.8 0.6 0.4 0.2 0.0
Specificity
138
Probability Distributions
Name Probability or Range MGF Mean Variance
Density fn.
Prob.
n x
Binomial p (1 − p ) n − x x = 0, 1, . . . , n (1 − p + pet )n np np(1 − p)
x
λ x e−λ t
Poisson x = 0, 1, . . . e λ ( e −1) λ λ
x!
p q q
Geometric p(1 − p) x = pq x x = 0, 1, . . .
1 − qet p p2
1 2
/2σ2 1 2 2
Normal √ e−( x−µ) −∞ < x < ∞ eµt+ 2 t σ µ σ2
2π σ
λ 1 1
Exponential λe−λx 0<x<∞
λ−t λ λ2
x α−1 e− x/β
Gamma 0<x<∞ (1 − βt)−α αβ αβ2
βα Γ(α)
or:
λα x α−1 e−λx λα α α
0<x<∞
Γ(α) (λ − t)α λ λ2
Γ ( a + b ) a −1 a ab
Beta x (1 − x ) b −1 0<x<1 −−−
Γ( a)Γ(b) a+b ( a + b )2 ( a + b + 1)
Log expansion
r2 r3
ln(1 − r ) = −r − − −···
2 3
ii
ANOVA
n
• Sample mean for population i: Yi· = ∑ j=i 1 Yij /ni
n ni
• Overall sample mean: Y·· = ∑ik=1 ∑ j=i 1 Yij /N = ∑ik=1 Y
N i·
n
• Error sum of squares: SSE = ∑ik=1 ∑ j=i 1 (Yij − Yi· )2
Regression
• Least squares estimate: (X′ X)−1 X′ y
S Archibald, WJ Bond, WD Stock, and DHK Fairbanks. Shaping the landscape: fire-grazer interactions in
an African savanna. Ecological Applications, 15(1):96–109, 2005.
P Collinson. Of bombers, radiologists, and cardiologists: time to ROC. Heart, 80(3):215–217, 1998.
Tom Fawcett. An introduction to ROC analysis. Pattern Recognition Letters, 27(8):861–874, 2006.