100% found this document useful (1 vote)
2K views

03 - CT3S Introduction To Probability Simulation and Gibbs Sampling With R Solutions

This document is an instructor manual for a course on introduction to probability, simulation, and Gibbs sampling using R. It contains 11 chapters that cover introductory examples of probability concepts and simulations in R, generating random numbers, Monte Carlo integration, applied probability models, screening tests, Markov chains with two and larger state spaces, introduction to Bayesian estimation, and using Gibbs samplers to compute Bayesian posterior distributions. An appendix also provides information on getting started with R.

Uploaded by

Matt O'Brien
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
2K views

03 - CT3S Introduction To Probability Simulation and Gibbs Sampling With R Solutions

This document is an instructor manual for a course on introduction to probability, simulation, and Gibbs sampling using R. It contains 11 chapters that cover introductory examples of probability concepts and simulations in R, generating random numbers, Monte Carlo integration, applied probability models, screening tests, Markov chains with two and larger state spaces, introduction to Bayesian estimation, and using Gibbs samplers to compute Bayesian posterior distributions. An appendix also provides information on getting started with R.

Uploaded by

Matt O'Brien
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 270

Eric A. Suess and Bruce E.

Trumbo

Introduction to Probability
Simulation and Gibbs Sampling
with R
Instructor Manual

Explanations and answers:


c 2011 by Bruce E. Trumbo and Eric A. Suess
°
All rights reserved.

Statements of problems:
°c 2010 by Springer Science+Business Media, LLC
All rights reserved.

January 19, 2012

Springer
Berlin Heidelberg NewYork
Hong Kong London
Milan Paris Tokyo
Contents

1 Introductory Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

2 Generating Random Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3 Monte Carlo Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

4 Applied Probability Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

5 Screening Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

6 Markov Chains with Two States . . . . . . . . . . . . . . . . . . . . . . . . . . . 133

7 Markov Chains with Larger State Spaces . . . . . . . . . . . . . . . . . . 151

8 Introduction to Bayesian Estimation . . . . . . . . . . . . . . . . . . . . . . . 185

9 Using Gibbs Samplers to Compute Bayesian Posterior


Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215

11 Appendix: Getting Started with R . . . . . . . . . . . . . . . . . . . . . . . . . 255


1
Introductory Examples

1.1 Based on Example 1.1, this problem provides some practice using R.
a) Start with set.seed(1). Then execute the function sample(1:100, 5)
five times and report the number of good chips (numbered ≤ 90) in each.
> set.seed(1) # For counts of good chips, see part (b).
> sample(1:100, 5)
[1] 27 37 57 89 20
> sample(1:100, 5)
[1] 90 94 65 62 6
> sample(1:100, 5)
[1] 21 18 68 38 74
> sample(1:100, 5)
[1] 50 72 98 37 75
> sample(1:100, 5)
[1] 94 22 64 13 26

b) Start with set.seed(1), and execute sum(sample(1:100, 5) <= 90)


five times. Report and explain the results.
> set.seed(1) # Ensures exactly the same samples as in (a)
> sum(sample(1:100, 5) <= 90)
[1] 5
> sum(sample(1:100, 5) <= 90)
[1] 4
> sum(sample(1:100, 5) <= 90)
[1] 5
> sum(sample(1:100, 5) <= 90)
[1] 4
> sum(sample(1:100, 5) <= 90)
[1] 4

d By setting the same seed as in part (a), we ensure that sample generates the
same sequence of samples of size five as in part (a). In each of the five runs, the
2 1 Instructor Manual: Introductory Examples

logical vector sample(1:100, 5) <= 90 has five elements—each of them either TRUE
or FALSE. When this vector is summed, TRUEs count as 1s and FALSEs count as 0s.
Thus, each of the five responses here counts the number of sampled values that are
less than or equal to 90. For example, in the second sample 94 exceeds 90, so only
four of the five results meet the criterion. c

c) Which two of the following four samples could not have been produced
using the function sample(1:90, 5)? Why not?
[1] 2 62 84 68 60 # OK
[1] 46 39 84 16 39 # No, two 39s (sampling without replacement)
[1] 43 20 79 32 84 # OK
[1] 68 2 98 20 50 # No, 98 exceeds 90

1.2 This problem relates to the program in Example 1.1.


a) Execute the statements shown below in the order given. Explain what
each statement does. Which ones produce output? What is the length of
each vector? (A number is considered a vector of length 1.) Which vectors
are logical, with possible elements TRUE or FALSE, and which are numeric?
pick = c(4, 47, 82, 21, 92); pick <= 90; sum(pick <= 90)
pick[1:90]; pick[pick <= 90]; length(pick[pick <= 90])
as.numeric(pick <= 90); y = numeric(5); y; y[1] = 10; y
w = c(1:5, 1:5, 1:10); mean(w); mean(w >= 5)

> pick = c(4, 47, 82, 21, 92) # defines ’pick’ (no output)
> pick <= 90 # logical 5-vector
[1] TRUE TRUE TRUE TRUE FALSE
# 4 <= 90, so 1st element ’TRUE’
> sum(pick <= 90) # four ’TRUE’s on line above
[1] 4
> pick[1:90] # only five elements in ’pick’
[1] 4 47 82 21 92 NA NA NA NA NA NA NA NA NA NA
[16] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
[31] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
[46] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
[61] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
[76] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
# ’NA’s for undefined entries
> pick[pick <= 90] # elements of ’pick’ <= 90
[1] 4 47 82 21
> length(pick[pick <= 90]) # length of vector above is 4
[1] 4
> as.numeric(pick <= 90) # views ’pick <= 90’ as numeric
[1] 1 1 1 1 0 # 1 for ’TRUE’; 0 for ’FALSE’
> y = numeric(5); y # vector of five 0s
[1] 0 0 0 0 0
> y[1] = 10; y # change first element to 10
1 Instructor Manual: Introductory Examples 3

[1] 10 0 0 0 0
> w = c(1:5, 1:5, 1:10) # defines ’w’ (no output)
> mean(w) # average of twenty numbers
[1] 4.25
> mean(w >= 5) # fraction of them >= 5
[1] 0.4

b) In the program, propose a substitute for the second line of code within
the loop so that good[i] is evaluated in terms of length instead of sum.
Run the resulting program, and report your result.
d For good[i] = sum(pick <= 90), use good[i] = length(pick[pick <= 90]).
With this substitution and still using set.seed(1937), you should get exactly the
same answer as with the original program. c

1.3 The random variable X of Example 1.1 has a hypergeometric distri-


bution. In R, the hypergeometric probabilities P {X = x} can be computed
using the function dhyper. Its parameters are, in order, the number x of good
items seen in the sample, the number of good items in the population, the
number of bad items in the population, and the number of items selected
without replacement. Thus each of the statements dhyper(5, 90, 10, 5) and
dhyper(0, 10, 90, 5) returns 0.5837524.
a) What is the relationship between sample(1:100, 5) and choose(100, 5)?
d The former chooses a specific random sample of size five without replacement from
a population of size 100. The latter is the number of possible samples chosen in this
way. c

b) Compute P {X = 2} using dhyper and then again using choose.


> dhyper(2, 90, 10, 5)
[1] 0.006383528
> choose(90, 2)*choose(10, 3)/choose(100, 5)
[1] 0.006383528

c) Run the program of Example 1.1 followed by the statements mean(good)


and var(good). What are the numerical results of these two statements?
In terms of the random variable X, what do they approximate and how
good is the agreement?
# seed not specified; your results will differ slightly
m = 100000; good = numeric(m)
for (i in 1:m)
{
pick = sample(1:100, 5) # vector of 5 items from ith box
good[i] = sum(pick <= 90) # number Good in ith box
}
> mean(good == 5) # approximates P{X = 5} = .5838
[1] 0.58507
4 1 Instructor Manual: Introductory Examples

> mean(good) # approximates E(X) = 4.5


[1] 4.50168
> var(good) # approximates V(X)
[1] 0.4310215
> 5*.9*.1*(95/99) # see Notes
[1] 0.4318182

d) Execute sum((0:5)*dhyper(0:5, 90, 10, 5))? How many terms are be-
ing summed? What numerical result is returned? What is its connection
with part (c)?
> sum((0:5)*dhyper(0:5, 90, 10, 5)) # sums 6 terms to compute E(X)
[1] 4.5

Notes: If n items are drawn at random without replacement from a box with b Bad
items, g Good items, and T = g +b, then E(X) = ng/T , V(X) = n( Tg )(1− Tg )( TT −n
−1
).

1.4 Based on concepts in the program of Example 1.3, this problem pro-
vides practice using functions in R.
a) Execute the statements shown below in the order given. Explain what
each statement does. State the length of each vector. Which vectors are
numeric and which are logical?
a = c(5, 6, 7, 6, 8, 7); length(a); unique(a)
length(unique(a)); length(a) - length(unique(a))
duplicated(a); length(duplicated(a)); sum(duplicated(a))

> a = c(5, 6, 7, 6, 8, 7) # defines 6-vector ’a’


> length(a) # length of ’a’
[1] 6
> unique(a) # 4 unique elements of ’a’
[1] 5 6 7 8
> length(unique(a)) # counts unique elements
[1] 4
> length(a) - length(unique(a)) # matches
[1] 2
> duplicated(a) # logical 6-vector; ’T’s at matches
[1] FALSE FALSE FALSE TRUE FALSE TRUE
> length(duplicated(a)) # length of above
[1] 6
> sum(duplicated(a)) # matches
[1] 2

b) Based on your findings in part (a), propose a way to count redundant


birthdays that does not use unique. Modify the program to implement
this method and run it. Report your results.
d Inside the program loop, for the line x[i] = n - length(unique(b)), substitute
x[i] = sum(duplicated(b)). c
1 Instructor Manual: Introductory Examples 5

1.5 Item matching. There are ten letters and ten envelopes, a proper one
for each letter. A very tired administrative assistant puts the letters into the
envelopes at random. We seek the probability that no letter is put into its
proper envelope and the expected number of letters put into their proper
envelopes. Explain, statement by statement, how the program below approx-
imates these quantities by simulation. Run the program with n = 10, then
with n = 5, and again with n = 20. Report and compare the results.
m = 100000; n = 10; x = numeric(m)
for (i in 1:m) {perm = sample(1:n, n); x[i] = sum(1:n==perm)}
cutp = (-1:n) + .5; hist(x, breaks=cutp, prob=T)
mean(x == 0); mean(x); sd(x)

> mean(x == 0); mean(x); sd(x)


[1] 0.36968
[1] 0.99488
[1] 0.9975188

> table(x)/m # provides approximate distribution


x
0 1 2 3 4 5 6 7
0.36968 0.36817 0.18275 0.06070 0.01502 0.00310 0.00053 0.00005

d Above we show the result of one run for n = 10, omitting the histogram, and pro-
viding the approximate PDF of X instead. We did not set a seed for the simulation,
so your answers will differ slightly. For any n > 1, exact values are E(X) = V(X) = 1,
and values approximated by simulation are close to 1. See the Notes below. c
Notes: Let X be the number correct. For n envelopes, a combinatorial argument gives
P {X = 0} = 1/2! − 1/3! + − · · · + (−1)n /n! . [See Feller (1957) or Ross (1997).]
In R, i = 0:10; sum((-1)^i/factorial(i)). For any n > 1, P {X = n − 1} = 0,
P {X > n} = 0, E(X) = 1, and V (X) = 1. For large n, X is approximately POIS(1).
Even for n as small as 10, this approximation is good to two places; to verify this,
run the program above, followed by points(0:10, dpois(0:10, 1)).

1.6 A poker hand consists of five cards selected at random from a deck of
52 cards. (There are four Aces in the deck.)
a) Use combinatorial methods to express the probability that a poker hand
has no Aces. Use R to find the numerical answer correct to five places.
> choose(4, 0)*choose(48, 5)/choose(52, 5)
[1] 0.658842
> dhyper(0, 4, 48, 5)
[1] 0.658842

b) Modify the program of Example 1.1 to approximate the probability in


part (a) by simulation.
6 1 Instructor Manual: Introductory Examples

m = 100000; y = numeric(m) # y is number of Aces in hand


for (i in 1:m)
{
hand = sample(1:52, 5) # five-card poker hand
y[i] = sum(hand <= 4) # designate Aces as cards 1-4
}
mean(y == 0) # approximates P{Y = 0} = .6588

> mean(y == 0) # approximates P{Y = 0} = .6588


[1] 0.6583

# extra information not requested in problem


> table(y)/m # approx. dist’n of Y
y
0 1 2 3 4
0.65830 0.29997 0.03996 0.00174 0.00003
> round(dhyper(0:4, 4, 48, 5), 5) # exact dist’n of Y
[1] 0.65884 0.29947 0.03993 0.00174 0.00002

1.7 Martian birthdays. In his science fiction trilogy on the human coloniza-
tion of Mars, Kim Stanley Robinson (1996) arranges the 669 Martian days of
the Martian year into 24 months with distinct names. Imagine a time when
the Martian population consists entirely of people born on Mars and that
birthdays in the Martian-born population are uniformly distributed across
the year. Make a plot for Martians similar to Figure 1.2. (You do not need
to change n.) Use your plot to guess how many Martians there must be in a
room in order for the probability of a birthday match just barely to exceed 1/2.
Then find the exact number with min(n[p > 1/2]).
n = 1:60; p = numeric(60)
for (i in n)
{
q = prod(1 - (0:(i-1))/669); p[i] = 1 - q
}
plot(n, p) # plot of p against n (not shown)

> h = min(n[p > 1/2]); h # h is smallest n with p > 1/2


[1] 31
> cbind(n, p)[(h-3):(h+3), ] # values of n and p for n near h
n p
[1,] 28 0.4361277
[2,] 29 0.4597277
[3,] 30 0.4831476
[4,] 31 0.5063249 # smallest n with p > 1/2
[5,] 32 0.5292007
[6,] 33 0.5517202
[7,] 34 0.5738327
1 Instructor Manual: Introductory Examples 7

1.8 Nonuniform birthrates. In this problem, we explore the effect on birth-


day matches of the nonuniform seasonal pattern of birthrates in the United
States, displayed in Figure 1.5. In this case, simulation requires an additional
parameter prob of the sample function. A vector of probabilities is used to
indicate the relative frequencies with which the 366 days of the year are to
be sampled. We can closely reflect the annual distribution of U.S. birthrates
with a vector of 366 elements:
p = c(rep( 96,61), rep( 98,89), rep( 99,62),
rep(100,61), rep(104,62), rep(106,30), 25)

The days of the year are reordered for convenience. For example, February
29 appears last in our list, with a rate that reflects its occurrence only one
year in four. Before using it, R scales this vector so that its entries add to 1.
To simulate the distribution of birthday matches based on these birthrates,
we need to make only two changes in the program of Example 1.3. First,
insert the line above before the loop. Second, replace the first line within
the loop by b = sample(1:366, 25, repl=T, prob=p). Then run the mod-
ified program, and compare your results with those obtained in the example
[P {X = 0} = 0.4313 and E(X) ≈ 0.81].
m = 100000; n = 25; x = numeric(m)
p = c(rep( 96,61), rep( 98,89), rep( 99,62),
rep(100,61), rep(104,62), rep(106,30), 25)
for (i in 1:m)
{
b = sample(1:366, n, repl=T, prob=p)
x[i] = n - length(unique(b))
}
cutp = (0:(max(x)+1)) - .5
hist(x, breaks=cutp, prob=T) # histogram (not shown)

> mean(x == 0); mean(x) # approximates P{X = 0} and E(X)


[1] 0.42706 # no exact value from simple combinatorics
[1] 0.81011

1.9 Nonuniform birthrates (continued). Of course, if the birthrates vary too


much from uniform, the increase in the probability of birthday matches will
surely be noticeable. Suppose the birthrate for 65 days we call “midsummer”
is three times the birthrate for the remaining days of the year, so that the
vector p in Problem 1.8 becomes p = c(rep(3, 65), rep(1, 300), 1/4).
a) What is the probability of being born in “midsummer”?
d See Answer (a) below. c
b) Letting X be the number of birthday matches in a room of 25 randomly
chosen people, simulate P {X ≥ 1} and E(X).
8 1 Instructor Manual: Introductory Examples

m = 100000; n = 25; x = numeric(m)


p = c(rep(3, 65), rep(1, 300), 1/4)
for (i in 1:m) {
b = sample(1:366, n, repl=T, prob=p)
x[i] = n - length(unique(b)) }
cutp = (0:(max(x)+1)) - .5
hist(x, breaks=cutp, prob=T) # histogram (not shown)

> mean(x >= 1); mean(x)


[1] 0.66601
[1] 1.04375 # see Answer (b) below

Answers: (a) sum(p[1:65])/sum(p). Why? (b) Roughly 0.67 and 1.0, respectively.

1.10 Three problems are posed about a die that is rolled repeatedly. In
each case, let X be the number of different faces seen in the specified num-
ber of rolls. Using at least m = 100 000 iterations, approximate P {X = 1},
P {X = 6}, and E(X) by simulation. To do this write a program using the
one in Example 1.3 as a rough guide. In what way might some of your sim-
ulated results be considered unsatisfactory? To verify that your program is
working correctly, you should be able to find exact values for some, but not
all, of the quantities by combinatorial methods.
a) The die is fair and it is rolled 6 times.
m = 100000; x = numeric(m); n = 6
for (i in 1:m) x[i] = length(unique(sample(1:6, n, repl=T)))
mean(x); table(x)/m

> mean(x); table(x)/m


[1] 3.98961
x
1 2 3 4 5 6
0.00010 0.02054 0.23080 0.50250 0.23033 0.01573

> > 6/6^6; factorial(6)/6^6


[1] 0.0001286 # Exact for P{X = 1}
[1] 0.0154321 # Exact for P{X = 6}

b) The die is fair and it is rolled 8 times.


m = 100000; x = numeric(m); n = 8
for (i in 1:m) x[i] = length(unique(sample(1:6, n, repl=T)))
mean(x); table(x)/m

> mean(x); table(x)/m


[1] 4.60594
x
2 3 4 5 6
0.00232 0.06984 0.36245 0.45036 0.11503
1 Instructor Manual: Introductory Examples 9

c) The die is biased and it is rolled 6 times. The bias of the die is such that
2, 3, 4, and 5 are equally likely but 1 and 6 are each twice as likely as 2.
m = 100000; x = numeric(m); n = 6; p = c(2, 1, 1, 1, 1, 2)
for (i in 1:m) x[i] = length(unique(sample(1:6, n, repl=T, prob=p)))
mean(x); table(x)/m

> mean(x); table(x)/m


[1] 3.84522
x
1 2 3 4 5 6
0.00055 0.03717 0.28686 0.47820 0.18637 0.01085

Answers: P {X = 1} = 1/65 in (a); the approximation has small absolute error, but
perhaps large percentage error. P {X = 6} = 6!/66 = 5/324 in (a), and P {X = 6} =
45/4096 = 0.0110 in (c).

1.11 Suppose 40% of the employees in a very large corporation are women.
If a random sample of 30 employees is chosen from the corporation, let X be
the number of women in the sample.
a) For a specific x, the R function pbinom(x, 30, 0.4) computes P {X ≤ x}.
Use it to evaluate P {X ≤ 17}, P {X ≤ 6}, and hence P {7 ≤ X ≤ 17}.
> p1 = pbinom(17, 30, 0.4); p2 = pbinom( 6, 30, 0.4)
> p1; p2
[1] 0.9787601
[1] 0.01718302
> p1 - p2 # X in closed interval [7, 17]
[1] 0.9615771
> diff(pbinom(c(6, 17), 30, 0.4)) # all in one line
[1] 0.9615771
> sum(dbinom(7:17, 30, 0.4)) # another method:
[1] 0.9615771 # summing 11 probabilities

b) Find µ = E(X) and σ = SD(X). Use the normal approximation to evaluate


P {7 ≤ X ≤ 17}. That is, take Z = (X − µ)/σ to be approximately
standard normal. It is best to start with P {6.5 < X < 17.5}. Why?
d Using the formulas for the mean and variance of a binomial random variable, we
have µ = nπ = 30(0.4) = 12 and σ 2 = nπ(1 − π) = 12(0.6) = 7.2. Then

P {7 ≤ X ≤ 17} = P {6 < X ≤ 17} = P {6.5 ≤ X ≤ 17.5}


½ ¾
6.5 − 12 X −µ 17.5 − 12
≈P √ ≤ =Z≤ √
7.2 σ 7.2
= P {−2.0497 < Z ≤ 2.0497} = 0.9596,

where Z ∼ NORM(0, 1) and we used diff(pnorm(c(-2.0497, 2.0497))) in R to


get the numerical result. Two-place accuracy is typical of such uses of the normal
approximation to the binomial. Alternatively, we could evaluate this probability with
10 1 Instructor Manual: Introductory Examples

the code diff(pnorm(c(6.5, 17.5), 12, sqrt(7.2))), which also returns 0.9596.
Here, the second and third arguments of pnorm designate the mean and standard
deviation of NORM(µ, σ).
The use of half-integer endpoints is called the continuity correction, appro-
priate because the binomial distribution is discrete (taking only integer values)
whereas the normal distribution is continuous. By using diff(pnorm(c(7, 17),
12, sqrt(7.2))), we would obtain 0.9376, losing roughly half of each of the proba-
bility values P {X = 7} and P {X = 17}. The exact value is 0.9615 (rounded to four
places), from diff(pbinom(c(6,17), 30, .4)). c

c) Now suppose the proportion π of women in the corporation is unknown.


A random sample of 30 employees has 20 women. Do you believe π is as
small as 0.4? Explain.
d You shouldn’t. With n = 30 and π = 0.4, the number of women chosen lies
between 7 and 17 (inclusive) with probability about 0.96. So if π = 0.4, then it
seems very unlikely to get as many as 20 women in a random sample of 30. Because
we observed 20 women in our sample, it seems reasonable to assume that the value
of π is larger than 0.4. c

d) In the circumstances of part (c), use formula (1.2) to find an approximate


95% confidence interval for π.
d The sample proportion p = X/n = 20/30 = 2/3 estimates the population propor-
tion π. The margin of error for a 95% confidence interval is
p p
1.96 p(1 − p)/n = 1.96 2/270 = 0.1687.

Thus the desired confidence interval is 0.6667 ± 0.1687 or (0.4980, 0.8354). c


Hints and comments: For (a) and (b), about 0.96; you should give 4-place accuracy.
The margin of error in (d) is about 0.17. Example 1.5 shows that the actual coverage
probability of the confidence interval in (d) may differ substantially from 95%; a
better confidence interval in this case is based on the Agresti-Coull adjustment of
Problem 1.16: (0.486, 0.808).

1.12 Refer to Example 1.4 and Figure 1.5 on the experiment with a die.
a) Use formula (1.2) to verify the numerical values of the confidence intervals
explicitly mentioned in the example (for students 2, 10, and 12).
n = 30; pp = 1/6; pm = c(-1,1)
> x = 1; p = x/n; round(p + pm*1.96*sqrt(p*(1-p)/n), 3)
[1] -0.031 0.098 # Students 2 & 12
> x =10; p = x/n; round(p + pm*1.96*sqrt(p*(1-p)/n), 3)
[1] 0.165 0.502 # Student 10

b) In the figure, how many of the 20 students obtained confidence intervals


extending below 0? d Answer: 7. c
1 Instructor Manual: Introductory Examples 11

c) The most likely number of 6s in 20 rolls of a fair die is three. To verify this,
first use i = 0:20; b = dbinom(i, 20, 1/6), and then i[b==max(b)]
or round(cbind(i, b), 6). How many of the 20 students got five 6s?
d Answer: 4. c

i = 0:20; b = dbinom(i, 20, 1/6)


> i[b==max(b)]
[1] 3
> round(cbind(i, b), 6)[1:10, ] # brackets to print only 10 rows
i b
[1,] 0 0.026084
[2,] 1 0.104336
[3,] 2 0.198239
[4,] 3 0.237887 # maximum probability
[5,] 4 0.202204
[6,] 5 0.129410
[7,] 6 0.064705
[8,] 7 0.025882
[9,] 8 0.008412
[10,] 9 0.002243

1.13 Beneath the program in Example 1.1 on sampling computer chips, we


claimed that the error in simulating P {X = 5} rarely exceeds 0.004. Consider
that the sample proportion 0.58298 is based on a sample of size m = 100 000.
a) Use formula (1.2) to find the margin of error of a 95% confidence interval
for π = P {X = 5}. With such a large sample size, this formula is reliable.
p
d The margin of error is 1.96 0.58298(1 − 0.58298)/100000 = 0.00306. c

b) Alternatively, after running the program, you could evaluate the margin
of error as 1.96*sqrt(var(good==5)/m). Why is this method essentially
the same as in part (a)? (Ignore the difference between dividing by m and
m − 1. Also, for a logical vector g, notice that sum(g) equals sum(g^2).)
d To begin, recall that the variance of a sample Y1 , . . . , Yn is defined as
Pn Pn Pn
2 i=1
(Yi − Ȳ )2 i=1
Yi2 − n1 ( i=1
Yi )2
s = = ,
n−1 n−1
where the expression at the right is often used in computation.
Denote the logical vector (good==5) as g, which we take to have numerical values
0 and 1. Then var(g) is the same as (sum(g^2) - sum(g)^2/m)/(m-1). For 0–1
Pn Pn
data, Y =
i=1 i
Y 2 , so this reduces to (sum(g)/(m-1)*(1 - sum(g)/m)).
i=1 i
For huge g, this is essentially mean(g)*(1 - mean(g)). But mean(g) is the sample
proportion p of instances where we see five good chips. So the argument of the square
root is essentially p(1 − p)/m. Here is a numerical demonstration with the program
of Example 1.1. c
12 1 Instructor Manual: Introductory Examples

m = 100000; good = numeric(m)


for (i in 1:m) {
pick = sample(1:100, 5)
good[i] = sum(pick <= 90) }
p = mean(good == 5); p
round(1.96*sqrt(var(good==5)/m), 5)
round(1.96*sqrt(p*(1 - p)/m), 5)

> p = mean(good == 5); p


[1] 0.58279
> round(1.96*sqrt(var(good==5)/m), 5)
[1] 0.00306
> round(1.96*sqrt(p*(1 - p)/m), 5)
[1] 0.00306

1.14 Modify the R program of Example 1.5 to verify that the coverage
probability corresponding to n = 30 and π = 0.79 is 0.8876. Also, for n = 30,
find the coverage probabilities for π = 1/6 = 0.167, 0.700, and 0.699. Then find
coverage probabilities for five additional values of π of your choice. From this
limited evidence, which appears to be more common—coverage probabilities
below 95% or above 95%? In Example 1.4, the probability of getting a 6 is
π = 1/6, and 18 of 20 confidence intervals covered π. Is this better, worse, or
about the same as should be expected?
d The code below permits confidence levels other than 100(1 − α)% = 95%, using κ
which cuts probability α/2 from the upper tail of NORM(0, 1). (You might want to
explore intervals with target confidence 90% or 99%.) Also, the changeable quantities
have been put into the first line of the program.
After the first run with n = 30 and π = 0.97, we show only the first line and
the true coverage probability. Most of the few instances here show true coverage
probabilities below 95%, a preliminary impression validated by Figure 1.6.
The last run, with n = 20 and π = 1/6, answers the final question above:
We obtain coverage probability 0.8583 < 18/20 = 0.9. So the number of intervals
covering π = 1/6 in Example 1.4 is about what one should realistically expect. c

n = 30; x = 0:n; sp = x/n; pp = .79; alpha = .05


kappa = qnorm(1-alpha/2); m.err = kappa*sqrt(sp*(1-sp)/n)
lcl = sp - m.err; ucl = sp + m.err
prob = dbinom(x, n, pp); cover = (pp >= lcl) & (pp <= ucl)
sum(dbinom(x[cover], n, pp)) # coverage probability

> sum(dbinom(x[cover], n, pp)) # coverage probability


[1] 0.8875662

n = 30; x = 0:n; sp = x/n; pp = 1/6; alpha = .05


...
> sum(dbinom(x[cover], n, pp)) # coverage probability
[1] 0.8904642
1 Instructor Manual: Introductory Examples 13

n = 30; x = 0:n; sp = x/n; pp = .70; alpha = .05


...
> sum(dbinom(x[cover], n, pp)) # coverage probability
[1] 0.9529077

> n = 30; x = 0:n; sp = x/n; pp = .699; alpha = .05


...
> sum(dbinom(x[cover], n, pp)) # coverage probability
[1] 0.9075769

n = 20; x = 0:n; sp = x/n; pp = 1/6; alpha = .05


...
> sum(dbinom(x[cover], n, pp)) # coverage probability
[1] 0.8583264

1.15 Modify the program of Example 1.6 to display coverage probabilities


of traditional “95% confidence” intervals for n = 50 observations. Also, modify
the program to show results for nominal 90% and 99% confidence intervals
with n = 30 and n = 50. Comment on the coverage probabilities in each of
these five cases. Finally, compare these results with Figure 1.7.
d One possible modification of the program is shown (for the case n = 50, α = .05).
For simplicity, the adjustments for Agresti-Coull intervals, not needed here, are
omitted. Some embellishments for the graph are included. (The graph is omitted
here; you should run the program for yourself.) To produce the remaining four
graphs requested, make the obvious changes in the first line of the program. c

n = 50; alpha = .05; kappa = qnorm(1-alpha/2)


x = 0:n; sp = x/n; m.err = kappa*sqrt(sp*(1 - sp)/n)
lcl = sp - m.err; ucl = sp + m.err
m = 2000; pp = seq(1/n, 1 - 1/n, length=m); p.cov = numeric(m)
for (i in 1:m)
{
cover = (pp[i] >= lcl) & (pp[i] <= ucl)
p.rel = dbinom(x[cover], n, pp[i])
p.cov[i] = sum(p.rel)
}
plot(pp, p.cov, type="l", ylim=c(1-4*alpha,1), main=paste("n =",n),
xlab=expression(pi), ylab="Coverage Probability")
lines(c(.01,.99), c(1-alpha,1-alpha), col="darkgreen")

1.16 In the R program of Example 1.6, set adj = 2 and leave n = 30. This
adjustment implements the Agresti-Coull type of 95% confidence interval. The
formula is similar to (1.2), except that one begins by “adding two successes
and two failures” to the data. [Example: If we see 20 Successes in 30 trials,
the 95% Agresti-Coull
p interval is centered at 22/34 = 0.6471 with margin of
error 1.96 (22)(12)/343 = 0.1606, and the interval is (0.4864, 0.8077).]
14 1 Instructor Manual: Introductory Examples

Run the modified program, and compare your plot with Figures 1.6 (p12)
and 1.8 (p21). For what values of π are such intervals too “conservative”—too
long and with coverage probabilities far above 95%? Also make plots for 90%
and 99% and comment. (See Problem 1.17 for more on this type of interval.)
d The explicit formula for Agresti-Coull 95% confidence intervals is provided in Prob-
lem 1.17(c) and the related Comment. These intervals are based on the approxima-
tion κ = 2 ≈ 1.96, so they are most accurate at the 95% confidence level. Also,
they tend to be unnecessarily long for some values of π near 0 or 1. The program of
Example 1.6 requires only the minor change to use adj = 2. So we do not repeat the
program here. Figure 1.8 (p21 of the text) shows the resulting coverage probabilities
for a 95% CI when n = 30. c

1.17 Algebraic derivation of alternate types of confidence


p intervals. For con-
venience, denote the standard error of p as SE(p) = π(1 − π)/n and its
p
c
estimated standard error as SE(p) = p(1 − p)/n.
c
a) Show that P {|p−π|/SE(p) c
< κ} = P {p−κ SE(p) c
< π < p+κ SE(p)}. The
extreme terms in the second inequality are the endpoints of the confidence
interval based on (1.2). The intended confidence level is 1 − α, and κ is
defined by P {|Z| < κ} = 1 − α for standard normal Z.
b) We can also “isolate” π between two terms computable from observed
c
data by using SE(p) instead of its estimate SE(p). Show that

P {|p − π|/SE(p) < κ} = P {p̃ − E < π < p̃ + E},


2 p
where p̃ = X+κ /2
n+κ2 , E = n+κ2
κ
np(1 − p) + κ2 /4, and κ is as in part (a).
Overall, the coverage probabilities of the Wilson confidence interval
p̃ ±E tend to be closer to 1−α than those of an interval based on (1.2). The
Wilson interval uses the normal approximation, but it avoids estimating
the standard error SE(p).
c) If we define X̃ = X +κ2 /2 and ñ = n+κ2 , then pverify that p̃ = X̃/ñ agrees
with the p̃ of part (b). Also, show that E ∗ = κ p̃(1 − p̃)/ñ is larger than
E of part (b), but not by much. With these definitions of ñ and p̃, the
Agresti-Coull confidence interval p̃ ± E ∗ is similar in form to the interval
of (1.2). But, for most values of π, its coverage probabilities are closer to
1 − α than are those of the traditional interval. (See Problem 1.16.)
d In many instances of statistical inference on a parameter θ, there is a kind of
duality between a test of the hypothesis H0: θ = θ0 against the two-sided alternative
Ha: θ 6= θ0 at level α, on the one hand, and a 100(1 − α)% confidence interval for θ,
on the other hand. That is, values θ0 for which H0 is not rejected (“accepted”) are
precisely those values of θ contained in the confidence interval.
For example, this duality holds when n observations are chosen at random from
NORM(µ, σ) and we want to make inferences about unknown µ when the value of
σ is known. On the one hand, we accept H0 : µ = µ0 against Ha : µ = µ0 when

|X̄ − µ0 |/(σ/ n) < 1.96. On the other hand, the usual 95% confidence interval for µ
1 Instructor Manual: Introductory Examples 15

is X̄ ±1.96σ/ n. With a few steps of algebra we can “invert the test” to see that the

values of µ inside the interval are precisely those for which |X̄ − µ|/(σ/ n) < 1.96.
For binomial data with unknown π, this duality does not hold between the two-
sided test of H0: π = π0 , which accepts when

|p − π0 |
p < 1.96,
π0 (1 − π0 )/n
p
and the traditional 95% confidence interval p±1.96 p(1 − p)/n. The task in part (b)
is to invert the test for the binomial success probability π to make a dual confidence
interval, called the Wilson interval.
We do not show the (routine, but admittedly somewhat tedious) algebra sug-
gested in the Hint to establish a general formula for the endpoints of the Wilson
interval. However, we include a demonstration in R for a particular case. We search
for values of π (pp in the code) that satisfy the criterion for accepting the null hy-
pothesis, and then we show they agree with the values of π in the Wilson interval.
We also show that the Agresti-Coull confidence interval in part (c) is only a little
longer than the Wilson interval for the values in our particular case. Of course, you
can change n, x, and α in this program to obtain analogous results for other cases. c

m = 100000; pp = seq(0, 1, length=m)


n = 50; x = 33; p = x/n
alpha = .05; kappa = qnorm(1 - alpha/2)
p.tilde = (x + kappa^2/2)/(n + kappa^2)
marg.err = (kappa/(n + kappa^2))*sqrt(n*p*(1-p)+ kappa^2/4)
lcl.wilson = p.tilde-marg.err
ucl.wilson = p.tilde marg.err
marg.err.ac = kappa*sqrt(p.tilde*(1-p.tilde)/(n + kappa^2))
lcl.ac = p.tilde-marg.err.ac
ucl.ac = p.tilde+marg.err.ac
accept = (abs(p - pp)/sqrt(pp*(1-pp)/n) < kappa)
lcl.invert = min(pp[accept])
ucl.invert = max(pp[accept])
round(c(lcl.wilson, ucl.wilson), 4) # Wilson CI
round(c(lcl.invert, ucl.invert), 4) # CI from inverted test
round(c(lcl.ac, ucl.ac), 4) # Agresti-Coull CI

> round(c(lcl.wilson, ucl.wilson), 4) # Wilson CI


[1] 0.5215 0.7756
> round(c(lcl.invert, ucl.invert), 4) # CI from inverted test
[1] 0.5215 0.7756
> round(c(lcl.ac, ucl.ac), 4) # Agresti-Coull CI
[1] 0.5211 0.7761

Hints and comments: (b) Square and use the quadratic formula to solve for π. When
1 − α = 95%, one often uses κ = 1.96 ≈ 2 and thus p̃ ≈ X+2 n+4
. (c) The difference
between E and E ∗ is of little practical importance unless p̃ is near 0 or 1. For a
more extensive discussion, see Brown et al. (2001).
16 1 Instructor Manual: Introductory Examples

1.18 For a discrete random P variable X, the expected value (if it exists)
is defined as µ = E(X) = k kP {X = k}, where the sum is taken over all
possible values of k. P
Also, if X takes only nonnegative integer values, then one
can show that µ = k P {X > k}. In particular, if X ∼ BINOM(n, π), then
one can show that µ = E(X) = nπ.
Also, the mode (if it exists) of a discrete random variable X is defined
as the unique value k such that P {X = k} is greatest. In particular, if X is
binomial, then one can show that its mode is b(n + 1)πc; that is, the greatest
integer in (n+1)π. Except that if (n+1)π is an integer, then there is a “double
mode”: values k = (n + 1)π and (n + 1)π − 1 have the same probability.
Run the following program for n = 6 and π = 1/5 (as shown); for n = 7
and π = 1/2; and for n = 18 and π = 1/3. Explain the code and interpret the
answers in terms of the facts stated above about binomial random variables. (If
necessary, use ?dbinom to get explanations of dbinom, pbinom, and rbinom.)
d Names of functions for the binomial distribution end in binom. Suppose the random
variable X ∼ BINOM(n, π). (Because pi is a reserved word in R, we use pp for π in
R code.)
• First letter d stands for the probability distribution function (PDF), so that
dbinom(k, n, pp) is P {X = k}.
• First letter p stands for the cumulative distribution function (CDF), so that
pbinom(k, n, pp) is P {X ≤ k}.
• First letter r stands for random sampling, so that rbinom(m, n, pp) generates
m independent observations from the distribution BINOM(n, π).

Below we have interspersed comments among the R statements to explain their


function. (Lengths of the vectors are given in parentheses.) c

n = 6; pp = 1/5 # parameters n and pi (one each)


k = 0:n # support of random variable X (n+1)
pdf = dbinom(k, n, pp) # vector of PDF values (n+1)
sum(k*pdf) # first formula for E(X) above (1)
cdf = pbinom(k, n, pp) # vector of CDF values (n+1)
sum(1 - cdf) # second formula for E(X) above (1)
mean(rbinom(100000, n, pp)) # mean of 100000 simulated values
# approximates E(X) (1)
n*pp # formula for E(X), ONLY for binomial (1)

round(cbind(k, pdf, cumsum(pdf), cdf), 4)


# matrix table of PDF, CDF
# as cumulative sum of PDF, and CDF
# (dimensions 7 rows x 4 columns)
k[pdf==max(pdf)] # search for mode (1 or,
# if (n+1)*pp, then 2)
floor((n+1)*pp) # formula for mode (1)
1 Instructor Manual: Introductory Examples 17

# run for n = 7 and pp = 1/2 below


> n = 7; pp = 1/2
> k = 0:n
> pdf = dbinom(k, n, pp)
> sum(k*pdf)
[1] 3.5
> cdf = pbinom(k, n, pp)
> sum(1 - cdf)
[1] 3.5
> mean(rbinom(100000, n, pp))
[1] 3.51137
> n*pp
[1] 3.5

> round(cbind(k, pdf, cumsum(pdf), cdf), 4)


k pdf cdf
[1,] 0 0.0078 0.0078 0.0078
[2,] 1 0.0547 0.0625 0.0625
[3,] 2 0.1641 0.2266 0.2266
[4,] 3 0.2734 0.5000 0.5000
[5,] 4 0.2734 0.7734 0.7734
[6,] 5 0.1641 0.9375 0.9375
[7,] 6 0.0547 0.9922 0.9922
[8,] 7 0.0078 1.0000 1.0000

> k[pdf==max(pdf)]
[1] 3 4
> floor((n+1)*pp)
[1] 4

1.19 Average lengths of confidence intervals. Problem 1.16 shows that, for
most values of π, Agresti-Coull confidence intervals have better coverage prob-
abilities than do traditional intervals based on formula (1.2). It is only rea-
sonable to wonder whether this improved coverage comes at the expense of
greater average length. For given n and π, the length of a confidence interval
is a random variable because the margin of error depends on the number of
Successes observed. The program below illustrates the computation and finds
the expected length.
n = 30; pp = .2 # binomial parameters
alpha = .05; kappa = qnorm(1-alpha/2) # level is 1 - alpha
#adj = 0 # 0 for traditional; 2 for Agresti-Coull
x = 0:n; sp = (x + adj)/(n + 2*adj)
CI.len = 2*kappa*sqrt(sp*(1 - sp)/(n + 2*adj))
Prob = dbinom(x, n, pp); Prod = CI.len*Prob
round(cbind(x, CI.len, Prob, Prod), 4) # displays computation
sum(Prod) # expected length
18 1 Instructor Manual: Introductory Examples

a) Explain each statement in this program, and state the length of each
named vector. (Consider a constant as a vector of length 1.)
d Objects in the first three lines each have 1 element; objects in the next three lines
each have n + 1 = 31 elements. The statement cbind binds together four column
vectors of length 31 to make a 31 × 4 matrix, the elements of which are rounded to
four places. c

b) Run the program as it is to find the average length of intervals based


on (1.2) when π = 0.1, 0.2, and 0.5. Then use adj = 2 to do the same for
Agresti-Coull intervals.
AVERAGE LENGTHS OF CIs: n = 30
(Compare with Figure 1.19)

pp Traditional Agresti-Coull Comparison


----------------------------------------------------------
.1 .2014 .2337 A-C longer
.2 .2782 .2805 about equal
.5 .3157 .3099 A-C shorter

c) Figure 1.9 was made by looping through about 200 values of π. Use it to
verify your answers in part (b). Compare the lengths of the two kinds of
confidence intervals and explain.
d Figure 1.9 shows that the traditional interval is shorter for values of π near 0 or 1—
roughly speaking, for π outside the interval (0.2, 0.8). See also the caption of the
figure and the answers to part (b). Thus the Agresti-Coull intervals tend to be longer
for the values of π where the traditional intervals tend to have less than the nominal
probability of coverage. (Perhaps they are even needlessly conservative for values
very near 0 and 1; in the answer to Problem 17(c) we noted that this is precisely
where the Wilson intervals would be shorter.) By contrast, the Agresti-Coull inter-
vals are shorter than traditional ones for values of π nearer 1/2. Generally speaking,
longer intervals provide better coverage. The “bottom line” is that the Agresti-Coull
intervals do not attain better coverage just by increasing interval length across all
values of π. c

d) Write a program to make a plot similar to Figure 1.9. Use the program
of Example 1.5 as a rough guide to the structure. You can use plot for
the first curve and lines to overlay the second curve.
n = 30; alpha = .05; kappa = qnorm(1-alpha/2)
# Traditional
adj = 0; x = 0:n; phat = (x + adj)/(n + 2*adj)
m.err = kappa*sqrt(phat*(1 - phat)/(n + 2*adj))
ci.len = 2*m.err # n + 1 possible lengths of CIs
m = 200; pp = seq(0,1, length=m); avg.ci.len = numeric(m)
for (i in 1:m) avg.ci.len[i] = sum(ci.len*dbinom(x, n, pp[i]))
1 Instructor Manual: Introductory Examples 19

plot(pp, avg.ci.len, type="l", ylim=c(0, 1.2* max(avg.ci.len)),


lty="dashed", col="darkred", lwd=3,
ylab="Average Length of CI", xlab=expression(pi == P(Success)),
main=paste("Traditional (Solid), A-C (Dashed): n =",n))
# Agresti-Coull
adj = 2; x = 0:n; phat = (x + adj)/(n + 2*adj)
m.err = kappa*sqrt(phat*(1 - phat)/(n + 2*adj))
ci.len = 2*m.err # n + 1 possible lengths of CIs
m = 200; pp = seq(0,1, length=m); avg.ci.len = numeric(m)
for (i in 1:m) avg.ci.len[i] = sum(ci.len*dbinom(x, n, pp[i]))
lines(pp, avg.ci.len, lwd = 2, col="darkblue")
Note: This program includes the entire length of any CI extending outside (0, 1).
[Even if the part outside (0, 1) is ignored, the result is nearly the same.]
1.20 Bayesian intervals. Here is a confidence interval based on a Bayesian
method and using the beta family of distributions. If x successes are observed
in n binomial trials, we use the distribution BETA(x+1, n−x+1). An interval
with nominal coverage probability 1 − α is formed by cutting off probability
α/2 from each side of this beta distribution. For example, its 0.025 and 0.975
quantiles are the lower and upper limits of a 95% interval, respectively. In the
R program of Example 1.6, replace the lines for lcl and ucl with the code
below and run the program. Compare the coverage results for this Bayesian
interval with results for the 95% confidence interval based on formula (1.2),
which are shown in Figure 1.6. (Also, if you did Problem 1.16, compare it
with the results for the Agresti-Coull interval.)
lcl = qbeta(alpha/2, x + 1, n - x + 1)
ucl = qbeta(1 - alpha/2, x + 1, n - x + 1)

d The change in the program is trivial, and we do not show the modified program
here. The main difference between the Bayesian intervals and those of Agresti and
Coull is in the coverage probabilties near near 0 and 1, where the Bayesian intervals
are more variable with changing π, but less extremely conservative. Because of the
discreteness of the binomial distribution, it is difficult to avoid large changes in
coverage probabilities for small changes in π. Perhaps a reasonable goal is that, if
oscillations in coverage probabilities are averaged over “nearby” values of π, then
the “smoothed” values lie close to the nominal level (here 95%). c
Notes: The mean of this beta distribution is (x + 1)/(n + 2), but this value need not
lie exactly at the center of the resulting interval. If 30 trials result in 20 successes,
then the traditional interval is (0.4980, 0.8354) and the Agresti-Coull interval is
(0.4864, 0.8077). The mean of the beta distribution is 0.65625, and a 95% Bayesian
interval is (0.4863, 0.8077), obtained in R with qbeta(c(.025, .975), 21, 11).
Bayesian intervals for π never extend outside (0, 1). (These Bayesian intervals are
based on a uniform prior distribution. Strictly speaking, the interpretation of such
“Bayesian probability intervals” is somewhat different than for confidence inter-
vals, but we ignore this distinction for now, pending a more complete discussion of
Bayesian inference in Chapter 8.)
20 1 Instructor Manual: Introductory Examples

Errors in Chapter 1
Page numbers refer to the the text. All corrections are scheduled to appear in
the second printing. Statements of problems in this Instructor Manual have
been corrected. Includes corrections compiled through July 13, 2011.
p12 Example 1.5. Just below printout π = 0.80 (not 0.30) in two places. Also, in the
displayed equation for P (Cover), the term P {X = 19}, with value 0.0161, needs
to be added to the beginning of the sum. [Thanks to Tarek Dib.] The correct
display is:

P (Cover) = P {X = 19} + P {X = 20} + P {X = 21} + · · · + P {X = 27}


= 0.0160 + 0.0355 + 0.0676 + · · · + 0.0785 = 0.9463.

p18 Problem 1.11(a). Code pbinom(x, 25, 0.3) should be pbinom(x, 30, 0.4).

Eric A. Suess and Bruce E. Trumbo


Introduction to Probability Simulation and Gibbs Sampling with R: Instructor Manual
Chapter 1

Explanations and answers:


c 2011 by Bruce E. Trumbo and Eric A. Suess. All rights reserved.
°
Statements of problems:
°c 2010 by Springer Science+Business Media, LLC. All rights reserved.
2
Generating Random Numbers

2.1 Before congruential generators became widely used, various mathemat-


ical formulas were tried in an effort to generate useful pseudorandom numbers.
The following unsuccessful procedure from the mid-1940s has been ascribed
to the famous mathematician John von Neumann, a pioneer in attempts to
simulate probability models.
Start with a seed number r1 having an even number of digits, square it,
pad the square with initial zeros at the left (if necessary) so that it has twice
as many digits as the seed, and take the center half of the digits in the result
as the pseudorandom number r2 . Then square this number and repeat the
same process to get the next number in the sequence, and so on.
To see why this method was never widely used, start with r1 = 23 as the
seed. Then 232 = 529; pad to get 0529 and r2 = 52. Show that the next
number is r3 = 70. Continue with values ri , for i = 3, 4 . . . until you discover
a difficulty. Also try starting with 19 as the seed.
d With seed 23, the first few values of ri are 23, 52, 70, 90, 10, and 10. After that, all
successive values are 10. The R program below automates the process, facilitating
exploration of other possible seeds. (The formatting of text output within the loop
illustrates some R functions that are not often used in the rest of this book.)

# Initialize
s = 23 # seed (even number of digits, 2 or more)
m = 10 # nr. of ’random’ numbers generated
r = numeric(m) # vector of ’random’ numbers
r[1] = s
t = character(m) # text showing the process for each iteration
nds = nchar(s) # number of digits in the seed
nds2 = 2*nds

# Check for proper seed


if (nds %% 2 != 0) {
warning("Seed must have even number of digits") }
22 2 Instructor Manual: Generating Random Numbers

# Generate
for (i in 2:m) {
r[i] = r[i-1]^2
lead.0 = paste(numeric(nds2-nchar(r[i])), sep="", collapse="")
temp1 = paste(lead.0, r[i], sep="", collapse="")
temp2 = substr(temp1, nds2-(3/2)*nds+1, nds2-(1/2)*nds)
r[i] = as.numeric(temp2)
#Loop diagnostics and output
msg = "OK"
if (r[i] == 0) {msg = "Zero"}
if (r[i] %in% r[1:(i-1)]) {msg = "Repeat"}
t[i] = paste(
format(r[i-1], width=nds),
format(paste(lead.0, r[i-1]^2, sep="", coll=""), width=nds2),
temp2,
msg)
}

t
summary(as.factor(r)) # counts > 1 indicate repetition
m - length(unique(r)) # nr. or repeats in m numbers generated

> t # narrow Session window


[1] "" # seed is first entry in next line
[2] "23 0529 52 OK"
[3] "52 2704 70 OK"
[4] "70 4900 90 OK"
[5] "90 8100 10 OK"
[6] "10 0100 10 Repeat"
[7] "10 0100 10 Repeat"
[8] "10 0100 10 Repeat"
[9] "10 0100 10 Repeat"
[10] "10 0100 10 Repeat"
> summary(as.factor(r)) # counts > 1 indicate repetition
10 23 52 70 90
6 1 1 1 1
> m - length(unique(r)) # nr. or repeats in m numbers generated
[1] 5

After a few steps (how many?), seed 19 gives the “random” number 2, which
has 0004 as its padded square; then the next step and all that follow are 0s. The
seeds 23 and 19 are by no means the only “bad seeds” for this method. With s = 45,
we get r1 = 45 and ri = 0, for all i > 1. With s = 2321 (and m = 100), we see
that all goes well until we get r82 = 6100, r83 = 2100, r84 = 4100, R85 = 8100, and
then this same sequence of four numbers repeats forever. (Acknowledgment: The
program and specific additional examples are based on a class project by Michael
Bissell, June 2011.) c
2 Instructor Manual: Generating Random Numbers 23

2.2 The digits of transcendental numbers such as π = 3.14159 . . . pass


many tests for randomness. A search of the Web using the search phrase pi
digits retrieves the URLs of many sites that list the first n digits of π for
very large n. We put the first 500 digits from one site into the vector v and
then used summary(as.factor(v)) to get the number of times each digit
appeared:
> summary(as.factor(v))
0 1 2 3 4 5 6 7 8 9
45 59 54 50 53 50 48 36 53 52

a) Always be cautious of information from unfamiliar websites, but for now


assume this information is correct. Use it to simulate 500 tosses of a coin,
taking even digits to represent Heads and odd digits to represent Tails. Is
this simulation consistent with 500 independent tosses of a fair coin?
d The fraction p = X/n = (45 + 54 + 53 + 48 + 53)/500 = 0.506 estimates the
probability π of Heads. We wish to test the null hypothesis H0 : π = π0 , where
π0 = 1/2, against the p two-sided alternative. As discussed in Section 1.2, the test
statistic Z = (p − π0 )/ π0 (1 − π0 )/n is very nearly standard normal for n as large
as 500 and π = 1/2. At the 5% significance level, we would reject H0 , concluding that
digits of π do not behave as random, if |Z| > 1.96. Here Z = 0.2683, so we do not
reject. One could also say that the P-value of the test is P {|Z| > 0.2683} ≈ 0.7886,
so that it is hardly unusual to observe a value of |Z| as extreme or more extreme
than what we get from our use of the first 500 digits of π.
An alternative approach is to use confidence
p intervals, as suggested in the Hint.
The traditional 95% CI for π is p ± 1.96 p(1 − p)/n. For n = 500, the Agresti-Coull
interval (as discussed on pages 14 and 19 of the book) is not much different. The R
code below shows the test above and the two CIs (both of which cover π = 1/2). c

n = 500; x = 253
p = x/n; z = (p - .5)/sqrt(.25/n)
p.value = 1 - (pnorm(z) - pnorm(-z))
pm = c(-1, 1)
trad.CI = p + pm*1.96*sqrt(p*(1-p)/n)
p.ac = (x + 2)/(n + 4)
Agresti.CI = p.ac + pm*1.96*sqrt(p.ac*(1-p.ac)/(n + 4))
p; z; p.value
trad.CI; p.ac; Agresti.CI

> p; z; p.value
[1] 0.506
[1] 0.2683282
[1] 0.7884467
> trad.CI; p.ac; Agresti.CI
[1] 0.4621762 0.5498238
[1] 0.5059524
[1] 0.4623028 0.5496020
24 2 Instructor Manual: Generating Random Numbers

b) Repeat part (a) letting numbers 0 through 4 represent Heads and numbers
5 through 9 represent Tails.
d The only change from part (a) is that X = 261. The agreement of p = X/n = 0.522
with π = 1/2 is not quite as good here as in part (a), but it is nowhere near bad
enough to declare a statistically significant difference. Output from the code of
part (a), but with x = 261, is shown below. c

> p; z; p.value
[1] 0.522
[1] 0.98387
[1] 0.3251795
> trad.CI; p.ac; Agresti.CI
[1] 0.4782155 0.5657845
[1] 0.5218254
[1] 0.4782143 0.5654365

c) Why do you suppose digits of π are not often used for simulations?
d Rapidly computing or accessing the huge number of digits of π (or e) necessary
to do serious simulations is relatively difficult. Partly as a result of this, some in-
vestigators say that insufficient testing has been done to know whether such digits
really do behave essentially as random. Refereed results from researchers at Berkeley
and Purdue are among the many that can be retrieved with an Internet search for
digits of pi random. c
Hint: (a, b) One possible approach is to find a 95% confidence interval for P (Heads)
and interpret the result.
2.3 Example 2.1 illustrates one congruential generator with b = 0 and
d = 53. The program there shows the first m = 60 numbers generated.
Modify the program, making the changes indicated in each part below, using
length(unique(r)) to find the number of distinct numbers produced, and
using the additional code below to make a 2-dimensional plot. Each part re-
quires two runs of such a modified program. Summarize findings, commenting
on differences within and among parts.
u = (r - 1/2)/(d-1)
u1 = u[0:(m-1)]; u2 = u[2:m]
plot(u1, u2, pch=19)

a) Use a = 23, first with s = 21 and then with s = 5.


b) Use s = 21, first with a = 15 and then with a = 18.
c) Use a = 22 and then a = 26, each with a seed of your choice.
d As usual, we do not show the plots. Among the generators in the three parts of
the problem, you will find plots that differ greatly in appearance, as do those in
Figure 2.2. Briefly, here are the numbers of unique values of ri for parts (a) and (b):
(a) 4 and 4; (b) 13 and 52. In part (c): With a = 22, all possible seeds (1 through
52) give full period, as shown in the program below. With the obvious change in the
2 Instructor Manual: Generating Random Numbers 25

program, you can see that the same is true for the generator with a = 26. However,
the grid patterns with a = 22 are different from those with a = 26. (In what way?) c

a = 22; d = 53; s = 1:(a-1); q = numeric(a-1)


m = 60; r = numeric(m)
for (j in s) { # try all possible seeds
r[1] = s[j]
for (i in 1:(m-1)) {r[i+1] = (a * r[i]) %% d}
q[j] = length(unique(r)) }
mean(q==(d-1)) # prop. of seeds yielding full period

2.4 A Chi-squared test for Example 2.2. Sometimes it is difficult to judge


by eye whether the evenness of the bars of a histogram is consistent with
a uniform distribution. The chi-squared goodness-of-fit statistic allows us to
quantify the evenness and formally test the null hypothesis that results agree
with UNIF(0, 1). If the null hypothesis is true, then each ui is equally likely to
fall into any one of the h bins of the histogram, so that the expected number of
values in each bin is E = m/h. Let Nj denote the observed number of values
in the jth of the h bins. The chi-squared statistic is

Xh
(Nj − E)2
Q= .
j=1
E

If the null hypothesis is true and E is large, as here, then Q is very nearly
distributed as CHISQ(h − 1), the chi-squared distribution with h − 1 degrees
of freedom. Accordingly, E(Q) = h − 1. For our example, h = 10, so values
of Q “near” 9 are consistent with uniform observations. Specifically, if Q falls
outside the interval [2.7, 19], then we suspect the generator is behaving badly.
The values 2.7 and 19 are quantiles 0.025 and 0.975, respectively, of CHISQ(9).
In some applications of the chi-squared test, we would reject the null hy-
pothesis only if Q is too large, indicating some large values of |N i − E|. But
when we are validating a generator we are also suspicious if results are “too
perfect” to seem random. (One similarly suspicious situation occurs if a fair
coin is supposedly tossed 8000 times independently and exactly 4000 Heads
are reported. Another is shown in the upper left panel of Figure 2.1.)
a) Run the part of the program of Example 2.2 that initializes variables and
the part that generates corresponding values of ui . Instead of the part that
prints a histogram and 2-dimensional plot, use the code below, in which
the parameter plot=F suppresses plotting and the suffix $counts retrieves
the vector of 10 counts. What is the result, and how do you interpret it?
d The revised program (partly from Example 2.2 and partly from an elaboration of
the code given with this problem) and its output are shown below. The resulting
chi-squared statistic Q = 7.2 falls inside the “acceptance” interval given above, so
we find no fault in the generator here. c
26 2 Instructor Manual: Generating Random Numbers

# Initialize
a = 1093; b = 18257; d = 86436; s = 7
m = 1000; r = numeric(m); r[1] = s

# Generate
for (i in 1:(m-1)) {r[i+1] = (a*r[i] + b) %% d}
u = (r + 1/2)/d # values fit in (0,1)

# Compute chi-squared statistic


h = 10; E = m/h; cut = (0:h)/h
N = hist(u, breaks=cut, plot=F)$counts
diff = N - E
comp = (diff)^2/E # component of chi-sq
cbind(N, E, diff, comp)
Q = sum(comp); Q

> cbind(N, E, diff, comp)


N E diff comp
[1,] 98 100 -2 0.04
[2,] 98 100 -2 0.04
[3,] 114 100 14 1.96
[4,] 105 100 5 0.25
[5,] 99 100 -1 0.01
[6,] 105 100 5 0.25
[7,] 106 100 6 0.36
[8,] 102 100 2 0.04
[9,] 92 100 -8 0.64
[10,] 81 100 -19 3.61
> Q = sum(comp); Q
[1] 7.2

b) Repeat part (a), but with m = 50 000 iterations.

d With a larger sample of numbers from this generator, we see that the histogram
bars are too nearly of the same height for the results of this generator to be consistent
with randomness. (The acceptance interval does not change for any large sample
size m. See the answer to Problem 2.7 for comments on tests with different numbers h
of histogram bins.) c

# Initialize
a = 1093; b = 18257; d = 86436; s = 7
m = 50000; r = numeric(m); r[1] = s
...
> cbind(N, E, diff, comp)
N E diff comp
[1,] 5007 5000 7 0.0098
[2,] 4995 5000 -5 0.0050
[3,] 5014 5000 14 0.0392
[4,] 4997 5000 -3 0.0018
2 Instructor Manual: Generating Random Numbers 27

[5,] 5003 5000 3 0.0018


[6,] 4991 5000 -9 0.0162
[7,] 5002 5000 2 0.0008
[8,] 4998 5000 -2 0.0008
[9,] 4995 5000 -5 0.0050
[10,] 4998 5000 -2 0.0008
> Q = sum(comp); Q
[1] 0.0812

c) Repeat part (a) again, but now with m = 1000 and b = 252. In this
case, also make the histogram and the 2-dimensional plot of the results
and comment. Do you suppose the generator with increment b = 252 is
useful? (Problem 2.6 below investigates this generator further.)
d With Q = 0.24 for only m = 1000 values, this generator fails the chi-squared test.
The 2-d plot (not shown here) reveals that the grid is far from optimal: all of the
points fall along about 20 lines. c

# Initialize
a = 1093; b = 252; d = 86436; s = 7
m = 1000; r = numeric(m); r[1] = s
...
> cbind(N, E, diff, comp)
N E diff comp
[1,] 98 100 -2 0.04
[2,] 103 100 3 0.09
[3,] 101 100 1 0.01
[4,] 102 100 2 0.04
[5,] 101 100 1 0.01
[6,] 99 100 -1 0.01
[7,] 99 100 -1 0.01
[8,] 99 100 -1 0.01
[9,] 99 100 -1 0.01
[10,] 99 100 -1 0.01
> Q = sum(comp); Q
[1] 0.24

d) Repeat part (a) with the original values of a, b, d, and s, but change to
m = 5000 and add the step u = u^0.9 before computing the chi-squared
statistic. (We still have 0 < ui < 1.) Also, make and comment on the
histogram.
d As discussed in Section 2.4, if U ∼ UNIF(0, 1), then U 0.9 does not have a uniform
distribution. With m = 5000 values, we get Q = 46.724, which provides very strong
evidence of a bad fit to uniform. (However, with only m = 1000 values, there is not
enough information to detect the departure from uniform). c

a = 1093; b = 18257; d = 86436; s = 7


m = 5000; r = numeric(m); r[1] = s
28 2 Instructor Manual: Generating Random Numbers

for (i in 1:(m-1)) {r[i+1] = (a*r[i] + b) %% d}


u = (r + 1/2)/d; u = u^0.9 # transformation
h = 10; E = m/h; cut = (0:h)/h
N = hist(u, breaks=cut, plot=F)$counts
diff = N - E; comp = (diff)^2/E
cbind(N, E, diff, comp)
Q = sum(comp); Q

> cbind(N, E, diff, comp)


N E diff comp
[1,] 394 500 -106 22.472
[2,] 448 500 -52 5.408
[3,] 463 500 -37 2.738
[4,] 491 500 -9 0.162
[5,] 511 500 11 0.242
[6,] 521 500 21 0.882
[7,] 541 500 41 3.362
[8,] 543 500 43 3.698
[9,] 542 500 42 3.528
[10,] 546 500 46 4.232
> Q = sum(comp); Q
[1] 46.724
e) Find and interpret the chi-squared goodness-of-fit statistic for the 10
counts given in the statement of Problem 2.2.
d Here, the h = 10 “bins” are the ten digits 0–9 that can occur in the first m = 500
digits of π. The expected count for each digit is E = 500/10 = 50. Below we compute
the goodness-of-fit statistic Q for these 500 digits of π. Because Q = 6.88 ∈ [2.7, 19]
they are consistent with randomness. c
N = c(45, 59, 54, 50, 53, 50, 48, 36, 53, 52); m = sum(N)
E = rep(m/10, 10); diff = N - E; comp = (diff)^2/E
cbind(N, E, diff, comp)
Q = sum(comp); Q

> cbind(N, E, diff, comp)


N E diff comp
[1,] 45 50 -5 0.50
[2,] 59 50 9 1.62
[3,] 54 50 4 0.32
[4,] 50 50 0 0.00
[5,] 53 50 3 0.18
[6,] 50 50 0 0.00
[7,] 48 50 -2 0.08
[8,] 36 50 -14 3.92
[9,] 53 50 3 0.18
[10,] 52 50 2 0.08
> Q = sum(comp); Q
[1] 6.88
2 Instructor Manual: Generating Random Numbers 29

d Extra: More about the chi-squared goodness-of-fit-test. Under the null hypothesis
that Ui are randomly sampled from UNIF(0, 1), we claim above that the statistic Q,
based on h bins, is approximately distributed as CHISQ(h − 1 = 9). Notice that the
statistic Q is computed from counts, and so it takes discrete values (ordinarily not
integers). By contrast, CHISQ(h − 1) is a continuous distribution. Roughly speaking,
the approximation is reasonably good, provided E is sufficiently large. (The provision
is not really an issue here because E = m/h and m is very large.)
This approximation is not easy to prove analytically, so we show a simulation
below for B = 10 000 batches (that is, values of Q). Each batch has m = 1000 pseudo-
random values U , generated using the high-quality random number generator runif
in R. The program below makes a histogram (not printed here) of the B values of Q
and compares it with the density function of CHISQ(9). (By changing the degrees
of freedom for pdf from h − 1 = 9 to h = 10, you easily see that CHISQ(10) is not
such a good fit to the histogram.)

We stress that this is not at all the same use of chi-squared distributions
as in Problems 2.12 and 2.13. Here a chi-squared distribution is used as an
approximation. Problems 2.12 and 2.13 show random variables that have
exact chi-squared distributions.

The first part of the text output shows that the simulated acceptance region
for Q is in good agreement with the acceptance region from cutting off 2.5% from
each tail of CHISQ(9). In the second part of the text output, a table compares areas
of bars in the histogram of Q with corresponding exact probabilities from CHISQ(9),
each expressed to 3-place accuracy. For judging tests of randomness, the accuracy of
the approximation of tail probabilities is more important than the fit in the middle
part of the distribution.

set.seed(1212)
B = 10000; Q = numeric(B)
m = 1000; h = 10; E = m/h; u.cut = (0:h)/h
for (i in 1:B)
{
u = runif(m)
N = hist(u, breaks=u.cut, plot=F)$counts
Q[i] = sum((N - E)^2/E)
}

qq = seq(0, max(Q), len=200); pdf = dchisq(qq, h)


Q.cut = 0:ceiling(max(Q))
hist(Q, breaks=Q.cut, ylim=c(0, 1.1*max(pdf)),
prob=T, col="wheat")
lines(qq, pdf, lwd=2, col="blue")

quantile(Q, c(.025, .975)); qchisq(c(.025, .975), h-1)


bar.area = hist(Q, breaks=Q.cut, plot=F)$density

chisq.pr = diff(pchisq(Q.cut, h-1))


round(cbind(bar.area, chisq.pr), 3)
30 2 Instructor Manual: Generating Random Numbers

> quantile(Q, c(.025, .075)); qchisq(c(.025, .975), h-1)


2.5% 7.5%
2.72 3.86 # simulated acceptance region
[1] 2.700389 19.022768 # chisq approx. acceptance region

> round(cbind(bar.area, chisq.pr), 3) # Compare histogram bars


bar.area chisq.pr # with chi-sq probabilities
[1,] 0.001 0.001
[2,] 0.008 0.008
[3,] 0.025 0.027
[4,] 0.050 0.053
[5,] 0.078 0.077
[6,] 0.095 0.094
[7,] 0.105 0.103
[8,] 0.104 0.103
[9,] 0.095 0.097
[10,] 0.092 0.087
[11,] 0.075 0.075
[12,] 0.061 0.062
[13,] 0.050 0.051
[14,] 0.038 0.040
[15,] 0.030 0.031
[16,] 0.025 0.024
[17,] 0.018 0.018
[18,] 0.013 0.014
[19,] 0.009 0.010
[20,] 0.006 0.007
[21,] 0.007 0.005
[22,] 0.004 0.004
[23,] 0.002 0.003
[24,] 0.003 0.002
[25,] 0.001 0.001
[26,] 0.001 0.001
[27,] 0.000 0.001
[28,] 0.001 0.000 # remaining values all 0.000
...

When hist parameter prob=T, the density of each histogram bar is its height.
If, as here, the width of each bar is set to unity (with Q.cut), then this is also the
area of each bar.
With seed 1212, it happens that the overall agreement of the histogram bars with
the density function is a little better than for some other seeds. You can experiment
with different seeds—and with alternative values of B, m, and h. Roughly speaking,
it is best to keep both h and E larger than about 5. c
Answers: In (a)–(e), Q ≈ 7, 0.1, 0.2, 47, and 7, respectively. Report additional
decimal places, and provide interpretation.
2 Instructor Manual: Generating Random Numbers 31

2.5 When beginning work on Trumbo (1969), the author obtained some
obviously incorrect results from the generator included in Applesoft BASIC
on the Apple II computer. The intended generator would have been mediocre
even if programmed correctly, but it had a disastrous bug in the machine-level
programming that led to periods of only a few dozen for some seeds (Sparks,
1983). A cure (proposed in a magazine for computer enthusiasts, Hare et
al., 1983) was to import the generator ri+1 = 8192ri (mod 67 099 547). This
generator has full period, matched the capabilities of the Apple II, and seemed
to give accurate results for the limited simulation work at hand.
a) Modify the program of Example 2.3 to make plots for this generator anal-
ogous to those in Figure 2.4. Use u = (r - 1/2)/(d - 1).
d The R code is shown below. The plots (omitted here) reveal no undesirable
structure—in either 2-d and 3-d. The first printing of the text suggested using
the code u = (r + 1/2)/d. But this is a multiplicative generator with b = 0 and
0 < ri < d. So, theoretically, u = (r - 1/2)/(d - 1) is the preferred way to spread
the ui over (0, 1). However, in the parts of this problem (with very large d), we have
seen no difference in results between the two formulas. c

a = 8192; b = 0; d = 67099547; s = 11
m = 20000; r = numeric(m); r[1] = s
for (i in 1:(m-1)) {r[i+1] = (a*r[i] + b) %% d}
u = (r - 1/2)/(d - 1)

u1 = u[1:(m-2)]; u2 = u[2:(m-1)]; u3 = u[3:m]


par(mfrow=c(1,2), pty="s")
plot(u1, u2, pch=19, xlim=c(0,.1), ylim=c(0,.1))
plot(u1[u3 < .01], u2[u3 < .01], pch=19, xlim=c(0,1), ylim=c(0,1))
par(mfrow=c(1,1), pty="m")

b) Perform chi-square goodness-of-fit tests as in Problem 2.4, based on 1000,


and then 100 000 simulated uniform observations from this generator.
d With m = 1000 and s = 11, we get Q = 9.04, as shown below; changing the seed
to s = 25 gives Q = 10.98. Either way, the generator passes the goodness-of-fit test
with 10 bins. With m = 100 000 and various seeds, we obtained various values of Q,
none leading to rejection. Also, with seed 11 (and several others) we got 100 000
distinct values of u. However, the generator cannot give more than d − 1 distinct
values, and so it is not suitable for large-scale simulations. c

a = 8192; b = 0; d = 67099547; s = 25
m = 1000; r = numeric(m); r[1] = s
for (i in 1:(m-1)) {r[i+1] = (a*r[i] + b) %% d}
u = (r - 1/2)/(d - 1)

h = 10; E = m/h; cut = (0:h)/h


N = hist(u, breaks=cut, plot=F)$counts
diff = N - E; comp = (diff)^2/E
32 2 Instructor Manual: Generating Random Numbers

cbind(N, E, diff, comp)


Q = sum(comp); Q

> cbind(N, E, diff, comp)


N E diff comp
[1,] 102 100 2 0.04
[2,] 109 100 9 0.81
[3,] 104 100 4 0.16
[4,] 92 100 -8 0.64
[5,] 96 100 -4 0.16
[6,] 89 100 -11 1.21
[7,] 116 100 16 2.56
[8,] 112 100 12 1.44
[9,] 91 100 -9 0.81
[10,] 89 100 -11 1.21
> Q = sum(comp); Q
[1] 9.04

Comment: (b) Not a bad generator. Q varies with seed.


2.6 Consider m = 50 000 values ui = (r + .5)/d from the generator with
a = 1093, b = 252, d = 86 436, and s = 6. We try using this generator to
simulate many tosses of a fair coin.

a) For a particular n ≤ m, you can use the code sum(u[1:n] < .5)/n to
simulate the proportion of heads in the first n tosses. If the values ui are
uniform in the interval (0, 1), then each of the n comparisons inside the
parentheses has probability one-half of being TRUE, and thus contribut-
ing 1 to the sum. Evaluate this for n = 10 000, 20 000, 30 000, 40 000, and
50 000. For each n, the 95% margin of error is about n−1/2 . Show that all
of your values are within this margin of the true value P {Head} = 0.5.
So, you might be tempted to conclude that the generator is working sat-
isfactorily. But notice that all of these proportions are above 0.5—and by
similar amounts. Is this a random coincidence or a pattern? (See part (c).)
a = 1093; b = 252; d = 86436; s = 6
m = 50000; r = numeric(m); r[1] = s
for (i in 1:(m-1)) {r[i+1] = (a*r[i] + b) %% d}
u = (r + 1/2)/d
n = c(1:5)*10000; p = numeric(5)
for (i in 1:5) {p[i] = sum(u[1:n[i]] < .5)/n[i]}
ME = 1/sqrt(n); diff = p - 1/2
cbind(n, p, diff, ME)

> cbind(n, p, diff, ME)


n p diff ME
[1,] 10000 0.5018000 0.001800000 0.010000000
[2,] 20000 0.5018500 0.001850000 0.007071068
[3,] 30000 0.5018667 0.001866667 0.005773503
2 Instructor Manual: Generating Random Numbers 33

[4,] 40000 0.5016750 0.001675000 0.005000000


[5,] 50000 0.5015200 0.001520000 0.004472136

b) This generator has serious problems. First, how many distinct values do
you get among m? Use length(unique(r)). So, this generator repeats
a few values many times in m = 50 000 iterations. Second, the period
depends heavily on the seed s. Report results for s = 2, 8 and 17.
s = c(2, 8, 17, 20, 111); k = length(s); distinct = numeric(k)
a = 1093; b = 252; d = 86436
m = 50000; r = numeric(m)
for (j in 1:k) {
r[1] = s[j]
for (i in 1:(m-1)) {r[i+1] = (a*r[i] + b) %% d }
distinct[j] = length(unique(r)) }
distinct

> distinct
[1] 1029 1029 147 1029 343

c) Based on this generator, the code below makes a plot of the proportion
of heads in n tosses for all values n = 1, 2 . . . m. For comparison, it does
the same for values from runif, which are known to simulate UNIF(0, 1)
accurately. Explain the code, run the program (which makes Figure 2.9),
and comment on the results. In particular, what do you suppose would
happen towards the right-hand edge of the graph if there were millions of
iterations m? (You will learn more about such plots in Chapter 3.)
a = 1093; b = 252; d = 86436; s = 6
m = 50000; n = 1:m
r = numeric(m); r[1] = s
for (i in 1:(m-1)) {r[i+1] = (a*r[i] + b) %% d}
u = (r + 1/2)/d; f = cumsum(u < .5)/n
plot(n, f, type="l", ylim=c(.49,.51), col="red") # ’ell’, not 1
abline(h=.5, col="green")

set.seed(1237) # Use this seed for exact graph shown in figure.


g = cumsum(sample(0:1, m, repl=T))/n; lines(n, g)

d See Figure 2.9 on p42 of the text for the plot. We used seed 1237 because it
illustrates the convergence of the good generator on a scale that also allows the
“sawtooth” nature of the trace of the bad generator to show clearly in print. For a
large number of iterations with any seed, the trace for the good generator will tend
toward 1/2 and the trace for the bad generator will continue to stay above 1/2. For
example, try several seeds with m = 100 000. c
Note: This generator “never had a chance.” It breaks one of the number-theoretic
rules for linear congruential generators, that b and d not have factors in common.
See Abramowitz and Stegun (1964).
34 2 Instructor Manual: Generating Random Numbers

2.7 In R, the statement runif(m) makes a vector of m simulated observa-


tions from the distribution UNIF(0, 1). Notice that no explicit loop is required
to generate this vector.
m = 10^6; u = runif(m) # generate one million from UNIF(0, 1)
u1 = u[1:(m-2)]; u2 = u[2:(m-1)]; u3 = u[3:m] # 3 dimensions

par(mfrow=c(1,2), pty="s") # 2 square panels per graph


plot(u1, u2, pch=".", xlim=c(0,.1), ylim=c(0,.1))
plot(u1[u3<.01], u2[u3<.01], pch=".", xlim=c(0,1), ylim=c(0,1))
par(mfrow=c(1,1), pty="m") # restore default plotting

a) Run the program and comment on the results. Approximately how many
points are printed in each graph?
d Both graphs show uniform random behavior within a “square.” The two plots are
similar to those shown in the second row of Figure 2.4. Specifically, the first plot
here is a “magnified” view of the lower-left 100th of the unit square; unlike the
lower-right plot in Figure 2.3, it reveals no grid structure. The second plot here
shows a “veneer” that is the front 100th of the unit cube; unlike the upper-right
plot in Figure 2.4, no concentration of points on parallel planes is evident. Each of
the plots here shows about m/100 = 106 /102 = 10 000 points. c

b) Perform chi-square goodness-of-fit tests as in Problem 2.4 based on these


one million simulated uniform observations.
d In the first program below, we use sample size m = 50 000 and h = 10 histogram
bins. Also, we set a seed so you can check your results against ours. Because we
get Q = 6.6728 ∈ [2.7, 19], this test does not detect any nonrandom behavior of the
random number generator in R. Several different seeds gave similar favorable results.
Of course, if you try enough seeds you will eventually find one (with probability 5%
or 1 in 20 for each seed) that gives a bogus unfavorable result. We found that one
such seed is 2011. c

set.seed(2012)
m = 50000; u = runif(m)
h = 10; E = m/h; cut = (0:h)/h
N = hist(u, breaks=cut, plot=F)$counts
diff = N - E; comp = (diff)^2/E
cbind(N, E, diff, comp)
Q = sum(comp); Q

> cbind(N, E, diff, comp)


N E diff comp
[1,] 4997 5000 -3 0.0018
[2,] 5033 5000 33 0.2178
[3,] 5069 5000 69 0.9522
[4,] 4955 5000 -45 0.4050
[5,] 5011 5000 11 0.0242
2 Instructor Manual: Generating Random Numbers 35

[6,] 4995 5000 -5 0.0050


[7,] 4875 5000 -125 3.1250
[8,] 4982 5000 -18 0.0648
[9,] 4987 5000 -13 0.0338
[10,] 5096 5000 96 1.8432
> Q = sum(comp); Q
[1] 6.6728

d To some extent, the choice h = 10 is arbitrary. Very roughly, we should have


E = m/h ≥ 5 and h ≥ 5. Then the acceptance interval for Q depends on the
number of bins h, not the number of pseudorandom numbers m sampled (m = 1000,
5000, 10 000, and so on.) Because Q has approximately the distribution EXP(h − 1),
the boundaries of the acceptance interval for h can be obtained from the R code
qchisq(c(.025, .975), h - 1). For h = 10, we have already seen that the interval
is [2.7, 19]. (See some related comments at the end of the answers for Problem 2.4.)
In the program below. h = 20. So the acceptance interval becomes [8.91, 32.9].
With the result Q = 18.25, the generator is again vindicated. We leave it to you to
try m = 106 with 10 or 20 bins.

set.seed(2001)
m = 50000; u = runif(m); h = 20; E = m/h; cut = (0:h)/h
N = hist(u, breaks=cut, plot=F)$counts
diff = N - E; comp = (diff)^2/E
cbind(N, E, diff, comp)
Q = sum(comp); Q

> cbind(N, E, diff, comp)


N E diff comp
[1,] 2465 2500 -35 0.4900
[2,] 2569 2500 69 1.9044
[3,] 2547 2500 47 0.8836
...
[19,] 2410 2500 -90 3.2400
[20,] 2516 2500 16 0.1024
> Q = sum(comp); Q
[1] 18.2472

Extra. The few lines below implement a Kolmogorov-Smirnov goodness-of-fit test.


Roughly speaking, this test is based on the maximum distance between the empirical
cumulative distribution function (ECDF) of the numbers generated and the CDF of
UNIF(0, 1). (See Example 4.8 and Figure 4.7.) P-values very near 0 indicate a poor
fit to UNIF(0, 1) and P-values very near 1 may indicate a suspiciously good fit. c

> ks.test(runif(50000), punif)

One-sample Kolmogorov-Smirnov test

data: runif(50000)
D = 0.0047, p-value = 0.2115
alternative hypothesis: two.sided
36 2 Instructor Manual: Generating Random Numbers

> ks.test(seq(0,1,len=5000), punif) # fake perfect fit

One-sample Kolmogorov-Smirnov test

data: seq(0, 1, len = 5000)


D = 2e-04, p-value = 1
alternative hypothesis: two.sided

2.8 The code used to make the two plots in the top row of Figure 2.1 (p26) is
shown below. The function runif is used in the left panel to “jitter” (randomly
displace) two plotted points slightly above and to the right of each of the 100
grid points in the unit square. The same function is used more simply in the
right panel to put 200 points at random into the unit square.
set.seed(121); n = 100
par(mfrow=c(1,2), pty="s") # 2 square panels per graph
# Left Panel
s = rep(0:9, each=10)/10 # grid points
t = rep(0:9, times=10)/10
x = s + runif(n, .01, .09) # jittered grid points
y = t + runif(n, .01, .09)
plot(x, y, pch=19, xaxs="i", yaxs="i", xlim=0:1, ylim=0:1)
#abline(h = seq(.1, .9, by=.1), col="green") # grid lines
#abline(v = seq(.1, .9, by=.1), col="green")

# Right Panel
x=runif(n); y = runif(n) # random points in unit square
plot(x, y, pch=19, xaxs="i", yaxs="i", xlim=0:1, ylim=0:1)
par(mfrow=c(1,1), pty="m") # restore default plotting

a) Run the program (without the grid lines) to make the top row of Figure 2.1
for yourself. Then remove the # symbols at the start of the two abline
statements so that grid lines will print to show the 100 cells of your left
panel. See Figure 2.12 (p47).
b) Repeat part (a) several times without the seed statement (thus getting a
different seed on each run) and without the grid lines to see a variety of
examples of versions of Figure 2.1. Comment on the degree of change in
the appearance of each with the different seeds.
c) What do you get from a single plot with plot(s, t)?
d The lower-left corners of the 100 cells. c

d) If 100 points are placed at random into the unit square, what is the prob-
ability that none of the 100 cells of this problem are left empty? (Give
your answer in exponential notation with four significant digits.)
d This is very similar to the birthday matching problem of Example 1.2. The answer
is 100!/100100 = 9.333 × 10−43 . Don’t wait around for this event to happen.
2 Instructor Manual: Generating Random Numbers 37

Below we show two methods of computing this quantity R, the latter of which is
theoretically preferable. Factorials and powers grow so rapidly that some computer
packages would not be able to handle the first method without overflowing. The
second method, sometimes called “zippering,” involves the product of 100 factors,
each of manageable size. It turns out that R is up to the task of computing the first
method correctly, so you could use either. (But if there were 152 = 225 cells, you
wouldn’t have a choice.) c

> factorial(100)/100^100 > factorial(225)/225^225


[1] 9.332622e-43 [1] NaN # "Not a Number"

> prod((1:100)/100) > prod((1:250)/250)


[1] 9.332622e-43 [1] 1.058240e-107

Note: Consider nesting habits of birds in a marsh. From left to right in Figure 2.1, the
first plot shows territorial behavior that tends to avoid close neighbors. The second
shows random nesting in which birds choose nesting sites entirely independently of
other birds. The third shows a strong preference for nesting near the center of the
square. The last shows social behavior with a tendency to build nests in clusters.
2.9 (Theoretical) Let U ∼ UNIF(0, 1). In each part below, modify equa-
tion (2.2) to derive the cumulative distribution function of X, and then take
derivatives to find the density function.
a) Show that X = (b − a)U + a ∼ UNIF(a, b), for real numbers a and b with
a < b. Specify the support of X.
d The support of X is (a, b). For a < x < b,
n o
x−a x−a
FX (x) = P {X ≤ x} = P {(b − a)U + a ≤ x} = P U≤ = ,
b−a b−a
where the last equation follows from FU (u) = u, for 0 < u < 1. Also, FX (x) = 0,
for x ≤ a, and FX (x) = 1, for x ≥ b. Then, taking the derivative of this piecewise
0
differentiable CDF, we have the density function fX (x) = FX (x) = 1/(b − a), for
a < x < b, and 0 elsewhere. c

b) What is the distribution of X = 1 − U ? [Hints: Multiplying an inequality


by a negative number changes its sense (direction). P (Ac ) = 1 − P (A).
A continuous distribution assigns probability 0 to a single point.]
d The support of X is (0, 1). For 0 < x < 1,

FX (x) = P {X ≤ x} = P {1 − U ≤ x} = P {U ≥ 1 − x}
= 1 − P {U ≤ 1 − x} = 1 − (1 − x) = x,

where we have used the 0 probability of a single point in moving from the first line
to the second. Outside the interval of support, FX (x) = 0, for x ≤ 0, and FX (x) = 1,
for x ≥ 1. Here again, the CDF is piecewise differentiable, and so we have the density
0
function fX (x) = FX (x) = 1, for a < x < b, and 0 elsewhere. Thus, X ∼ UNIF(0, 1)
also. c
38 2 Instructor Manual: Generating Random Numbers

2.10 In Example 2.4, we used the random R function runif to sample from
the distribution BETA(0.5, 1). Here we wish to sample from BETA(2, 1).
a) Write the density function, cumulative distribution function, and quantile
function of BETA(2, 1). According to the quantile transformation method,
explain how to use U ∼ UNIF(0, 1) to sample from BETA(2, 1).
d All distributions in the beta family have support (0, 1), Let 0 < x, y < 1. The
density function of X ∼ BETA(2, 1) is fX (x) = 2x, because Γ (3) = 2! = 2, and
2
Γ (2) = Γ (1) = 1. Integrating, we have the CDF y = FX (x) √ = x , so that the
−1 √ −1
quantile function is x = FX (y) = y. Thus, X = FX (U ) = U ∼ BETA(2, 1). c

b) Modify equation (2.2) as appropriate to this situation.


d For 0 < x < 1, we have

FX (x) = P {X ≤ x} = P { U ≤ x} = P {U ≤ x2 },
0
and fX (x) = FX (x) = 2x, which is the density function of BETA(2, 1). We leave it
to you to supply the values of F and f outside (0, 1). For additional applications of
the method of equation (2.2), see Problem 2.9. c

c) Modify the program of Example 2.4 to illustrate the method of part (a),
Of course, you will need to change the code for x and cut.x and the code
used to plot the density function of BETA(2, 1). Also, change the code to
simulate a sample of 100 000 observations, and use 20 bars in each of the
histograms. Finally, we suggest changing the ylim parameters so that the
vertical axes of the histograms include the interval (0, 2). See Figure 2.10.
d The program below shows the required changes. Lines with changes are marked
with ##. Alternatives: define the cutpoints using cut.u = seq(0, 1, len=21) and
plot the density function of BETA(2, 1) using dbeta(xx, 2, 1). Except for some
embellishments to make a clearer image for publication, the program produces his-
tograms as in Figure 2.10 on p45 of the text. c

set.seed(3456)
m = 100000
u = runif(m)
x = sqrt(u) ##
xx = seq(0, 1, by=.001)
cut.u = (0:20)/20 ##
cut.x = sqrt(cut.u)

par(mfrow=c(1,2))
hist(u, breaks=cut.u, prob=T, ylim=c(0,2)) ##
lines(xx, dunif(xx), col="blue")
hist(x, breaks=cut.x, prob=T, ylim=c(0,2)) ##
lines(xx, 2*xx, col="blue") ##
par(mfrow=c(1,1))
2 Instructor Manual: Generating Random Numbers 39

2.11 The program below simulates 10 000 values of X ∼ EXP(λ = 1),


using the quantile transformation method. That is, X = − log(U )/λ, where
U ∼ UNIF(0, 1). A histogram of results is shown in Figure 2.11.
set.seed(1212)
m = 10000; lam = 1
u = runif(m); x = -log(u)/lam

cut1 = seq(0, 1, by=.1) # for hist of u, not plotted


cut2 = -log(cut1)/lam; cut2[1] = max(x); cut2 = sort(cut2)
hist(x, breaks=cut2, ylim=c(0,lam), prob=T, col="wheat")
xx = seq(0, max(x), by = .01)
lines(xx, lam*exp(-lam*xx), col="blue")

mean(x); 1/lam # simulated and exact mean


median(x); qexp(.5, lam) # simulated and exact median
hist(u, breaks=cut1, plot=F)$counts # interval counts for UNIF
hist(x, breaks=cut2, plot=F)$counts # interval counts for EXP

> mean(x); 1/lam # simulated and exact mean


[1] 1.008768
[1] 1
> median(x); qexp(.5, lam) # simulated and exact median
[1] 0.6933868
[1] 0.6931472
> hist(u, breaks=cut1, plot=F)$counts # interval counts for UNIF
[1] 1007 1041 984 998 971 1018 990 1004 994 993
> hist(x, breaks=cut2, plot=F)$counts # interval counts for EXP
[1] 993 994 1004 990 1018 971 998 984 1041 1007

a) If X ∼ EXP(λ), then E(X) = 1/λ (see Problem 2.12). Find the median of
this distribution by setting FX (x) = 1−e−λx = 1/2 and solving for x. How
accurately does your simulated sample of size 10 000 estimate the popu-
lation mean and median of EXP(1)? [The answer for λ = 1 is qexp(.5).]
d In general, FX (η) = 1 − e−λη = 1/2 implies η = − ln(.5)/λ. So if λ = 2, the median
is η = 0.3465736. In R this could also be obtained as qexp(.5, 2). For λ = 1, the
simulation above gives the desired result 0.693 correct to three places. c

b) The last two lines of the program (counts from unplotted histograms)
provide counts for each interval of the realizations of U and X, respec-
tively. Report the 10 counts in each case. Explain why their order gets
reversed when transforming from uniform to exponential. What is the
support of X? Which values in the support (0, 1) of U correspond to the
largest values of X? Also, explain how cut2 is computed and why.
d A straightforward application of the quantile method simulates X ∼ EXP(λ = 1)
as X = − ln(1 − U 0 ), where U 0 ∼ UNIF(0, 1). However, as shown in Problem 2.9(b),
we also have U = 1 − U 0 ∼ UNIF(0, 1). So, as in the program above, one often
40 2 Instructor Manual: Generating Random Numbers

simplifies the simulation, using X = − ln(U ) ∼ EXP(1). This substitution of U


for 1 − U 0 causes a “reversal,” so that the smallest values of U get transformed into
the values farthest into the right tail of X.
The first element of cut2 = -log(cut1)/lam is infinite, so we use the maximum
sampled value of X instead. When the transformed cutpoints are sorted from small-
est to largest, this means we use our largest value of X as the upper end of the
rightmost bin of the histogram of simulated values of X. c

c) In Figure 2.11, each histogram bar represents about 1000 values of X, so


that the bars have approximately equal area. Make a different histogram of
these values of X using breaks=10 to get about 10 intervals of equal width
along the horizontal axis. (For most purposes, intervals of equal width are
easier to interpret.) Also, an alternate method to overlay the density curve,
use dexp(xx, lam) instead of lam*exp(-lam*xx).
d In the function hist, equal spacing of cutpoints is the default. With the parameter
breaks=10 the hist function used 9 bins with integer endpoints.
The default number of bins (with no specified breaks parameter) turns out
to be 9 for our simulation—with integer and half-integer cutpoints. Different sim-
ulations yield different values (most crucially, different maximum values) and thus
noticeably different binning. If you think it is important to have exactly 10 bins then
define cutp = seq(0, max(x), len=11) and use parameter breaks=cutp in hist.
Extra output at the end of the program shows results for seed 1212 and both of
these additional choices of cutpoints. c

set.seed(1212)
m = 10000; lam = 1
u = runif(m); x = -log(u)/lam

hist(x, breaks=10, ylim=c(0,lam), prob=T, col="wheat")


xx = seq(0, max(x), by = .01)
lines(xx, dexp(xx, lam), col="blue")
hist(x, breaks=10, plot=F)$counts # interval counts
hist(x, breaks=10, plot=F)$breaks # default cutpoints

> hist(x, breaks=10, plot=F)$counts # interval counts


[1] 6280 2332 876 307 134 49 16 5 1
> hist(x, breaks=10, plot=F)$breaks # default cutpoints
[1] 0 1 2 3 4 5 6 7 8 9

# default binning
> hist(x, plot=F)$counts # interval counts
[1] 3906 2374 1423 909 571 305 191 116 83
[10] 51 33 16 13 3 2 3 1
> hist(x, plot=F)$breaks # default cutpoints
[1] 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5
[13] 6.0 6.5 7.0 7.5 8.0 8.5
2 Instructor Manual: Generating Random Numbers 41

# forcing exactly 10 bins


> hist(x, breaks=cutp, plot=F)$counts # interval counts
[1] 5667 2434 1098 431 207 96 42 18 3 4
> hist(x, breaks=cutp, plot=F)$breaks # forced cutpoints
[1] 0.000000 0.841578 1.683156 2.524734 3.366312
[6] 4.207890 5.049468 5.891046 6.732624 7.574202
[11] 8.415780

d) Run the program again with lam = 1/2. Describe and explain the results
of this change. (Notice that EXP(1/2) = CHISQ(2) = GAMMA(1, 1/2).
See Problems 2.12 and 2.13.)
d The smaller rate λ = 1/2 corresponds to the larger mean µ = 2. Moreover, if we
use the same seed, then each of the simulated values is precisely doubled. To see
the effect, compare the first set of results below with those shown in part (a). The
second set of results below is for a different seed. c

set.seed(1212)
m = 10000; lam = 1/2
u = runif(m); x = -log(u)/lam

cut1 = seq(0, 1, by=.1) # for hist of u, not plotted


cut2 = -log(cut1)/lam; cut2[1] = max(x); cut2 = sort(cut2)
hist(x, breaks=cut2, ylim=c(0,lam), prob=T, col="wheat")
xx = seq(0, max(x), by = .01)
lines(xx, lam*exp(-lam*xx), col="blue")
mean(x); 1/lam # simulated and exact mean
median(x); qexp(.5, lam) # simulated and exact median
hist(u, breaks=cut1, plot=F)$counts # interval counts for UNIF
hist(x, breaks=cut2, plot=F)$counts # interval counts for EXP

> mean(x); 1/lam # simulated and exact mean


[1] 2.017537
[1] 2
> median(x); qexp(.5, lam) # simulated and exact median
[1] 1.386774
[1] 1.386294
> hist(u, breaks=cut1, plot=F)$counts # interval counts for UNIF
[1] 1007 1041 984 998 971 1018 990 1004 994 993
> hist(x, breaks=cut2, plot=F)$counts # interval counts for EXP
[1] 993 994 1004 990 1018 971 998 984 1041 1007

# with seed 1213 (slightly abbreviated)


> mean(x) # simulated mean
[1] 1.995580
> median(x) # simulated median
[1] 1.388134
> hist(x, breaks=cut2, plot=F)$counts # interval counts for EXP
[1] 986 1020 977 1051 959 991 1001 1110 915 990
42 2 Instructor Manual: Generating Random Numbers

2.12 Distributions in Example 2.6 (Theoretical). The density function for


α
the gamma family of distributions GAMMA(α, λ) is fT (t) = Γλ(α) tα−1 e−λt ,
for t > 0. Here α > 0 is the shape parameter and λ > 0 is the rate parameter.
Two important subfamilies are the exponential EXP(λ) with α = 1 and the
chi-squared CHISQ(ν) with ν = 2α (called degrees of freedom) and λ = 1/2.
a) Show that the density function of CHISQ(2) shown in the example is con-
sistent with the information provided above. Recall that Γ (α) = (α − 1)!,
for integer α > 0.
d Substitute α = ν/2 = 1 and λ = 0.5 into the general gamma density function:
λα α−1 −λt 0.5 1−1 −0.5t
fT (t) = t e = t e = (0.5)e−0.5t ,
Γ (α) Γ (1)
for t > 0. Thus, the right-hand side is the density function of CHISQ(2) as given in
Example 2.6. This is also the density function of EXP(λ = 1/2). (The symbol ν is
the small Greek letter nu, pronounced to rhyme with too). c
b) For T ∼ CHISQ(2), show that E(T ) = 2. (Use the density function and
integration by parts, or use the moment generating function in part (c).)
d Integration by parts. To begin, let X ∼ EXP(λ = 1). Then, integrating by parts,
we have Z ∞ Z ∞
¯∞
E(X) = xe−x dx = −xe−x ¯0 + e−x = 0 + 1 = 1.
0 0
In general, if Y ∼ EXP(λ), then Y = X/λ and E(Y ) = 1/λ. In the particular case
T ∼ CHISQ(2) = EXP(1/2), we have E(T ) = 2.
Moment generating function. The moment generating function (MGF) of a random
variable X (if it exists), is
µ ¶
(sX)2 (sX)3 (sX)4
mX (s) = E(esX ) = E 1 + sX + + + + ··· .
2! 3! 4!
The existence of the MGF requires that the integral (or sum) for E(esX ) exist in a
neighborhood of s = 0. In the second equality we have used the Taylor (Maclauren)
expansion of esX in such a way as to assume that interchange of the order of in-
tegration and infinite summation is valid. Then, also assuming differentiation with
respect to s and integration can be interchanged, we have
µ ¶
s2 X 3 s3 X 4
m0X (s) = E X + sX + 2
+ + ··· ,
2! 3!
so that m0X (0) = E(X). In general, taking the kth derivative of mX (s) abrades the
terms in the series before the kth, and evaluating the kth derivative at s = 0 nullifies
[k]
terms after the kth, so mX (0) = E(X k ).
The MGF of X ∼ EXP(λ) is
Z 1 Z ∞
mX (s) = esx e−λx dx = e(s−λ)x dx = (s − λ)−1 .
0 0

Then m0X (s) = λ(s − λ)−2 , and m0X (0) = E(X) = 1/λ. Finally, in particular, if
T ∼ CHISQ(2) = EXP(1/2), then E(T ) = 2. c
2 Instructor Manual: Generating Random Numbers 43

c) The moment generating function of X ∼ CHISQ(ν), for s < 1/2, is


mX (s) = E(esX ) = (1 − 2s)−ν/2 .
2
If Z ∼ NORM(0, 1), with density function ϕ(z) = √12π e−z /2 , then the
moment generating function of Z 2 is
Z ∞ Z ∞
m(s) = mZ 2 (s) = exp(sz 2 )ϕ(z) dz = 2 exp(sz 2 )ϕ(z) dz.
−∞ 0

−1/2
Show that this simplifies
√ to m(s) = (1 − 2s) , so that Z 2 ∼ CHISQ(1).
Recall that Γ (1/2) = π.
d To find the density function of T ∼ CHISQ(1/2), substitute α = λ = 1/2 into the
general gamma density function: for t > 0,
λα α−1 −λt (1/2)1/2 1/2−1 −(1/2)t 1
fT (t) = t e = t e = √ e−t/2 .
Γ (α) Γ (1/2) 2πt
For s < 1/2, continuing from the statement of the problem, the MGF of Z 2 is
Z ∞ Z ∞
2 1
mZ 2 (s) = 2 exp(sz )ϕ(z) dz = 2 √ exp[−(1 − 2s)z 2 /2] dz

Z0 ∞ 0
1 1
=2 √ exp(−t/2) p dt
0 2π 2 (1 − 2s)t
Z ∞
−1/2 1
= (1 − 2s) √ e−t/2 dt = (1 − 2s)−1/2 ,
0 2πt
2
where we use the change of variable
p t = (1 − 2s)z in the second line, so that
dt = 2(1 − 2s)zdz and dz = dt/[ 2 (1 − 2s)t ]. Also, the final equality holds because
the density function of CHISQ(1) integrates to 1.
This argument uses the uniqueness property of moment generating functions.
No two distributions have the same MGF. There are some distributions that do
not have moment generating functions. However, among those that do, there is a
one-to-one correspondence between distributions and moment generating functions.
In the current demonstration, we know that Z 2 ∼ CHISQ(1) because we have found
the MGF of Z 2 and it matches the MGF of CHISQ(1). c

d) If X and Y are independent random variables with moment generating


functions mX (s) and mY (s), respectively, then mX+Y (s) = mX (s)mY (s).
Use this property of moment generating functions to show that, if Zi are
independently NORM(0, 1), then Z12 + Z22 + . . . + Zν2 ∼ CHISQ(ν).
d For ν = 2, let Q2 = Z12 + Z22 so that

mQ2 (s) = mZ 2 (s) × mZ 2 (s) = (1 − 2s)−1/2 × (1 − 2s)−1/2 = (1 − 2s)−2/2 ,


1 2

which is the MGF of CHISQ(2). By induction, we can multiply several MGFs to


find the distribution of the sum of several independent random variables. Thus, the
MGF of Z12 + Z22 + . . . + Zν2 is (1 − 2s)−ν/2 , which is the MGF of CHISQ(ν). c
44 2 Instructor Manual: Generating Random Numbers

2.13 Simulations for chi-squared random variables. The first block of code
in Example 2.6 illustrates that the sum of squares of two standard normal
random variables is distributed as CHISQ(2). (Problem 2.12 provides formal
proof.) Modify the code in the example to do each part below. For simplicity,
when plotting the required density functions, use dens = dchisq(tt, df)
for df suitably defined.
a) If Z ∼ NORM(0, 1), then illustrate by simulation that Z 2 ∼ CHISQ(1)
and that E(Z 2 ) = 1.
d Our modified code is shown below. The histogram (omitted here) shows good fit
of the histogram to the density function of CHISQ(1). As in Problem 2.12, MGFs
can be used to show that CHISQ(ν) has mean ν and variance 2ν.
For additional verification, the last line of this program performs a Kolmogorov-
Smirnov test of goodness-of-fit. For seed 1234, the P-value 0.2454 of this test (output
omitted) is consistent with good fit of the simulated observations to CHISQ(1). The
K-S test is discussed briefly at the end of the answers to Problem 2.15. c

set.seed(1234)
m = 40000; z = rnorm(m); t = z^2
hist(t, breaks=30, prob=T, col="wheat")
tt = seq(0, max(t), length=200); dens = dchisq(tt, 1)
lines(tt, dens, col="blue")
mean(t); var(t)
ks.test(t, pchisq, 1)

> mean(t); var(t)


[1] 0.9986722
[1] 1.973247

b) If Z1 , Z2 , and Z3 are independently NORM(0, 1), then illustrate by simu-


lation that T = Z12 + Z22 + Z32 ∼ CHISQ(3) and that E(T ) = 3.
d Instead of the obvious, straightforward modification of the program of the example,
we show a program below that can be easily modified to provide a simulation for
general CHISQ(ν). We generate mν standard normal random variables and put them
into an m × ν matrix. Each row of the matrix is used to simulate one of the m
observations from CHISQ(ν = 3). (For seed 1235, the P-value of the K-S test, from
ks.test(t, pchisq, nu), is 0.1173.) To ensure that there is room to plot the density
function, we adjust the vertical plotting window. c

set.seed(1235)
m = 100000; nu = 3
z = rnorm(m * nu); DTA = matrix(z, nrow=m); t = rowSums(DTA^2)
tt = seq(0, max(t), length=200); dens = dchisq(tt, nu)
hist(t, breaks=30, ylim=c(0, 1.1*max(dens)), prob=T, col="wheat")
lines(tt, dens, col="blue")
mean(t); nu # simulated and exact mean
var(t); 2*nu # simulated and exact variance
2 Instructor Manual: Generating Random Numbers 45

> mean(t); nu # simulated and exact mean


[1] 3.005203
[1] 3
> var(t); 2*nu # simulated and exact variance
[1] 6.110302
[1] 6

2.14 Illustrating the Box-Muller method. Use the program below to imple-
ment the Box-Muller method of simulating a random sample from a standard
normal distribution. Does the histogram of simulated values seem to agree
with the standard normal density curve? What do you conclude from the
chi-squared goodness-of-fit statistic? (This statistic, based on 10 bins with
E(Ni ) = Ei , has the same approximate chi-squared distribution as in Prob-
lem 2.4, but here the expected counts Ei are not the same for all bins.) Before
drawing firm conclusions, run this program several times with different seeds.
set.seed(1236)
m = 2*50000; z = numeric(m)
u1 = runif(m/2); u2 = runif(m/2)
z1 = sqrt(-2*log(u1)) * cos(2*pi*u2) # half of normal variates
z2 = sqrt(-2*log(u1)) * sin(2*pi*u2) # other half
z[seq(1, m, by = 2)] = z1 # interleave
z[seq(2, m, by = 2)] = z2 # two halves
cut = c(min(z)-.5, seq(-2, 2, by=.5), max(z)+.5)
hist(z, breaks=cut, ylim=c(0,.4), prob=T)
zz = seq(min(z), max(z), by=.01)
lines(zz, dnorm(zz), col="blue")
E = m*diff(pnorm(c(-Inf, seq(-2, 2, by=.5), Inf))); E
N = hist(z, breaks=cut, plot=F)$counts; N
Q = sum(((N-E)^2)/E); Q; qchisq(c(.025,.975), 9)

> E = m*diff(pnorm(c(-Inf, seq(-2, 2, by=.5), Inf))); E


[1] 2275.013 4405.707 9184.805 14988.228 19146.246
[6] 19146.246 14988.228 9184.805 4405.707 2275.013
> N = hist(z, breaks=cut, plot=F)$counts; N
[1] 2338 4371 9140 14846 19161 19394 15067 9061
[9] 4400 2222
> Q = sum(((N-E)^2)/E); Q; qchisq(c(.025,.975), 9)
[1] 10.12836
[1] 2.700389 19.022768

d The data from the run shown are consistent with a standard normal distribution.
(Also, the K-S test for this run has P-value 0.7327.) Additional runs (seeds unspec-
ified) yielded Q = 10.32883, 7.167461, 8.3525, 2.360288, and 9.192768. Because Q is
approximately distributed as CHISQ(9) we expect values averaging 9, and falling
between 2.7 and 19.0 for 95% of the simulation runs. c
46 2 Instructor Manual: Generating Random Numbers

2.15 Summing uniforms to simulate a standard normal. The Box-Muller


transformation requires the evaluation of logarithms and trigonometric func-
tions. Some years ago, when these transcendental computations were very
time-intensive (compared with addition and subtraction), the following method
of simulating a standard normal random variable Z from 12 independent ran-
dom variables Ui ∼ UNIF(0, 1) was commonly used: Z = U1 +U2 +. . .+U12 −6.
However, with current hardware, transcendental operations are relatively
fast, so this method is now deprecated—partly because it makes 12 calls to
the random number generator for each standard normal variate generated.
(You may discover other reasons as you work this problem.)
a) Using the fact that a standard uniform random variable has mean 1/2 and
variance 1/12, show that Z as defined above has E(Z) = 0 and V(Z) = 1.
(The assumed near normality of such a random variable Z is based on the
Central Limit Theorem, which works reasonably well here—even though
only 12 random variables are summed.)
d Mean. The mean of the sum of random variables is the sum of their means:
à 12 !
X
E(Z) = E(U1 + U2 + . . . + U12 − 6) = E(Ui ) − 6 = 12(1/2) − 6 = 0.
i=1

Variance. The variance of the sum of mutually independent random variables is the
sum of their variances.
à !
X
12

V(Z) = V(U1 + U2 + . . . + U12 − 6) = V(Ui ) − 0 = 12(1/12) = 1,


i=1

recalling the the variance of a constant is 0. c

b) For the random variable Z of part (a), evaluate P {−6 < Z < 6}. Theoret-
ically, how does this result differ for a random variable Z that is precisely
distributed as standard normal?
d If all 12 of the Ui were 0, then Z = −6; if all of them were 1, then Z = 6. Thus,
P {−6 < Z < 6} = 1, exactly. If Z 0 ∼ NORM(0, 1), then P {−6 < Z 0 < 6} ≈ 1, but
not exactly. There is some probability in the “far tails” of standard normal, but not
much. We can get the exact probability of the two tails in R as diff(2*pnorm(-6)),
which returns 1.973175 × 10−09 , about 2 chances in a billion. Nevertheless, the main
difficulty simulating NORM(0, 1) with a sum of 12 uniformly distributed random
variables is that the shape of the density function of the latter is not precisely
normal within (−6, 6). c

c) The following program implements the method of part (a) to simulate


100,000 (nearly) standard normal observations by making a 100 000 × 12
matrix DTA, summing its rows, and subtracting 6 from each result.
Does the histogram of simulated values seem to agree with the stan-
dard normal density curve? What do you conclude from the chi-squared
2 Instructor Manual: Generating Random Numbers 47

goodness-of-fit test? (Assuming normality, this statistic has the same ap-
proximate chi-squared distribution as in Problem 2.4. Here again, there
are 10 bins, but now the expected counts Ei are not all the same.) Before
drawing firm conclusions, run this program several times with different
seeds. Also, make a few runs with m = 10 000 iterations.
d Below the program we show a handmade table of results from the chi-squared and
the Kolmogorov-Smirnov goodness-of-fit tests—both for m = 100 000 and 10,000.
(The first results for m = 100 000 result from seed 1237.) For truly normal data, we
expect Q values averaging 9, and falling between 2.7 and 19 in 95% of simulation
runs. Also, we expect K-S P-values between 2.5% and 97.5%. Symbols * indicate
values suggesting poor fit, + for a value suggesting suspiciously good fit. This method
is inferior to the Box-Muller method of Problem 2.14 and should not be used. c

set.seed(1237)
m = 100000; n = 12
u = runif(m*n); UNI = matrix(u, nrow=m)
z = rowSums(UNI) - 6

cut = c(min(z)-.5, seq(-2, 2, by=.5), max(z)+.5)


hist(z, breaks=cut, ylim=c(0,.4), prob=T)
zz = seq(min(z), max(z), by=.01)
lines(zz, dnorm(zz), col="blue")

E = m*diff(pnorm(c(-Inf, seq(-2, 2, by=.5), Inf))); E


N = hist(z, breaks=cut, plot=F)$counts; N
Q = sum(((N-E)^2)/E); Q; qchisq(c(.025,.975), 9)

Chisq K-S P-val

m = 100000
-----------------------
25.46162* 0.07272492
31.97759* 0.1402736
37.38722* 0.02045037*
10.00560 0.1683760
18.00815* 0.7507103
37.38722* 0.02045037*

m = 10000
-----------------------
6.72417 0.2507788
12.34241 0.01991226*
1.928343+ 0.6481974
9.201392 0.7471199
4.814753 0.5253633
12.67015 0.651179
48 2 Instructor Manual: Generating Random Numbers

2.16 Random triangles (Project). If three points are chosen at random from
a standard bivariate normal distribution (µ1 = µ2 = ρ = 0, σ1 = σ2 = 1),
then the probability that they are vertices of an obtuse triangle is 3/4. Use
simulation to illustrate this result. Perhaps explore higher dimensions. (See
Portnoy (1994) for a proof and for a history of this problem tracing back to
Lewis Carroll.)
d Below is a program to simulate the probability of an obtuse triangle in n = 2
dimensions. The code can easily be generalized to n > 2 dimensions. The number
of random triangles generated and evaluated is m. There are three matrices a, b,
and c: one for each vertex of a triangle. Each matrix contains m rows: one for each
triangle generated; the n entries in each row are the coordinates of one vertex of the
triangle.
The ith components of m-vectors AB, AC, and BC are the squared lengths of the
sides of the ith random triangle. A triangle is obtuse if the squared length of its
longest side exceeds the sum of the squared lengths of the other two sides.

set.seed(1238)
m = 1000000 # number of triangles
n = 2 # dimensions

a = matrix(rnorm(m*n), nrow=m) # coord of vertex A


b = matrix(rnorm(m*n), nrow=m) # coord of vertex B
c = matrix(rnorm(m*n), nrow=m) # coord of vertex C

AB = rowSums((a-b)^2) # squared side lengths


AC = rowSums((a-c)^2)
BC = rowSums((b-c)^2)

pr.obtuse = mean(AB>AC+BC) + mean(AC>AB+BC) + mean(BC>AB+AC)


pr.obtuse

> pr.obtuse
[1] 0.749655

Background. In 1893 Charles Dodgson, writing as Lewis Carroll, published a collec-


tion of “Pillow Problems” that he said he had pondered while trying to fall asleep.
One of these was: “Three points are taken at random on an infinite plane. Find
the chance of their being the vertices of an obtuse-angled triangle.” This is not
a well-posed problem because there exists no uniform distribution over the entire
plane.
Taking a fixed horizontal segment to be the longest side of a triangle and choosing
the third vertex at√random within the allowable region, “Carroll” obtained the
answer: 3π/(8π − 6 3) ≈ 0.64. [Carroll, L. (1893). Curiosa Mathematica, Part II:
Pillow Problems. MacMillan, London.]
However, another solution based on taking a fixed horizontal √ segment to be the
second longest side leads to a very different result: 3π/(2π + 3 3) ≈ 0.82. This
additional formulation of the problem makes one wonder if there may be many
“solutions” depending on how “randomness” is interpreted.
2 Instructor Manual: Generating Random Numbers 49

Clear-cut interpretations of the problem are to put a uniform distribution on a


square, circle or other shape of finite area. Ruma Falk and Ester Samuel-Cahn pub-
lished results for triangles with random vertices in a circle, a square, an equilateral
triangle, and rectangles of various ratios of width to height. For example, within a
square they used simulation to obtain a 95% CI for the probability: 0.7249 ± 0.0009.
[Falk, R.; Samuel-Cahn, E. (2001). Lewis Carroll’s obtuse problem. Teaching Statis-
tics, 23(3), p72-75.]
In a Summer 2009 class project, Raymond Blake reported the probability
0.72521 ± 0.00003 based on m = 109 simulated triangles in a square, using sev-
eral networked computers. Blake also investigated the problem based on uniform
distributions over a circle, on the circumference of a circle, and over hyperspheres
and hypercubes of several dimensions. Blake’s simulations for simulations over circles
and (hyper)spheres were based on generating points uniformly in a (hyper)cube and
disregarding unwanted points. In a Spring 2011 project, Rebecca Morganstern, sug-
gested adapting the method of Example 7.7 and Problem 7.15 to generate vertices
uniformly, precisely within a convex shape of interest.
Finally, we return to the specific simulation of this problem. Portnoy (1994)
argues that, for a statistician, a natural interpretation of Carroll’s problem is to
consider an uncorrelated multivariate standard normal distribution, which has the
entire plane as its support. At a technical level that should be accessible to many
readers of our book, he provides a proof that the probability of getting an obtuse
triangle is 3/4, if the vertices are randomly chosen from such a multivariate normal
distribution. c

Errors in Chapter 2
Page numbers refer to the the text. All corrections are scheduled to appear in
the second printing. Statements of problems in this Instructor Manual have
been corrected. Includes corrections compiled through July 13, 2011.
p26 Example 2.1. In the first line below Figure 2.1: ri = 21 should be r1 = 21.
[Thanks to Jeff Glickman.]

Eric A. Suess and Bruce E. Trumbo


Introduction to Probability Simulation and Gibbs Sampling with R: Instructor Manual
Chapter 2

Explanations and answers:


c 2011 by Bruce E. Trumbo and Eric A. Suess. All rights reserved.
°
Statements of problems:
°c 2010 by Springer Science+Business Media, LLC. All rights reserved.
3
Monte Carlo Integration

3.1 Computation of simple integrals in Section 3.1.


a) If W ∼ UNIF(0, 30), sketch the density function of W and the area that
represents P {W ≤ 10}. In general, if X ∼ UNIF(α, β), write the formula
for P {c < X < d}, where α < c < d < β.
d The formula is P {c < X < d} = (d − c)/(β − α). Below is R code that plots the
density function of UNIF(0, 30) and shades area for P {W ≤ 10}. c

u = seq(-10, 40, by = .01)


fill = seq(0, 10, by=.01)
plot(u, dunif(u, 0, 30), type="l", lwd = 2, col="blue",
xlim=c(-5, 35), ylim=c(0, .035), ylab="Density")
lines(fill, dunif(fill, 0, 30), type="h", col="lightblue")
abline(h=0, col="darkgreen")

b) If T ∼ EXP(0.5), sketch the exponential distribution with rate λ = 0.5


and mean µ = 2. Write P {T > 1} as an integral, and use calculus to
evaluate it.
R ∞ −λt −λt ∞ −M/2 −1/2
d Calculus: P {T > 1} = 1 λe dt = [−e ]1 = − limM →∞ e +e =
e−1/2 = 0.6065, where the R code exp(-1/2) provides the numerical value. Also,
FT (1), the cumulative distribution function of T ∼ EXP(0.5) evaluated at 1, is de-
noted in R as pexp(1, 1/2), and 1 - pexp(1, 1/2) returns 0.6065307. The program
below makes the corresponding sketch. c

t = seq(-2, 12, by = .01)


fill = seq(1, 12, by=.01)
plot(t, dexp(t, 1/2), type="l", lwd = 2, col="blue",
xlim=c(-1, 10), ylab="Density")
lines(fill, dexp(fill, 1/2), type="h", col="lightblue")
abline(h=0, col="darkgreen")
52 3 Instructor Manual: Monte Carlo Integration

3.2 We explore two ways to evaluate e−1/2 ≈ 0.61, correct to two deci-
mal places, using only addition, subtraction, multiplication, and division—
the fundamental operations available to the makers of tables 50 years ago. On
most modern computers, the evaluation of ex is a chip-based function.
P∞ k
a) Consider the Taylor (Maclauren) expansion ex = k=0 x /k!. Use the
first few terms of this infinite series to approximate e−1/2 . How many
terms are required to get two-place accuracy? Explain.
k = 0:10; nr.terms = k + 1; x = -1/2
taylor.term = x^k/factorial(k); taylor.sums = cumsum(taylor.term)
round(cbind(k, nr.terms, taylor.sums), 10)
min(nr.terms[round(taylor.sums, 2) == round(exp(-1/2), 2)])

> round(cbind(k, nr.terms, taylor.sums), 10)


k nr.terms taylor.sums
[1,] 0 1 1.0000000
[2,] 1 2 0.5000000
[3,] 2 3 0.6250000
[4,] 3 4 0.6041667
[5,] 4 5 0.6067708 # Answer: 5 terms
[6,] 5 6 0.6065104
[7,] 6 7 0.6065321
[8,] 7 8 0.6065306
[9,] 8 9 0.6065307
[10,] 9 10 0.6065307
[11,] 10 11 0.6065307
> min(nr.terms[round(taylor.sums, 2) == round(exp(-1/2), 2)])
[1] 5

b) Use the relationship ex = limn→∞ (1 + x/n)n . Notice that this is the limit
of an increasing sequence. What is the smallest value of k such that n = 2k
gives two-place accuracy for e−1/2 ?
k = 1:10; n = 2^k; x = -1/2; term = (1 + x/n)^n
cbind(k, n, term)
min(k[round(term, 2) == round(exp(-1/2), 2)])

> cbind(k, n, term)


k n term
[1,] 1 2 0.5625000
[2,] 2 4 0.5861816
[3,] 3 8 0.5967195
[4,] 4 16 0.6017103
[5,] 5 32 0.6041411
[6,] 6 64 0.6053410 # Answer: k = 6
[7,] 7 128 0.6059371
[8,] 8 256 0.6062342
[9,] 9 512 0.6063825
[10,] 10 1024 0.6064566
3 Instructor Manual: Monte Carlo Integration 53

> min(k[round(term, 2) == round(exp(-1/2), 2)])


[1] 6

c) Run the following R script. For each listed value of x, say whether the
method of part (a) or part (b) provides the better approximation of ex .

d We add several lines to the code to show results in tabular and graphical formats.
In the table, if the column t.best shows 1 (for TRUE), then the sum of the first seven
terms of the Taylor expansion is at least as good as the 1024th term in the sequence.
(The two methods agree exactly for x = 0, and they agree to several decimal places
for values of x very near 0. But for accuracy over a wider interval, we might want
to sum a few more terms of the Taylor series.) c

x = seq(-2, 2, by=.25)
tay.7 = 1 + x + x^2/2 + x^3/6 + x^4/24 + x^5/120 + x^6/720
seq.1024 = (1 + x/1024)^1024; exact = exp(x)
t.err = tay.7 - exact; s.err = seq.1024 - exact
t.best = (abs(t.err) <= abs(s.err))
round(cbind(x, tay.7, seq.1024, exact, t.err, s.err, t.best), 5)
plot(x, s.err, ylim=c(-.035, .035),
ylab="Taylor.7 (solid) and Sequence.1044 Errors")
points(x, t.err, pch=19, col="blue")

> round(cbind(x, tay.7, seq.1024, exact, t.err, s.err, t.best), 5)


x tay.7 seq.1024 exact t.err s.err t.best
[1,] -2.00 0.15556 0.13507 0.13534 0.02022 -0.00026 0
[2,] -1.75 0.18193 0.17351 0.17377 0.00815 -0.00026 0
[3,] -1.50 0.22598 0.22288 0.22313 0.00285 -0.00025 0
[4,] -1.25 0.28732 0.28629 0.28650 0.00082 -0.00022 0
[5,] -1.00 0.36806 0.36770 0.36788 0.00018 -0.00018 1
[6,] -0.75 0.47239 0.47224 0.47237 0.00002 -0.00013 1
[7,] -0.50 0.60653 0.60646 0.60653 0.00000 -0.00007 1
[8,] -0.25 0.77880 0.77878 0.77880 0.00000 -0.00002 1
[9,] 0.00 1.00000 1.00000 1.00000 0.00000 0.00000 1
[10,] 0.25 1.28403 1.28399 1.28403 0.00000 -0.00004 1
[11,] 0.50 1.64872 1.64852 1.64872 0.00000 -0.00020 1
[12,] 0.75 2.11697 2.11642 2.11700 -0.00003 -0.00058 1
[13,] 1.00 2.71806 2.71696 2.71828 -0.00023 -0.00133 1
[14,] 1.25 3.48923 3.48768 3.49034 -0.00112 -0.00266 1
[15,] 1.50 4.47754 4.47677 4.48169 -0.00415 -0.00492 1
[16,] 1.75 5.74194 5.74601 5.75460 -0.01267 -0.00859 0
[17,] 2.00 7.35556 7.37466 7.38906 -0.03350 -0.01440 0

3.3 Change the values of the constants in the program of Example 3.1 as
indicated.
a) For a = 0 and b = 1, try each of the values m = 10, 20, 50, and 500.
Among these values of m, what is the smallest m that gives five-place
accuracy for P {0 < Z < 1}?
54 3 Instructor Manual: Monte Carlo Integration

d We show the substitution m = 10 and make a table by hand to show the complete
set of answers. Compare with the exact value P {0 < Z < 1} = 0.3413447, computed
using diff(pnorm(c(0,1))). c
m = 10; a = 0; b = 1; w = (b - a)/m
g = seq(a + w/2, b - w/2, length=m)
const = 1/sqrt(2 * pi); h = const * exp(-g^2 / 2)
sum(w * h)

> sum(w * h)
[1] 0.3414456

m Riemann approximation
-------------------------------
10 0.3414456 above
20 0.3413700
50 0.3413488
500 0.3413448 5-place accuracy
5000 0.3413447 Example 3.1

b) For m = 5000, modify this program to find P {1.2 < Z ≤ 2.5}. Compare
your answer with the exact value obtained using the R function pnorm.
m = 5000; a = 1.2; b = 2.5; w = (b - a)/m
g = seq(a + w/2, b - w/2, length=m)
const = 1/sqrt(2 * pi)
h = const * exp(-g^2 / 2)
sum(w * h)
diff(pnorm(c(1.2, 2.5)))

> sum(w * h)
[1] 0.10886 # Riemann approximation
> diff(pnorm(c(1.2, 2.5)))
[1] 0.10886 # exact value from ’pnorm’

3.4 Modify the program of Example 3.1 to find P {X ≤ 1} for X expo-


nentially distributed with mean 2. The density function is f (x) = 12 e−x/2 ,
for x > 0. Run the program, and compare the result with the exact value
obtained using calculus and a calculator.
m = 5000; a = 0; b = 1; w = (b - a)/m
g = seq(a + w/2, b - w/2, length=m)
h = .5*exp(-.5*g)
sum(w * h)
1 - exp(-.5)

> sum(w * h)
[1] 0.3934693 # Riemann approximation
> 1 - exp(-.5)
[1] 0.3934693 # exact value from calculus
3 Instructor Manual: Monte Carlo Integration 55

3.5 Run the program of Example 1.2 several times (omitting set.seed)
to evaluate P {0 < Z ≤ 1}. Do any of your answers have errors that exceed
the claimed margin of error 0.0015? Also, changing constants as necessary,
make several runs of this program to evaluate P {0.5 < Z ≤ 2}. Compare
your results with the exact value; the margin of error is larger here.
d To seven places, the exact value is P {0 < Z ≤ 1} = 0.3413447. Values from several
runs are shown here—all of which happen to fall within the claimed margin of error.
Of course, your values will differ slightly.
In Section 3.4 we show that the last line of code provides an approximate 95%
margin of error for this Monte Carlo integration over the unit interval (shown only
for the first run). Finally, we show two Monte Carlo values for P {0.5 < Z ≤ 2}. c

m = 500000; a = 0; b = 1; w = (b - a)/m
u = a + (b - a) * runif(m); h = dnorm(u)
mc = sum(w * h)
exact = pnorm(1)-.5
mc; abs(mc - exact)
2*(b-a)*sd(h)/sqrt(m)

> mc; abs(mc - exact)


[1] 0.3413576
[1] 1.290339e-05
> 2*(b-a)*sd(h)/sqrt(m)
[1] 0.0001369646
...
> mc; abs(mc - exact)
[1] 0.341341
[1] 3.778366e-06
...
> mc; abs(mc - exact)
[1] 0.3412556
[1] 8.918342e-05
...
> mc; abs(mc - exact)
[1] 0.3413883
[1] 4.356457e-05
...
> mc; abs(mc - exact)
[1] 0.3413206
[1] 2.413555e-05

m = 500000; a = 1/2; b = 2; w = (b - a)/m


u = a + (b - a) * runif(m); h = dnorm(u)
mc = sum(w * h)
exact = diff(pnorm(c(1/2, 2)))
mc; exact; abs(mc - exact)
2*(b-a)*sd(h)/sqrt(m)
56 3 Instructor Manual: Monte Carlo Integration

> mc; exact; abs(mc - exact)


[1] 0.2860235
[1] 0.2857874
[1] 0.0002360797
> 2*(b-a)*sd(h)/sqrt(m)
[1] 0.0003874444
...
> mc; exact; abs(mc - exact)
[1] 0.2860734
[1] 0.2857874
[1] 0.0002860270

3.6 Use Monte Carlo integration with m = 100 000 to find the area of the
first quadrant of the unit circle, which has area π/4. Thus obtain a simulated
value of π = 3.141593. How many places of accuracy do you get?

set.seed(1234)
m = 100000; a = 0; b = 1; w = (b - a)/m
u = a + (b - a) * runif(m); h = sqrt(1 - u^2)
quad = sum(w * h); quad; 4*quad

> quad = sum(w * h); quad; 4*quad


[1] 0.7849848
[1] 3.139939 # 2-decimal accuracy with the seed shown

3.7 Here we consider two very similar random variables. In each part below
we wish to evaluate P {X ≤ 1/2} and E(X). Notice that part (a) can be done
by straightforward analytic integration but part (b) cannot.
a) Let X be a random variable distributed as BETA(3, 2) with density func-
tion f (x) = 12x2 (1−x), for 0 < x < 1, and 0 elsewhere. Use the numerical
integration method of Example 3.1 to evaluate the specified quantities.
Compare the results with exact values obtained using calculus.
R 1/2
d Exact values: P {X ≤ 1/2} = 12 0 (x2 −x3 ) dx = 12[(1/3)(1/2)3 −(1/4)(1/2)4 ] =
R1 2 3
0.3125 and E(X) = 12 0
x(x − x ) dx = 12(1/4 − 1/5) = 3/5. c

m = 5000; a = 0; b = .5; w = (b - a)/m


g = seq(a + w/2, b - w/2, length=m)
h = 12 * g^2 * (1 - g)
prob = sum(w * h)

m = 5000; a = 0; b = 1; w = (b - a)/m
g = seq(a + w/2, b - w/2, length=m)
h = g * 12 * g^2 * (1 - g)
mean = sum(w * h)
3 Instructor Manual: Monte Carlo Integration 57

prob; mean
> prob; mean
[1] 0.3125
[1] 0.6
b) Let X be a random variable distributed as BETA(2.9, 2.1) with density
Γ (5) 1.9
function f (x) = Γ (2.9)Γ (2.1) x (1 − x)1.1 , for 0 < x < 1, and 0 elsewhere.
Use the method of Example 3.1.
m = 5000; a = 0; b = .5; w = (b - a)/m
g = seq(a + w/2, b - w/2, length=m)
h = dbeta(g, 2.9, 2.1) # BETA(2.9, 2.1) density function
prob = sum(w * h)

m = 5000; a = 0; b = 1; w = (b - a)/m
g = seq(a + w/2, b - w/2, length=m)
h = g * dbeta(g, 2.9, 2.1)
mean = sum(w * h)

prob; mean
> prob; mean
[1] 0.3481386
[1] 0.58 # 29/50 = 2.9/(2.9 + 2.1) = 0.58

c) Use the Monte Carlo integration method of Example 3.2 for both of the
previous parts. Compare results.
d We show Monte Carlo integration only for the problem in part (b). c
set.seed(1235)
m = 500000; a = 0; b = 1/2; w = (b - a)/m
u = a + (b - a) * runif(m); h = dbeta(u, 2.9, 2.1)
prob = sum(w * h)

a = 0; b = 1; w = (b - a)/m
u = a + (b - a) * runif(m); h = u * dbeta(u, 2.9, 2.1)
mean = sum(w * h)

prob; mean
> prob; mean
[1] 0.3483516 # approx. P{X < 1/2}
[1] 0.5791294 # approx. E(X)

Hints and answers: (a) From integral calculus, P {X ≤ 1/2} = 5/16 and E(X) = 3/5.
(Show your work.) For the numerical integration, modify the lines of the pro-
gram of Example 1.2 that compute the density function. Also, let a = 0 and let
b = 1/2 for the probability. For the expectation, let b = 1 use h = 12*g^3*(1-g) or
h = g*dbeta(g, 3, 2). Why? (b) The constant factor of f (x) can be evaluated in R
as gamma(5)/(gamma(2.9)*gamma(2.1)), which returns 12.55032. Accurate answers
are 0.3481386 (from pbeta(.5, 2.9, 2.1)) and 29/50.
58 3 Instructor Manual: Monte Carlo Integration

3.8 The yield of a batch of protein produced by a biotech company is


X ∼ NORM(100, 10). The dollar value of such a batch is V = 20 − |X − 100|
as long as the yield X is between 80 and 120, but the batch is worthless
otherwise. (Issues of quality and purity arise if the yield of a batch is much
different from 100.) Find the expected monetary value E(V ) of such a batch.
m = 5000; a = 80; b = 120; w = (b - a)/m
g = seq(a + w/2, b - w/2, length=m)
h = (20 - abs(g - 100))*dnorm(g, 100, 10)
sum(w * h)

> sum(w * h)
[1] 12.19097

Hint: In the program of Example 3.1, let a = 80, b = 120. Also, for the line of code
defining h, substitute h = (20 - abs(g - 100))*dnorm(g, 100, 10). Provide the
answer (between 12.0 and 12.5) correct to two decimal places.

3.9 Suppose you do not know the value of 2. You can use simulation to
approximate it as follows.√Let X = U 2 , where U ∼ UNIF(0, 1). Then show
that 2P {0 < √
X ≤ 1/2} = 2, and use the sampling method with large m to
approximate 2.
d The cumulative distribution function of U ∼ UNIF(0, 1) is FU (u) = P {U ≤ u} = u,
for 0 < u ≤ 1. Then, if X = U 2 , we have P {0 < X ≤ 1/2} = P {U 2 ≤ 1/2} =
√ √ √
FU (1/ 2) = 1/ 2 = 2/2. c

set.seed(1236)
m = 500000; u = runif(m)
2 * mean(u^2 < 1/2); sqrt(2)

> 2 * mean(u^2 < 1/2); sqrt(2)


[1] 1.41168 # approx. by sampling method with seed shown
[1] 1.414214 # exact

Note: Of the methods in Section 3.1, only sampling is useful. You could find the

density function of X, but it involves 2, which you are pretending not to know.

3.10 A computer processes a particular kind of instruction in two steps.


The time U (in µs) for the first step is uniformly distributed on (0, 10). Inde-
pendently, the additional time V for the second step is normally distributed
with mean 5 µs and standard deviation 1 µs. Represent the total processing
time as T = U + V and evaluate P {T > 15}. Explain each step in the sug-
gested R code below. Interpret the results. Why do you suppose we choose
the method of Example 3.4 here—in preference to those of Examples 3.1–3.3?
(The histogram is shown in Figure 3.3, p57.)
3 Instructor Manual: Monte Carlo Integration 59
p
d By the independence of V and V , we have SD(T ) = 100/12 + 1 = 3.055. Also,
E(T ) = E(V ) + E(T ) = 10. A “normal approximation” for the probability would
give P {T > 15} ≈ P {Z > (15 − 10)/3.055 = 1.64} = 0.05, while the actual result
is shown below to be very nearly 0.04. The method of Example 3.4 is the only one
available to us, because we don’t know the PDF of T . c

set.seed(1237)
m = 500000 # sample size
u = 10*runif(m) # sample of size m from UNIF(0, 10)
v = 5 + rnorm(m) # sample of size m from NORM(5, 1)
t = u + v # sample of size m from T
hist(t) # similar to Figure 3.3 on p57 of the book
mean(t > 15) # prop. of t > 15; est. of P{T > 15}
mean(t) # sample mean estimates E(T)
sd(t) # sample SD estimates SD(T)

> mean(t > 15) # prop. t > 15; est. of P{T > 15}
[1] 0.039844
> mean(t) # sample mean estimates E(T)
[1] 9.999499
> sd(t) # sample SD estimates SD(T)
[1] 3.056966

Comments: The mean of the 500 000 observations of T is the balance point of the
histogram. How accurately does this mean simulate E(T ) = E(U )+E(V ) = 10? Also,
compare simulated and exact SD(T ). The histogram facilitates a rough guess of the
value P {T > 15}. Of the m = 500 000 sampled values, it seems that approximately
20 000 (or 4%) exceed 15. Compare this guess with the answer from your program.

3.11 The acceptance-rejection method for sampling from a distribution. Ex-


ample 3.3 illustrates how the acceptance-rejection (AR) method can be used
to approximate the probability of an interval. A generalization of this idea is
sometimes useful in sampling from a distribution, especially when the quan-
tile transformation method is infeasible. Suppose the random variable X has
the density function fX with support S (that is, fX (x) > 0 exactly when
x ∈ S). Also, suppose we can find an “envelope” Bb(x) ≥ fX (x), for all x
in S, where B is a known constant and b(x) has a finite integral over S.
Then, to sample a value at random from X, we sample a “candidate”
value y at random from a density g(x) that is proportional to b(x), accepting
the candidate value as a random value of X with probability fX (y)/Bb(y).
(Rejected candidate values are ignored.) Generally speaking, this method
works best when the envelope function is a reasonably good fit to the tar-
get density so that the acceptance rate is relatively high.
a) As a trivial example, suppose we want to sample from X ∼ BETA(3, 1)
without using the R function rbeta. Its density function is fX (x) = 3x2 ,
on S = (0, 1). Here we can choose Bb(x) = 3x ≥ fX (x), for x in (0, 1),
and b(x) is proportional to the density 2x of BETA(2, 1). Explain how
60 3 Instructor Manual: Monte Carlo Integration

the following R code implements the AR method to simulate X. (See


Problem 2.10 and Figure 2.10, p45.)
The top panel of Figure 3.10 illustrates this method. (We say this is a
trivial example because we could easily use the quantile transformation
method to sample from BETA(3, 1).)
d In all that follows, 0 < x < 1. The random variable X ∼ BETA(3, 1) has density
fX (x) = 3x2 , E(X) = 3/(3 + 1) = 3/4, V(X) = 3(1)/[(3 + 1 + 1)(3 + 1)2 ] = 3/80,
and SD(X) = 0.1936.
The random variable Y ∼ BETA(2, 1) has density function fY (x) = 2x. Obser-
vations from BETA(2, 1) can be generated by squaring a standard uniform random
variable. The density of X lies beneath the envelope Bb(x) = 3x, where B = 3/2
and b(x) = 2x = fY (x).
The acceptance probability of a candidate value y, generated from BETA(2, 1) is
fX (y)/Bb(y) = 3y 2 /3y = y. Because the acceptance rate is about 2/3, the effective
sample size is only about 13333. Of course, a larger value of m would give better
approximations. c

set.seed(1238)
m = 20000 # number of candidate values
u = runif(m) # sample from UNIF(0, 1)
y = sqrt(u) # candidate y from BETA(2, 1)
acc = rbinom(m, 1, y) # accepted with probability y
x = y[acc==T] # accepted values x from BETA(3, 1)

# Figure similar to top panel of Figure 3.10 of the book


hist(x, prob=T, ylim=c(0,3), col="wheat")
lines(c(0,1), c(0, 3), lty="dashed", lwd=2, col="darkgreen")
xx = seq(0, 1, len=1000)
lines(xx, dbeta(xx, 3, 1), lwd=2, col="blue")

# Numerical values
mean(x) # sample mean estimates E(X) = .75
sd(x) # sample sd estimates SD(X) = .1936
mean(x < 1/2) # estimates P{X < 1/2} = .125
mean(acc) # acceptance rate

> mean(x) # sample mean estimates E(X) = .75


[1] 0.7487564
> sd(x) # sample sd estimates SD(X) = .1936
[1] 0.1948968
> mean(x < 1/2) # estimates P{X < 1/2} = .125
[1] 0.1284009
> mean(acc) # acceptance rate
[1] 0.67445

b) As a more serious example, consider sampling from X ∼ BETA(1.5, 1.5),


for which the quantile function is not so easily found. Here the density
is fX (x) = (8/π)x0.5 (1 − x)0.5 on (0, 1). The mode occurs at x = 1/2
3 Instructor Manual: Monte Carlo Integration 61

with FX (1/2) = 4/π, so we can use Bb(x) = 4/π. Modify the program
of part (a) to implement the AR method for simulating values of X,
beginning with the following two lines. Annotate and explain your code.
Make a figure similar to the bottom panel of Figure 3.10. For verification,
note that E(X) = 1/2, SD(X) = 1/4, and FX (1/2) = 1/2. What is the
acceptance rate?
m = 40000; y = runif(m)
acc = rbinom(m, 1, dbeta(y, 1.5, 1.5)/(4/pi)); x = y[acc==T]

set.seed(1239)
m = 40000 # nr of candidate values
y = runif(m) # candidate y
acc = rbinom(m, 1, dbeta(y, 1.5, 1.5)/(4/pi)) # acceptance rule
x = y[acc==T] # accepted values

# Figure similar to bottom panel of Figure 3.10 of the book


hist(x, prob=T, ylim=c(0,3), col="wheat")
lines(c(0,1), c(4/pi, 4/pi), lty="dashed", lwd=2, col="darkgreen")
xx = seq(0, 1, len=1000)
lines(xx, dbeta(xx, 1.5, 1.5), lwd=2, col="blue")

# Numerical values
mean(x) # sample mean estimates E(X) = .5
sd(x) # sample sd estimates SD(X) = .25
mean(x < 1/2) # estimates P{X < 1/2} = 1/2
mean(acc) # acceptance rate

> mean(x) # sample mean estimates E(X) = .5


[1] 0.5003874
> sd(x) # sample sd estimates SD(X) = .25
[1] 0.2510418
> mean(x < 1/2) # estimates P{X < 1/2} = 1/2
[1] 0.4991431
> mean(acc) # acceptance rate
[1] 0.78775

c) Repeat part (b) for X ∼ BETA(1.4, 1.6). As necessary, use the R function
gamma to evaluate the necessary values of the Γ -function. The function
rbeta implements very efficient algorithms for sampling from beta dis-
tributions. Compare your results from the AR method in this part with
results from rbeta.
d The general form of a beta density function is f (x) = ΓΓ(α)Γ (α+β)
(β)
xα−1 (1 − x)β−1 ,
for 0 < x < 1. If α, β > 1, then it is a simple exercise in differential calculus to show
that f (x) has a unique mode at x = (α − 1)/(α + β − 2). So the maximum of the
density function for BETA(α = 1.4, β = 1.6) occurs at x = 0.4, and the maximum
Γ (3)
value is f (0.4) = Γ (1.4)Γ (1.6)
0.40.4 0.60.6 = 1.287034. The first block of code below
62 3 Instructor Manual: Monte Carlo Integration

shows several ways in which the mode and maximum can be verified numerically
in R. A program generating random samples from BETA(1.4, 1.6) follows. c

x = seq(0, 1, by = .0001); alpha = 1.4; beta = 1.6


const = gamma(alpha+beta)/(gamma(alpha)*gamma(beta))
pdf = const * x^(alpha-1) * (1-x)^(beta-1)
max1 = max(pdf); mode1 = x[pdf==max1]
mode2 = (alpha - 1)/(alpha + beta - 2)
max2 = dbeta(mode2, alpha, beta)
max1; max2
mode1; mode2

> max1; max2


[1] 1.287034
[1] 1.287034
> mode1; mode2
[1] 0.4
[1] 0.4

set.seed(1240)
m = 100000 # nr of candidate values
y = runif(m) # candidate y
acc = rbinom(m, 1, dbeta(y, 1.4, 1.6)/max2) # acceptance rule
x = y[acc==T] # accepted values

# Figure similar those in Figure 3.10, but for BETA(1.4, 1.6)


hist(x, prob=T, ylim=c(0,1.1*max1), col="wheat")
lines(c(0,1), c(max1, max1), lty="dashed", lwd=2, col="darkgreen")
xx = seq(0, 1, len=1000)
lines(xx, dbeta(xx, 1.4, 1.6), lwd=2, col="blue")

# Numerical values
mean(x) # sample mean estimates E(X)
sd(x) # sample sd estimates SD(X)
mean(x < 1/2) # estimates P{X < 1/2}
mean(acc) # acceptance rate
x1 = rbeta(m, alpha, beta); mean(x1); sd(x1); mean(x1 < 1/2)
alpha/(alpha+beta)
sqrt(alpha*beta/((alpha+beta+1)*(alpha+beta)^2))
pbeta(1/2, alpha, beta)

> mean(x) # sample mean estimates E(X)


[1] 0.4663471
> sd(x) # sample sd estimates SD(X)
[1] 0.2492403
> mean(x < 1/2) # estimates P{X < 1/2}
[1] 0.5545596
> mean(acc) # acceptance rate
[1] 0.77979
3 Instructor Manual: Monte Carlo Integration 63

> x1 = rbeta(m, alpha, beta); mean(x1); sd(x1); mean(x1 < 1/2)


[1] 0.4660053 # ’rbeta’ values: estimate E(X),
[1] 0.2498521 # SD(X), and
[1] 0.55329 # P{X < 1/2}
> alpha/(alpha+beta)
[1] 0.4666667 # exact E(X)

sqrt(alpha*beta/((alpha+beta+1)*(alpha+beta)^2))
[1] 0.2494438 # exact SD(X)
> pbeta(1/2, alpha, beta)
[1] 0.5528392 # exact P{X < 1/2}

Answers: (a) Compare with exact values E(X) = 3/4, SD(X) = (3/80)1/2 = 0.1936,
and FX (1/2) = P {X ≤ 1/2} = 1/8.

3.12 In Example 3.5, interpret the output for the run shown in the ex-
ample as follows. First, verify using hand computations the values given for
Y1 , Y2 , . . . , Y5 . Then, say exactly how many Heads were obtained in the first
9996 simulated tosses and how many Heads were obtained in all 10 000 tosses.
d In the first 9996 tosses there were 0.49990(9996) = 4997 Heads, and in all 10 000
tosses there were 4999 Heads. Noticing that, among the last four tosses, only those
numbered 9999 and 10 000 resulted in Heads (indicated by 1s), we can also express
the number of Heads in the first 9996 tosses as 4999 − 2 = 4997. c

3.13 Run the program of Example 3.5 several times (omitting set.seed).
Did you get any values of Y10 000 outside the 95% interval (0.49, 0.51) claimed
there? Looking at the traces from your various runs, would you say that the
runs are more alike for the first 1000 values of n or the last 1000 values?
d Of course, there is no way for us to know what you observed. However, we can
make a probability statement. When n = 10, 000 and π = 1/2, the distribution of
X ∼ BINOM(n, π) places about 95% of its probability in the interval (0.49, 0.51).
Specifically, diff(pbinom(c(4901, 5099), 10000, 1/2)) returns 0.9522908.
There is much more variability in the appearance of traces near the beginning
than near the end, where most traces have become very
p close to 1/2. In particular,
Sn ∼ BINOM(n, 1/2), so SD(Yn ) = SD(Sn /n) = 1/4n, which decreases with
increasing n. c

3.14 By making minor changes in the program of Example 3.2 (as below), it
is possible to illustrate the convergence of the approximation to J = 0.341345
as the number n of randomly chosen points increases to m = 5000. Explain
what each statement in the code does. Make several runs of the program. How
variable are the results for very small values of n, and how variable are they
for values of n near m = 5000? (Figure 3.11 shows superimposed traces for 20
runs.)
64 3 Instructor Manual: Monte Carlo Integration

m = 5000 # final number of randomly chosen points


n = 1:m # vector of numbers from 1 through 5000
u = runif(m) # positions of random points in (0, 1)
h = dnorm(u) # heights of std normal density at these points
j = cumsum(h)/n # simulated area based on first n of m points

plot(n, j, type="l", ylim=c(0.32, 0.36)) # plots trace of areas


abline(h=0.3413, col="blue") # horizontal reference line
j[m] # estimate of J based on all m points--
# the height of the trace as it ends

d Generally speaking, the traces become less variable near m = 5000 as they converge
towards J. More specifically, let the random variable H be the height of the normal
curve above a randomly chosen point in (0, 1).
After the program above is run, we can estimate SD(H) by sd(h), which returned
approximately 0.0486 after one run of the program. Then the standard deviation of

the estimated area Jn , based on n points is about 0.0486/ n, which decreases as n
increases. c
Note: The plotting parameter ylim establishes a relatively small vertical range for
the plotting window on each run, making it easier to assess variability within and
among runs.

3.15 Consider random variables X1 ∼ BETA(1, 1), X2 ∼ BETA(2, 1), and


X3 ∼ BETA(3, 1). Then, for appropriate constants, Ki , i = 1, 2, 3, the integral
R1 2
0
x dx = 1/3 can be considered as each of the following: K1 E(X12 ), K2 E(X2 ),
and K3 P {0 < X3 ≤ 1/2}. Evaluate K1 , K2 , and K3 .
d We evaluate three integrals in turn, where fi (x), for i = 1, 2, 3 are the density
functions of BETA(i, 1):
Z 1 Z 1
E(X12 ) = x2 f1 (x) dx = x2 dx = 1/3,
0 0
Z 1 Z 1
E(X2 ) = xf2 (x) dx = x(2x) dx = 2/3.
0 0
and
Z 1/2 Z 1/2
P {0 < X3 ≤ 1/2} = f3 (x) dx = 3x2 dx
0 0
Z 1 Z 1
2
= (3/2)(y/2) dy = (3/8) y 2 dy = (3/8)(1/3) = 1/8.
0 0

Thus, K1 = 1, K2 = 1/2, and K3 = 8/3. Here are some alternative evaluations. If


X1 ∼ BETA(1, 1) = UNIF(0, 1), then E(X12 ) = V(X1 ) + [E(X1 )]2 = 1/12 + 1/4 = 1/3.
If X2 ∼ BETA(2, 1), then E(X2 ) = 2/(2 + 1) = 2/3. If X3 ∼ BETA(3, 1), then we
can evaluate P {0 < X3 < 1/2} = FX3 (1/2) in R by using pbeta(1/2, 3, 1), which
returns 0.125. c
3 Instructor Manual: Monte Carlo Integration 65

3.16 In Example 3.5, let ² = 1/10 and define Pn = P {|Yn − 1/2| < ²} =
P {1/2 − ² < Yn < 1/2 + ²}. In R, the function pbinom is the cumulative
distribution function of a binomial random variable.
a) In the R Console window, execute
n = 1:100; pbinom(ceiling(n*0.6)-1, n, 0.5) - pbinom(n*0.4, n, 0.5)
Explain how this provides values of Pn , for n = 1, 2, . . . 100. (Notice
that the argument n in the function pbinom is a vector, so 100 results
are generated by the second statement.) Also, report the five values
P20 , P40 , P60 , P80 , and P100 , correct to six decimal places, and compare
results with Figure 3.12.
d With ² = 1/10, we have Pn = P {0.4 < Yn < 0.6} = P {0.4n < Xn < 0.6n},
where Xn ∼ BINOM(n, 1/2). The ceiling function rounds up to the next integer,
and subtracting 1 ensures that the upper end of the interval does not include an
unwanted value.
For example, when n = 10, only Xn = 5 satisfies the strict inequalities. Look
at the output below to see that P10 is precisely the value 0.246094 returned by
round(dbinom(5, 10, 1/2), 6). With less elegant mathematics and simpler R
code, we could have illustrated the LLN using Pn0 = P {1/2 − ² < Yn ≤ 1/2 + ²}.
In the output below, we use the vector show to restrict the output to only a few
lines relevant here and in part (b). c

show = c(1:6, 10, 20, 40, 60, 80, 100)


n = 1:100
lwr = n*0.4; upr = ceiling(n*0.6)-1
p = round(pbinom(upr, n, .5) - pbinom(lwr, n, .5), 6)
cbind(n, p)[show, ]

> cbind(n, p)[show, ]


n p
[1,] 1 0.000000 # first six results
[2,] 2 0.500000 # provide answers
[3,] 3 0.000000 # for part (b)
[4,] 4 0.375000 #
[5,] 5 0.000000 #
[6,] 6 0.312500 #
[7,] 10 0.246094
[8,] 20 0.496555
[9,] 40 0.731813
[10,] 60 0.844998
[11,] 80 0.907088
[12,] 100 0.943112

b) By hand, using a calculator, verify the R results for P1 , . . . , P6 .


c) With ² = 1/50, evaluate the fifty values P100 , P200 , . . ., P5000 . (Use the
expression n = seq(100, 5000, by=100), and modify the parameters of
pbinom appropriately.)
66 3 Instructor Manual: Monte Carlo Integration

n = seq(200, 5000, by=400) # show only a few values requested


eps = 1/50; lwr = n*(1/2-eps); upr = ceiling(n*(1/2+eps))-1
p = round(pbinom(upr, n, .5) - pbinom(lwr, n, .5), 6)
cbind(n, p)

> cbind(n, p)
n p
[1,] 200 0.379271
[2,] 600 0.652246
[3,] 1000 0.782552
[4,] 1400 0.858449
[5,] 1800 0.905796
[6,] 2200 0.936406
[7,] 2600 0.956637
[8,] 3000 0.970209
[9,] 3400 0.979413
[10,] 3800 0.985707
[11,] 4200 0.990038
[12,] 4600 0.993035
[13,] 5000 0.995116

3.17 Modify the program of Example 3.5 so that there are only n = 100
tosses of the coin. This allows you to see more detail in the plot. Compare the
behavior of a fair coin with that of a coin heavily biased in favor of Heads,
P (Heads) = 0.9, using the code h = rbinom(m, 1, 0.9). Make several runs
for each type of coin. Some specific points for discussion are: (i) Why are there
long upslopes and short downslopes in the paths for the biased coin but not for
the fair coin? (ii) Which simulations seem to converge faster—fair or biased?
(iii) Do the autocorrelation plots acf(h) differ between fair and biased coins?
d (i) The biased coin has long runs of Heads (average length 10), which correspond to
upslopes, interspersed with occasional Tails. The fair coin alternates Heads and Tails
(average length of each kind of run is 2). (ii) Biased coins have smaller variance and
so their traces converge faster. For a fair coin V(Yn ) = 1/4n, but a biased coin with
π = 0.9 has V(Yn ) = π(1−π)/n = 9/100n. (iii) The Hi are independent for both fair
and biased coins, so neither autocorrelation function should show significant corre-
lation at any lag (except, of course, for “lag” 0, which always has correlation 1). c

set.seed(1212)
m = 100; n = 1:m
h = rbinom(m, 1, 1/2); y = cumsum(h)/n # fair coin
plot (n, y, type="l", ylim=c(0,1)) # trace (not shown)
cbind(n,h,y)[1:18, ]
acf(h, plot=F) # 2nd parameter produces printed output
3 Instructor Manual: Monte Carlo Integration 67

> cbind(n,h,y)[1:18, ]
n h y
[1,] 1 0 0.0000000
[2,] 2 0 0.0000000
[3,] 3 1 0.3333333
[4,] 4 0 0.2500000
[5,] 5 1 0.4000000
[6,] 6 0 0.3333333
[7,] 7 0 0.2857143
[8,] 8 0 0.2500000
[9,] 9 1 0.3333333
[10,] 10 0 0.3000000
[11,] 11 1 0.3636364
[12,] 12 1 0.4166667
[13,] 13 0 0.3846154
[14,] 14 1 0.4285714
[15,] 15 1 0.4666667
[16,] 16 0 0.4375000
[17,] 17 0 0.4117647
[18,] 18 0 0.3888889

> acf(h, plot=F) # 2nd parameter produces printed output

Autocorrelations of series ’h’, by lag

0 1 2 3 4 5 6 7 8
1.000 -0.225 0.010 -0.101 -0.010 -0.023 0.050 0.002 -0.006
9 10 11 12 13 14 15 16 17
0.022 0.136 -0.017 0.075 -0.136 -0.004 -0.017 0.178 -0.096
18 19 20
0.099 -0.013 -0.044

set.seed(1213)
m = 100; n = 1:m
h = rbinom(m, 1, .9); y = cumsum(h)/n # biased coin
plot (n, y, type="l", ylim=c(0,1)) # trace (not shown)
cbind(n,h,y)[1:18, ]
acf(h, plot=F) # 2nd parameter produces printed output

> cbind(n,h,y)[1:18, ]
n h y
[1,] 1 1 1.0000000
[2,] 2 1 1.0000000
[3,] 3 1 1.0000000
[4,] 4 1 1.0000000
[5,] 5 1 1.0000000
[6,] 6 1 1.0000000
[7,] 7 1 1.0000000
68 3 Instructor Manual: Monte Carlo Integration

[8,] 8 0 0.8750000
[9,] 9 1 0.8888889
[10,] 10 1 0.9000000
[11,] 11 1 0.9090909
[12,] 12 1 0.9166667
[13,] 13 1 0.9230769
[14,] 14 0 0.8571429
[15,] 15 1 0.8666667
[16,] 16 1 0.8750000
[17,] 17 1 0.8823529
[18,] 18 1 0.8888889

> acf(h, plot=F) # 2nd parameter produces printed output

Autocorrelations of series ’h’, by lag

0 1 2 3 4 5 6 7 8
1.000 -0.023 -0.024 -0.025 -0.015 -0.016 0.085 -0.019 -0.009
9 10 11 12 13 14 15 16 17
-0.010 -0.113 0.090 0.088 -0.117 -0.107 -0.006 0.004 -0.100
18 19 20
0.001 0.000 -0.001

3.18 A version of the program in Example 3.5 with an explicit loop would
substitute one of the two blocks of code below for the lines of the original
program that make the vectors h and y.
# First block: One operation inside loop
h = numeric(m)
for (i in 1:m) { h[i] = rbinom(1, 1, 1/2) }
y = cumsum(h)/n

# Second block: More operations inside loop


y = numeric(m); h = numeric(m)
for (i in 1:m) {
if (i==1)
{b = rbinom(1, 1, 1/2); h[i] = y[i] = b}
else
{b = rbinom(1, 1, 1/2); h[i] = b;
y[i] = ((i - 1)*y[i - 1] + b)/i} }

Modify the program with one of these blocks, use m = 500 000 iterations,
and compare the running time with that of the original “vectorized” program.
To get the running time of a program accurate to about a second, use as the
first line t1 = Sys.time() and as the last line t2 = Sys.time(); t2 - t1.
t1 = Sys.time()
m = 500000; n = 1:m
h = rbinom(m, 1, 1/2) # Original vectorized version
3 Instructor Manual: Monte Carlo Integration 69

y = cumsum(h)/n # from Example 3.5


t2 = Sys.time()
elapsed.0 = t2 - t1

t1 = Sys.time()
m = 500000; n = 1:m; h = numeric(m)
for (i in 1:m) { # Version with one
h[i] = rbinom(1, 1, 1/2) } # operation inside loop
y = cumsum(h)/n
t2 = Sys.time()
elapsed.1 = t2 - t1

t1 = Sys.time()
m = 500000; n = 1:m; y = numeric(m); h = numeric(m)
for (i in 1:m) { # Version
if (i==1) # with
{b = rbinom(1, 1, 1/2); h[i] = y[i] = b} # several
else # operations
{b = rbinom(1, 1, 1/2); h[i] = b; # inside
y[i] = ((i - 1)*y[i - 1] + b)/i} } # loop
t2 = Sys.time()
elapsed.2 = t2 - t1

elapsed.0; elapsed.1; elapsed.2

> elapsed.0; elapsed.1; elapsed.2


Time difference of 1 secs
Time difference of 6 secs
Time difference of 12 secs

Note: On computers available as this is being written, the explicit loops in the
substitute blocks take noticeably longer to execute than the original vectorized code.
3.19 The program in Example 3.6 begins with a Sunny day. Eventually,
there will be a Rainy day, and then later another Sunny day. Each return to
Sun (0) after Rain (1), corresponding to a day n with Wn−1 = 1 and Wn = 0,
signals the end of one Sun–Rain “weather cycle” and the beginning of another.
(In the early part of the plot of Yn , you can probably see some “valleys” or
“dips” caused by such cycles.)
If we align the vectors (W1 , . . . , W9999 ) and (W2 , . . . , W10 000 ), looking
to see where 1 in the former matches 0 in the latter, we can count the
complete weather cycles in our simulation. The R code to make this count
is length(w[w[1:(m-1)]==1 & w[2:m]==0]). Type this line in the Console
window after a simulation run—or append it to the program. How many cycles
do you count with set.seed(1237)?
d In the program below, we see 201 complete Sun–Rain cycles in 10 000 simulated
days, for an average cycle length of about 10, 000/201 = 49.75 days. However, the
code cy.end = n[w[1:(m-1)]==1 & w[2:m]==0] makes a list of days on which such
70 3 Instructor Manual: Monte Carlo Integration

cycles end, and the last complete cycle in this run ended on day 9978 (obtained as
max(cy.end)). Thus, a fussier estimate of cycle length would be 9978/201 = 49.64.
Although we have simulated 10 000 days, we have seen only 201 cycles, so we can’t
expect this estimate of cycle length to be really close. Based on the means of geo-
metric random variables, the exact theoretical length is 1/0.3 + 1/0.6 = 50 days. c

# set.seed(1237) # results shown are for seed 1237


m = 10000; n = 1:m; alpha = 0.03; beta = 0.06
w = numeric(m); w[1] = 0
for (i in 2:m) {
if (w[i-1]==0) w[i] = rbinom(1, 1, alpha)
else w[i] = rbinom(1, 1, 1 - beta) }
y = cumsum(w)/n
plot(n, y, type="l") # plot not shown
targ = alpha/(alpha + beta); abline(w = targ)
nr.cyc = length(w[w[1:(m-1)]==1 & w[2:m]==0])
y[m]; nr.cyc

> y[m]; nr.cyc


[1] 0.3099 # still somewhat below ’target’ 1/3
[1] 201 # nr of complete Sun-Rain cycles

Hint: One can show that the theoretical cycle length is 50 days. Compare this with
the top panel of Figure 3.7 (p63).
3.20 Branching out from Example 3.6, we discuss two additional imaginary
islands. Call the island of the example Island E.
a) The weather on Island A changes more readily than on Island E. Specif-
ically, P {Wn+1 = 0|Wn = 0} = 3/4 and P {Wn+1 = 1|Wn = 1} = 1/2.
Modify the program of Example 3.6 accordingly, and make several runs.
Does Yn appear to converge to 1/3? Does Yn appear to stabilize to its
limit more quickly or less quickly than for Island E?
d The trace in the program below (plot not shown) stabilizes much more quickly
than the one for Island E. Intuitively, it seems likely that it will be sunny much of
the time: A sunny day will be followed by a rainy one only a quarter of the time,
but a rainy day has a 50-50 chance of being followed by a sunny one. One can show
that the proportion of rainy days over the long run is α/(α + β) = 1/3, where α
and β are the respective probabilities of weather change. c

# set.seed(1236) # omit the seed statement for your own runs


m = 10000; n = 1:m; alpha = 1/4; beta = 1/2
w = numeric(m); w[1] = 0
for (i in 2:m) {
if (w[i-1]==0) w[i] = rbinom(1, 1, alpha)
else w[i] = rbinom(1, 1, 1 - beta) }
y = cumsum(w)/n
3 Instructor Manual: Monte Carlo Integration 71

plot(n, y, type="l") # plot not shown


targ = alpha/(alpha + beta); abline(h = targ)
y[m/2]; y[m]
acf(w, plot=F) # 2nd parameter produces printout

> y[m/2]; y[m] # stabilizes relatively quickly:


[1] 0.3384 # halfway along, already
[1] 0.3371 # stabilized near ’target’ 1/3
> acf(w, plot=F) # for part (c)

Autocorrelations of series ’w’, by lag # lags 1 & 2 not near 0

0 1 2 3 4 5 6 7 8
1.000 0.228 0.052 0.021 -0.011 -0.033 -0.026 -0.028 -0.014
9 10 11 12 13 14 15 16 17
0.001 0.018 -0.002 -0.011 -0.008 -0.001 0.002 0.002 0.019
18 19 20 21 22 23 24 25 26
0.004 0.005 -0.008 -0.007 -0.006 0.005 -0.009 -0.018 0.003
27 28 29 30 31 32 33 34 35
0.002 0.001 0.012 0.012 0.008 -0.001 0.000 0.009 -0.007
36 37 38 39 40
-0.002 0.007 0.017 0.001 0.008

b) On Island B, P {Wn+1 = 0|Wn = 0} = 2/3 and P {Wn+1 = 1|Wn = 1} =


1/3. Modify the program, make several runs, and discuss as in part (a),
but now comparing all three islands. In what fundamental way is Island B
different from Islands E and A?
d Although all three islands (E, A, and B) have Rain on about 1/3 of the days over
the long run, the weather patterns are quite different. On Island E, there are very
long runs of Rain and Sun that keep the trace of the Yn from settling down quickly.
On Islands A and B the weather changes more readily and convergence of the trace
to the target value is faster.
On Island B, the weather on one day is independent of the weather the next:
Whatever today’s weather may be, the there is a 1/3 chance of rain tomorrow. This
is the same as tossing a fair die independently each day to “decide” the weather,
with Rain for 1 or 2 spots showing on the die, and Sun otherwise. c

set.seed(1235)
m = 10000; n = 1:m; alpha = 1/3; beta = 2/3
w = numeric(m); w[1] = 0
for (i in 2:m) {
if (w[i-1]==0) w[i] = rbinom(1, 1, alpha)
else w[i] = rbinom(1, 1, 1 - beta) }
y = cumsum(w)/n
plot(n, y, type="l") # plot not shown
targ = alpha/(alpha + beta); abline(h = targ)
y[m/2]; y[m]
acf(w, plot=F) # for part (c)
72 3 Instructor Manual: Monte Carlo Integration

> y[m/2]; y[m] # stabilizes relatively quickly:


[1] 0.343 # halfway along, already
[1] 0.3399 # stabilized near ’target’ 1/3
> acf(w, plot=F)

Autocorrelations of series ’w’, by lag # all lags near 0

0 1 2 3 4 5 6 7 8
1.000 0.014 -0.005 -0.012 0.004 -0.007 0.017 -0.002 0.011
9 10 11 12 13 14 15 16 17
0.008 -0.014 0.010 -0.006 -0.004 0.005 -0.010 -0.005 -0.011
18 19 20 21 22 23 24 25 26
-0.014 -0.007 -0.010 -0.012 0.008 0.002 0.011 -0.005 -0.005
27 28 29 30 31 32 33 34 35
0.005 -0.003 0.003 -0.005 -0.005 0.002 0.005 -0.014 -0.008
36 37 38 39 40
0.009 0.007 0.002 -0.012 0.013

c) Make acf plots for Islands A and B, and compare them with the corre-
sponding plot in the bottom panel of Figure 3.7 (p63).
d The ACF plot for Island A shows the dependence of each day’s weather on the
previous couple of days. The ACF plot for Island B shows no significant correlation
for any lag, which is consistent with independence, noted in the answer to part (b) c
Note: We know of no real place where weather patterns are as extremely persistent
as on Island E. The two models in this problem are both more realistic.)

3.21 Proof of the Weak Law of Large Numbers (Theoretical). Turn the
methods suggested below (or others) into carefully written proofs. Verify the
examples. (Below we assume continuous random variables. Similar arguments,
with sums for integrals, would work for the discrete case.)
a) Markov’s Inequality. Let W be a random variable
R ∞that takes only positive
values and has a finite expected value E(W ) = 0 xfW (w) dw. Then, for
any a > 0, P {W ≥ a} ≤ E(W )/a.
Method of proof: Break the integral into two nonnegative parts, over the
intervals (0, a) and (a, ∞). Then E(W ) cannot be less than
R ∞ the second
integral, which in turn cannot be less than aP {W ≥ a} = a a fW (w) dw.
Example: Let W ∼ UNIF(0, 1). Then, for 0 < a < 1, E(W )/a = 1/2a and
P {W ≥ a} = 1 − P {W < a} = 1 − a. Is 1 − a < 1/2a?

d Here are details of the proof:


Z ∞ Z a Z ∞
E(W ) = wfW (w) dw = wfW (w) dw + wfW (w) dw
Z0 ∞ 0
Z ∞
a

≥ wfW (w) dw ≥ a fW (w) dw = aP {W ≥ a},


a a
3 Instructor Manual: Monte Carlo Integration 73

where the first inequality holds because a nonnegative term is omitted and the
second holds because the value of w inside the integral must be at least as big as a.
The first equality requires that the support of W is contained in the positive half
line.
In the example, for W ∼ UNIF(0, 1) we have E(W ) = 1/2 and P {W ≥ a} = 1−a,
for 0 < a < 1. Markov’s Inequality says that E(W )/a = 1/2a ≥ P {W ≥ a} = 1 − a.
This amounts to the claim that a(1 − a) ≤ 1/2. But, for 0 < a < 1, the function
g(a) = a(1 − a) is a parabola with maximum value 1/4 at a = 1/2. So, in fact,
a(1 − a) ≤ 1/4 < 1/2.
The inequalities in the proof of Markov’s result may seem to have been obtained
by such extreme strategies (throwing away one integral and severely truncating
another), that you may wonder if equality is ever achieved. Generally not, in practical
applications. But consider a degenerate “random variable” X with P {X = µ} = 1
and µ > 0. Then E(X) = µ, E(X)/µ = 1, and also P {X ≥ µ} = 1. c

b) Chebyshev’s Inequality. Let X be a random variable with E(X) = µ and


V(X) = σ 2 < ∞. Then, for any k > 0, P {|X − µ| ≥ kσ} ≤ 1/k 2 .
Method of proof: In Markov’s Inequality, let W = (X − µ)2 ≥ 0 so that
E(W ) = V(X), and let a = k 2 σ 2 .
Example: If Z is standard normal, then P {|Z| ≥ 2} < 1/4. Explain briefly
how this illustrates Chebyshev’s Inequality.

d The method of proof is a straightforward. If Z ∼ NORM(0, 1), then µ = 0, σ 2 = 1.


In Chebyshev’s inequality, take k = 2. c

c) WLLN. Let Y1 , Y2 , . . . , Yn be independent, identically distributed random


variables each with mean µ and variance σ 2 < ∞. Further denote by Ȳn the
sample mean of the Yi . Then, for any ² > 0, limn→∞ P {|Ȳn − µ| < ²} = 1.
Method of proof: In Chebyshev’s√ Inequality, let X = Ȳn , which has
V(Ȳn ) = σ 2 /n, and let k = ² n/σ. Then use the complement rule and
let n → ∞.
d Using the complement rule, we obtain and use an alternate form of Chebyshev’s
Inequality for X̄, as suggested.

1 ≥ P {|X̄ − µ| < kσ/ n = ²} > 1 − 1/k2 = 1 − σ 2 /n²2 → 1,

so that P {|X̄ − µ| < ²} → 1, as n becomes infinite. c

Note: What we have referred to in this section as the Law of Large Numbers is
usually called the Weak Law of Large Numbers (WLLN), because a stronger result
can be proved with more advanced mathematical methods than we are using in this
book. The same assumptions imply that P {Ȳn → µ} = 1. This is called the Strong
Law of Large Numbers. The proof is more advanced because one must consider the
joint distribution of all Yn in order to evaluate the probability.
74 3 Instructor Manual: Monte Carlo Integration

3.22 In Example 3.5, we have Sn ∼ BINOM(n, 1/2). Thus E(Sn ) = n/2 and
V(Sn ) = n/4. Find the mean and variance of Yn . According to the Central
Limit Theorem, Yn is very nearly normal for large n. Assuming Y10 000 to
be normal, find P {|Y10 000 − 1/2| ≥ 0.01}. Also find the margin of error in
estimating P {Heads} using Y10 000 .
1 1 1 1
d We have E(Yn ) = E(Sn /n) = n
E(Sn ) = 2
, V(Yn ) = V(Sn /n) = n2
V(Sn ) = 4n
,
and SD(Yn ) = 2√1 n . Thus
½ ¾
|Y10 000 − 1/2| 0.01
P {|Y10 000 − 1/2| ≥ 0.01} = P ≥ =2
1/200 1/200
≈ P {|Z| ≥ 2} = 0.0455,

where the numerical value is obtained in R with 1 - diff(pnorm(c(-2, 2))). Al-



ternatively, for n = 10 000, take Yn ∼ NORM(1/2, 1/2 n) and get the same result
with n = 10000; 1 - diff(pnorm(c(.49, .51), 1/2, 1/(2*sqrt(n)))).
We conclude that a 95% margin of error for estimating π = P {Heads} = 1/2
with 10 000 iterations is about ±0.01.
Chebyshev’s bound is P {|Ȳ10 000 − 1/2| ≤ 0.01} > 1 − 1/[(40, 000)(.012 )] = 0.75,
and so it is not very useful in this practical situation. c
3.23 Mean of exponential lifetimes. In Example 3.7, we see that the mean
of 12 observations from a uniform population is nearly normal. In contrast,
the electronic components of Section 3.1 (p51) have exponentially distributed
lifetimes with mean 2 years (rate 1/2 per year). Because the exponential
distribution is strongly skewed, convergence in the Central Limit Theorem
is relatively slow. Suppose you want to know the probability that the average
lifetime T̄ of 12 randomly chosen components of this kind exceeds 3.

a) Show that E(T̄ ) = 2 and SD(T̄ ) = 1/ 3. Use the normal distribution
with this mean and standard deviation to obtain an (inadequate) estimate
of P {T̄ > 3}. The Central Limit Theorem does not provide accurate
estimates for probabilities involving T̄ when n is as small as 12.
d Let Xi , for i = 1, . . . , 12, be independently distributed EXP(1/2). Then E(Xi ) = 2,
1
P 1
P
V(Xi ) = 4, and SD(Xi ) = 2. Thus E(T̄ ) = E( 12 Xi ) = 12 E(Xi ) = 24/12 = 2.
1
P P √
Also, V(T̄ ) = V( 12 Xi ) = 1212 V(Xi ) = 48/144 = 1/3, and SD(T̄ ) = 1/ 3.
√ √ √
Then P {T̄ > 3} = P {(T̄ − 2)/(1/ 3) > (3 − 2) 3} ≈ P {Z > 3} = 0.0416,
where the final numerical value is from 1 - pnorm(sqrt(3)). More directly, we can

take T̄ ∼ NORM(2, 1/ 3) and use 1 - pnorm(3, 2, 1/sqrt(3)), letting R do the
“standardization” and obtaining the same numerical result. c

b) Modify the program of Example 3.7, using x = rexp(m*n, rate=1/2),


to simulate P {T̄ > 3}. One can show that T̄ ∼ GAMMA(12, 6) precisely.
Compare the results of your simulation with your answer to part (a) and
with the exact result obtained using 1 - pgamma(3, 12, rate=6).
d The mean and variance of the distribution GAMMA(12, 6) are µ = 12/6 = 2 and
σ 2 = 12/62 = 1/3, respectively. These are the mean and variance we obtained above
3 Instructor Manual: Monte Carlo Integration 75

for the random variable T̄ . The program below carries out the required simulation
and computation. The density function of GAMMA(12, 6) is illustrated in Figure 3.13
along with numerical results from the programs, as mentioned in part (c). c

set.seed(1215)
m = 500000; n = 12
x = rexp(m*n, rate=1/2); DTA = matrix(x, m)
x.bar = rowMeans(DTA)
mean(x.bar > 3)
1 - pgamma(3, 12, rate=6)

> mean(x.bar > 3)


[1] 0.055242 # approximated by simulation
> 1 - pgamma(3, 12, rate=6)
[1] 0.05488742 # exact value

c) Compare your results from parts (a) and (b) with Figure 3.13 and numer-
ical values given in its caption.

3.24 Here are some modifications of Example


R 1 3.8 with consequences that
may not be apparent at first. Consider J = 0 xd dx.
a) For d = 1/2 and m = 100 000, compare the Riemann approximation
with the Monte Carlo approximation. Modify the method in Example 3.8
appropriately, perhaps writing a program that incorporates d = 1/2 and
h = x^d, to facilitate easy changes in the parts that follow. Find V(Y ).
d In addition to the changes shown, use var(y) to estimate V(Y ). c

m = 100000; d = 1/2; a = 0; b = 1; w = (b - a)/m


x = seq(a + w/2, b-w/2, length=m); hx = x^d
set.seed(1214)
u = runif(m, a, b); hu = u^d; y = (b - a)*hu
mean(y) # Monte Carlo
2*sd(y)/sqrt(m) # MC margin of error
sum(w*hx) # Riemann

> mean(y) # Monte Carlo


[1] 0.6678417
> 2*sd(y)/sqrt(m) # MC margin of error
[1] 0.001487177
> sum(w*hx) # Riemann
[1] 0.6666667

b) What assumption of Section 3.4 fails for d = −1/2? What is the value
of J? Of V(Y )? Try running the two approximations. How do you explain
the unexpectedly good behavior of the Monte Carlo simulation?
76 3 Instructor Manual: Monte Carlo Integration

d Using d = -1/2, we obtain the output shown below. The function f (x) = x−1/2
is not bounded in (0, 1). However, the integral is finite, and the area under f (x)

and above (0, ²), for small ², Ris very small (specifically, 2 ²). Thus the Riemann
1
approximation is very near to 0 x−1/2 dx = 2. Monte Carlo results are good because
values of u in (0, ²) are extremely
R 1 rare. To put it another way, all assumptions are
valid for the computation of ² x−1/2 dx ≈ 2 by either method, and the value of
R ² −1/2
0
x dx ≈ 0, for ² > 0 sufficiently small. c

> mean(y) # Monte Carlo


[1] 1.987278
> 2*sd(y)/sqrt(m) # MC margin of error
[1] 0.01855919
> sum(w*hx) # Riemann
[1] 1.998087

c) Repeat part(b), but with d = −1. Comment.


R ² −1
d The integral is infinite. For any positive ², 0 x dx is infinite. So this is an unstable
computational problem. The first block of answers below results from changing to
d = -1. The second block retains that change and also changes to m = 1 000 000
and seed 1214. In the Monte Carlo result, the smallest random value at which x−1
happens to be evaluated can have a large influence. c

> mean(y) # Monte Carlo


[1] 12.56027
> 2*sd(y)/sqrt(m) # MC margin of error
[1] 3.466156
> sum(w*hx) # Riemann
[1] 13.47644

> mean(y) # Monte Carlo


[1] 39.43215
> 2*sd(y)/sqrt(m) # MC margin of error
[1] 50.55656
> sum(w*hx) # Riemann
[1] 15.77902

3.25 This problem shows how the rapid oscillation of a function can affect
the accuracy of a Riemann approximation.
a) Let h(x) = | sin kπx| and k be a positive integer. Then use calculus to
R1
show that 0 h(x) dx = 2/π = 0.6366. Use the code below to plot h on
[0, 1] for k = 4.
k = 4
x = seq(0,1, by = 0.01); h = abs(sin(k*pi*x))
plot(x, h, type="l")
3 Instructor Manual: Monte Carlo Integration 77

d As the plot (not shown here) illustrates, the function h(x) has k congruent sine-
shaped “humps” with maximums at 1 and minimums at 0. Thus the claim can be
established by finding the area of one of these k humps:

Z 1/k Z 1 h i1
1 1 2
| sin kπx| dx = sin πy dy = cos πy = ,
0
k 0
k 0 kπ
where we have made the substitution
R1 y = kx at the first equality. Adding the k
equal areas together, we have 0 h(x) dx = 2/π = 0.6366. c

b) Modify the program of Example 3.8 to integrate h. Use k = 5000 through-


out, and make separate runs with m = 2500, 5000, 10 000, 15 000, and
20 000. Compare the accuracy of the resulting Riemann and Monte Carlo
approximations, and explain the behavior of the Riemann approximation.
d The function | sin 5000πx| oscillates regularly and rapidly. Riemann approximation
does not work well unless the number of rectangles within each cycle of the function
is large. Specifically, m = 2500 puts centers of rectangles precisely at “troughs”, and
m = 5000 puts them only at “peaks,” producing the sorry results shown below. By
contrast, Monte Carlo integration works reasonably well for all values of m.
The program below is structured to make a table of results, and also to make it
easy for you to explore different values of m. We have included one more value than
requested (last row of output). The last two columns of output refer to part (c).
Notice that the increment of i in the loop must be 1, so we cannot use m itself as
the loop variable. c

M = c(2500, 5000, 10000, 15000, 20000, 100000)


r = length(M); RA = MC = EstME = numeric(r)
for (i in 1:r) {
m = M[i]; a = 0; b = 1; k = 5000
w = (b - a)/m; x = seq(a + w/2, b-w/2, length=m)
hx = abs(sin(k*pi*x)); RA[i] = sum(w*hx)
set.seed(1220)
u = runif(m, a, b); hu = abs(sin(k*pi*u))
y = (b - a)*hu; MC[i] = mean(y)
EstME[i]=2*sd(y)/sqrt(m) }
round(cbind(M, RA, MC, EstME), 4)
round(2/pi, 4) # exact

> round(cbind(M, RA, MC, EstME), 4)


M RA MC EstME
[1,] 2500 0.0000 0.6356 0.0122
[2,] 5000 1.0000 0.6348 0.0086
[3,] 10000 0.7071 0.6368 0.0061
[4,] 15000 0.6667 0.6366 0.0050
[5,] 20000 0.6533 0.6379 0.0043
[6,] 100000 0.6373 0.6367 0.0019
> round(2/pi, 4) # exact
[1] 0.6366
78 3 Instructor Manual: Monte Carlo Integration

c) Use calculus to show that V(Y ) = V(h(U )) = 1/2 − 4/π 2 = 0.0947. How
accurately is this value approximated by simulation? If m = 10 000, find
the margin of error for the Monte Carlo approximation in part (b) based
on SD(Y ) and the Central Limit Theorem. Are your results consistent
with this margin of error? d No solution is provided for this part. c
R1
3.26 The integral J = 0 sin2 (1/x) dx cannot be evaluated analytically,
R∞
but advanced analytic methods yield 0 sin2 (1/x) dx = π/2.
R1
a) Assuming this result, show that J = π/2 − 0 x−2 sin2 x dx. Use R to plot
both integrands on (0, 1), obtaining results as in Figure 3.14.
d We manipulate the integral obtained by advanced methods to obtain the alternate
form of J:
Z ∞ Z 1 Z ∞
π
= sin2 (1/x) dx = sin2 (1/x) dx + sin2 (1/x) dx
2 0 0 1
Z 1 Z 1
2 −2 2
= sin (1/x) dx + y sin y dy,
0 0

where the last integral arises from the substitution y = 1/x, dx = −y −2 dy. c

b) Use both Riemann and Monte Carlo approximations to evaluate J as


originally defined. Then evaluate J using the equation in part (a). Try
both methods with m = 100, 1000, 1001, and 10 000 iterations. What do
you believe is the best answer? Comment on differences between methods
and between equations.
d The structure of the program below is similar to that of the program in Prob-
lem 3.25. We have also included m = 100 000 iterations here.
Riemann approximation. Riemann approximation works better than Monte Carlo
estimation. For a smooth curve, we expect and get very good results even when m
is relatively small (see column RA2 of the output). With less rounding than in the
table, the correct value seems to be 0.673457. The most rapidly oscillating part of
the first form of the function covers a small portion of the interval of integration.
Thus, for larger m Riemann integration gives satisfactory results for this form also
(shown in column RA1).
Monte Carlo integration. For large enough m, Monte Carlo integration gives satis-
factory results. Of course, for any m, the estimated margin of error is much smaller
for the smooth curve. Different seeds give slightly different results. c

M = c(100, 1000, 1001, 10000, 100000); r = length(M)


RA1 = RA2 = MC1 = EstME1 = MC2 = EstME2 = numeric(r)
for (i in 1:r) {
m = M[i]; a = 0; b = 1
w = (b - a)/m; x = seq(a + w/2, b-w/2, length=m)
hx = (sin(1/x))^2; RA1[i] = sum(w*hx)
hx = (sin(x)/x)^2; RA2[i] = pi/2 - sum(w*hx)
3 Instructor Manual: Monte Carlo Integration 79

set.seed(1234)
u = runif(m, a, b); hu = (sin(1/u))^2
y = (b - a)*hu; MC1[i] = mean(y)
EstME1[i]=2*sd(y)/sqrt(m)
u = runif(m, a, b); hu = (sin(u)/u)^2
y = (b - a)*hu; MC2[i] = pi/2 - mean(y)
EstME2[i]=2*sd(y)/sqrt(m) }
round(cbind(M, RA1, RA2, MC1, EstME1, MC2, EstME2), 4)

> round(cbind(M, RA1, RA2, MC1, EstME1, MC2, EstME2), 4)


M RA1 RA2 MC1 EstME1 MC2 EstME2
[1,] 100 0.6623 0.6735 0.6194 0.0750 0.6825 0.0189
[2,] 1000 0.6757 0.6735 0.6797 0.0206 0.6683 0.0054
[3,] 1001 0.6760 0.6735 0.6798 0.0206 0.6681 0.0054
[4,] 10000 0.6736 0.6735 0.6727 0.0065 0.6745 0.0018
[5,] 100000 0.6735 0.6735 0.6731 0.0021 0.6734 0.0006

Note: Based on a problem in Liu (2001), Chapter 2.

3.27 Modify the program of Example 3.9 to approximate the volume be-
neath the bivariate standard normal density surface and above two additional
regions of integration as specified below. Use both the Riemann and Monte
Carlo methods in parts (a) and (b), with m = 10 000.
a) Evaluate P {0 < Z1 ≤ 1, 0 < Z2 ≤ 1}. Because Z1 and Z2 are indepen-
dent standard normal random variables, we know that this probability is
0.3413452 = 0.116516. For each method, say whether it would have been
better to use m = 10 000 points to find P {0 < Z ≤ 1} and then square
the answer.
d The first program below uses both Riemann and Monte Carlo methods to find
P {0 < Z ≤ 1}2 . The second approximates P {0 < Z1 ≤ 1, 0 < Z2 ≤ 1} with both
methods. For each Monte Carlo integration, results from several runs are shown.
The exact value, from (pnorm(1) - .5)^2, is 0.1165162.
For Riemann approximation, squaring P {0 < Z ≤ 1} gives the exact answer to
seven places; the result from a 2-d grid accurate only to five places. For the Monte
Carlo method, integration over the square seems slightly better; both methods can
give an incorrect fourth digit, but the fourth digit seems to vary a little more when
the probability of the interval is squared. Perhaps, even in this simple example for
one and two dimensions, improvement of Monte Carlo integration results in higher
dimensions is barely beginning to show. c

m = 10000; a = 0; b = 1; w = (b - a)/m
x = seq(a + w/2, b-w/2, length=m); hx = dnorm(x)
set.seed(1111)
u = runif(m, a, b); hu = dnorm(u); y = (b - a)*hu
80 3 Instructor Manual: Monte Carlo Integration

sum(w*hx) # Riemann P(0 < Z < 1)


sum(w*hx)^2 # Riemann P(Unit square)
mean(y) # Monte Carlo P(0 < Z < 1)
mean(y)^2 # Monte Carlo P(Unit square)

> sum(w*hx) # Riemann P(0 < Z < 1)


[1] 0.3413447
> sum(w*hx)^2 # Riemann P(Unit square)
[1] 0.1165162 # exactly correct to all 7 places

> mean(y) # Monte Carlo P(0 < Z < 1)


[1] 0.3409257
> mean(y)^2 # Monte Carlo P(Unit square)
[1] 0.1162303 # other seeds: 0.1161, 0.1163, 0.1164
# 0.1166, 0.1168. Avg: 0.1164

m = 10000
g = round(sqrt(m)) # no. of grid pts on each axis
x1 = rep((1:g - 1/2)/g, times=g) # these two lines give
x2 = rep((1:g - 1/2)/g, each=g) # coordinates of grid points
hx = dnorm(x1)*dnorm(x2)
sum(hx)/g^2 # Riemann P{Unit square}

> sum(hx)/g^2 # Riemann approximation


[1] 0.1165169

set.seed(1120)
u1 = runif(m) # these two lines give a random
u2 = runif(m) # point in the unit square
hu = dnorm(u1)*dnorm(u2)
mean(hu) # Monte Carlo P{Unit square}

>> mean(hu) # Monte Carlo P{Unit square}


[1] 0.1163651 # other seeds: 0.1164, 0.1164, 0.1165
# 0.1165, 0.1167. Avg: 0.1165

b) Evaluate P {Z12 + Z22 < 1}. Here the region of integration does not have
area 1, so remember to multiply by an appropriate constant. Because
Z12 + Z22 ∼ CHISQ(2), the exact answer can be found with pchisq(1, 2).
d With both methods, we integrate over the portion of the unit circle in the first
quadrant and then multiply by 4. The Riemann approximation agrees with the
exact value 0.3936 to three places. To get about 10 000 points of evaluation for both
methods, we increase the number of candidate points for the Monte Carlo method
to 12 732 (and happen to have 10 028 of them accepted in the run shown). The
multiplier 1/2 in the Monte Carlo integration of Example 3.9 becomes π/4 here. In
the (very lucky) run shown, the Monte Carlo integration has four-place accuracy. c
3 Instructor Manual: Monte Carlo Integration 81

m = 10000; g = round(sqrt(m))
x1 = rep((1:g-1/2)/g, times=g); x2 = rep((1:g-1/2)/g, each=g)
hx = dnorm(x1)*dnorm(x2)
4 * sum(hx[x1^2 + x2^2 < 1])/g^2 # Riemann approximation
pchisq(1, 2) # exact value

> 4 * sum(hx[x1^2 + x2^2 < 1])/g^2 # Riemann approximation


[1] 0.3935860
> pchisq(1, 2) # exact value
[1] 0.3934693

set.seed(1222)
m = round(10000*4/pi) # to get about 10000 accepted
u1 = runif(m); u2 = runif(m); hu = dnorm(u1)*dnorm(u2)
hu.acc = hu[u1^2 + u2^2 < 1]
m.prime = length(hu.acc); m.prime # number accepted
4*(pi/4) * mean(hu.acc) # Monte Carlo result
2*pi*sd(hu.acc)/sqrt(m.prime) # MC margin of error

> m.prime = length(hu.acc); m.prime # number accepted


[1] 10028
> 4*(pi/4) * mean(hu.acc) # Monte Carlo result
[1] 0.3934832
> 2*pi*sd(hu.acc)/sqrt(m.prime) # MC margin of error
[1] 0.00112921

c) The joint density function of (Z1 , Z2 ) has circular contour lines centered
at the origin, so that probabilities of regions do not change if they are
rotated about the origin. Use this fact to argue that the exact value of
P {Z1 > 0, Z2 > 0, Z1 +Z2 < 1}, which was approximated in Example 3.9,
can be found with (pnorm(1/sqrt(2)) - 0.5)^2.

d Consider the square in all four quadrants, of which our triangle contains a quarter
of the total area. This square has two of its vertices at (0, 1) and (0, −1), and the

length of one of its sides is 2. If we rotate this square by 45 degrees so that its sides
are parallel to the axes, it will still have the same probability under the bivariate
normal curve as before rotation. By symmetry, the desired probability is the same
as for the square within the first quadrant after rotation, with two of its corners at
√ √
the origin and (1/ 2, 1/ 2). Then, by an argument similar to that of part (a), the
R expression provided above computes the exact probability J = 0.06773 mentioned
in Example 3.9. c

3.28 Here we extend the idea of Example 3.9 to three dimensions. Suppose
three items are drawn at random from a population of items with weights (in
grams) distributed as NORM(100, 10).
d We do not provide a detailed code. See the partial Answers at the end of the
problem. c
82 3 Instructor Manual: Monte Carlo Integration

a) Using the R function pnorm (cumulative distribution function of a stan-


dard normal), find the probability that the sum of the three weights is less
than 310 g. Also find the probability that the minimum weight of these
three items exceeds 100 g.
b) Using both Riemann and Monte Carlo methods, approximate the prob-
ability that (simultaneously) the minimum of the weights exceeds 100 g
and their sum is less than 310 g.
The suggested procedure is to (i) express this problem in terms of three
standard normal random variables, and (ii) modify appropriate parts of
the program in Example 3.9 to approximate the required integral over a
triangular cone (of area 1/6) in the unit cube (0, 1)3 using m = g 3 = 253 .
Here is R code to make the three grid vectors needed for the Riemann
approximation:
x1 = rep((1:g - 1/2)/g, each=g^2)
x2 = rep(rep((1:g - 1/2)/g, each=g), times=g)
x3 = rep((1:g - 1/2)/g, times=g^2)
Answers: (a) pnorm(1/sqrt(3)) for the sum. (b) Approximately 0.009. To under-
stand the three lines of code provided, experiment with a five-line program. Use
g = 3 before these three lines and cbind(x1, x2, x3) after.
3.29 For d = 2, 3, and 4, use Monte Carlo approximation with m = 100 000
to find the probability that a d-variate (independent) standard normal distri-
bution places in the part of the d-dimensional unit ball with all coordinates
positive. To do this, imitate the program in Example 3.9, and use the fact
that the entire 4-dimensional unit ball has hypervolume π 2 /2. Compare the
results with appropriate values computed using the chi-squared distribution
with d degrees of freedom; use the R function pchisq.
d In order to avoid structures in R not yet fully developed in this book, we have
not fully “vectorized” the code in the program below. We include the case d = 1,
where the probability sought is P {Z 2 < 1}, for Z ∼ NORM(0, 1), as in part (a) of
Problem 3.27.
The vector ball.vol contains hypervolumes of d-dimensional unit balls (see the
formula in the note). For Monte Carlo integration, the appropriate multiplier of the
average of values hu at random points within the ball is the hypervolume of the ball
divided by 2d . This ratio expresses the hypervolume of the subset of the ball for
which all coordinates are positive (the right half-line, first quadrant, and so on).
The error of the Monte Carlo simulations decreases in higher dimensions, al-
though the relative error (absolute error divided by exact value, not shown) remains
about the same. In this program, the four Monte Carlo approximations are not in-
dependent because we use information from lower dimensions to simulate higher
dimensions. As a result, there is a tendency for the errors to have the same sign. c

set.seed(1123)
m = 100000; d = 1:4; ball.vol = pi^(d/2)/gamma(d/2 + 1)
u1 = runif(m); u2 = runif(m); u3 = runif(m); u4 = runif(m)
hu = dnorm(u1); sq.dist = u1^2
3 Instructor Manual: Monte Carlo Integration 83

p1 = ball.vol[1]*mean(hu[sq.dist < 1])/2


hu = dnorm(u1)*dnorm(u2); sq.dist = sq.dist + u2^2
p2 = ball.vol[2]*mean(hu[sq.dist < 1])/4
hu = hu*dnorm(u3); sq.dist = sq.dist + u3^2
p3 = ball.vol[3]*mean(hu[sq.dist < 1])/8
hu = hu*dnorm(u4); sq.dist = sq.dist + u4^2
p4 = ball.vol[4]*mean(hu[sq.dist < 1])/16
MC = c(p1,p2,p3,p4); Exact = pchisq(1, d)/2^d; Err = MC - Exact
round(cbind(d, MC, Exact, Err), 6)

> round(cbind(d, MC, Exact, Err), 6)


d MC Exact Err
[1,] 1 0.341471 0.341345 0.000126
[2,] 2 0.098401 0.098367 0.000034
[3,] 3 0.024856 0.024844 0.000013
[4,] 4 0.005640 0.005638 0.000002

# Brief investigation of some questions posed in the note below:


d = 1:13; ball.vol = pi^(d/2)/gamma(d/2+1); norm.pr = pchisq(1,d)
round(cbind(d, ball.vol, norm.pr, 2^d), 6)

> round(cbind(d, ball.vol, norm.pr, 2^d), 6)


d ball.vol norm.pr
[1,] 1 2.000000 0.682689 2
[2,] 2 3.141593 0.393469 4
[3,] 3 4.188790 0.198748 8
[4,] 4 4.934802 0.090204 16
[5,] 5 5.263789 0.037434 32
[6,] 6 5.167713 0.014388 64
[7,] 7 4.724766 0.005171 128
[8,] 8 4.058712 0.001752 256
[9,] 9 3.298509 0.000562 512
[10,] 10 2.550164 0.000172 1024
[11,] 11 1.884104 0.000050 2048
[12,] 12 1.335263 0.000014 4096
[13,] 13 0.910629 0.000004 8192

Note: In case you want to explore higher dimensions, the general formula for the
hypervolume of the unit ball in d dimensions is π d/2/ Γ ((d+2)/2); for a derivation see
Courant and John (1989), p459. Properties of higher dimensional spaces may seem
strange to you. What happens to the hypervolume of the unit ball as d increases?
What happens to the probability assigned to the (entire) unit ball by the d-variate
standard normal distribution? What happens to the hypervolume of the smallest
hypercube that contains it? There is “a lot of room” in higher dimensional space.
84 3 Instructor Manual: Monte Carlo Integration

Errors in Chapter 3
Page numbers refer to the the text. All corrections are scheduled to appear in
the second printing. Statements of problems in this Instructor Manual have
been corrected. Includes corrections compiled through July 13, 2011.
p74 Problem 3.7. Hint (a): R code h = 12*g^3*(1-g) should be h = 12*g^2*(1-g).
p76 Problem 3.11(b). Should refer to Figure 3.10 (on the next page), not Figure 3.2.
[Thanks to Leland Burrill.]
p84 Problem 3.27(c). The probability should be P {Z1 > 0, Z2 > 0, Z1 + Z2 < 1}.
That is, the event should be restricted to the first quadrant.

Eric A. Suess and Bruce E. Trumbo


Introduction to Probability Simulation and Gibbs Sampling with R: Instructor Manual
Chapter 3

Explanations and answers:


c 2011 by Bruce E. Trumbo and Eric A. Suess. All rights reserved.
°
Statements of problems:
°c 2010 by Springer Science+Business Media, LLC. All rights reserved.
4
Applied Probability Models

4.1 Let Nt ∼ POIS(λt) and X ∼ EXP(λ), where λ = 3. Use R to evaluate


P {N1 = 0}, P {N2 > 0}, P {X > 1}, and P {X ≤ 1/2}. Repeat for λ = 1/3.
d In R, recall that dpois denotes the Poisson probability density function, dpois
denotes the Poisson cumulative distribution function (CDF), and pexp denotes the
exponential CDF. Below, the first block from the R Session window is for λ = 3 and
the second is for λ = 1/3. c

> lambda = 3 # Poisson mean, exponential rate


> dpois(0, 1*lambda); 1 - ppois(0, 2*lambda)
[1] 0.04978707
[1] 0.9975212
> 1 - pexp(1, lambda); pexp(1/2, lambda)
[1] 0.04978707
[1] 0.7768698

> lambda = 1/3 # Poisson mean, exponential rate


> dpois(0, 1*lambda); 1 - ppois(0, 2*lambda)
[1] 0.7165313
[1] 0.4865829
> 1 - pexp(1, lambda); pexp(1/2, lambda)
[1] 0.7165313
[1] 0.1535183

Note: Equation (4.1) says that the first and third probabilities are equal.

4.2 Analytic results for X ∼ EXP(λ). (Similar to problems in Chapter 2.)


a) Find η such that P {X ≤ η} = 1/2. Thus η is the median of X.
R η −λt −λη −λη
d If P {X ≤ η} = 0 λe dt = 1 − e = 1/2, then e = 1/2. But this implies
that λη = ln 2 = 0.69315 (to five places). So η = 0.69315/λ = 0.69315 E(X). In a
right-skewed distribution such as an exponential, the mean is typically larger than
the median. If λ = 1, then η = 0.69315 is given by the R code qexp(1/2, 1). The
86 4 Answers to Problems: Applied Probability Models

function qexp denotes the quantile function of an exponential distribution, which is


the inverse CDF. Thus also, pexp(0.69315, 1) returns a number very near 1/2. c
R∞ R∞
b) Show that E(X) = 0
tfX (t) dt = 0
λte−λt dt = 1/λ.

d For the analytical proof here and to find E(X 2 ) in part (c), use integration by
parts or the moment generating function E(esX ) = m(x) = λ/(λ + s), which has
m0 (0) = E(X) and m00 (0) = E(X 2 ).
To illustrate that a random sample of 100 000 observations from EXP(λ = 3)
has a sample mean very near 1/3, type the following into the R Session window:
mean(rexp(100000, 3)). You can anticipate two-place accuracy. For a similar illus-
tration of the value of SD(X) in part (c), use sd(rexp(100000, 3)). Referring to
part (a), what are the results of qexp(1/2, 3) and log(2)/3, and the approximate
result of quantile(rexp(100000, 3), 1/2)? c

c) Similarly, show that V(X) = E(X 2 ) − [E(X)]2 = 1/λ2 , and SD(X) = 1/λ.

4.3 Explain each step in equation (4.2). For X ∼ EXP(λ) and r, s, t > 0,
why is P {X > r + t|X > r} = P {X > s + t|X > s}?
d Following the Hints, the first equal sign below uses the definition of conditional
probability, P (A|B) = P (A ∩ B)/P (B). Then numerator of the resulting fraction is
P (A ∩ B) = P (A), because A = P {X > s + t} is a subset of B = P {X > s}.

P {X > s + t, X > s} P {X > s + t}


P {X > s + t|X > s} = =
P {X > s} P {X > s}
e−λ(s+t)
= = e−λt = P {X > t}.
e−λs
Expressions in the second line follow from the formula for the CDF FX (t) = 1−e−λt ,
for t > 0, of X ∼ EXP(λ). Specifically, P {X > s + t} = e−λ(s+t) = e−λs e−λt . c
Hints: P (A|B) = P (A ∩ B)/P (B). If A ⊂ B, then what is P (A ∩ B)?

4.4 In the R code below, each line commented with a letter (a)–(h) returns
an approximate result related to the discussion at the beginning of Section 4.1.
For each, say what method of approximation is used, explain why the result
may not be exactly correct, and provide the exact value being approximated.
d Let Y ∼ POIS(λ) and X ∼ EXP(λ). Then with λ = 0.5, we have E(Y ) = 0.5 and
E(X)P=1001/λ−λ
= 2: P∞ −λ k
(a) k=0
e λk /k! ≈ k=0
e λ /k! = 1 verifies that the Poisson probability
distribution
P100 function sums to 1, as it must.
(b) ke−λ λk /k! ≈ E(Y ) = 0.5.
k=0 R 1000 −λy R∞
(c) The Riemann approximation of 0 λe dy ≈ 0 λe−λy dy = 1.
R 100
(d) The Riemann approximation of E(X) ≈ 0 yλe−λy dy.
(e & f) The sampling method approximates E(X) = SD(X) = 1/λ = 2.
(g & h) The probability P {X > 0.1} = P {X > 0.15 | X > 0.05} = 0.9512 is
approximated (by the sampling method), thus illustrating the no-memory property
for s = 0.1 and t = 0.05. c
4 Answers to Problems: Applied Probability Models 87

> lam = .5; i = 0:100


> sum(dpois(i, lam)) #(a)
[1] 1
> sum(i*dpois(i, lam)) #(b)
[1] 0.5
> g = seq(0,1000,by=.001)-.0005
> sum(dexp(g, lam))/1000 #(c)
[1] 1
> sum(g*dexp(g, lam))/1000 #(d)
[1] 2
> x = rexp(1000000, lam)
> mean(x) #(e)
[1] 1.997514
> sd(x) #(f)
[1] 1.995941
> mean(x > .1) #(g)
[1] 0.951023
> y = x[x > .05]; length(y)
[1] 975221
> mean(y > .15) #(h)
[1] 0.9514869

Hints: Several methods from Chapter 3 are used. Integrals over (0, ∞) are approxi-
mated. For what values of s and t is the no-memory property illustrated?
4.5 Four statements in the R code below yield output. Which ones? Which
two statements give the same results? Why? Explain what two other state-
ments compute. Make the obvious modifications for maximums, try to predict
the results, and verify.
> x1 = c(1, 2, 3, 4, 5, 0, 2) # define x1
> x2 = c(5, 4, 3, 2, 1, 3, 7) # define x2
> min(x1, x2); pmin(x1, x2)
[1] 0 # minimum value among 14
[1] 1 2 3 2 1 0 2 ## ’parallel’ (elementwise) minimum

> MAT = cbind(x1, x2); MAT # makes (and prints) matrix


x1 x2
[1,] 1 5
[2,] 2 4
[3,] 3 3
[4,] 4 2
[5,] 5 1
[6,] 0 3
[7,] 2 7
> apply(MAT, 1, min) # min for each row (dimension 1)
[1] 1 2 3 2 1 0 2 ##x same as for ’pmin’ above
> apply(MAT, 2, min) # min for each col (dimension 2)
x1 x2
0 1
88 4 Answers to Problems: Applied Probability Models

4.6 More simulations related to Example 4.1: Bank Scenarios 1 and 2.


a) In Scenario 1, simulate the probability that you will be served by the
female teller. Use and explain the code mean(x2==v). [Exact value is 5/9.]
d We are served by the second (more experienced teller) if the waiting time for her
to be free is the minimum of the two waiting times. It may not be obvious without
proof or simulation that the exact probability of being served by the second teller
is 5/9 = 0.5556, but it should be obvious that the probability is greater than 1/2,
because the faster teller is more likely to finish first. The vector (x2 == v) is a
logical vector, and its mean is the proportion of its TRUE values. c

set.seed(1212)
m = 100000; lam1 = 1/5; lam2 = 1/4
x1 = rexp(m, lam1); x2 = rexp(m, lam2)
v = pmin(x1, x2)
mean(x2 == v) # Min wait is wait for 2nd teller (part a)
mean(v) # Avg. min is avg. wait to start (part b)

> mean(x2 == v) # Min wait is wait for 2nd teller (part a)


[1] 0.55762
> mean(v) # Avg. min is avg. wait to start (part b)
[1] 2.225070

b) In Scenario 1, simulate the expected waiting time using mean(v), and


compare it with the exact value, 20/9.
d The mean waiting time to begin service is simulated above as mean(v). The exact
mean waiting time is shown in Example 4.1 to be 20/9 = 2.2222. c

c) Now suppose there is only one teller with service rate λ = 1/5. You are
next in line to be served. Approximate by simulation the probability it
will take you more than 5 minutes to finish being served. This is the same
as one of the probabilities mentioned under Scenario 2. Which one? What
is the exact value of the probability you approximated? Discuss.
d The time to finish service with this one teller is the sum of the waiting time to
start service and the waiting time to finish service with the teller. From Scenario 2
of the example we know that T = X1 + X2 ∼ GAMMA(2, 1/5). Thus we can find the
exact value of P {T > 5} in R as 1 - pgamma(5, 2, 1/5), which returns 0.7358. c

set.seed(1212)
m = 100000; lam1 = lam2 = 1/5
x1 = rexp(m, lam1); x2 = rexp(m, lam2)
t = x1 + x2
mean(t > 5)

> mean(t > 5)


[1] 0.73618
4 Answers to Problems: Applied Probability Models 89

4.7 Some analytic results for Example 4.1: Bank Scenario 3.


a) Argue that FW (t) = (1 − e−t/5 )(1 − e−t/4 ), for t > 0, is the cumulative
distribution function of W .
d We derive the CDF of W = max(Xi , X2 ):

FW (t) = P {W ≤ t} = P {X1 ≤ t, X2 ≤ t}
= P {X1 ≤ t}P {X2 ≤ t} = (1 − e−t/5 )(1 − e−t/4 ),

recalling that X1 and X2 are independent random variables with X1 ∼ EXP(1/5)


and EXP2 ∼ EXP(1/4). c

b) Use the result of part (a) to verify the exact value P {W > 5} = 0.5490
given in Scenario 3.
d Then P {W > 5} = 1 − FW (5) = (1 − e−1 )(1 − e−1.25 ) = 0.5490. The numeri-
cal evaluation is done in R with 1 - (1 - exp(-1))*(1 - exp(-1.25)). Also see
below. c

c) Modify the program of Scenario 1 to approximate E(W ).


d The first result below simulates E(W ) as requested in part (c). As mentioned in
part (d) below, the exact value can be obtained analytically: Expand the result for
FW (t) in part (a), take the derivative of each term to find the density function fW (t),
multiply by t to obtain tfW (t), and integrate each term over (0, ∞). The second
simulated result below approximates P {W > 5} = 0.5490, derived in part (b). c

set.seed(1212)
m = 100000; lam1 = 1/5; lam2 = 1/4
x1 = rexp(m, lam1); x2 = rexp(m, lam2)
w = pmax(x1, x2)
mean(w); mean(w > 5)

> mean(w); mean(w > 5)


[1] 6.784017
[1] 0.55026

d) Use the result of part (a) to find the Rdensity function fW (t) of W, and

hence find the exact value of E(W ) = 0 tfW (t) dt. d See method above. c

4.8 Modify the R code in Example 4.2 to explore a parallel system of four
CPUs, each with failure rate λ = 1/5. The components are more reliable here,
but fewer of them are connected in parallel. Compare the ECDF of this system
with the one in Example 4.2. Is one system clearly better than the other?
(Defend your answer.) In each case, what is the probability of surviving for
more than 12 years?
d The approximate probability of 12-year survival for the system of the example
is 0.22, whereas the corresponding probability for the system of this problem is 0.32.
90 4 Answers to Problems: Applied Probability Models

Over the long run it is hard to beat more reliable components, even if there are
fewer of them connected in parallel. More precisely, 1 − (1 − e−12/4 )5 = 0.2254, while
1 − (1 − e−12/5 )4 = 0.3164. (See Problem 4.9.)
The figure made by the program below illustrates that, for time periods between
about 3 and 30 years, the system of this problem is more reliable (dashed line in the
ECDF is below the solid one). But initially the system of the example is the more
reliable one. For example, reliabilities at 2 years are 1 − (1 − e−2/4 )5 = 0.9906 and
1 − (1 − e−2/5 )4 = 0.9882, respectively. What is computed by the three extra lines
below the program output? c

set.seed(12)
m = 100000; ecdf = (1:m)/m
n = 5; lam = 1/4; x = rexp(m*n, lam)
DTA = matrix(x, nrow=m); w1 = apply(DTA, 1, max)
n = 4; lam = 1/5; y = rexp(m*n, lam)
DTA = matrix(y, nrow=m); w2 = apply(DTA, 1, max)
par(mfrow = c(1,2))
w1.sort = sort(w1); w2.sort = sort(w2)
plot(w1.sort, ecdf, type="l", xlim=c(0,30), xlab="Years")
lines(w2.sort, ecdf, lty="dashed")
abline(v=12, col="green")
plot(w1.sort, ecdf, type="l", xlim=c(0, 3),
ylim=c(0,.06), xlab="Years")
lines(w2.sort, ecdf, lty="dashed")
abline(v=2, col="green")
par(mfrow = c(1,1))
mean(w1 > 12) # aprx prob syst of example surv 12 yrs
mean(w2 > 12) # aprx prob syst of problem surv 12 yrs

> mean(w1 > 12) # aprx prob syst of example surv 12 yrs
[1] 0.22413
> mean(w2 > 12) # aprx prob syst of problem surv 12 yrs
[1] 0.31804

> w = seq(0, 5, by=.001)


> cdf1 = 1-(1-exp(-w/4))^5; cdf2 = 1-(1-exp(-w/5))^4
> max(w[cdf1 > cdf2])
[1] 3.093

4.9 Some analytic solutions for Example 4.2: Parallel Systems.


a) A parallel system has n independent components, each with lifetime dis-
tributed as EXP(λ). Show that the cumulative distribution of the lifetime
of this system is FW (t) = (1 − e−λt )n . For n = 5 and λ = 1/4, use this
result to show that P {W > 5} = 0.8151 as indicated in the example. Also
evaluate P {W ≤ 15.51}.
d The proof is similar to the one in Problem 4.7(a). The evaluations are done as in
Problem 4.8. c
4 Answers to Problems: Applied Probability Models 91

b) How accurately does the ECDF in Example 4.2 approximate the cumula-
tive distribution function FW in part (a)? Use the same plot statement
as in the example, but with parameters lwd=3 and col="green", so that
the ECDF is a wide green line. Then overlay the plot of FW with
tt = seq(0, 30, by=.01); cdf = (1-exp(-lam*tt))^5; lines(tt, cdf)
and comment.
d The change in the program is elementary and well specified. The resulting figure
illustrates excellent agreement between the CDF and the ECDF. c

c) Generalize the result for FW in part (a) so that the lifetime of the ith
component is distributed as EXP(λi ), where the λi need not be equal.
Qn −λi t
d The generalization is FW (t) = i=1
(1 − e ). c

d) One could find E(W ) by taking the derivativeR of FW in part (a) to get

the density function fW and then evaluating 0 tfW (t) dt, but this is a
messy task. However, in the case where all components have the same
failure rate λ, we can find E(W ) using the following argument, which is
based on the no-memory property of exponential distributions.
Start with the expected wait for the first component to fail. That is,
the expected value of the minimum of n components. The distribution is
EXP(nλ) with mean 1/λn. Then start afresh with the remaining n − 1
components, and conclude that the mean additional time until the second
failure is 1/λ(n − 1). Continue in this fashion to show that the R code
sum(1/(lam*(n:1))) gives the expected lifetime of the system. For a
five-component system with λ = 1/4, as in Example 4.2, show that this
result gives E(W ) = 9.1333.
> lam = 1/4; n = 5; sum(1/(lam*(n:1)))
[1] 9.133333

Note: The argument in (d) depends on symmetry, so it doesn’t work in the case
where components have different failure rates, as in part (c).

4.10 When a firm receives an invitation to bid on a contract, a bid cannot


be made until it has been reviewed by four divisions: Engineering, Personnel,
Legal, and Accounting. These divisions start work at the same time, but they
work independently and in different ways. Times from receipt of the offer to
completion of review by the four divisions are as follows. Engineering: expo-
nential with mean 3 weeks; Personnel: normal with mean 4 weeks and stan-
dard deviation 1 week; Legal: either 2 or 4 weeks, each with probability 1/2;
Accounting: uniform on the interval 1 to 5 weeks.
a) What is the mean length of time W before all four divisions finish their
reviews? Bids not submitted within 6 weeks are often rejected. What is
the probability that it takes more than 6 weeks for all these reviews to
be finished? Write a program to answer these questions by simulation.
92 4 Answers to Problems: Applied Probability Models

Include a histogram that approximates the distribution of the time before


all divisions finish. Below is suggested R code for making a matrix DTA;
proceed from there, using the code of Example 4.2 as a guide.
Eng = rexp(m, 1/3)
Per = rnorm(m, 4, 1)
Leg = 2*rbinom(m, 1, .5) + 2
Acc = runif(m, 1, 5)
DTA = cbind(Eng, Per, Leg, Acc)

set.seed(1213)
m = 100000
Eng = rexp(m, 1/3); Per = rnorm(m, 4, 1)
Leg = 2*rbinom(m, 1, .5) + 2; Acc = runif(m, 1, 5)
DTA = cbind(Eng, Per, Leg, Acc)
w = apply(DTA, 1, max)

hist(w, prob=T, col="wheat") # figure not shown here


abline(v = 6, lty="dashed", col="red")
mean(w); mean(w > 6); max(w)
mean(w==Eng); mean(w==Per) # to begin part (b)
mean(w==Leg); mean(w==Acc)

> mean(w); mean(w > 6)


[1] 5.07939
[1] 0.15505 # Fraction of bids not done in 6 wks
[1] 31.74465 # Worst case: bids can take a LONG time
> mean(w==Eng); mean(w==Per) # to begin part (b)
[1] 0.239
[1] 0.45214 # Personnel is most often last
> mean(w==Leg); mean(w==Acc)
[1] 0.13761
[1] 0.17125

b) Which division is most often the last to complete its review? If that divi-
sion could decrease its mean review time by 1 week, by simply subtract-
ing 1 from the values in part (a), what would be the improvement in the
6-week probability value?
set.seed(1214)
m = 100000
Eng = rexp(m, 1/3); Per = rnorm(m, 4, 1) - 1
Leg = 2*rbinom(m, 1, .5) + 2; Acc = runif(m, 1, 5)
DTA = cbind(Eng, Per, Leg, Acc)
w = apply(DTA, 1, max)
mean(w); mean(w > 6)

> mean(w); mean(w > 6)


[1] 4.752623 # decreased from 5.08
[1] 0.13617 # decreased from 0.16
4 Answers to Problems: Applied Probability Models 93

c) How do the answers in part (a) change if the uniformly distributed time for
Accounting starts precisely when Engineering is finished? Use the original
distributions given in part (a).
set.seed(1215)
m = 100000
Eng.Acc = rexp(m, 1/3) + runif(m, 1, 5)
Per = rnorm(m, 4, 1)
Leg = 2*rbinom(m, 1, .5) + 2
DTA = cbind(Eng.Acc, Per, Leg)
w = apply(DTA, 1, max)
mean(w); mean(w > 6)

> mean(w); mean(w > 6)


[1] 6.40436
[1] 0.40877

Hints and answers: (a) Rounded results from one run with m = 10 000; give more
accurate answers: 5.1, 0.15. (b) The code mean(w==Eng) gives the proportion of the
time Engineering is last to finish. Greatest proportion is 0.46. (c) Very little. Why?
(Ignore the tiny chance that a normal random variable might be negative.)

4.11 Explain the similarities and differences among the five matrices pro-
duced by the R code below. What determines the dimensions of a matrix
made from a vector with the matrix function? What determines the order
in which elements of the vector are inserted into the matrix? What happens
when the number of elements of the matrix exceeds the number of elements
of the vector? Focus particular attention on MAT3, which illustrates a method
we use in Problem 4.12.
> a1 = 3; a2 = 1:5; a3 = 1:30
> MAT1 = matrix(a1, nrow=6, ncol=5); MAT1
[,1] [,2] [,3] [,4] [,5] # 6 rows, 5 columns, as
[1,] 3 3 3 3 3 # specified by arguments
[2,] 3 3 3 3 3
[3,] 3 3 3 3 3
[4,] 3 3 3 3 3
[5,] 3 3 3 3 3
[6,] 3 3 3 3 3

> MAT2 = matrix(a2, nrow=6, ncol=5); MAT2


[,1] [,2] [,3] [,4] [,5] # by default, matrices
[1,] 1 2 3 4 5 # are filled by columns
[2,] 2 3 4 5 1 # with the vector repeating
[3,] 3 4 5 1 2 # as necessary to fill all
[4,] 4 5 1 2 3 # columns and rows
[5,] 5 1 2 3 4
[6,] 1 2 3 4 5
94 4 Answers to Problems: Applied Probability Models

> MAT3 = matrix(a2, nrow=6, ncol=5, byrow=T); MAT3


[,1] [,2] [,3] [,4] [,5] # argument ’byrow=T’ causes
[1,] 1 2 3 4 5 # matrix to be filled by
[2,] 1 2 3 4 5 # rows
[3,] 1 2 3 4 5
[4,] 1 2 3 4 5
[5,] 1 2 3 4 5
[6,] 1 2 3 4 5

> MAT4 = matrix(a3, 6); MAT4


[,1] [,2] [,3] [,4] [,5] # if only the number of rows
[1,] 1 7 13 19 25 # is specified, the number
[2,] 2 8 14 20 26 # of columns is chosen to use
[3,] 3 9 15 21 27 # all elements of the vector
[4,] 4 10 16 22 28 # here default filling by
[5,] 5 11 17 23 29 # columns is used
[6,] 6 12 18 24 30

> MAT5 = matrix(a3, 6, byrow=T); MAT5


[,1] [,2] [,3] [,4] [,5] # same as above, but filling
[1,] 1 2 3 4 5 # by rows is invoked
[2,] 6 7 8 9 10
[3,] 11 12 13 14 15
[4,] 16 17 18 19 20
[5,] 21 22 23 24 25
[6,] 26 27 28 29 30

4.12 In Example 4.2, each of the five component CPUs in the parallel
system has failure rate λ = 1/4 because it is covered by a thickness of lead
foil that cuts deadly radiation by half. That is, without the foil, the failure
rate would be λ = 1/2. Because the foil is heavy, we can’t afford to increase
the total amount of foil used. Here we explore how the lifetime distribution of
the system would be affected if we used the same amount of foil differently.
a) Take the foil from one of the CPUs (the rate goes to 1/2) and use it to
double-shield another CPU (rate goes to 1/8). Thus the failure rates for the
five CPUs are given in a 5-vector lam as shown in the simulation program
below. Compare the mean and median lifetimes, probability of survival
longer than 10 years, and ECDF curve of this heterogeneous system with
similar results for the homogeneous system of Example 4.2. Notice that in
order for each column of the matrix to have the same rate down all rows,
it is necessary to fill the matrix by rows using the argument (byrow=T).
Thus the vector of five rates “recycles” to provide the correct rate for each
element in the matrix. (See Problem 4.11 for an illustrative exercise.)
d Run the program to see the figure. Is the altered configuration (dotted curve)
generally more reliable than the configuration of the example? c
4 Answers to Problems: Applied Probability Models 95

# Curve for original example


m = 100000
n = 5; lam.e = 1/4
x = rexp(m*n, lam.e)
DTA = matrix(x, nrow=m)
w.e = apply(DTA, 1, max)
mean(w.e); quantile(w.e, .5); mean(w.e > 10)
ecdf = (1:m)/m; w.e.sort = sort(w.e)
plot(w.e.sort, ecdf, type="l", xlim=c(0,40), xlab="Years")

# Overlay curve for part (a)


lam = c(1/2, 1/4, 1/4, 1/4, 1/8)
x = rexp(m*n, lam)
DTA = matrix(x, nrow=m, byrow=T)
w.a = apply(DTA, 1, max)
mean(w.a); quantile(w.a, .5); mean(w.a > 10)
w.a.sort = sort(w.a)
lines(w.a.sort, ecdf, lwd=2, col="darkblue", lty="dashed")

b) Denote the pattern of shielding in part (a) as 01112. Experiment with other
patterns with digits summing to 5, such as 00122, 00023, and so on. The
pattern 00023 would have lam = c(1/2, 1/2, 1/2, 1/8, 1/16). Which
of your patterns seems best? Discuss.
d Based on the issues explicitly raised above, a configuration such as the one denoted
by 00005, with all the shielding on one CPU, seems best. But see the Notes below. c
Notes: The ECDF of one very promising reallocation of foil in part (b) is shown in
Figure 4.9. Parallel redundancy is helpful, but “it’s hard to beat” components with
lower failure rates. In addition to the kind of radiation against which the lead foil
protects, other hazards may cause CPUs to fail. Also, because of geometric issues,
the amount of foil actually required for, say, triple shielding may be noticeably more
than three times the amount for single shielding. Because your answer to part (b)
does not take such factors into account, it might not be optimal in practice.

4.13 Repeat Example 4.3 for n = 5. For what value K is Runb = KR an


unbiased estimate of σ. Assuming that σ = 10, what is the average length of
a 95% confidence interval for σ based on the sample range?
set.seed(1234)
m = 100000; n = 5
mu = 100; sg = 10
x = rnorm(m*n, mu, sg); DTA = matrix(x, m)
x.mx = apply(DTA, 1, max)
x.mn = apply(DTA, 1, min)
x.rg = x.mx - x.mn # vector of m sample ranges
mean(x.rg); sd(x.rg)
quantile(x.rg, c(.025,.975))
hist(x.rg, prob=T)
96 4 Answers to Problems: Applied Probability Models

> mean(x.rg); sd(x.rg)


[1] 23.26074
[1] 8.640983
> quantile(x.rg, c(.025,.975))
2.5% 97.5%
8.486097 41.967766

d Because E(R) = 23.26 = 2.326σ we see that Runb = R/2.326 = 0.43R is unbiased
for σ and that K = 0.43. Moreover,

P {8.49 < R < 41.97} = P {0.849 < R/σ < 4.197}


= P {R/4.197 < σ < R/0.849} = 0.95.

Thus, as in Example 4.3, the expected length of an R-based 95% confidence interval
for σ is E(R)(1/0.849 − 1/4.197) = 2.326σ(0.940) = 2.19. c
4.14 Modify the code of Example 4.3 to try “round-numbered” values of n
such as n = 30, 50, 100, 200, and 500. Roughly speaking, for what sample sizes
are the constants K = 1/4, 1/5, and 1/6 appropriate to make Runb = KR an
unbiased estimator of σ? (Depending on your patience and your computer,
you may want to use only m = 10 000 iterations for larger values of n.)
d We simulated data with σ = 1 so that, for each sample size, the denominator
d = 1/K is simply approximated as the average of the simulated ranges. Because
we seek only approximate values, 10 000 iterations would be enough, but we used
50 000 for all sample sizes below.

set.seed(1235)
m = 50000; n = c(30, 50, 100, 200, 500)
mu = 100; sg = 1
k = length(n); d = numeric(k)
for (i in 1:k) {
x = rnorm(m*n[i], mu, sg); DTA = matrix(x, m)
x.mx = apply(DTA, 1, max); x.mn = apply(DTA, 1, min)
x.rg = x.mx - x.mn
d[i] = mean(x.rg) }
round(cbind(n, d), 3)

> round(cbind(n, d), 3)


n d
[1,] 30 4.085
[2,] 50 4.505
[3,] 100 5.017
[4,] 200 5.491
[5,] 500 6.074

You can see that, for a normal sample of about size n = 30, dividing the sample
range by 4 gives an reasonable estimate of σ. Two additional cases, with easy-to-
remember round numbers, are to divide by 5 for samples of about 100 observations,
and by 6 for samples of about 500. However, as n increases, the sample range R
4 Answers to Problems: Applied Probability Models 97

becomes ever less correlated with S, which is the preferred estimate of σ. (The
sample variance S 2 is unbiased for σ 2 and has the smallest variance among unbiased
estimators. A slight bias, decreasing as n increases, is introduced by taking the
square root, as we see in Problem 4.15.)
Nevertheless, for small values of n, estimates of σ based on R can be useful.
In particular, estimates of σ for industrial process control charts have traditionally
been based on R, and engineering statistics books sometimes provide tables of the
appropriate unbiasing constants for values of n up to about 20.
Elementary statistics texts often suggest estimating σ by dividing the sample
range by a favorite value—typically 4 or 5. This suggestion may be accompanied by
a carefully chosen example in which the results are pretty good. However, we see
here that no one divisor works across all sample sizes.
As n increases, S converges to σ in probability, but R diverges to infinity. Normal
tails have little probability beyond a few standard deviations from the mean, but
the tails do extend to plus and minus infinity. So, if you take enough observations,
you are bound to get “outliers” from far out in the tails of the distribution, which
inflate R. c
4.15 This problem involves exploration of the sample standard deviation S
as an estimate of σ. Use n = 10.
a) Modify the program of Example 4.3 to simulate the distribution of S.
Use x.sd = apply(DTA, 1, sd). Although E(S 2 ) = σ 2 , equality (that
is, unbiasedness) does not survive the nonlinear operation of taking the
square root. What value a makes Sunb = aS an unbiased estimator of σ?
d The simulation below gives a ≈ 1.027, with 1/a = 0.973. This approximation is
based on n = 10 and aE(S) = E(Sunb ) = σ = 10. It is in good agreement with the
exact value E(S) = 0.9727σ, which can be found analytically. (See the Notes). c
set.seed(1238)
m = 100000; n = 10; mu = 100; sg = 10
x = rnorm(m*n, mu, sg); DTA = matrix(x, m)
x.sd = apply(DTA, 1, sd)
a = mean(x.sd)/sg; mean(x.sd)/sg; a

a = sg/mean(x.sd); mean(x.sd)/sg; a
[1] 0.973384
[1] 1.027344

b) Verify the value of E(LS ) given in Example 4.3. To find the confidence
limits of a 95% confidence interval for S, use qchisq(c(.025,.975), 9)
and then use E(S) in evaluating E(LS ). Explain each step.
d The quantiles from qchisq are 2.70 and 19.02, so that the confidence interval for σ 2
is derived from
P {2.70 < 9S 2 /σ 2 < 19.20} = P {9S 2 /19.20 < σ 2 < 9S 2 /2.70} = 95%.

On taking square roots, the 95% CI for σ is (3S/4.362, 3S/1.643) or (0.688S, 1.826S),
which has expected length E(LS ) = 1.138E(S). According to the Notes, for n = 10,
98 4 Answers to Problems: Applied Probability Models

the exact value of E(S) = 0.9727σ. Thus E(LS ) = 1.138(0.9727)σ = 1.107σ. We did
some of the computations in R, as shown below. c

> diff(3/sqrt(qchisq(c(.975, .025), 9)))


[1] 1.137775
> sqrt(2/9)*gamma(10/2)/gamma(9/2)
[1] 0.9726593

c) Statistical theory says that V(Sunb ) in part (a) has the smallest possible
variance among unbiased estimators of σ. Use simulation to show that
V(Runb ) ≥ V(Sunb ).
d As above and in Example 4.3, we use sample size n = 10. From the simulation
in the example, we know that Runb = R/K, where K ≈ 30.8/σ = 30.8/10 = 3.08.
From the simulation in part (a), we know that Sunb = aS, where a ≈ 0.973. We use
these values in the further simulation below. The last line of code illustrates that
SD(Runb ) ≈ 2.6 > SD(Sunb ) ≈ 2.4. c

set.seed(1240)
m = 100000; n = 10; mu = 100; sg = 10; K = 3.08; a = 1.027
x = rnorm(m*n, mu, sg); DTA = matrix(x, m)
x.sd = apply(DTA, 1, sd); s.unb = a*x.sd
x.mx = apply(DTA, 1, max); x.mn = apply(DTA, 1, min)
x.rg = x.mx - x.mn; r.unb = x.rg/K
mean(s.unb); mean(r.unb) # validation: both about 10
sd(s.unb); sd(r.unb) # first should be smaller

> mean(s.unb); mean(r.unb) # validation: both about 10


[1] 9.981463
[1] 9.98533
> sd(s.unb); sd(r.unb) # first should be smaller
[1] 2.370765
[1] 2.576286
q
2
Notes: E(S10 ) = 0.9727σ = 9.727. For n ≥ 2, E(Sn ) = σ n−1
Γ ( n2 )/Γ ( n−1
2
).
The Γ -function can be evaluated in R with gamma(). As n increases, the bias
of Sn in estimating σ disappears. By Stirling’s approximation of the Γ -function,
limn→∞ E(Sn ) = σ.

4.16 For a sample of size 2, show that the sample range is precisely a mul-
tiple of the sample standard deviation. [Hint: In the definition of S 2 , express
X̄ as (X1 + X2 )/2.] Consequently, for n = 2, the unbiased estimators of σ
based on S and R are identical.
d With X̄ as suggested variance becomes
·³ ´2 ³ ´2 ¸
X1 − X2 X2 − X1 (X1 − X2 )2 R2
S2 = + = = .
2 2 2 2
√ √
Upon taking square roots, we have S = |X1 − X2 |/ 2 = R/ 2. c
4 Answers to Problems: Applied Probability Models 99

4.17 (Intermediate) The shape of a distribution dictates the “best” esti-


mators for its parameters. Suppose we have a random√sample of√size n from
a population with the uniform distribution UNIF(µ − 3σ, µ + 3σ), which
has mean µ and standard deviation σ. Let Runb and Sunb be unbiased multi-
ples of the sample range R and sample standard deviation S, respectively, for
estimating σ. (Here the distribution of S 2 is not related to a chi-squared dis-
tribution.) Use simulation methods in each of the parts below, taking n = 10,
µ = 100, and σ = 10, and using the R code of Example 4.3 as a pattern.
d This question is the analogue of Problem 4.3 and Problem 4.15 for uniformly
distributed data. It makes a nice project or take-home test question. We do not
provide answers. c

a) Find the unbiasing constants necessary to define Runb and Sunb . These
estimators are, of course, not necessarily the same as for normal data.
b) Show that V(Runb ) < V(Sunb ). For data from such a uniform distribution,
one can prove that Runb is the unbiased estimator with minimum variance.
c) Find the quantiles of Runb and Sunb necessary to make 95% confidence
intervals for σ. Specify the endpoints of both intervals in terms of σ. Which
confidence interval, the one based on R or the one based on S, has the
shorter expected length?

4.18 Let Y1 , Y2 , . . . , Y9 be a random sample from NORM(200, 10).


a) Modify the R code of Example 4.4 to make a plot similar to Figure 4.4
based on m = 100 000 and using small dots (plot parameter pch=".").
From the plot, try to estimate E(Ȳ ) (balance point), SD(Ȳ ) (95% of ob-
servations lie within two standard deviations of the mean), E(S), and ρ.
Then write and use code to simulate these values. Compare your estimates
from looking at the plot and your simulated values with the exact values.
d The correlation must be very near 0 because there is no linear trend. c

b) Based on the simulation in part (a), compare P {Ȳ ≤ a} P {S ≤ b} with


P {Ȳ ≤ a, S ≤ b}. Do this for at least three choices of a and b from
among the values a = 197, 200, 202 and b = 7, 10, 11. Use normal and
chi-squared distributions to find exact values for the probabilities you
simulate. Comment.
d Because Ȳ and S are independent for normal data, the multiplication rule holds. c

c) In a plot similar to Figure 4.5, show the points for which the usual 95%
confidence interval for σ covers the population value σ = 10. How does
this differ from the display of points for which the t confidence interval
for µ covers the true population value?
d The boundaries are horizontal parallel lines. c
100 4 Answers to Problems: Applied Probability Models

4.19 Repeat the simulation of Example 4.5 twice, once with n = 15 random
observations from NORM(200, 10) and again with n = 50. Comment on the
effect of sample size.
4.20 More on Example 4.6 and Figure 4.10.
a) Show that there is an upper linear bound on the points in Figure 4.10. This
boundary is valid for any sample in which Pnegative values are impossible.
2 2 2
Suggested
P 2 steps:
P Start with (n − 1)S = Y
i i √− n Ȳ . For your data, say
2
why i Yi ≤ ( i Yi ) . Conclude that Ȳ ≥ S/ n.
P 2 P 2 2
d First, when all Yi ≥ 0, we have i Yi ≤ ( i Yi ) = (nȲ ) , because the expansion
of the square of the sum on the right-hand side has all n of the terms Yi2 from the
left-hand side, in addition to some nonnegative products. Then,
X
n

(n − 1)S 2 = Yi2 − nȲ 2 ≤ n2 Ȳ 2 − nȲ 2 = n(n − 1)Ȳ 2 ,


i=1

so S 2 /n ≤ Ȳ 2 and Ȳ ≥ S/ n. For nonnegative data the sample mean is at least as
large as its standard error. c

b) Use plot to make a scatterplot similar to the one in Figure 4.10 but with
m = 100 000 points, and then use lines to superimpose your line from
part (a) on the same graph.
d Just for variety, we have used slightly different code below than that shown for
Examples 4.4, 4.6, and 4.7. The last few lines of the program refer to part (c). c

set.seed(12)
m = 100000; n = 5; lam = 2
DTA = matrix(rexp(m*n, lam), nrow=m)
x.bar = rowMeans(DTA) # alternatively ’apply(DTA, 1, mean)’
x.sd = apply(DTA, 1, sd)
plot(x.bar, x.sd, pch=".")
abline(a=0, b=sqrt(n), lty="dashed", col="blue")
abline(h=1.25, col="red")
abline(v=0.5, col="red")
mean(x.bar < .5) # Estimates of probabilities
mean(x.sd > 1.25) # discussed
mean((x.bar < .5) & (x.sd > 1.25)) # in part (c)

> mean(x.bar < .5) # Estimates of probabilities


[1] 0.56028
> mean(x.sd > 1.25) # discussed
[1] 0.01072
> mean((x.bar < .5) & (x.sd > 1.25)) # in part (c)
[1] 0

c) For Example 4.6, show (by any method) that P {Ȳ ≤ 0.5} and P {S > 1.25}
are both positive but that P {Ȳ ≤ 0.5, S > 1.25} = 0. Comment.
4 Answers to Problems: Applied Probability Models 101

d In the example, X1 , X2 , . . . , X5 are a random sample from EXP(λ = 2). Clearly, it is


possible for all five observations to be smaller than 0.5, in which case Ȳ < 0.5. Also,
it is possible for three observations to be smaller than 0.25, and two observations to
exceed 3, in which case we must have S > 1.5. (The lower bound is S = 1.506237, √
achieved for data .25, .25, .25, 3, and 3.) By contrast, if S = 1.25 then S/ 5 =
0.559017, so by the result of part (a) we cannot also have Ȳ ≤ 5.
Alternatively, see the last few lines of the program in part (b). They clearly show
that P {Ȳ ≤ 0.5} and P {S > 1.25} are both positive. They also show no points in
the intersection of the two events out of m = 100 000: this suggests, but does not
prove P {Ȳ ≤ 0.5, S > 1.25} = 0. Proof of that it is impossible to have points in the
intersection is best expressed in terms of the inequality of part (a).
The obvious comment is that we have proved the random variables S and Ȳ
cannot be independent. The required multiplication property for independent events
is clearly violated. For independence of two random variables, any event described
in terms of the first must be independent of any event described in terms of the
second. (The positive correlation in Example 6.6 provides additional evidence of
association.) c

4.21 Figure 4.6 (p101) has prominent “horns.” We first noticed such horns
on (Ȳ , S) plots when working with uniformly distributed data, for which the
horns are not so distinct. With code similar to that of Example 4.7 but simu-
lated samples of size n = 5 from UNIF(0, 1) = BETA(1, 1), make several plots
of S against Ȳ with m = 10 000 points. On most plots, you should see a few
“straggling” points running outward near the top of the plot. The question
is whether they are real or just an artifact of simulation. (That is, are they
“signal or noise”?) A clue is that the stragglers are often in the same places
on each plot. Next try m = 20 000, 50 000, and 100 000. For what value of m
does it first become obvious to you that the horns are real?
d Many people say 50 000. For a full answer, describe and discuss. c

4.22 If observations Y1 , Y2 , . . . , Y5 are a random sample from BETA(α, β),


which takes values only in (0, 1), then the data fall inside the 5-dimensional
unit hypercube, which has 25 = 32 vertices. Especially if we have parameters
α, β < 1/2, a large proportion of data points will fall near the vertices, edges,
and faces of the hypercube. The “horns” in the plots of Example 4.7 (and
Problem 4.21) are images of these vertices under the transformation from the
5-dimensional data space to the 2-dimensional space of (Ȳ , S).
a) Use the code of Example 4.7 to make a plot similar to Figure 4.6 (p101),
but with m = 100 000 small dots. There are six horns in this plot, four at
the top and two at the bottom. Find their exact (Ȳ , S)-coordinates. (You
should also be able discern images of some edges of the hypercube.)
d Horns occur when all five observations are either 0 or 1. The crucial issue is the
number k of observations that are 1, where k = 0, 1, . . . , 5. The associated values
P P 2
of Ȳ are k/5. For 0–1 data,
p Yi = Yi , so the corresponding values of S are
(k − k/5)/4. In R the code k = 0:5; sqrt((k - k/5)/4) returns values 0 0.4472,
102 4 Answers to Problems: Applied Probability Models

0.6325, 0.7746, 0.8944, and 1 (rounded to four places). You can use the R code with
points to plot heavy dots at the cusps of the horns. c

b) The horn at the lower left in the figure of part (a) is the image of one
vertex of the hypercube, (0, 0, 0, 0, 0). The horn at the lower right is the
image of (1, 1, 1, 1, 1). They account for two of the 32 vertices. Each of the
remaining horns is the image of multiple vertices. For each horn, say how
many vertices get mapped onto it, its “multiplicity.”
d The multiplicities are given by the binomial coefficients that describe how many
1s are selected out of 5 possibilities. Multiplicities that correspond to the vertices
in part (a) are 1, 5, 10, 10, 5, and 1, respectively. Multiplicities explain why some
horns are more distinctly defined in the plots. c

c) Now make a plot with n = 10 and m = 100 000. In addition to the two
horns at the bottom, how many do you see along the top? Explain why
the topmost horn has multiplicity (10
5 ) = 252.

d Eight, making sort of a scalloped edge. c


4.23 Begin with the paired differences, Pair.Diff, of Example 4.8.
a) Use mean(Pair.Diff) to compute d, ¯ sd(Pair.Diff) to compute Sd ,
length(Pair.Diff) to verify n = 33, and qt(.975, 32) to find t∗ . Thus
verify that the 95% t confidence interval is (10.3, 21.6), providing two-place
accuracy. Compare your interval with results from t.test(Pair.Diff).
Risk = c(38, 23, 41, 18, 37, 36, 23, 62, 31, 34, 24,
14, 21, 17, 16, 20, 15, 10, 45, 39, 22, 35,
49, 48, 44, 35, 43, 39, 34, 13, 73, 25, 27)

Ctrl = c(16, 18, 18, 24, 19, 11, 10, 15, 16, 18, 18,
13, 19, 10, 16, 16, 24, 13, 9, 14, 21, 19,
7, 18, 19, 12, 11, 22, 25, 16, 13, 11, 13)

Pair.Diff = Risk - Ctrl


d.mean = mean(Pair.Diff); d.sd = sd(Pair.Diff)
n = length(Pair.Diff)
t.star = qt(c(.025, .975), n-1)
ci = d.mean + t.star*d.sd/sqrt(n)
d.mean; d.sd; n; t.star; ci

> d.mean; d.sd; n; t.star; ci


[1] 15.96970
[1] 15.86365
[1] 33
[1] -2.036933 2.036933
[1] 10.34469 21.59470 # 95% CI: rounded in part (a)
4 Answers to Problems: Applied Probability Models 103

> t.test(Pair.Diff)

One Sample t-test

data: Pair.Diff
t = 5.783, df = 32, p-value = 2.036e-06
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
10.34469 21.59470 # 95% CI: agrees with above
sample estimates:
mean of x
15.96970

b) Modify the R code of the example to make a 99% nonparametric bootstrap


confidence interval for the population mean difference µ. Compare with
the 99% t confidence interval (see the Notes).
# assumes vector Pair.Diff from part (a) still available in R
set.seed(1492)
n = length(Pair.Diff) # number of data pairs
d.bar = mean(Pair.Diff) # observed mean of diff’s
B = 10000 # number of resamples
re.x = sample(Pair.Diff, B*n, repl=T)
RDTA = matrix(re.x, nrow=B) # B x n matrix of resamples
re.mean = rowMeans(RDTA) # vector of B ‘d-bar-star’s
hist(re.mean, prob=T) # hist. of bootstrap dist.
bci = quantile(re.mean, c(.005,.995)) # simple bootstrap 99% CI
alt.bci = 2*d.bar - bci[2:1] # bootstrap percentile 99% CI
bci; alt.bci

> bci; alt.bci


0.5% 99.5%
9.030152 23.242879
99.5% 0.5%
8.696515 22.909242

> t.test(Pair.Diff, conf.level=.99)

One Sample t-test

data: Pair.Diff
t = 5.783, df = 32, p-value = 2.036e-06
alternative hypothesis: true mean is not equal to 0
99 percent confidence interval: # reasonably close to bootstrap
8.407362 23.532031 # percentile 99% CI above
sample estimates:
mean of x
15.96970
104 4 Answers to Problems: Applied Probability Models

c) Use qqnorm(Pair.Diff) to make a normal probability plot of the differ-


¯ d on
ences. Then use the lines function to overlay the line y = (d − d)/S
your plot. The result should be similar to Figure 4.11, except that this
figure also has normal probability plots (lighter dots) from 20 samples of
size n = 33 from NORM(d, ¯ Sd ) to give a rough indication of the deviation
of truly normal data from a straight line.
d The default style of for qnorm to put data on the vertical axis. So, after a couple of
steps of easy algebra, the desired line has the sample mean as its vertical intercept
and the sample standard deviation as its slope. The function abline overlays this
line onto the quantile plot.
Although the data seem to fall somewhat closer to a curve than to a straight
line, the Shapiro-Wilk test fails to reject normality (see the Note). With only n = 33
observations, substantial departure from normality would ordinarily be required in
order to reject. c
# assumes vector Pair.Diff from part (a) still available in R
# also uses d.mean, d.sd, and n from part (a)

qqnorm(Pair.Diff)
abline(a=d.mean, b=d.sd, lwd=2, col="red")

> shapiro.test(Pair.Diff)

Shapiro-Wilk normality test

data: Pair.Diff
W = 0.9574, p-value = 0.2183
Notes: (b) Use parameter conf.level=.99 in t.test. Approximate t interval:
(8.4, 23.5). (c) Although the normal probability plot of Pair.Diff seems to fit a
curve better than a straight line, evidence against normality is not strong. For ex-
ample, the Shapiro-Wilk test fails to reject normality: shapiro.test(Pair.Diff)
returns a p-value of 0.22.
4.24 Student heights. In a study of the heights of young men, 41 students at
a boarding school were used as subjects. Each student’s height was measured
(in millimeters) in the morning and in the evening, see Majundar and Rao
(1958). Every student was taller in the morning. Other studies have found
a similar decrease in height during the day; a likely explanation is shrinkage
along the spine from compression of the cartilage between vertebrae. The 41
differences between morning and evening heights are displayed in the R code
below.
d The data have been moved to the program in part (a), and they are also used in
part (b). The normal probability plot (not shown here) gives the impression that
the data are very nearly normal. c
a) Make a normal probability plot of these differences with qqnorm(dh) and
comment on whether the data appear to be normal. We wish to have
4 Answers to Problems: Applied Probability Models 105

an interval estimate of the mean shrinkage in height µ in the population


from which these 41 students might be viewed as a random sample. Com-
pare the 95% t confidence interval with the 95% nonparametric bootstrap
confidence interval, obtained as in Example 4.8 (but using dh instead of
Pair.Diff).
dh = c(8.50, 9.75, 9.75, 6.00, 4.00, 10.75, 9.25, 13.25, 10.50,
12.00, 11.25, 14.50, 12.75, 9.25, 11.00, 11.00, 8.75, 5.75,
9.25, 11.50, 11.75, 7.75, 7.25, 10.75, 7.00, 8.00, 13.75,
5.50, 8.25, 8.75, 10.25, 12.50, 4.50, 10.75, 6.75, 13.25,
14.75, 9.00, 6.25, 11.75, 6.25)

set.seed(1776)
n = length(dh) # number of data pairs
d.bar = mean(dh) # observed mean of diff’s
B = 10000 # number of resamples
re.x = sample(dh, B*n, repl=T)
RDTA = matrix(re.x, nrow=B) # B x n matrix of resamples
re.mean = rowMeans(RDTA) # vector of B ‘d-bar-star’s

hist(re.mean, prob=T) # hist. of bootstrap dist.


bci = quantile(re.mean, c(.025,.975)) # simple bootstrap 95% CI
alt.bci = 2*d.bar - bci[2:1] # bootstrap percentile 95% CI
bci; alt.bci

> bci; alt.bci


2.5% 97.5% # Because height differences are
8.768293 10.426829 # very nearly symmetrical,
97.5% 2.5% # the two types of bootstrap
8.768293 10.426829 # CIs happen to be identical.

> mean(dh) + qt(c(.025,.975), n-1)*sd(dh)/sqrt(n) # 95% t CI


[1] 8.734251 10.460871 # good agreement with bootstrap CIs

b) Assuming the data are normal, we illustrate how to find a parametric


bootstrap confidence interval. First estimate the parameters: the pop-
ulation mean µ by d¯ and the population standard deviation σ by Sd . Then
take B resamples of size n = 41 from the distribution NORM(d, ¯ Sd ), and
find the mean of each resample. Finally, find confidence intervals as in the
nonparametric case. Here is the R code.
d The parametric bootstrap (assuming normality) is computed below. A handmade
table summarizes the results from all parts of this problem. Bootstrap results may
differ slightly from one run to the next.
In this case, where it is reasonable to assume normal data, the traditional method
based on the sample standard deviation and using the chi-squared distribution is
satisfactory. Therefore, it is not necessary to make bootstrap confidence intervals.
See the Notes. c
106 4 Answers to Problems: Applied Probability Models

B = 10000; n = length(dh)

# Parameter estimates
dh.bar = mean(dh); sd.dh = sd(dh)

# Resampling
re.x = rnorm(B*n, dh.bar, sd.dh)
RDTA = matrix(re.x, nrow=B)

# Results
re.mean = rowMeans(RDTA)
hist(re.mean)
bci = quantile(re.mean, c(.025, .975)); bci
2*dh.bar - bci[2:1]

> bci = quantile(re.mean, c(.025, .975)); bci


2.5% 97.5%
8.75582 10.45207
> 2*dh.bar - bci[2:1]
97.5% 2.5%
8.743056 10.439301

SUMMARY
Interval Part Method
-----------------------------------------------------
(8.73, 10.46) (a) Traditional CI from CHISQ(40)
(8.77, 10.43) (a) Both nonparametric bootstraps
(8.76, 10.45) (b) Parametric bootstrap (simple)
(8.74, 10.44) (b) Parametric bootstrap (percentile)

Notes: (a) Nearly normal data, so this illustrates how closely the bootstrap procedure
agrees with the t procedure when we know the latter is appropriate. The t interval is
(8.7, 10.5); in your answer, provide two decimal places. (b) This is a “toy” example
because T = n1/2 (d¯ − µ)/Sd ∼ T(n − 1) and (n − 1)Sd2 /σ 2 ∼ CHISQ(n − 1) provide
useful confidence intervals for µ and σ without the need to do a parametric bootstrap.
(See Rao (1989) and Trumbo (2002) for traditional analyses and data, and see
Problem 4.27 for another example of the parametric bootstrap.)

4.25 Exponential data. Consider n = 50 observations generated below from


an exponential population with mean µ = 10. (Be sure to use the seed shown.)
set.seed(1); x = round(rexp(50, 1/10), 2); x

> x
[1] 7.55 11.82 1.46 1.40 4.36 28.95 12.30 5.40 9.57 1.47
[11] 13.91 7.62 12.38 44.24 10.55 10.35 18.76 6.55 3.37 5.88
[21] 23.65 6.42 2.94 5.66 1.06 0.59 5.79 39.59 11.73 9.97
[31] 14.35 0.37 3.24 13.20 2.04 10.23 3.02 7.25 7.52 2.35
[41] 10.80 10.28 12.92 12.53 5.55 3.01 12.93 9.95 5.14 20.08
4 Answers to Problems: Applied Probability Models 107

a) For exponential data X1 , . . . , Xn with mean µ (rate λ = 1/µ), it can


be shown that X̄/µ ∼ GAMMA(n, n). Use R to find L and U with
P {L ≤ X̄/µ ≤ U } = 0.95 and hence find an exact formula for a 95%
confidence interval for µ based on data known to come from an exponen-
tial distribution. Compute this interval for the data given above.
d Given the distribution of X̄, we have:

P {L ≤ X̄/µ ≤ U } = P {1/U ≤ µ/X̄ ≤ 1/L} = P {X̄/U ≤ µ ≤ X̄/L},

where U and L, as specified, can be evaluated for a sample of size n = 50 by


qgamma(c(.975, .025), 50, 50), which returns the respective values U = 1.2956
and L = 0.74222, upon rounding to four places.
In the R code below, we begin by evaluating an exact 95% confidence interval
for µ based on this derivation. Then we show a simulation that demonstrates the
claimed distribution of X̄. The last line performs a Kolmogorov-Sminrov goodness-
of-fit test to see how well the vector x.bar/mu fits the distribution GAMMA(50, 50).
The verdict is that the vector is consistent with that distribution (P-value much
larger than 0.05) c

set.seed(1); x = round(rexp(50, 1/10), 2) # re-generate the 50 obs.


mean(x)/qgamma(c(.975, .025), 50, 50) # exact 95% CI

> mean(x)/qgamma(c(.975, .025), 50, 50) # exact 95% CI


[1] 7.595638 13.258885

set.seed(1215); m = 5000; n = 50; mu = 10


DTA = matrix(rexp(m*n, 1/mu), nrow=m)
x.bar = rowMeans(DTA)
ks.test(x.bar/mu, pgamma, 50, 50)

> ks.test(x.bar/mu, pgamma, 50, 50)

One-sample Kolmogorov-Smirnov test

data: x.bar/mu
D = 0.0095, p-value = 0.7596
alternative hypothesis: two.sided

b) As an illustration, even though we know the data are not normal, find the
t confidence interval for µ.
set.seed(1); x = round(rexp(50, 1/10), 2) # re-generate the 50 obs.
mean(x) + qt(c(.025,.975), 49)*sd(x)/sqrt(50) # 95% t CI

> mean(x) + qt(c(.025,.975), 49)*sd(x)/sqrt(50) # 95% t CI


[1] 7.306565 12.375435 # remarkably close to exact 95% CI

c) Set a fresh seed. Then replace Pair.Diff by x in the code of Example 4.8
to find a 95% nonparametric bootstrap confidence interval for µ.
108 4 Answers to Problems: Applied Probability Models

set.seed(1) # seed to re-generate data


x = round(rexp(50, 1/10), 2) # re-generate the 50 obs.

set.seed(1066) # seed for bootstrap


n = length(x) # number of data pairs
d.bar = mean(x) # observed mean of diff’s
B = 10000 # number of resamples
re.x = sample(x, B*n, repl=T)
RDTA = matrix(re.x, nrow=B) # B x n matrix of resamples
re.mean = rowMeans(RDTA) # vector of B ‘d-bar-star’s

hist(re.mean, prob=T) # hist. of bootstrap dist.


bci = quantile(re.mean, c(.025,.975)) # simple bootstrap 95% CI
alt.bci = 2*d.bar - bci[2:1] # bootstrap percentile 95% CI
bci; alt.bci

> bci; alt.bci # Because x is right-skewed, the two


2.5% 97.5% # bootstrap methods differ substantially.
7.51392 12.50383
97.5% 2.5% # The latter is often preferred, but
7.178175 12.168080 # the former is closer to the exact CI.

7.595638 13.258885 # Exact CI from part (a) for comparison.

d) Does a normal probability plot clearly show the data are not normal? The
Shapiro-Wilk test is a popular test of normality. A small p-value indicates
nonnormal data. In R, run shapiro.test(x) and comment on the result.
d The normal probability plot (not shown) clearly indicates these exponential data
are not normal, and the Shapiro-Wilk test decisively rejects normality.

set.seed(1)
x = round(rexp(50, 1/10), 2) # re-generate the 50 obs.
shapiro.test(x)

> shapiro.test(x)

Shapiro-Wilk normality test

data: x
W = 0.789, p-value = 4.937e-07

It seems worth commenting that the t interval does a serviceable job even
for these strongly right-skewed exponential data. Because of the Central Limit
Theorem, the mean of 50 exponential observations is “becoming” normal.
Specifically, X̄ has a gamma distribution with shape parameter 50, which is
still skewed to the right, but more nearly symmetrical than the exponential
distribution. c
Answers: (a) (7.6, 13.3), (b) (7.3, 12.4). (c) On one run: (7.3, 12.3).
4 Answers to Problems: Applied Probability Models 109

4.26 Coverage probability of a nonparametric bootstrap confidence interval.


Suppose we have a sample of size n = 50 from a normal population. We
wonder whether an alleged 95% nonparametric bootstrap confidence interval
really has nearly 95% coverage probability. Without loss of generality, we
consider m = 1000 such random samples from NORM(0, 1), find the bootstrap
confidence interval (based on B = 1000) for each, and determine whether it
covers the true mean 0. (In this case, the interval covers the true mean if its
endpoints have opposite sign, in which case the product of the endpoints is
negative.) The fraction of the m = 1000 nonparametric bootstrap confidence
intervals that cover 0 is an estimate of the coverage probability.
A suitable R program is given below. We chose relatively small values of
m and B and simple bootstrap confidence intervals. The program has a com-
putationally intensive loop and so it runs rather slowly with larger numbers
of iterations. Do not expect a high-precision answer because with m = 1000
the final step alone has a margin of error of about 1.4%. Report results from
three runs. Increase the values of m and B for improved accuracy, if you have
some patience or a fast computer.
d We used 100 000 iterations (over a coffee break) to obtain a better estimate of
the true coverage than is possible with only 1000 iterations. For n = 50 normal
observations, bootstrap confidence intervals targeted at 95% coverage tend to be a
little shorter than t intervals, and thus to have only about 94% coverage. c

set.seed(1789)
m = 100000; cover = numeric(m); B = 1000; n = 50
for (i in 1:m)
{
x = rnorm(n) # simulate a sample
re.x = sample(x, B*n, repl=T) # resample from it
RDTA = matrix(re.x, nrow=B)
re.mean = rowMeans(RDTA)
cover[i] = prod(quantile(re.mean, c(.025,.975)))
# does bootstrap CI cover?
}
mean(cover < 0)

> mean(cover < 0)


[1] 0.93994

4.27 Mark-recapture estimate of population size. To estimate the number τ


of fish of a certain kind that live in a lake, we first take a random sample of c
such fish, mark them with red tags, and return them to the lake. Later, after
the marked fish have had time to disperse randomly throughout the lake but
before there have been births, deaths, immigration, or emigration, we take a
second random sample of n fish and note the number X of them that have
red tags.
110 4 Answers to Problems: Applied Probability Models

a) Argue that a reasonable estimate of τ is τ̂ = bcn/Xc, where b c indicates


the “largest included integer” or “floor” function. If c = 900, n = 1100,
and X = 95, then evaluate τ̂ .
d The Notes suggests one argument: c/τ is the fraction of tagged fish in the lake,
and X/n is the fraction of tagged fish in the recaptured sample. Equating these two
fractions, we get cn/X as an estimate of τ (when rounded or truncated to an integer).
This is an example of the Method of Moments for finding an estimator (often called
MME). The Maximum Likelihood Estimate (MLE), discussed in part (c), turns out
to be essentially the same. c

b) For known values of τ , r, and n, explain why P {X = x} = (cx )(τn−x −c


)/(τn ),
a
for x = 0, . . . , n, where (b ) is defined as 0 if integer b ≥ 0 exceeds a. We
say that X has a hypergeometric distribution. Suppose τ = 10 000 total
fish, c = 900 tagged fish, τ − c = 9100 untagged fish, and n = 1100 fish
in the second sample. Then, in R, use dhyper(95, 900, 9100, 1100) to
evaluate P {X = 95}.
d Upon recapture we are sampling n fish from among τ without replacement. The
total number of ways to do this is the denominator of the probability. The population
consists of c tagged fish, of which we choose x; and also τ − c untagged fish, of
which we choose n − x. This accounts for the product of two binomial coefficients
in the numerator. In R the function dhyper(95, 900, 9100, 1100) returns 0.0409
(rounded to four places). c

c) Now, with c = 900 and n = 1100, suppose we observe X = 95. For


what value of τ is P {X = 95} maximized? This value is the maximum
likelihood estimate of τ . Explain why the following code evaluates this
estimate. Compare your result with the value of τ̂ in part (a).
tau = 7000:15000; like = dhyper(95, 900, tau-900, 1100)
mle = tau[like==max(like)]; mle
plot(tau, like, type="l"); abline(v=mle, lty="dashed") # not shown

> mle = tau[like==max(like)]; mle


[1] 10421

d) The R code below makes a parametric bootstrap confidence interval


for τ . For c, n, and X as in parts (a) and (c), we have the estimate
τ̂ = 10 421 of the parameter τ . We resample B = 10 000 values of X based
on the known values of c and n and this estimate τ̂ . From each resam-
pled X, we reestimate τ . This gives a bootstrap distribution consisting of
B estimates of τ , from which we obtain a confidence interval.
set.seed(1935)
# Data
c = 900; n = 1100; x = 95
4 Answers to Problems: Applied Probability Models 111

# Estimated population size


tau.hat = floor(c*n/x)

# Resample using estimate


B = 10000
re.tau = floor(c*n/rhyper(B, c, tau.hat-c, n))

# Histogram and bootstrap confidence intervals


hist(re.tau)
bci = quantile(re.tau, c(.025,.975)); bci # simple bootstrap
2*tau.hat - bci[2:1] # percentile method

> bci = quantile(re.tau, c(.025,.975)); bci # simple bootstrap


2.5% 97.5%
8761 12692
> 2*tau.hat - bci[2:1] # percentile method
97.5% 2.5%
8150 12081

d For more on mark-recapture estimation of population size, see Problem 8.11. c


Notes: (a) How does each of the following fractions express the proportion of marked
fish in the lake: c/τ and X/n? (d) Roughly (8150, 12 000) from the bootstrap per-
centile method, which we prefer here because of the skewness; report your seed and
exact result. This is a “toy” example of the parametric bootstrap because stan-
dard methods of finding a confidence interval for τ require less computation and are
often satisfactory. There is much literature on mark-recapture (also called capture-
recapture) methods. For one elementary discussion, see Feller (1957).

Errors in Chapter 4
Page numbers refer to the the text. All corrections are scheduled to appear in
the second printing. Statements of problems in this Instructor Manual have
been corrected. Includes corrections compiled through July 13, 2011.
p116 Problem 4.26. In the third line inside the loop of the program: The right paren-
thesis should immediately follow repl=T, not the comment. The correct line
reads:

re.x = sample(x, B*n, repl=T) # resample from it


112 4 Answers to Problems: Applied Probability Models

Eric A. Suess and Bruce E. Trumbo


Introduction to Probability Simulation and Gibbs Sampling with R: Instructor Manual
Chapter 4

Explanations and answers:


c 2011 by Bruce E. Trumbo and Eric A. Suess. All rights reserved.
°
Statements of problems:
°c 2010 by Springer Science+Business Media, LLC. All rights reserved.
5
Screening Tests

5.1 In a newspaper trivia column, L. M. Boyd (1999) ponders why lie de-
tector results are not admissible in court. His answer is that “lie detector tests
pass 10 percent of the liars and fail 20 percent of the truth-tellers.” If you use
these percentages and take {D = 1} to mean being deceitful and {T = 1}
to mean failing the test, what are the numerical values of the sensitivity and
specificity for such a lie detector test? (Continued in Problem 5.12.)
d “Pass 10 percent of the liars” means P {T = 0|D = 1} = 1 − η = 0.1, so sensitivity
η = 0.9. “Fail 20 percent of the truth-tellers” means P {T = 1|D = 0} = 1 − θ = 0.2,
so specificity θ = 0.8. c

5.2 In a discussion of security issues, Charles C. Mann (2002) considers the


use of face-recognition software to identify terrorists at an airport terminal:
[One of the largest companies marketing face-recognition technology]
contends that...[its] software has a success rate of 99.32 percent—that
is, when the software matches a passenger’s face with a face on a
list of terrorists, it is mistaken only 0.68 percent of the time. Assume
for the moment that this claim is credible; assume, too, that good
pictures of suspected terrorists are readily available. About 25 million
passengers used Boston’s Logan Airport in 2001. Had face-recognition
software been used on 25 million faces, it would have wrongly picked
out just 0.68 percent of them–but that would have been enough...to
flag as many as 170,000 innocent people as terrorists. With almost 500
false alarms a day, the face-recognition system would quickly become
something to ignore.
Interpret the quantities η, θ, and π of Section 5.1 in terms of this situation.
As far as possible, say approximately what numerical values of these quantities
Mann seems to assume.
d To make any sense of this, it seems that we must begin with the nontrivial as-
sumptions that we know exactly what we mean by “terrorist,” that we have a list
114 5 Answers to Problems: Screening Tests

of them all, and that the software can access pictures of exactly the people on that
list. We take D = 1 to mean that a scanned passenger is a terrorist. The preva-
lence π = P {D = 1} is the proportion of scanned passengers who are terrorists. We
suppose Mann believes π to be very small.
We take T = 1 to mean that the software flags a passenger as a terrorist. Then
the sensitivity η = P {T = 1|D = 1} is the conditional probability that a scanned
passenger is flagged as a terrorist, given that he or she is on the terrorist list. The
specificity of the face-recognition software θ = P {T = 0|D = 0} is the conditional
probability that someone not on the list will will not be flagged as a terrorist.
In the first part of the quote, Mann seems to say P {D = 0|T = 1} = 0.0068,
which implies P {D = 1|T = 1} = 0.9932. But this “reversed” conditional probability
is not the specificity η. In Section 5.3, we refer to it as the predictive value of a
positive test: γ = P {D = 1|T = 1}.
In any case, sensitivity is a property of the face-recognition software. Without
the “cooperation” of the terrorists lining up to help test the software, it would seem
difficult for the company to know the sensitivity. By contrast, the company could
easily know the specificity by lining up people who are not terrorists and seeing how
many of them the software incorrectly flags as terrorists. We venture to speculate
that 1 − θ = P {T = 1|D = 0} may be what the company is touting as its low
“mistaken” rate of 0.68%.
Later in the quote, Mann focuses on 170 000 out of a population of 25 000 000
passengers in 2001 or 0.68% of passengers that would be “wrongly picked out” as
being on the list (and 170 000/365 ≈ 466, which we suppose to be his “almost 500
false alarms”). The probability of a false positive (wrong identification or a false
alarm) ought be P {T = 1|D = 0} = 1 − θ = 0.0068, which matches our speculation
of what Mann really meant in his first statement.
In our experience, people who have not studied probability frequently confuse
the three probabilities P {D = 0|T = 1}, P {T = 1|D = 0}, and P {D = 0, T = 1}.
Respectively, each of these is a fraction of a different population: all passengers
tagged as terrorists, all passengers who are not terrorists, and all passengers. What
we expect of the student in this problem is not to confuse η with γ.
Considering the current state of face-recognition software, we agree with Mann
that using it at an airport to detect terrorists would be a challenge. Retaining all the
dubious assumptions at the beginning of this answer, let’s speculate in addition that
π = 0.0001 (one in 10 000) of airport passengers were terrorists, and that sensitivity
η and specificity θ were both very large, say η = θ = 0.9932. Then using formulas
in Section 6.3:
τ = P {T = 0} = πη + (1 − π)(1 − θ) = 0.0068 = 0.68%
and
γ = πη/[πη + (1 − π)(1 − θ)] = 0.00146 = 0.146%.
Therefore, on the one hand, out of Mann’s hypothetical n = 68 400 passengers
a day, there would indeed be about nτ = 466 alarms, almost all of them false. And
decreasing the prevalence π won’t change the number 466 by much.
On the other hand, we’re pretending that the proportion of terrorists in the entire
population is one in 10 000. But on average, among every group of 1000/0.146 = 685
passengers flagged by the software, about one is a terrorist. Based on these numerical
assumptions, the flagged passengers would be a pretty scary bunch, worth some
individual attention. c
5 Answers to Problems: Screening Tests 115

5.3 Consider a bogus test for a virus that always gives positive results,
regardless of whether the virus is present or not. What is its sensitivity?
What is its specificity? In describing the usefulness of a screening test, why
might it be misleading to say how “accurate” it is by stating its sensitivity
but not its specificity?
d Sensitivity η = 1; specificity θ = 0. One could argue that the term “accurate” is too
vague to be useful in discussing screening tests. An ideal test will have high values
of both η and θ. As we see in Section 5.3, in a particular population of interest, it
would also be desirable to have high values of the predictive values γ and δ. c

5.4 Suppose that a medical screening test for a particular disease yields
a continuum of numerical values. On this scale, the usual practice is to take
values less than 50 as a negative indication for having the disease {T = 0}, and
to take values greater than 56 as positive indications {T = 1}. The borderline
values between 50 and 56 are usually also read as positive, and this practice
is reflected in the published sensitivity and specificity values of the test. If
the borderline values were read as negative, would the sensitivity increase or
decrease? Explain your answer briefly.
d Sensitivity η = P {T = 1|D = 1} would decrease because fewer outcomes are now
counted as positive. By contrast, the specificity θ = P {T = 0|D = 0} would increase,
and for the same reason. Whether η or θ has the bigger change depends on how likely
the patients getting scores between 50 and 56 are are to have the disease. c

5.5 Many criteria are possible for choosing the “best” (η, θ)-pair from an
ROC plot. In Example 5.1, we mentioned the pair with η = θ. Many references
vaguely suggest picking a pair “close to” the upper-left corner of the plot. Two
ways to quantify this are to pick the pair on the curve that maximizes the
Youden index η + θ or the pair that maximizes η 2 + θ2 .
a) As shown below, modify the line of the program in Example 5.1 that
prints numerical results. Use the expanded output to find the (η, θ)-pair
that satisfies each of these maximization criteria.
cbind(x, eta, theta, eta + theta, eta^2 + theta^2)
x = seq(40,80,1)
eta = 1 - pnorm(x, 70, 15); theta = pnorm(x, 50, 10)
show = (x >= 54) & (x <= 65)
youden = eta + theta; ssq = eta^2 + theta^2
round(cbind(x, eta, theta, youden, ssq)[show,], 4)

> round(cbind(x, eta, theta, youden, ssq)[show,], 4)


x eta theta youden ssq
[1,] 54 0.8569 0.6554 1.5124 1.1639
[2,] 55 0.8413 0.6915 1.5328 1.1860
[3,] 56 0.8247 0.7257 1.5504 1.2068
[4,] 57 0.8069 0.7580 1.5650 1.2258
[5,] 58 0.7881 0.7881 1.5763 1.2423 # equal
116 5 Answers to Problems: Screening Tests

[6,] 59 0.7683 0.8159 1.5843 1.2561


[7,] 60 0.7475 0.8413 1.5889 1.2666
[8,] 61 0.7257 0.8643 1.5901 1.2738 # max Youden
[9,] 62 0.7031 0.8849 1.5880 1.2774
[10,] 63 0.6796 0.9032 1.5828 1.2777 # max SSQ
[11,] 64 0.6554 0.9192 1.5747 1.2746
[12,] 65 0.6306 0.9332 1.5638 1.2685

b) Provide a specific geometrical interpretation of each of the three criteria:


η = θ, maximize η + θ, and maximize η 2 + θ2 . (Consider lines, circles, and
tangents.)
d The criterion η = θ is satisfied by taking (the point nearest) the intersection of
the line between (0, 1) and (1, 0) and the ROC curve, as shown in Figure 5.1 (p123)
of the text (and by the green line in the code below). For the usual case where the
ROC curve is convex, the Youden criterion is satisfied by choosing the line of unit
slope that is tangent to the ROC curve and using the point of tangency. Similarly,
third criterion is satisfied by choosing the circle with center at (1, 0) that is tangent
to the ROC curve and using the point of tangency. Run the code below to see a
plot that illustrates the tangent line and circle corresponding to the ROC plot of
Figure 5.1. (We do not know the equation of the curve, so the line and circle are
constrained to hit a plotting point.) c

x = seq(40,80,1)
eta = 1 - pnorm(x, 70, 15); theta = pnorm(x, 50, 10)

plot(1-theta, eta, xlim=c(0,.5), ylim=c(.5,1), pch=20,


xlab=expression(1 - theta), ylab=expression(eta))

abline(a = 1, b = -1, col="darkgreen")


ye = eta[eta==theta]; xe = (1-theta)[eta==theta]
lines(c(0, xe, xe), c(ye, ye, 0), col="darkgreen", lty="dotted")

youden = eta + theta; mx.y = max(youden)


yy = eta[youden==mx.y]; xx = (1-theta)[youden==mx.y]
abline(a = yy-xx, b = 1, col="red")
lines(c(0, xx, xx), c(yy, yy, 0), col="red", lty="dotted")

ssq = eta^2 + theta^2; mx.s = max(ssq)


ys = eta[ssq==mx.s]; xs = (1-theta)[ssq==mx.s]
x.plot = seq(0, 1, by = .001)
lines(1-x.plot, sqrt(mx.s - x.plot^2), col="blue")
lines(c(0, xs, xs), c(ys, ys, 0), col="blue", lty="dotted")

Notes: When the ROC curve is only roughly estimated from data (as in Problem 5.6),
it may make little practical difference which criterion is used. Also, if false-positive
results are much more (or less) consequential errors than false-negative ones, then
criteria different from any of these may be appropriate.
5 Answers to Problems: Screening Tests 117

5.6 Empirical ROC. DeLong et al. (1985) investigate blood levels of creat-
enine (CREAT) in mg% and β2 microglobulin (B2M) in mg/l as indicators
of imminent rejection {D = 1} in kidney transplant patients. Based on data
from 55 patients, of whom 33 suffered episodes of rejection, DeLong and her
colleagues obtained the sensitivity data in Table 5.2 (p133 of the text).
For example, as a screening test for imminent rejection, we might take a
createnine level above 1.7 to be a positive test result. Then we would estimate
its sensitivity as η(1.7) = 24/33 = 0.727 because 24 patients who had a
rejection episode soon after the test had createnine levels above 1.7.
Similarly, consider a large number of instances in which the createnine test
was not soon followed by a rejection episode. Of these, 53.5% had levels at
most 1.7, so θ(1.7) ≈ 0.535. For a test that “sounds the alarm” more often, we
can use a cut-off level smaller than 1.7. Then we will “predict” more rejection
episodes, but we will also have more false alarms.
Use these data to make approximate ROC curves for both CREAT and
B2M. Put both sets of points on the same plot, using different symbols (or
colors) for each, and try to draw a smooth curve through each set of points
(imitating Figure 5.1). Compare your curves to determine whether it is worth-
while to use a test based on the more expensive B2M determinations. Would
you use CREAT or B2M? If false positives and false negatives were equally se-
rious, what cutoff value would you use? What if false negatives are somewhat
more serious? Defend your choices.
cre.sens = c(.939, .939, .909, .818, .758, .727, .636, .636, .545,
.485, .485, .394, .394, .364, .333, .333, .333, .303)
cre.spec = c(.123, .203, .281, .380, .461, .535, .649, .711, .766,
.773, .803, .811, .843, .870, .891, .894, .896, .909)
b2m.sens = c(.909, .909, .909, .909, .879, .879, .879, .879, .818,
.818, .818, .788, .788, .697, .636, .606, .576, .576)
b2m.spec = c(.067, .074, .084, .123, .149, .172, .215, .236, .288,
.359, .400, .429, .474, .512, .539, .596, .639, .676)

plot(1-cre.spec, cre.sens, pch=20, xlim=c(0,1), ylim=c(0,1),


xlab = expression(1 - theta), ylab = expression(eta),
main = "Estimated ROC Curve: CREAT (solid) and B2M")
points(1-b2m.spec, b2m.sens, col = "blue")
abline(a = 1, b = -1, col="darkgreen")

dist = abs(cre.sens - cre.spec)


cre.sens[dist==min(dist)]; cre.spec[dist==min(dist)]

> cre.sens[dist==min(dist)]; cre.spec[dist==min(dist)]


[1] 0.636
[1] 0.649

d The R code above makes a plot similar to Figure 5.5 for the CREAT data (solid
black dots), but also including points for an estimated ROC curve of the B2M data
(open blue circles). The curves are similar, except that the CREAT curve seems a
118 5 Answers to Problems: Screening Tests

little closer to the upper-left corner of the plot. Therefore, if one only measurement
is to be used, the less expensive CREAT measurement seems preferable, providing
a screening test for transplant rejection with relatively higher values of η and θ (but
see the Notes).
If false positives and false negatives are equally serious, then we should pick a
point on the smoothed curve where η ≈ θ. For the CREAT ROC curve, this seems to
be somewhere in the vicinity of η ≈ θ ≈ 6.4 (see the output to the program above),
which means a createnine level near 1.8 mg% (see Table 5.2). The probability of a
false negative is P {T = 0|D = 1} = 1 − η (the probability that we do not detect
a patient is about to reject). Making this probability smaller means making the
sensitivity η larger, which means moving upward on the ROC curve, and toward
a smaller createnine cut-off value for the test (see the sentence about “more false
alarms” in the question). c
Notes: Data can be coded as follows. [The code originally provided here has been
moved to the first four statements in the program above.] Use plot for the first set
of points (as shown in Figure 5.5), then points to overlay the second. In practice,
a combination of the two determinations, including their day-to-day changes, may
provide better predictions than either determination alone. See DeLong et al. for
an exploration of this possibility and also for a general discussion (with further
references) of a number of issues in diagnostic testing. The CREAT data also appear
in Pagano and Gauvreau (2000) along with the corresponding approximate ROC
curve.

5.7 Confidence intervals for prevalence.


a) Compute the 95% confidence interval for τ given in Section 5.2 on p123,
based on t = A/n = 0.049. Show how the corresponding 95% confidence
interval for π is obtained from this confidence interval.
d Below we use R to compute the standard confidence interval (0.0356, 0.0624) for τ
and the corresponding confidence interval (0.0059, 0.0337) for π. (For τ and π, the
program prints the lower confidence limit, followed by the point estimate, and then
the upper confidence limit.) The confidence interval for π is based on equation (5.4)
on p123 of the text. We show more decimal places, and hence have less rounding
error, than on p123. c

# standard CIs centered at point estimate


eta = .99; theta=.97; a = 49; n = 1000
t = a/n; pm = c(-1,0,1)
CI.tau = t + pm*1.96*sqrt(t*(1-t)/n); CI.tau
CI.pi = (CI.tau + theta - 1)/(eta + theta -1); CI.pi

> CI.tau = t + pm*1.96*sqrt(t*(1-t)/n); CI.tau


[1] 0.03562036 0.04900000 0.06237964
> CI.pi = (CI.tau + theta - 1)/(eta + theta -1); CI.pi
[1] 0.005854544 0.019791667 0.033728790
5 Answers to Problems: Screening Tests 119

b) The traditional approximate confidence interval for a binomial probability


used in Section 5.2 can seriously overstate the confidence level. Especially
for samples of small or moderate size, the approximate confidence interval
suggested by Agresti and Coull (1998) is more accurate. Their procedure
is to “add two successes and two failures” when estimating the probability
of success. Here, this amounts to using t0 = (A + 2)/n0 , where n0 = n + 4,
as the point
p estimate of τ and then computing the confidence interval
t0 ± 1.96 t0 (1 − t0 )/n0 . Find the 95% Agresti-Coull confidence interval
for τ and from it the confidence interval for π given in Section 5.2. (See
Chapter 1 of the text for a discussion of the Agresti-Coull adjustment.)
d Similarly, we show the Agresti-Coull confidence interval for τ and the corresponding
confidence interval for τ . c

# Agresti-Coull CIs centered at point est.; data from above


a.ac = a + 2; n.ac = n + 4; t.ac = a.ac/n.ac
CI.tau = t.ac + pm*1.96*sqrt(t.ac*(1-t.ac)/n.ac); CI.tau
CI.pi = (CI.tau + theta - 1)/(eta + theta -1); CI.pi

> CI.tau = t.ac + pm*1.96*sqrt(t.ac*(1-t.ac)/n.ac); CI.tau


[1] 0.03721408 0.05079681 0.06437954
> CI.pi = (CI.tau + theta - 1)/(eta + theta -1); CI.pi
[1] 0.00751467 0.02166335 0.03581202

5.8 Suppose that a screening test for a particular parasite in humans has
sensitivity 80% and specificity 70%.
a) In a sample of 100 from a population, we obtain 45 positive tests. Estimate
the prevalence.
d We use equation (5.4) on p123 of the text to find the estimate p of π:

p = (t + θ − 1)(η + θ − 1) = (0.45 + 0.7 − 1)/(0.8 + 0.7 − 1) = 0.3,

where the estimate of τ is t = 45/100. c

b) In a sample of 70 from a different population, we obtain 62 positive tests.


Estimate the prevalence. How do you explain this result?
d Similarly, p = (62/70 + 0.7 − 1)/(0.8 + 0.7 − 1) = 1.17. This absurd estimate
lies outside the interval (0, 1) in which we know π must lie. The difficulty arises
because the specificity η = 0.8 would lead us to expect only 80% positive tests,
even if all members of the population were infected. But we happen to observe
t = 62/70 = 88.6% positive tests in our relatively small sample. c

5.9 Consider the ELISA test of Example 5.2, and suppose that the preva-
lence of infection is π = 1% of the units of blood in a certain population.
a) What proportion of units of blood from this population tests positive?
120 5 Answers to Problems: Screening Tests

d Recall that η = 0.99 and θ = 0.97. Then by equation (5.2) on p123, we have
τ = πη + (1 − π)(1 − θ) = 0.01(0.99) + 0.99(0.03) = 0.0396. c

b) Suppose that n = 250 units of blood are tested and that A of them yield
positive results. What values of t = A/n and of the integer A yield a
negative estimate of prevalence?
d In formula (5.4) for the estimate p of π, the denominator is η+θ−1 = 0.96 > 0, so p
is negative precisely when the numerator is negative. That is, when t = A/n < 0.03.
(Even if no units are infected, we expect from the specificity θ = 0.97 that 3% of
sampled units test positive.) So we require integer A < 0.03(250) = 7.5; that is, the
estimate p < 0 when A ≤ 7. c

c) Use parts (a) and (b) to find the proportion of random samples of size
250 from this population that yields negative estimates of prevalence.
d From part (a) we know that A ∼ BINOM(n = 250, τ = 0.0396), and from part (b)
we seek P {A ≤ 7}. The R code pbinom(7, 250, 0.0396) returns 0.2239. c
5.10 Write a program to make a figure similar to Figure 5.4 (p127).
What are the exact values of PVP γ and PVN δ when π = 0.05?
d The program is shown below, including some optional embellishments that put
numerical values on the plot (not shown here). The requested values of PVP γ and
PVN δ are shown at the end. c
pp = seq(0, 1, by=.001); eta = .99; theta = .97
tau = pp*eta + (1 - pp)*(1 - theta)
gamma = pp*eta/tau; delta = (1 - pp)* theta/(1 - tau)

pp = seq(0, 1, by=.001); eta = .99; theta = .97


tau = pp*eta + (1 - pp)*(1 - theta)
gamma = pp*eta/tau; delta = (1 - pp)* theta/(1 - tau)

plot(pp, gamma, type="l", ylim=0:1, xlim=0:1, lwd=3, col="blue",


ylab="Predictive Value", xlab="Prevalence",
main=paste("PVP (solid) and PVN: sensitivity", eta,
", specificity", theta))
prev.show = 0.05
text(.5,.4, paste("At prevalence =", prev.show))
text(.5,.35, paste("PVP =", round(gamma[pp==.05], 3)))
text(.5,.3, paste("PVN =", round(delta[pp==.05], 3)))
lines(pp, delta, lwd=2, lty="dashed")
abline(v=prev.show, lwd=3, col="darkred", lty="dotted")

# For prevalence .05:


gamma[pp==.05]; delta[pp==.05] # PVP and PVN

> gamma[pp==.05]; delta[pp==.05] # PVP and PVN


[1] 0.6346154
[1] 0.9994577
5 Answers to Problems: Screening Tests 121

5.11 Suppose that a screening test for a particular disease is to be given


to all available members of a population. The goal is to detect the disease
early enough that a cure is still possible. This test is relatively cheap, conve-
nient, and safe. It has sensitivity 98% and specificity 96%. Suppose that the
prevalence of the disease in this population is 0.5%.
a) What proportion of those who test positive will actually have the disease?
Even though this value may seem quite low, notice that it is much greater
than 0.5%.
d We seek γ = P {D = 1|T = 1} = πη/τ , the predictive power of a positive test. From
the information given, τ = πη + (1 − π)(1 − θ) = 0.005(0.98) + 0.995(0.04) = 0.0447.
Thus γ = 0.005(0.98)/0.0447 = 0.1096. We say the prior probability of disease
(in the entire population) is the prevalence π = 0.5%. By contrast, the posterior
probability of disease (among those with positive tests) is the PVP γ = 10.96%. c

b) All of those who test positive will be subjected to more expensive, less
convenient (possibly even somewhat risky) diagnostic procedures to de-
termine whether or not they actually have the disease. What percentage
of the population will be subjected to these procedures?
d This percentage is τ = 0.0447, computed in part (a). c

c) The entire population can be viewed as split into four sets by the random
variables D and T : either of them may be 0 or 1. What proportion of
the entire population falls into each of these four sets? Suppose you could
change the sensitivity of the test to 99% with a consequent change in speci-
ficity to 94%. What factors of economics, patient risk, and preservation
of life would be involved in deciding whether to make this change?
d In terms of random variables D and T , the four probabilities (to full four-place
accuracy) are as follows, respectively: First,

P {D = 1, T = 1} = πη = 0.0049 and P {D = 0, T = 1} = (1 − π)(1 − θ) = 0.0398,

the two probabilities from above whose sum is τ . Then

P {D = 1, T = 0} = π(1 − η) = 0.0001 and P {D = 0, T = 0} = (1 − π)θ = 0.9552,

which add to 1 − τ = 0.9553. Also notice that the first probabilities in each display
above add to π = 0.005. Of course, all four probabilities add to 1.
We note that these four probabilities could be also expressed in terms of τ , γ,
and δ. (Note: In the first printing, we used “false positive” and similar terms to refer
to the four sets. Many authors reserve this terminology for conditional probabilities,
such as 1 − γ.)
On increasing the sensitivity to η = 99% and decreasing specificity to η = 94%:
This would increase the number of subjects testing positive. The disadvantage would
be that more people would undergo the expensive and perhaps risky diagnostic
procedure. Specifically, τ increases from 0.0447 to 0.0547. So the bill for doing the
diagnostic procedures would be 0.0547/0.0447 = 1.22 times as large—a 22% increase.
122 5 Answers to Problems: Screening Tests

The advantage would be that a few more people with the disease would be alerted,
possibly in time to be cured. Specifically, the fraction of the population denoted
by {D = 1, T = 1} would increase from 0.00490 to 0.00495. But the PVP would
actually decrease from 0.1096 to 0.0901. c
Note: (b) This is a small fraction of the population. It would have been prohibitively
expensive (and depending on risks, possibly even unethical) to perform the definitive
diagnostic procedures on the entire population. But the screening test permits focus
on a small subpopulation of people who are relatively likely to have the disease and
in which it may be feasible to perform the definitive diagnostic procedures.
5.12 Recall the lie detector test of Problem 5.1. In the population of in-
terest, suppose 5% of the people are liars.
a) What is the probability that a randomly chosen member of the population
will fail the test?
d From the answer to Problem 5.1, recall that sensitivity η = 0.9 and specificity
θ = 0.8. Here we evaluate

τ = P {T = 1} = πη + (1 − π)(1 − θ) = 0.05(0.9) + 0.95(0.2) = 0.045 + 0.190 = 0.235,

using equation (5.2) on p123. (We provide the notation requested in part (d) as we
go along.) c
b) What proportion of those who fail the test are really liars? What propor-
tion of those who fail the test are really truth-tellers?
d We require

γ = P {D = 1|T = 1} = πη/τ = 0.045/0.235 = 0.1915,

where the computation follows from equation (5.5) on p126. Of those who fail the
test, the proportion 1 − γ = P {D = 0|T = 1} = 1 − 0.1915 = 0.8085 will be falsely
accused of being liars. (By contrast, among all who take the test: the proportion
P {D = 0, T = 1} = (1 − π)(1 − θ) = 0.95(0.2) = 0.19 will be falsely accused, and
the proportion P {D = 1, T = 1} = πη = 0.045 rightly accused. As a check, notice
that τ = 0.19 + 0.045.)
Recall the original complaint, quoted in Problem 5.1, that lie detector tests pass
10% of liars. That deficiency, together with examples, such as the ones in the previous
paragraph, showing that the tests accuse relatively large numbers of truthful people,
makes judges reluctant to allow results of lie detector tests in the courtroom. c

c) What proportion of those who pass the test are really telling the truth?
d This proportion is

δ = P {D = 0|T = 0} = (1 − π)θ/(1 − τ ) = 0.95(0.8)/0.765 = 0.9935,

based on equation (5.6). c


d) Following the notation of this chapter, express the probabilities and pro-
portions in parts (a), (b), and (c) in terms of the appropriate Greek letters.
5 Answers to Problems: Screening Tests 123

5.13 In Example 5.3, a regulatory agency may be concerned with the values
of η and γ. Interpret these two conditional probabilities in terms of testing a
batch for potency. Extend the program in this example to obtain approximate
numerical values for η and γ.
d In the language of Example 5.3, η = P (F |B) and γ = P (B|F ). The program below
has been extended to evaluate all four conditional probabilities, including θ and δ.
(The spacing in the output has been fudged slightly for easy reading.) c

set.seed(1066)
n = 500000
mu.s = 110; sd.s = 5; cut.s = 100
sd.x = 1; cut.x = 101
s = rnorm(n, mu.s, sd.s)
x = rnorm(n, s, sd.x)

n.g = length(s[s > cut.s]) # number Good


n.p = length(x[x > cut.x]) # number Pass
n.gp = length(x[s > cut.s & x > cut.x]) # number Good & Pass
n.bf = n - n.p - n.g + n.gp # number Bad & Fail

pp = (n - n.g)/n # prev. pi = .022275


tau = (n - n.p)/n # tau = .03878
eta = n.bf/(n - n.g) # sensitivity
theta = n.gp/n.g # specificity
gamma = n.bf/(n - n.p) # PVP
delta = n.gp/n.p # PVN
round(c(pp, tau, eta, theta, delta, gamma), 3)

> round(c(pp, tau, eta, theta, gamma, delta), 3)


[1] 0.023 0.039 0.969 0.983 0.569 0.999

Note: For verification, the method of Problem 5.14 provides values accurate to at
least four decimal places.

5.14 The results of Example 5.3 can be obtained without simulation


through a combination of analytic and computational methods.

a) Express the conditional probabilities η, θ, γ, and δ in terms of π, τ , and


P (G ∩ P ) = P {S > 100, X > 101}.
d Let D and E be any to events. To abbreviate, we write D ∪ E = DE. Also denote
Gc = B and P c = F , and notice that π = P (B) and τ = P (F ). We express P (BF )
in terms of π, τ , and P (GP ) as follows:

P (BF ) = P (Gc P c ) = 1 − P (G ∪ P ) = 1 − [P (G) + P (F ) − P (GP )]


= 1 − [(1 − π) + (1 − τ ) − P (GP )] = π + τ + P (GP ) − 1.

Then η = P (BF )/π, θ = P (GP )/(1 − τ ), γ = P (BF )/τ , and δ = P (GP )/(1 − τ ). c
124 5 Answers to Problems: Screening Tests

b) Denote the density function of NORM(µ, σ) by ϕ( ·, µ, σ) and its CDF by


Φ( ·, µ, σ). Manipulate a double integral to show that
Z ∞
P (G ∩ P ) = ϕ(s, 110, 5)[1 − Φ(101, s, 1)] ds.
100

d We show the double integral for P {S > 100, X > 101}:


Z ∞ ·Z ∞ ¸ Z ∞ Z ∞
ϕ(s, 110, 5) ϕ(x, s, 1) dx ds = ϕ(s, 110, 5) ϕ(x, s, 1) dx ds,
100 101 100 101

where we have expressed the conditional CDF of X|S = s as the integral of its density
function. In part (c) we evaluate this probability as P (G∩P ) = 0.96044. Notice that
this is not the same as P {S > 100}P {E > 1} = (1 − Φ(−2))(1 − Φ(1)) = 0.1550,
because events S and E are not independent. c

c) Write a program in R to evaluate P (G ∩ P ) by Riemann approximation


and to compute the four conditional probabilities of part (a). We suggest
including (and explaining) the following lines of code.
Compare your results for θ and δ with the simulated values shown in
Example 5.3. (If you did Problem 5.13, then also compare your results for
η and γ with the values simulated there.)

d As shown below, we append statements and annotations to the code suggested in


the problem to make the full program. For comparison, answers from Problem 5.13
are repeated at the end. Spacing of the output has been modified slight for easier
comparison. c

mu.s = 110; sd.s = 5; cut.s = 100 # const: as in Example 5.3


cut.x = 101; sd.x = 1
s = seq(cut.s, mu.s + 5 * sd.s, .001) # tiny NORM area beyond 5 SD
int.len = mu.s + 5 * sd.s - cut.s # finite integration interval
integrand = dnorm(s, mu.s, sd.s) * (1 - pnorm(cut.x, s, sd.x))
pr.gp = int.len * mean(integrand) # see Example 3.1; approx. P(GP)
pp = pnorm(-2) # exact: as in Example 5.3
tau = pnorm(-9/sqrt(26)) # exact: as in Example 5.3

pr.bf = pp + tau + pr.gp - 1 # P(BF) from part (a)


eta = pr.bf/pp; theta = pr.gp/(1 - pp) # from part(a)
gamma = pr.bf/tau; delta = pr.gp/(1 - tau) # from part(a)

# Values from Riemann approximation of integral


round(c(pr.gp, pp, tau, eta, theta, gamma, delta) ,5)
> # Values from Riemann approximation of integral
> round(c(pr.gp, pp, tau, eta, theta, gamma, delta),5)
[1] 0.96044 0.02275 0.03878 0.96561 0.98280 0.56650 0.99919
# 0.023 0.039 0.969 0.983 0.569 0.999
# Last row above: Simulated values from Problem 5.13
5 Answers to Problems: Screening Tests 125

5.15 In Example 5.3, change the rule for “passing inspection” as follows.
Each batch is assayed twice; if either of the two assays gives a result above
101, then the batch passes.
d A routine change in the program. No answers provided. c

a) Change the program of the example to simulate the new situation; some
useful R code is suggested below. What is the effect of this change on τ ,
θ, and γ?
x1 = rnorm(n,s,sd.x); x2 = rnorm(n,s,sd.x); x = pmax(x1, x2)

b) If you did Problem 5.13, then also compare the numerical values of η and γ
before and after the change in the inspection protocol.

5.16 In Example 5.4, suppose that Machine D is removed from service and
that Machine C is used to make 20% of the parts (without a change in its
error rate). What is the overall error rate now? If a defective part is selected
at random, what is the probability that it was made by Machine A?
d First, we show R code that can be used to get the results in Example 5.4. Names
of the vectors anticipate some of the terminology in Chapter 8.
Vector prior shows the proportions of all plastic parts made by each of the
four machines (component 1 for A, 2 for B, and so on). That is, if we went into
the warehouse and selected a part at random, the elements of this vector show the
probabilities that the bracket was made by each machine. (Suppose a tiny code
number molded into each bracket allows us to determine the machine that made
that bracket.)
The ith element of the vector like shows the likelihood that a bracket from the
ith machine is defective. The code computes the vector post. Of all of the defective
parts, the ith element of this vector is the fraction that is made by the ith machine.

prior = c(.4, .4, .15, .05) # prior probability distribution


like = c(.01, .01, .005, .03) # likelihoods of defectives

prod = prior * like


d = sum(prod) # prob. random part is defective
post = prod/d # posterior prob. distribution
d; round(rbind(prior, like, prod, post), 3)

> d; round(rbind(prior, like, prod, post), 3)


[1] 0.01025 # overall P{Defective}
[,1] [,2] [,3] [,4]
prior 0.400 0.400 0.150 0.050
like 0.010 0.010 0.005 0.030
prod 0.004 0.004 0.001 0.002
post 0.390 0.390 0.073 0.146

Now we modify the program to work the stated problem. Because Machine D
has been taken out of service, the 4-vectors above have become 3-vectors below.
126 5 Answers to Problems: Screening Tests

prior = c(.4, .4, .20) # prior probability distribution


like = c(.01, .01, .005) # likelihoods of defectives
prod = prior * like
d = sum(prod) # prob. random part is defective
post = prod/d # posterior prob. distribution
d; round(rbind(prior, like, prod, post), 3)

> d; round(rbind(prior, like, prod, post), 3)


[1] 0.009
[,1] [,2] [,3]
prior 0.400 0.400 0.200
like 0.010 0.010 0.005
prod 0.004 0.004 0.001
post 0.444 0.444 0.111

We see that 0.55% of all parts are defective. That is, P (E) = 0.55. This is about
half as many defectives as in the Example (using Machine D). Also, Machine A
makes over 44% of the defective brackets now (compared with 39% in the Example.
That is: we now have posterior probability P (A|E) = 0.444. c
5.17 There are three urns, identical in outward appearance. Two of them
each contain 3 red balls and 1 white ball. One of them contains 1 red ball and
3 white balls. One of the three urns is selected at random.
a) Neither you nor John has looked into the urn. On an intuitive “hunch,”
John is willing to make you an even-money bet that the urn selected has
one red ball. (You each put up $1 and then look into the urn. He gets
both dollars if the urn has exactly one red ball, otherwise you do.) Would
you take the bet? Explain briefly.
d Yes, unless you believe John has extraordinary powers or has somehow cheated.
P (Urn 3) = 1/3. There are two chances in three that you would win a dollar, and one
chance in three that you would lose a dollar; expected profit for you: 1(2/3) − 1(1/3)
or a third of a dollar. c

b) Consider the same situation as in (a), except that one ball has been chosen
at random from the urn selected, and that ball is white. The result of this
draw has provided both of you with some additional information. Would
you take the bet in this situation? Explain briefly.
d Denote by W the event that the first ball drawn from the urn selected is white,
and by Ui , for i = 1, 2, 3, the event that the ith urn was selected. We want to know
P (U3 |W ) = P (U3 ∩ W )/P (W ) = P (U3 )P (W |U3 )/P (W ). The total probability of
getting a white ball is
X X
P (W ) = P (Ui ∩ W ) = P (Ui )P (W |Ui ) = 1/3(1/4 + 1/4 + 3/4) = 5/12,

so P (U3 |W ) = (3/12)/(5/12) = 3/5. Taking John’s offer in this situation, your


expected “winnings” would be negative: 2/5 − 3/5 = −1/5. That is, you can expect
to lose a fifth of a dollar.
5 Answers to Problems: Screening Tests 127

Of course, this is an application of Bayes’ Theorem in which the three urns are the
partitioning events. Some frequentist statisticians insist that the inverse conditional
probabilities from Bayes’ Theorem be called something other than probabilities (such
as proportions or fractions). For example, a conditional outcome such as “disease
given positive test” is either true or not. Possibly pending results of a gold standard
test, we will know which. But this particular subject (or unit of blood) has already
been chosen, and his or her (or its) infection status is difficult to discuss according
to a frequentist or long-run interpretation of probability.
However, in the simple gambling situation of this problem, when you are con-
sidering how to bet, you are entitled to your own personal probability for each urn.
Before getting the information about drawing a white ball, a reasonable person would
take John’s offer. After that information is available, a reasonable person would not.
The Bayesian approach to probabilities, formally introduced in Chapter 8, is often
used to model such personal opinions. c

5.18 According to his or her use of an illegal drug, each employee in a large
company belongs to exactly one of three categories: frequent user, occasional
user, or abstainer (never uses the drug at all). Suppose that the percentages
of employees in these categories are 2%, 8%, and 90%, respectively. Further
suppose that a urine test for this drug is positive 98% of the time for frequent
users, 50% of the time for occasional users, and 5% of the time for abstainers.
d In the program below, the vector prior shows the distribution of heavy users, oc-
casional users, and abstainers, respectively. The vector like shows the probabilities
of the respective groups to get positive test results. Finally, after computation, the
vector post shows the distribution of the three employee types among those getting
positive test results. c

prior = c(.02, .08, .90) # proportions in each employee group


like = c(.98, .5, .05) # likelihoods of positive tests
prod = prior.a * like
pos = sum(prod) # overall proportion testing positive
post = prod/pos # among positives: proportions of groups
pos; round(rbind(prior, like, prod, post), 3)

> pos; round(rbind(prior, like, prod, post), 3)


[1] 0.1046
[,1] [,2] [,3]
prior 0.020 0.080 0.900
like 0.980 0.500 0.050
prod 0.020 0.040 0.045
post 0.187 0.382 0.430

a) If employees are selected at random from this company and given this
drug test, what percentage of them will test positive?
d As shown by the quantity pos in the output above P (Positive test) = 0.1046. c

b) Of those employees who test positive, what percentage are abstainers?


128 5 Answers to Problems: Screening Tests

d From the vector post in the output of the program, among those who have positive
tests, the proportion of abstainers is 43%. c

c) Suppose that employees are selected at random for testing and that those
who test positive are severely disciplined or dismissed. How might an
employee union or civil rights organization argue against the fairness of
drug testing in these circumstances?
d Possible points: Over 10% of employees will be accused of drug use and disciplined
or dismissed. Of those, 43% are abstainers. c

d) Can you envision different circumstances under which it might be appro-


priate to use such a test in the workplace? Explain.
d Perhaps, if only employees reasonably suspected of drug use were tested, the dam-
age to abstainers might be mitigated. (See the Comment.) Perhaps, if the company
kept test results confidential and paid for those with positive results on the screen-
ing test to get tested by a “gold standard” method. However, if the percentages
given are true, the company is at risk of losing up to 10% of its employees. Maybe
management needs to figure out what kinds of drug use are actually detrimental to
performance and deal with that part of the problem.
The program below looks at employees with negative test results.

prior = c(.02, .08, .90) # proportions in each group


like = c(1-.98, 1-.5, 1-.05) # likelihoods of negative tests
prod = prior * like
neg = sum(prod) # overall proportion testing negative
post = prod/neg # among neg.: proportions of groups
neg; round(rbind(prior, like, prod, post), 5)

> neg; round(rbind(prior, like, prod, post), 5)


[1] 0.8954
[,1] [,2] [,3]
prior.a 0.02000 0.08000 0.90000
like 0.02000 0.50000 0.95000
prod 0.00040 0.04000 0.85500
post.a 0.00045 0.04467 0.95488

There are almost no frequent users among the negatives (45 in 100,000), and very
few occasional users (less than 5%). Almost all (more than 95%) of those testing
negative are abstainers.
But this does not help those falsely accused: 4.5% of all employees tested are
abstainers and have positive tests (from the product line in the results for positive
results): P (Abstain ∩ Positive) = 0.045. From the product line for negative tests,
85.5% of all employees abstain and test negative: P (Abstain ∩ Negative) = 0.855.
Together, of course, these last two probabilities add up to the 90% of all employees
who abstain. c
Comment: (d) Consider, as one example, a railroad that tests only train operators
who have just crashed a train into the rear of another train.
5 Answers to Problems: Screening Tests 129

5.19 A general form of Bayes’ Theorem.


a) Using f with appropriate subscripts to denote joint, marginal, and con-
ditional density functions, we can state a form of Bayes’ Theorem that
applies to distributions generally, not just to probabilities of events,

fX,S (x, s) fX,S (x, s) fS (s)fX|S (x|s)


fS|X (s|x) = =R =R ,
fX (x) fX,S (x, s) ds fS (s)fX|S (x|s) ds

where the integrals are taken over the real line. Give reasons for each step
in this equation. (Compare this result with equation (5.7).)
d The first inequality expresses the definition of a conditional density function fS|X
in terms of the joint density function fX,S and the marginal density function fX . The
second shows the marginal density fX as the integral of the joint density function
fX,S with respect to s. The last inequality uses the definition of the conditional
density function fX|S in both numerator and denominator.
Extra. In some applications (not here) it is obvious from the functional form
of the numerator fS (s)fX|S (x|s) on the right, that it must be proportional to a
known density function. In that case, it is not necessary to find the integral in the
denominator, and one writes

fS|X (s|x) ∝ fS (s)fS (s)fX|S (x|s),

where the symbol ∝ is read “proportional to.” The left-hand side is called the
posterior probability density. The function fS (s)fX|S (x|s), viewed as a function
of x given data s, is called the likelihood function. And the function fS (s) is called
the prior density function. More on this in Chapter 8. c

b) In Example 5.3, S ∼ NORM(110, 5) is the potency of a randomly chosen


batch of drug, and it is assayed as X|{S = s} ∼ NORM(s, 1). The expres-
sion for the posterior density in part (a) allows us to find the probability
that a batch is good given that it assayed at 100.5, thus barely failing
inspection. Explain why this is not the same as 1 − γ.
d In the terminology and notation of Example 5.3, the predictive value of a negative
test is δ = P (Good|Pass) = P {S > 100|X > 101}. And, according to Problems 5.13
and 5.14, δ ≈ 0.9992, so 1 − δ = P {S ≤ 100|X > 101} is very small. Similarly, The
predictive value of a positive test is

γ = P (Bad|Fail) = P {S ≤ 100|X ≤ 101} ≈ 0.567,

so
1 − γ = P (Good|Fail) = P {S > 100|X ≤ 101} ≈ 0.433.
All of these events are conditioned on a events of positive probability.
By contrast, this problem focuses on P {S ≤ 100|X = 100.5}, which can be eval-
uated using the conditional density fS|X at a the value x = 100.5. The conditional
density is defined by conditioning on the event {|X − 100.5| < ²} as ² → 0. Roughly,
one might say P {S ≤ 100|X = 100.5} = 0.8814 is the probability a batch is good
given that it “barely” fails because X = 100.5 (just below the cut-off value 101).
130 5 Answers to Problems: Screening Tests

The first printing had a misprint, asking about 1 − δ, when 1 − γ was intended.
Confusion of P {S ≤ 100|X = 100.5} with 1 − δ might result from a reversal of the
roles of X and S. Avoiding confusion of P {S ≤ 100|X = 100.5} with 1 − γ requires
one to make the distinction between (i) the general condition X < 101 for failure of
a batch and (ii) the specific condition X = 100.5 for a particular batch. c

c) We seek P {S > 100|X = 100.5}. Recalling the distribution of X from


Example 5.3 and using the notation of Problem 5.14, show that this prob-
ability can be evaluated as follows:
Z ∞ R∞
ϕ(s, 110, 5)ϕ(100.5, s, 1) ds
fS|X (s|100.5) ds = 100 .
100 ϕ(100.5, 110, 5.099)

d) The R code below implements Riemann approximation of the probability


in part (c). Run the program and provide a numerical answer (roughly
0.8) to three decimal places. In Chapter 8, we will see how to find the
exact conditional distribution S|{X = 100.5}, which is normal. For now,
carefully explain why the R code below can be used to find the required
probability.

d This is straightforward Riemann approximation of a function of s, as in Exam-


ple 3.1, where mw = 130 − 100 = 30 is the width of the region of integration and
division by m occurs in taking the mean.√ The denominator of the function being
integrated is a constant, which involves 26 = 5.099 (derived in the second bulleted
statement of Example 5.3, p128.)
Educated trial and error (based on the tiny probability in normal tails beyond
several standard deviations from the mean) establishes that integrating over (0, 30)
is essentially the same to four places as integrating over (0, ∞), which would be
impossible for Riemann approximation. (Try 35 and 40 for wide, instead of 30, if
you have doubts.) c

wide = 30
s = seq(100, 100+wide, 0.001)
numer = 30 * mean(dnorm(s, 110, 5) * dnorm(100.5, s, 1))
denom = dnorm(100.5, 110, 5.099)
numer/denom

> numer/denom
[1] 0.8113713
5 Answers to Problems: Screening Tests 131

Errors in Chapter 5
Page numbers refer to the the text. All corrections are scheduled to appear in
the second printing. Statements of problems in this Instructor Manual have
been corrected. Includes corrections compiled through July 13, 2011.
p128 Example 5.3. In the second paragraph, change to: ...the value observed is a
conditional random variable X|{S = s} ∼ NORM(s, 1).
p133 Problem 5.6. In the second paragraph, three instances of 2.5 should be 1.7. (For
clarity, in the second printing, the first two paragraphs of the problem are to be
revised as shown in this Manual.)

Eric A. Suess and Bruce E. Trumbo


Introduction to Probability Simulation and Gibbs Sampling with R: Instructor Manual
Chapter 5

Explanations and answers:


c 2011 by Bruce E. Trumbo and Eric A. Suess. All rights reserved.
°
Statements of problems:
°c 2010 by Springer Science+Business Media, LLC. All rights reserved.
6
Markov Chains with Two States

6.1 In each part below, consider the three Markov chains specified as fol-
lows: (i) α = 0.3, β = 0.7; (ii) α = 0.15, β = 0.35; and (iii) α = 0.03, β = 0.07.
a) Find P {X2 = 1|X1 = 1} and P {Xn = 0|Xn−1 = 0}, for n ≥ 2.
d Probabilities of not moving. (i) P {X2 = 1|X1 = 1} = p11 = 1 − β = 1 − 0.7 = 0.3,
P {Xn = 0|Xn−1 = 0} = p00 = 1 − α = 0.7. (ii) p11 = 0.65, p00 = 0.85. (iii) p11 =
0.93, p00 = 0.97. c

b) Use the given values of α and β and means of geometric distributions to


find the average cycle length for each chain.
d Cycle length. (i) 1/α + 1β = 1/.3 + 1/.7 = 4.7619. (ii) 1/.15 + 1/.35 = 9.5238.
(iii) 1/.03 + 1/.07 = 47.619. c

c) For each chain, modify the program of Example 6.2 to find the long-run
fraction of steps in state 1.
In the notation of Sections 6.2 and 6.3, we seek

lim p01 (r) = lim p11 (r) = λ1 = α/(α + β).


r→∞ r→∞

The exact answer is 0.3 in all three cases. For case (iii), the program and
partial results for P64 are shown below. (You will find that smaller powers of
the transition matrix suffice for the other two cases.)
P = matrix(c(.97, .03,
.07, .93), nrow=2, ncol=2, byrow=T)
P
P2 = P %*% P; P4 = P2 %*% P2; P4
P8 = P4 %*% P4; P16 = P8 %*% P8; P16
P32 = P16 %*% P16; P64 = P32 %*% P32; P64

...
134 6 Answers to Problems: Chains with Two States

> P64 = P32 %*% P32; P64


[,1] [,2]
[1,] 0.7003537 0.2996463
[2,] 0.6991747 0.3008253

d) For each chain, make and interpret plots similar to Figures 6.3 (where the
number of steps is chosen to illustrate the behavior clearly), 6.4, and 6.5.
d The program changes are trivial, but you may need to experiment to find what m
is required for satisfactory convergence. In case (i), the autocorrelation for positive
lags is 0. In case (iii) autocorrelations are highly significant even for moderately
large lags. c

e) In summary, what do these three chains have in common, and in what


respects are they most remarkably different. Is any one of these chains an
independent process, and how do you know?
d The limiting distribution is the same in all three cases. However, chains in cases (ii)
and, especially, (iii) move very slowly toward their limits. Case (i) is independent,
because both rows of the transition matrix P are the same. c
Answers for one of the chains: (a) 0.85 and 0.65, (b) 9.52, (c) 0.3.

6.2 There is more information in the joint distribution of two random vari-
ables than can be discerned by looking only at their marginal distributions.
Consider two random variables X1 and X2 , each distributed as BINOM(1, π),
where 0 < π < 1.
a) In general, show that 0 ≤ Q11 = P {X1 = 1, X2 = 1} ≤ π. In particular,
evaluate Q11 in three cases, in which: (i) X1 and X2 are independent,
(ii) X2 = X1 , and (iii) X2 = 1 − X1 , respectively.
d In general, 0 ≤ Q11 = P {X1 = 1, X2 = 1} ≤ P {X1 = 1} = π, where we have used
the inequality in the Hints. In particular,
(i): We have 0 ≤ Q11 = P {X1 = 1, X2 = 1} = P {X1 = 1}P {X2 = 1} = π 2 ≤ π.
(ii): If X1 = X2 , then 0 ≤ Q11 = P {X1 = 1, X2 = 1} = P {X1 = 1} = π.
(iii): If X1 + X2 = 1, then 0 ≤ Q11 = P {X1 = 1, X2 = 1} = 0, so Q11 = 0. c

b) For each case in part (a), evaluate Q00 = P {X1 = 0, X2 = 0}.

d Similar to part (a). In general, 0 ≤ Q00 ≤ 1 − π. Under independence, 0 ≤ Q00 =


(1 − π)2 ≤ 1 − π. If X1 = X2 , then 0 ≤ Q00 = 1 − π, and if X1 + X2 = 1, then
Q00 = 0. c

c) In general, if P {X2 = 1|X1 = 0} = α and P {X2 = 0|X1 = 1} = β, then


express π, Q00 , and Q11 in terms of α and β.
d Find joint probabilities via conditional and marginal ones. For example, 1 − β =
P {X2 = 1|X1 = 1} = Q11 /π, so Q11 = π(1−β). Similarly, with the obvious notation
Q01 = α(1 − π). But Q11 + Q10 = P {X1 = 1} = π. So we have two expressions
6 Answers to Problems: Chains with Two States 135

for Q11 , which can be equated and solved to get π = α/(α + β). The rest follows
from simple probability rules and algebra. The three cases of parts (a) and (b) can
be expressed in terms of α and β as: (i) α + β = 1, independence (see the Hints);
(ii) α = β = 0, “never-move”; (ii) α = β = 1, “flip-flop.” c

d) In part (c), find the correlation ρ = Cov(X1 , X2 ) / SD(X1 )SD(X2 ), recall-


ing that Cov(X1 , X2 ) = E(X1 X2 ) − E(X1 )E(X2 ).
d Because Cov(X1 , X2 ) = P {X1 X2 = 1} − π 2 = Q11 − π 2 = π(1 − β) − π 2 and
SD(X1 ) = SD(X2 ) = π(1 − π), we have ρ = 1 − α − β, after a little algebra and
using the expression for π in terms of α and β. c
Hints and partial answers: P (A ∩ B) ≤ P (A). Make two-way tables of joint dis-
tributions, showing marginal totals. π = α/(α + β). E(X1 X2 ) = P {X1 X2 = 1}.
ρ = 1 − α − β. Independence (hence ρ = 0) requires α + β = 1.
6.3 Geometric distributions. Consider a coin with 0 < P (Heads) = π < 1.
A geometric random variable X can be used to model the number of indepen-
dent tosses required until the first Head is seen.
a) Show that the probability function P {X = x} = p(x) = (1 − π)x−1 π, for
x = 1, 2, . . . .
d In order to obtain the first Head on toss number x, that first Head must be preceded
by x − 1 Tails. c

b) Show that the geometric series p(x) sums to 1, so that one is sure to see
a Head eventually.
P∞
d Let T = x=1
p(x). Then

T = π + (1 − π)π + (1 − π)2 π + (1 − π)3 π + (1 − π)4 π + · · ·


(1 − π)T = (1 − π)π + (1 − π)2 π + (1 − π)3 π + (1 − π)4 π + · · ·

Subtracting, we have T − (1 − π)T = π, which implies that T = 1. This is essentially


the standard derivation of the sum of a geometric series with first term π and
constant ratio (1 − π) between terms.
Some probability textbooks and software define the “geometric” random variable
as Y = X −1, counting only the number y of Tails before the first Head. In particular,
R defines the functions dgeom, rgeom, and so on, according to this convention. Thus,
if π = 1/3 (pp = 1/3 in the code), then the PDF of X for x = 1, 2, 3, . . . , 35, is
expressed in R as x = 1:35; pdf = dgeom(x - 1, Pk1/3). The following program
illustrates that k = 35 terms are sufficient to give x=1
p(x) ≈ 1, to five places.

k = 30; pp = 1/3; x = 1:k


pdf = dgeom(x-1, 1/3)
sum(pdf); sum(x*pdf) # approximates 1 and E(X) = 3

> sum(pdf); sum(x*pdf) # approximates 1 and E(X) = 3


[1] 0.9999993
[1] 2.999974
136 6 Answers to Problems: Chains with Two States

m = 100000; rx = rgeom(m, 1/3) + 1


mean(rx) # simulates E(X) by sampling

> mean(rx)
[1] 3.01637 # simulates E(X) by sampling

Anticipating part (c) in the R code above, we also approximate E(X) = 3, first
P∞
by summing 35 terms of the series x=1
xp(x), and then by a simulation based
on 100 000 random realizations of X. c

c) Show that the moment generating function of X is m(t) = E(etX ) =


πet /[1 − (1 − π)et ], and hence that E(X) = m0 (0) = dm(t)
dt |t=0 = 1/π. (You
may assume that the limits involved in differentiation and summing an
infinite series can be interchanged as required.)
d To derive the moment generation function, we express and simplify E(etX ).
X
∞ X

E(etX ) = etx p(x) = etx (1 − π)x−1 π


x=1 x=1

t
X

πet
= πe [(1 − π)et ]x−1 = ,
1 − (1 − π)et
x=1

where the final result is from the standard formula for the sum of a geometric
series. Differentiation to obtain µX = E(X) = 1/π is elementary calculus. The
variance V(X) = (1 − π)/π 2 can be found using m00 (0) = E(X 2 ) and the formula
V(X) = E(X 2 ) − µ2X . Appended to the run of the program in the answer to part (b),
the instruction var(rx) returned 6.057, approximating V(X) = 6. c
6.4 Suppose the weather for a day is either Dry (0) or Rainy (1) according
to a homogeneous 2-state Markov chain with α = 0.1 and β = 0.5. Today is
Monday (n = 1) and the weather is Dry.
a) What is the probability that both tomorrow and Wednesday will be Dry?
d Tuesday and Wednesday will both be Dry days: P {X2 = 0, X3 = 0|X1 = 0} =
P {X2 = 0|X1 = 0}P {X3 = 0|X2 = 0} = p00 p00 = (1 − α)2 = 0.92 = 0.81. c

b) What is the probability that it will be Dry on Wednesday?


d Two-stage transition (Wednesday Dry, Tuesday unspecified): P {X3 = 0|X1 = 0} =
p00 p00 + p01 p10 = (1 − α)2 + αβ = 0.92 + 0.1(0.5) = 0.81 + 0.05 = 0.86. Alternatively,
p00 (2) = β/(α + β) + (1 − α − β)2 α/(α + β) = 0.5/0.6 + 0.42 (0.1)/0.6 = 0.86. c

c) Use equation (6.5) to find the probability that it will be Dry two weeks
from Wednesday (n = 17).
d Upper-left element of P16 : p00 (16) = β/(α + β) + (1 − α − β)16 α/(α + β) =
0.5/0.6+0.416 (0.1)/0.6 = 0.8333334. This is not far from λ0 = β/(α+β) = 0.5/0.6 =
0.8333333. The fact that we began with a Dry day has become essentially irrelevant. c
6 Answers to Problems: Chains with Two States 137

d) Modify the R code of Example 6.1 to find the probability that it will be
Dry two weeks from Wednesday.
P = matrix(c(.9, .1,
.5, .5), nrow=2, ncol=2, byrow=T)
P; P2 = P %*% P; P4 = P2 %*% P2; P4
P8 = P4 %*% P4; P8
P16 = P8 %*% P8; P16
...
> P16 = P8 %*% P8; P16
[,1] [,2]
[1,] 0.8333334 0.1666666
[2,] 0.8333330 0.1666670

e) Over the long run, what will be the proportion of Rainy days? Modify
the R code of Example 6.1 to simulate the chain and find an approximate
answer.
d The exact value is α/(α + β) = 1/6. The plot generated by the program below
(not shown here) indicates that m = 50 000 iterations is sufficient for the trace to
stabilize hear the exact value. c

set.seed(1234)
m = 50000; n = 1:m; x = numeric(m); x[1] = 0
alpha = 0.1; beta = 0.5
for (i in 2:m) {
if (x[i-1]==0) x[i] = rbinom(1, 1, alpha)
else x[i] = rbinom(1, 1, 1 - beta) }
y = cumsum(x)/n; y[m]
a = sum(x[1:(m-1)]==0 & x[2:m]==1); a # No. of cycles
m/a # Average cycle length
plot(y, type="l", ylim=c(0,.3), xlab="Step",
ylab="Proportion of Rainy Days")
abline(h = 1/6, col="green")

> y[m]
[1] 0.1659
> a = sum(x[1:(m-1)]==0 & x[2:m]==1); a # No. of cycles
[1] 4123
> m/a # Average cycle length
[1] 12.12709

f) What is the average length of runs of Rainy days?

d Runs of Rain have average length 1/β = 2. The total cycle length averages
1/α + 1/β = 12. This is approximated by the program in part (e) as 12.1. c

g) How do the answers above change if α = 0.15 and β = 0.75?


d Similar methods. Answers not shown. c
138 6 Answers to Problems: Chains with Two States

6.5 Several processes X1 , X2 , . . . are described below. For each of them


evaluate (i) P {X3 = 0|X2 = 1}, (ii) P {X13 = 0|X12 = 1, X11 = 0},
(iii) P {X13 = 0|X12 = 1, X11 = 1}, and (iv) P {X13 = 0|X11 = 0}. Also,
(v) say whether the process is a 2-state homogeneous Markov chain. If not,
show how it fails to satisfy the Markov property. If so, give its 1-stage transi-
tion matrix P.
a) Each Xn is determined by an independent toss of a coin, taking the value 0
if the coin shows Tails and 1 if it shows Heads, with 0 < P (Heads) = π < 1.
d (i) P {X3 = 0|X2 = 1} = 1 − π, (ii) P {X13 = 0|X12 = 1, X11 = 0} = 1 − π,
(iii) P {X13 = 0|X12 = 1, X11 = 1} = 1 − π, and (iv) P {X13 = 0|X11 = 0} = 1 − π.
(v) A homogeneous Markov chain with α = 1 − β = π. Independent process. c

b) The value of X1 is determined by whether a toss of a fair coin is Tails (0)


or Heads (1), and X2 is determined similarly by a second independent
toss of the coin. For n > 2, Xn = X1 for odd n, and Xn = X2 for even n.
d (i) P {X3 = 0|X2 = 1} = 1/2, (ii) P {X13 = 0|X12 = 1, X11 = 0} = 1, (iii) but
P {X13 = 0|X12 = 1, X11 = 1} = 0; (iv) P {X13 = 0|X11 = 0} = 1. (v) Not a Markov
chain because answers for (ii) and (iii) are not the same. c

c) The value of X1 is 0 or 1, according to whether the roll of a fair die gives


a Six (1) or some other value (0). For each step n > 1, if a roll of the die
shows Six on the nth roll, then Xn 6= Xn−1 ; otherwise, Xn = Xn−1 .
d (i) P {X3 = 0|X2 = 1} = p10 = β = 1/6, (ii) P {X13 = 0|X12 = 1, X11 = 0} = 1/6,
(iii) P {X13 = 0|X12 = 1, X11 = 1} = 1/6, and (iv) P {X13 = 0|X11 = 0} =
p00 p00 + p01 p10 = 26/36 = 0.7222. (v) A homogeneous Markov chain with transition
matrix shown in the Hints. c

d) Start with X1 = 0. For n > 1, a fair die is rolled. If the maximum value
shown on the die at any of the steps 2, . . . , n is smaller than 6, then
Xn = 0; otherwise, Xn = 1.
d Answers for (i)–(iii) are all 0. Once a 6 is seen at step i, we have Xi = 1 and the
value of Xn , for n > i, can never be 0. (iv) P {X13 = 0|X11 = 0} = (5/6)2 because
rolls of the die at steps 12 and 13 must both show values less than 6. (v) This is
a homogeneous Markov chain with α = 1/6 and β = 0. It is one of the absorbing
chains mentioned in Section 6.1 (p141). c

e) At each step n > 1, a fair coin is tossed, and Un takes the value −1 if the
coin shows Tails and 1 if it shows Heads. Starting with V1 = 0, the value
of Vn for n > 1 is determined by
Vn = Vn−1 + Un (mod 5).
The process Vn is sometimes called a “random walk” on the points
0, 1, 2, 3 and 4, arranged around a circle (with 0 adjacent to 4). Finally,
Xn = 0, if Vn = 0; otherwise Xn = 1.
6 Answers to Problems: Chains with Two States 139

d The V -process is a Markov chain with five states, S = {0, 1, 2, 3, 4}. (Processes
with more than two states are discussed in more detail in Chapter 7. Perhaps you
can write the 5 × 5 transition matrix.)
(i) Because X1 = V1 = 0, we know that X2 = 1, so P {X3 = 0|X2 = 1} = 1/2. At
step 2 the V -process must be in state 1 or 4; either way, there is a 50-50 chance that
X3 = V3 = 0. (ii) Similarly, P {X13 = 0|X12 = 1, X11 = 0} = 1/2. (iii) However,
P {X13 = 0|X12 = 1, X11 = 1, X10 = 0} = 0, because the V -process must be
in either state 2 or 3 at step 12, with no chance of returning to 0 at step 13.
(If we don’t know the state of the X-process at step 10, it’s more difficult to say
what happens at step 13. But the point in (v) is that its state at step 10 matters.)
(iv) P {X13 = 0|X11 = 0} = 1/4.
(v) The X-process is not Markov because the probabilities in (ii) and (iii) differ.
The X-process is a function of the V -process; this example shows that a function of
a Markov chain need not be Markov. c
Hints and partial answers: (a) Independence is consistent with the Markov property.
(b) Steps 1 and 2 are independent. Show that the values at steps 1 £and¤ 2 determine
the value at step 3 but the value at step 2 alone does not. (c) P = 16 51 15 . (d) Markov
chain. (e) The X-process is not Markov.

6.6 To monitor the flow of traffic exiting a busy freeway into an industrial
area, the highway department has a TV camera aimed at traffic on a one-
lane exit ramp. Each vehicle that passes in sequence can be classified as Light
(for example, an automobile, van, or pickup truck) or Heavy (a heavy truck).
Suppose data indicate that a Light vehicle is followed by another Light vehi-
cle 70% of the time and that a Heavy vehicle is followed by a Heavy one 5%
of the time.
a) What assumptions are necessary for the Heavy-Light process to be a ho-
mogenous 2-state Markov chain? Do these assumptions seem realistic?
(One reason the process may not be independent is a traffic law that for-
bids Heavy trucks from following one another within a certain distance
on the freeway. The resulting tendency towards some sort of “spacing”
between Heavy trucks may carry over to exit ramps.)
d Assume Markovian dependence only on the last step; probably a reasonable ap-
proximation to reality. Assume the proportions of Light and Heavy vehicles remain
constant (homogeneous) over time. This does not seem reasonable if applied day and
night, weekdays and weekends, but it may be reasonable during business hours. c

b) If I see a Heavy vehicle in the monitor now, what is the probability that
the second vehicle after it will also be Heavy? The fourth vehicle after it?
d Denote Heavy as 1 and Light as 0. Then P {X3 = 1|X1 = 1} = p11 p11 + p10 p01 =
0.052 + 0.95(0.30) = 0.2875. For the second probability, we need to keep track of
eight possible sequences of five 0s and 1s, beginning and ending with 1s, so it is easier
to use matrix multiplication. Below we show the first, second and fourth powers of
the transition matrix with α = 0.3 and β = 0.95. Notice that the lower-right element
140 6 Answers to Problems: Chains with Two States

of P2 is p11 (2) = 0.2875, as above. Similarly, the second required probability is the
lower-right element of P4 , which is p11 (4) = 0.2430, to four places. c

P = matrix(c(.7, .3,
.95, .05), nrow=2, ncol=2, byrow=T)
P
P2 = P %*% P; P2
P4 = P2 %*% P2; P4

> P
[,1] [,2]
[1,] 0.70 0.30
[2,] 0.95 0.05
> P2 = P %*% P; P2
[,1] [,2]
[1,] 0.7750 0.2250
[2,] 0.7125 0.2875
> P4 = P2 %*% P2; P4
[,1] [,2]
[1,] 0.7609375 0.2390625
[2,] 0.7570312 0.2429687

c) If I see a Light vehicle in the monitor now, what is the probability that
the second vehicle after it will also be Light? The fourth vehicle after it?
d From the output for part (b), p00 (2) = 0.7750 and p00 (4) = 0.7609. c

d) In the long run, what proportion of the vehicles on this ramp do you
suppose is Heavy?

d The long run probability λ1 = limr→∞ p11 (r) = limr→∞ p01 (r) = β/(α + β) =
0.30/1.25 = 0.24. c

e) How might an observer of this Markov process readily notice that it differs
from a purely independent process with about 24% Heavy vehicles.
d In a purely independent process, runs of Heavy vehicles would average about
1/0.76 = 1.32 in length, so we would regularly see pairs of Heavy vehicles. Specif-
ically, if one vehicle is Heavy, then the next one is Heavy roughly a quarter of the
time. By contrast, this Markov chain will produce runs of Heavy vehicles that av-
erage about 1/0.95 = 1.05 in length, so we would very rarely see pairs of Heavy
vehicles. c

f) In practice, one would estimate the probability that a Heavy vehicle is


followed by another Heavy one by taking data. If about 1/4 of the vehicles
are Heavy, about how many Heavy vehicles (paired with the vehicles that
follow immediately behind) would you need to observe in order to estimate
this probability accurately enough to distinguish meaningfully between a
purely independent process and a Markov process with dependence?
6 Answers to Problems: Chains with Two States 141

d The question amounts to asking how many trials n we needpto distinguish X ∼



BINOM(n, 0.25) from Y ∼ BINOM(n, 0.05). Then SD(X/n) = 3/16n = 0.433/ n
p √
and SD(Y /n) = 19/400n = 0.212/ n. Roughly speaking, if our goal√is to avoid
overlap of 95% confidence intervals, then we want 1.96(0.433 + 0.212)/ n < 0.20,
which gives n ≈ 40.
Another, possibly more precise, approach would be to say that a reasonable
dividing point (critical value for a test of hypothesis) between 5% and 25% might
be 12%, so we seek n for which both P {X/n < .12} < .05 and P {Y /n > .12} < .05.
A program to do the required grid search is shown below. The result is n = 34.
With some fussing, both methods might be made a more precise. However, the
question asks for only a rough estimate. Also, take into account that the binomial
distributions are discrete, so exact error probabilities usually can’t be achieved.
Based on either method, we see that it should be enough to look at about 35 or 40
Heavy vehicles, each paired with the the vehicle immediately following. c

n = 1:100
p1 = pbinom(n*.12, n, .25)
p2 = 1 - pbinom(n*.12, n, .05)
N = min(n[pmax(p1, p2) < .05])
N; p1[N]; p2[N]

> N; p1[N]; p2[N]


[1] 34
[1] 0.04909333
[1] 0.02591563

6.7 For a rigorous proof of equation (6.3), follow these steps:


a) Show that p01 (2) is a fraction with numerator

P {X1 = 0, X3 = 1} = P {X1 = 0, X2 = 0, X3 = 1}
+ P {X1 = 0, X2 = 1, X3 = 1}.

b) Use the relationship P (A ∩ B ∩ C) = P (A)P (B|A)P (C|A ∩ B) and the


Markov property to show that the first term in the numerator can be
expressed as p0 p00 p01 , where p0 = P {X1 = 0}.
c) Complete the proof.
d Here are some details in the proof of equation (6.3). We identify event A with step 1,
event B with step 2, and event C with step 3. By the Law of Total Probability we
have P (A ∩ C) = P (A ∩ B ∩ C) + P (A ∩ B c ∩ C), which is justifies the third equal
sign in the equation below.
The relationship in part (b) is proved by expressing conditional probabilities as
ratios, according to the definition of conditional probability, and canceling terms
that appear in both numerator and denominator. For example,

P (A ∩ B) P (A ∩ B ∩ C) P (A ∩ B ∩ C)
P (B|A)P (C|A ∩ B) = = .
P (A) P (A ∩ B) P (A)
142 6 Answers to Problems: Chains with Two States

This equation justifies the fourth equality below. The Markov property accounts for
the simplification involved in the fifth equality.

P {X1 = 0, X3 = 1}
p01 (2) = P {X3 = 1|X1 = 0} =
P {X1 = 0}
P {X1 = 0, X2 = 0, X3 = 1} + P {X1 = 0, X2 = 1, X3 = 1}
=
P {X1 = 0}
P {X1 = 0}P {X2 = 0|X1 = 0}P {X3 = 1|X2 = 0, X1 = 0}
=
P {X1 = 0}
P {X1 = 0}P {X2 = 1|X1 = 0}P {X3 = 1|X2 = 1, X1 = 0}
+
P {X1 = 0}
P {X1 = 0}P {X2 = 0|X1 = 0}P {X3 = 1|X2 = 0}
=
P {X1 = 0}
P {X1 = 0}P {X2 = 1|X1 = 0}P {X3 = 1|X2 = 1}
+
P {X1 = 0}
p0 p00 p01 p0 p01 p11
= + = p00 p01 + p01 p11
p0 p0
= (1 − α)α + α(1 − β) = α(2 − α − β).

In the next-to-last line above, we use notation p0 of part (c) and Section 6.2. c

6.8 To verify equation (6.5) do the matrix multiplication and algebra nec-
essary to verify each of the four elements of P2 .
d We show how the upper-left element p00 (2) of P2 arises from matrix multiplication.
The first row of P is the vector (p00 , p01 ) = (1 − α, α), and its first column is the
vector (p00 , p10 ) = (1 − α, β). So

X
1

p00 (2) = p0i pi0 = p00 p00 + p01 p10 = (1 − α)2 + αβ.
i=0

However, in equation (6.5) this element is represented as

β α(1 − α − β)2 β + α[(1 − α) − β]2


p00 = + =
α+β α+β α+β
(α + β)(1 − α)2 + αβ(α + β)
= = (1 − α)2 + αβ,
α+β

where we omit a few steps of routine algebra between the first and second lines. c
6.9 Prove equation (6.6), by mathematical induction as follows:
Initial step: Verify that the equation is correct for r = 1. That is, let r = 1
in (6.6) and verify that the result is P.
Induction step: Do the matrix multiplication P·Pr , where Pr is given by the
right-hand side of (6.6). Then simplify the result to show that the product
Pr+1 agrees with the right-hand side of (6.6) when r is replaced by r + 1.
d We do not show the somewhat tedious and very routine algebra required here. c
6 Answers to Problems: Chains with Two States 143

6.10 Consider the 2-state Markov chain with α = β = 0.9999. This is


almost a “flip-flop” chain. Find λ1 , the cycle length, and P132 . Simulate X̄100
and X̄101 several times. Also, look at the autocorrelations of Xn in several
simulation runs. Comment on your findings.
d The limiting probability of state 1 is λ1 = β/(α + β) = 0.9999/[2(0.9999)] = 1/2.
The cycle length is 1/α + 1/β = 2/0.9999 = 2.0002. In the first blocks of code below,
we show P132 , along with a few lower powers of the transition matrix to emphasize
the extremely slow convergence. One run of each requested simulation follows.

alpha = beta = 0.9999


P = matrix(c(1 - alpha, alpha,
beta, 1 - beta), nrow=2, ncol=2, byrow=T)
P
P2 = P %*% P
P4 = P2 %*% P2; P8 = P4 %*% P4; P8
P16 = P8 %*% P8; P32 = P16 %*% P16; P32
P64 = P32 %*% P32; P128 = P64 %*% P64; P128

> P
[,1] [,2]
[1,] 0.0001 0.9999
[2,] 0.9999 0.0001

> P8 = P4 %*% P4; P8


[,1] [,2]
[1,] 0.9992005598 0.0007994402
[2,] 0.0007994402 0.9992005598

> P32 = P16 %*% P16; P32


[,1] [,2]
[1,] 0.9968099 0.0031901
[2,] 0.0031901 0.9968099

> P128 = P64 %*% P64; P128


[,1] [,2]
[1,] 0.98736120 0.01263880
[2,] 0.01263880 0.98736120

m = 100; n = 1:m; x = numeric(m); x[1] = 0


alpha = .9999; beta = .9999
for (i in 2:m) {
if (x[i-1]==0) x[i] = rbinom(1, 1, alpha)
else x[i] = rbinom(1, 1, 1 - beta) }
y = cumsum(x)/n; y[m]; x[m]; acf(x, plot=F)

> y = cumsum(x)/n; y[m]; x[m]; acf(x, plot=F)


[1] 0.5
[1] 1
144 6 Answers to Problems: Chains with Two States

Autocorrelations of series ’x’, by lag

0 1 2 3 4 5 6 7 8 9 10
1.00 -0.99 0.98 -0.97 0.96 -0.95 0.94 -0.93 0.92 -0.91 0.90
11 12 13 14 15 16 17 18 19 20
-0.89 0.88 -0.87 0.86 -0.85 0.84 -0.83 0.82 -0.81 0.80

m = 101; n = 1:m; x = numeric(m); x[1] = 0


alpha = .9999; beta = .9999
for (i in 2:m) {
if (x[i-1]==0) x[i] = rbinom(1, 1, alpha)
else x[i] = rbinom(1, 1, 1 - beta) }
y = cumsum(x)/n; y[m]; x[m]; acf(x, plot=F)

> y = cumsum(x)/n; y[m]; x[m]; acf(x, plot=F)


[1] 0.4950495
[1] 0

Autocorrelations of series ’x’, by lag

0 1 2 3 4 5 6 7 8
1.000 -0.990 0.980 -0.970 0.960 -0.950 0.941 -0.931 0.921
9 10 11 12 13 14 15 16 17
-0.911 0.901 -0.891 0.881 -0.871 0.861 -0.851 0.842 -0.832
18 19 20
0.822 -0.812 0.802

While the powers of the matrix converge very slowly, the almost-deterministic
simulations converge very rapidly to 1/2 (for m = 100 and starting with X1 = 0),
and to 50/101 (for m = 101). There is no simple connection between the speed of
convergence of the powers of the transition matrix to a matrix with all-identical rows
and the speed with which the trace of the simulated chain converges to its limit.
We have omitted the plots as in Figures 6.3, 6.4, and 6.5; you should make them
and look at them. Printouts for the ACF, show alternating negative and positive
autocorrelations.
Very occasionally, the usually-strict alternation between 0 and 1 at successive
steps may be broken, giving slightly different results for y[m] (the average X̄m ), and
also giving a rare value of x[m] (single observation Xm ) that is 0 for m = 100—or
that is 1 for m = 101. c
Note: The autocorrelations for small lags have absolute values near 1 and they
alternate in sign; for larger lags, the trend towards 0 is extremely slow.

6.11 A single strand of a DNA molecule is a sequence of nucleotides. There


are four possible nucleotides in each position (step), one of which is cytosine
(C). In a particular long strand, it has been observed that C appears in 34.1%
of the positions. Also, in 36.8% of the cases where C appears in one position
along the strand, it also appears in the next position.
6 Answers to Problems: Chains with Two States 145

a) What is the probability that a randomly chosen pair of adjacent nu-


cleotides is CC (that has cytosine in both locations).
d Let Xn = 1 signify that nucleotide C is at step n. Because p11 = 1 − β = 0.368,
we have β = 1 − 0.368 = 0.632. Because λ1 = α/(α + β) = 0.341, we deduce that
α = βλ1 /(1 − λ1 ) = 0.632(0.341)/0.659 = 0.327. Then the probability a random
pair of adjacent nucleotides is CC can be found as λ1 p11 = 0.341(0.368) = 0.1255. c

b) If a position along the strand is not C, then what is the probability that
the next position is C?
d The probability that non-C is followed by (C) is p01 = α = 0.327. c

c) If a position n along the strand is C, what is the probability that position


n + 2 is also C? How about position n + 4?
d From the matrix computations below, the answers are 0.3421 and 0.3410, respec-
tively. We see that P4 ≈ limn→∞ Pn . c
alpha = 0.327; beta = 0.632
P = matrix(c(1 - alpha, alpha,
beta, 1 - beta), nrow=2, ncol=2, byrow=T)
P
P2 = P %*% P; P2
P4 = P2 %*% P2; P4

> P
[,1] [,2]
[1,] 0.673 0.327
[2,] 0.632 0.368
> P2 = P %*% P; P2
[,1] [,2]
[1,] 0.659593 0.340407
[2,] 0.657912 0.342088
> P4 = P2 %*% P2; P4
[,1] [,2]
[1,] 0.6590208 0.3409792
[2,] 0.6590180 0.3409820

d) Answer parts (a)–(c) if C appeared independently in any one position with


probability 0.341.
d (a) 0.3412 = 0.1163, (b) 0.341, and (c) 0.341 in both instances. The Markov chain
of parts (a)–(c) is nearly independent; the two rows of P are not much different.
However, this distinction, although numerically small, signals the difference between
life and random chemistry. c
Hint: Find the transition matrix of a chain consistent with the information given.
6.12 Consider a 2-state Markov chain with P = [ 1−α α
β 1−β ]. The elements
of the row vector σ = (σ1 , σ2 ) give a steady-state distribution of this chain
if σP = σ and σ1 + σ2 = 1.
146 6 Answers to Problems: Chains with Two States

a) If λ is the limiting distribution of this chain, show that λ is a steady-state


distribution.
d Suppose σ is a steady-state distribution of a chain for which λ is the limiting
distribution. We show that σ = λ. Multiplying σP = σ on the right by P gives
σP2 = σP = σ. Iterating, we have σPn = σ, for any n. Taking the limit as
n → ∞, we have σΛ = σ, where both rows of Λ are λ = (λ0 , λ1 ). Then, from the
multiplication σΛ, we get the first element σ0 λ0 + σ1 λ0 = λ0 (σ0 + σ1 ) = σ0 . But
σ0 + σ1 = 1, so λ0 = σ0 . Similarly, from the second element, λ1 = σ1 , and we have
shown that σ = λ. This method generalizes to the K-state chains, with K ≥ 2, in
Chapter 7. c

b) If |1 − α − β| < 1, show that the solution of the vector equation λP = λ


β α
is the long-run distribution λ = [ α+β , α+β ].
d The condition ensures that α + β > 0. Clearly, λ = [β/(α + β), α/(α + β)] satisfies
the equation λP = λ: By matrix multiplication, the first element of the product is
β(1 − α)/(α + β) + αβ/(α + β) = β/(α + β). Similarly, the second element of the
product is α/(α + β). So the specified λ is a solution to the steady-state equation.
Alternatively, we can multiply λ = (λ0 , λ1 ) by the first column of P to obtain
λ0 (1−α)+λ1 β = λ0 , from which we see that λ0 = λ1 β/α. Together with λ0 +λ1 = 1,
this implies that λ0 = β/(α + β), and thus also λ1 = α/(α + β). So we have solved
the steady-state equation to obtain λ in terms of α and β, as specified. c

c) It is possible for a chain that does not have a long-run distribution to have
a steady-state distribution. What is the steady-state distribution of the
“flip-flop” chain? What are the steady-state distributions of the “never
move” chain?
d The unique steady-state vector for the flip-flop chain is σ = (1/2, 1/2). The never-
move chain has the two-dimensional identity matrix as its transition matrix: P = I,
so any 2-element vector σ has σP = σI = σ. If the elements σ are nonnegative and
add to unity, it is a steady-state distribution of the never-move chain. c
6.13 Suppose a screening test for a particular disease has sensitivity η = 0.8
and specificity θ = 0.7. Also suppose, for a particular population that is espe-
cially at risk for this disease, PV Positive γ = 0.4 and PV Negative δ = 0.9.
a) Use the analytic method of Example 6.6 to compute π.
eta = 0.8; theta = 0.7; gamma= 0.4; delta = 0.9
Q = matrix(c(theta, 1-theta,
1-eta, eta ), ncol=2, byrow=T)
R = matrix(c(delta, 1-delta,
1-gamma, gamma ), ncol=2, byrow=T)
P = Q %*% R; alpha = P[1,2]; beta = P[2,1]
prevalence = alpha/(alpha + beta); prevalence

> prevalence
[1] 0.2235294
6 Answers to Problems: Chains with Two States 147

b) As in Example 6.5, use a Gibbs sampler to approximate the preva-


lence π. As seems appropriate, adjust the vertical scale of the plot, the run
length m, and the burn-in period. Report any adjustments you made and
the reasons for your choices. Make several runs of the modified simulation,
and compare your results with the value obtained in part (a). Compare
the first-order autocorrelation with that of the example.
d In the program below, we have chosen m = 100 000, although fewer iterations
might have sufficed. We take m/2 as the burn-in period, because the trace seems
stable enough after that point. After burn-in, the average prevalence π (proportion
infected) is 0.223, which is very close to the value found in part (a).
We use the vertical plotting window (0.10, 0.35) as wide enough to show the
essence of the trace. An interval more narrowly bracketing 0.22 might have be used
and would give greater emphasis on the variability of the trace. The ACF printout
shows very small autocorrelations for all but the smallest lags (below it is truncated
at lag 17). The trace and ACF plots are not shown here, but you should run the
program and look at them. c

set.seed(1066)
m = 100000; d = t = numeric(m); d[1] = 0
eta = .8; theta = .7; gamma = .4; delta = .9
for (n in 2:m) {
if (d[n-1]==1) t[n-1] = rbinom(1, 1, eta)
else t[n-1] = rbinom(1, 1, 1 - theta)
if (t[n-1]==1) d[n] = rbinom(1, 1, gamma)
else d[n] = rbinom(1, 1, 1 - delta) }
runprop = cumsum(d)/1:m
par(mfrow=c(1,2)) # plots not shown here
plot(runprop, type="l", ylim=c(.1,.35),
xlab="Step", ylab="Running Proportion Infected")
abline(v=m/2, lty="dashed")
acf(d, ylim=c(-.1,.4))
par(mfrow=c(1,1))
mean(d[(m/2+1):m])
acf(d, plot=F)

> mean(d[(m/2+1):m])
[1] 0.22268
> acf(d, plot=F)

Autocorrelations of series ’d’, by lag

0 1 2 3 4 5 6 7 8
1.000 0.151 0.027 0.007 -0.004 -0.005 -0.001 0.002 0.001
9 10 11 12 13 14 15 16 17
0.001 0.007 0.004 0.003 0.002 -0.003 0.000 -0.002 -0.001
...

Answer: π ≈ 0.22; your answer to part (a) should show four places.
148 6 Answers to Problems: Chains with Two States

6.14 Mary and John carry out an iterative process involving two urns and
two dice as follows:
(i) Mary has two urns: Urn 0 contains 2 black balls and 5 red balls; Urn 1
contains 6 black balls and 1 red ball. At step 1, Mary chooses one ball at
random from Urn 1 (thus X1 = 1). She reports its color to John and returns
it to Urn 1.
(ii) John has two fair dice, one red and one black. The red die has three
faces numbered 0 and three faces numbered 1; the black die has one face
numbered 0 and five faces numbered 1. John rolls the die that corresponds to
the color Mary reported to him. In turn, he reports the result X2 to Mary. At
step 2, Mary chooses the urn numbered X2 (0 or 1).
(iii) This process is iterated to give values of X3 , X4 , . . . .
a) Explain why the X-process is a Markov chain, and find its transition
matrix.
d Mary’s choice of an urn at step n, depends on what John tells her, which depends
in turn on Mary’s choice at step n − 1. However, knowing her choice at step n − 1
is enough to compute probabilities of her choices at step n. No information earlier
than step n is relevant in that computation. Thus the X-process is a Markov chain.
Its transition matrix is shown in the Hint. c

b) Use an algebraic method to find the percentage of steps on which Mary


samples from Urn 1.
d We need to find the transition matrix P, and from it λ1 . The program below
illustrates this procedure. Mary chooses Urn 1 about 73.5% of the time.
Thus, intuitively speaking, the proportion of black balls drawn is 0.265(2/7) +
0.735(6/7) = 70.6% and that is also the proportion of the time that John rolls the
black die. Black balls predominate only slightly in Mary’s two urns taken together,
but the choice of a black die heavily favors choosing Urn 1 at the next step, and
Urn 1 has a majority of black balls. c

Q = (1/7)*matrix(c(2, 5,
6, 1), ncol=2, byrow=T)
R = (1/6)*matrix(c(1, 5,
3, 3), ncol=2, byrow=T)
P = Q %*% R; alpha = P[1,2]; beta = P[2,1]
urn.1 = alpha/(alpha + beta); urn.1

> urn.1 = alpha/(alpha + beta); urn.1


[1] 0.7352941

c) Modify the program of Example 6.5 to approximate the result in part (b)
by simulation.
d The run of the program below gives very nearly the exact answer. The seed chosen
for that run is an unusually “lucky” one; with a seed randomly chosen from your
system clock, you should expect about two-place accuracy.
6 Answers to Problems: Chains with Two States 149

Problem 6.13 could pretty much be done by plugging in parameters from Ex-
amples 6.5 and 6.6. By contrast, this problem gives you the opportunity to think
through the logic of the Gibbs Sampler in order to get the correct binomial para-
meters inside the loop.
After running the program below (in which urn denotes the X-process), you can
use the same code to make plots of the trace and ACF is in part (c) of Problem 6.13,
but change the plotting interval of the vertical scale to something like (0.65, 0.85). c

set.seed(2011)
m = 100000
urn = die = numeric(m); urn[1] = 0
for (n in 2:m)
{
if (urn[n-1]==1) die[n-1] = rbinom(1, 1, 1/7)
else die[n-1] = rbinom(1, 1, 1 - 2/7)
if (die[n-1]==1) urn[n] = rbinom(1, 1, 3/6)
else urn[n] = rbinom(1, 1, 1 - 1/6)
}
runprop = cumsum(urn)/1:m
mean(urn[(m/2+1):m])

> mean(urn[(m/2+1):m])
[1] 0.73522

Hint: Drawing from an urn and rolling a die are each “half” a step; account for both
possible paths to each full-step transition: P = 17 [ 26 51 ] · 61 [ 13 53 ].
150 6 Answers to Problems: Chains with Two States

Errors in Chapter 6
Page numbers refer to the the text. All corrections are scheduled to appear in
the second printing. Statements of problems in this Instructor Manual have
been corrected. Includes corrections compiled through July 13, 2011.
p148 Example 6.2. In the second line below printout, the transition probability should
be p01 (4) ≈ 0.67, not 0.69. [Thanks to Leland Burrill.]
p153 Example 6.6. In the displayed equation, the lower-right entry in first matrix
should be 0.99, not 0.00. [Thanks to Tony Tran.] The correct display is as
follows:
· ¸· ¸ · ¸
0.97 0.03 0.9998 0.0002 0.9877 0.0123
P= =
0.01 0.99 0.5976 0.4024 0.6016 0.3984

p155 Problem 6.5(e). The displayed equation should have ’mod 5’; consequently, the
points should run from 1 through 5, and 0 should be adjacent to 4. The answer
for part (e) should say: “The X-process is not Markov.” The correct statement
of part (e) is as follows:
e) At each step n > 1, a fair coin is tossed, and Un takes the value −1
if the coin shows Tails and 1 if it shows Heads. Starting with V1 = 0,
the value of Vn for n > 1 is determined by

Vn = Vn−1 + Un (mod 5).

The process Vn is sometimes called a “random walk” on the points


0, 1, 2, 3 and 4, arranged around a circle (with 0 adjacent to 4). Finally,
Xn = 0, if Vn = 0; otherwise Xn = 1.

Eric A. Suess and Bruce E. Trumbo


Introduction to Probability Simulation and Gibbs Sampling with R: Instructor Manual
Chapter 6

Explanations and answers:


c 2011 by Bruce E. Trumbo and Eric A. Suess. All rights reserved.
°
Statements of problems:
°c 2010 by Springer Science+Business Media, LLC. All rights reserved.
7
Markov Chains with Larger State Spaces

7.1 Ergodic and nonergodic matrices. In the transition matrices of the six
4-state Markov chains below, elements 0 are shown and * indicates a positive
element. Identify the ergodic chains, giving the smallest value N for which
PN has all positive elements. For nonergodic chains, explain briefly what
restriction on the movement among states prevents ergodicity.

     
**0 0 ** * 0 ** 00
0 * * 0 * * * 0 * * 0 0
a) P = 
0 0 *
, b) P =  , c) P =  ,
* * * * 0 * 0 * *
*00 * 00 0 * 00 *0

     
0 * 00 0* * 0 * *0 0
0 0 * * 0 0 0 * * *0 0
d) P = 
*
, e) P =  , f) P =  .
0 0 0 * 0 0 0 0 ** *
* 0 00 0* * * 0 00 *

d For parts (a), (b), (d), and (f), see the Answers below.
Part (c): The classes of intercommunicating states are A = {1, 2} and B = {3, 4}.
Class B can lead to A, but not the reverse. So B is transient and A is persistent.
Not an ergodic chain.
Part (e): The path 2 → 4 → 3 → 1 → 2 shows that all states intercommunicate. An
ergodic chain. c
Answers: In each chain, let the state space be S = {1, 2, 3, 4}. (a) Ergodic, N = 3.
(b) Class {1, 2, 3} does not intercommunicate with {4}. (d) Nonergodic because of
the period 3 cycle {1} → {2} → {3, 4} → {1}; starting in {1} at step 1 allows visits to
{3, 4} only at steps 3, 6, 9, . . . . (f) Starting in {3} leads eventually to absorption in
either {1, 2} or {4}.
152 7 Instructor Manual: Chains with Larger State Spaces

7.2 Continuation of Example 7.1, CpG islands. We now look at a Markov


chain that models the part of the genome where mutation of CpGs to TpGs
is not inhibited. In the transition matrix below, note particularly that the
probability p23 is much smaller than in the matrix of Example 7.1 (p161).
 
0.300 0.205 0.285 0.210
 0.322 0.298 0.078 0.302 
P= 
 0.248 0.246 0.298 0.208  .
0.177 0.239 0.292 0.292

a) Find a sufficiently high power of P to determine the long-run distribu-


tion of this chain. Comment on how your result differs from the long-run
distribution of the chain for CpG islands.
d Below we see that limr→∞ Pr ≈ P8 with at least five place accuracy. Because the
original data are accurate to about three places, we take the limiting distribution
of this “sea” chain to be λ = (0.262, 0.246, 0.239, 0.254). Recall that the limiting
distribution for the CpG island chain is λ = (0.155, 0.341, 0.350, 0.154). Not surpris-
ingly, the frequencies of C = 2 and G = 3 are lower here. (Each C on one side of the
DNA “ladder” is matched with a G on the other.) Moreover, as mentioned in the
note, the proportion of CpGs in the sea chain is only λ2 p23 = 0.246(0.078) = 0.019,
compared with 0.093 in the island chain. c

P = matrix(c(0.300, 0.205, 0.285, 0.210,


0.322, 0.298, 0.078, 0.302,
0.248, 0.246, 0.298, 0.208,
0.177, 0.239, 0.292, 0.292), nrow=4, byrow=T)
P2 = P %*% P; P4 = P2 %*% P2; P8 = P4 %*% P4
P2; P4; P8

> P2
[,1] [,2] [,3] [,4]
[1,] 0.263860 0.242890 0.247740 0.245510
[2,] 0.265354 0.246180 0.226442 0.262024
[3,] 0.264332 0.247168 0.239408 0.249092
[4,] 0.254158 0.249127 0.241367 0.255348
> P4
[,1] [,2] [,3] [,4]
[1,] 0.2619579 0.2462802 0.2389381 0.2528238
[2,] 0.2617925 0.2463029 0.2389403 0.2529643
[3,] 0.2619256 0.2462810 0.2388936 0.2528999
[4,] 0.2618687 0.2463348 0.2387957 0.2530008
> P8
[,1] [,2] [,3] [,4]
[1,] 0.2618869 0.2462998 0.238892 0.2529213
[2,] 0.2618869 0.2462998 0.238892 0.2529214
[3,] 0.2618869 0.2462998 0.238892 0.2529213
[4,] 0.2618869 0.2462998 0.238892 0.2529214
7 Instructor Manual: Chains with Larger State Spaces 153

rowSums(P)
> rowSums(P)
[1] 1 1 1 1 # verifying that all rows of this P add to 1

b) Modify the R program of Example 7.1 to simulate this chain, approximat-


ing its long-run distribution and the overall proportion of CpGs. How does
this compare with the product λ2 p23 ? With the product λ2 λ3 ? Comment.
How does it compare with the proportion of CpGs in the CpG-islands
model?
d In the simulation below, λ ≈ (0.262, 0.246, 0.238, 0.255), which agrees within two-
or three-place accuracy with each row of P8 in part (a). The simulation estimates the
proportion of CpGs as 0.019, which agrees with the result λ2 p23 = 0.019 computed
in part (a). If Cs and Gs were independent in the sea, then the proportion of CpGs
would be λ2 λ3 ≈ 0.246(0.238) = 0.059, instead of the Markovian value 0.019.

set.seed(1234)
m = 100000; x = numeric(m); x[1] = 1
for (i in 2:m) {
if (x[i-1] == 1)
x[i] = sample(1:4, 1, prob=c(0.300, 0.205, 0.285, 0.210))
if (x[i-1] == 2)
x[i] = sample(1:4, 1, prob=c(0.322, 0.298, 0.078, 0.302))
if (x[i-1] == 3)
x[i] = sample(1:4, 1, prob=c(0.248, 0.246, 0.298, 0.208))
if (x[i-1] == 4)
x[i] = sample(1:4, 1, prob=c(0.177, 0.239, 0.292, 0.292)) }
summary(as.factor(x))/m # Table of proportions
mean(x[1:(m-1)]==2 & x[2:m]==3) # Est. Proportion of CpG
hist(x, breaks=0:4 + .5, prob=T, xlab="State", ylab="Proportion")

> summary(as.factor(x))/m # Table of proportions


1 2 3 4
0.26236 0.24550 0.23764 0.25450
> mean(x[1:(m-1)]==2 & x[2:m]==3) # Est. Proportion of CpG
[1] 0.01899019

For a graphical impression of the difference between the distribution of the four
nucleotides in the sea and island chains, compare the “histogram” (interpreted as a
bar chart) from the simulation above with the bar chart in Figure 7.1.
As suggested, the simulation program above is similar to the one in Example 7.1.
The program below uses a different style based directly on the transition matrix.
The style is more elegant, but maybe not quite as easy to understand as the one
above. Compare the two programs, and see if you can see how each one works.
154 7 Instructor Manual: Chains with Larger State Spaces

P = matrix(c(0.300, 0.205, 0.285, 0.210,


0.322, 0.298, 0.078, 0.302,
0.248, 0.246, 0.298, 0.208,
0.177, 0.239, 0.292, 0.292), nrow=4, byrow=T)
set.seed(1234)
m = 100000; x = numeric(m); x[1] = 1
for (i in 2:m) { x[i] = sample(1:4, 1, prob=P[x[i-1], ]) }
summary(as.factor(x))/m # Table of proportions
mean(x[1:(m-1)]==2 & x[2:m]==3) # Est. Proportion of CpG

> summary(as.factor(x))/m # Table of proportions


1 2 3 4
0.26236 0.24550 0.23764 0.25450
> mean(x[1:(m-1)]==2 & x[2:m]==3) # Est. Proportion of CpG
[1] 0.01899019

We purposely used the same seed in both versions of the program. That the
simulated results are exactly the same in both cases indicates that the two programs
are doing exactly the same simulation. c
Note: The proportion of CpGs among dinucleotides in the island model is approx-
imately 9%; here it is only about 2%. Durbin et al. (1998) discuss how, given the
nucleotide sequence for a short piece of the genome, one might judge whether or
not it comes from a CpG island. Further, with information about the probabilities of
changing between island and “sea,” one might make a Markov chain with 8 states:
A0 , T0 , G0 , C0 for CpG islands and A, T, G, C for the surrounding sea. However, when
observing the nucleotides along a stretch of genome, one cannot tell A from A0 ,
T from T0 , and so on. This is an example of a hidden Markov model.

7.3 Brother-sister mating (continued).


a) In Example 7.2 (p164) verify the entries in the transition matrix P.
d The results of Crosses 1, 2, and 5 are explained in the example. It is clear that
Cross 6 = aa × aa can lead only to offspring of type aa and thus only to Cross 6.
So Cross 6 joins Cross 1 as a second absorbing state. Cross 4 = Aa × aa is similar
to Cross 2 = AA × Aa, but with the roles of A and a reversed, and so Cross 4 must
lead to Crosses 3, 4, and 6 in the ratios 1:2:1, respectively.
Finally, we consider the Cross 3 = Aa × Aa which can lead to any of the six
crosses. Possible offspring are AA, Aa, and aa, with probabilities 1/4, 1/2 and 1/4,
respectively. Thus Cross 1 = AA × AA and Cross 6 = aa × aa each occur with
probability (1/4)2 = 1/16. Also, Cross 5 = AA × aa happens with probability
2/16 = 1/8 (with the probability doubled because, by convention, AA × aa and
aa × AA are both written as AA × aa—we don’t care which genotype came from
which sibling). Then we note that Crosses 2, 3, and 4 each occur with probability
4/16 = 1/4. For example, Cross 2 = AA × Aa, could result in a type AA male and
a type Aa female (probability (1/4)(1/2) = 1/8). But this cross might also have the
genders reversed, so the total probability of Cross 2 is 2/8 = 1/4. c
7 Instructor Manual: Chains with Larger State Spaces 155

b) Evaluate the products (1/2, 0, 0, 0, 0, 1/2) · P and (1/3, 0, 0, 0, 0, 2/3) · P


by hand and comment.
P = (1/16)*matrix(c(16, 0, 0, 0, 0, 0,
4, 8, 4, 0, 0, 0,
1, 4, 4, 4, 2, 1,
0, 0, 4, 8, 0, 4,
0, 0, 16, 0, 0, 0,
0, 0, 0, 0, 0, 16), nrow=6, byrow = T)
c(1/2, 0, 0, 0, 0, 1/2) %*% P; c(1/3, 0, 0, 0, 0, 2/3) %*% P

> c(1/2, 0, 0, 0, 0, 1/2) %*% P; c(1/3, 0, 0, 0, 0, 2/3) %*% P


[,1] [,2] [,3] [,4] [,5] [,6]
[1,] 0.5 0 0 0 0 0.5
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] 0.3333333 0 0 0 0 0.6666667

d Any vector σ = (τ, 0, 0, 0, 0, 1 − τ ), where 0 ≤ τ ≤ 1, is a steady-state vector for


this nonergodic chain. There is no movement out of either state 1 or 6. c

c) Make several simulation runs similar to the one at the end of Example 7.2
and report the number of steps before absorption in each.
d We use the transition matrix, as in the second part of the answer to Problem 7.2.
For brevity, we use a rather crude and “wasteful” program: We bet on absorption
before step 1000 (an extremely good bet, but use 100 000 if you’re really not a
risk taker), and we do not bother to stop the program at absorption (as we do in
Problem 7.4). We start at Cross 3 (state 3) as in the example, and we change the
population parameter of the sample function to 1:6 in order to match the number
of states in the chain.
Finally, we do not specify a seed, and do show the results of several runs. Unlike
the situation with an ergodic chain, the starting state makes a difference here. So
results would be different if we picked a state other than 3 to start. (If we started
in state 1 or 6, be would be “absorbed” at the outset. We explore absorption times
further in Problem 7.4.

P = (1/16)*matrix(c(16, 0, 0, 0, 0, 0,
4, 8, 4, 0, 0, 0,
1, 4, 4, 4, 2, 1,
0, 0, 4, 8, 0, 4,
0, 0, 16, 0, 0, 0,
0, 0, 0, 0, 0, 16), nrow=6, byrow = T)
m = 1000; x = numeric(m); x[1] = 3
for (i in 2:m) { x[i] = sample(1:6, 1, prob=P[x[i-1], ]) }
sba = length(x[(x > 1) & (x < 6)]); sba

> sba
[1] 14

Additional runs gave 1, 10, 4, 2, 10, and 1, respectively. c


156 7 Instructor Manual: Chains with Larger State Spaces

7.4 Distribution of absorption times in brother-sister mating (continuation


of Problem 7.3). The code below simulates 10 000 runs of the brother-sister
mating process starting at state 3. Each run is terminated at absorption, and
the step and state at absorption for that run are recorded. The histogram
from one run is shown in Figure 7.14.
Run the program several times for yourself, each time with a different start-
ing state. Summarize your findings, comparing appropriate results with those
from P128 in Example 7.2 and saying what additional information is gained
by simulation.
d Below the program, we show results of two runs starting at state 3 and two runs
starting at state 2. Starting at state 3, we know from P128 in Example 7.2 that
there is a 50-50 chance of getting absorbed in either state 1 or 6. Starting at state 2,
we know the probabilities of getting in states 1 or 6 are 3/4 and 1/4, respectively.
Simulation results are consistent with this information.
In addition to what is known from P128 in Example 7.2, simulation provides
information about the distribution of absorption times. There are analytic methods
for finding mean times to absorption, which are reasonably well approximated in
the simulation below.
However, with simulation we can get more information about the distribution
of times to absorption: For both of the starting states we tried, absorption occurred
fairly quickly—at or before step 14 or 15 (starting at state 3) or before step 14 (for
state 2). (So we were not running much risk using m = 1000 in the sloppy program
of Problem 7.3(b)!)
Histograms (not shown here) give more complete information about the distrib-
ution of times to absorption. You should look at them as you explore starting states
other than 2 and 3. At the very end of the output below, we show a numerical
summary of the simulated distribution of times to absorption from our second run
starting at step 2. c

m = 10000 # number of runs


step.a = numeric(m) # steps when absorption occurs
state.a = numeric(m) # states where absorbed
for (j in 1:m)
{
x = 3 # initial state; inside the loop the length
# of x increases to record all states visited
a = 0 # changed to upon absorption
while(a==0)
{
i = length(x) # current step; state found below
if (x[i]==1) x = c(x, 1)
if (x[i]==2)
x = c(x, sample(1:6, 1, prob=c(4,8,4,0,0,0)))
if (x[i]==3)
x = c(x, sample(1:6, 1, prob=c(1,4,4,4,2,1)))
if (x[i]==4)
x = c(x, sample(1:6, 1, prob=c(0,0,4,8,0,4)))
if (x[i]==5) x = c(x, 3)
7 Instructor Manual: Chains with Larger State Spaces 157

if (x[i]==6) x = c(x, 6)
# condition below checks for absorption
if (length(x[x==1 | x==6]) > 0) a = i + 1
}
step.a[j] = a -1 # absorption step for jth run
state.a[j] = x[length(x)] # absorption state for jth run
}

hist(step.a) # simulated distribution of absorption times


mean(step.a) # mean time to absorption
quantile(step.a, .95) # 95% of runs absorbed by this step
summary(as.factor(state.a))/m # dist’n of absorption states

# runs starting at state 3

> mean(step.a) # mean time to absorption


[1] 5.6256
> quantile(step.a, .95) # 95% of runs absorbed by this step
95%
15
> summary(as.factor(state.a))/m # dist’n of absorption states
1 6
0.5058 0.4942

> mean(step.a) # mean time to absorption


[1] 5.6575
> quantile(step.a, .95) # 95% of runs absorbed by this step
95%
14
> summary(as.factor(state.a))/m # dist’n of absorption states
1 6
0.4987 0.5013

# runs starting at state 2

> mean(step.a) # mean time to absorption


[1] 4.9042
> quantile(step.a, .95) # 95% of runs absorbed by this step
95%
14

> summary(as.factor(state.a))/m # dist’n of absorption states


1 6
0.7521 0.2479

> mean(step.a) # mean time to absorption


[1] 4.7996
158 7 Instructor Manual: Chains with Larger State Spaces

> quantile(step.a, .95) # 95% of runs absorbed by this step


95%
14
> summary(as.factor(state.a))/m # dist’n of absorption states
1 6
0.7531 0.2469

> summary(as.factor(step.a))/m # dist’n of absorption TIMES


1 2 3 4 5 6 7 8 9
0.2506 0.1549 0.1129 0.0949 0.0770 0.0613 0.0481 0.0396 0.0282
10 11 12 13 14 15 16 17 18
0.0248 0.0204 0.0163 0.0148 0.0126 0.0085 0.0079 0.0055 0.0041
19 20 21 22 23 24 25 26 27
0.0029 0.0035 0.0017 0.0021 0.0014 0.0014 0.0015 0.0007 0.0006
28 29 30 31 32 33 34 39 42
0.0003 0.0003 0.0002 0.0003 0.0002 0.0001 0.0001 0.0001 0.0001
45
0.0001

7.5 Doubly stochastic matrix. Consider states S = {0, 1, 2, 3, 4} arranged


clockwise around a circle with 0 adjacent to 4. A fair coin is tossed. A Markov
chain moves clockwise by one number if the coin shows Heads, otherwise it
does not move.
a) Write the 1-step transition matrix P for this chain. Is it ergodic?
d See part (c). c

b) What is the average length of time this chain spends in any one state before
moving to the next? What is the average length of time to go around the
circle once? From these results, deduce the long-run distribution of this
chain. (In many chains with more than 2 states, the possible transitions
among states are too complex for this kind of analysis to be tractable.)
d The average length of time staying in any one state is 2. The number of steps W
until a move is distributed geometrically with π = 1/2, so E(W ) = 1/π = 2.. So the
average number of steps to go around the circle is 5(2) = 10. On average, one-fifth
of the time is spent in each state, so the long-run distribution is expressed by the
vector λ = (0.2, 0.2, 0.2, 0.2, 0.2). c

c) Show that the vector σ = (1/5, 1/5, 1/5, 1/5, 1/5) satisfies the matrix
equation σP = σ and thus is a steady-state distribution of this chain. Is
σ also the unique long-run distribution?
d Below we show the transition matrix P, illustrate that σ(P ) = σ, and compute
a sufficiently high power of P to show that the matrix is ergodic with limiting
distribution λ (as claimed in part (b)), and thus also that σ = λ.
7 Instructor Manual: Chains with Larger State Spaces 159

> P = 1/2*matrix(c(1, 1, 0, 0, 0,
+ 0, 1, 1, 0, 0,
+ 0, 0, 1, 1, 0,
+ 0, 0, 0, 1, 1,
+ 1, 0, 0, 0, 1), nrow=5, byrow=T)
> P
[,1] [,2] [,3] [,4] [,5]
[1,] 0.5 0.5 0.0 0.0 0.0
[2,] 0.0 0.5 0.5 0.0 0.0
[3,] 0.0 0.0 0.5 0.5 0.0
[4,] 0.0 0.0 0.0 0.5 0.5
[5,] 0.5 0.0 0.0 0.0 0.5

> ss.vec = rep(1/5, times=5); ss.vec


[1] 0.2 0.2 0.2 0.2 0.2
> ss.vec %*% P # verifies steady state dist’n
[,1] [,2] [,3] [,4] [,5]
[1,] 0.2 0.2 0.2 0.2 0.2

> P2 = P %*% P; P2
[,1] [,2] [,3] [,4] [,5]
[1,] 0.25 0.50 0.25 0.00 0.00
[2,] 0.00 0.25 0.50 0.25 0.00
[3,] 0.00 0.00 0.25 0.50 0.25
[4,] 0.25 0.00 0.00 0.25 0.50
[5,] 0.50 0.25 0.00 0.00 0.25

> P4 = P2 %*% P2
> P8 = P4 %*% P4; P8
[,1] [,2] [,3] [,4] [,5]
[1,] 0.2226563 0.1406250 0.1406250 0.2226563 0.2734375
[2,] 0.2734375 0.2226563 0.1406250 0.1406250 0.2226563
[3,] 0.2226563 0.2734375 0.2226563 0.1406250 0.1406250
[4,] 0.1406250 0.2226563 0.2734375 0.2226563 0.1406250
[5,] 0.1406250 0.1406250 0.2226563 0.2734375 0.2226563

> P16 = P8 %*% P8


> P32 = P16 %*% P16; P32 # illustrates ergodicity, long run dist’n
[,1] [,2] [,3] [,4] [,5]
[1,] 0.2001402 0.2004536 0.2001402 0.1996330 0.1996330
[2,] 0.1996330 0.2001402 0.2004536 0.2001402 0.1996330
[3,] 0.1996330 0.1996330 0.2001402 0.2004536 0.2001402
[4,] 0.2001402 0.1996330 0.1996330 0.2001402 0.2004536
[5,] 0.2004536 0.2001402 0.1996330 0.1996330 0.2001402

In the answer to part (b) of Problem 7.2, we showed how to simulate a Markov
chain using its one-step transition matrix P. The random walk on a circle provides a
good opportunity to show another method of simulation—“programming the story.”
160 7 Instructor Manual: Chains with Larger State Spaces

This method is important because later in the chapter we consider Markov chains
that don’t have matrices. (For example, see Problem 7.11.)
set.seed(1239)
m = 100000; x = numeric(m); x[1] = 0
for (i in 2:m)
{
d = rbinom(1, 2, 1/2) # 1 if Heads, 0 if Tails
x[i] = (x[i-1] + d) %% 5 # moves clockwise if Head
}
summary(as.factor(x))/m

> summary(as.factor(x))/m
0 1 2 3 4
0.19929 0.20057 0.19934 0.19986 0.20094
The resulting limiting distribution is in essential agreement with the stationary
distribution given above. c
d) Transition matrices for Markov chains are sometimes called stochastic,
meaning that each row sums to 1. In a doubly stochastic matrix, each
column also sums to 1. Show that the limiting distribution of a K-state
chain with an ergodic, doubly stochastic transition matrix P is uniform
on the K states.
d Let σ be a K-vector with all elements 1/K. Also, as usual, denote the elements of
P as pij , for i, j = 1, . . . , K. Then the jth element of σP is
X
K
1 1 X
K
1
pij = pij = ,
K K K
i=1 i=1
PK
where the last equality holds because i=1
pij = 1, for j = 1, . . . , K, as required by
the doubly-stochastic nature of P. c
e) Consider a similar process with state space S = {0, 1, 2, 3}, but with 0
adjacent to 3, and with clockwise or counterclockwise movement at each
step determined by the toss of a fair coin. (This process moves at every
step.) Show that the resulting doubly stochastic matrix is not ergodic.
d Suppose we start in an even-numbered state at step 1. Then we must be in an
even-numbered state at any odd-numbered step. For even n, Pn will have pij = 0,
for odd i and even j, and also for even i and odd j. A similar argument can be made
to show that Pn must have 0 elements for odd powers n. Therefore, there can be no
power of P with all positive elements, and Pn cannot approach a limit will all rows
the same. Below we illustrate with a few powers of P.
This is called a periodic chain of period 2. For K = 2 states, the only periodic
chain is the flip-flop chain discussed in Chapter 6. But for larger K, there can be
a variety of kinds of periodic chains. Such a random walk on a circle, with forced
movement to an immediately adjacent state at each step, is periodic with period 2
when the number of states K is even, but aperiodic (not periodic) when the number
of states is odd.
7 Instructor Manual: Chains with Larger State Spaces 161

P = (1/2)*matrix(c(0, 1, 0, 1,
1, 0, 1, 0,
0, 1, 0, 1,
1, 0, 1, 0), nrow=4, byrow=T)
P
P2 = P %*% P; P2
P3 = P2 %*% P; P3
P4 = P2 %*% P2; P4

> P
[,1] [,2] [,3] [,4]
[1,] 0.0 0.5 0.0 0.5
[2,] 0.5 0.0 0.5 0.0
[3,] 0.0 0.5 0.0 0.5
[4,] 0.5 0.0 0.5 0.0

> P2 = P %*% P; P2
[,1] [,2] [,3] [,4]
[1,] 0.5 0.0 0.5 0.0
[2,] 0.0 0.5 0.0 0.5
[3,] 0.5 0.0 0.5 0.0
[4,] 0.0 0.5 0.0 0.5

> P3 = P2 %*% P; P3
[,1] [,2] [,3] [,4]
[1,] 0.0 0.5 0.0 0.5
[2,] 0.5 0.0 0.5 0.0
[3,] 0.0 0.5 0.0 0.5
[4,] 0.5 0.0 0.5 0.0

> P4 = P2 %*% P2; P4


[,1] [,2] [,3] [,4]
[1,] 0.5 0.0 0.5 0.0
[2,] 0.0 0.5 0.0 0.5
[3,] 0.5 0.0 0.5 0.0
[4,] 0.0 0.5 0.0 0.5

7.6 An Ehrenfest Urn model. A permeable membrane separates two com-


partments, Boxes A and B. There are seven molecules altogether in the two
boxes. On each step of a process, the probability is 1/2 that no molecules
move. If there is movement, then one of the seven molecules is chosen at
random and it “diffuses” (moves) from the box it is in to the other one.
a) The number of molecules in Box A can be modeled as an 8-state Markov
chain with state space S = {0, 1, . . . , 7}. For example, if the process is
currently in state 5, then the chances are 7 in 14 that it will stay in state
5 at the next step, 5 in 14 that it will go to state 4, and 2 in 14 that it will
go to state 6. The more unequal the apportionment of the molecules, the
stronger the tendency to equalize it. Write the 1-step transition matrix.
162 7 Instructor Manual: Chains with Larger State Spaces

d Below, we show the transition matrix and then use it to illustrate ideas in part (b).
(Rounding and multiplying by 27 make the output fit the width of the page.) c

P = (1/14)*matrix(c(7, 7, 0, 0, 0, 0, 0, 0,
1, 7, 6, 0, 0, 0, 0, 0,
0, 2, 7, 5, 0, 0, 0, 0,
0, 0, 3, 7, 4, 0, 0, 0,
0, 0, 0, 4, 7, 3, 0, 0,
0, 0, 0, 0, 5, 7, 2, 0,
0, 0, 0, 0, 0, 6, 7, 1,
0, 0, 0, 0, 0, 0, 7, 7), nrow=8, byrow=T)
ss.vec = dbinom(0:7, 7, 1/2) ## steady state vector
round(P, 5); ss.vec*2^7; ss.vec %*% P*2^7
P2 = P %*% P; P4 = P2 %*% P2; P8 = P4 %*% P4; P16 = P8 %*% P8
P32 = P16 %*% P16; P64 = P32 %*% P32; P128 = P64 %*% P64
P128 * 2^7

> round(P, 5)
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
[1,] 0.50000 0.50000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000
[2,] 0.07143 0.50000 0.42857 0.00000 0.00000 0.00000 0.00000 0.00000
[3,] 0.00000 0.14286 0.50000 0.35714 0.00000 0.00000 0.00000 0.00000
[4,] 0.00000 0.00000 0.21429 0.50000 0.28571 0.00000 0.00000 0.00000
[5,] 0.00000 0.00000 0.00000 0.28571 0.50000 0.21429 0.00000 0.00000
[6,] 0.00000 0.00000 0.00000 0.00000 0.35714 0.50000 0.14286 0.00000
[7,] 0.00000 0.00000 0.00000 0.00000 0.00000 0.42857 0.50000 0.07143
[8,] 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.50000 0.50000

> ss.vec*2^7
[1] 1 7 21 35 35 21 7 1
> ss.vec %*% P*2^7 # see part (b)
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
[1,] 1 7 21 35 35 21 7 1

> P128 * 2^7


[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
[1,] 1 7 21 35 35 21 7 1
[2,] 1 7 21 35 35 21 7 1
[3,] 1 7 21 35 35 21 7 1
[4,] 1 7 21 35 35 21 7 1
[5,] 1 7 21 35 35 21 7 1
[6,] 1 7 21 35 35 21 7 1
[7,] 1 7 21 35 35 21 7 1
[8,] 1 7 21 35 35 21 7 1

b) Show that the steady-state distribution of this chain is BINOM(7, 21 ). That is,
show that it satisfies λP = λ. This is also the long-run distribution.

d See answers to parts (a) and (c). c


7 Instructor Manual: Chains with Larger State Spaces 163

c) More generally, show that if there are M molecules, the long-run distrib-
ution is BINOM(M, 12 ).
d For i = 1, . . . , M − 1, the positive transition probabilities can be expressed as
pi−1,i = (M − i + 1)/2M , pii = M/2M = 1/2, and pi+1,i = (i + i)/2M . We have
shown in part (a) that the chain is ergodic, so the long-run (limiting) and steady-
state (stationary) distributions are the same.
Now we show that the vector λ has elements λi = C(M, i)/2M , where the
C-notation denotes the binomial coefficient and i = 0, 1, . . . , M . In the product λP,
the first term is easily seen to be λ0 = (1/2M )(M/2M ) + (M/2M )(1/2M ) = 1/2M ,
as required for i = 0. Similarly, it is easy to see that λM = 1/2M .
For i = 1, . . . , M − 1, the ith element in the product has three terms, which
simplify to λi = C(M, i)/2M . Because all three terms have some factors in common,
we abbreviate by writing K = M !/(2M 2M ):

λi = λi−1 pi−1,i + λi pii + λi+i pi+1,i


= [C(M, i − 1)(M − i + 1) + C(M, i)M + C(M, i + 1)(i + 1)] /(2M 2M )
· ¸
M +i+1 M i+i
=K + =
(i − 1)!(M − i + 1!) i!(M − i)! (i + 1)!(M − i − 1)!
· ¸
i M M −i
=K + +
i!(M − i)! i!(M − i)! i!(M − i)!
2M C(M, i)
=K = ,
i!(M − i)! 2M

as required. c

d) If there are 10 000 molecules at steady state, what is the probability that
between 4900 and 5100 are in Box A?
d We interpret this to mean between 4900 and 5100, inclusive. Below is the exact
binomial probability and its normal approximation. For such a large n, the normal
approximation is very good and the continuity correction might be ignored. c

> diff(pbinom(c(4899, 5100), 10000, 1/2))


[1] 0.9555742 # exact
> n = 10000; pp = 1/2; mu = n*pp; sigma = sqrt(n*pp*(1-pp))
> diff(pnorm(c(4899.5, 5100.5), mu, sigma))
[1] 0.9555688 # normal approximation
> diff(pnorm(c((4899.5 - mu)/sigma, (5100.5 - mu)/sigma)))
[1] 0.9555688 # normal approx. using NORM(0, 1)
> diff(pnorm(c((4900 - mu)/sigma, (5100 - mu)/sigma)))
[1] 0.9544997 # without continuity correction

Note: This is a variant of the famous Ehrenfest model, modified to have proba-
bility 1/2 of no movement at any one step and thus to have an ergodic transition
matrix. (See Cox and Miller (1965), Chapter 3, for a more advanced mathematical
treatment.)
164 7 Instructor Manual: Chains with Larger State Spaces

7.7 A Gambler’s Ruin problem. As Chris and Kim begin the following gam-
bling game, Chris has $4 and Kim has $3. At each step of the game, both play-
ers toss fair coins. If both coins show Heads, Chris pays Kim $1; if both show
Tails, Kim pays Chris $1; otherwise, no money changes hands. The game con-
tinues until one of the players has $0. Model this as a Markov chain in which
the state is the number of dollars Chris currently has. What is the probability
that Kim wins (that is, Chris goes broke)?
d Matrix multiplication. Because, at each step, there is no exchange of money with
probability 1/2, the process moves relatively slowly. Below we use P256 to show the
exact probabilities of absorption into state 0 (Chris is ruined) from each starting
state. These are in the the first column of the matrix. Similarly, probabilities that
Kim is ruined are shown in the last column.
For example, given that Chris has $4 at the start, the probability Chris is ruined
is 0.42857 and the probability Kim is ruined is 0.57143. Notice that these numbers
are from row [5,] of the matrix, which corresponds to state 4. However, this method
does not give us information about how many steps the game lasts until absorption.

P = (1/4)*matrix(c(4, 0, 0, 0, 0, 0, 0, 0,
1, 2, 1, 0, 0, 0, 0, 0,
0, 1, 2, 1, 0, 0, 0, 0,
0, 0, 1, 2, 1, 0, 0, 0,
0, 0, 0, 1, 2, 1, 0, 0,
0, 0 ,0, 0, 1, 2, 1, 0,
0, 0, 0, 0, 0, 1, 2, 1,
0, 0, 0, 0, 0, 0, 0, 4), nrow=8, byrow=T)

P2 = P %*% P
P4 = P2 %*% P2 # intermediate powers not printed
P8 = P4 %*% P4
P16 = P8 %*% P8
P32 = P16 %*% P16
P64 = P32 %*% P32
P128 = P64 %*% P64
P256 = P128 %*% P128
round(P256, 5)

> round(P256, 5)
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
[1,] 1.00000 0 0 0 0 0 0 0.00000
[2,] 0.85714 0 0 0 0 0 0 0.14286
[3,] 0.71428 0 0 0 0 0 0 0.28571
[4,] 0.57143 0 0 0 0 0 0 0.42857
[5,] 0.42857 0 0 0 0 0 0 0.57143
[6,] 0.28571 0 0 0 0 0 0 0.71428
[7,] 0.14286 0 0 0 0 0 0 0.85714
[8,] 0.00000 0 0 0 0 0 0 1.00000

Simulation We use the matrix format introduced in the answer to Problem 7.2(b)
to modify the program of Problem 7.3. Notice that in this program i stands for the
7 Instructor Manual: Chains with Larger State Spaces 165

state being vacated rather than the one being entered. Also, the row and column
numbers of the matrix run from 1 through 8, whereas the states run from 0 through 7.
We have used state 4 as the starting state.
After simulating 10 000 games, we obtain results the probability of Chris’s ruin as
0.4287, which is very close to the exact value 0.42857 obtained just above from P256 .
The mean time until absorption is about 25. Although, 95% of the games ended by
the 65th roll of the die, the histogram (not shown here) for the seed we used indicates
that there was at least one game that lasted for almost 200 rolls of the die. (For
the seed shown, we found from max(step.a) that the longest game had 195 steps.
Then, from sum(step.a==195), we discovered that there happened to be a second
game of that same length.) c

set.seed(1235)
m = 10000 # number of runs
step.a = numeric(m) # steps when absorption occurs
state.a = numeric(m) # states where absorbed

for (j in 1:m)
{
x = 4 # initial state
a = 0 # changed upon absorption
while(a==0) {
i = length(x) # current step; state found below
x = c(x, sample(0:7, 1, prob=P[x[i]+1,])) # uses P from above
if (length(x[x==0 | x==7]) > 0) a = i + 1 }
step.a[j] = a # absorption step for jth run
state.a[j] = x[length(x)] # absorption state for jth run
}

hist(step.a) # simulated distribution of absorption times


mean(step.a) # mean time to absorption
quantile(step.a, .95) # 95% of runs absorbed by this step
summary(as.factor(state.a))/m # dist’n of absorption states

> mean(step.a) # mean time to absorption


[1] 25.1886
> quantile(step.a, .95) # 95% of runs absorbed by this step
95%
65
> summary(as.factor(state.a))/m # dist’n of absorption states
0 7
0.4287 0.5713

Note: This is a version of the classic gambler’s ruin problem. Many books on stochas-
tic processes derive general formulas for the probability of the ruin of each player
and the expected time until ruin. Approximations of these results can be obtained
by adapting the simulation program of Problem 7.4.
166 7 Instructor Manual: Chains with Larger State Spaces

7.8 Suppose weather records for a particular region show that 1/4 of Dry
(0) days are followed by Wet (1) days. Also, 1/3 of the Wet days that are
immediately preceded by a Dry day are followed by a Dry day, but there can
never be three Wet days in a row.
a) Show that this situation cannot be modeled as a 2-state Markov chain.
d Let Xi denote the state (0 or 1) at step i. Then P {X3 = 1|X2 = 1, X1 = 0} = 2/3,
but P {X3 = 1|X2 = 1, X1 = 1} = 0 because of the prohibition of three wet days in
a row. c

b) However, this situation can be modeled as a 4-state Markov chain by the


device of considering overlapping paired-day states: S = {00, 01, 10, 11}.
For example, 00 can be followed by 00 (three dry days in a row) or 01,
but it would contradict the definition of states for 00 to be followed by
10 or 11; logically, half of the entries in the 1-step transition matrix must
be 0. The prohibition on three Wet days in a row dictates an additional 0
entry. Write the 4 × 4 transition matrix, show that it is ergodic, and find
the long-run distribution.
d Let S be the states of a Y -process. In the one-step transition matrix below, the
first row is for 00, the second for 01, and so on. The 0 in the last row and column
is dictated by the ban on three consecutive wet days. Results for the Y -process are
shown below.
The chain is ergodic because the matrix P4 (actually even P3 , if you compute
it) has all positive elements. The matrix P32 provides the long-run distribution
λ = (0.5294118, 0.1764706, 0.1764706, 0.1176471). At the end of the R printout, we
show that this vector is a steady-state distribution—as it must be.

P = (1/12)*matrix(c(9, 3, 0, 0,
0, 0, 4, 8,
9, 3, 0, 0,
0, 0, 12, 0), nrow=4, byrow=T)
P2 = P %*% P; P4 = P2 %*% P2; P8 = P4 %*% P4
P16 = P8 %*% P8; P32 = P16 %*% P16
P4; P32

> P4; P32


[,1] [,2] [,3] [,4]
[1,] 0.5351563 0.1783854 0.1788194 0.1076389
[2,] 0.5364583 0.1788194 0.1319444 0.1527778
[3,] 0.5351563 0.1783854 0.1788194 0.1076389
[4,] 0.4843750 0.1614583 0.2291667 0.1250000
[,1] [,2] [,3] [,4]
[1,] 0.5294118 0.1764706 0.1764706 0.1176471
[2,] 0.5294118 0.1764706 0.1764706 0.1176471
[3,] 0.5294118 0.1764706 0.1764706 0.1176471
[4,] 0.5294118 0.1764706 0.1764706 0.1176471
7 Instructor Manual: Chains with Larger State Spaces 167

ss = c(0.5294118, 0.1764706, 0.1764706, 0.1176471)


ss %*% P

> ss %*% P
[,1] [,2] [,3] [,4]
[1,] 0.5294118 0.1764706 0.1764706 0.1176471

Without the 3-day restriction, the X-process would have been a Markov process
with states 0 and 1. That process would have α = 1/4, β = 1/3, and long-run distri-
bution λ = (4/7, 3/7). Because states 00 and 10 of the Y -process result in a dry day,
the long-run probability of a dry day is 0.5294118 + 0.1764706 = 0.7058824, com-
pared to the probability 4/7 = 0.5714286 of a dry day without the 3-day restriction.
Thus the prohibition against three wet days in a row implies a substantial increase
in the proportion of dry days over the long run. c
Hints and answers: (a) The transition probability p11 would have to take two
different values depending on the weather two days back. State two relevant condi-
tional probabilities with different values. (b) Over the long run, about 29% of the
days are Wet; give a more accurate value.

7.9 Hardy-Weinberg Equilibrium. In a certain large population, a gene has


two alleles a and A, with respective proportions θ and 1 − θ. Assume these
same proportions hold for both males and females. Also assume there is no
migration in or out and no selective advantage for either a or A, so these
proportions of alleles are stable in the population over time. Let the genotypes
aa = 1, Aa = 2, and AA = 3 be the states of a process. At step 1, a female
is of genotype aa, so that X1 = 1. At step 2, she selects a mate at random
and produces one or more daughters, of whom the eldest is of genotype X2 .
At step 3, this daughter selects a mate at random and produces an eldest
daughter of genotype X3 , and so on.
a) The X-process is a Markov chain. Find its transition matrix. For example,
here is the argument that p12 = 1 − θ: A mother of type aa = 1 surely
contributes the allele a to her daughter, and so her mate must contribute
an A-allele in order for the daughter to be of type Aa = 2. Under random
mating, the probability of acquiring an A-allele from the father is 1 − θ.

d See the Hints. The last row of the one-step transition matrix is (0, θ, (1 − θ)2 ). c

b) Show that this chain is ergodic. What is the smallest N that gives PN > 0?
d By simple matrix algebra, smallest is N = 2. For a numerical result, see the answer
to part (d). c

c) According to the Hardy-Weinberg Law, this Markov chain has the “equi-
librium” (steady-state) distribution σ = [θ2 , 2θ(1 − θ), (1 − θ)2 ]. Verify
that this is true.
d Simple matrix algebra. Also, the answer to part (d) for the case with θ = 0.2. c
168 7 Instructor Manual: Chains with Larger State Spaces

d) For θ = 0.2, simulate this chain for m = 50 000 iterations and verify
that the sampling distribution of the simulated states approximates the
Hardy-Weinberg vector.
d We used m = 100 000 iterations below. In simulation, this chain is relatively slow
to stabilize. However, P32 agrees with the limiting value. c

P = matrix(c(.2, .8, 0,
.1, .5, .4,
0, .2, .8), nrow=3, byrow=T)

P2 = P %*% P; P4 = P2 %*% P2
P8 = P4 %*% P4; P16 = P8 %*% P8
P32 = P16 %*% P16; P64 = P32 %*% P32
P2; P32

> P2; P32


[,1] [,2] [,3]
[1,] 0.12 0.56 0.32
[2,] 0.07 0.41 0.52
[3,] 0.02 0.26 0.72

[,1] [,2] [,3]


[1,] 0.04 0.32 0.64
[2,] 0.04 0.32 0.64
[3,] 0.04 0.32 0.64

ss = c(.04, .32, .64)


ss %*% P

> ss %*% P
[,1] [,2] [,3]
[1,] 0.04 0.32 0.64

set.seed(1238)
m = 100000; x = numeric(m); x[1] = 1
for (i in 2:m)
{
x[i] = sample(1:3, 1, prob=P[x[i-1], ])
}
summary(as.factor(x))/m

> summary(as.factor(x))/m
1 2 3 # compare with exact
0.04007 0.31928 0.64065 # (.04, .32, .64)

Hints and partial answers: (a) In deriving p12 , notice that it makes no difference how
the A-alleles in the population may currently be apportioned among males of types
7 Instructor Manual: Chains with Larger State Spaces 169

AA and Aa. For example, suppose θ = 20% in a male population with 200 alleles
(100 individuals), so that there are 40 a-alleles and 160 As. If only genotypes AA
and aa exist, then there are 80 AAs to choose from, any of them would contribute an
A-allele upon mating, and the probability of an Aa offspring is 80% = 1 − θ. If there
are only 70 AAs among the males, then there must be 20 Aas. The probability that
an Aa mate contributes an A-allele is 1/2, so that the total probability of an Aa
offspring is again 1(0.70) + (1/2)(0.20) = 80% = 1 − θ. Other apportionments of
genotypes AA and Aa among males yield the same result. The first row of the matrix
P is [θ, 1 − θ, 0]; its second row is [θ/2, 1/2, (1 − θ)/2]. (b) For the given σ, show
that σP = σ. (d) Use a program similar to the one in Example 7.1.
7.10 Algebraic approach. For a K-state ergodic transition matrix P, the
long-run distribution is proportional to the unique row eigenvector λ corre-
sponding to eigenvalue 1. In R, g = eigen(t(P))$vectors[,1]; g/sum(g),
where the transpose function t is needed to obtain a row eigenvector,
$vectors[,1] to isolate the relevant part of the eigenvalue-eigenvector dis-
play, and the division by sum(g) to give a distribution. Use this method to
find the long-run distributions of two of the chains in Problems 7.2, 7.5, 7.6,
and 7.8—your choice, unless your instructor directs otherwise. (See Cox and
Miller (1965) for the theory.)
d In general, eigenvectors can involve complex numbers. But the relevant eigenvec-
tor for an ergodic Markov chain is always real. We use as.real to suppress the
irrelevant 0 imaginary components.
We also show the results for the Hardy-Weinberg Equilibrium of Problem 7.9,
along with the complete eigenvalue-eigenvector display from which the particular
eigenvector of interest is taken. c
# Problem 7.2: CpG Sea
P = matrix(c(0.300, 0.205, 0.285, 0.210,
0.322, 0.298, 0.078, 0.302,
0.248, 0.246, 0.298, 0.208,
0.177, 0.239, 0.292, 0.292), nrow=4, byrow=T)
g = eigen(t(P))$vectors[,1]; g/sum(g); as.real(g/sum(g))

> g = eigen(t(P))$vectors[,1]; g/sum(g); as.real(g/sum(g))


[1] 0.2618869+0i 0.2462998+0i 0.2388920+0i 0.2529213+0i
[1] 0.2618869 0.2462998 0.2388920 0.2529213

# Problem 7.5: Doubly stochastic


P = 1/2*matrix(c(1, 1, 0, 0, 0,
0, 1, 1, 0, 0,
0, 0, 1, 1, 0,
0, 0, 0, 1, 1,
1, 0, 0, 0, 1), nrow=5, byrow=T)
g = eigen(t(P))$vectors[,1]; as.real(g/sum(g))

> g = eigen(t(P))$vectors[,1]; as.real(g/sum(g))


[1] 0.2 0.2 0.2 0.2 0.2
170 7 Instructor Manual: Chains with Larger State Spaces

# Problem 7.6: Ehrenfest Urn


P = (1/14)*matrix(c(7, 7, 0, 0, 0, 0, 0, 0,
1, 7, 6, 0, 0, 0, 0, 0,
0, 2, 7, 5, 0, 0, 0, 0,
0, 0, 3, 7, 4, 0, 0, 0,
0, 0, 0, 4, 7, 3, 0, 0,
0, 0, 0, 0, 5, 7, 2, 0,
0, 0, 0, 0, 0, 6, 7, 1,
0, 0, 0, 0, 0, 0, 7, 7), nrow=8, byrow=T)
g = eigen(t(P))$vectors[,1]; round(g/sum(g), 5)

> g = eigen(t(P))$vectors[,1]; round(g/sum(g), 5) # real


[1] 0.00781 0.05469 0.16406 0.27344 0.27344 0.16406 0.05469 0.00781

# Problem 7.8: 8-state weather


P = (1/12)*matrix(c(9, 3, 0, 0,
0, 0, 4, 8,
9, 3, 0, 0,
0, 0, 12, 0), nrow=4, byrow=T)
g = eigen(t(P))$vectors[,1]; g/sum(g); as.real(g/sum(g))

> g = eigen(t(P))$vectors[,1]; as.real(g/sum(g))


[1] 0.5294118 0.1764706 0.1764706 0.1176471

# Problem 7.9: Hardy-Weinberg Equilibrium


P = matrix(c(.2, .8, 0,
.1, .5, .4,
0, .2, .8), nrow=3, byrow=T)
g = eigen(t(P))$vectors[,1]; as.real(g/sum(g))

> g = eigen(t(P))$vectors[,1]; as.real(g/sum(g))


[1] 0.04 0.32 0.64

> eigen(t(P))
$values
[1] 1.000000e+00 5.000000e-01 -9.313297e-17

$vectors
[,1] [,2] [,3]
[1,] 0.05581456 0.1961161 0.4082483
[2,] 0.44651646 0.5883484 -0.8164966
[3,] 0.89303292 -0.7844645 0.408248
7 Instructor Manual: Chains with Larger State Spaces 171

7.11 Reflecting barrier. Consider a random walk on the nonnegative inte-


gers with pi,i−1 = 1/2, pi,i+1 = 1/4, and pii = 1/4, for i = 1, 2, 3, . . . , but
with p00 = 1/4 and p01 = 3/4. There is a negative drift, but negative values
are impossible because the particle gets “reflected” to 1 whenever the usual
leftward displacement would have taken it to −1.
a) Argue that the following R script simulates this process, run the program,
and comment on whether there appears to be a long-run distribution.
d The countably infinite state space S is the nonnegative integers, so the matrix
methods we have discussed so far do not work here. The method of simulation is to
consider the random displacement: one step to the left, no movement, or one step to
the right. The displacement di , takes values −1, 0, or 1, with appropriate probabili-
ties. To get from state Xi−1 at step i − 1 to state xi at step i, we find xi = xi−1 + di .
The absolute value “reflects” the particle to 1 if a negative displacement would have
taken it to −1.
Below, the displacements are simulated all at once in an m-vector before the
loop. Alternatively, each displacement di could be simulated “just in time” inside
the loop. We do not show the histogram here, but the tally of results shows that
the most common state is 1 with probability 0.3687 and that the distribution has
a tail extending to the right. In the run shown, the strong negative drift keeps the
particle from going to the right farther than state 10. c

set.seed(1237)
m = 10000
d = sample(c(-1,0,1), m, replace=T, c(1/2,1/4,1/4))
x = numeric(m); x[1] = 0
for (i in 2:m) {x[i] = abs(x[i-1] + d[i]) }

summary(as.factor(x))
cutp=0:(max(x)+1) - .5; hist(x, breaks=cutp, prob=T)
k = 1:max(x); points(c(0,k), c(1/4,(3/4)*(1/2)^k)) # see part (b)

> summary(as.factor(x))
0 1 2 3 4 5 6 7 8 9 10
2453 3687 1900 936 498 255 135 69 34 26 7

b) Show that the steady-state distribution of this chain is given by λ0 = 1/4


and λi = 43 ( 12 )i , for P
i = 1, 2, . . ., by verifying that these values of λi satisfy

the equations λj = i=0 λi pij , for j = 0, 1, . . . . For this chain, the steady-
state distribution is unique and is also the long-run distribution. Do these
values agree reasonably well with those simulated in part (a)?
P∞ 1 3 3 1 3 1 3
d First, notice that i=0 λi = 4 + 8 + 16 + · · · = 4 + 4 [1/2 + 1/4 + · · ·] = 4 + 4 = 1.
Thus, the λi > 0, for all nonnegative integers i, are a probability distribution.
To finish, we distinguish three cases: λ0 , λ1 , and λj , for j ≥ 2.
P 1 1 31
If j = 0: λ0 = λi pi0 = λ0 p00 + λ1 p10 = 4 4
+ 82
= 41 .
Pi 1 3 31 3 1
If j = 1: λ1 = i
λi pi1 = λ0 p01 + λ1 p11 + λ2 p21 = 4 4
+ 84
+ 16 2
= 38 .
172 7 Instructor Manual: Chains with Larger State Spaces

If j ≥ 2:
X

λj = λi pij = λj−1 pj−1,j + λj pjj + λj+1 pj+1,j


i=0
·³ ´j−1 ³ ´j ³ ´j+1 ¸ ³ ´j−1 h i ³ ´j
3 1 1 1 1 1 1 3 1 1 1 1 3 1
= + + = + + = .
4 2 4 2 4 2 2 4 2 4 8 8 4 2

We use i = 1:6; round(c(1/4,(3/4)*(1/2)^i), 4) to find the first few terms


of this distribution, obtaining (0.2500, 0.3750, 0.1875, 0.0938, 0.0469, 0.0234, 0.0117).
The results of the simulation in part (a) are consistent with these exact values, within
the accuracy that can be expected of a simulation with m = 10 000 iterations. (We
added a line to the program of part (a) to put points on the histogram corresponding
to the exact distribution.) c

7.12 Attraction toward the origin. Consider the random walk simulated by
the R script below. There is a negative drift when Xn−1 is positive and a
positive drift when it is negative, so that there is always drift towards 0. (The
R function sign returns values −1, 0, and 1 depending on the sign of the
argument.)
# set.seed(1212)
m = 10000; x = numeric(m); x[1] = 0
for (i in 2:m)
{
drift = (2/8)*sign(x[i-1]); p = c(3/8+drift, 2/8, 3/8-drift)
x[i] = x[i-1] + sample(c(-1,0,1), 1, replace=T, prob=p)
}
summary(as.factor(x))
par(mfrow=c(2,1)) # prints two graphs on one page
plot(x, type="l")
cutp = seq(min(x), max(x)+1)-.5; hist(x, breaks=cutp, prob=T)
par(mfrow=c(1,1))

> summary(as.factor(x))
-4 -3 -2 -1 0 1 2 3 4 5
13 96 508 2430 4035 2343 466 93 15 1

a) Write the transition probabilities pij of the chain simulated by this pro-
gram. Run the program, followed by acf(x), and comment on the result-
ing graphs. (See Figure 7.15.)
d The transition probabilities are p 0,−1 = p 01 = 3/8, and p 00 = 2/8, for transitions
from state 0. For transitions from a positive state i, the probabilities are pi,i−1 = 5/8,
pii = 2/8, and pi,i+1 = 1/8. Finally, for negative i, we have pi,i−1 = 1/8, pii = 2/8,
and pi,i+1 = 5/8.
A tally of simulated values of the chain, for the seed shown, appears beneath the
program. A “history plot” of simulated values in the order they occurred is shown
in Figure 5.17, along with a relative frequency histogram. The chain seems to move
7 Instructor Manual: Chains with Larger State Spaces 173

readily among its states. The bias for transitions toward the origin keeps the chain
from moving very far from 0, with most of the values between ±3. The ACF plot
shows positive, significant correlations for most lags up to about 25. c

b) Use the method of Problem 7.11 to show that the long-run distribution
is given by λ0 = 2/5 and λi = 65 ( 15 )|i| for positive and negative integer
values of i. Do these values agree with your results in part (a)?
P∞ P−1 P∞ 65 2 65 6+8+6
d First, i=−∞ i
λ = i=−∞ i
λ + λ0 + i=1 λi = 5 8 + 5 + 5 8 = 20 = 1,
where the first and last terms are sums of the same geometric series. Thus the λi ,
for (negative, zero, and positive) integer i, form a probability
P∞distribution.
Then, we verify that the steady-state equations λj = λ p , for all inte-
−∞ i ij
gers j, have the solutions claimed. For this verification, we distinguish five cases—
where j < −1, j = −1; j = 0, j = 1, and j > 1, respectively:
For j = 0, the right side of the equation has only three terms because only three of
6 5
the relevant pij are positive: λ0 = λ−1 p−1,0 + λ0 p00 + λ1 p10 = 25 8
+ 52 82 + 25
6 5
8
=
30+20+30 2
200
= 5
.
For j = 1, the right side again has three positive terms: λ1 = λ0 p01 +λ1 p11 +λ2 p21 =
23
58
+ 65 ( 15 )1 28 + 65 ( 51 )2 58 = 65 5+2+1
40
= 65 ( 15 ).
For j > 1: Similarly, the equation becomes λj = λj−1 pj−1,j + λj pjj + λj+1 pj+1,j =
6 1 j−1 1
[( )
5 5 8
+ ( 15 )j 28 + ( 15 )j+1 85 ] = 65 18 ( 15 )j−1 [1 + 25 + 15 ] = 56 ( 15 )j .
For the two negative cases, the verification is very similar to the corresponding
positive ones because of the absolute value in the exponent of 15 and because the
drift is in the opposite direction.

k = -4:4; ss = (6/5)*(1/5)^abs(k); ss[k==0] = 2/5


round(rbind(k, ss), 4)

> round(rbind(k, ss), 4)


[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
k -4.0000 -3.0000 -2.000 -1.00 0.0 1.00 2.000 3.0000 4.0000
ss 0.0019 0.0096 0.048 0.24 0.4 0.24 0.048 0.0096 0.0019

The code above prints exact probabilities of the most likely values of the steady-
state distribution. The values simulated in part (a) are in reasonable agreement with
these exact values. Of course, if we used m = 100 000 iterations, the simulated values
would tend to be more accurate. c
7.13 Random walk on a circle. In Example 7.5, the displacements of the
random walk on the circle are UNIF(−0.1, 0.1) and the long-run distribution
is UNIF(0, 1). Modify the program of the example to explore the long-run
behavior of such a random walk when the displacements are NORM(0, 0.1).
Compare the two chains.
d This process with normally-distributed displacements also produces uniformly dis-
tributed outcomes on the unit interval. The plots (not shown here) look similar to
the corresponding ones for the process of Example 7.5 shown in Figures 7.6 and 7.7.
The hist function with parameter plot=F provides text output of information
used to plot and label a histogram. Here we show two of the several lines of the
174 7 Instructor Manual: Chains with Larger State Spaces

resulting output—giving the counts in each of the 10 bins and the bin midpoints. For
a uniform limiting distribution, we would expect about m/10 = 5000 observations in
each bin, and the bin counts we obtained from the simulation seem at least roughly
consistent with that. c

set.seed(1214)
m = 500000
d = c(0, rnorm(m-1, 0, .1))
x = cumsum(d) %% 1

par(mfrow=c(2,1)) # not shown


hist(x, breaks=10, prob=T, xlab="State", ylab="Proportion")
plot(x[1:1000], pch=".", xlab="Step", ylab="State")
par(mfrow=c(1,1))
hist(x, breaks=10, plot=F)

> hist(x, breaks=10, plot=F)


...
$counts
[1] 4974 4854 4861 4865 5008 5041 5029 5259 5129 4980
...
$mids
[1] 0.05 0.15 0.25 0.35 0.45 0.55 0.65 0.75 0.85 0.95
...

7.14 Explore the following two variants of Example 7.6 (p171).


d This problem is largely graphical and exploratory, and the required changes in the
R code are simple. No answers are provided. See the text for the questions. c
7.15 Monte Carlo integration. Modify the procedure in Example 7.7 (p174)
to make a Markov chain whose limiting distribution is uniform on the first
quadrant of the unit circle. If Z1 and Z2 are independent and standard normal,
use your modified chain to approximate P {Z1 > 0, Z2 > 0, Z12 + Z22 < 1}.
set.seed (1234)
m = 5000; x = y = numeric(m); x[1] = y[1] = 0
for (i in 2:m)
{
x[i] = runif(1, 0, sqrt(1-y[i-1]^2))
y[i] = runif(1, 0, sqrt(1-x[i]^2))
}

plot(x, y, pch=".", xlim=c(0,1), ylim=c(0,1))


# plot not shown here: 500 random points in 1/4 circle
h = dnorm(x)*dnorm(y)
(pi/4)*mean(h) # sim. value; pi/4 is area of integ.
2*(pi/4)*sd(h)/sqrt(m) # approximate margin of error
pchisq(1, 2)/4 # exact value
7 Instructor Manual: Chains with Larger State Spaces 175

> (pi/4)*mean(h) # simulated value


[1] 0.09834325
> 2*(pi/4)*sd(h)/sqrt(m) # approximate margin of error
[1] 0.0003974617
> pchisq(1, 2)/4 # exact value
[1] 0.09836734

7.16 Sierpinski Triangle. Consider S, obtained by successive deletions from


a (closed) triangle 4ABC of area 1/2 with vertices at A = (0, 0), B = (0, 1),
and C = (1, 0). Successively delete open subtriangles of 4ABC as follows.
At stage 1, delete the triangle of area 1/8 with vertices at the center points
of the sides of 4ABC, leaving the union of three triangles, each of area 1/8.
At stage 2, delete three more triangles, each of area 1/32 with vertices at the
center points of the sides of triangles remaining after stage 1, leaving the union
of nine triangles, each of area 1/32. Iterate this process forever. The result is
the Sierpinski Triangle.
a) Show that the area of S is 0. That is, the infinite sum of the areas of all
the triangles removed is 1/2.
d At step 1, remove 1 = 30 triangle of area 1/8 = (1/2)(1/4)1 ; at step 2, remove
3 = 31 triangles, each of area 1/32 = (1/2)(1/4)2 , and so on. In general, at step k,
remove 3k−1 triangles, each of area (1/2)(1/4)k . Thus the total area we have removed
P∞ P∞
is k=1
(1/2)(1/4)k 3k−1 = (1/6) k=1 (3/4)k = 3/6 = 1/2. This means we have
“nibbled away” the entire area of the original triangle; the figure of zero area that
remains is Sierpinski’s Triangle. c

b) S is the state space of a Markov chain. Starting with (X1 , Y1 ) = (1/2, 1/2),
choose a vertex of the triangle at random (probability 1/3 each) and
let (X2 , Y2 ) be the point halfway to the chosen vertex. At step 3, choose a
vertex, and let (X3 , Y3 ) be halfway between (X2 , Y2 ) and the chosen ver-
tex. Iterate. Suppose the first seven vertices chosen are A, A, C, B, B, A, A.
(These were taken from the run in part (c).) Find the coordinates of
(Xn , Yn ), for n = 2, 3, . . . , 8, and plot them by hand.

d Below is a hand-made table of results. The last three rows are from R code that
captures the first seven components of m-vectors in the program of part (c). c

Step: 1 2 3 4 5 6 7
Vertex: A A C B B A A
k: 1 1 3 2 2 1 1
x: 0.5 0.25 0.125 0.0625 0.53125 0.765625 0.3828125
y: 0.5 0.25 0.125 0.5625 0.28125 0.140625 0.0703125

c) As shown in Figure 7.12 (p176), the R script below generates enough points
of S to suggest the shape of the state space. (The default distribution of
the sample function assigns equal probabilities to the values sampled, so
the prob parameter is not needed here.)
176 7 Instructor Manual: Chains with Larger State Spaces

# set.seed(1212)
m = 5000
e = c(0, 1, 0); f = c(0, 0, 1)
k = sample(1:3, m, replace=T)
x = y = numeric(m); x[1] = 1/2; y[1] = 1/2

for (i in 2:m) {
x[i] = .5*(x[i-1] + e[k[i-1]])
y[i] = .5*(y[i-1] + f[k[i-1]]) }
plot(x,y,pch=20)

Within the limits of your patience and available computing speed, increase
the number m of iterations in this simulation. Why do very large values of
m give less-informative plots? Then try plot parameter pch=".". Also,
make a plot of the first 100 states visited, similar to Figure 7.10. Do you
think such plots would enable you to distinguish between the Sierpinski
chain and the chain of Example 7.7?
d Plotting points have area, which Sierpinski’s Triangle does not. So too many points
can muddy the picture. Points made using pch="." have much less area than those
made with pch=20, so you can use more of them without making a mess—perhaps
to get a plot you like better.
A plot connecting the first 100 simulated points is a relatively poor tool for
distinguishing the Sierpinski chain from the one in Figure 7.10. But the plot for
Sierpinski’s triangle does tend to have longer line segments—skipping across the
large missing central triangle. You might learn to exploit this tendency. c

d) As n increases, the number of possible values of Xn increases. For exam-


ple, 3 points for n = 2 and 9 points for n = 3, considering all possible
paths. Points available at earlier stages become unavailable at later stages.
For example, it is possible to have (X2 , Y2 ) = (1/4, 1/4), but explain why
this point cannot be visited at any higher numbered step. By a similar
argument, no state can be visited more than once. Barnsley (1988), p372)
shows that the limiting distribution can be regarded as a “uniform” dis-
tribution on Sierpinski’s Triangle.
d The point (1/4, 1/4) can occur at the second step. A point at the third step will
have coordinates that are irreducible fractions with denominator 8, and so on. So in
the kth step, coordinates will be at irreducible fractions with denominator 2k .
For an easy example, imagine that the vertices at (0, 0) and (1, 0) are repeat-
edly chosen. Alternating between these two “target” vertices, we might get the
x-coordinates 1/2, 1/4, 5/16, 5/32, and so on. For another example, see the results
in the answer to part (b). c
Note: Properties of S related to complex analysis, chaos theory and fractal geometry
have been widely studied. Type sierpinski triangle into your favorite search
engine to list hundreds of web pages on these topics. (Those from educational and
governmental sites may have the highest probability of being correct.)
7 Instructor Manual: Chains with Larger State Spaces 177

7.17 Continuation of Problem 7.16: Fractals. Each subtriangle in Fig-


ure 7.12 is a miniature version of the entire Sierpinski set. Similarly, here
each “petal” of the frond in Figure 7.16 is a miniature version of the frond
itself, as is each “lobe” of each petal, and so on to ever finer detail beyond the
resolution of the figure. This sort of self-similarity of subparts to the whole
characterizes one type of fractal.
By selecting at random among more general kinds of movement, one can
obtain a wide range of such fractals. Figure 7.16 resembles a frond of the
black spleenwort fern. This image was made with the R script shown below.
It is remarkable that such a simple algorithm can realistically imitate the
appearance of a complex living thing.
a) In the fern process, the choices at each step have unequal probabilities,
as specified by the vector p. For an attractive image, these “weights” are
chosen to give roughly even apparent densities of points over various parts
of the fern. Run the script once as shown. Then vary p in several ways to
observe the role played by these weights.
m = 30000
a = c(0, .85, .2, -.15); b = c(0, .04, -.26, .28)
c = c(0, -.04, .23, .26); d = c(.16, .85, .22, .24)
e = c(0, 0, 0, 0); f = c(0, 1.6, 1.6, .44)
p = c(.01, .85, .07, .07)
k = sample(1:4, m, repl=T, p)
h = numeric(m); w = numeric(m); h[1] = 0; w[1] = 0
for (i in 2:m)
{
h[i] = a[k[i]]*h[i-1] + b[k[i]]*w[i-1] + e[k[i]]
w[i] = c[k[i]]*h[i-1] + d[k[i]]*w[i-1] + f[k[i]]
}
plot(w, h, pch=20, col="darkgreen")
d Try using m = 100 000 iterations and plotting with pch="." to get a much more
intricately detailed fern than we could show in print. In experimenting with compo-
nents of p, perhaps most striking finding is that the first component of p controls
the skeletal stems of the fern, which take up little area and so need relatively few
dots to be seen clearly. Setting the first component to 0 erases the stems entirely.
Other components affect the relative densities of various parts of the fronds. c
b) How can the vectors of parameters (a, b, etc.) of this script be changed
to display points of Sierpinski’s Triangle?
d Various changes are possible. Here we show what we suppose may be the simplest
one that leaves the two lines of code within the loop unchanged. First, we note
that, in the program of Problem 7.16, the terms e[k[i-1]] and f[k[i-1]] can be
changed to e[k[i]] and e[k[i]]. Then, in the notation of the fern program, we
need the two lines within the loop to amount to the following:
h[i] = .5*h[i-1] + .5*e[k[i]]
w[i] = .5*w[i-1] + .5*f[k[i]]
178 7 Instructor Manual: Chains with Larger State Spaces

This can be accomplished by setting b = c = 0 and a = d = c(1, 1, 1)/2.


Because there are three corners in the triangle, we need 3-vectors here, not 4-vectors.
The vectors e and f carry the h and g coordinates of the three corners (but divided
by 2), so we have e = c(0, 0, 1)/2 and f = c(0, 1, 0)/2. Finally, p should be
uniform (or omitted). A suitably revised program is shown below. c

m = 25000
a = d = c(1, 1, 1)/2; b = c = numeric(3)
e = c(0, 0, 1)/2; f = c(0, 1, 0)/2
k = sample(1:3, m, repl=T)
h = numeric(m); w = numeric(m); h[1] = 0; w[1] = 0
for (i in 2:m) {
h[i] = a[k[i]]*h[i-1] + b[k[i]]*w[i-1] + e[k[i]]
w[i] = c[k[i]]*h[i-1] + d[k[i]]*w[i-1] + f[k[i]] }
plot(w, h, pch=".", col="darkblue")

Note: See Barnsley (1988) for a detailed discussion of fractal objects with many
illustrations, some in color. Our script is adapted from pages 87–89; its numerical
constants can be changed to produce additional fractal objects described there.
7.18 A bivariate normal distribution for (X, Y ) with zero means, unit stan-
dard deviations, and correlation 0.8, as in Section 7.5, can be obtained as a
linear transformation of independent random variables.
√ Specifically, if U and
V are independently distributed as NORM(0, 2/ 5), then let X = U + V /2
and Y = U/2 + V .
a) Verify analytically that the means, standard deviations, and correlation
are as expected. Then use the following program to simulate and plot this
bivariate distribution. Compare your results with the results obtained in
Examples 7.8 and 7.9.
d Equations for the verification are as follows:
Expectations: E(X) = E(U ) + (1/2)E(V ) = 0, and similarly of Y .
Variances: V(X) = V(U + V /2) = 4/5 + (1/4)(4/5) = 1, and similarly for Y .
Correlation:

ρ = Cor(X, Y ) = Cov(X, Y ) = Cov(U + V /2, U/2 + V )


= (1/2)V(U ) + Cov(U, V ) + (1/4)Cov(V, U ) + (1/2)V(V ) = 4/5 = 0.8,

because V(X) = V(Y ) = 1 (first line) and U and V are independent (second line). c

set.seed(234)
m = 10000
u = rnorm(m,0,2/sqrt(5)); v = rnorm(m,0,2/sqrt(5))
x = u + v/2; y = u/2 + v
round(c(mean(x), mean(y), sd(x), sd(y), cor(x,y)), 4)
best = pmax(x, y)
mean(best >= 1.25)
plot(x, y, pch=".", xlim=c(-4,4), ylim=c(-4,4))
7 Instructor Manual: Chains with Larger State Spaces 179

> round(c(mean(x), mean(y), sd(x), sd(y), cor(x,y)), 4)


[1] 0.0019 0.0098 1.0080 1.0099 0.8041 # Exact: 0, 0, 1, 1, .8
> best = pmax(x, y)
> mean(best >= 1.25)
[1] 0.1522 # Metropolis p178: 0.147
# from m = 1M: 0.151 seems accurate

b) What linear transformation of independent normal random variables could


you use to sample from a bivariate normal distribution with zero means,
unit standard deviations, and correlation ρ = 0.6? Modify the code of
part (a) accordingly, run it, and report your results.

d Let U and V be independently NORM(0, 3/ 10) and also X = U + V /3 and
Y = U/3 + V . Then V(X) = 9/10 + 1/10 = 1 and similarly V(Y ) = 1. Also,
Cor(X, Y ) = Cov(X, Y ) = Cov(U + V /3, U/3 + V ) = (1/3)V(U ) + (1/3)V(V ) =
(2/3)(9/10) = 0.6, as required.
Here is how to find a such that X = U + aV and Y = aU + V meet the
requirements. If U and V are independent, each with variance σ 2 , then we need
V(X) = V(U + aV ) = (1 + a2 )σ 2 = 1, and similarly for V(Y ). Then (canceling σ 2
from numerator and denominator) we get ρ = Cor(X, Y ) = 2a/(1 + a2 ), which we
can solve for a in terms of ρ. c

set.seed(123)
m = 1000000
u = rnorm(m,0,3/sqrt(10)); v = rnorm(m,0,3/sqrt(10))
x = u + v/3; y = u/3 + v
round(c(mean(x), mean(y), sd(x), sd(y), cor(x,y)), 4)
best = pmax(x, y)
mean(best >= 1.25)
show = 1:30000
plot(x[show], y[show], pch=".", xlim=c(-4,4), ylim=c(-4,4))
lines(c(-5, 1.25, 1.25), c(1.25, 1.25, -5), lwd=2, col="red")

> round(c(mean(x), mean(y), sd(x), sd(y), cor(x,y)), 4)


[1] -0.0010 -0.0016 0.9997 0.9995 0.5995
> best = pmax(x, y)
> mean(best >= 1.25)
[1] 0.168599 # Intuitively: why the increase for smaller rho?

7.19 Metropolis and Metropolis-Hastings algorithms. Consider the following


modifications of the program of Example 7.8.
a) Explore the consequences of incorrectly using an asymmetric jump distri-
bution in the Metropolis algorithm. Let jl = 1.25; jr = .75.
d Below we show results for the same seed used in Example 7.8, but with the sug-
gested asymmetrical jump distribution. The plot of the first 100 points (not included
here) shows that the process quickly drifts “southwesterly” of its target location, and
the plot of points after burn-in (upper-left in Figure 7.17) shows that this is not an
180 7 Instructor Manual: Chains with Larger State Spaces

initial fluke. The estimate (−2.2, −2.2) of the center is grossly far from the origin.
(See the Notes.) c

set.seed(1234)
m = 40000
rho = .8; sgm = sqrt(1 - rho^2)
xc = yc = numeric(m) # vectors of state components
xc[1] = -3; yc[1] = 3 # arbitrary starting values
jl = 1.25; jr = .75 # l and r limits of proposed jumps
...

> x = xc[(m/2+1):m]; y = yc[(m/2+1):m] # states after burn-in

> round(c(mean(x), mean(y), sd(x), sd(y), cor(x,y)), 4)


[1] -2.1755 -2.1742 1.0046 0.9986 0.8158

> mean(diff(x)==0) # proportion or proposals rejected


[1] 0.5826291

> mean(pmax(x,y) >= 1.25) # prop. of subj. getting certificates


[1] 9e-04

b) The Metropolis-Hastings algorithm permits and adjusts for use of an


asymmetric jump distribution by modifying the acceptance criterion.
Specifically, the ratio of variances is multiplied by a factor that corrects
for the bias due to the asymmetric jump function. In our example, this
“symmetrization” amounts to restricting jumps in X and Y to 0.75 units
in either direction. The program below modifies the one in Example 7.8
to implement the Metropolis-Hastings algorithm; the crucial change is the
use of the adjustment factor adj inside the loop. Interpret the numerical
results, the scatterplot (as in Figure 7.17, upper right), and the histogram.
set.seed(2008)
m = 100000; xc = yc = numeric(m)
xc[1] = 3; yc[1] = -3
rho = .8; sgm = sqrt(1 - rho^2)
jl = 1.25; jr = .75

for (i in 2:m)
{
xc[i] = xc[i-1]; yc[i] = yc[i-1] # if no jump
xp = runif(1, xc[i-1]-jl, xc[i-1]+jr)
yp = runif(1, yc[i-1]-jl, yc[i-1]+jr)
nmtr.r = dnorm(xp)*dnorm(yp, rho*xp, sgm)
dntr.r = dnorm(xc[i-1])*dnorm(yc[i-1], rho*xc[i-1], sgm)
nmtr.adj = dunif(xc[i-1], xp-jl, xp+jr)*
dunif(yc[i-1], yp-jl, yp+jr)
dntr.adj = dunif(xp, xc[i-1]-jl, xc[i-1]+jr)*
dunif(yp, yc[i-1]-jl, yc[i-1]+jr)
7 Instructor Manual: Chains with Larger State Spaces 181

r = nmtr.r/dntr.r; adj = nmtr.adj/dntr.adj


acc = (min(r*adj, 1) > runif(1))
if (acc) {xc[i] = xp; yc[i] = yp}
}

x = xc[(m/2+1):m]; y = yc[(m/2+1):m]
round(c(mean(x), mean(y), sd(x), sd(y), cor(x,y)) ,4)
mean(diff(xc)==0); mean(pmax(x, y) > 1.25)

par(mfrow=c(1,2), pty="s")
jump = diff(unique(x)); hist(jump, prob=T, col="wheat")
plot(x, y, xlim=c(-4,4), ylim=c(-4,4), pch=".")
par(mfrow=c(1,1), pty="m")

c) After a run of the program in part (b), make and interpret autocorrelation
function plots of x and of x[thinned], where the latter is defined by
thinned = seq(1, m/2, by=100). Repeat for realizations of Y .
Notes: (a) The acceptance criterion still has valid information about the shape of
the target distribution, but the now-asymmetrical jump function is biased towards
jumps downward and to the left. The approximated percentage of subjects awarded
certificates is very far from correct. (c) Not surprisingly for output from a Markov
chain, the successive pairs (X, Y ) sampled by the Metropolis-Hastings algorithm
after burn-in are far from independent. “Thinning” helps. To obtain the desired
degree of accuracy, we need to sample more values than would be necessary in
a simulation with independent realizations as in Problem 7.18. It is important to
distinguish the association between Xi and Yi on the one hand from the association
among the Xi on the other hand. The first is an essential property of the target
distribution, whereas the second is an artifact of the method of simulation.
7.20 We revisit the Gibbs sampler of Example 7.9.
a) Modify this program to sample from a bivariate normal distribution with
zero means, unit standard deviations, and ρ = 0.6. Report your results. If
you worked Problem 7.18, compare with those results.
d The only substantive change is that rho = .8 becomes rho = .6 in the first line of
code. Results below are from seed 1236. As in Example 7.9, we use only m = 20 000
iterations. Of these, we used results from the 10 000 iterations after burn-in for the
summary shown below.
Accuracy is not quite as good as in our answer to Problem 7.18(a) where we
used 10 000 independent simulated points. (See part (b).) The simulated value of
ρ is reasonably near 0.6. But, of course, it is not as accurate as in our answer to
Problem 7.18(b), where we chose to simulate 1 000 000 independent points.

set.seed(1236)
m = 20000
rho = .6; sgm = sqrt(1 - rho^2)
...
182 7 Instructor Manual: Chains with Larger State Spaces

> round(c(mean(x), mean(y), sd(x), sd(y), cor(x,y)), 4)


[1] 0.0176 0.0243 1.0119 1.0107 0.5949
> best = pmax(x,y); mean(best >= 1.25) # prop. getting certif.
[1] 0.1751
> summary(best)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-3.0580 -0.2463 0.3861 0.3840 1.0130 3.6210

The probability that the best of the two scores is greater here (about 17.5%) for
ρ = 0.6 than it was in Example 7.9 (about 15.3%) where ρ = 0.8. As the correlation
decreases, opportunity to demonstrate achievement is improved.
If the tests were independent (ρ = 0), then the exact probability would be given
by 1 - pnorm(1.25)^2, which returns 0.2001377. When ρ = 1, so that the two scores
are identical, the exact probability is 0.1056498. You can change ρ in the program
to approximate the first of these extreme results. But not the second—why not? c

b) Run the original program (with ρ = 0.8) and make an autocorrelation plot
of X-values from m/2 on, as in part (c) of Problem 7.19. If you worked
that problem, compare the two autocorrelation functions.
c) In the Gibbs sampler of Example 7.9, replace the second statement inside
the loop by yc[i] = rnorm(1, rho*xc[i-1], sgm) and run the result-
ing program. Why is this change a mistake?
d On a practical level, we can see that this change is a bad idea because it
gives obviously incorrect results. The trial run below approximates ρ as −0.057,
whereas the original program in Example 7.9, with the same seed, gave 0.8044—
very close to the known value ρ = 0.8. Moreover, the altered program approximates
P {max(X, Y ) ≥ 1.25} as 0.2078, while the original program gives 0.1527.
This example illustrates the importance of using a newly generated result in
a Gibbs sampler as early as possible. Here, the main purpose is to simulate the
distribution of max(X, Y ), and we might not have realized that the answer for
P {max(X, Y ) ≥ 1.25} is wrong. However, the wrong answer for the known correla-
tion is a clear indication that the program is not working as it should.

set.seed(1235); m = 20000
rho = .8; sgm = sqrt(1 - rho^2)
xc = yc = numeric(m); xc[1] = -3; yc[1] = 3
for (i in 2:m) {
xc[i] = rnorm(1, rho*yc[i-1], sgm)
yc[i] = rnorm(1, rho*xc[i-1], sgm) }
x = xc[(m/2+1):m]; y = yc[(m/2+1):m]
round(c(mean(x), mean(y), sd(x), sd(y), cor(x,y)), 4)
best = pmax(x,y); mean(best >= 1.25)

> round(c(mean(x), mean(y), sd(x), sd(y), cor(x,y)), 4)


[1] 0.0085 0.0079 1.0070 1.0107 -0.0057
> best = pmax(x,y); mean(best >= 1.25) # prop. getting certif.
[1] 0.2078
7 Instructor Manual: Chains with Larger State Spaces 183

The answer to why this modified program does not work lies on an explanatory
and more theoretical level. The difficulty with the change is that each new simu-
lated value Yi in the chain needs to be paired with the corresponding new Xi just
generated, not with the previous value Xi−1 . Because this is a Markov Chain, there
is some association between one step and the next. It is important not to entangle
that association with the association between Xi and Yi .
It would be OK to reverse the order in which the x-values and y-values are
simulated, as shown below in a correct modification of the program.

set.seed(1235); m = 20000
rho = .8; sgm = sqrt(1 - rho^2)
xc = yc = numeric(m); xc[1] = -3; yc[1] = 3
for (i in 2:m) {
yc[i] = rnorm(1, rho*xc[i-1], sgm)
xc[i] = rnorm(1, rho*yc[i], sgm) }
x = xc[(m/2+1):m]; y = yc[(m/2+1):m]
round(c(mean(x), mean(y), sd(x), sd(y), cor(x,y)), 4)
best = pmax(x,y); mean(best >= 1.25)

> round(c(mean(x), mean(y), sd(x), sd(y), cor(x,y)), 4)


[1] 0.0077 0.0083 1.0073 1.0046 0.8044
> best = pmax(x,y); mean(best >= 1.25)
[1] 0.1527

Here, we get a correct view of the dependence between (Xi , Yi ) pairs, because we
have not disrupted the pairing. And again, we have useful estimates of ρ = 0.8 and
P {max(X, Y ) ≥ 1.25} ≈ 0.15. c
Note: (b) In the Metropolis-Hastings chain, a proposed new value is sometimes
rejected so that there is no change in state. The Gibbs sampler never rejects.

Errors in Chapter 7
Page numbers refer to the the text. All corrections are scheduled to appear in
the second printing. Statements of problems in this Instructor Manual have
been corrected. Includes corrections compiled through July 13, 2011.
p183 Problem 7.4 In the program, the first statement after inner loop should read
a[j] = a - 1 (not a). The correct code is shown in this Manual. This error
in the program makes a small difference in the histogram of Figure 7.14 (most
notably, the first bar there is a little too short). A corrected figure is scheduled
for the second printing; you will see it if you work the problem.
184 7 Instructor Manual: Chains with Larger State Spaces

Eric A. Suess and Bruce E. Trumbo


Introduction to Probability Simulation and Gibbs Sampling with R: Instructor Manual
Chapter 7

Explanations and answers:


c 2011 by Bruce E. Trumbo and Eric A. Suess. All rights reserved.
°
Statements of problems:
°c 2010 by Springer Science+Business Media, LLC. All rights reserved.
8
Introduction to Bayesian Estimation

8.1 In a situation similar to Example 8.1, suppose a political consultant


chooses the prior BETA(380, 220) to reflect his assessment of the proportion
of the electorate favoring Proposition B.
a) In terms of a most likely value for π and a 95% probability interval for π,
describe this consultant’s view of the prospects for Proposition B.
d The density function of BETA(α, β) has kernel p(x) = xα−1 (1 − x)β−1 . Setting the
first derivative of the kernel equal to 0, we obtain x = (α − 1)/(α + β − 2). Provided
that α, β > 1, this is the value of x at which p(x) achieves its absolute maximum.
That is, the mode of the distribution BETA(α, β). Thus the most likely value of the
prior BETA(380, 220) is π = 379/598 = 0.6338, to four places. The R code below
finds this result using a grid search.

x = seq(0, 1, by=.0001); alpha = 380; beta = 220


pdf = dbeta(x, alpha, beta)
x[pdf==max(pdf)]

> x[pdf==max(pdf)]
[1] 0.6338

Then the statement qbeta(c(.025, .975), 380, 220) returns a 95% prior
probability interval (0.5944, 0.6714), when rounded to four places. Roughly speak-
ing, the consultant choosing this prior distribution must feel that π is near 63% and
pretty sure to be between 60% and 67%.
Extra: There are infinitely many 95% prior probability intervals on π, and it is
customary to use the one that cuts 2.5% from each tail of the relevant distribution.
This interval is called the probability-symmetric interval. If we were to insist on
the shortest 95% probability interval, then another grid search would be in order,
as shown in the R code below. (See also Problem 8.10.)

lwr.tail = seq(0, .05, by=.00001); upr.tail = .95 + lwr.tail


lwr.end = qbeta(lwr.tail, 380, 220)
upr.end = qbeta(upr.tail, 380, 220)
186 8 Instructor Manual: Introduction to Bayesian Estimation

int.len = upr.end - lwr.end


cond = (int.len == min(int.len))
cbind(lwr.tail, lwr.end, upr.end, upr.tail)[cond,]

> cbind(lwr.tail, lwr.end, upr.end, upr.tail)[cond,]


lwr.tail lwr.end upr.end upr.tail
0.0258800 0.5947001 0.6717143 0.9758800

Thus the shortest 95% probability interval is (0.5947, 0.6717), cutting almost
2.6% from the lower tail and just over 2.4% from the upper. Unless the distribution
is very severely skewed, there is often no practical difference between the probability-
symmetric interval and the shortest one. c

b) If a poll of 100 randomly chosen registered voters shows 62% opposed to


Proposition B, do you think the consultant (a believer in Bayesian infer-
ence) now fears Proposition B will fail? Quantify your answer with specific
information about the posterior distribution. Recall that in Example 8.5
a poll of 1000 subjects showed 62% in favor of Proposition A. Contrast
that situation with the current one.
d The posterior distribution is BETA(α0 + x, β0 + n − x), where the data con-
sists of x Successes in n binomial trials. From part (a), we have α0 = 380 and
β0 = 220. From the poll, we have x = 38 and n = 100. So the posterior distribu-
tion is BETA(418, 282), and the 95% posterior probability interval (computed using
qbeta(c(.025, .975), 418, 282)) is (0.5606, 0.6332). This interval overlaps the
95% prior probability interval, but it does not contain the consultant’s optimistic
“most likely” prior value 0.6338. Nevertheless, he will be happy to see that the pos-
terior interval still lies “safely” above 50%. (Also, see the Hints.) One might expect
a press conference devoted mainly to positive spin.
In Example 8.5, the data from a poll of 1000 subjects has more influence on the
posterior distribution than the relatively disperse prior distribution (see Figure 8.1).
Roughly speaking, one might say that the prior distribution BETA(330, 270) carries
expert opinion equivalent to a poll of 600 people with 330 (55%) in favor of Propo-
sition A. In contrast, in this problem, the data from a poll of only 100 subjects
has less influence on the posterior distribution than does the relatively concentrated
prior distribution (see Figure 8.5). Here, again roughly speaking, one might say that
the prior distribution carries expert opinion (or maybe we should say optimism)
equivalent to a poll of 600 people with 380 (63.3%) in favor. c

c) Modify the R code of Example 8.5 to make a version of Figure 8.5 (p207)
that describes this problem.
x = seq(.50, .73, .001); prior = dbeta(x, 380, 220)
post = dbeta(x, 380 + 38, 220 + 62)
plot(x, post, type="l", ylim=c(0, 25), lwd=2, col="blue",
xlab="Proportion in Favor", ylab="Density")
post.int = qbeta(c(.025,.975), 418, 282)
abline(v=post.int, lty="dashed", col="red")
abline(h=0, col="darkgreen"); lines(x, prior)
8 Instructor Manual: Introduction to Bayesian Estimation 187

d For variety, and to encourage you to make this plot for yourself, we have changed
the plotting window and included vertical red lines to show the 95% posterior prob-
ability interval. c

d) Pollsters sometimes report the margin of sampling error √ for a poll with
n subjects as being roughly given by the formula 100/ n %. According
to this formula, what is the (frequentist’s) margin of error for the poll in
part (b)? How do you suppose the formula is derived?
d The rough margin of error from the formula is 10%. The R code below shows
results we hope will be self-explanatory. The inverted parabola π(1 − π) has its
maximum at π = 1/2. As above, let x be the number of Successes out of p n individuals
sampled. Approximating 1.96 by 2, the traditional margin of error 1.96 p(1 − p)/n

becomes 1/ n. When p is reasonably near 1/2 the formula still works reasonably
well: we have (0.5)2 = 0.25, while (0.4)(0.6) = 0.24, and even (0.3)(0.7) = 0.21. c

pm = c(-1,1); x = 38; n = 100; p = x/n


trad.me = 1.96*sqrt(p*(1 - p)/n)
trad.ci = p + pm*trad.me
aprx.me = 1/sqrt(n)
trad.me; trad.ci; aprx.me

> trad.me; trad.ci; aprx.me


[1] 0.09513574
[1] 0.2848643 0.4751357
[1] 0.1

Hints: (a) Use R code qbeta(c(.025,.975), 380, 220) to find one 95% prior
probability interval. (b) One response: P {π < 0.55} < 1%.
p(d) A standard formula
for an interval with roughly 95% confidence is p ± 1.96 p(1 − p)/n, where n is
“large” and p is the sample proportion in favor (see Example 1.6). What value of π
maximizes π(1 − π)? What if π = 0.4 or 0.6?
8.2 In Example 8.1, we require a prior distribution with E(π) ≈ 0.55 and
P {0.51 < π < 0.59} ≈ 0.95. Here we explore how one might find suitable
parameters α and β for such a beta-distributed prior.
a) For a beta distribution, the mean is µ = α/(α + β) and the variance is
σ 2 = αβ/[(α + β)2 (α + β + 1)]. Also, a beta distribution with large enough
values of α and β is roughly normal, so that P {µ − 2σ < π < µ + 2σ} ≈
0.95. Use these facts to find values of α and β that approximately satisfy
the requirements. (Theoretically, this normal distribution would need to
be truncated to have support (0, 1).)
d We require α/(α + β) = 0.55, so that β = (0.45/0.55)α = 0.818α, α + β = 1.818α,
and αβ = 0.818α2 . Also, we require 2σ ≈ 0.04 or σ 2 ≈ 0.0004. Then routine algebra
gives α ≈ 340, and thus β ≈ 278. c
188 8 Instructor Manual: Introduction to Bayesian Estimation

b) The following R script finds values of α and β that may come close to
satisfying the requirements and then checks to see how well they succeed.
What assumptions about α and β are inherent in the script? Why do we
use β = 0.818α? What values of α and β are returned? For the values of
the parameters considered, how close do we get to the desired values of
E(π) and P {0.51 < π < 0.59}?
alpha = 1:2000 # trial values of alpha
beta = .818*alpha # corresponding values of beta

# Vector of probabilities for interval (.51, .59)


prob = pbeta(.59, alpha, beta) - pbeta(.51, alpha, beta)
prob.err = abs(.95 - prob) # errors for probabilities

# Results: Target parameter values


t.al = alpha[prob.err==min(prob.err)]
t.be = round(.818*t.al)
t.al; t.be

> t.al; t.be


[1] 326
[1] 267

# Checking: Achieved mean and probability


a.mean = t.al/(t.al + t.be)
a.prob = pbeta(.59, t.al, t.be) - pbeta(.51, t.al, t.be)
a.mean; a.prob

> a.mean; a.prob


[1] 0.549747
[1] 0.9500065

c) If the desired mean is 0.56 and the desired probability in the interval
(0, 51, 0.59) is 90%, what values of the parameters are returned by a suit-
ably modified script?
alpha = 1:2000 # trial values of alpha
beta = ((1 - .56)/.56)* alpha # corresponding values of beta

# Vector of probabilities for interval (.51, .59)


prob = pbeta(.59, alpha, beta) - pbeta(.51, alpha, beta)
prob.err = abs(.90 - prob) # errors for probabilities

# Results: Target parameter values


t.al = alpha[prob.err==min(prob.err)]
t.be = round(beta[prob.err==min(prob.err)])
t.al; t.be
8 Instructor Manual: Introduction to Bayesian Estimation 189

> t.al; t.be


[1] 280
[1] 220

# Checking: Achieved mean and probability


a.mean = t.al/(t.al + t.be)
a.prob = pbeta(.59, t.al, t.be) - pbeta(.51, t.al, t.be)
a.mean; a.prob

> a.mean; a.prob


[1] 0.56
[1] 0.899825

8.3 In practice, the beta family of distributions offers a rich variety of


shapes for modeling priors to match expert opinion.
a) Beta densities p(π) are defined on the open unit interval. Observe that
parameter α controls the behavior of the density function near 0. In par-
ticular, find the value p(0+ ) and the slope p0 (0+ ) in each of the following
five cases: α < 1, α = 1, 1 < α < 2, α = 2, and α > 2. Evaluate each
limit as being 0, positive, finite, ∞, or −∞. (As usual, 0+ means to take
the limit as the argument approaches 0 through positive values.)
d We provide the answers for each case:
• α < 1: p(0+ ) = ∞, p0 (0+ ) = −∞.
• α = 1: p(0+ ) is finite and positive, p0 (0+ ) is finite.
• 1 < α < 2: p(0+ ) = 0, p0 (0+ ) = ∞.
• α = 2: p(0+ ) = 0, p0 (0+ ) is finite and positive.
• α > 2: p(0+ ) = p0 (0+ ) = 0.

Students should provide brief calculus-based arguments. c

b) By symmetry, β controls behavior of the density function near 1. Thus,


combinations of the parameters yield 25 cases, each with its own “shape”
of density. These different shapes are illustrated in Figure 8.6. In which of
these 25 cases does the density have a unique mode in (0, 1)? The number
of possible inflection points of a beta density curve is 0, 1, or 2. For each
of the 25 cases, give the number of inflection points.
d Unique mode in (0, 1): both α > 1 and β > 1. Two inflection points: both α > 2
and β > 2. One inflection point: (i) α > 2 and 1 < β ≤ 2; (ii) 1 < α < 2 and β < 1;
or swap α and β in either (i) or (ii). No inflection points for the remaining cases. c

c) The R script below plots examples of each of the 25 cases, scaled vertically
(with top) to show the properties in parts (a) and (b) about as well as
can be done and yet show most of each curve.
190 8 Instructor Manual: Introduction to Bayesian Estimation

alpha = c(.5, 1, 1.2, 2, 5); beta = alpha


op = par(no.readonly = TRUE) # records existing parameters
par(mfrow=c(5, 5)) # formats 5 x 5 matrix of plots
par(mar=rep(2, 4), pty="m") # sets margins
x = seq(.001, .999, .001)
for (i in 1:5) {
for (j in 1:5) {
top = .2 + 1.2 * max(dbeta(c(.05, .2, .5, .8, .95),
alpha[i], beta[j])) ## error corrected
plot(x,dbeta(x, alpha[i], beta[j]),
type="l", ylim=c(0, top), xlab="", ylab="",
main=paste("BETA(",alpha[i],",", beta[j],")", ##
sep="")) }
}
par(op) # restores former parameters

Run the code and compare the resulting matrix of plots with your results
above (α-cases are rows, β columns). What symmetries within and among
the 25 plots are lost if we choose beta = c(.7, 1, 1.7, 2, 7)? (See
Figure 8.6.)
d Three cases along the principal diagonal in Figure 8.6 are no longer symmetrical.
(Correction: The first printing had errors in the code, corrected at ## above, affecting
captions in the figure. See the erratum at the end of this chapter of answers. c
8.4 In Example 8.1, we require a prior distribution with E(π) ≈ 0.55 and
P {0.51 < π < 0.59} ≈ 0.95. If we are willing to use nonbeta priors, how might
we find ones that meet these requirements?
a) If we use a normal distribution, what parameters µ and σ would satisfy
the requirements?
d For a method, see the answer to Problem 8.2. Answers: µ = 0.55 and σ = 0.02. c

b) If we use a density function in the shape of an isosceles triangle, show that


it should have vertices at (0.4985, 0), (0.55, 19.43), and (0.6015, 0).
d The area under the triangle above the interval (0.59, 0.6015) is a smaller triangle
with base 0.6015 − 0.59 = 0.0115 and height 19.43(0.0115)/(0.6015 − 0.55) = 4.3387,
so the small triangle has area 0.0115(4.3387)/2 = 0.025, as required. Similarly, for
the small triangle with base (0.4985, 0.51). Also, 19.43(0.6015 − 0.4985)/2 = 1.00. c

c) Plot three priors on the same axes: BETA(330, 270) of Example 8.1 and
the results of parts (a) and (b).
xx = seq(0.48, 0.62, by=0.001)
plot(xx, dbeta(xx, 330, 270), ylim=c(0, 20), type="l",
ylab="Prior Density", xlab=expression(pi))
lines(xx, dnorm(xx, .55, .02), lty="dashed", col="red")
lines(c(.4985, .55, .6015), c(0, 19.43, 0), lwd=2, col="blue")
abline(h = 0, col="darkgreen")
8 Instructor Manual: Introduction to Bayesian Estimation 191

d) Do you think the expert would object to any of these priors as an expres-
sion of her feelings about the distribution of π?
d Superficially, the three densities seem about the same, so probably not—especially
if the expert is not a student of probability. The normal and beta curves are al-
most identical, provided that the normal curve is restricted to the interval (0, 1).
(The area under the normal curve outside the unit interval is negligible for practical
purposes, so one could restrict its support without adjusting the height of the den-
sity curve.) The triangular density places no probability at all outside the interval
(0.4985, 0.6015), so a poll with a large n and p = x/n far outside that interval would
still not yield a posterior distribution with any probability outside the interval. c
Notes: (c) Plot: Your result should be similar to Figure 8.7. Use the method in Exam-
ple 8.5 to put several plots on the same axes. Experiment: If v = c(.51, .55, .59)
and w = c(0, 10, 0), then what does lines(v, w) add to an existing plot? (d) The
triangular prior would be agreeable only if she thinks values of π below 0.4985 or
above 0.6015 are absolutely impossible.

8.5 Computational methods are often necessary if we multiply the kernels


of the prior and likelihood and then can’t recognize the result as the kernel
of a known distribution. This can occur, for example, when we don’t use a
conjugate prior. We illustrate several computational methods using the polling
situation of Examples 8.1 and 8.5 where we seek to estimate the parameter π.
To begin, suppose we are aware of the beta prior p(π) (with α = 330 and
β = 270) and the binomial likelihood p(x|π) (for x = 620 subjects in favor
out of n = 1000 responding). But we have not been clever enough to notice
the convenient beta form of the posterior p(π|x).
We wish to compute the posterior estimate of centrality E(π|x) and the
posterior probability P {π > .6|x} of a potential “big win” for the ballot propo-
R1
sition. From the equation in (8.2), we have E(π|x) = 0 πp(π)p(x|π) dπ/D and
R1
P (π > 0.6|x) = 0.6 p(π)p(x|π) dπ/D, where the denominator of the posterior
R1
is D = 0 p(π)p(x|π) dπ. You should verify these equations for yourself before
going on.
a) The following R script uses Riemann approximation to obtain the desired
posterior information. Match key quantities in the program with those in
the equations above. Also, interpret the last two lines of code. Run the
program and compare your results with those obtainable directly from
the known beta posterior of Example 8.5. (In R, pi means 3.1416, so we
use pp, population proportion, for the grid points of parameter π.)
x = 620; n = 1000 # data
m = 10000 # nr of grid points
pp = seq(0, 1, length=m) # grid points
igd = dbeta(pp, 330, 270) * dbinom(x, n, pp) # integrand
d = mean(igd); d # denominator
192 8 Instructor Manual: Introduction to Bayesian Estimation

# Results
post.mean = mean(pp*igd)/d; post.mean
post.pr.bigwin = (1/m)*sum(igd[pp > .6])/d; post.pr.bigwin
post.cum = cumsum((igd/d)/m)
min(pp[post.cum > .025]); min(pp[post.cum > .975])

> d = mean(igd); d # denominator


[1] 0.0003619164
> post.mean = mean(pp*igd)/d; post.mean
[1] 0.59375
> post.pr.bigwin = (1/m)*sum(igd[pp > .6])/d; post.pr.bigwin
[1] 0.3059262
> min(pp[post.cum > .025]); min(pp[post.cum > .975])
[1] 0.569557 # On p202 Example 8.5 has posterior
[1] 0.6176618 # interval (0.570, 0.618).

b) Now suppose we choose the prior NORM(0.55, 0.02) to match the expert’s
impression that the prior should be centered at π = 55% and put 95% of its
probability in the interval 51% < π < 59%. The shape of this distribution
is very similar to BETA(330, 270) (see Problem 8.4). However, the normal
prior is not a conjugate prior. Write the kernel of the posterior, and say
why the method of Exampe 8.5 is intractable. Modify the program above
to use the normal prior (substituting the function dnorm for dbeta). Run
the modified program. Compare the results with those in part (a).
2 2
d The prior has kernel p(π) ∝ e−(π−µ) /2σ , where µ and σ are as specified above.
The likelihood is of the form p(x|π) ∝ π x (1 − π)n−x . The product doesn’t have the
form of any familiar density function. c

x = 620; n = 1000 # data


m = 10000 # nr of grid points
pp = seq(0, 1, length=m) # grid points
igd = dnorm(pp, .55, .02) * dbinom(x, n, pp) # integrand
d = mean(igd) # denominator

# Results
post.mean = mean(pp*igd)/d
post.pr.bigwin = (1/m)*sum(igd[pp > .6])/d
post.cum = cumsum((igd/d)/m)
pi.lower = min(pp[post.cum > .025])
pi.upper = min(pp[post.cum > .975])
d; post.mean; post.pr.bigwin; pi.lower; pi.upper

> d; post.mean; post.pr.bigwin; pi.lower; pi.upper


[1] 0.0003475865
[1] 0.5936035 # 0.59375 with beta prior
[1] 0.3022829 # 0.3059262 with beta prior
[1] 0.5693569 # 0.569557 with beta prior
[1] 0.6177618 # 0.6176618 with beta prior
8 Instructor Manual: Introduction to Bayesian Estimation 193

c) The scripts in parts (a) and (b) above are “wasteful” because grid values
of π are generated throughout (0, 1), but both prior densities are very
nearly 0 outside of (0.45, 0.65). Modify the program in part (b) to integrate
over this shorter interval.
Strictly speaking, you need to divide d, post.mean, and so on, by 5 be-
cause you are integrating over a region of length 1/5. (Observe the change
in b if you shorten the interval without dividing by 5.) Nevertheless, show
that this correction factor cancels out in the main results. Compare your
results with those obtained above.
d In computing the probability and the cumulative distribution function based on
the posterior distribution, we are integrating p(π|x) in equation (8.2) on p200 of the
text. The integrals in the numerator and denominator involve the same interval. c

x = 620; n = 1000 # data


m = 10000 # nr of grid points
pp = seq(.45, .65, length=m) # pts in small int
igd = dnorm(pp, .55, .02) * dbinom(x, n, pp) # integrand
d = mean(igd) # denominator

# Results
post.mean = mean(pp*igd)/d
post.pr.bigwin = (1/m)*sum(igd[pp > .6])/d
post.cum = cumsum((igd/d)/m)
pi.lower = min(pp[post.cum > .025])
pi.upper = min(pp[post.cum > .975])
d; post.mean; post.pr.bigwin; pi.lower; pi.upper

> d; post.mean; post.pr.bigwin; pi.lower; pi.upper


[1] 0.001737929
[1] 0.5936034 # 0.5936035 in part (b)
[1] 0.3024236 # 0.3022829 in part (b)
[1] 0.5693719 # 0.5693569 in part (b)
[1] 0.6177168 # 0.6177618 in part (b)

d) Modify the R script of part (c) to do the computation for a normal


prior using Monte Carlo integration. Increase the number of iterations
to m ≥ 100 000, and use pp = sort(runif(m, .45, .65)). Part of the
program depends on having the π-values sorted in order. Which part?
Why? Compare your results with those obtained by Riemann approxi-
mation. (If this were a multidimensional integration, some sort of Monte
Carlo integration would probably be the method of choice.)
set.seed(2345)
x = 620; n = 1000 # data
m = 1000000 # nr of rand points
pp = sort(runif(m, .45, .65)) # random points
igd = dnorm(pp, .55, .02) * dbinom(x, n, pp) # integrand
d = mean(igd) # denominator
194 8 Instructor Manual: Introduction to Bayesian Estimation

# Results
post.mean = mean(pp*igd)/d
post.pr.bigwin = (1/m)*sum(igd[pp > .6])/d
post.cum = cumsum((igd/d)/m) # These three
pi.lower = min(pp[post.cum > .025]) # lines of code
pi.upper = min(pp[post.cum > .975]) # require the sort.
d; post.mean; post.pr.bigwin; pi.lower; pi.upper

> d; post.mean; post.pr.bigwin; pi.lower; pi.upper


[1] 0.001734663
[1] 0.5936017 # 0.5936034 with Riemann
[1] 0.3032985 # 0.3024236 with Riemann
[1] 0.5693388 # 0.5693719 with Riemann
[1] 0.6177445 # 0.6177168 with Riemann

e) (Advanced) Modify part (d) to generate normally distributed values of


pp (with sorted rnorm(m, .55,.02)), removing the dnorm factor from the
integrand. Explain why this works, and compare the results with those
above. This method is efficient because it concentrates values of π in the
“important” part of (0, 1), where computed quantities are largest. So there
would be no point in restricting the range of integration as in parts (c)
and (d). This is an elementary example of importance sampling.
d When equation (8.2) is integrated, the integrals in numerator and denominator
can be viewed as double integrals, one with respect to π and one with respect to x.
We are now using the sampling method to do the integration with respect to π. c

set.seed(2345)
x = 620; n = 1000 # data
m = 1000000 # nr of norm points
pp = sort(rnorm(m, .55, .02)) # pts from norm
igd = dbinom(x, n, pp) # weighted likelihood
d = mean(igd) # denominator

# Results
post.mean = mean(pp*igd)/d
post.pr.bigwin = (1/m)*sum(igd[pp > .6])/d
post.cum = cumsum((igd/d)/m) # requires sort
pi.lower = min(pp[post.cum > .025])
pi.upper = min(pp[post.cum > .975])
d; post.mean; post.pr.bigwin; pi.lower; pi.upper

> d; post.mean; post.pr.bigwin; pi.lower; pi.upper


[1] 0.0003469340
[1] 0.5935785 # 0.5936034 with Riemann
[1] 0.2993382 # 0.3024236 with Riemann
[1] 0.5693718 # 0.5693719 with Riemann
[1] 0.6178535 # 0.6177168 with Riemann
8 Instructor Manual: Introduction to Bayesian Estimation 195

8.6 Metropolis algorithm. In Section 7.5, we illustrated the Metropolis algo-


rithm as a way to sample from a bivariate normal distribution having a known
density function. In Problem 8.5, we considered some methods of computing
posterior probabilities that arise from nonconjugate prior distributions. Here
we use the Metropolis algorithm in a more serious way than before to sample
from posterior distributions arising from the nonconjugate prior distributions
of Problem 8.4.
a) Use the Metropolis algorithm to sample from the posterior distribution
of π arising from the prior NORM(0.55, 0.02) and a binomial sample of
size n = 1000 with x = 620 respondents in favor. Simulate m = 100 000
observations from the posterior to find a 95% Bayesian probability in-
terval for π. Also, if you did Problem 8.5, find the posterior probability
P {π > 0.6|x}. The R code below implements this computation using a
symmetrical uniform jump function, and it and compares results with
those from the very similar conjugate prior BETA(330, 270). See the top
panel in Figure 8.8.
set.seed(1234)
m = 100000
piec = numeric(m); piec[1] = 0.7 # states of chain
for (i in 2:m) {
piec[i] = piec[i-1] # if no jump
piep = runif(1, piec[i-1]-.05, piec[i-1]+.05) # proposal
nmtr = dnorm(piep, .55, .02)*dbinom(620, 1000, piep) %%1
dmtr = dnorm(piec[i-1], .55, .02)*dbinom(620, 1000, piec[i-1])
r = nmtr/dmtr; acc = (min(r,1) > runif(1)) # accept prop.?
if(acc) {piec[i] = piep} }
pp = piec[(m/2+1):m] # after burn-in
quantile(pp, c(.025,.975)); mean(pp > .6)
qbeta(c(.025,.975), 950, 650); 1-pbeta(.6, 950, 650)
hist(pp, prob=T, col="wheat", main="")
xx = seq(.5, .7, len=1000)
lines(xx, dbeta(xx, 950, 650), lty="dashed", lwd=2)

> quantile(pp, c(.025,.975)); mean(pp > .6)


2.5% 97.5%
0.5691524 0.6175106
[1] 0.30866
> qbeta(c(.025,.975), 950, 650); 1-pbeta(.6, 950, 650)
[1] 0.5695848 0.6176932
[1] 0.3062133

b) Modify the program of part (a) to find the posterior corresponding to the
“isosceles” prior of Problem 8.4. Make sure your initial value is within the
support of this prior, and use the following lines of code for the numerator
and denominator of the ratio of densities. Notice that, in this ratio, the
constant of integration cancels, so it is not necessary to know the height
196 8 Instructor Manual: Introduction to Bayesian Estimation

of the triangle. In some more advanced applications of the Metropolis


algorithm, the ability to ignore the constant of integration is an impor-
tant advantage. Explain why results here differ considerably from those
in part (a). See the bottom panel in Figure 8.8.
d The suggested lines of code for nmtr and dmtr have been substituted into the
program below. See part (b) of the Notes. The histogram is similar to the lower
panel of Figure 8.8 (p213 of the text). c

set.seed(2345)
m = 100000
piec = numeric(m) # states of chain
piec[1] = 0.5 # starting value in base of triangle
for (i in 2:m) {
piec[i] = piec[i-1] # if no jump
piep = runif(1, piec[i-1]-.05, piec[i-1]+.05) # proposal
nmtr = max(.0515-abs(piep-.55), 0)*dbinom(620, 1000, piep)
dmtr = max(.0515-abs(piec[i-1]-.55), 0)*
dbinom(620, 1000, piec[i-1])
r = nmtr/dmtr; acc = (min(r,1) > runif(1)) # accept prop.?
if(acc) {piec[i] = piep} }
pp = piec[(m/2+1):m] # after burn-in
quantile(pp, c(.025,.975)); mean(pp > .6)
qbeta(c(.025,.975), 950, 650); 1-pbeta(.6, 950, 650)
hist(pp, prob=T, col="wheat", main="")
xx = seq(.5, .7, len=1000)
lines(xx, dbeta(xx, 950, 650), lty="dashed", lwd=2)

Notes: (a) In the program, the code %%1 (mod 1) restricts the value of nmtr to
(0, 1). This might be necessary if you experiment with parameters different from
those in this problem. (b) Even though the isosceles prior may seem superficially
similar to the beta and normal priors, it puts no probability above 0.615, so the
posterior can put no probability there either. In contrast, the data show 620 out of
1000 respondents are in favor.

8.7 A commonly used frequentist principle of estimation provides a point


estimate of a parameter by finding the value of the parameter that maximizes
the likelihood function. The result is called a maximum likelihood esti-
mate (MLE). Here we explore one example of an MLE and its similarity to
a particular Bayesian estimate.
Suppose we observe x = 620 successes in n = 1000 binomial trials and wish
to estimate the probability π of success. The likelihood function is p(x|π) ∝
π x (1 − π)n−x taken as a function of π.
a) Find the MLE π̂. A common way to maximize p(x|π) in π is to maximize
`(π) = ln p(x|π). Solve d`(π)/dπ = 0 for π, and verify that you have found
an absolute maximum. State the general formula for π̂ and then its value
for x = 620 and n = 1000.
8 Instructor Manual: Introduction to Bayesian Estimation 197

d The log-likelihood is `(π) = x ln π + (n − x) ln(1 − π), and setting its derivative


equal to 0 gives x/π = (n − x)/(1 − π), from which we obtain π̂ = x/n as the MLE,
for 0 < x < n. In particular, for x = 620 and n = 1000, the MLE is π̂ = 0.620.
One could establish that p(x|π) has a relative maximum at π = x/n by looking
at the second derivative of `(π). However, for 0 < x < n, we notice that p(x|π) has
one of the nine shapes at the lower-right of Figure 8.6. In all of these cases, the only
horizontal tangent is at the absolute maximum.
If x = 0 or x = n, the MLE cannot be found using calculus, but we extend the
likelihood function in the obvious way and say that the MLE is 0 or 1, respectively.
With this interpretation, the formula π̂ = x/n still holds in these extreme cases. c

b) Plot the likelihood function for n = 1000 and x = 620. Approximate its
maximum value from the graph. Then do a numerical maximization with
the R script below. Compare it with the answer in part (a).
pp = seq(.001, .999, .001) # avoid ’pi’ (3.1416)
like = dbinom(620, 1000, pp); pp[like==max(like)]
plot(like, type="l") # plot not shown

> like = dbinom(620, 1000, pp); pp[like==max(like)]


[1] 0.62
p
c) Agresti-Coull confidence interval. The interval π̃±1.96 π̃(1 − π̃)/(n + 4),
where π̃ = (x + 2)/(n + 4), has approximately 95% confidence for estimat-
ing π. (This interval is based on the normal approximation to the bino-
mial; see Example 1.6 on p13 and Problems 1.16 and 1.17.) Evaluate its
endpoints for 620 successes in 1000 trials.
d The output below shows the Agresti-Coull confidence interval (0.5895, 0.6496) sur-
rounding the value of π̃ = 0.6195, where we have rounded to four decimal places. c

x = 620; n = 1000; pm = -1:1


n.tilde = n + 4; pp.tilde = (x + 2)/n.tilde
pp.tilde + pm*1.96*sqrt(pp.tilde*(1 - pp.tilde)/n.tilde)

> pp.tilde + pm*1.96*sqrt(pp.tilde*(1 - pp.tilde)/n.tilde)


[1] 0.5894900 0.6195219 0.6495538

d) Now we return to Bayesian estimation. A prior distribution that provides


little, if any, definite information about the parameter to be estimated is
called a noninformative prior or flat prior. A commonly used nonin-
formative beta prior has α0 = β0 = 1, which is the same as UNIF(0, 1). For
this prior and data consisting of x successes in n trials, find the posterior
distribution and its mode.
d In Example 8.5, based on x successes in n trials and prior BETA(α0 , β0 ), the pos-
terior distribution is BETA(α0 + x, β0 + n − x). In our answer to Problem 8.1, we
have shown that the mode of BETA(α, β) is (α − 1)/(α + β − 2). So the mode of our
posterior distribution is x/n. c
198 8 Instructor Manual: Introduction to Bayesian Estimation

e) For the particular case with n = 1000 and x = 620, find the posterior
mode and a 95% probability interval.
d In the R program below, we include a grid search for the mode of the posterior
distribution just to confirm that it agrees with the MLE—as the answer to part (d)
says it must. In view of the assertion in the Note (also see Problem 1.20), it is not
a surprise that the 95% posterior probability interval is numerically similar to the
Agresti-Coull 95% confidence interval, but it is a bit of a surprise to find agreement
to four-places in this particular problem.

x = 620; n = 1000; alpha.0 = beta.0 = 1; ppp = seq(0, 1, by=.001)


post = dbeta(ppp, alpha.0 + x, beta.0 + n - x)
post.max = ppp[post==max(post)]
post.int = qbeta(c(.025,.975), alpha.0+x, beta.0+n-x)
post.max; post.int

> post.max; post.int


[1] 0.62 # matches MLE
[1] 0.5894984 0.6495697 # (.5895, .6496) is A-C CI from part (c)

Extra: This remarkable agreement in one instance raises the question how good
the agreement is across the board. A few computations for n = 1000 and then for
n = 100, suggest that very good agreement is not unusual.
Not explored here is behavior for success ratios very near 0 or 1, where interval
estimation can become problematic, but where there is evidence that the Bayesian
intervals may perform better. Also notice that, using R, the Bayesian intervals are
easy to find. c

n = 1000; x = seq(100, 900, by=100)


p.tilde = (x+2)/(n+4)
me = 1.96*sqrt(p.tilde*(1-p.tilde)/(n+4))
ac.lwr = p.tilde - me; ac.upr = p.tilde + me
al.post = 1 + x; be.post = 1 + n - x
pi.lwr = qbeta(.025, al.post, be.post)
pi.upr = qbeta(.975, al.post, be.post)
round(cbind(x, n, ac.lwr, pi.lwr, ac.upr, pi.upr),4)

> round(cbind(x, n, ac.lwr, pi.lwr, ac.upr, pi.upr),4)


x n ac.lwr pi.lwr ac.upr pi.upr
[1,] 100 1000 0.0829 0.0829 0.1203 0.1202
[2,] 200 1000 0.1764 0.1764 0.2260 0.2259
[3,] 300 1000 0.2724 0.2724 0.3292 0.3291
[4,] 400 1000 0.3701 0.3701 0.4307 0.4307
[5,] 500 1000 0.4691 0.4691 0.5309 0.5309
[6,] 600 1000 0.5693 0.5693 0.6299 0.6299
[7,] 700 1000 0.6708 0.6709 0.7276 0.7276
[8,] 800 1000 0.7740 0.7741 0.8236 0.8236
[9,] 900 1000 0.8797 0.8798 0.9171 0.9171
8 Instructor Manual: Introduction to Bayesian Estimation 199

n = 100; x = seq(10, 90, by=10)


...
> round(cbind(x, n, ac.lwr, pi.lwr, ac.upr, pi.upr),4)
x n ac.lwr pi.lwr ac.upr pi.upr
[1,] 10 100 0.0540 0.0556 0.1768 0.1746
[2,] 20 100 0.1330 0.1336 0.2900 0.2891
[3,] 30 100 0.2190 0.2190 0.3964 0.3961
[4,] 40 100 0.3095 0.3093 0.4981 0.4983
[5,] 50 100 0.4039 0.4036 0.5961 0.5964
[6,] 60 100 0.5019 0.5017 0.6905 0.6907
[7,] 70 100 0.6036 0.6039 0.7810 0.7810
[8,] 80 100 0.7100 0.7109 0.8670 0.8664
[9,] 90 100 0.8232 0.8254 0.9460 0.9444

Note: In many estimation problems, the MLE is in close numerical agreement with
the Bayesian point estimate based on a noninformative prior and on the posterior
mode. Also, a confidence interval based on the MLE may be numerically similar
to a Bayesian probability interval from a noninformative prior. But the underlying
philosophies of frequentists and Bayesians differ, and so the ways they interpret
results in practice may also differ.

8.8 Recall that in Example 8.6 researchers counted a total of t = 256 mice
on n = 50 occasions. Based on these data, find the interval estimate for λ
described in each part. Comment on similarities and differences.
a) The prior distribution GAMMA(α0 , κ0 ) has least effect on the posterior
distribution GAMMA(α0 + t, κ0 + n) when α0 and κ0 are both small. So
prior parameters α0 = 1/2 and κ0 = 0 give a Bayesian 95% posterior
probability interval based on little prior information.
d By formula, the posterior distribution is GAMMA(1/2 + 256, 50). Rounded to four
places, the R code qgamma(c(.025,.975), 256.5, 50) returns (4.5214, 5.7765). c

b) Assuming that nλ is large, an approximate √ 95% frequentist confidence


interval for λ is obtained by dividing t ± 1.96 t by n.
√ √
d Here, t ± 1.96 t is 256 ± 1.95 256 or (224.8, 287.2). Then dividing by 50 gives
(4.496, 5.744), which is not much different from the Bayesian interval in part (a).
Extra: Recall from Section 1.2 that the Agresti-Coull 95% confidence interval (which
“adds two successes and two failures to the data”) has more accurate coverage
probabilities than does the traditional 95% confidence interval. For similar reasons,

one can argue that a 95% confidence interval for Poisson λ based on t+2±1.96 t + 1

is better than one based on t±1.96 t. For the current problem, this adjusted interval
is (4.5316, 5.7884). c

c) A frequentist confidence interval guaranteed to have at least 95% coverage


has lower and upper endpoints computable in R as qgamma(.025, t, n)
and qgamma(.975, t+1, n), respectively.
200 8 Instructor Manual: Introduction to Bayesian Estimation

d Rounded to four places, this interval is (4.5120, 5.7871). For practical purposes,
all of the intervals in this problem are essentially the same. Ideally, in a homework
paper, you would summarize the intervals (and their lengths) in a list for easy
comparison.
In general, discussions about which methods are best center on their coverage
probabilities (as considered for binomial intervals in Section 1.2) and on their average
lengths (see Problem 1.19). Getting useful interval estimates for Poisson λ is more
difficult when x is very small. c
Notes: (a) Actually, using κ0 = 0 gives an improper prior. See the discussion in
Problem 8.12. (b) This style of CI has coverage inaccuracies similar to those of the
traditional CIs for binomial π (see Section 1.2). (c) See Stapleton (2008), Chapter 12.

8.9 In a situation similar to that in Examples 8.2 and 8.6, suppose that we
want to begin with a prior distribution on the parameter λ that has E(λ) ≈ 8
and P {λ < 12} ≈ 0.95. Subsequently, we count a total of t = 158 mice in
n = 12 trappings.
a) To find the parameters of a gamma prior that satisfy the requirements
above, write a program analogous to the one in Problem 8.2. (You can
come very close with α0 an integer, but don’t restrict κ0 to integer values.)
d In the program below, we try values α0 = 0.1, 0.2, . . . , 20.0 and get α0 = 12.8 and
κ0 = 1.6, which give E(λ) = 12.8/1.6 = 8 and P {λ < 12} = 0.95, to three places.
Using integer values of α0 , as suggested in the problem, we would get α0 = 13 and
κ0 = 1.625, which give E(λ) = 8 and P {λ < 12} a little above 95%. c

alpha = seq(.1, 20, by=.1) # trial integer values of alpha


kappa = alpha/8 # corresp values of kappa for mean 8
# Vector of probabilities for interval (0, 12)
prob = pgamma(12, alpha, kappa)
prob.err = abs(.95 - prob) # errors for probabilities
# Results: Target parameter values
t.al = alpha[prob.err==min(prob.err)]
t.ka = t.al/8 # NOT rounded to integer
t.al; t.ka

> t.al; t.ka


[1] 12.8
[1] 1.6

# Checking: Achieved mean and probability


a.mean = t.al/t.ka
a.prob = pgamma(12, t.al, t.ka)
a.mean; a.prob

> a.mean; a.prob


[1] 8
[1] 0.9500812
8 Instructor Manual: Introduction to Bayesian Estimation 201

b) Find the gamma posterior that results from the prior in part (a) and the
data given above. Find the posterior mean and a 95% posterior probability
interval for λ.
d We use the prior GAMMA(α0 = 12.8, κ0 = 1.6) along with the data t = 158 and
n = 12 to obtain the posterior GAMMA(αn = α0 + t = 170.8, κn = κ0 + n = 13.6),
as on p202 of the text. Then the R code qgamma(c(.025, .975), 170.8, 13.6)
returns the 95% posterior probability interval (10.75, 14.51). Of course, if you used
a slightly different prior, your answer may differ slightly. c
c) As in Figure 8.2(a), plot the prior and the posterior. Why is the posterior
here less concentrated than the one in Figure 8.2(a)?
d The code below graphs the prior (black) and posterior (blue) density curves. The
posterior is less concentrated than the one in Example 8.6 because it is based on
much less data. c
xx = seq(2, 18, by=.01); top = max(dgamma(xx, 170.8, 13.6))
plot(xx, dgamma(xx, 12.8, 1.6), type="l", lwd=2,
ylim=c(0, top), xlab="Mice in Region", ylab="Density",
main="Prior (black) and Posterior Densities")
lines(xx, dgamma(xx, 170.8, 13.6), col="blue")
abline(h=0, col="darkgreen")
d) The ultimate noninformative gamma prior is the improper prior having
α0 = κ0 = 0 (see Problems 8.7 and 8.12 for definitions). Using this prior
and the data above, find the posterior mean and a 95% posterior proba-
bility interval for λ. Compare the interval with the interval in part (b).
d The code qgamma(c(.025, .975), 158, 12) returns the required posterior prob-
ability interval (11.19, 15.30), of length 4.10 (based on unrounded results). Owing
to the influence of the informative prior of part (b), the interval (10.75, 14.51), of
length 3.76, is shorter and shifted a little to the left. But see the comment below. c
Partial answers: In (a) you can use a prior with α0 = 13. Our posterior intervals
in (b) and (d) agree when rounded to integer endpoints: (11, 15), but not when
expressed to one- or two-place accuracy—as you should do.
8.10 In this chapter, we have computed 95% posterior probability intervals
by finding values that cut off 2.5% from each tail. This method is computa-
tionally relatively simple and gives satisfactory intervals for most purposes.
However, for skewed posterior densities, it does not give the shortest interval
with 95% probability.
The following R script finds the shortest interval for a gamma posterior.
(The vectors p.low and p.up show endpoints of enough 95% intervals that
we can come very close to finding the one for which the length, long, is a
minimum.)
d The suggested code, slightly modified, has been moved to part (a). See also the
Extra example in the answer to Problem 8.1(a)—where there is not much difference
between the shortest and the probability-symmetric probability intervals. c
202 8 Instructor Manual: Introduction to Bayesian Estimation

a) Compare the length of the shortest interval with that of the usual
(probability-symmetric) interval. What probability does the shortest in-
terval put in each tail?
alp = 5; kap = 1
p.lo = seq(.001,.05, .00001); p.up = .95 + p.lo
q.lo = qgamma(p.lo, alp, kap); q.up = qgamma(p.up, alp, kap)
long = q.up - q.lo # avoid confusion with function ‘length’
cond = (long==min(long))
PI.short = c(q.lo[cond], q.up[cond]); PI.short # shortest PI
diff(PI.short) # length of shortest PI
pr =c(p.lo[cond], 1-p.up[cond]); pr # probs in each tail
dens.ht = dgamma(PI.short, alp, kap); dens.ht # for part (c)
PI.sym = qgamma(c(.025,.975), alp, kap); PI.sym # prob-sym PI
diff(PI.sym) # length of prob-symmetric PI

> PI.short = c(q.lo[cond], q.up[cond]); PI.short # shortest PI


[1] 1.207021 9.430278
> diff(PI.short) # length of shortest PI
[1] 8.223257
> pr =c(p.lo[cond], 1-p.up[cond]); pr # probs in each tail
[1] 0.00793 0.04207
> dens.ht = dgamma(PI.short, alp, kap); dens.ht # for part (c)
[1] 0.02645121 0.02644655
> PI.sym = qgamma(c(.025,.975), alp, kap); PI.sym # prob-sym PI
[1] 1.623486 10.241589
> diff(PI.sym) # length of prob-symmetric PI
[1] 8.618102

b) Use the same method to find the shortest 95% posterior probability inter-
val in Example 8.6. Compare it with the probability interval given there.
Repeat, using suitably modified code, for 99% intervals.
d In Example 8.6, the posterior probability interval is is based on αn = 260 and
κn = 50.33. For large α, gamma distributions become nearly symmetrical. So the
shortest and probability-symmetric intervals are nearly the same for this example. c

alp = 260; kap = 50.33


p.lo = seq(.001,.05, .00001); p.up = .95 + p.lo
q.lo = qgamma(p.lo, alp, kap); q.up = qgamma(p.up, alp, kap)
long = q.up - q.lo # avoid confusion with function ‘length’
cond = (long==min(long))
PI.short = c(q.lo[cond], q.up[cond]); PI.short # shortest PI
diff(PI.short) # length of shortest PI
pr =c(p.lo[cond], 1-p.up[cond]); pr # probs in each tail
dens.ht = dgamma(PI.short, alp, kap); dens.ht # for part (c)
PI.sym = qgamma(c(.025,.975), alp, kap); PI.sym # prob-sym PI
diff(PI.sym) # length of prob-symmetric PI
8 Instructor Manual: Introduction to Bayesian Estimation 203

> PI.short = c(q.lo[cond], q.up[cond]); PI.short # shortest PI


[1] 4.544292 5.798645
> diff(PI.short) # length of shortest PI
[1] 1.254353
> pr =c(p.lo[cond], 1-p.up[cond]); pr # probs in each tail
[1] 0.02258 0.02742
> dens.ht = dgamma(PI.short, alp, kap); dens.ht # for part (c)
[1] 0.1824679 0.1825161
> PI.sym =qgamma(c(.025, .975), alp, kap); PI.sym # prob-sym PI
[1] 4.557005 5.812432
> diff(PI.sym) # length of prob-symmetric PI
[1] 1.255427

# 99 percent probability interval


alp = 260; kap = 50.33
p.lo = seq(.0001, .01, .00001); p.up = .99 + p.lo
...
> PI.short = c(q.lo[cond], q.up[cond]); PI.short # shortest PI
[1] 4.365433 6.014426
> diff(PI.short) # length of shortest PI
[1] 1.648993
> pr =c(p.lo[cond], 1-p.up[cond]); pr # probs in each tail
[1] 0.0044 0.0056
> dens.ht = dgamma(PI.short, alp, kap); dens.ht # for part (c)
[1] 0.04508353 0.04513106
> PI.sym = qgamma(c(.025,.975), alp, kap); PI.sym # prob-sym PI
[1] 4.557005 5.812432
> diff(PI.sym) # length of prob-symmetric PI
[1] 1.255427
> PI.sym =qgamma(c(.005, .995), alp, kap); PI.sym # prob-sym PI
[1] 4.378008 6.028411
> diff(PI.sym)
[1] 1.650404
c) Suppose a posterior density function has a single mode and decreases
monotonically as the distance away from the mode increases (for example,
a gamma density with α > 1). Then the shortest 95% posterior probability
interval is also the 95% probability interval corresponding to the highest
values of the posterior: a highest posterior density interval. Explain
why this is true. For the 95% intervals in parts (a) and (b), verify that the
heights of the posterior density curve are indeed the same at each end of
the interval (as far as allowed by the spacing 0.00001 of the probability
values used in the script).
d The programs for parts (a) and (b), find that a density function is essentially equal
at the endpoints of the shortest interval. If such an interval is adjusted by shifting a
small probability from one tail to the other, the interval must become longer because
a high-density strip on one side is replaced by a longer, lower-density strip on the
other side. c
204 8 Instructor Manual: Introduction to Bayesian Estimation

8.11 Mark-recapture estimation of population size. In order to estimate the


number ν of fish in a lake, investigators capture r of these fish at random,
tag them, and then release them. Later (leaving time for mixing but not for
significant population change), they capture s fish at random from the lake
and observe the number x of tagged fish among them. Suppose r = 900,
s = 1100, and we observe x = 103. (This is similar to the situation described
in Problem 4.27, partially reprised here in parts (a) and (b).)
a) Method of moments estimate (MME). At recapture, an unbiased estimate
of the true proportion r/ν of tagged fish in the lake is x/s. That is,
E(x/s) = r/ν. To find the MME of ν, equate the observed value x/s to its
expectation and solve for ν. (It is customary to truncate to an integer.)
d From r/ν = x/s, we obtain ν = rs/x. Upon truncating fractional values to the
next lower integer (the “floor”) we find the MME τ̃ = brs/xc of ν. For the data given
in the problem, we have ν̃ = b900(1100)/103c = b9611.65c = 9611. This is often
called the Lincoln-Petersen method of estimating sample size. It is also discussed in
Problem 4.27.
Extra. Sometimes beginning students get the idea that MMEs are invariably unbi-
ased, which is not the case when nonlinear operations are required in order to solve
the appropriate moment equation. An example from a previous chapter is that the
sample variance S 2 of a random sample from NORM(µ, σ) has E(S 2 ) = σ 2 . The
sample variance is the MME for θ = σ 2 . Accordingly, the MME σ̃ for σ is σ̃ = S.
However, as mentioned in the Note to Problem 4.15, this is a biased estimator;
taking the square root is a nonlinear operation.
In the current example, a simple simulation shows that ν̃ is biased. A variant
of the estimator, attributed to Schnabel, is ν̌ = (r + 1)(s + 1)/(x + 1) − 1. It is
said to have less bias than the Lincoln-Petersen estimator. [There is a huge amount
of material on the Internet about mark-recapture (also called capture-recapture)
methods, of which a distressingly large fraction seems to be incorrect or misleading.
At the instant this is written, the Wikipedia article is (admittedly) fragmentary
and incompletely documented, but we found no obvious errors in it. This article
mentions the Schnabel method and shows a formula for its variance.]
The program below assumes ν = 9500, carries out m = 100 000 simulated mark-
recapture experiments with r = 900 and s = 1100, finds both estimates of ν for
each experiment, and approximates the expectations of ν̃ and ν̌. From the output,
it does seem that the Schnabel method has the smaller bias. c

set.seed(2345)
m = 100000; nu = 9500; r = 900; s = 1100
x = rhyper(m, 900, nu - r, 1100)
nu.est.lp = floor(r*s/x)
nu.est.sch = floor((r+1)*(s+1)/(x+1) - 1)
mean(nu.est.lp); mean(nu.est.sch)
sd(nu.est.lp); sd(nu.est.sch)

> mean(nu.est.lp); mean(nu.est.sch)


[1] 9577.072
[1] 9502.757
8 Instructor Manual: Introduction to Bayesian Estimation 205

> sd(nu.est.lp); sd(nu.est.sch)


[1] 855.218
[1] 840.2448
b) Maximum likelihood estimate (MLE). For known r, s, and ν, the hyper-
geometric distribution function pr,s (x|ν) = (rx )(ν−r ν
s−x )/(s ) gives the prob-
ability of observing x tagged fish at recapture. With known r and s and
observed data x, the likelihood function of ν is pr,s (x|ν). Find the MLE;
that is, the value of ν that maximizes pr,s (x|ν).
d We wish to maximize the likelihood function pr,s (x|ν) in ν for the given design
values r and s and observed x. Because the hypergeometric random variable x is
discrete, we cannot use differential calculus to find this maximum. However, writing
the binomial coefficients in terms of factorials and doing some algebra, one can find
that the ratio R(ν) = pr,s (x|ν)/pr,s (x|ν − 1) first increases to a maximum of about 1
and then decreases. The MLE ν̂ = brs/xc occurs where R(ν) ≈ 1. In some cases,
two adjacent values of ν may maximize pr,s (x|ν). (This method of maximization
is discussed in Chapter II of Feller (1968), along with brief comments on finding a
frequentist confidence interval for ν].
Below, we do a grid search for ν̂ in the particular instance with r = 900, s = 1100,
and x = 103. The result agrees with the MLE. c
nu = 8000:11000; r = 900; s = 1100; x = 103
like = dhyper(x, 900, nu - r, 1100)
nu[like==max(like)]; floor(r*s/x)

> nu[like==max(like)]; floor(r*s/x)


[1] 9611
[1] 9611
c) Bayesian interval estimate. Suppose we believe ν lies in (6000, 14 000) and
are willing to take the prior distribution of ν as uniform on this interval.
Use the R code below to find the cumulative posterior distribution of ν|x
and thence a 95% Bayesian interval estimate of ν. Explain the code.
d In this problem, the uniform prior density is not conjugate to the likelihood func-
tion, so we cannot ignore constants of summation and must evaluate the denominator
of the general version of Bayes’ Theorem, equation (8.2). c
r = 900; s = 1100; x = 103; nu = 6000:14000; n = length(nu)
prior = rep(1/n, n); like = dhyper(x, r, nu-r, s)
denom = sum(prior*like)
post = prior*like/denom; cumpost = cumsum(post)
c(min(nu[cumpost >= .025]), max(nu[cumpost <= .975]))

> c(min(nu[cumpost >= .025]), max(nu[cumpost <= .975]))


[1] 8228 11663
d) Use the negative binomial prior: prior = dnbinom(nu-150, 150, .014).
Compare the resulting Bayesian interval with that of part (c) and with a
bootstrap confidence interval obtained as in Problem 4.27.
206 8 Instructor Manual: Introduction to Bayesian Estimation

d The modified program is shown below, along with the 95% Bayesian probability
interval it computes. The version of the negative binomial distribution implemented
in R counts only the number of Failures up until the required number (here 150) of
Successes are encountered, and so it has support 0, 1, 2, . . . . Because we use nu - 150
as the first argument of the negative binomial probability function in R, the values
of ν with positive probability are 150, 151, 152, . . . .

r = 900; s = 1100; x = 103; nu = 6000:14000; n = length(nu)


prior = dnbinom(nu-150, 150, .014); like = dhyper(x, r, nu-r, s)
denom = sum(prior*like)
post = prior*like/denom; cumpost = cumsum(post)
c(min(nu[cumpost >= .025]), max(nu[cumpost <= .975]))

> c(min(nu[cumpost >= .025]), max(nu[cumpost <= .975]))


[1] 9043 11512

The mean of the negative binomial prior (counting both Successes and Failures)
is 150/.014 = 10714.29, which is somewhat larger than the data alone would suggest.
So it is not surprising that the resulting probability interval covers somewhat larger
values than the probability interval resulting from the flat prior in part (c).
We leave it to you to make the trivial change in the program of Problem 4.27(d).
With seed 1935, we obtained the simple bootstrap CI (8181, 11 511); our CI from
the percentile method was (7711, 11 041). Maybe it would be worthwhile for you
to explore what bootstrap CIs result if the nearly-unbiased Schnabel estimator of
population size is used throughout the parametric bootstrap procedure. c
8.12 In Example 8.7, we show formulas for the mean and precision of the
posterior distribution. Suppose five measurements of the weight of the beam,
using a scale known to have precision τ = 1, are: 698.54, 698.45, 696.09,
697.14, 698.62 (x̄ = 697.76).
a) Based on these data and the prior distribution of Example 8.3, what is
the posterior mean of µ? Does it matter whether we choose the mean, the
median, or the mode of the posterior distribution as our point estimate?
(Explain.) Find a 95% posterior probability interval for µ. Also, suppose
we are unwilling to use this beam if it weighs more than 699 pounds; what
are the chances of that?
d The formulas µn = (τ0 /τn )µ0 + (nτ /τn )x̄ and τn = τ0 + nτ , are used in the R

code below. The parameters and distributions are: prior NORM(µ0 , σ0 = 1/ τ0 );

likelihood NORM(µ, σ = 1/ τ ), in which σ is known and µ is to be estimated using

data x̄; and posterior NORM(µn , σn = 1/ τn ).
The last line of the R code below computes the 95% posterior probability interval
for comparison with intervals in parts (c) and (d). c

mu.0 = 700; sg.0 = 10; tau.0 = 1/sg.0^2


x.bar = 697.76; n = 5; tau = 1
tau.n = tau.0 + n*tau
mu.n = (tau.0/tau.n)*mu.0 + (n*tau/tau.n)*x.bar
mu.n; tau.n
8 Instructor Manual: Introduction to Bayesian Estimation 207

1 - pnorm(699, mu.n, 1/sqrt(tau.n))


qnorm(c(.025, .975), mu.n, 1/sqrt(tau.n))

> mu.n; tau.n


[1] 697.7645
[1] 5.01
> 1 - pnorm(699, mu.n, 1/sqrt(tau.n))
[1] 0.002841884
> qnorm(c(.025, .975), mu.n, 1/sqrt(tau.n))
[1] 696.8888 698.6401

b) Modify the R script shown in Example 8.5 to plot the prior and posterior
densities on the same axes. (Your result should be similar to Figure 8.3.)
d The distributions, parameters, and data are as in part (a). The code is shown, but
not the resulting plot, which is indeed similar to Figure 8.3. c

x = seq(680, 715, .001)


mu.0 = 700; sg.0 = 10; tau.0 = 1/sg.0^2
x.bar = 697.76; n = 5; tau = 1
tau.n = tau.0 + n*tau
mu.n = (tau.0/tau.n)*mu.0 + (n*tau/tau.n)*x.bar

prior = dnorm(x, mu.0, 1/sqrt(tau.0))


post = dnorm(x, mu.n, 1/sqrt(tau.n))
plot(x, post, type="l", ylim=c(0, max(prior,post)),
lwd=2, col="blue")
lines(x, prior)

c) Taking a frequentist point of view, use the five observations given above
and the known variance of measurements produced by our scale to give a
95% confidence interval for the true weight of the beam. Compare it with
the results of part (a) and comment.
d For n = 5 observations with mean x̄ = 697.76, chosen at random from a normal
population with unknown mean µ and known σ = 1, the 95% confidence interval

for µ is x̄ ± 1.96σ/ n. This computes as (696.8835, 698.6365). c

d) The prior distribution in this example is very “flat” compared with the
posterior: its precision is small. A practically noninformative normal prior
is one with precision τ0 that is much smaller than the precision of the data.
As τ0 decreases, the effect of µ0 diminishes. Specifically, limτ0 →0 µn = x̄
and limτ0 →0 τn = nτ. The effect is as if we had used p(µ) ∝ 1 as the
prior.
R ∞ Of course, such a prior distribution is not strictly possible because
−∞
p(µ) dµ would be ∞. But it is convenient to use such an improper
prior as shorthand for understanding what happens to a posterior as
the prior gets less and less informative. What posterior mean and 95%
probability interval result from using an improper prior with our data?
Compare with the results of part (c).
208 8 Instructor Manual: Introduction to Bayesian Estimation

d Numerically, the Bayesian 95% posterior probability interval is the same as the 95%
confidence interval in part (c). The interpretation might differ depending on the
user’s philosophy of inference. Computationally, the Bayesian interval is found in R
as qnorm(c(.025, .975), 697.76, 1/sqrt(5/1)). c

e) Now change the example: Suppose that our vendor supplies us with a
more consistent product so that the prior NORM(701, 5) is realistic and
that our data above come from a scale with known precision τ = 0.4.
Repeat parts (a) and (b) for this situation.
d The code below is the obvious modification of the code in part (a). We leave it to
you to modify the program of part (b) and make the resulting plot. The prior is not
as flat as in parts (a) and (b), but the data now have more precision than before,
and so the data still predominate in determining the posterior. c

mu.0 = 705; sg.0 = 5; tau.0 = 1/sg.0^2


x.bar = 697.76; n = 5; tau = 0.4
tau.n = tau.0 + n*tau
mu.n = (tau.0/tau.n)*mu.0 + (n*tau/tau.n)*x.bar
mu.n; tau.n
1 - pnorm(699, mu.n, 1/sqrt(tau.n))
qnorm(c(.025, .975), mu.n, 1/sqrt(tau.n))

> mu.n; tau.n


[1] 697.902
[1] 2.04
> 1 - pnorm(699, mu.n, 1/sqrt(tau.n))
[1] 0.05840397
> qnorm(c(.025, .975), mu.n, 1/sqrt(tau.n))
[1] 696.5297 699.2742

8.13 (Theoretical) The purpose of this problem is to derive the posterior


distribution p(µ|x) resulting from the prior NORM(µ0 , σ0 ) and n independent
observations xi ∼ NORM(µ, σ). (See Example 8.7.)
a) Show that the likelihood is
n · ¸ " n #
Y 1 X 1
f (x|µ) ∝ exp − 2 (xi − µ)2 ∝ exp − 2
(x̄ − µ)2 .
i=1
2σ i=1

To obtain the first expression above, recall that the likelihood function
is the joint density function of x = (x1 , . . . , xn )|µ. To obtain the second,
write (xi − µ)2 = [(xi − x̄) + (x̄ − µ)]2 , expand the square, and sum over i.
On distributing the sum, you should obtain three terms. One of them
provides the desired result, another is 0, and the third is irrelevant because
it does not contain the variable µ. (A constant term in the exponential is
a constant factor of the likelihood, which is not included in the kernel.)
8 Instructor Manual: Introduction to Bayesian Estimation 209

d Following these suggestions, we have


( )
Y
n h i X
n
1 1
f (x|µ) ∝ exp − 2 (xi − µ)2 = exp − [(xi − x̄) + (x̄ − µ)] 2
2σ 2σ 2
i=1 i=1
n o
1 X£ ¤
= exp − 2
(xi − x̄)2 + 2(xi − x̄)(x̄ − µ) + (xi − x̄)2

n h io
1 X X X
= exp − 2 (xi − x̄)2 + 2(x̄ − µ) (xi − x̄) + (xi − x̄)2

n h X io h X i
1 1
∝ exp − 2 2(x̄ − µ)(0) + (xi − x̄)2 = exp − 2
(x̄ − µ)2
2σ 2σ
h i
n 2
= exp − 2 (x̄ − µ) .

Inside the square bracket in the middle line above, the first term is the one that is
irrelevant because it does not contain µ (hence the proportionality symbol ∝ at the
P
start of the next line), the second is 0 (because (xi − x̄) = 0), and the third gives
the expression we want. The final line is a further simplification used in part (b). c

b) To derive the expression for the kernel of the posterior, multiply the kernels
of the prior and the likelihood, and expand the squares in each. Then put
everything in the exponential over a common denominator, and collect
terms in µ2 and µ. Terms in the exponent that do not involve µ are
constant factors of the posterior density that may be adjusted as required
in completing the square to obtain the desired posterior kernel.
d In terms of τn and µn , we express the kernel of the posterior density as
· ¸ h i
1 n
p(µ|x) ∝ p(µ) p(x|µ) ∝ exp − 2 (µ − µ0 )2 × exp − 2 (x̄ − µ)2
2σ0 2σ
h i h i
τ0 nτ
= exp − (µ − µ0 )2 × exp − (x̄ − µ)2
2 2
h i
τ0 2 2 nτ 2
= exp − (µ − 2µµ0 + µ0 ) − (x̄ − 2x̄µ + µ2 )
2 2
h i
τ0 2 nτ
∝ exp − (µ − 2µµ0 ) − (−2x̄µ + µ2 )
2 2
h i h i
τ0 + nτ 2 τn
= exp − µ + (τ0 µ0 + nτ x̄)µ = exp − (µ2 − 2µµn )
2 2
h i h i
τn 2 τn
∝ exp − (µ − 2µµn + µn ) = exp − (µ − µn )2 .
2
2 2

We recognize the last term as the kernel of NORM(µn , 1/ τn ) = NORM(µn , σn ).
The proportionality symbol on the fourth line indicates that terms not involving µ
have been dropped, the one on the last line indicates that such a term has been added
in order to complete the square. (Because we used precisions instead of variances
after the first line, the “common denominator” of the suggested procedure became,
in effect, a common factor.) c
210 8 Instructor Manual: Introduction to Bayesian Estimation

8.14 For a pending American football game, the “point spread” is estab-
lished by experts as a measure of the difference in ability of the two teams. The
point spread is often of interest to gamblers. Roughly speaking, the favored
team is thought to be just as likely to win by more than the point spread as
to win by less or to lose. So ideally a fair bet that the favored team “beats the
spread” could be made at even odds. Here we are interested in the difference
x = v − w between the point spread v, which might be viewed as the favored
team’s predicted lead, and the actual point difference w (the favored team’s
score minus its opponent’s) when the game is played.
a) Suppose an amateur gambler, perhaps interested in bets that would not
have even odds, is interested in the precision of x and is willing to assume
x ∼ NORM(0, σ). Also, recalling relatively few instances with |x| > 30,
he decides to use a prior distribution on σ that satisfies P {10 < σ < 20} =
P {100 < σ 2 = 1/τ < 400} = P {1/400 < τ < 1/100} = 0.95. Find parame-
ters α0 and κ0 for a gamma-distributed prior on τ that approximately
satisfy this condition. (Imitate the program in Problem 8.2.)
d We wish to imitate the program of Problem 8.2(b) (for a beta prior) or, perhaps
more directly, the program of Problem 8.9(a) (already adapted for a gamma prior).
Both of these programs are based on a target mean value. Because we want to have
P {10 < σ < 20}, it seems reasonable to suppose the prior mean may lie a little to
the right of the center 15 of this interval, perhaps around 16. On the precision scale,
this would indicate α0 /κ0 = 1612 ≈ 0.004. (The parameters mentioned in the Hints
give mean 0.0044.) Consistent with this guess, we seek parameters α0 and κ0 for a
gamma prior on τ with P {1/400 < τ < 1/100} = 0.95.
The the program below yields parameters α0 = 16 and κ0 = 4000, which put
close to 95% of the probability in this interval. (Your parameters may be somewhat
different, depending on the details of your search.)

alpha = 1:50; mu = 0.004; kappa = round(alpha/mu)


prob = pgamma(1/100, alpha, kappa) - pgamma(1/400, alpha, kappa)
prob.err = abs(.95 - prob)
# Trial parameter values
t.al = alpha[prob.err==min(prob.err)]
t.ka = kappa[prob.err==min(prob.err)]
t.al; t.ka
# Check for central and lower-tail probabilities
diff(pgamma(c(1/400, 1/100), t.al, t.ka))
pgamma(1/400, t.al, t.ka)

> t.al; t.ka


[1] 16
[1] 4000
> # Check for central and lower-tail probabilities
> diff(pgamma(c(1/400, 1/100), t.al, t.ka))
[1] 0.9512541
> pgamma(1/400, t.al, t.ka)
[1] 0.0487404
8 Instructor Manual: Introduction to Bayesian Estimation 211

The last line of the program shows that most of the 5% of probability that
this prior distribution puts outside the interval ( 2012 , 1012 ) lies in the left tail. Before
deciding that this is unacceptably lopsided, the gambler needs to ponder whether an
initial guess such as P {12.7 < σ < 21} = 95% would express his prior view about as
well: diff(pgamma(c(1/21^2, 1/12.7^2), t.al, t.ka)) returns about 95%, and
about 2.5% is in the lower tail. It practice, it is alright not to be too finicky how well
a member of the desired prior family can match a rough probability guess about
prior information. c

b) Suppose data for point


P spreads and scores of 146 professional football
games show s = ( x2i /n)1/2 = 13.3. Under the prior distribution of
part (a), what 95% posterior probability intervals for τ and σ result from
these data?
d From the discussion in Example 8.8, we see that the posterior distribution on τ
is GAMMA(αn = α0 + n/2, κn = κ0 + ns2 /2). For our data, these parameters are
αn = 16 + 146/2 = 89 and κn = 4000 + 146(13.32 )/2 = 16912.97. (They are also
the parameters of the inverse gamma distribution on θ.) The 95% posterior prob-
ability interval for τ can be computed as qgamma(c(.025,.975), 89, 16912.97),
which returns (0.004226, 0.006410). Taking reciprocals and square roots, we have
(12.49, 15.38) as the corresponding probability interval for σ. (Of course, if your
parameters from part (a) are different from ours, your probability intervals will also
differ. But, in any case, you should get something like (12, 15).) c

c) Use the noninformative improper prior distribution with α0 = κ0 = 0 and


the data of part (b) to find 95% posterior probability intervals for τ and σ.
Also, use these data to find the frequentist 95% confidence interval for σ
based on the distribution CHISQ(146), and compare it with the posterior
probability interval for σ.
d Bayesian interval from an improper prior. The parameters for the posterior gamma
distribution on τ are αn = 0 + 146/2 = 73 and κn = 0 + 146(13.32 )/2 = 12912.97.
Much as in part (b), the 95% posterior probability interval for τ can be computed
as qgamma(c(.025, .975), 73, 12912.97), which returns (0.004431, 0.007022).
The corresponding probability interval (11.934, 15.022) for σ is obtained by tak-
ing reciprocals and square roots. For practical purposes, this interval for σ is similar
to the one in part (b), because the prior distribution in part (b) did not provide a
great deal of information.
Frequentist confidence interval. For known µ = 0, the statistic s given in the state-
ment of the problem has ns2 /σ 2 ∼ CHISQ(n), so a 95% CI for σ 2 can be computed
as 146*13.3^2/qchisq(c(.975, .025), 146), which returns (142.41, 225.67). Upon
taking square roots, we have (11.934, 15.022) as the corresponding CI for σ. (Notice
that this interval includes s = 13.3, as must be the case for such a CI.)
Numerically, this CI is precisely the same as the Bayesian interval from an im-
proper prior. (Recall the relationship between the gamma and chi-squared distri-
butions discussed in Chapter 2 and reiterated in Problem 8.15(d).) For practical
purposes, the nature of the data (n = 146 integer observations) might ordinarily
212 8 Instructor Manual: Introduction to Bayesian Estimation

prompt a statistician to express the interval for σ to only one decimal place accu-
racy, but we have displayed three places here to emphasize that the two intervals
for σ in this part are essentially equal. c
Notes and hints: (a) Parameters α0 = 11, κ0 = 2500 give probability 0.9455, but
your program should give integers that come closer to 95%. (b) The data x in
part (b), taken from more extensive data available online, Stern(1992), are for 1992
NFL home games; x̄ ≈ 0 and the data pass standard tests for normality. For a
more detailed discussion and analysis of point spreads, see Stern (1991). (c) The
two intervals for σ agree closely, roughly (12, 15). You should report results to one
decimal place.

8.15 We want to know the precision of an analytic device. We believe its


readings are normally distributed and unbiased. We have five standard spec-
imens of known value to use in testing the device, so we can observe the
error xi that the device makes for each specimen. Thus we assume
√ that the xi
are independent NORM(0, σ), and we wish to estimate σ = 1/ τ .
a) We use information from the manufacturer of the device to determine
a gamma-distributed prior for τ . This information is provided in terms
of σ. Specifically, we want the prior to be consistent with a median of
about 0.65 for σ and with P {σ < 1} ≈ 0.95. If a gamma prior distribution
on τ has parameter α0 = 5, then what value of the parameter κ0 comes
close to meeting these requirements?
d Expressed in terms of the precision τ , the conditions are that the median is near
0.65−2 = 2.444 and that P {τ < 1} ≈ 0.05. Simple grid searches (or guesses) using
α0 = 5, show we can use the integer value κ0 = 2: pgamma(c(1, 2.444), 5, 2)
returns 0.05265 (roughly 5%) and 0.5396 (roughly 50%). c

b) The following five errors are observed when analyzing test specimens:
−2.65, 0.52, 1.82, −1.41, 1.13. Based on the prior distribution in part (a)
and these data, find the posterior distribution, the posterior median value
of τ , and a 95% posterior probability interval for τ . Use these to give the
posterior median value of σ and a 95% posterior probability interval for σ.
d The mean of these n = 5 observations is x̄ = −0.118, which makes the assumption
that µ = 0 seem realistic. There are too few observations to for a worthwhile test of
normality, so we will have to take that assumption on faith.
P
Moreover, s2 = ( x2i )/n = 13.87035
= 2.774. Then αn = α0 + n2 = 5 + 25 = 7.5
ns2 13.8703
and κn = κ0 + 2 = 2 + 2
= 8.93515. The posterior median and 95% prob-
ability interval for τ are found from qgamma(c(.5, .025, .975), 7.5, 8.93515),
which returns 0.802 for the median and (0.3504, 1.5382) for the interval. In terms of

σ = 1/ τ , the median is 1.116 and the interval is (0.8063, 1.6893). c

c) On the same axes, make plots of the prior and posterior distributions of τ .
Comment.
8 Instructor Manual: Introduction to Bayesian Estimation 213

d The R code required to make the plot is similar to that used in Example 8.5 and
Problem 8.12(b). c

d) Taking a frequentist approach, find the maximum likelihood estimate


(MLE) τ̂ of τ based on the data given in part (b).
PnAlso, find 95% confidence
intervals for σ 2 , σ, and τ . Use the fact that i=1 x2i /σ 2 ∼ CHISQ(n) =
GAMMA(n/2, 1/2). Compare these with the Bayesian results in part (b).

d The MLEs are σb2 = s2 = 2.774, σ̂ = 2.774 = 1.666, and τ̂ = 2.774 1
= 0.3605.
2
(See the Notes below.) The 95% confidence interval for σ can be found in R as
13.8703/qchisq(c(.975, .025), 5), which returns (1.081, 16.687). Taking square
roots, we have (1.04, 4.08) as the 95% CI for σ. Notice that the CIs for σ 2 and σ
include 2.774 and 1.666, respectively.
Extra. A Bayesian 95% posterior interval from a (noninformative) improper prior
for σ is found using the R code 1/sqrt(qgamma(c(.975, .025), 5/2, 13.8703/2)),
which returns (1.04, 4.08). This is numerically identical to the 95% CI for σ found
just above using the chi-squared distribution. (See Problem 8.14(c).)
The substantial difference between the Bayesian intervals based on the improper
prior and the prior of parts (a) and (b) shows that the prior we used in those earlier
parts carries a considerable amount of information. In particular, specifying α0 = 5
and κ0 = 2 is similar to imagining that we know of an earlier experiment
P 2 with this
analytic device based on n = 2α0 = 10 observations yielding xi = 2κ0 = 4 or
s2 = 0.4.
If we are getting information from the manufacturer, it is certainly reasonable
to suppose engineers there ran many experiments before the device was marketed.
An important question is whether the conditions under which they tested the device
are comparable with the conditions under which we are using it. c
Notes: The invariance principle of MLEs states that τ̂ = 1/σb2 = 1/σ̂ 2 , where “hats”
indicate MLEs of the respective parameters. Also, the median of a random variable
is invariant under any monotone transformation. Thus, for the prior or posterior
distribution of τ (always positive), Med(τ ) = 1/Med(σ 2 ) = 1/[Med(σ)]2 . But, in
general, expectation is invariant only under linear transformations. For example,
E(τ ) 6= 1/E(σ 2 ) and E(σ 2 ) 6= [E(σ)]2 .
214 8 Instructor Manual: Introduction to Bayesian Estimation

Errors in Chapter 8
Page numbers refer to the the text. All corrections are scheduled to appear in
the second printing. Statements of problems in this Instructor Manual have
been corrected. Includes corrections compiled through July 13, 2011.
p208 Problem 8.3(c). In two lines of the inner loop of the program code, the loop
indices i and j should be reversed, to have alpha[i] and beta[j]. As a result
of this error, values of alpha and beta inside parentheses are reversed in captions
in Figure 8.6. [A corrected figure is scheduled for 2nd printing.] The correct inner
loop is shown below and in Problem 8.3(c) of this Manual.

for (j in 1:5) {
top = .2 + 1.2 * max(dbeta(c(.05, .2, .5, .8, .95),
alpha[i], beta[j]))
plot(x,dbeta(x, alpha[i], beta[j]),
type="l", ylim=c(0, top), xlab="", ylab="",
main=paste("BETA(",alpha[i],",", beta[j],")", sep="")) }

p214 Problem 8.8(c). The second R statement should be qgamma(.975, t+1, n), not
gamma(.975, t+1, n).

Eric A. Suess and Bruce E. Trumbo


Introduction to Probability Simulation and Gibbs Sampling with R: Instructor Manual
Chapter 8

Explanations and answers:


c 2011 by Bruce E. Trumbo and Eric A. Suess. All rights reserved.
°
Statements of problems:
°c 2010 by Springer Science+Business Media, LLC. All rights reserved.
9
Using Gibbs Samplers to Compute Bayesian
Posterior Distributions

Note: If you are using the first printing of the text, please consult the list of
errata at the end of this chapter.
9.1 Estimating prevalence π with an informative prior.
a) According to the prior distribution BETA(1, 10), what is the probability
that π lies in the interval (0, 0.2)?
d The R code pbeta(.2, 1, 10) returns 0.8926258. c

b) If the prior BETA(1, 10) is used with the data of Example 9.1, what is the
(posterior) 95% Bayesian interval estimate of π?
d We use the program of Example 9.1 with appropriate change in the prior distrib-
ution. We do not show the plots.

set.seed(1234)
m = 50000; PI = numeric(m); PI[1] = .5
alpha = 1; beta = 10 # parameters of beta prior
eta = .99; theta = .97
n = 1000; A = 49; B = n - A
for (i in 2:m) {
num.x = PI[i-1]*eta; den.x = num.x + (1-PI[i-1])*(1 - theta)
X = rbinom(1, A, num.x/den.x)
num.y = PI[i-1]*(1 - eta); den.y = num.y + (1-PI[i-1])*theta
Y = rbinom(1, B, num.y/den.y)
PI[i] = rbeta(1, X + Y + alpha, n - X - Y + beta) }
aft.brn = seq(m/2 + 1,m)
par(mfrow=c(2,1))
plot(aft.brn, PI[aft.brn], type="l")
hist(PI[aft.brn], prob=T)
par(mfrow=c(1,1))
mean(PI[aft.brn])
quantile(PI[aft.brn], c(.025, .975))
216 9 Instructor Manual: Gibbs Sampling

> mean(PI[aft.brn])
[1] 0.02033511
> quantile(PI[aft.brn], c(.025, .975))
2.5% 97.5% # Interval from Example 9.1:
0.007222679 0.035078009 # (0.0074, 0.0355)

Compared with the posterior probability interval of Example 9.1, this result is
shifted slightly downward. c

c) What parameter β would you use so that BETA(1, β) puts about 95%
probability in the interval (0, 0.05)?
d The R code in the Hints does a grid search among integers from 1 through 100,
giving β = 59. c

d) If the beta distribution of part (c) is used with the data of Example 9.1,
what is the 95% Bayesian interval estimate of π?
d With seed 1236 and the appropriate change in the prior, the program of part (a)
gives the results shown below. In this run, the posterior mean is found to be 1.78%,
which is consistent with the approximate value given in the Hints. c

> mean(PI[aft.brn])
[1] 0.01781736
> quantile(PI[aft.brn], c(.025, .975))
2.5% 97.5%
0.005396175 0.031816216

Hints: c) Use beta = seq(1:100); x = pbeta(.05, 1, beta); min(beta[x>=.95]).


Explain. d) The mean of the posterior distribution π|X, Y is about 1.8%.

9.2 Run the program of Example 9.1 and use your simulated posterior
distribution of π to find Bayesian point and interval estimates of the predictive
power of a positive test in the population from which the data are sampled.
How many of the 49 subjects observed to test positive do you expect are
actually infected?
d Use the equation γ = πη/(πη+(1−π)(1−θ)) to transform the posterior distribution
of π to a posterior distribution of γ, which provides point and interval estimates
of γ. Multiplying these values by 49 gives an idea how many infected units there are
among those testing positive.

set.seed(1238); m = 50000; PI = numeric(m); PI[1] = .5


alpha = 1; beta = 1; eta = .99; theta = .97
n = 1000; A = 49; B = n - A
for (i in 2:m) {
num.x = PI[i-1]*eta; den.x = num.x + (1-PI[i-1])*(1 - theta)
X = rbinom(1, A, num.x/den.x)
num.y = PI[i-1]*(1 - eta); den.y = num.y + (1-PI[i-1])*theta
Y = rbinom(1, B, num.y/den.y)
PI[i] = rbeta(1, X + Y + alpha, n - X - Y + beta) }
9 Instructor Manual: Gibbs Sampling 217

aft.brn = seq(m/2 + 1,m)


PI.ab = PI[aft.brn]
gam = PI.ab*eta/(PI.ab*eta + (1-PI.ab)*(1-theta))
mean(gam); quantile(gam, c(.025, .975))
mean(gam*49); quantile(gam*49, c(.025, .975))

> mean(gam); quantile(gam, c(.025, .975))


[1] 0.3973661 # Point estimate: gamma
2.5% 97.5%
0.1984984 0.5489723 # Interval estimate: gamma
> mean(gam*49); quantile(gam*49, c(.025, .975))
[1] 19.47094 # Point estimate: nr of true positives
2.5% 97.5%
9.72642 26.89964 # Interval estimate: nr of true positives

Another approach to getting Bayesian posterior estimates of γ would be to keep


track of the values of γ as they are generated within the Gibbs sampler—as in the
slightly modified program below. The result must be almost the same as above.

set.seed(1239); m = 50000; PI = GAM = numeric(m); PI[1] = .5


alpha = 1; beta = 1; eta = .99; theta = .97
n = 1000; A = 49; B = n - A
for (i in 2:m) {
num.x = PI[i-1]*eta; den.x = num.x + (1-PI[i-1])*(1 - theta)
GAM[i] = num.x/den.x
X = rbinom(1, A, num.x/den.x)
num.y = PI[i-1]*(1 - eta); den.y = num.y + (1-PI[i-1])*theta
Y = rbinom(1, B, num.y/den.y)
PI[i] = rbeta(1, X + Y + alpha, n - X - Y + beta) }
aft.brn = seq(m/2 + 1,m)
GAM.ab = GAM[aft.brn]
mean(GAM.ab); quantile(GAM.ab, c(.025, .975))

> mean(GAM.ab); quantile(GAM.ab, c(.025, .975))


[1] 0.3959689
2.5% 97.5%
0.1930070 0.5466789

By either method, we must remember that the prior and posterior distributions
are for π, a property of a particular population. The sensitivity η and specificity θ
are properties of the screening test. When η and θ are known, γ becomes a function
of π, inheriting its prior and posterior distributions from those of π. c

9.3 In Example 5.2 (p124), the test has η = 99% and θ = 97%, the data
are n = 250 and A = 6, and equation (9.1) on p220 gives an absurd negative
estimate of prevalence, π = −0.62%.
a) In this situation, with a uniform prior, what are the Bayesian point es-
timate and (two-sided) 95% interval estimate of prevalence? Also, find a
one-sided 95% interval estimate that provides an upper bound on π.
218 9 Instructor Manual: Gibbs Sampling

d The appropriate line near the beginning of the program of Example 9.1 has been
modified to show the current data; the last line provides the required one-sided
probability interval.

set.seed(1240)
m = 50000; PI = numeric(m); PI[1] = .5
alpha = 1; beta = 1; eta = .99; theta = .97
n = 250; A = 6; B = n - A # Data
for (i in 2:m) {
num.x = PI[i-1]*eta; den.x = num.x + (1-PI[i-1])*(1 - theta)
X = rbinom(1, A, num.x/den.x)
num.y = PI[i-1]*(1 - eta); den.y = num.y + (1-PI[i-1])*theta
Y = rbinom(1, B, num.y/den.y)
PI[i] = rbeta(1, X + Y + alpha, n - X - Y + beta) }
aft.brn = seq(m/2 + 1,m)
par(mfrow=c(2,1))
plot(aft.brn, PI[aft.brn], type="l")
hist(PI[aft.brn], prob=T)
par(mfrow=c(1,1))
mean(PI[aft.brn])
quantile(PI[aft.brn], c(.025, .975))
quantile(PI[aft.brn], .95)

> mean(PI[aft.brn])
[1] 0.008846055
> quantile(PI[aft.brn], c(.025, .975))
2.5% 97.5%
0.0002883394 0.0285951866 # two-sided probability interval
> quantile(PI[aft.brn], .95)
95% # one-sided
0.02400536

The data happen to have fewer positive tests than we would expect from false-
negatives alone: A = 6 < n(1 − θ) = 250(.03) = 7.5. The Gibbs sampler indicates
that the prevalence is very likely below 2.4%. The histogram in the lower panel of
Figure 10.6 shows a right-skewed posterior distribution “piled up” against 0. c

b) In part (a), what estimates result from using the prior BETA(1, 30)?
d Only the line of the program for the prior is changed. Results are shown below for
a run with seed 1241. c

> mean(PI[aft.brn])
[1] 0.007416064
> quantile(PI[aft.brn], c(.025, .975))
2.5% 97.5%
0.0002133581 0.0242588017
> quantile(PI[aft.brn], .95)
95%
0.02041878
9 Instructor Manual: Gibbs Sampling 219

Comment: a) See Figure 9.10. Two-sided 95% Bayesian interval: (0.03%, 2.9%). Cer-
tainly, this is more useful than a negative estimate, but don’t expect a narrow interval
with only n = 250 observations. Consider that a flat-prior 95% Bayesian interval
estimate of τ based directly on t = 6/250 is roughly (1%, 5%).

9.4 In each part below, use the uniform prior distribution on π and suppose
the test procedure described results in A = 24 positive results out of n = 1000
subjects.
a) Assume the test used is not a screening test but a gold-standard test,
so that η = θ = 1. Follow through the code for the Gibbs sampler in
Example 9.1, and determine what values of X and Y must always occur.
Run the sampler. What Bayesian interval estimate do you get? Explain
why the result is essentially the same as the Bayesian interval estimate you
would get from a uniform prior and data indicating 24 infected subjects
in 1000, using the code qbeta(c(.025,.975), 25, 977).
d Because η = θ = 1, we also have γ = δ = 1, so both “binomial” simulations have all
Successes. Thus the degenerate random variables are X ≡ A = 24 and Y ≡ B = 976.
This means that the posterior distribution from which each PI[i] is generated is
BETA(αn = 1 + 24 = 25, βn = 1 + 976 = 977), as in the Hints. c

b) Screening tests exist because it is not feasible to administer a gold-


standard test to a large group of subjects. So the situation in part (a)
is not likely to occur in the real world. But it does often happen that
everyone who gets a positive result on the screening test is given a gold-
standard test, and no gold-standard tests are given to subjects with neg-
ative screening test results. Thus, in the end, we have η = 99% and θ = 1.
In this case, what part of the Gibbs sampler becomes deterministic? Run
the Gibbs sampler with these values and report the result.
d Here γ = πη/(πη + (1 − π)(1 − θ)) = πη/πη = 1, so X ≡ A = 24. At the end of
the two-step procedure, everyone with persistently positive results is known to have
the disease; there are no false positives.
Of the subjects in the study with the disease, we expect that 3% will have tested
negative on the screening test and will not have taken the gold-standard test. c

c) Why are the results from parts (a) and (b) not much different?
d The gold-standard test in part (a) has τ = π = 24/100 = 2.4%. The only difference
in part (b) is that about 3% of the approximately 2.4% of infected subjects, almost
surely less than 1% of the population, will have different test results. c
Hints: a) The Gibbs sampler simulates a large sample precisely from BETA(25, 977)
and cuts off appropriate tails. Why these parameters? Run the additional code:
set.seed(1237); pp=c(.5, rbeta(m-1, 25, 977)); mean(pp[(m/2):m])
c) Why no false positives among the 24 in either (a) or (b)? Consider false negatives.
220 9 Instructor Manual: Gibbs Sampling

9.5 Running averages and burn-in periods. In simulating successive steps


of a Markov chain, we know that it may take a number of steps before the
running averages of the resulting values begin to stabilize to the mean value
of the limiting distribution. In a Gibbs sampler, it is customary to disregard
values of the chain during an initial burn-in period. Throughout this chapter,
we rather arbitrarily choose to use m = 50 000 iterations and take the burn-
in period to extend for the first m/4 or m/2 steps. These choices have to
do with the appearance of stability in the running average plot and how
much simulation error we are willing to tolerate. For example, the running
averages in the right-hand panel of Figure 9.2 (p223) seem to indicate smooth
convergence of the mean of the π-process to the posterior mean after 25 000
iterations. The parts below provide an opportunity to explore the perception
of stability and variations in the length of the burn-in period. Use m = 50 000
iterations throughout.
d This problem is mainly exploratory. Different people will get slightly different an-
swers. In a process that mixes as well as the Gibbs samplers considered here, the
required burn-in periods are relatively short. Once m is chosen, changing the burn-in
period from m/2 to m/4 results in averaging over more simulated outcomes. The
advantage of that may outweigh any disadvantage due to possibly continuing insta-
bility between m/4 and m/2. However, with the speed of today’s computers, one
can afford to be “wasteful” of pseudorandom numbers, making several runs with
large m and a more-than-ample burn-in period.
Changes required in the R code of Example 9.1 are routine, and we do not
provide numerical answers. For parts (a) and (c), it would be a good idea to make a
table showing your seed, your value of m, the point estimate you obtain for π, and
the lower and upper endpoints of your probability interval for π. Then comment on
similarities and differences you see among the results. c

a) Rerun the Gibbs sampler of Example 9.1 three times with different seeds,
which you select and record. How much difference does this make in the
Bayesian point and interval estimates of π? Use one of the same seeds in
parts (b) and (c) below.
b) Redraw the running averages plot of Figure 9.2 so that the vertical plotting
interval is (0, 0.5). (Change the plot parameter ylim.) Does this affect
your perception of when the process “becomes stable”? Repeat, letting
the vertical interval be (0.020, 0.022), and comment.
d By choosing a small enough window on the vertical axis, you can make even a very
stable process appear to be unstable. You need to keep in mind how many decimal
places of accuracy you hope to get in the final result. c

c) Change the code of the Gibbs sampler in the example so that the burn-in
period extends for 15 000 steps. Compared with the results of the example,
what change does this make in the Bayesian point and interval estimates
of π? Repeat for a burn-in of 30 000 steps and comment.
9 Instructor Manual: Gibbs Sampling 221

9.6 Thinning. From the ACF plot in Figure 9.2 on p223, we see that the
autocorrelation is near 0 for lags of 25 steps or more. Also, from the right-
hand plot in this figure, it seems that the process of Example 9.1 stabilizes
after about 15 000 iterations. One method suggested to mitigate effects of
autocorrelation, called thinning, is to consider observations after burn-in
located sufficiently far apart that autocorrelation is not an important issue.
a) Use the data and prior of Example 9.1. What Bayesian point estimate
and probability interval do you get by using every 25th step, starting
with step 15 000? Make a histogram of the relevant values of PI. Does
thinning in this way have an important effect on the inferences?
d The program below uses the same seed as Example 9.1, for a direct comparison.
Code for the histogram and the ACF plot [requested in part (b)] is shown, but
not the plots themselves. Some “un-thinned” results obtained in the example are
shown in comments; here, the difference between thinned and un-thinned results lies
beyond the second decimal place. c

set.seed(1237) # same seed as in Example 9.1


m = 50000
PI = numeric(m); PI[1] = .5
alpha = 1; beta = 1
eta = .99; theta = .97
n = 1000; A = 49; B = n - A

for (i in 2:m)
{
num.x = PI[i-1]*eta; den.x = num.x + (1-PI[i-1])*(1 - theta)
X = rbinom(1, A, num.x/den.x)
num.y = PI[i-1]*(1 - eta); den.y = num.y + (1-PI[i-1])*theta
Y = rbinom(1, B, num.y/den.y)
PI[i] = rbeta(1, X + Y + alpha, n - X - Y + beta)
}
thin.aft.brn = seq(.3*m+1, m, by=25) # thinned set of steps
mean(PI[thin.aft.brn])
quantile(PI[thin.aft.brn], c(.025, .975))
acf(PI[thin.aft.brn], plot=F) # for part (b)
par(mfrow=c(2,1))
hist(PI[thin.aft.brn], prob=T)
acf(PI[thin.aft.brn], ylim=c(-.1, .6)) # for part (b)
par(mfrow=c(1,1))

> mean(PI[thin.aft.brn])
[1] 0.02040507 # Un-thinned 0.0206
> quantile(PI[thin.aft.brn], c(.025, .975))
2.5% 97.5%
0.007883358 0.035243238 # Un-thinned (0.0074, 0.0355)
> acf(PI[thin.aft.brn], plot=F) # for part (b)
222 9 Instructor Manual: Gibbs Sampling

Autocorrelations of series ’PI[thin.aft.brn]’, by lag

0 1 2 3 4 5 6 7 8
1.000 0.026 0.008 0.017 -0.049 -0.049 -0.010 -0.011 0.013
9 10 11 12 13 14 15 16 17
0.008 0.045 0.049 -0.004 -0.073 0.025 -0.020 -0.035 -0.017
18 19 20 21 22 23 24 25 26
-0.078 -0.018 -0.009 -0.013 0.011 -0.024 0.014 -0.035 0.007
27 28 29 30 31
0.011 -0.015 -0.025 -0.005 -0.022

b) Use the statement acf(PI[seq(15000, m, by=25)]) to make the ACF


plot of these observations. Explain what you see.
d Printed ACF results above for the thinned steps show very small autocorrelations
for positive lags. Almost all of these autocorrelations are judged not significant,
with perhaps 2 in 30 of them barely “significant.” For the thinned steps, some of
the autocorrelations happen to be slightly negative, and we adjusted the plotting
window, with ylim=c(-.1, .6)), to show this. c

9.7 Density estimation. A histogram, as in Figure 9.1, is one way to show


the approximate posterior distribution of π. But the smooth curve drawn
through the histogram there reminds us that we are estimating a continuous
posterior distribution. A Gibbs sampler does not give us the functional form of
the posterior density function, but the smooth curve is a good approximation.
After the Gibbs sampler of Example 9.1 is run, the following additional code
superimposes an estimated density curve on the histogram of sampled values.
set.seed(1241)
m = 50000; PI = numeric(m); PI[1] = .5
alpha = 1; beta = 1
eta = .99; theta = .97
n = 1000; A = 49; B = n - A

for (i in 2:m) {
num.x = PI[i-1]*eta; den.x = num.x + (1-PI[i-1])*(1 - theta)
X = rbinom(1, A, num.x/den.x)
num.y = PI[i-1]*(1 - eta); den.y = num.y + (1-PI[i-1])*theta
Y = rbinom(1, B, num.y/den.y)
PI[i] = rbeta(1, X + Y + alpha, n - X - Y + beta) }
aft.brn = seq(m/2 + 1,m)
est.d = density(PI[aft.brn], from=0, to=1); mx = max(est.d$y)
hist(PI[aft.brn], ylim=c(0, mx), prob=T, col="wheat")
lines(est.d, col="darkgreen")
quantile(PI[aft.brn], c(.025, .975)) # posterior probability int.
mean(PI[aft.brn]) # posterior mean
median(PI[aft.brn]) # posterior median
est.d$x[est.d$y==mx] # density est of post. mode
9 Instructor Manual: Gibbs Sampling 223

> quantile(PI[aft.brn], c(.025, .975)) # posterior probability int.


2.5% 97.5%
0.007672956 0.035812722
> mean(PI[aft.brn]) # posterior mean
[1] 0.02082176
> median(PI[aft.brn]) # posterior median
[1] 0.02048869
> est.d$x[est.d$y==mx] # density est of post. mode
[1] 0.01956947

a) Run the code to verify that it gives the result claimed. In the R Ses-
sion window, type ?density and browse the information provided on
kernel density estimation. In this instance, what is the reason for the
parameters from=0, to=1? What is the reason for finding mx before the
histogram is made? In this book, we have used the mean of sampled val-
ues after burn-in as the Bayesian point estimate of π. Possible alternative
estimates of π are the median and the mode of the sampled values after
burn-in. Explain how the last statement in the code roughly approximates
the mode.
d Making the figure (not shown): The prior distribution BETA(1, 1) = UNIF(0, 1) has
support (0, 1), so we know that the posterior distribution has this same support, and
we want the density estimate for the posterior to be constrained to this interval also.
In many cases, the value of the density estimate at its mode turns out to be greater
than the height of any of the histogram bars, so we set the vertical axis of the
histogram to accommodate the height of the density estimate.
In the output above: After showing the posterior probability interval, we show
the three possible point estimates: mean (as usual), median, and (density-estimated)
mode. Our simulated posterior distribution is slightly skewed to the right, so the
mean is the largest of these and the mode is the smallest. However, the skewness is
slight, and all three point estimates of π round to 0.02 = 2%. c

b) To verify how well kernel density estimation works in one example, do


the following: Generate 50 000 observations from BETA(2, 3), make a
histogram of these observations, superimpose a kernel density-estimated
curve in one color, and finally superimpose the true density function of
BETA(2, 3) as a dotted curve in a different color. Also, find the estimated
mode and compare it with the exact mode 1/3 of this distribution.
set.seed(1945)
m = 50000; x = rbeta(m, 2, 3); est.d = density(x, from=0, to=1)
mx = max(est.d$y); est.d$x[est.d$y == mx]
hist(x, prob=T, ylim=c(0, mx), col="wheat"); xx = seq(0, 1, by=.01)
lines(xx, dbeta(xx, 2, 3), col="blue", lwd = 2, lty="dotted")
lines(est.d, col="darkgreen")

> est.d$x[est.d$y == mx]


[1] 0.3502935 # other seeds: 0.297, 0.333, 0.337, 0.344
224 9 Instructor Manual: Gibbs Sampling

9.8 So far as is known, a very large herd of livestock is entirely free of a


certain disease (π = 0). However, in a recent routine random sample of n = 100
of these animals, two have tested positive on a screening test with sensitivity
95% and specificity 98%. One “expert” argues that the two positive tests
warrant slaughtering all of the animals in the herd. Based on the specificity
of the test, another “expert” argues that seeing two positive tests out of 100
is just what one would expect by chance in a disease-free herd, and so mass
slaughter is not warranted by the evidence.
a) Use a Gibbs sampler with a flat prior to make a one-sided 95% probability
interval that puts an upper bound on the prevalence. Based on this result,
what recommendation might you make?
d In the program below, we have made the appropriate slight changes to the one in
Example 9.1. The four diagnostic plots, included in the program but not reproduced
here, reveal no difficulty with the Gibbs sampler. c

set.seed(1246)
m = 50000; PI = numeric(m); PI[1] = .2
alpha = 1; beta = 1
eta = .95; theta = .98
n = 100; A = 2; B = n - A

for (i in 2:m) {
num.x = PI[i-1]*eta; den.x = num.x + (1-PI[i-1])*(1 - theta)
X = rbinom(1, A, num.x/den.x)
num.y = PI[i-1]*(1 - eta); den.y = num.y + (1-PI[i-1])*theta
Y = rbinom(1, B, num.y/den.y)
PI[i] = rbeta(1, X + Y + alpha, n - X - Y + beta) }

aft.brn = seq(m/2 + 1,m)


est.d = density(PI[aft.brn], from=0, to=1); mx = max(est.d$y)
par(mfrow=c(2,2))
plot(aft.brn, PI[aft.brn], type="l")
hist(PI[aft.brn], ylim=c(0, mx), prob=T, col="wheat")
lines(est.d, col="darkgreen")
acf(PI[aft.brn])
plot(1:m, cumsum(PI)/(1:m), type="l", ylim=c(.01,.03))
abline(v=m/2, col="blue", lty="dashed")
par(mfrow=c(1,1))

quantile(PI[aft.brn], .95) # one-sided posterior probability int.


mean(PI[aft.brn]) # posterior mean

> quantile(PI[aft.brn], .95) # one-sided posterior probability int.


95%
0.04945211
> mean(PI[aft.brn]) # posterior mean
[1] 0.01842621
9 Instructor Manual: Gibbs Sampling 225

b) How does the posterior mean compare with the estimate from equa-
tion (9.1) on p220? Use the Agresti-Coull adjustment t0 = (A + 2)/(n + 4).
d The point estimate p of π from equation (9.1), using the Agresti-Coull adjustment
t0 = (A + 2)/(n + 4) = 4/104 = 0.03846 to estimate τ , is

t0 + θ − 1 4/104 + 0.98 − 1
p= = = 0.01985 ≈ 2%.
η+θ−1 0.95 + 0.98 − 1

This is close to the value in part (a). Moreover, if use the Agresti-Coull upper bound
for τ in equation (9.1), we get about 5% as an upper bound for π. (See the R code
below.) If we use the traditional estimate t = A/n = 2/100 = 0.02 of τ , then
equation (9.1) gives p = 0 as the estimate of π. c

eta = .95; theta = .98; A = 2; n = 100


zp = c(0, 1); t1 = (A + 2)/(n + 4)
t.ac.ci = t1 + zp * qnorm(.95)*sqrt(t1*(1-t1)/(n + 4))
p.ac.ci = (t.ac.ci + theta - 1)/(eta + theta - 1)
t.ac.ci; p.ac.ci

> t.ac.ci; p.ac.ci


[1] 0.03846154 0.06947907 # point est & upper bound for tau
[1] 0.01985112 0.05320330 # point est & upper bound for pi
c) Explain what it means to believe the prior BETA(1, 40). Would your rec-
ommendation in part (a) change if you believed this prior?
d In BETA(1, 40), about 98.5% of the probability lies below π = 0.1 = 10%. Substi-
tuting beta = 40 into the program of part (a) and retaining the same seed, we get a
Bayesian point estimate for π of about 1.1% (down from about 2%), with an upper
bound of about 3.3% (down from about 5%). c

d) What Bayesian estimates would you get with the prior of part (c) if there
are no test-positive animals among 100? In this case, what part of the
Gibbs sampling process becomes deterministic?
d Point estimate: about π = 0.7%; bound: about 2.2%. With A = 0, we have X ≡ 0. c
Comments: In (a) and (b), the Bayesian point estimate and the estimate from equa-
tion (9.1) are about the same. If there are a few thousand animals in the herd,
these results indicate there might indeed be at least one infected animal. Then, if
the disease is one that may be highly contagious beyond the herd or if diseased
animals pose a danger to humans, we could be in for serious trouble. If possible,
first steps might be to quarantine this herd for now, find the two animals that tested
positive, and quickly subject them to a gold-standard diagnostic test for the disease.
That would provide more reliable information than the Gibbs sampler based on the
screening test results. d) Used alone, a screening test with η = 95% and θ = 98%
applied to a relatively small proportion of the herd seems a very blunt instrument
for trying to say whether the herd is free of a disease.
226 9 Instructor Manual: Gibbs Sampling

9.9 Write and execute R code to make diagnostic graphs for the Gibbs
sampler of Example 9.2 showing ACFs and traces (similar to the plots in
Figure 9.2). Comment on the results. d Imitate the relevant code in Example 9.1.c
9.10 Run the code below. Explain step-by-step what each line (beyond
the first) computes. How do you account for the difference between diff(a)
and diff(b)?
x.bar = 9.60; x.sd = 2.73; n = 41
x.bar + qt(c(.025, .975), n-1)*x.sd/sqrt(n)
a = sqrt((n-1)*x.sd^2 / qchisq(c(.975,.025), n-1)); a; diff(a)
b = sqrt((n-1)*x.sd^2 / qchisq(c(.98,.03), n-1)); b; diff(b)

> x.bar + qt(c(.025, .975), n-1)*x.sd/sqrt(n)


[1] 8.738306 10.461694
> a = sqrt((n-1)*x.sd^2 / qchisq(c(.975,.025), n-1)); a; diff(a)
[1] 2.241365 3.493043
[1] 1.251678
> b = sqrt((n-1)*x.sd^2 / qchisq(c(.98,.03), n-1)); b; diff(b)
[1] 2.220978 3.457103
[1] 1.236124
d The sample size, mean, and standard deviation are from the student heights of
x̄=µ
Example 9.2. A 95% frequentist CI for µ, based on T = s/ √ ∼ T(40), is computed
n
to be (8.74, 10.46), rounded to two places. The corresponding Bayesian posterior
probability interval from the Gibbs sampler of that example is (8.75, 10.45).
The second result is a probability-symmetric 95% frequentist confidence interval
for σ, based on (n − 1)s2 /σ 2 ∼ CHISQ(40). The result is (2.24, 3.49), of length 1.25.
Because the chi-squared distribution is not symmetrical, a slightly shorter interval,
of length 1.24, can be found by cutting off 2% of the probability from the lower tail
and 3% from the upper tail. The Bayesian posterior probability interval from the
Gibbs sampler is (2.21, 3.44), which is a little shorter than either of the frequentist
intervals. c
9.11 Suppose we have n = 5 observations from a normal population that
can be summarized as x̄ = 28.31 and s = 5.234.
a) Use traditional methods based on Student’s t and chi-squared distribu-
tions to find 95% confidence intervals for µ and σ.
d Below, we use R code similar to that of Problem 9.10. The frequentist 95% confi-
dence intervals are shown in the output. c
x.bar = 28.31; x.sd = 5.234; n = 5
x.bar + qt(c(.025, .975), n-1)*x.sd/sqrt(n)
sqrt((n-1)*x.sd^2 / qchisq(c(.975,.025), n-1))

> x.bar + qt(c(.025, .975), n-1)*x.sd/sqrt(n)


[1] 21.81113 34.80887 # 95% CI for population mean
> sqrt((n-1)*x.sd^2 / qchisq(c(.975,.025), n-1))
[1] 3.135863 15.040190 # 95% CI for population SD
9 Instructor Manual: Gibbs Sampling 227

b) In the notation√of Example 9.2, use prior distributions with parameters


µ0 = 25, σ0 = θ0 = 2, α0 = 30, and κ0 = 1000, and use a Gibbs sampler
to find 95% Bayesian interval estimates for µ and σ. Discuss the priors.
Make diagnostic plots. Compare with the results of part (a) and comment.
d We make the required substitutions at the beginning of the program of Exam-
ple 9.2. The rest of the program is the same, except that we have omitted the code
for the diagnostic plots. You should verify for yourself that the diagnostic plots
reveal no difficulties.
The Bayesian 95% probability intervals are (23.10, 29.35) for µ and (4.90, 6.93)
for σ. These are very different from the frequentist CIs of part (a) because the prior
distributions are informative and the sample size is small. c

set.seed(1947)
m = 50000
MU = numeric(m); THETA = numeric(m)
THETA[1] = 1
n = 5; x.bar = 28.31; x.var = 5.234^2
mu.0 = 25; th.0 = 4
alp.0 = 30; kap.0 = 1000

for (i in 2:m)
{
th.up = 1/(n/THETA[i-1] + 1/th.0)
mu.up = (n*x.bar/THETA[i-1] + mu.0/th.0)*th.up
MU[i] = rnorm(1, mu.up, sqrt(th.up))

alp.up = n/2 + alp.0


kap.up = kap.0 + ((n-1)*x.var + n*(x.bar - MU[i])^2)/2
THETA[i] = 1/rgamma(1, alp.up, kap.up)
}

# Bayesian point and probability interval estimates


aft.brn = (m/2 + 1):m
mean(MU[aft.brn]) # point estimate of mu
bi.MU = quantile(MU[aft.brn], c(.025,.975)); bi.MU

mean(THETA[aft.brn]) # point estimate of theta


bi.THETA = quantile(THETA[aft.brn], c(.025,.975)); bi.THETA
SIGMA = sqrt(THETA)
mean(SIGMA[aft.brn]) # point estimate of sigma
bi.SIGMA = sqrt(bi.THETA); bi.SIGMA

> mean(MU[aft.brn]) # point estimate of mu


[1] 26.25223
> bi.MU = quantile(MU[aft.brn], c(.025,.975)); bi.MU
2.5% 97.5%
23.10376 29.34601
228 9 Instructor Manual: Gibbs Sampling

> mean(THETA[aft.brn]) # point estimate of theta


[1] 34.00875
> bi.THETA = quantile(THETA[aft.brn], c(.025,.975)); bi.THETA
2.5% 97.5%
23.99395 48.03736
> SIGMA = sqrt(THETA)
> mean(SIGMA[aft.brn]) # point estimate of sigma
[1] 5.808603
> bi.SIGMA = sqrt(bi.THETA); bi.SIGMA
2.5% 97.5%
4.898362 6.930899

c) Repeat part (b), but with µ0 = 0, σ0 = 1000, α0 = 0.01, and κ0 = 0.01.


Compare with the results of parts (a) and (b) and comment.
d These prior distributions are uninformative and the numerical agreement with
part (a) is much closer. We show only the changes in the first part of the program
and the output; the rest is the same as in part (b). c

set.seed(1948)
m = 50000
MU = numeric(m); THETA = numeric(m)
THETA[1] = 1
n = 5; x.bar = 28.31; x.var = 5.234^2
mu.0 = 0; th.0 = 10^6
alp.0 = .01; kap.0 = .01

...

> mean(MU[aft.brn]) # point estimate of mu


[1] 28.3033
> bi.MU = quantile(MU[aft.brn], c(.025,.975)); bi.MU
2.5% 97.5%
21.83191 34.84150
> mean(THETA[aft.brn]) # point estimate of theta
[1] 53.20529
> bi.THETA = quantile(THETA[aft.brn], c(.025,.975)); bi.THETA
2.5% 97.5%
9.725076 217.702315
> SIGMA = sqrt(THETA)
> mean(SIGMA[aft.brn]) # point estimate of sigma
[1] 6.530094
> bi.SIGMA = sqrt(bi.THETA); bi.SIGMA
2.5% 97.5%
3.118505 14.754739

Hints: In (a)–(c), the sample size is small, so an informative prior is influential. In


(a) and (c): (21.8, 34.8) for µ; (3, 15) for σ. Roughly.
9 Instructor Manual: Gibbs Sampling 229

9.12 Before drawing inferences, one should always look at the data to see
whether assumptions are met. The vector x in the code below contains the
n = 41 observations summarized in Example 9.2.
x = c( 8.50, 9.75, 9.75, 6.00, 4.00, 10.75, 9.25, 13.25,
10.50, 12.00, 11.25, 14.50, 12.75, 9.25, 11.00, 11.00,
8.75, 5.75, 9.25, 11.50, 11.75, 7.75, 7.25, 10.75,
7.00, 8.00, 13.75, 5.50, 8.25, 8.75, 10.25, 12.50,
4.50, 10.75, 6.75, 13.25, 14.75, 9.00, 6.25, 11.75, 6.25)
mean(x)
var(x)
shapiro.test(x)
par(mfrow=c(1,2))
boxplot(x, at=.9, notch=T, ylab="x",
xlab = "Boxplot and Stripchart")
stripchart(x, vert=T, method="stack", add=T, offset=.75, at = 1.2)
qqnorm(x)
par(mfrow=c(1,1))

> mean(x)
[1] 9.597561
> var(x)
[1] 7.480869
> shapiro.test(x)

Shapiro-Wilk normality test

data: x
W = 0.9838, p-value = 0.817

a) Describe briefly what each statement in the code does.


d The first two lines compute the sample mean and variance. The third line performs
the Shapiro-Wilk test of the null hypothesis that the data are normal, finding no
evidence that the data are other than normal (P-value = 0.817 > 0.05). c

b) Comment on the graphical output in Figure 9.11. (The angular sides of the
box in the boxplot, called notches, indicate a nonparametric confidence
interval for the population median.) Also comment on the result of the
test. Give several reasons why it is reasonable to assume these data come
from a normal population.
d There are two graphical displays (shown in Figure 9.11 of the text). The first
is a notched boxplot with a stripchart on the same scale. The notches indicate a
nonparametric CI for the population median, roughly (9, 11), perhaps a little lower,
with a sample median a little below 10mm. Within the accuracy of reading the plot,
the numerical agreement of this CI with probability interval from the Gibbs sampler
in Example 9.2 seems good. There is no evidence that the population is skewed. The
second plot is a normal probability plot (Q-Q plot). Points fall roughly in a straight
line, as is anticipated for data from a normal population.
230 9 Instructor Manual: Gibbs Sampling

Extra: The R code wilcox.test(x, conf.int=T) produces the nonparametric con-


fidence interval (8.75, 10.50). This procedure is based on the theoretical assumption
that the data are from a continuous distribution and thus do not have ties. Because
there are a few ties in our data, the CI is only approximate. There are various ways
to adjust for ties, not implemented in this R procedure.
One way to get a rough idea of the extent to which ties matter is to “jitter” the
data several times to break the ties, and see how much difference that makes in the
result. For example, to jitter our data we could add a small random displacement
to each observation—perhaps with x.jit = x + runif(41, -.001, 001). For our
data, such jittering makes tiny, inconsequential changes in the CI reported above. c
Note: Data are from Majumdar and Rao (1958); they are also listed and discussed
in Rao (1989) and Trumbo (2002). Each data value in x is the difference between
a morning and an evening height value. Each height value is the average of four
measurements on the same subject.
9.13 Modify the code for the Gibbs sampler of Example 9.2 as follows
to reverse the order of the two key sampling steps at each passage through
the loop. Use the starting value MU[1]= 5. At each step i, first generate
THETA[i] from the data, the prior distribution on θ, and the value MU[i-1].
Then generate MU[i] from the data, the prior on µ, and the value THETA[i].
Finally, compare your results with those in the example, and comment.
d The modified program is shown below. The results are the same as in Example 9.2. c

set.seed(1237)
m = 50000
MU = numeric(m); THETA = numeric(m)
MU[1] = 5 # initial value for MU
n = 41; x.bar = 9.6; x.var = 2.73^2
mu.0 = 0; th.0 = 400
alp.0 = 1/2; kap.0 = 1/5

for (i in 2:m)
{ # use MU[i-1] to sample THETA[i]
alp.up = n/2 + alp.0
kap.up = kap.0 + ((n-1)*x.var + n*(x.bar - MU[i-1])^2)/2
THETA[i] = 1/rgamma(1, alp.up, kap.up)
# use THETA[i] to sample MU[i]
th.up = 1/(n/THETA[i] + 1/th.0)
mu.up = (n*x.bar/THETA[i] + mu.0/th.0)*th.up
MU[i] = rnorm(1, mu.up, sqrt(th.up))
}

# Bayesian point and probability interval estimates (unchanged)


aft.brn = (m/2 + 1):m
mean(MU[aft.brn]) # point estimate of mu
bi.MU = quantile(MU[aft.brn], c(.025,.975)); bi.MU
mean(THETA[aft.brn]) # point estimate of theta
bi.THETA = quantile(THETA[aft.brn], c(.025,.975)); bi.THETA
9 Instructor Manual: Gibbs Sampling 231

SIGMA = sqrt(THETA)
mean(SIGMA[aft.brn]) # point estimate of sigma
bi.SIGMA = sqrt(bi.THETA); bi.SIGMA

par(mfrow=c(2,2))
plot(aft.brn, MU[aft.brn], type="l")
plot(aft.brn, SIGMA[aft.brn], type="l")
hist(MU[aft.brn], prob=T); abline(v=bi.MU, col="red")
hist(SIGMA[aft.brn], prob=T); abline(v=bi.SIGMA, col="red")
par(mfrow=c(1,1))

> mean(MU[aft.brn]) # point estimate of mu


[1] 9.594315
> bi.MU = quantile(MU[aft.brn], c(.025,.975)); bi.MU
2.5% 97.5%
8.753027 10.452743
> mean(THETA[aft.brn]) # point estimate of theta
[1] 7.646162
> bi.THETA = quantile(THETA[aft.brn], c(.025,.975)); bi.THETA
2.5% 97.5%
4.886708 11.810233
> SIGMA = sqrt(THETA)
> mean(SIGMA[aft.brn]) # point estimate of sigma
[1] 2.747485
> bi.SIGMA = sqrt(bi.THETA); bi.SIGMA
2.5% 97.5%
2.210590 3.436602

9.14 (Theoretical ) In Example 9.2, the prior distribution of the parameter


θ = σ 2 is of the form θ ∼ IG(α0 , κ0 ), so that p(θ) ∝ θ−(α0 +1) exp(−κ0 /θ).
Also, the data x are normal with xi randomly sampled from NORM(µ, σ), so
that the likelihood function is
( n
)
n/2 1 X 2
p(x|µ, θ) ∝ θ exp − (xi − µ) .
2θ i=1

a) By subtracting and adding x̄, show that the exponential in the likelihood
1
function can be written as exp{− 2θ [(n − 1)s2 + n(x̄ − µ)2 ]}.
d Looking at the sum in the exponent (over values i from 1 to n), we have
X X
(xi − µ)2 = [(xi − x̄) + (x̄ − µ)]2
X£ ¤
= (xi − x̄)2 + 2(x̄ − µ)(xi − x̄) + (x̄ − µ)2
= (n − 1)s2 + n(x̄ − µ)2 .

Upon distributing the sum in the second line: The first term in the second line
becomes the first term in the last line by the definition of s2 . The last term in the
232 9 Instructor Manual: Gibbs Sampling

second line does not involve i and so becomes the last term in the last line. The
P
second term in the second line is 0 because it is a multiple of (xi − x̄) = 0. c

b) The distribution of θ|x, µ used in the Gibbs sampler is based on the prod-
uct p(θ|x, µ) ∝ p(θ) p(x|µ, θ). Expand and then simplify this product to
verify that θ|x, µ ∼ IG(αn , κn ), where αn and κn are as defined in the
example.
d In the product of the kernels of the prior and likelihood, there are two kinds of
factors: powers of θ and exponentials (powers of e). The product of the powers of θ
is 0
θ−(α0 +1) × θ−n/2 = θ−(α0 +n/2)+1 = θ−(α +1) ,
where α0 = α0 + n/2 is defined in Example 9.2. If we denote by A the quantity
displayed in part (a), then the product of the exponential factors is

³ ´ ³ ´ µ ¶ µ ¶
κ0 A κ0 + A/2 κ0
exp − × exp − = exp − = exp − ,
θ 2θ θ θ

where κ0 = (κ0 + A/2) = κ0 + [(n − 2)s2 + n(x̄ − µ)2 ]/2 is defined in the example. c
9.15 The R code below was used to generate the data used in Example 9.3.
If you run the code using the same (default) random number generator in R
we used and the seed shown, you will get the same data.
set.seed(1212)
g = 12 # number of batches
r = 10 # replications per batch
mu = 100; sg.a = 15; sg.e = 9 # model parameters
a.dat = matrix(rnorm(g, 0, sg.a), nrow=g, ncol=r)
# ith batch effect across ith row
e.dat = matrix(rnorm(g*r, 0, sg.e), nrow=g, ncol=r)
# g x r random item variations
X = round(mu + a.dat + e.dat) # integer data
X

> X
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 103 113 88 96 89 88 80 92 89 81
[2,] 143 116 126 127 132 121 129 148 129 119
[3,] 107 107 98 103 113 104 99 103 98 109
[4,] 71 72 89 63 85 71 75 76 98 57
[5,] 105 101 113 110 109 101 114 114 113 107
[6,] 88 93 100 91 98 105 103 91 123 110
[7,] 71 52 67 59 67 67 60 68 62 53
[8,] 115 102 93 111 130 114 97 103 112 98
[9,] 58 70 65 78 67 60 74 80 47 68
[10,] 133 119 130 136 133 116 131 118 140 135
[11,] 103 101 97 110 125 107 115 106 110 94
[12,] 83 106 86 91 88 107 92 98 88 95
9 Instructor Manual: Gibbs Sampling 233

a) Run the code and verify whether you get the same data. Explain the
results of the statements a.dat, var(a.dat[1,]), var(a.dat[,1]), and
var(as.vector(e.dat)). How do the results of the first and second of
these statements arise? What theoretical values are approximated (not
very well because of the small sample size) by the last two of them.
d The function rnorm(g, 0, sg.a) generates the g values Ai of the model displayed
just above the middle of p229 of the text. By default, matrices are filled by columns.
Thus, when the g × r matrix a.dat is made, the values Ai are recycled r times to
yield a matrix with r identical columns. The statement a.dat[1,] would print r
copies of the single value A1 in the first row of the matrix, and its variance is 0.
The statement a.dat[,1] would print the g values Ai in the first column of the
2
matrix. Its variance approximates the parameter σA = θA of the model, the “batch
variance.” In practice, the values Ai are latent (not observable), so they are not
available for estimating θA . (See part (b) for an unbiased estimate of θA based on
observed data.)
In this simulation, we know the values of the Ai , and their sample variance is
about 397.5, which is not very close to θA = 152 = 225. Because this estimate
is based on a sample of only g = 12 simulated observations, the poor result is not
surprising. (Notice that we can make this comparison in a simulation, where we have
specified true parameter values and have access to latent generated variables—none
of which could be known in a practical situation.)
The function rnorm(g*r, 0, sg.e) generates gr observations with mean 0 and
variance σ, the “error components” of the model. These are put into a g × r matrix,
in which all values are different. The statement var(as.vector(e.dat)) finds the
variance of these gr observations, which estimates the parameter σ 2 = θ of the
model. The numerical value is about 82.8, which is not far from θ = 92 = 81. c

b) Explain why the following additional code computes MS(Batch) and


MS(Error). How would you use these quantities to find the unbiased esti-
mate of θA shown in the example?
d The equation at the bottom of p230 of the text shows the method of moments
2
estimate θ̂A of θA = σA . It is based on the expectations E[MS(Batch)] = rθA + θ
and E[MS(Error)] = θ, which can be obtained from the information about chi-
squared distributions shown just before the equation. (Recall that a random variable
distributed as CHISQ(ν) has mean ν.) In order to find θ̂A for the data generated
above, we have included two additional lines in the code provided with this problem.

X.bar = apply(X, 1, mean); X.sd = apply(X, 1, sd)


MS.Bat = r*var(X.bar); MS.Err = mean(X.sd^2)
est.theta.a = (MS.Bat - MS.Err)/r # est. of batch variance
MS.Bat; MS.Err; est.theta.a # numerical results

> MS.Bat; MS.Err; est.theta.a


[1] 4582.675
[1] 82.68056
[1] 449.9994
234 9 Instructor Manual: Gibbs Sampling

We have already noted (p231 of the text) that the estimate θ̂A is unsatisfactory
because (owing to the subtraction) it can sometimes be negative. But even when it is
positive, we shouldn’t assume it is really close to θA . First, a reliable estimate would
require a large number g of batches. Second, as noted in the answer to part (a), we
cannot observe the Ai directly, and so we must try to estimate θA by disentangling
information about the Ai from information about the eij . Here, the numerical value
2
of θ̂A is about 500, which is not very near the value θA = σA = 152 = 225 we
specified for the simulation. c
Hints: a) By default, matrices are filled by columns; shorter vectors recycle. The
variance components of the model are estimated.

9.16 (Continuation of Problem 9.15 ) Computation and derivation of fre-


quentist confidence intervals related to Example 9.3.
a) The code below shows how to find the 95% confidence intervals for µ, θ,
and ρI based on information in Problem 9.15 and Example 9.3.
d The numerical results are in good agreement with the output from the Gibbs
sampler in Example 9.3. c

mean(X.bar) + qt(c(.025,.975), g-1)*sqrt(MS.Bat/(g*r))


df.Err = g*(r-1); df.Err*MS.Err/qchisq(c(.975,.025), df.Err)
R = MS.Bat/MS.Err; q.f = qf(c(.975,.025), g-1, g*r-g)
(R - q.f)/(R + (r-1)*q.f)

> mean(X.bar) + qt(c(.025,.975), g-1)*sqrt(MS.Bat/(g*r))


[1] 84.37352 111.57648 # 95% CI for mu
> df.Err = g*(r-1); df.Err*MS.Err/qchisq(c(.975,.025), df.Err)
[1] 64.40289 110.06013 # 95% CI for error variance
> R = MS.Bat/MS.Err; q.f = qf(c(.975,.025), g-1, g*r-g)
> (R - q.f)/(R + (r-1)*q.f)
[1] 0.7160076 0.9420459 # 95% CI for ICC

b) Intermediate. Derive the confidence intervals in part (a) from the distrib-
utions of the quantities involved.
p
d Grand mean µ. From p230 of the text, (x̄.. − µ)/ MS(Batch)/gr ∼ T(g − 1). To
find a 100(1 − α)% confidence interval for µ, let L and U cut off probability α/2
from the lower and upper tails of T(g − 1), respectively. Then
( ) ( )
x̄.. − µ µ − x̄..
1−α = P L≤ p ≤U =P −U ≤ p ≤ −L
MS(Batch)/gr MS(Batch)/gr
( r r )
MS(Batch) MS(Batch)
=P −U ≤ µ − x̄.. ≤ −L
gr gr
( r r )
MS(Batch) MS(Batch)
=P x̄.. − U ≤ µ ≤ x̄.. − L .
gr gr
9 Instructor Manual: Gibbs Sampling 235

We have L = −U > 0 for the symmetrical distributionp T(g − 1), a 100(1 − α)%,
so a confidence interval for µ is given by x̄.. ± U θ̂/gr. Because they view µ to be
a fixed, unknown constant, frequentist statisticians object to the use of the word
probability in connection with this interval. The idea is that, over many repeated
experiments, such an interval will be valid with relative frequency 1 − α.
Error variance θ. The point estimate of θ = σ 2 is θ̂ = MS(Error). Moreover, we
know that (gr − 1)θ̂/θ ∼ CHISQ(gr − 1). Let L0 and U 0 cut off probability α/2 from
the lower and upper tails of CHISQ(gr − 1), respectively. Then

½ ¾ ½ ¾
0(gr − 1)θ̂ 1 θ 1
1−α = P L ≤ ≤ U0 =P ≤ ≤ 0
θ U0 (gr − 1)θ̂ L
½ ¾
(gr − 1)θ̂ (gr − 1)θ̂
=P ≤θ≤ .
U0 L0
¡ ¢
Thus a 100(1 − α)% confidence interval for θ is (gr − 1)θ̂/U 0 , (gr − 1)θ̂/L0 .
Intraclass correlation ρI . No exact distribution theory leads to a confidence interval
for the batch component of variance θA . However, in practice, it is often useful have
a confidence interval for the ratio ψ = θA /θ of the two variance components. Also of
interest is the intraclass correlation ρI = Cor(xij , xij 0 ) = θA /(θA + θ), where j 6= j 0 .
Thus, ρI (read: “rho-sub-I”) is the fraction of the total variance of an individual
observation that is due to the variance among batches.
From p230 of the text, we know that (g −1)MS(Batch)/(rθA +θ) ∼ CHISQ(g −1)
and (g(r − 1))MS(Error)/θ ∼ CHISQ(g(r − 1)). Dividing each of these chi-squared
random variables by its degrees of freedom, and taking the ratio, we have
θ MS(Batch) 1
= R ∼ F(g − 1, g(r − 1)).
rθA + θ MS(Error) rψ + 1
Then, letting L00 and R00 cut off probability α/2 from the lower and upper tails of
F(g − 1, g(r − 1)), respectively, and noticing that 1/ρI = 1/ψ + 1, we have
½ ¾ n o
R R R
1−α = P L00 ≤ ≤ U 00 =P ≤ rψ + 1 ≤ 00
rψ + 1 U 00 L
½ ¾ ½ ¾
R − U 00 R − L00 rL00 1 rU 00
=P 00
≤ψ≤ =P +1≤ +1≤ +1
rU rL00 R−L 00 ψ R − U 00
½ ¾
R − U 00 R − L00
=P 00
≤ ρI ≤ .
R + (r − 1)U R + (r − 1)L00

From these equations we can make confidence intervals for ψ and for ρI . c
Hint: b) For ρI , start by deriving a confidence interval for ψ = θA /θ. What multiple
of R is distributed as F(g − 1, g(r − 1))?
9.17 Figure 9.8 on p235 shows four diagnostic plots for the simulated pos-
terior distribution of σA in the Gibbs sampler of Example 9.3. Make similar
diagnostic plots for the posterior distributions of µ, σ, and ρI .
d Imitate the code in Example 9.1 to used to make Figures 9.1 and 9.2. But, for each
of µ, σ, and ρI , use par(mfrow=(2,2)) to make one figure with four panels. c
236 9 Instructor Manual: Gibbs Sampling

9.18 Small contribution of batches to the overall variance. Suppose the


researchers who did the experiment in Example 9.3 find a way to reduce the
batch component of variance. For the commercial purpose at hand, that would
be important progress. But when they try to analyze a second experiment,
there is a good chance that standard frequentist analysis will run into trouble.
The code below is essentially the same as in Problem 9.15, but with the
parameters and the seed changed. Group means and standard deviations,
sufficient for running the Gibbs sampler of Example 9.3 are shown as output.
d Correction note: In the first printing of the text, the summary data shown below
the program are usable, but not as intended. Seed 1237 produces the sample means
and standard deviations printed below, illustrated in Figure 9.5 of the text, and to
be included with the current problem in the second printing. c

set.seed(1237)
g = 12; r = 10
mu = 100; sg.a = 1; sg.e = 9
a.dat = matrix(rnorm(g, 0, sg.a), nrow=g, ncol=r)
e.dat = matrix(rnorm(g*r, 0, sg.e), nrow=g, ncol=r)
X = round(mu + a.dat + e.dat)
X.bar = apply(X, 1, mean); X.sd = apply(X, 1, sd)
round(rbind(X.bar, X.sd), 3)

> round(rbind(X.bar, X.sd), 3)


[,1] [,2] [,3] [,4] [,5] [,6]
X.bar 99.600 97.80 94.700 103.000 102.400 99.700
X.sd 8.514 9.92 9.322 12.815 8.553 6.897
[,7] [,8] [,9] [,10] [,11] [,12]
X.bar 105.400 102.500 99.700 101.600 99.10 100.600
X.sd 6.867 10.058 7.861 6.484 11.19 8.195

a) Figure 9.6 (p231) shows boxplots for each of the 12 batches simulated
above. Compare it with Figure 9.5 (p230). How can you judge from these
two figures that the batch component of variance is smaller here than in
Example 9.3?
d Small dots within the boxes of the boxplots in Figures 9.5 and 9.6 indicate the batch
means. In Figure 9.6 these batch means are much less variable than in Figure 9.5.
This suggests that θA is smaller for the data in Figure 9.6 than for the data of
Figure 9.5.
The rationale for this rough method of comparing batch variances θA using
boxplots is as follows. The model is xij = µ + Ai + eij , where Aj ∼ NORM(0, σA ),
eij ∼ NORM(0, σ) and all Ai and eij are mutually independent, and V(xij ) = θA +θ.
So the variability of the x̄i. is a rough guide to the size of θA . We say rough because
there is no straightforward way to visualize θA alone. The batch mean of each boxplot
reflects both components of variance. c

b) Run the Gibbs sampler of Section 9.3 for these data using the same un-
informative priors as shown in the code there. You should obtain 95%
9 Instructor Manual: Gibbs Sampling 237

Bayesian interval estimates for µ, σ, σA = θA , and ρI that cover the
values used to generate the data X. See Figure 9.12, where one-sided in-
tervals are used for σA and ρI .
d Below we show the modifications required in the program of Section 9.3, along with
commented labels for the parts of the program. Changes in the Data section generate
the data shown in part (a). In the last section we make one-sided confidence intervals
for θA and ρI because the sampled posterior distributions are strongly right-skewed
with no left-hand tail. (Two lines of code with ## correct typographical errors in the
first printing of the text, replacing incorrect n with correct k.) c

# Data and dimensions


set.seed(1237) # seed for generating data shown
g = 12; r = 10
mu = 100; sg.a = 1; sg.e = 9
a.dat = matrix(rnorm(g, 0, sg.a), nrow=g, ncol=r)
e.dat = matrix(rnorm(g*r, 0, sg.e), nrow=g, ncol=r)
X = round(mu + a.dat + e.dat)
X.bar = apply(X, 1, mean); X.sd = apply(X, 1, sd)

# Initialize Gibbs sampler


set.seed(1066) # seed for Gibbs sampler
m = 50000; b = m/4 # iterations; burn-in
MU = VAR.BAT = VAR.ERR = numeric(m)

mu.0 = 0; th.0 = 10^10 # prior parameters for MU


alp.0 = .001; kap.0 = .001 # prior parameters for VAR.BAT
bta.0 = .001; lam.0 = .001 # prior parameters for VAR.ERR
MU[1] = 150; a = X.bar # initial values

# Sampling
for (k in 2:m) {
alp.up = alp.0 + g/2
kap.up = kap.0 + sum((a - MU[k-1])^2)/2
VAR.BAT[k] = 1/rgamma(1, alp.up, kap.up)

bta.up = bta.0 + r*g/2


lam.up = lam.0 + (sum((r-1)*X.sd^2) + r*sum((a - X.bar)^2))/2
VAR.ERR[k] = 1/rgamma(1, bta.up, lam.up)

mu.up = (VAR.BAT[k]*mu.0 + th.0*sum(a))/(VAR.BAT[k] + g*th.0) ##


th.up = th.0*VAR.BAT[k]/(VAR.BAT[k] + g*th.0) ##
MU[k] = rnorm(1, mu.up, sqrt(th.up))

deno = r*VAR.BAT[k] + VAR.ERR[k]


mu.a = (r*VAR.BAT[k]*X.bar + VAR.ERR[k]*MU[k])/deno
th.a = (VAR.BAT[k]*VAR.ERR[k])/deno
a = rnorm(g, mu.a, sqrt(th.a)) }
238 9 Instructor Manual: Gibbs Sampling

# Summarize and display results, diagnostics


mean(MU[b:m]); sqrt(mean(VAR.BAT[b:m])); sqrt(mean(VAR.ERR[b:m]))
bi.MU = quantile(MU[b:m], c(.025,.975))
SIGMA.BAT = sqrt(VAR.BAT); SIGMA.ERR = sqrt(VAR.ERR)
bi.SG.B = quantile(SIGMA.BAT[b:m], .95) # One-sided
bi.SG.E = quantile(SIGMA.ERR[b:m], c(.025,.975))
ICC = VAR.BAT/(VAR.BAT+VAR.ERR)
bi.ICC = quantile(ICC[b:m], .95) # One-sided
bi.MU; bi.SG.B; bi.SG.E; bi.ICC

par(mfrow=c(2,2)); hc = "wheat"; lc = "red"


hist(MU[b:m], prob=T, col=hc); abline(v=bi.MU, col=lc)
hist(SIGMA.BAT[b:m], prob=T, col=hc); abline(v=bi.SG.B, col=lc)
hist(SIGMA.ERR[b:m], prob=T, col=hc); abline(v=bi.SG.E, col=lc)
hist(ICC[b:m], prob=T, col=hc); abline(v=bi.ICC, col=lc)
par(mfrow=c(1,1))

> mean(MU[b:m]); sqrt(mean(VAR.BAT[b:m])); sqrt(mean(VAR.ERR[b:m]))


[1] 100.4375
[1] 0.9347261
[1] 9.101563

> bi.MU; bi.SG.B; bi.SG.E; bi.ICC


2.5% 97.5%
98.69659 102.21611 # probability interval for grand mean
95%
2.091580 # One-sided interval for batch variance
2.5% 97.5%
7.99132 10.34513 # probability interval for error variance
95%
0.05145389 # One-sided interval for intraclass correlation

9.19 Continuation of Problem 9.18. Negative estimates of θA and ρI .


a) Refer to results stated in Problems 9.15 and 9.16. Show that the unbiased
estimate of θA is negative. Also, show that the 95% confidence interval
for ρI includes
√ negative values. Finally, find 95% confidence intervals for µ
and σ = θ and compare them with corresponding results from the Gibbs
sampler in Problem 9.18.
d Below we use code similar to that of Problems 9.15(b) and 9.16(a). First, we find
the inappropriate negative value value θ̂A = −0.58. Then we find the frequentist
95% CI (98.75, 102.26) for µ, which is in good numerical agreement with the 95%
Bayesian probability interval (98.70, 102.22) from the Gibbs sampler of the previous
problem, and the 95% CI for (64.08, 109.52) for θ or (8.01, 10.46) for σ, which has
nearly the same endpoints as the 95% interval (7.99, 10.36) for σ from the Gibbs
sampler.
The two-sided 95% CI for ρI computed below extends from 0.148 downward into
negative values, whereas the corresponding Bayesian probability interval (not shown
9 Instructor Manual: Gibbs Sampling 239

in the previous problem) is (0.001, 0.078). The code below also finds a 95% one-sided
upper bound 0.11 for ρI ; this differs substantially from the corresponding Bayesian
bound 0.05. For our simulated data, we know that ρI = θA /(θ + θA ) = 1/(1 + 92 ) =
0.0122, so both the one- or two-sided Bayesian intervals provide information that is
more useful what is provided by traditional CIs. c

MS.Bat = r*var(X.bar); MS.Err = mean(X.sd^2)


est.theta.a = (MS.Bat - MS.Err)/r # est. of batch variance
> MS.Bat; MS.Err; est.theta.a # numerical results
[1] 76.42652
[1] 82.2713
[1] -0.5844781

> mean(X.bar) + qt(c(.025,.975), g-1)*sqrt(MS.Bat/(g*r))


[1] 98.75183 102.26483
> df.Err = g*(r-1); df.Err*MS.Err/qchisq(c(.975,.025), df.Err)
[1] 64.0841 109.5153
> R = MS.Bat/MS.Err; q.f = qf(c(.975,.025), g-1, g*r-g)
> (R - q.f)/(R + (r-1)*q.f)
[1] -0.05939803 0.14829396
> R = MS.Bat/MS.Err; q.f = qf(.05, g-1, g*r-g)
> (R - q.f)/(R + (r-1)*q.f)
[1] 0.1133588
b) Whenever R = MS(Batch)/MS(Error) < 1, the unbiased estimate θ̂A of θA is
negative. When the batch component of variance is relatively small, this has a
good chance of occurring. Evaluate P {R < 1} when σA = 1, σ = 9, g = 12, and
r = 10, as in this problem.
d From Problem 9.16 (and p230 of the text), we have rθAθ+θ R ∼ F(g − 1, g(r − 1))
or (81/91)R = 0.8901R ∼ F(11, 108). Thus we can use pf(.8901, 11, 108) to get
the answer, 0.4479. (If you are using the first printing of the text, see the error note
at the end of this chapter of the Instructor Manual.)
Especially if you don’t feel comfortable with the distribution theory, another
approach is to simulate P {R < 1}, as shown in the program below. c

set.seed(1234)
g = 12; r = 10; mu = 100; sg.a = 1; sg.e = 9
m = 100000; R = numeric(m)
for (i in 1:m) {
a.dat = matrix(rnorm(g, 0, sg.a), nrow=g, ncol=r)
e.dat = matrix(rnorm(g*r, 0, sg.e), nrow=g, ncol=r)
X = round(mu + a.dat + e.dat)
X.bar = apply(X, 1, mean); X.sd = apply(X, 1, sd)
MS.Bat = r*var(X.bar); MS.Err = mean(X.sd^2)
R[i] = MS.Bat/MS.Err }
mean(R < 1)

> mean(R < 1)


[1] 0.44739
240 9 Instructor Manual: Gibbs Sampling

c) The null hypothesis H0 : θA = 0 is accepted (against H1 : θA > 0) when


R is smaller than the 95th quantile of the F distribution with g − 1 and
g(r − 1) degrees of freedom. Explain why this null hypothesis is always
accepted when θ̂A < 0.
d From the answer to Problem 9.15, we have θ̂A = [MS(Batch) − MS(Error)]/r,
so θ̂A < 0 implies MS(Batch)/MS(Error) < 1. In order to reject H0 , we must have R
“significantly greater” than 1. But the distribution F(ν1 = g − 1, ν2 = g(r − 1)) has
its mode near 1. Specifically, for ν1 > 2, the mode is at νν21 (ν 1 −2)
(ν2 +1)
. For this problem,
we reject at the 5% (or 10%) level of significance for R exceeding 1.878 (or 1.631). c
Hints: b) Exceeds 0.4. c) The code qf(.95, 11, 108) gives a result exceeding 1.
9.20 Calcium concentration in turnip leaves (% dry weight) is assayed for
four samples from each of four leaves. Consider leaves as “batches.” The data
are shown below as R code for the matrix X in the program of Example 9.3;
that is, each row of X corresponds to a batch.
X = matrix(c(3.28, 3.09, 3.03, 3.03,
3.52, 3.48, 3.38, 3.38,
2.88, 2.80, 2.81, 2.76,
3.34, 3.38, 3.23, 3.26), nrow=4, ncol=4, byrow=T)

a) Run the program, using the same noninformative prior distributions as


specified there, to find 95% Bayesian interval estimates for µ, σA , σ, and ρI
from these data.
d The program below is essentially the same as the one in Example 9.3 and Prob-
lem 9.18. We have substituted the real data above for the part of the earlier programs
that simulated data, changing the values of g and r appropriately. We show two-
sided probability intervals for all parameters. To save space we omit the parts of the
program that do the sampling and make the graphics because they are unchanged.
After the Gibbs sampling program, we show method-of-moments estimators and fre-
quentist confidence intervals found as in the answers to Problems 9.15(b) and 916(a).
At the end we show a hand-made summary table for easy comparison of results.
Because we have so few batches in this experiment, the probability interval for σA
is relatively long, and its MME may not be reliable. c

# Data and dimensions


g = 4; r = 4
X = matrix(c(3.28, 3.09, 3.03, 3.03,
3.52, 3.48, 3.38, 3.38,
2.88, 2.80, 2.81, 2.76,
3.34, 3.38, 3.23, 3.26), nrow=g, ncol=r, byrow=T)
X.bar = apply(X, 1, mean); X.sd = apply(X, 1, sd)

# Initialize Gibbs sampler


set.seed(1067)
m = 50000; b = m/4 # iterations; burn-in
MU = VAR.BAT = VAR.ERR = numeric(m)
9 Instructor Manual: Gibbs Sampling 241

mu.0 = 0; th.0 = 10^10 # prior parameters for MU


alp.0 = .001; kap.0 = .001 # prior parameters for VAR.BAT
bta.0 = .001; lam.0 = .001 # prior parameters for VAR.ERR
MU[1] = 150; a = X.bar # initial values

# Sampling
...

# Summarize and display results, diagnostics


mean(MU[b:m]); sqrt(mean(VAR.BAT[b:m])); sqrt(mean(VAR.ERR[b:m]))
bi.MU = quantile(MU[b:m], c(.025,.975))
SIGMA.BAT = sqrt(VAR.BAT); SIGMA.ERR = sqrt(VAR.ERR)
bi.SG.B = quantile(SIGMA.BAT[b:m], c(.025, .95))
bi.SG.E = quantile(SIGMA.ERR[b:m], c(.025,.975))
ICC = VAR.BAT/(VAR.BAT+VAR.ERR)
bi.ICC = quantile(ICC[b:m], c(.025, .95))
bi.MU; bi.SG.B; bi.SG.E; bi.ICC

... # histograms omitted here

> mean(MU[b:m]); sqrt(mean(VAR.BAT[b:m]));


sqrt(mean(VAR.ERR[b:m])); mean(ICC[b:m])
[1] 3.166042 # Bayes est of grand mean
[1] 0.4711305 # Bayes est of batch SD
[1] 0.09021789 # Bayes est of error SD
[1] 0.90411305 # Bayes est of ICC

> bi.MU; bi.SG.B; bi.SG.E; bi.ICC


2.5% 97.5%
2.730100 3.608211 # prob int for grand mean
2.5% 95%
0.1468979 0.8046348 # prob int for batch SD
2.5% 97.5%
0.05878872 0.13628576 # prob int for error SD
2.5% 95%
0.6818592 0.9899228 # prob int for ICC

# Comparable MMEs and Frequentist 95% CIs


X.bar = apply(X, 1, mean); X.sd = apply(X, 1, sd)
MS.Bat = r*var(X.bar); MS.Err = mean(X.sd^2)
est.theta.a = (MS.Bat - MS.Err)/r
mean(X.bar); sqrt(est.theta.a); sqrt(MS.Err)

> mean(X.bar); sqrt(est.theta.a); sqrt(MS.Err)


[1] 3.165625 # MME of grand mean
[1] 0.2690357 # MME of batch SD
[1] 0.0812532 # MME of error SD
242 9 Instructor Manual: Gibbs Sampling

mean(X.bar) + qt(c(.025,.975), g-1)*sqrt(MS.Bat/(g*r))


df.Err = g*(r-1); sqrt(df.Err*MS.Err/qchisq(c(.975,.025), df.Err))
R = MS.Bat/MS.Err; q.f = qf(c(.975,.025), g-1, g*r-g)
(R - q.f)/(R + (r-1)*q.f)

> mean(X.bar) + qt(c(.025,.975), g-1)*sqrt(MS.Bat/(g*r))


[1] 2.732676 3.598574
> df.Err = g*(r-1); sqrt(df.Err*MS.Err/qchisq(c(.975,.025), df.Err))
[1] 0.05826553 0.13412752
> R = MS.Bat/MS.Err; q.f = qf(c(.975,.025), g-1, g*r-g)
> (R - q.f)/(R + (r-1)*q.f)
[1] 0.6928943 0.9938084

SUMMARY TABLE

grand mean batch SD error SD ICC


Bayes 3.17 .471 .090 .904
prob int (2.73, 3.61) (.147, .805) (.059, .136) (0.682, 0.990)
MME 3.17 .269 .081 .916
CI (2.73, 3.60) -- (.058, .134) (0.693, 0.994)

b) Suppose the researchers have previous experience making calcium deter-


minations from such leaves. While calcium content and variability from
leaf to leaf can change from one crop to the next, they have observed that
the standard deviation σ of measurements from the same leaf is usually
between 0.075 and 0.100. So instead of a flat prior for σ, they choose
IG(β0 = 35, λ0 = 0.25). In these circumstances, explain why this is a
reasonable prior.
d The R code 1/sqrt(qgamma(c(.975,.025), 35, 0.25)) returns (0.073, 0.101). c

c) With flat priors for µ and θA , but the prior of part (b) for θ, run the
Gibbs sampler to find 95% Bayesian interval estimates for µ, σA , σ, and ρI
from the data given above. Compare these intervals with your answers in
part (a) and comment.
d This is the same program as in part (a) except that we have changed to the infor-
mative prior distribution on θ = σ 2 . As a result the Bayesian probability interval
for σ is narrower than in part (a), and accordingly the probability interval for the
intraclass correlation ρI = θA /(θA +θ) a little narrower. Again here, we have omitted
(at the ...s) some parts of the program that have been shown earlier. c
9 Instructor Manual: Gibbs Sampling 243

# Data and dimensions


g = 4; r = 4
X = matrix(c(3.28, 3.09, 3.03, 3.03,
3.52, 3.48, 3.38, 3.38,
2.88, 2.80, 2.81, 2.76,
3.34, 3.38, 3.23, 3.26), nrow=g, ncol=r, byrow=T)
X.bar = apply(X, 1, mean); X.sd = apply(X, 1, sd)

# Initialize Gibbs sampler


set.seed(1067)
m = 50000; b = m/4 # iterations; burn-in
MU = VAR.BAT = VAR.ERR = numeric(m)
mu.0 = 0; th.0 = 10^10 # prior parameters for MU
alp.0 = .001; kap.0 = .001 # prior parameters for VAR.BAT
bta.0 = 35; lam.0 = .25 # prior parameters for VAR.ERR--new
MU[1] = 150; a = X.bar # initial values

# Sampling
...

# Summarize and display results, diagnostics


mean(MU[b:m]); sqrt(mean(VAR.BAT[b:m])); sqrt(mean(VAR.ERR[b:m]))
bi.MU = quantile(MU[b:m], c(.025,.975))
SIGMA.BAT = sqrt(VAR.BAT); SIGMA.ERR = sqrt(VAR.ERR)
bi.SG.B = quantile(SIGMA.BAT[b:m], c(.025, .95))
bi.SG.E = quantile(SIGMA.ERR[b:m], c(.025,.975))
ICC = VAR.BAT/(VAR.BAT+VAR.ERR)
bi.ICC = quantile(ICC[b:m], c(.025, .95))
bi.MU; bi.SG.B; bi.SG.E; bi.ICC
...

> mean(MU[b:m]); sqrt(mean(VAR.BAT[b:m])); sqrt(mean(VAR.ERR[b:m]))


[1] 3.166769
[1] 0.4846692
[1] 0.08505532

> bi.MU; bi.SG.B; bi.SG.E; bi.ICC


2.5% 97.5%
2.730944 3.609069
2.5% 95%
0.1480317 0.8144920
2.5% 97.5%
0.07286628 0.09912327
2.5% 95%
0.7457294 0.9895616

Note: Data are from page 239 of Snedecor and Cochran (1980). The unbiased esti-
2
mate of θA = σA is positive here. Estimation of σA by any method is problematic
because there are so few batches.
244 9 Instructor Manual: Gibbs Sampling

9.21 In order to assess components of variance in the two-stage manufac-


ture of a dye, researchers obtain measurements on five samples from each of
six batches. The data are shown below as R code for the matrix X in the
program of Example 9.3; that is, each row of X corresponds to a batch.
X = matrix(c(1545, 1440, 1440, 1520, 1580,
1540, 1555, 1490, 1560, 1495,
1595, 1550, 1605, 1510, 1560,
1445, 1440, 1595, 1465, 1545,
1595, 1630, 1515, 1635, 1625,
1520, 1455, 1450, 1480, 1445), 6, 5, byrow=T)

d With the obvious changes, this problem is solved in the same way as Problem 9.20.
A Bayesian analysis of these dye data can also be found in Example 10.3. c

a) Use these data to find unbiased point estimates of µ, σA , and σ. Also find
95% confidence intervals for µ, σ, and ρI (see Problem 9.16).
b) Use a Gibbs sampler to find 95% Bayesian interval estimates for µ, σA , σ,
and ρI from these data. Specify noninformative prior distributions as in
Example 9.3. Make diagnostic plots.

Answers: b) Roughly: (1478, 1578) for µ; (15, 115) for σA . See Box and Tiao (1973)
for a discussion of these data, reported in Davies (1957).

9.22 In order to assess components of variance in the two-stage manufac-


ture of a kind of plastic, researchers obtain measurements on four samples
from each of 22 batches. Computations show that MS(Error) = 23.394. Also,
sums of the four measurements from each of the 22 batches are as follows:
218 182 177 174 208 186
206 192 187 154 208 176
196 179 181 158 158 198
160 178 148 194

a) Compute the batch means, and thus x̄.. and MS(Batch). Use your results
to find the unbiased point estimates of µ, θA , and θ.
Pr
d Batch means are x̄i. = (1/r) j=1 xij , for i = 1, . . . , g, where g = 22 and r = 4.
For example, x̄1. = 218/4 = 54.5. These form the vector X.bar below. Estimates are

1 XX 1X
g r g

µ̂ = x̄.. = xij = x̄i. = 45.66,


gr g
i=1 j=1 i=1

θ̂ = σ̂ 2 = MS(Error) = 23.39, and θ̂A = σ̂A


2
= [MS(Batch) − MS(Error)]/r = 17.01,
where
1 X
g

MS(Batch) = (x̄i. − x̄.. )2 = 91.42.


g−1
i=1

Computations in R are shown below. c


9 Instructor Manual: Gibbs Sampling 245

# Point estimates (MMEs)


g = 22, r = 4
MS.Err = 23.394
X.bar = c(218, 182, 177, 174, 208, 186, 206, 192,
187, 154, 208, 176, 196, 179, 181, 158,
158, 198, 160, 178, 148, 194)/r
X.bar; mean(X.bar)
MS.Bat = r*var(X.bar); MS.Bat
est.theta.a = (MS.Bat - MS.Err)/r; est.theta.a

> X.bar; mean(X.bar)


[1] 54.50 45.50 44.25 43.50 52.00 46.50 51.50 48.00 46.75
[10] 38.50 52.00 44.00 49.00 44.75 45.25 39.50 39.50 49.50
[19] 40.00 44.50 37.00 48.50
[1] 45.65909
> MS.Bat = r*var(X.bar); MS.Bat
[1] 91.41775
> est.theta.a = (MS.Bat - MS.Err)/r; est.theta.a
[1] 17.00594
d Extra: Here is R code to verify the 90% confidence intervals shown in the Note.
Results agree within rounding error. See the answer to Problem 9.16 for the formulas.
We have no CI for θA , but the CI for ψ covers 1, so it is believable that the
two components of variance may be about equal. If our purpose is to decrease the
variability of the xij , we could start by working to decrease either the variability
of batches or the variability of observations within batches—whichever might be
easiest or cheapest to accomplish. c
# 90% CIs
mean(X.bar) + qt(c(.05,.95), g-1)*sqrt(MS.Bat/(g*r))
df.Err = g*(r-1); df.Err*MS.Err/qchisq(c(.95,.05), df.Err)
R = MS.Bat/MS.Err; q.f = qf(c(.95,.05), g-1, g*r-g)
(R - q.f)/(r*q.f)
(R - q.f)/(R + (r-1)*q.f)

> mean(X.bar) + qt(c(.05,.95), g-1)*sqrt(MS.Bat/(g*r))


[1] 43.90525 47.41293 # 90% CI for grand mean
> df.Err = g*(r-1); df.Err*MS.Err/qchisq(c(.95,.05), df.Err)
[1] 17.96086 31.96340 # 90% CI for error variance
> R = MS.Bat/MS.Err; q.f = qf(c(.95,.05), g-1, g*r-g)
> (R - q.f)/(r*q.f)
[1] 0.3185997 1.6134812 # 90% CI for ratio psi
> (R - q.f)/(R + (r-1)*q.f)
[1] 0.2416197 0.6173686 # 90% CI for ICC (not shown in Note)
b) Notice that the batch standardP deviations si , i = 1, . . . , 12, enter into the pro-
2
gram of Example 9.3 only as i
(r − 1)si . Make minor changes in the program
so that you can use the information provided to find 90% Bayesian interval esti-
mates of µ, σA , σ, and ρI based on the same noninformative prior distributions
as in the example.
246 9 Instructor Manual: Gibbs Sampling

d We have
Pg made a few changes so that the summarized data can be used. In partic-
2
ular, i=1
(r − 1)s i = g(r − 1)MS(Error). Because of the changes, we show the full
Gibbs sampler program for this problem.
Posterior means for µ, σA , and σ are in reasonably good agreement with the cor-
responding MMEs; where they differ we prefer the Bayesian results with flat priors.
Moreover, there is very good numerical agreement of the 95% posterior probability
intervals for µ, θ, and ψ with corresponding 95% CIs that were obtained using the
methods illustrated at the end of the answers for part(a).
Finally, the last block of code, which makes plots similar to those in Figures 9.8
and 9.9, is a reminder to look at diagnostic graphics for all Gibbs samplers. Here
they are all well behaved. c

# Data and dimensions


set.seed(1237) # seed for generating data shown
g = 22; r = 4
X.bar = c(218, 182, 177, 174, 208, 186, 206, 192,
187, 154, 208, 176, 196, 179, 181, 158,
158, 198, 160, 178, 148, 194)/4
MS.Err = 23.394

# Initialize Gibbs sampler


set.seed(1068) # seed for Gibbs sampler
m = 50000; b = m/4 # iterations; burn-in
MU = VAR.BAT = VAR.ERR = numeric(m)

mu.0 = 0; th.0 = 10^10 # prior parameters for MU


alp.0 = .001; kap.0 = .001 # prior parameters for VAR.BAT
bta.0 = .001; lam.0 = .001 # prior parameters for VAR.ERR
MU[1] = 50; a = X.bar # initial values

# Sampling
for (k in 2:m) {
alp.up = alp.0 + g/2
kap.up = kap.0 + sum((a - MU[k-1])^2)/2
VAR.BAT[k] = 1/rgamma(1, alp.up, kap.up)

bta.up = bta.0 + r*g/2


lam.up = lam.0 + (g*(r-1)*MS.Err + r*sum((a-X.bar)^2))/2 # new
VAR.ERR[k] = 1/rgamma(1, bta.up, lam.up)

mu.up = (VAR.BAT[k]*mu.0 + th.0*sum(a))/(VAR.BAT[k] + g*th.0)


th.up = th.0*VAR.BAT[k]/(VAR.BAT[k] + g*th.0)
MU[k] = rnorm(1, mu.up, sqrt(th.up))

deno = r*VAR.BAT[k] + VAR.ERR[k]


mu.a = (r*VAR.BAT[k]*X.bar + VAR.ERR[k]*MU[k])/deno
th.a = (VAR.BAT[k]*VAR.ERR[k])/deno
a = rnorm(g, mu.a, sqrt(th.a)) }
9 Instructor Manual: Gibbs Sampling 247

# Summarize and display results, diagnostics


bi.MU = quantile(MU[b:m], c(.025,.975))
SIGMA.BAT = sqrt(VAR.BAT); SIGMA.ERR = sqrt(VAR.ERR)
bi.SG.B = quantile(SIGMA.BAT[b:m], c(.025,.975))
bi.SG.E = quantile(SIGMA.ERR[b:m], c(.025,.975))
ICC = VAR.BAT/(VAR.BAT+VAR.ERR)
bi.ICC = quantile(ICC[b:m], c(.025,.975))
PSI = VAR.BAT/VAR.ERR
bi.PSI = quantile(PSI[b:m], c(.025,.975))
mean(MU[b:m]); mean(SIGMA.BAT[b:m]); mean(SIGMA.ERR[b:m])
bi.MU; bi.SG.B; bi.SG.E; bi.SG.E^2; bi.ICC; bi.PSI

par(mfrow=c(2,2)); hc = "wheat"; lc = "red"


hist(MU[b:m], prob=T, col=hc); abline(v=bi.MU, col=lc)
hist(SIGMA.BAT[b:m], prob=T, col=hc); abline(v=bi.SG.B, col=lc)
hist(SIGMA.ERR[b:m], prob=T, col=hc); abline(v=bi.SG.E, col=lc)
hist(ICC[b:m], prob=T, col=hc); abline(v=bi.ICC, col=lc)
par(mfrow=c(1,1))

> mean(MU[b:m]); mean(SIGMA.BAT[b:m]); mean(SIGMA.ERR[b:m])


[1] 45.67262 # MME: 45.66
[1] 4.156379 # MME: 4.12
[1] 4.92968 # MME: 4.84

> bi.MU; bi.SG.B; bi.SG.E; bi.SG.E^2; bi.ICC; bi.PSI


2.5% 97.5%
43.60003 47.74295 # 95% CI for grand mean: (43.45, 47.78)
2.5% 97.5%
2.527066 6.286374
2.5% 97.5%
4.152797 5.898822
2.5% 97.5%
17.24572 34.79610 # 95% CI for error var: (17.09, 33.99)
2.5% 97.5%
0.1831863 0.6414378
2.5% 97.5%
0.2242693 1.7889162 # 95% CI for ratio PSI: (.268, 1.87)

par(mfrow=c(2,2)); hc = "wheat"; lc = "red"


plot(ICC[b:m], type="l")
plot(cumsum(ICC[2:m])/(2:m), type="l") # skip 1
acf(ICC[b:m])
thin.ab = seq(b, m, by = 10)
plot(MU[thin.ab], ICC[thin.ab], pch=20)
abline(v=bi.MU, col="red")
abline(h=bi.ICC, col="red")
par(mfrow=c(1,1))
248 9 Instructor Manual: Gibbs Sampling

Note: Data are taken from Brownlee (1956), p325. Along with other inferences
from these data, the following traditional 90% confidence intervals are given there:
(43.9, 47.4) for µ; (17.95, 31.97) for θ; and (0.32, 1.62) for ψ = θA /θ. (See Prob-
lem 9.16.)

9.23 Design considerations. Here we explore briefly how to allocate re-


sources to get a narrower probability interval for σA than we got in Exam-
ple 9.3 with g = 12 batches and r = 10 replications. Suppose access to ad-
ditional randomly chosen batches comes at negligible cost, so that the main
expenditure for the experiment is based on handling gr = 120 items. Then an
experiment with g = 60 and r = 2 would cost about the same as the one in
the example.
Modify the code shown in Problem 9.15 to generate data from such a
60-batch experiment, but still with µ = 100, σA = 15, and σ = 9 as in the ex-
ample. Then run the Gibbs sampler of the example with these data. Compare
the lengths of the probability intervals for µ, σA , σ, and ρI from this 60-batch
experiment with those obtained for the 12-batch experiment of the example.
Comment on the trade-offs.
d This problem makes a nice take-home exam question or (especially, by following
some suggestions in the Note) a nice project. Related programs are covered in pre-
vious examples and problems, and the required modifications are straightforward.
We do not provide answers. c
Note: In several runs with different seeds for generating the data, we got much
shorter intervals for σA based on the larger number of batches, but intervals for σ
were a little longer. What about the lengths of probability intervals for µ and ρI ?
In designing an experiment, one must keep its goal in mind. For the goal of getting
the shortest frequentist confidence interval for µ within a given budget, Snedecor
and Cochran (1980) show an optimization based on relative costs of batches and
items. Additional explorations: (i) For these same parameters, investigate a design
with g = 30 and r = 4. (ii) Investigate the effect of increasing the number of batches
when σA is small, as for the data generated in Problem 9.18.

9.24 Using the correct model. To assess the variability of a process for
making a pharmaceutical drug, measurements of potency were made on one
pill from each of 50 bottles. These results are entered into a spreadsheet as 10
rows of 5 observations each. Row means and standard deviations are shown
below.
Row 1 2 3 4 5 6 7 8 9 10
Mean 124.2 127.8 119.4 123.4 110.6 130.4 128.4 127.6 122.0 124.4
SD 10.57 14.89 11.55 10.14 12.82 9.99 12.97 12.82 16.72 8.53

a) Understanding from a telephone conversation with the researchers that


the rows correspond to different batches of the drug made on different
days, a statistician uses the Gibbs sampler of Example 9.3 to analyze the
data. Perform this analysis for yourself.
9 Instructor Manual: Gibbs Sampling 249

d According to the Gibbs sampler, the batch variance quite small compared with the
error variance. We give one-sided probability intervals for the batch variance and
the intraclass correlation. The block of code at the end of the program shows that
the MME of the batch variance is positive, but the null hypothesis that the batch
variance is 0 is not rejected (P-value = 40%). c

g = 10; r = 5
X.bar = c(124.2, 127.8, 119.4, 123.4, 110.6,
130.4, 128.4, 127.6, 122.0, 124.4)
X.sd = c(10.57, 14.89, 11.55, 10.14, 12.82,
9.99, 12.97, 12.82, 16.72, 8.53)

set.seed(1247)
m = 50000; b = m/4 # iterations; burn-in
MU = VAR.BAT = VAR.ERR = numeric(m)

mu.0 = 0; th.0 = 10^10 # prior parameters for MU


alp.0 = .001; kap.0 = .001 # prior parameters for VAR.BAT
bta.0 = .001; lam.0 = .001 # prior parameters for VAR.ERR
MU[1] = 150; a = X.bar # initial values

for (k in 2:m)
{
alp.up = alp.0 + g/2
kap.up = kap.0 + sum((a - MU[k-1])^2)/2
VAR.BAT[k] = 1/rgamma(1, alp.up, kap.up)

bta.up = bta.0 + r*g/2


lam.up = lam.0 + (sum((r-1)*X.sd^2) + r*sum((a - X.bar)^2))/2
VAR.ERR[k] = 1/rgamma(1, bta.up, lam.up)

mu.up = (VAR.BAT[k]*mu.0 + th.0*sum(a))/(VAR.BAT[k] + g*th.0)


th.up = th.0*VAR.BAT[k]/(VAR.BAT[k] + g*th.0)
MU[k] = rnorm(1, mu.up, sqrt(th.up))

deno = r*VAR.BAT[k] + VAR.ERR[k]


mu.a = (r*VAR.BAT[k]*X.bar + VAR.ERR[k]*MU[k])/deno
th.a = (VAR.BAT[k]*VAR.ERR[k])/deno
a = rnorm(g, mu.a, sqrt(th.a))
}

mean(MU[b:m]); mean(sqrt(VAR.BAT[b:m])); mean(sqrt(VAR.ERR[b:m]))


bi.MU = quantile(MU[b:m], c(.025,.975))
SIGMA.BAT = sqrt(VAR.BAT); SIGMA.ERR = sqrt(VAR.ERR)
bi.SG.B = quantile(SIGMA.BAT[b:m], .95)
bi.SG.E = quantile(SIGMA.ERR[b:m], c(.025,.975))
ICC = VAR.BAT/(VAR.BAT+VAR.ERR)
bi.ICC = quantile(ICC[b:m], .95)
bi.MU; bi.SG.B; bi.SG.E; bi.ICC
250 9 Instructor Manual: Gibbs Sampling

par(mfrow=c(2,2))
hist(MU[b:m], prob=T); abline(v=bi.MU)
hist(SIGMA.BAT[b:m], prob=T); abline(v=bi.SG.B)
hist(SIGMA.ERR[b:m], prob=T); abline(v=bi.SG.E)
hist(ICC[b:m], prob=T); abline(v=bi.ICC)
par(mfrow=c(1,1))

> mean(MU[b:m]); mean(sqrt(VAR.BAT[b:m])); mean(sqrt(VAR.ERR[b:m]))


[1] 123.7381
[1] 1.223406
[1] 12.51017

> bi.MU; bi.SG.B; bi.SG.E; bi.ICC


2.5% 97.5%
120.1225 127.2927
95%
4.708581 # one-sided interval for batch variance
2.5% 97.5%
10.25783 15.34659
95%
0.1327639 # one-sided interval for ICC

# Comparable MMEs
MS.Bat = r*var(X.bar); MS.Err = mean(X.sd^2)
est.theta.a = (MS.Bat - MS.Err)/r
R = MS.Bat/MS.Err; P.val = 1 - pf(R, g-1, g*(r-1))
mean(X.bar); est.theta.a; MS.Bat; MS.Err; R; P.val

> mean(X.bar); est.theta.a; MS.Bat; MS.Err; R; P.val


[1] 123.82
[1] 2.145472
[1] 162.5978
[1] 151.8704
[1] 1.070635
[1] 0.4048054

b) The truth is that all 50 observations come from the same batch. Record-
ing the data in the spreadsheet by rows was just someone’s idea of a
convenience. Thus, the data would properly be analyzed without regard
to bogus “batches” according to a Gibbs sampler as in Example 9.3.
(Of course, this requires summarizing the data in a different way. Use
s2 = [9MS(Batch) + 40MS(Error)]/49, where s is the standard deviation
of all 50 observations.) Perform this analysis, compare it with the results
of part (a), and comment.
d Now, letting θ = σ 2 , we have xi = µ+ei , where the ei are independently distributed
as NORM(0, σ), for i = 1, . . . , n = 50, so there is no θA in the model. Assuming that
9 Instructor Manual: Gibbs Sampling 251

the data were truly collected according to this model, the correct value in part (a)
would have been θA = 0.
Using the values from the end of the code in part (a), we have

s2 = [9(162.5978) + 40(151.8704)]/49 = 153.8407.

This equation amounts to adding

SS(Batch) = (g − 1)MS(Batch) and SS(Error) = g(r − 1)MS(Error)

in the standard ANOVA table for the model of part (a), to obtain (n − 1)s2 for the
model of this part. p
Then a 95% CI for µ is 123.82 ± 2.0096 153.8407/50 or (120.30, 127.35),
where 2.0096 is from qt(.975, 49). Also, a 95% CI for θ is (107.35, 238.89), from
49*153.8407/qchisq(c(.975, .025), 49), so that the CI for σ is (10.36, 15.46).
Below, the Gibbs sampler of Example 9.2 is suitably modified to provide Bayesian
interval estimates for these parameters, based on noninformative priors. (We leave
it to you to make the diagnostic graphs shown in the code below and to provide
code for graphs similar to Figures 9.2, and 9.9.) c

set.seed(1070)
m = 50000
MU = numeric(m); THETA = numeric(m)
THETA[1] = 10

n = 40; x.bar = 123.82; x.var = 153.8407 # data


mu.0 = 0; th.0 = 10^10 # mu priors
alp.0 = .001; kap.0 = .001 # theta priors

for (i in 2:m)
{
th.up = 1/(n/THETA[i-1] + 1/th.0)
mu.up = (n*x.bar/THETA[i-1] + mu.0/th.0)*th.up
MU[i] = rnorm(1, mu.up, sqrt(th.up))

alp.up = n/2 + alp.0


kap.up = kap.0 + ((n-1)*x.var + n*(x.bar - MU[i])^2)/2
THETA[i] = 1/rgamma(1, alp.up, kap.up)
}

# Bayesian point and probability interval estimates


aft.brn = (m/2 + 1):m
mean(MU[aft.brn]) # point estimate of mu
bi.MU = quantile(MU[aft.brn], c(.025,.975)); bi.MU
mean(THETA[aft.brn]) # point estimate of theta
bi.THETA = quantile(THETA[aft.brn], c(.025,.975)); bi.THETA
SIGMA = sqrt(THETA)
mean(SIGMA[aft.brn]) # point estimate of sigma
bi.SIGMA = sqrt(bi.THETA); bi.SIGMA
252 9 Instructor Manual: Gibbs Sampling

par(mfrow=c(2,2))
plot(aft.brn, MU[aft.brn], type="l")
plot(aft.brn, SIGMA[aft.brn], type="l")
hist(MU[aft.brn], prob=T); abline(v=bi.MU, col="red")
hist(SIGMA[aft.brn], prob=T); abline(v=bi.SIGMA, col="red")
par(mfrow=c(1,1))

> mean(MU[aft.brn]) # point estimate of mu


[1] 123.8209
> bi.MU = quantile(MU[aft.brn], c(.025,.975)); bi.MU
2.5% 97.5%
119.8250 127.7961
> mean(THETA[aft.brn]) # point estimate of theta
[1] 162.2791
> bi.THETA = quantile(THETA[aft.brn], c(.025,.975)); bi.THETA
2.5% 97.5%
103.0896 255.8621
> SIGMA = sqrt(THETA)
> mean(SIGMA[aft.brn]) # point estimate of sigma
[1] 12.65193
> bi.SIGMA = sqrt(bi.THETA); bi.SIGMA
2.5% 97.5%
10.15330 15.99569

Note: Essentially a true story, but with data simulated from NORM(125, 12) replac-
ing unavailable original data. The most important “prior” of all is to get the model
right.
9 Instructor Manual: Gibbs Sampling 253

Errors in Chapter 9
Page numbers refer to the the text. All corrections are scheduled to appear in
the second printing. Statements of problems in this Instructor Manual have
been corrected. Includes corrections compiled through July 13, 2011.
p221 Example 9.1. Just below displayed equations: The second factor of the second
factor of the second term in the denominator of PVP γ is (1 − θ), not (1 − η).
The same equation on the previous page is correct, as is the program on p221.

γ = πη/[πη + (1 − π)(1 − θ)]

p223 Example 9.1. Last paragraph: Couil should be Coull.


p230 The last three of the four displayed distributional relationships near the bottom
of the page are incorrect. Correct statements are:
p
(x̄.. − µ)/ MS(Batch)/gr ∼ T(g − 1),
(g − 1)MS(Batch)/(rθA + θ) ∼ CHISQ(g − 1),
(g(r − 1))MS(Error)/θ ∼ CHISQ(g(r − 1)),
σMS(Batch)/(rσA + σ)MS(Error) ∼ F(g − 1, g(r − 1)).
P
p232 First displayed equation: In two places + i Ai should be +gθ0 . (In the program
on the next page, the corresponding lines for mu.up and th.up are OK.) The
correct equations follow:
X
µ0 = (µ0 θA + θ0 Ai )/(θA + gσ0 ) and θ0 = θ0 θA /(θA + gθ0 )
i

p239 Problem 9.5(b). The vertical interval in the last line should be (0.020, 0.022).
p240 Problem 9.7: In the R code, the ylim argument of the hist function should be
ylim=c(0, mx). The correct line of code is:
hist(PI[aft.burn], ylim=c(0, mx), prob=T, col="wheat")
p240 Problem 9.8(b). Add the following sentence:
Use the Agresti-Coull adjustment t0 = (A + 2)/(n + 4).
p245 Problem 9.16. At the beginning of the second line of code, include the statement:
df.Err = g*(r-1);. [Thanks to Leland Burrill.]

Eric A. Suess and Bruce E. Trumbo


Introduction to Probability Simulation and Gibbs Sampling with R: Instructor Manual
Chapter 9

Explanations and answers:


c 2011 by Bruce E. Trumbo and Eric A. Suess. All rights reserved.
°
Statements of problems:
°c 2010 by Springer Science+Business Media, LLC. All rights reserved.
11
Appendix: Getting Started with R

Because of the introductory nature of this Appendix, we present only partial


answers to selected problems—and in a somewhat different format than for the
other chapters. In particular, the statements of the problems are not provided.
We do not include as much explanation as instructors may expect of their
students. However, for clarification, we sometimes include code not in original
problems. Often the extra code is to print out intermediate steps.

11.1
> exp(1) # the standard way to write the constant e
[1] 2.718282
> exp(1)^2; exp(2) # both give the square of e
[1] 7.389056
[1] 7.389056
> log(exp(2)) # ’log’ is log-base-e, inverse of ’exp’
[1] 2

11.2
> numeric(10); rep(0, 10) # two ways to get a vector of ten 0s
[1] 0 0 0 0 0 0 0 0 0 0
[1] 0 0 0 0 0 0 0 0 0 0
> c(0,0,0,0,0,0,0,0,0,0) # a third (tedious) way
[1] 0 0 0 0 0 0 0 0 0 0

> 0:9.5 # increment by units, stop at or before 2nd argument


[1] 0 1 2 3 4 5 6 7 8 9

> -.5:10; seq(-.5, 9.5) # two ways to write the same vector
[1] -0.5 0.5 1.5 2.5 3.5 4.5 5.5 6.5 7.5 8.5 9.5
[1] -0.5 0.5 1.5 2.5 3.5 4.5 5.5 6.5 7.5 8.5 9.5
256 11 Answers to Problems: Appendix

> 10/2:22; (10/2):22 # note order of operations


[1] 5.0000000 3.3333333 2.5000000 2.0000000 1.6666667
[6] 1.4285714 1.2500000 1.1111111 1.0000000 0.9090909
[11] 0.8333333 0.7692308 0.7142857 0.6666667 0.6250000
[16] 0.5882353 0.5555556 0.5263158 0.5000000 0.4761905
[21] 0.4545455
[1] 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22

d A careful choice of the width of the Session window allows the result of 10/2:22
to fit the width of our page. c
11.4
> (1:4)^(0:3) # two vectors of same length
[1] 1 2 9 64
> (1:4)^2 # 4-vector and constant (1-vector)
[1] 1 4 9 16
> (1:2)*(0:3) # 2-vector recycles
[1] 0 2 2 6

> c(1, 2, 1, 2)*(0:3) # same result as above (extra code)


[1] 0 2 2 6

11.5
> x1 = (1:10)/(1:5); x1 # 5-vector recycles
[1] 1.000000 1.000000 1.000000 1.000000 1.000000 6.000000
[7] 3.500000 2.666667 2.250000 2.000000
> x1[8] # eighth element of x1
[1] 2.666667
> x1[8] = pi # eighth elem of x1 changed
> x1[6:8] # change visible
[1] 6.000000 3.500000 3.141593

11.6
> x2 = c(1, 2, 7, 6, 5); cumsum(x2) # same length
[1] 1 3 10 16 21
> diff(cumsum(x2)) # ’diff’ not inverse of ’cumsum’
[1] 2 7 6 5

> c(x2[1], diff(cumsum(x2))) # this reclaims x2 (extra code)


[1] 1 2 7 6 5

> -5:5; unique(-5:5) # all elements of -5:5 differ


[1] -5 -4 -3 -2 -1 0 1 2 3 4 5
[1] -5 -4 -3 -2 -1 0 1 2 3 4 5
> (-5:5)^2; unique((-5:5)^2) # squaring produces duplicates
[1] 25 16 9 4 1 0 1 4 9 16 25
[1] 25 16 9 4 1 0
11 Answers to Problems: Appendix 257

> a1 = exp(2); a1 # exact value of e squared


[1] 7.389056
> n = 0:15; a2 = sum(2^n/factorial(n)); a2 # Taylor aprx.
[1] 7.389056
> a1 - a2 # small error of approximation
[1] 3.546512e-09

11.8
> x4 = seq(-1, 1, by= .1); x4 # 21 unique values
[1] -1.0 -0.9 -0.8 -0.7 -0.6 -0.5 -0.4 -0.3 -0.2 -0.1 0.0
[12] 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
> x5 = round(x4); x5 # three unique values
[1] -1 -1 -1 -1 -1 0 0 0 0 0 0 0 0 0 0 0 1 1
[19] 1 1 1
> unique(x5) # list unique values in x5
[1] -1 0 1
> length(unique(x5)) # count unique values in x5
[1] 3
> x5 == 0 # T for elem. where x5 matches 0
[1] FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE
[10] TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE
[19] FALSE FALSE FALSE
> x5[x5==0] # list 0s in x5
[1] 0 0 0 0 0 0 0 0 0 0 0

> length(x5[x5==0]); sum(x5==0) # two ways to count 0s in x5


[1] 11 # (both extra code)
[1] 11

> x4 == x5 # T if x4 matches x5 (i.e., if x4 is integer)


[1] TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[10] FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[19] FALSE FALSE TRUE
> sum(x4==x5) # count integers in x4
[1] 3

> x5 = 0; x5 # (unlike ’x5 == 0’) changes x5 to be constant 0


[1] 0

d Correction to the first printing: In the statement of the problem, the closed interval
for t should be [0, 1] (not [−1, 1]). The related R code requires no change. c

> t = seq(0, 1, len=201); f = 6*(t - t^2) # vectors of equal length


> mx.f = max(f); mx.f # f attains max at 1.5
[1] 1.5
> t[f==mx.f] # value of t where f attains max in 0.5
[1] 0.5
258 11 Answers to Problems: Appendix

> t = seq(0, 1, len=200); f = 6*(t - t^2)


> mx.f = max(f); mx.f # value of t at which f(t)
[1] 1.499962 # attains max is 1.5---not
> t[f==mx.f] # included in new vector t
[1] 0.4974874 0.5025126 # two values of vector t that
# "share" max value of f

d Note: One reader has reported that his installation of R does not give two values
in the last line. Conceivably so. Exact “matching” of values that are equal to many
decimal places is a delicate operation. c
11.13
a) d Because Γ (n) = (1 − 1)!, we evaluate the constant term of f (t) as 4!/(2! 1!) =
24/2 = 12. (Alternatively, see the second line of the program below. Also, see
the answer to part (b).) Then f (t) = 12t2 (1 − t) = 12(t2 − t3 ). Perhaps an
implied task, but explicitly stated, is to a routine integrationR to verify that
1
this density function integrates to unity over (0, 1) We have 12 0 (t2 − t3 ) dt =
12[t3 /3 − t4 /4]10 = 12(1/3 − 1/4) = 1.
The required plot could be made by changing the second line of the program
in Example 11.5 to read f = 12*(t^2 - t^3) and the main label of the plot
function to say BETA(3, 2). Instead, we show a program below that will plot
most members of the beta family of distributions by changing only the parame-
ters alpha and beta The last few lines of the program perform a grid search for
the mode of BETA(33, 2) as requested in part (c) (Also see the last few lines of
code in Problem 11.8).
The main label is made by paste-ing together character strings, separated by
commas. Some of these character strings are included in quotes, others are
made in R by converting numeric constants to character strings that express
their numerical values. The default separator when the strings are concatenated
is a space; here we specify the null string with the argument sep="". We do not
show the resulting, slightly left skewed, plot. c

alpha = 3; beta = 2
k = gamma(alpha + beta)/(gamma(alpha)*gamma(beta)); k
t = seq(0, 1, length=200); f = k*t^(alpha-1)*(1 - t)^(beta-1)
m.lab = paste("Density of BETA(", alpha, ", ", beta, ")", sep="")
plot(t, f, type="l", lwd=2, col="blue", ylab="f(t)", main=m.lab)
abline(h=0, col="darkgreen"); abline(v=0, col="darkgreen")

> k
[1] 12

# grid search for mode; see part (c)


max.f = max(f); mode = t[f==max.f]
max.f; mode

> max.f; mode


[1] 1.777744
[1] 0.6683417
11 Answers to Problems: Appendix 259

b) d We explore briefly the relationship between R functions factorial and gamma.


Specifically, the code below uses a few values to illustrate the general relationship
Γ (n) = (n − 1)!, for positive integers n. In many texts, the factorial function is
used only for integers. But we add a couple of extra lines to show that R allows
noninteger values for its factorial function. We set the width of the R Session
window to put integer values at the beginning of key lines of output. c

alpha = seq(1, 4, by=.2)


gamma(alpha); factorial(alpha - 1)
factorial(0:3)

> gamma(alpha); factorial(alpha - 1)


[1] 1.0000000 0.9181687 0.8872638 0.8935153 0.9313838
[6] 1.0000000 1.1018025 1.2421693 1.4296246 1.6764908
[11] 2.0000000 2.4239655 2.9812064 3.7170239 4.6941742
[16] 6.0000000
[1] 1.0000000 0.9181687 0.8872638 0.8935153 0.9313838
[6] 1.0000000 1.1018025 1.2421693 1.4296246 1.6764908
[11] 2.0000000 2.4239655 2.9812064 3.7170239 4.6941742
[16] 6.0000000
>
> factorial(0:3)
[1] 1 1 2 6

c) d We find the mode of the distribution BETA(3, 2) using differential calculus:


From the answer to part (a), f (t) = 12(t2 − t3 ), so f 0 (t) = 12(2t − 3t2 ). Setting
f 0 (t) = 0 and solving for t, we have the mode t = 2/3, which is in good agreement
with the grid search at the end of the program in the answer to part (b).
(A finer grid would have come even closer to 2/3.) The second derivative is
f 00 (t) = 12(2 − 6t), and f 00 (2/3) = 12(2 − 4) = −24, where the negative value
indicates a relative maximum at t = 2/3. We can see from the graph in part (a)
that this is also an absolute maximum. c

11.19
d We simulate the distribution of the random variable X, which is the number of
Aces obtained when a five-card poker hand is dealt at random (without replacement)
from a standard deck of cards, containing four Aces. We have slightly modified the
program of Example 11.7 to find the exact distribution of X for comparison (see
Problem 11.20). We also plot points on the histogram (not shown here) correspond-
ing to these exact values. Approximate probabilities obtained from two runs of the
program are in very good agreement with the exact probabilities. (Results from the
second run are shown at the very end of the printout below.) c

m = 100000; aces = numeric(m)


for (i in 1:m)
{
h = sample(1:52, 5); aces[i] = sum(h < 5)
}
260 11 Answers to Problems: Appendix

cut = (0:5) - .5
summary(as.factor(aces))/m # simulated probabilities

# see Problem 11.20


r = 0:4; pdf = choose(4,r)*choose(48,5-r)/choose(52,5)
round(rbind(r, pdf), 4)

hist(aces, breaks=cut, prob=T, col="Wheat",


main="Aces in Poker Hands")
points(r, pdf, pch=19, col="blue")

> summary(as.factor(aces))/m # simulated probabilities


0 1 2 3 4
0.65830 0.29997 0.03996 0.00174 0.00003

> # see Problem 11.20


> r = 0:4; pdf = choose(4,r)*choose(48,5-r)/choose(52,5)
> round(rbind(r, pdf), 4)
[,1] [,2] [,3] [,4] [,5]
r 0.0000 1.0000 2.0000 3.0000 4
pdf 0.6588 0.2995 0.0399 0.0017 0

# simulated probabilities from a second run


> summary(as.factor(aces))/m # simulated probabilities
0 1 2 3 4
0.65894 0.29959 0.03973 0.00172 0.00002

11.20
d The task is to compute the distribution of the random variable X which is the
number of Aces obtained when a five-card poker hand is dealt at random (without
replacement) from a standard deck of cards, containing four Aces. The distribution is
P {X = r} = (4r )(5−r
48
)/(52
5 ), for r = 0, 1, 2, 3, 4. The required computation, using the
choose function in R is shown at the end of the program in the answer to part (a).
This distribution is hypergeometric, and R has a function dhyper for computing
the distribution somewhat more simply, as illustrated below. Compare the answers
with the exact answers provided in the program for part (a).

r = 0:4; pdf = dhyper(r, 4, 48, 5)


round(rbind(r, pdf), 5)

> round(rbind(r, pdf), 5)


[,1] [,2] [,3] [,4] [,5]
r 0.00000 1.00000 2.00000 3.00000 4e+00
pdf 0.65884 0.29947 0.03993 0.00174 2e-05

In terms of our present application, the parameters of dhyper are in turn: the number
of Aces seen, the number of Aces in the deck (4), the number of non-Aces in the
deck (48), and the number of cards drawn (5). c
11 Answers to Problems: Appendix 261

Multiple Choice Quiz. Students not familiar with R need to start imme-
diately learning to use R as a calculator. With computers available, this quiz
might be used in class or as a very easy initial take-home assignment. Even
without computers, if the few questions with “tricky” answers are eliminated,
it might work as an in-class quiz.
Instructions: Mark the one best answer for each question.
If answer (e) is not specified, it stands for an alternative
answer such as "None of the above," "Cannot be determined
from information provided," "Gives error message," and so on.

1. The expression 2 + 5^2 - 4/2 returns:


(a) 11.25 (b) 25 (c) 47 (d) 10

2. The expression 2/5^2 - 4^2 returns:


(a) -15.92 (b) -15.84 (c) 0.2222222 (d) 15.3664

3. The function tan(pi/4) returns:


(a) 0 (b) 1/2 (c) 1 (d) 1.414214

4. The number of elements in the vector 0:9.5 is:


(a) 1 (b) 2 (c) 9 (d) 10 (e) 11

5. The number of elements in the vector seq(1, 2, by=.1) is:


(a) 2 (b) 3 (c) 9 (d) 10 (e) 11

6. The number of elements in the vector rep(1:3, each=2) is:


(a) 2 (b) 3 (c) 4 (d) 6 (e) 9

In problems 7-10 begin with: a = 3; x1 = -1:2; x2 = rep(1:0, 2)

7. The function sum(a * x1) returns:


(a) 0 (b) -3 (c) 3 (d) 6 (e) 9

8. The function prod(x1^a) returns:


(a) -1 (b) 0 (c) 3 (d) 14

9. The expression (x1/x2)[2] returns:


(a) NaN (b) Inf (c) -1 (d) 1

10. The function prod(x1^x2) returns:


(a) NaN (b) -1 (c) 0 (d) 1

In problems 11-14, begin with: x = 1:10; f = x*(10 - x)

11. The expression f[3] returns:


(a) 0 (b) 1 (c) 9 (d) 16 (e) 21
262 11 Answers to Problems: Appendix

12. The function length(unique(f)) returns:


(a) 5 (b) 6 (c) 7 (d) 9 (e) 10

13. The expression (f^(1:2))[3] returns:


(a) 9 (b) 21 (c) 25 (d) 256 (e) NA

14. The function prod(f) returns:


(a) 0 (b) 1 (c) 9 (d) 16 (e) 165

15. The function sum(cumsum(1:3)) returns:


(a) 1 (b) 3 (c) 10 (d) 12

16. The expression cumsum(1:3)[3] returns:


(a) 1 (b) 2 (c) 3 (d) 6 (e) 10

17. The function sum(1:3 == 3:1) returns:


(a) 0 (b) 1 (c) 2 (d) 3 (e) TRUE

18. The function sum(1:3 < 3:1) returns:


(a) 0 (b) 1 (c) 2 (d) 3

19. The function mean(1:3) returns:


(a) 0 (b) 1 (c) 2 (d) 2.5 (e) 3

20. The function sd(1:3) returns:


(a) 0 (b) 1 (c) 2 (d) 2.5 (e) 3

21. The function log(exp(10)^2) returns:


(a) 1 (b) 2 (c) 10 (d) 20 (e) 100

22. The function log(exp(10^2)) returns:


(a) 1 (b) 2 (c) 10 (d) 20 (e) 100

In Problems 22-25, begin with: x = numeric(10); x[1:4] = 1:2

23. The function length(unique(x)) returns:


(a) 1 (b) 2 (c) 3 (d) 6 (e) 10

24. The function sum(x) returns:


(a) 1 (b) 2 (c) 3 (d) 6 (e) 10

25. The function sum(x==0) returns:


(a) 1 (b) 2 (c) 3 (d) 6 (e) 10
11 Answers to Problems: Appendix 263

Errors in Chapter 11
Page numbers refer to the the text. All corrections are scheduled to appear in
the second printing. Statements of problems in this Instructor Manual have
been corrected. Includes corrections compiled through July 13, 2011.
p281 Section 11.2.2. In the R code above the problems: Use vector v5 instead of w5
in both instances. [Thanks to Wenqi Zheng.] The correct line is;
> w5; w5[9]
p285 Problem 11.8. The closed interval should be [0, 1], not [−1, 1]. The related R
code is correct. [Thanks to Tony Tran.]

Eric A. Suess and Bruce E. Trumbo


Introduction to Probability Simulation and Gibbs Sampling with R: Instructor Manual
Chapter 11

Explanations and answers:


c 2011 by Bruce E. Trumbo and Eric A. Suess. All rights reserved.
°
Statements of problems:
°c 2010 by Springer Science+Business Media, LLC. All rights reserved.
264 11 Answers to Problems: Appendix

Introduction to Probability Simulation and


Gibbs Sampling with R
Eric A. Suess and Bruce E. Trumbo
Springer 2010

Compilation of Errors in all Chapters


Page numbers refer to the the text. All corrections are scheduled to appear in
the second printing. Statements of problems in the Instructor Manual have
been corrected. This list includes corrections compiled through July 13, 2011.
p12 Example 1.5. Just below printout π = 0.80 (not 0.30) in two places. Also, in the
displayed equation for P (Cover), the term P {X = 19}, with value 0.0161, needs
to be added to the beginning of the sum. [Thanks to Tarek Dib.] The correct
display is:
P (Cover) = P {X = 19} + P {X = 20} + P {X = 21} + · · · + P {X = 27}
= 0.0160 + 0.0355 + 0.0676 + · · · + 0.0785 = 0.9463.
p18 Problem 1.11(a). Code pbinom(x, 25, 0.3) should be pbinom(x, 30, 0.4).
p26 Example 2.1. In the first line below Figure 2.1: ri = 21 should be r1 = 21.
[Thanks to Jeff Glickman.]
p74 Problem 3.7. Hint (a): R code h = 12*g^3*(1-g) should be h = 12*g^2*(1-g).
p76 Problem 3.11(b). Should refer to Figure 3.10 (on the next page), not Figure 3.2.
[Thanks to Leland Burrill.]
p84 Problem 3.27(c). The probability should be P {Z1 > 0, Z2 > 0, Z1 + Z2 < 1}.
That is, the event should be restricted to the first quadrant.
p116 Problem 4.26. In the third line inside the loop of the program: The right paren-
thesis should immediately follow repl=T, not the comment. The correct line
reads:
re.x = sample(x, B*n, repl=T) # resample from it
p128 Example 5.3. In the second paragraph, change to: ...the value observed is a
conditional random variable X|{S = s} ∼ NORM(s, 1).
p133 Problem 5.6. In the second paragraph, three instances of 2.5 should be 1.7. (For
clarity, in the second printing, the first two paragraphs of the problem are to be
revised as shown in this Manual.)
p148 Example 6.2. In the second line below printout, the transition probability should
be p01 (4) ≈ 0.67, not 0.69. [Thanks to Leland Burrill.]
p153 Example 6.6. In the displayed equation, the lower-right entry in first matrix
should be 0.99, not 0.00. [Thanks to Tony Tran.] The correct display is as
follows:
· ¸· ¸ · ¸
0.97 0.03 0.9998 0.0002 0.9877 0.0123
P= =
0.01 0.99 0.5976 0.4024 0.6016 0.3984

p155 Problem 6.5(e). The displayed equation should have ’mod 5’; consequently, the
points should run from 1 through 5, and 0 should be adjacent to 4. The answer
for part (e) should say: “The X-process is not Markov.” The correct statement
of part (e) is as follows:
11 Answers to Problems: Appendix 265

e) At each step n > 1, a fair coin is tossed, and Un takes the value −1
if the coin shows Tails and 1 if it shows Heads. Starting with V1 = 0,
the value of Vn for n > 1 is determined by

Vn = Vn−1 + Un (mod 5).

The process Vn is sometimes called a “random walk” on the points


0, 1, 2, 3 and 4, arranged around a circle (with 0 adjacent to 4). Finally,
Xn = 0, if Vn = 0; otherwise Xn = 1.
p183 Problem 7.4 In the program, the first statement after inner loop should read
a[j] = a - 1 (not a). The correct code is shown in this Manual. This error
in the program makes a small difference in the histogram of Figure 7.14 (most
notably, the first bar there is a little too short). A corrected figure is scheduled
for the second printing; you will see it if you work the problem.
p208 Problem 8.3(c). In two lines of the inner loop of the program code, the loop
indices i and j should be reversed, to have alpha[i] and beta[j]. As a result
of this error, values of alpha and beta inside parentheses are reversed in captions
in Figure 8.6. [A corrected figure is scheduled for 2nd printing.] The correct inner
loop is shown below and in Problem 8.3(c) of this Manual.

for (j in 1:5) {
top = .2 + 1.2 * max(dbeta(c(.05, .2, .5, .8, .95),
alpha[i], beta[j]))
plot(x,dbeta(x, alpha[i], beta[j]),
type="l", ylim=c(0, top), xlab="", ylab="",
main=paste("BETA(",alpha[i],",", beta[j],")", sep="")) }

p214 Problem 8.8(c). The second R statement should be qgamma(.975, t+1, n), not
gamma(.975, t+1, n).
p221 Example 9.1. Just below displayed equations: The second factor of the second
factor of the second term in the denominator of PVP γ is (1 − θ), not (1 − η).
The same equation on the previous page is correct, as is the program on p221.

γ = πη/[πη + (1 − π)(1 − θ)]

p223 Example 9.1. Last paragraph: Couil should be Coull.


p230 The last three of the four displayed distributional relationships near the bottom
of the page are incorrect. Correct statements are:
p
(x̄.. − µ)/ MS(Batch)/gr ∼ T(g − 1),
(g − 1)MS(Batch)/(rθA + θ) ∼ CHISQ(g − 1),
(g(r − 1))MS(Error)/θ ∼ CHISQ(g(r − 1)),
σMS(Batch)/(rσA + σ)MS(Error) ∼ F(g − 1, g(r − 1)).
P
p232 First displayed equation: In two places + i Ai should be +gθ0 . (In the program
on the next page, the corresponding lines for mu.up and th.up are OK.) The
correct equations follow:
X
µ0 = (µ0 θA + θ0 Ai )/(θA + gσ0 ) and θ0 = θ0 θA /(θA + gθ0 )
i
266 11 Answers to Problems: Appendix

p239 Problem 9.5(b). The vertical interval in the last line should be (0.020, 0.022).
p240 Problem 9.7: In the R code, the ylim argument of the hist function should be
ylim=c(0, mx). The correct line of code is:
hist(PI[aft.burn], ylim=c(0, mx), prob=T, col="wheat")
p240 Problem 9.8(b). Add the following sentence:
Use the Agresti-Coull adjustment t0 = (A + 2)/(n + 4).
p245 Problem 9.16. At the beginning of the second line of code, include the statement:
df.Err = g*(r-1);. [Thanks to Leland Burrill.]
p245 Problem 9.18. The summary data printed by the program is usable, but does not
correspond to seed 1237. [Figure 9.6 (p231) illustrates the data for seed 1237.]
The correct summary data are shown with the problem in this Manual.
p246 Problem 9.20(b). Notation for the prior on σ should be IG(β0 = 35, λ0 = 0.25)
to match the code in the program of Example 9.3.
p281 Section 11.2.2. In the R code above the problems: Use vector v5 instead of w5
in both instances. [Thanks to Wenqi Zheng.] The correct line is;
> w5; w5[9]
p285 Problem 11.8. The closed interval should be [0, 1], not [−1, 1]. The related R
code is correct. [Thanks to Tony Tran.]

You might also like