03 - CT3S Introduction To Probability Simulation and Gibbs Sampling With R Solutions
03 - CT3S Introduction To Probability Simulation and Gibbs Sampling With R Solutions
Trumbo
Introduction to Probability
Simulation and Gibbs Sampling
with R
Instructor Manual
Statements of problems:
°c 2010 by Springer Science+Business Media, LLC
All rights reserved.
Springer
Berlin Heidelberg NewYork
Hong Kong London
Milan Paris Tokyo
Contents
1 Introductory Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Based on Example 1.1, this problem provides some practice using R.
a) Start with set.seed(1). Then execute the function sample(1:100, 5)
five times and report the number of good chips (numbered ≤ 90) in each.
> set.seed(1) # For counts of good chips, see part (b).
> sample(1:100, 5)
[1] 27 37 57 89 20
> sample(1:100, 5)
[1] 90 94 65 62 6
> sample(1:100, 5)
[1] 21 18 68 38 74
> sample(1:100, 5)
[1] 50 72 98 37 75
> sample(1:100, 5)
[1] 94 22 64 13 26
d By setting the same seed as in part (a), we ensure that sample generates the
same sequence of samples of size five as in part (a). In each of the five runs, the
2 1 Instructor Manual: Introductory Examples
logical vector sample(1:100, 5) <= 90 has five elements—each of them either TRUE
or FALSE. When this vector is summed, TRUEs count as 1s and FALSEs count as 0s.
Thus, each of the five responses here counts the number of sampled values that are
less than or equal to 90. For example, in the second sample 94 exceeds 90, so only
four of the five results meet the criterion. c
c) Which two of the following four samples could not have been produced
using the function sample(1:90, 5)? Why not?
[1] 2 62 84 68 60 # OK
[1] 46 39 84 16 39 # No, two 39s (sampling without replacement)
[1] 43 20 79 32 84 # OK
[1] 68 2 98 20 50 # No, 98 exceeds 90
> pick = c(4, 47, 82, 21, 92) # defines ’pick’ (no output)
> pick <= 90 # logical 5-vector
[1] TRUE TRUE TRUE TRUE FALSE
# 4 <= 90, so 1st element ’TRUE’
> sum(pick <= 90) # four ’TRUE’s on line above
[1] 4
> pick[1:90] # only five elements in ’pick’
[1] 4 47 82 21 92 NA NA NA NA NA NA NA NA NA NA
[16] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
[31] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
[46] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
[61] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
[76] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
# ’NA’s for undefined entries
> pick[pick <= 90] # elements of ’pick’ <= 90
[1] 4 47 82 21
> length(pick[pick <= 90]) # length of vector above is 4
[1] 4
> as.numeric(pick <= 90) # views ’pick <= 90’ as numeric
[1] 1 1 1 1 0 # 1 for ’TRUE’; 0 for ’FALSE’
> y = numeric(5); y # vector of five 0s
[1] 0 0 0 0 0
> y[1] = 10; y # change first element to 10
1 Instructor Manual: Introductory Examples 3
[1] 10 0 0 0 0
> w = c(1:5, 1:5, 1:10) # defines ’w’ (no output)
> mean(w) # average of twenty numbers
[1] 4.25
> mean(w >= 5) # fraction of them >= 5
[1] 0.4
b) In the program, propose a substitute for the second line of code within
the loop so that good[i] is evaluated in terms of length instead of sum.
Run the resulting program, and report your result.
d For good[i] = sum(pick <= 90), use good[i] = length(pick[pick <= 90]).
With this substitution and still using set.seed(1937), you should get exactly the
same answer as with the original program. c
d) Execute sum((0:5)*dhyper(0:5, 90, 10, 5))? How many terms are be-
ing summed? What numerical result is returned? What is its connection
with part (c)?
> sum((0:5)*dhyper(0:5, 90, 10, 5)) # sums 6 terms to compute E(X)
[1] 4.5
Notes: If n items are drawn at random without replacement from a box with b Bad
items, g Good items, and T = g +b, then E(X) = ng/T , V(X) = n( Tg )(1− Tg )( TT −n
−1
).
1.4 Based on concepts in the program of Example 1.3, this problem pro-
vides practice using functions in R.
a) Execute the statements shown below in the order given. Explain what
each statement does. State the length of each vector. Which vectors are
numeric and which are logical?
a = c(5, 6, 7, 6, 8, 7); length(a); unique(a)
length(unique(a)); length(a) - length(unique(a))
duplicated(a); length(duplicated(a)); sum(duplicated(a))
1.5 Item matching. There are ten letters and ten envelopes, a proper one
for each letter. A very tired administrative assistant puts the letters into the
envelopes at random. We seek the probability that no letter is put into its
proper envelope and the expected number of letters put into their proper
envelopes. Explain, statement by statement, how the program below approx-
imates these quantities by simulation. Run the program with n = 10, then
with n = 5, and again with n = 20. Report and compare the results.
m = 100000; n = 10; x = numeric(m)
for (i in 1:m) {perm = sample(1:n, n); x[i] = sum(1:n==perm)}
cutp = (-1:n) + .5; hist(x, breaks=cutp, prob=T)
mean(x == 0); mean(x); sd(x)
d Above we show the result of one run for n = 10, omitting the histogram, and pro-
viding the approximate PDF of X instead. We did not set a seed for the simulation,
so your answers will differ slightly. For any n > 1, exact values are E(X) = V(X) = 1,
and values approximated by simulation are close to 1. See the Notes below. c
Notes: Let X be the number correct. For n envelopes, a combinatorial argument gives
P {X = 0} = 1/2! − 1/3! + − · · · + (−1)n /n! . [See Feller (1957) or Ross (1997).]
In R, i = 0:10; sum((-1)^i/factorial(i)). For any n > 1, P {X = n − 1} = 0,
P {X > n} = 0, E(X) = 1, and V (X) = 1. For large n, X is approximately POIS(1).
Even for n as small as 10, this approximation is good to two places; to verify this,
run the program above, followed by points(0:10, dpois(0:10, 1)).
1.6 A poker hand consists of five cards selected at random from a deck of
52 cards. (There are four Aces in the deck.)
a) Use combinatorial methods to express the probability that a poker hand
has no Aces. Use R to find the numerical answer correct to five places.
> choose(4, 0)*choose(48, 5)/choose(52, 5)
[1] 0.658842
> dhyper(0, 4, 48, 5)
[1] 0.658842
1.7 Martian birthdays. In his science fiction trilogy on the human coloniza-
tion of Mars, Kim Stanley Robinson (1996) arranges the 669 Martian days of
the Martian year into 24 months with distinct names. Imagine a time when
the Martian population consists entirely of people born on Mars and that
birthdays in the Martian-born population are uniformly distributed across
the year. Make a plot for Martians similar to Figure 1.2. (You do not need
to change n.) Use your plot to guess how many Martians there must be in a
room in order for the probability of a birthday match just barely to exceed 1/2.
Then find the exact number with min(n[p > 1/2]).
n = 1:60; p = numeric(60)
for (i in n)
{
q = prod(1 - (0:(i-1))/669); p[i] = 1 - q
}
plot(n, p) # plot of p against n (not shown)
The days of the year are reordered for convenience. For example, February
29 appears last in our list, with a rate that reflects its occurrence only one
year in four. Before using it, R scales this vector so that its entries add to 1.
To simulate the distribution of birthday matches based on these birthrates,
we need to make only two changes in the program of Example 1.3. First,
insert the line above before the loop. Second, replace the first line within
the loop by b = sample(1:366, 25, repl=T, prob=p). Then run the mod-
ified program, and compare your results with those obtained in the example
[P {X = 0} = 0.4313 and E(X) ≈ 0.81].
m = 100000; n = 25; x = numeric(m)
p = c(rep( 96,61), rep( 98,89), rep( 99,62),
rep(100,61), rep(104,62), rep(106,30), 25)
for (i in 1:m)
{
b = sample(1:366, n, repl=T, prob=p)
x[i] = n - length(unique(b))
}
cutp = (0:(max(x)+1)) - .5
hist(x, breaks=cutp, prob=T) # histogram (not shown)
Answers: (a) sum(p[1:65])/sum(p). Why? (b) Roughly 0.67 and 1.0, respectively.
1.10 Three problems are posed about a die that is rolled repeatedly. In
each case, let X be the number of different faces seen in the specified num-
ber of rolls. Using at least m = 100 000 iterations, approximate P {X = 1},
P {X = 6}, and E(X) by simulation. To do this write a program using the
one in Example 1.3 as a rough guide. In what way might some of your sim-
ulated results be considered unsatisfactory? To verify that your program is
working correctly, you should be able to find exact values for some, but not
all, of the quantities by combinatorial methods.
a) The die is fair and it is rolled 6 times.
m = 100000; x = numeric(m); n = 6
for (i in 1:m) x[i] = length(unique(sample(1:6, n, repl=T)))
mean(x); table(x)/m
c) The die is biased and it is rolled 6 times. The bias of the die is such that
2, 3, 4, and 5 are equally likely but 1 and 6 are each twice as likely as 2.
m = 100000; x = numeric(m); n = 6; p = c(2, 1, 1, 1, 1, 2)
for (i in 1:m) x[i] = length(unique(sample(1:6, n, repl=T, prob=p)))
mean(x); table(x)/m
Answers: P {X = 1} = 1/65 in (a); the approximation has small absolute error, but
perhaps large percentage error. P {X = 6} = 6!/66 = 5/324 in (a), and P {X = 6} =
45/4096 = 0.0110 in (c).
1.11 Suppose 40% of the employees in a very large corporation are women.
If a random sample of 30 employees is chosen from the corporation, let X be
the number of women in the sample.
a) For a specific x, the R function pbinom(x, 30, 0.4) computes P {X ≤ x}.
Use it to evaluate P {X ≤ 17}, P {X ≤ 6}, and hence P {7 ≤ X ≤ 17}.
> p1 = pbinom(17, 30, 0.4); p2 = pbinom( 6, 30, 0.4)
> p1; p2
[1] 0.9787601
[1] 0.01718302
> p1 - p2 # X in closed interval [7, 17]
[1] 0.9615771
> diff(pbinom(c(6, 17), 30, 0.4)) # all in one line
[1] 0.9615771
> sum(dbinom(7:17, 30, 0.4)) # another method:
[1] 0.9615771 # summing 11 probabilities
the code diff(pnorm(c(6.5, 17.5), 12, sqrt(7.2))), which also returns 0.9596.
Here, the second and third arguments of pnorm designate the mean and standard
deviation of NORM(µ, σ).
The use of half-integer endpoints is called the continuity correction, appro-
priate because the binomial distribution is discrete (taking only integer values)
whereas the normal distribution is continuous. By using diff(pnorm(c(7, 17),
12, sqrt(7.2))), we would obtain 0.9376, losing roughly half of each of the proba-
bility values P {X = 7} and P {X = 17}. The exact value is 0.9615 (rounded to four
places), from diff(pbinom(c(6,17), 30, .4)). c
1.12 Refer to Example 1.4 and Figure 1.5 on the experiment with a die.
a) Use formula (1.2) to verify the numerical values of the confidence intervals
explicitly mentioned in the example (for students 2, 10, and 12).
n = 30; pp = 1/6; pm = c(-1,1)
> x = 1; p = x/n; round(p + pm*1.96*sqrt(p*(1-p)/n), 3)
[1] -0.031 0.098 # Students 2 & 12
> x =10; p = x/n; round(p + pm*1.96*sqrt(p*(1-p)/n), 3)
[1] 0.165 0.502 # Student 10
c) The most likely number of 6s in 20 rolls of a fair die is three. To verify this,
first use i = 0:20; b = dbinom(i, 20, 1/6), and then i[b==max(b)]
or round(cbind(i, b), 6). How many of the 20 students got five 6s?
d Answer: 4. c
b) Alternatively, after running the program, you could evaluate the margin
of error as 1.96*sqrt(var(good==5)/m). Why is this method essentially
the same as in part (a)? (Ignore the difference between dividing by m and
m − 1. Also, for a logical vector g, notice that sum(g) equals sum(g^2).)
d To begin, recall that the variance of a sample Y1 , . . . , Yn is defined as
Pn Pn Pn
2 i=1
(Yi − Ȳ )2 i=1
Yi2 − n1 ( i=1
Yi )2
s = = ,
n−1 n−1
where the expression at the right is often used in computation.
Denote the logical vector (good==5) as g, which we take to have numerical values
0 and 1. Then var(g) is the same as (sum(g^2) - sum(g)^2/m)/(m-1). For 0–1
Pn Pn
data, Y =
i=1 i
Y 2 , so this reduces to (sum(g)/(m-1)*(1 - sum(g)/m)).
i=1 i
For huge g, this is essentially mean(g)*(1 - mean(g)). But mean(g) is the sample
proportion p of instances where we see five good chips. So the argument of the square
root is essentially p(1 − p)/m. Here is a numerical demonstration with the program
of Example 1.1. c
12 1 Instructor Manual: Introductory Examples
1.14 Modify the R program of Example 1.5 to verify that the coverage
probability corresponding to n = 30 and π = 0.79 is 0.8876. Also, for n = 30,
find the coverage probabilities for π = 1/6 = 0.167, 0.700, and 0.699. Then find
coverage probabilities for five additional values of π of your choice. From this
limited evidence, which appears to be more common—coverage probabilities
below 95% or above 95%? In Example 1.4, the probability of getting a 6 is
π = 1/6, and 18 of 20 confidence intervals covered π. Is this better, worse, or
about the same as should be expected?
d The code below permits confidence levels other than 100(1 − α)% = 95%, using κ
which cuts probability α/2 from the upper tail of NORM(0, 1). (You might want to
explore intervals with target confidence 90% or 99%.) Also, the changeable quantities
have been put into the first line of the program.
After the first run with n = 30 and π = 0.97, we show only the first line and
the true coverage probability. Most of the few instances here show true coverage
probabilities below 95%, a preliminary impression validated by Figure 1.6.
The last run, with n = 20 and π = 1/6, answers the final question above:
We obtain coverage probability 0.8583 < 18/20 = 0.9. So the number of intervals
covering π = 1/6 in Example 1.4 is about what one should realistically expect. c
1.16 In the R program of Example 1.6, set adj = 2 and leave n = 30. This
adjustment implements the Agresti-Coull type of 95% confidence interval. The
formula is similar to (1.2), except that one begins by “adding two successes
and two failures” to the data. [Example: If we see 20 Successes in 30 trials,
the 95% Agresti-Coull
p interval is centered at 22/34 = 0.6471 with margin of
error 1.96 (22)(12)/343 = 0.1606, and the interval is (0.4864, 0.8077).]
14 1 Instructor Manual: Introductory Examples
Run the modified program, and compare your plot with Figures 1.6 (p12)
and 1.8 (p21). For what values of π are such intervals too “conservative”—too
long and with coverage probabilities far above 95%? Also make plots for 90%
and 99% and comment. (See Problem 1.17 for more on this type of interval.)
d The explicit formula for Agresti-Coull 95% confidence intervals is provided in Prob-
lem 1.17(c) and the related Comment. These intervals are based on the approxima-
tion κ = 2 ≈ 1.96, so they are most accurate at the 95% confidence level. Also,
they tend to be unnecessarily long for some values of π near 0 or 1. The program of
Example 1.6 requires only the minor change to use adj = 2. So we do not repeat the
program here. Figure 1.8 (p21 of the text) shows the resulting coverage probabilities
for a 95% CI when n = 30. c
|p − π0 |
p < 1.96,
π0 (1 − π0 )/n
p
and the traditional 95% confidence interval p±1.96 p(1 − p)/n. The task in part (b)
is to invert the test for the binomial success probability π to make a dual confidence
interval, called the Wilson interval.
We do not show the (routine, but admittedly somewhat tedious) algebra sug-
gested in the Hint to establish a general formula for the endpoints of the Wilson
interval. However, we include a demonstration in R for a particular case. We search
for values of π (pp in the code) that satisfy the criterion for accepting the null hy-
pothesis, and then we show they agree with the values of π in the Wilson interval.
We also show that the Agresti-Coull confidence interval in part (c) is only a little
longer than the Wilson interval for the values in our particular case. Of course, you
can change n, x, and α in this program to obtain analogous results for other cases. c
Hints and comments: (b) Square and use the quadratic formula to solve for π. When
1 − α = 95%, one often uses κ = 1.96 ≈ 2 and thus p̃ ≈ X+2 n+4
. (c) The difference
between E and E ∗ is of little practical importance unless p̃ is near 0 or 1. For a
more extensive discussion, see Brown et al. (2001).
16 1 Instructor Manual: Introductory Examples
1.18 For a discrete random P variable X, the expected value (if it exists)
is defined as µ = E(X) = k kP {X = k}, where the sum is taken over all
possible values of k. P
Also, if X takes only nonnegative integer values, then one
can show that µ = k P {X > k}. In particular, if X ∼ BINOM(n, π), then
one can show that µ = E(X) = nπ.
Also, the mode (if it exists) of a discrete random variable X is defined
as the unique value k such that P {X = k} is greatest. In particular, if X is
binomial, then one can show that its mode is b(n + 1)πc; that is, the greatest
integer in (n+1)π. Except that if (n+1)π is an integer, then there is a “double
mode”: values k = (n + 1)π and (n + 1)π − 1 have the same probability.
Run the following program for n = 6 and π = 1/5 (as shown); for n = 7
and π = 1/2; and for n = 18 and π = 1/3. Explain the code and interpret the
answers in terms of the facts stated above about binomial random variables. (If
necessary, use ?dbinom to get explanations of dbinom, pbinom, and rbinom.)
d Names of functions for the binomial distribution end in binom. Suppose the random
variable X ∼ BINOM(n, π). (Because pi is a reserved word in R, we use pp for π in
R code.)
• First letter d stands for the probability distribution function (PDF), so that
dbinom(k, n, pp) is P {X = k}.
• First letter p stands for the cumulative distribution function (CDF), so that
pbinom(k, n, pp) is P {X ≤ k}.
• First letter r stands for random sampling, so that rbinom(m, n, pp) generates
m independent observations from the distribution BINOM(n, π).
> k[pdf==max(pdf)]
[1] 3 4
> floor((n+1)*pp)
[1] 4
1.19 Average lengths of confidence intervals. Problem 1.16 shows that, for
most values of π, Agresti-Coull confidence intervals have better coverage prob-
abilities than do traditional intervals based on formula (1.2). It is only rea-
sonable to wonder whether this improved coverage comes at the expense of
greater average length. For given n and π, the length of a confidence interval
is a random variable because the margin of error depends on the number of
Successes observed. The program below illustrates the computation and finds
the expected length.
n = 30; pp = .2 # binomial parameters
alpha = .05; kappa = qnorm(1-alpha/2) # level is 1 - alpha
#adj = 0 # 0 for traditional; 2 for Agresti-Coull
x = 0:n; sp = (x + adj)/(n + 2*adj)
CI.len = 2*kappa*sqrt(sp*(1 - sp)/(n + 2*adj))
Prob = dbinom(x, n, pp); Prod = CI.len*Prob
round(cbind(x, CI.len, Prob, Prod), 4) # displays computation
sum(Prod) # expected length
18 1 Instructor Manual: Introductory Examples
a) Explain each statement in this program, and state the length of each
named vector. (Consider a constant as a vector of length 1.)
d Objects in the first three lines each have 1 element; objects in the next three lines
each have n + 1 = 31 elements. The statement cbind binds together four column
vectors of length 31 to make a 31 × 4 matrix, the elements of which are rounded to
four places. c
c) Figure 1.9 was made by looping through about 200 values of π. Use it to
verify your answers in part (b). Compare the lengths of the two kinds of
confidence intervals and explain.
d Figure 1.9 shows that the traditional interval is shorter for values of π near 0 or 1—
roughly speaking, for π outside the interval (0.2, 0.8). See also the caption of the
figure and the answers to part (b). Thus the Agresti-Coull intervals tend to be longer
for the values of π where the traditional intervals tend to have less than the nominal
probability of coverage. (Perhaps they are even needlessly conservative for values
very near 0 and 1; in the answer to Problem 17(c) we noted that this is precisely
where the Wilson intervals would be shorter.) By contrast, the Agresti-Coull inter-
vals are shorter than traditional ones for values of π nearer 1/2. Generally speaking,
longer intervals provide better coverage. The “bottom line” is that the Agresti-Coull
intervals do not attain better coverage just by increasing interval length across all
values of π. c
d) Write a program to make a plot similar to Figure 1.9. Use the program
of Example 1.5 as a rough guide to the structure. You can use plot for
the first curve and lines to overlay the second curve.
n = 30; alpha = .05; kappa = qnorm(1-alpha/2)
# Traditional
adj = 0; x = 0:n; phat = (x + adj)/(n + 2*adj)
m.err = kappa*sqrt(phat*(1 - phat)/(n + 2*adj))
ci.len = 2*m.err # n + 1 possible lengths of CIs
m = 200; pp = seq(0,1, length=m); avg.ci.len = numeric(m)
for (i in 1:m) avg.ci.len[i] = sum(ci.len*dbinom(x, n, pp[i]))
1 Instructor Manual: Introductory Examples 19
d The change in the program is trivial, and we do not show the modified program
here. The main difference between the Bayesian intervals and those of Agresti and
Coull is in the coverage probabilties near near 0 and 1, where the Bayesian intervals
are more variable with changing π, but less extremely conservative. Because of the
discreteness of the binomial distribution, it is difficult to avoid large changes in
coverage probabilities for small changes in π. Perhaps a reasonable goal is that, if
oscillations in coverage probabilities are averaged over “nearby” values of π, then
the “smoothed” values lie close to the nominal level (here 95%). c
Notes: The mean of this beta distribution is (x + 1)/(n + 2), but this value need not
lie exactly at the center of the resulting interval. If 30 trials result in 20 successes,
then the traditional interval is (0.4980, 0.8354) and the Agresti-Coull interval is
(0.4864, 0.8077). The mean of the beta distribution is 0.65625, and a 95% Bayesian
interval is (0.4863, 0.8077), obtained in R with qbeta(c(.025, .975), 21, 11).
Bayesian intervals for π never extend outside (0, 1). (These Bayesian intervals are
based on a uniform prior distribution. Strictly speaking, the interpretation of such
“Bayesian probability intervals” is somewhat different than for confidence inter-
vals, but we ignore this distinction for now, pending a more complete discussion of
Bayesian inference in Chapter 8.)
20 1 Instructor Manual: Introductory Examples
Errors in Chapter 1
Page numbers refer to the the text. All corrections are scheduled to appear in
the second printing. Statements of problems in this Instructor Manual have
been corrected. Includes corrections compiled through July 13, 2011.
p12 Example 1.5. Just below printout π = 0.80 (not 0.30) in two places. Also, in the
displayed equation for P (Cover), the term P {X = 19}, with value 0.0161, needs
to be added to the beginning of the sum. [Thanks to Tarek Dib.] The correct
display is:
p18 Problem 1.11(a). Code pbinom(x, 25, 0.3) should be pbinom(x, 30, 0.4).
# Initialize
s = 23 # seed (even number of digits, 2 or more)
m = 10 # nr. of ’random’ numbers generated
r = numeric(m) # vector of ’random’ numbers
r[1] = s
t = character(m) # text showing the process for each iteration
nds = nchar(s) # number of digits in the seed
nds2 = 2*nds
# Generate
for (i in 2:m) {
r[i] = r[i-1]^2
lead.0 = paste(numeric(nds2-nchar(r[i])), sep="", collapse="")
temp1 = paste(lead.0, r[i], sep="", collapse="")
temp2 = substr(temp1, nds2-(3/2)*nds+1, nds2-(1/2)*nds)
r[i] = as.numeric(temp2)
#Loop diagnostics and output
msg = "OK"
if (r[i] == 0) {msg = "Zero"}
if (r[i] %in% r[1:(i-1)]) {msg = "Repeat"}
t[i] = paste(
format(r[i-1], width=nds),
format(paste(lead.0, r[i-1]^2, sep="", coll=""), width=nds2),
temp2,
msg)
}
t
summary(as.factor(r)) # counts > 1 indicate repetition
m - length(unique(r)) # nr. or repeats in m numbers generated
After a few steps (how many?), seed 19 gives the “random” number 2, which
has 0004 as its padded square; then the next step and all that follow are 0s. The
seeds 23 and 19 are by no means the only “bad seeds” for this method. With s = 45,
we get r1 = 45 and ri = 0, for all i > 1. With s = 2321 (and m = 100), we see
that all goes well until we get r82 = 6100, r83 = 2100, r84 = 4100, R85 = 8100, and
then this same sequence of four numbers repeats forever. (Acknowledgment: The
program and specific additional examples are based on a class project by Michael
Bissell, June 2011.) c
2 Instructor Manual: Generating Random Numbers 23
n = 500; x = 253
p = x/n; z = (p - .5)/sqrt(.25/n)
p.value = 1 - (pnorm(z) - pnorm(-z))
pm = c(-1, 1)
trad.CI = p + pm*1.96*sqrt(p*(1-p)/n)
p.ac = (x + 2)/(n + 4)
Agresti.CI = p.ac + pm*1.96*sqrt(p.ac*(1-p.ac)/(n + 4))
p; z; p.value
trad.CI; p.ac; Agresti.CI
> p; z; p.value
[1] 0.506
[1] 0.2683282
[1] 0.7884467
> trad.CI; p.ac; Agresti.CI
[1] 0.4621762 0.5498238
[1] 0.5059524
[1] 0.4623028 0.5496020
24 2 Instructor Manual: Generating Random Numbers
b) Repeat part (a) letting numbers 0 through 4 represent Heads and numbers
5 through 9 represent Tails.
d The only change from part (a) is that X = 261. The agreement of p = X/n = 0.522
with π = 1/2 is not quite as good here as in part (a), but it is nowhere near bad
enough to declare a statistically significant difference. Output from the code of
part (a), but with x = 261, is shown below. c
> p; z; p.value
[1] 0.522
[1] 0.98387
[1] 0.3251795
> trad.CI; p.ac; Agresti.CI
[1] 0.4782155 0.5657845
[1] 0.5218254
[1] 0.4782143 0.5654365
c) Why do you suppose digits of π are not often used for simulations?
d Rapidly computing or accessing the huge number of digits of π (or e) necessary
to do serious simulations is relatively difficult. Partly as a result of this, some in-
vestigators say that insufficient testing has been done to know whether such digits
really do behave essentially as random. Refereed results from researchers at Berkeley
and Purdue are among the many that can be retrieved with an Internet search for
digits of pi random. c
Hint: (a, b) One possible approach is to find a 95% confidence interval for P (Heads)
and interpret the result.
2.3 Example 2.1 illustrates one congruential generator with b = 0 and
d = 53. The program there shows the first m = 60 numbers generated.
Modify the program, making the changes indicated in each part below, using
length(unique(r)) to find the number of distinct numbers produced, and
using the additional code below to make a 2-dimensional plot. Each part re-
quires two runs of such a modified program. Summarize findings, commenting
on differences within and among parts.
u = (r - 1/2)/(d-1)
u1 = u[0:(m-1)]; u2 = u[2:m]
plot(u1, u2, pch=19)
program, you can see that the same is true for the generator with a = 26. However,
the grid patterns with a = 22 are different from those with a = 26. (In what way?) c
Xh
(Nj − E)2
Q= .
j=1
E
If the null hypothesis is true and E is large, as here, then Q is very nearly
distributed as CHISQ(h − 1), the chi-squared distribution with h − 1 degrees
of freedom. Accordingly, E(Q) = h − 1. For our example, h = 10, so values
of Q “near” 9 are consistent with uniform observations. Specifically, if Q falls
outside the interval [2.7, 19], then we suspect the generator is behaving badly.
The values 2.7 and 19 are quantiles 0.025 and 0.975, respectively, of CHISQ(9).
In some applications of the chi-squared test, we would reject the null hy-
pothesis only if Q is too large, indicating some large values of |N i − E|. But
when we are validating a generator we are also suspicious if results are “too
perfect” to seem random. (One similarly suspicious situation occurs if a fair
coin is supposedly tossed 8000 times independently and exactly 4000 Heads
are reported. Another is shown in the upper left panel of Figure 2.1.)
a) Run the part of the program of Example 2.2 that initializes variables and
the part that generates corresponding values of ui . Instead of the part that
prints a histogram and 2-dimensional plot, use the code below, in which
the parameter plot=F suppresses plotting and the suffix $counts retrieves
the vector of 10 counts. What is the result, and how do you interpret it?
d The revised program (partly from Example 2.2 and partly from an elaboration of
the code given with this problem) and its output are shown below. The resulting
chi-squared statistic Q = 7.2 falls inside the “acceptance” interval given above, so
we find no fault in the generator here. c
26 2 Instructor Manual: Generating Random Numbers
# Initialize
a = 1093; b = 18257; d = 86436; s = 7
m = 1000; r = numeric(m); r[1] = s
# Generate
for (i in 1:(m-1)) {r[i+1] = (a*r[i] + b) %% d}
u = (r + 1/2)/d # values fit in (0,1)
d With a larger sample of numbers from this generator, we see that the histogram
bars are too nearly of the same height for the results of this generator to be consistent
with randomness. (The acceptance interval does not change for any large sample
size m. See the answer to Problem 2.7 for comments on tests with different numbers h
of histogram bins.) c
# Initialize
a = 1093; b = 18257; d = 86436; s = 7
m = 50000; r = numeric(m); r[1] = s
...
> cbind(N, E, diff, comp)
N E diff comp
[1,] 5007 5000 7 0.0098
[2,] 4995 5000 -5 0.0050
[3,] 5014 5000 14 0.0392
[4,] 4997 5000 -3 0.0018
2 Instructor Manual: Generating Random Numbers 27
c) Repeat part (a) again, but now with m = 1000 and b = 252. In this
case, also make the histogram and the 2-dimensional plot of the results
and comment. Do you suppose the generator with increment b = 252 is
useful? (Problem 2.6 below investigates this generator further.)
d With Q = 0.24 for only m = 1000 values, this generator fails the chi-squared test.
The 2-d plot (not shown here) reveals that the grid is far from optimal: all of the
points fall along about 20 lines. c
# Initialize
a = 1093; b = 252; d = 86436; s = 7
m = 1000; r = numeric(m); r[1] = s
...
> cbind(N, E, diff, comp)
N E diff comp
[1,] 98 100 -2 0.04
[2,] 103 100 3 0.09
[3,] 101 100 1 0.01
[4,] 102 100 2 0.04
[5,] 101 100 1 0.01
[6,] 99 100 -1 0.01
[7,] 99 100 -1 0.01
[8,] 99 100 -1 0.01
[9,] 99 100 -1 0.01
[10,] 99 100 -1 0.01
> Q = sum(comp); Q
[1] 0.24
d) Repeat part (a) with the original values of a, b, d, and s, but change to
m = 5000 and add the step u = u^0.9 before computing the chi-squared
statistic. (We still have 0 < ui < 1.) Also, make and comment on the
histogram.
d As discussed in Section 2.4, if U ∼ UNIF(0, 1), then U 0.9 does not have a uniform
distribution. With m = 5000 values, we get Q = 46.724, which provides very strong
evidence of a bad fit to uniform. (However, with only m = 1000 values, there is not
enough information to detect the departure from uniform). c
d Extra: More about the chi-squared goodness-of-fit-test. Under the null hypothesis
that Ui are randomly sampled from UNIF(0, 1), we claim above that the statistic Q,
based on h bins, is approximately distributed as CHISQ(h − 1 = 9). Notice that the
statistic Q is computed from counts, and so it takes discrete values (ordinarily not
integers). By contrast, CHISQ(h − 1) is a continuous distribution. Roughly speaking,
the approximation is reasonably good, provided E is sufficiently large. (The provision
is not really an issue here because E = m/h and m is very large.)
This approximation is not easy to prove analytically, so we show a simulation
below for B = 10 000 batches (that is, values of Q). Each batch has m = 1000 pseudo-
random values U , generated using the high-quality random number generator runif
in R. The program below makes a histogram (not printed here) of the B values of Q
and compares it with the density function of CHISQ(9). (By changing the degrees
of freedom for pdf from h − 1 = 9 to h = 10, you easily see that CHISQ(10) is not
such a good fit to the histogram.)
We stress that this is not at all the same use of chi-squared distributions
as in Problems 2.12 and 2.13. Here a chi-squared distribution is used as an
approximation. Problems 2.12 and 2.13 show random variables that have
exact chi-squared distributions.
The first part of the text output shows that the simulated acceptance region
for Q is in good agreement with the acceptance region from cutting off 2.5% from
each tail of CHISQ(9). In the second part of the text output, a table compares areas
of bars in the histogram of Q with corresponding exact probabilities from CHISQ(9),
each expressed to 3-place accuracy. For judging tests of randomness, the accuracy of
the approximation of tail probabilities is more important than the fit in the middle
part of the distribution.
set.seed(1212)
B = 10000; Q = numeric(B)
m = 1000; h = 10; E = m/h; u.cut = (0:h)/h
for (i in 1:B)
{
u = runif(m)
N = hist(u, breaks=u.cut, plot=F)$counts
Q[i] = sum((N - E)^2/E)
}
When hist parameter prob=T, the density of each histogram bar is its height.
If, as here, the width of each bar is set to unity (with Q.cut), then this is also the
area of each bar.
With seed 1212, it happens that the overall agreement of the histogram bars with
the density function is a little better than for some other seeds. You can experiment
with different seeds—and with alternative values of B, m, and h. Roughly speaking,
it is best to keep both h and E larger than about 5. c
Answers: In (a)–(e), Q ≈ 7, 0.1, 0.2, 47, and 7, respectively. Report additional
decimal places, and provide interpretation.
2 Instructor Manual: Generating Random Numbers 31
2.5 When beginning work on Trumbo (1969), the author obtained some
obviously incorrect results from the generator included in Applesoft BASIC
on the Apple II computer. The intended generator would have been mediocre
even if programmed correctly, but it had a disastrous bug in the machine-level
programming that led to periods of only a few dozen for some seeds (Sparks,
1983). A cure (proposed in a magazine for computer enthusiasts, Hare et
al., 1983) was to import the generator ri+1 = 8192ri (mod 67 099 547). This
generator has full period, matched the capabilities of the Apple II, and seemed
to give accurate results for the limited simulation work at hand.
a) Modify the program of Example 2.3 to make plots for this generator anal-
ogous to those in Figure 2.4. Use u = (r - 1/2)/(d - 1).
d The R code is shown below. The plots (omitted here) reveal no undesirable
structure—in either 2-d and 3-d. The first printing of the text suggested using
the code u = (r + 1/2)/d. But this is a multiplicative generator with b = 0 and
0 < ri < d. So, theoretically, u = (r - 1/2)/(d - 1) is the preferred way to spread
the ui over (0, 1). However, in the parts of this problem (with very large d), we have
seen no difference in results between the two formulas. c
a = 8192; b = 0; d = 67099547; s = 11
m = 20000; r = numeric(m); r[1] = s
for (i in 1:(m-1)) {r[i+1] = (a*r[i] + b) %% d}
u = (r - 1/2)/(d - 1)
a = 8192; b = 0; d = 67099547; s = 25
m = 1000; r = numeric(m); r[1] = s
for (i in 1:(m-1)) {r[i+1] = (a*r[i] + b) %% d}
u = (r - 1/2)/(d - 1)
a) For a particular n ≤ m, you can use the code sum(u[1:n] < .5)/n to
simulate the proportion of heads in the first n tosses. If the values ui are
uniform in the interval (0, 1), then each of the n comparisons inside the
parentheses has probability one-half of being TRUE, and thus contribut-
ing 1 to the sum. Evaluate this for n = 10 000, 20 000, 30 000, 40 000, and
50 000. For each n, the 95% margin of error is about n−1/2 . Show that all
of your values are within this margin of the true value P {Head} = 0.5.
So, you might be tempted to conclude that the generator is working sat-
isfactorily. But notice that all of these proportions are above 0.5—and by
similar amounts. Is this a random coincidence or a pattern? (See part (c).)
a = 1093; b = 252; d = 86436; s = 6
m = 50000; r = numeric(m); r[1] = s
for (i in 1:(m-1)) {r[i+1] = (a*r[i] + b) %% d}
u = (r + 1/2)/d
n = c(1:5)*10000; p = numeric(5)
for (i in 1:5) {p[i] = sum(u[1:n[i]] < .5)/n[i]}
ME = 1/sqrt(n); diff = p - 1/2
cbind(n, p, diff, ME)
b) This generator has serious problems. First, how many distinct values do
you get among m? Use length(unique(r)). So, this generator repeats
a few values many times in m = 50 000 iterations. Second, the period
depends heavily on the seed s. Report results for s = 2, 8 and 17.
s = c(2, 8, 17, 20, 111); k = length(s); distinct = numeric(k)
a = 1093; b = 252; d = 86436
m = 50000; r = numeric(m)
for (j in 1:k) {
r[1] = s[j]
for (i in 1:(m-1)) {r[i+1] = (a*r[i] + b) %% d }
distinct[j] = length(unique(r)) }
distinct
> distinct
[1] 1029 1029 147 1029 343
c) Based on this generator, the code below makes a plot of the proportion
of heads in n tosses for all values n = 1, 2 . . . m. For comparison, it does
the same for values from runif, which are known to simulate UNIF(0, 1)
accurately. Explain the code, run the program (which makes Figure 2.9),
and comment on the results. In particular, what do you suppose would
happen towards the right-hand edge of the graph if there were millions of
iterations m? (You will learn more about such plots in Chapter 3.)
a = 1093; b = 252; d = 86436; s = 6
m = 50000; n = 1:m
r = numeric(m); r[1] = s
for (i in 1:(m-1)) {r[i+1] = (a*r[i] + b) %% d}
u = (r + 1/2)/d; f = cumsum(u < .5)/n
plot(n, f, type="l", ylim=c(.49,.51), col="red") # ’ell’, not 1
abline(h=.5, col="green")
d See Figure 2.9 on p42 of the text for the plot. We used seed 1237 because it
illustrates the convergence of the good generator on a scale that also allows the
“sawtooth” nature of the trace of the bad generator to show clearly in print. For a
large number of iterations with any seed, the trace for the good generator will tend
toward 1/2 and the trace for the bad generator will continue to stay above 1/2. For
example, try several seeds with m = 100 000. c
Note: This generator “never had a chance.” It breaks one of the number-theoretic
rules for linear congruential generators, that b and d not have factors in common.
See Abramowitz and Stegun (1964).
34 2 Instructor Manual: Generating Random Numbers
a) Run the program and comment on the results. Approximately how many
points are printed in each graph?
d Both graphs show uniform random behavior within a “square.” The two plots are
similar to those shown in the second row of Figure 2.4. Specifically, the first plot
here is a “magnified” view of the lower-left 100th of the unit square; unlike the
lower-right plot in Figure 2.3, it reveals no grid structure. The second plot here
shows a “veneer” that is the front 100th of the unit cube; unlike the upper-right
plot in Figure 2.4, no concentration of points on parallel planes is evident. Each of
the plots here shows about m/100 = 106 /102 = 10 000 points. c
set.seed(2012)
m = 50000; u = runif(m)
h = 10; E = m/h; cut = (0:h)/h
N = hist(u, breaks=cut, plot=F)$counts
diff = N - E; comp = (diff)^2/E
cbind(N, E, diff, comp)
Q = sum(comp); Q
set.seed(2001)
m = 50000; u = runif(m); h = 20; E = m/h; cut = (0:h)/h
N = hist(u, breaks=cut, plot=F)$counts
diff = N - E; comp = (diff)^2/E
cbind(N, E, diff, comp)
Q = sum(comp); Q
data: runif(50000)
D = 0.0047, p-value = 0.2115
alternative hypothesis: two.sided
36 2 Instructor Manual: Generating Random Numbers
2.8 The code used to make the two plots in the top row of Figure 2.1 (p26) is
shown below. The function runif is used in the left panel to “jitter” (randomly
displace) two plotted points slightly above and to the right of each of the 100
grid points in the unit square. The same function is used more simply in the
right panel to put 200 points at random into the unit square.
set.seed(121); n = 100
par(mfrow=c(1,2), pty="s") # 2 square panels per graph
# Left Panel
s = rep(0:9, each=10)/10 # grid points
t = rep(0:9, times=10)/10
x = s + runif(n, .01, .09) # jittered grid points
y = t + runif(n, .01, .09)
plot(x, y, pch=19, xaxs="i", yaxs="i", xlim=0:1, ylim=0:1)
#abline(h = seq(.1, .9, by=.1), col="green") # grid lines
#abline(v = seq(.1, .9, by=.1), col="green")
# Right Panel
x=runif(n); y = runif(n) # random points in unit square
plot(x, y, pch=19, xaxs="i", yaxs="i", xlim=0:1, ylim=0:1)
par(mfrow=c(1,1), pty="m") # restore default plotting
a) Run the program (without the grid lines) to make the top row of Figure 2.1
for yourself. Then remove the # symbols at the start of the two abline
statements so that grid lines will print to show the 100 cells of your left
panel. See Figure 2.12 (p47).
b) Repeat part (a) several times without the seed statement (thus getting a
different seed on each run) and without the grid lines to see a variety of
examples of versions of Figure 2.1. Comment on the degree of change in
the appearance of each with the different seeds.
c) What do you get from a single plot with plot(s, t)?
d The lower-left corners of the 100 cells. c
d) If 100 points are placed at random into the unit square, what is the prob-
ability that none of the 100 cells of this problem are left empty? (Give
your answer in exponential notation with four significant digits.)
d This is very similar to the birthday matching problem of Example 1.2. The answer
is 100!/100100 = 9.333 × 10−43 . Don’t wait around for this event to happen.
2 Instructor Manual: Generating Random Numbers 37
Below we show two methods of computing this quantity R, the latter of which is
theoretically preferable. Factorials and powers grow so rapidly that some computer
packages would not be able to handle the first method without overflowing. The
second method, sometimes called “zippering,” involves the product of 100 factors,
each of manageable size. It turns out that R is up to the task of computing the first
method correctly, so you could use either. (But if there were 152 = 225 cells, you
wouldn’t have a choice.) c
Note: Consider nesting habits of birds in a marsh. From left to right in Figure 2.1, the
first plot shows territorial behavior that tends to avoid close neighbors. The second
shows random nesting in which birds choose nesting sites entirely independently of
other birds. The third shows a strong preference for nesting near the center of the
square. The last shows social behavior with a tendency to build nests in clusters.
2.9 (Theoretical) Let U ∼ UNIF(0, 1). In each part below, modify equa-
tion (2.2) to derive the cumulative distribution function of X, and then take
derivatives to find the density function.
a) Show that X = (b − a)U + a ∼ UNIF(a, b), for real numbers a and b with
a < b. Specify the support of X.
d The support of X is (a, b). For a < x < b,
n o
x−a x−a
FX (x) = P {X ≤ x} = P {(b − a)U + a ≤ x} = P U≤ = ,
b−a b−a
where the last equation follows from FU (u) = u, for 0 < u < 1. Also, FX (x) = 0,
for x ≤ a, and FX (x) = 1, for x ≥ b. Then, taking the derivative of this piecewise
0
differentiable CDF, we have the density function fX (x) = FX (x) = 1/(b − a), for
a < x < b, and 0 elsewhere. c
FX (x) = P {X ≤ x} = P {1 − U ≤ x} = P {U ≥ 1 − x}
= 1 − P {U ≤ 1 − x} = 1 − (1 − x) = x,
where we have used the 0 probability of a single point in moving from the first line
to the second. Outside the interval of support, FX (x) = 0, for x ≤ 0, and FX (x) = 1,
for x ≥ 1. Here again, the CDF is piecewise differentiable, and so we have the density
0
function fX (x) = FX (x) = 1, for a < x < b, and 0 elsewhere. Thus, X ∼ UNIF(0, 1)
also. c
38 2 Instructor Manual: Generating Random Numbers
2.10 In Example 2.4, we used the random R function runif to sample from
the distribution BETA(0.5, 1). Here we wish to sample from BETA(2, 1).
a) Write the density function, cumulative distribution function, and quantile
function of BETA(2, 1). According to the quantile transformation method,
explain how to use U ∼ UNIF(0, 1) to sample from BETA(2, 1).
d All distributions in the beta family have support (0, 1), Let 0 < x, y < 1. The
density function of X ∼ BETA(2, 1) is fX (x) = 2x, because Γ (3) = 2! = 2, and
2
Γ (2) = Γ (1) = 1. Integrating, we have the CDF y = FX (x) √ = x , so that the
−1 √ −1
quantile function is x = FX (y) = y. Thus, X = FX (U ) = U ∼ BETA(2, 1). c
c) Modify the program of Example 2.4 to illustrate the method of part (a),
Of course, you will need to change the code for x and cut.x and the code
used to plot the density function of BETA(2, 1). Also, change the code to
simulate a sample of 100 000 observations, and use 20 bars in each of the
histograms. Finally, we suggest changing the ylim parameters so that the
vertical axes of the histograms include the interval (0, 2). See Figure 2.10.
d The program below shows the required changes. Lines with changes are marked
with ##. Alternatives: define the cutpoints using cut.u = seq(0, 1, len=21) and
plot the density function of BETA(2, 1) using dbeta(xx, 2, 1). Except for some
embellishments to make a clearer image for publication, the program produces his-
tograms as in Figure 2.10 on p45 of the text. c
set.seed(3456)
m = 100000
u = runif(m)
x = sqrt(u) ##
xx = seq(0, 1, by=.001)
cut.u = (0:20)/20 ##
cut.x = sqrt(cut.u)
par(mfrow=c(1,2))
hist(u, breaks=cut.u, prob=T, ylim=c(0,2)) ##
lines(xx, dunif(xx), col="blue")
hist(x, breaks=cut.x, prob=T, ylim=c(0,2)) ##
lines(xx, 2*xx, col="blue") ##
par(mfrow=c(1,1))
2 Instructor Manual: Generating Random Numbers 39
a) If X ∼ EXP(λ), then E(X) = 1/λ (see Problem 2.12). Find the median of
this distribution by setting FX (x) = 1−e−λx = 1/2 and solving for x. How
accurately does your simulated sample of size 10 000 estimate the popu-
lation mean and median of EXP(1)? [The answer for λ = 1 is qexp(.5).]
d In general, FX (η) = 1 − e−λη = 1/2 implies η = − ln(.5)/λ. So if λ = 2, the median
is η = 0.3465736. In R this could also be obtained as qexp(.5, 2). For λ = 1, the
simulation above gives the desired result 0.693 correct to three places. c
b) The last two lines of the program (counts from unplotted histograms)
provide counts for each interval of the realizations of U and X, respec-
tively. Report the 10 counts in each case. Explain why their order gets
reversed when transforming from uniform to exponential. What is the
support of X? Which values in the support (0, 1) of U correspond to the
largest values of X? Also, explain how cut2 is computed and why.
d A straightforward application of the quantile method simulates X ∼ EXP(λ = 1)
as X = − ln(1 − U 0 ), where U 0 ∼ UNIF(0, 1). However, as shown in Problem 2.9(b),
we also have U = 1 − U 0 ∼ UNIF(0, 1). So, as in the program above, one often
40 2 Instructor Manual: Generating Random Numbers
set.seed(1212)
m = 10000; lam = 1
u = runif(m); x = -log(u)/lam
# default binning
> hist(x, plot=F)$counts # interval counts
[1] 3906 2374 1423 909 571 305 191 116 83
[10] 51 33 16 13 3 2 3 1
> hist(x, plot=F)$breaks # default cutpoints
[1] 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5
[13] 6.0 6.5 7.0 7.5 8.0 8.5
2 Instructor Manual: Generating Random Numbers 41
d) Run the program again with lam = 1/2. Describe and explain the results
of this change. (Notice that EXP(1/2) = CHISQ(2) = GAMMA(1, 1/2).
See Problems 2.12 and 2.13.)
d The smaller rate λ = 1/2 corresponds to the larger mean µ = 2. Moreover, if we
use the same seed, then each of the simulated values is precisely doubled. To see
the effect, compare the first set of results below with those shown in part (a). The
second set of results below is for a different seed. c
set.seed(1212)
m = 10000; lam = 1/2
u = runif(m); x = -log(u)/lam
Then m0X (s) = λ(s − λ)−2 , and m0X (0) = E(X) = 1/λ. Finally, in particular, if
T ∼ CHISQ(2) = EXP(1/2), then E(T ) = 2. c
2 Instructor Manual: Generating Random Numbers 43
−1/2
Show that this simplifies
√ to m(s) = (1 − 2s) , so that Z 2 ∼ CHISQ(1).
Recall that Γ (1/2) = π.
d To find the density function of T ∼ CHISQ(1/2), substitute α = λ = 1/2 into the
general gamma density function: for t > 0,
λα α−1 −λt (1/2)1/2 1/2−1 −(1/2)t 1
fT (t) = t e = t e = √ e−t/2 .
Γ (α) Γ (1/2) 2πt
For s < 1/2, continuing from the statement of the problem, the MGF of Z 2 is
Z ∞ Z ∞
2 1
mZ 2 (s) = 2 exp(sz )ϕ(z) dz = 2 √ exp[−(1 − 2s)z 2 /2] dz
2π
Z0 ∞ 0
1 1
=2 √ exp(−t/2) p dt
0 2π 2 (1 − 2s)t
Z ∞
−1/2 1
= (1 − 2s) √ e−t/2 dt = (1 − 2s)−1/2 ,
0 2πt
2
where we use the change of variable
p t = (1 − 2s)z in the second line, so that
dt = 2(1 − 2s)zdz and dz = dt/[ 2 (1 − 2s)t ]. Also, the final equality holds because
the density function of CHISQ(1) integrates to 1.
This argument uses the uniqueness property of moment generating functions.
No two distributions have the same MGF. There are some distributions that do
not have moment generating functions. However, among those that do, there is a
one-to-one correspondence between distributions and moment generating functions.
In the current demonstration, we know that Z 2 ∼ CHISQ(1) because we have found
the MGF of Z 2 and it matches the MGF of CHISQ(1). c
2.13 Simulations for chi-squared random variables. The first block of code
in Example 2.6 illustrates that the sum of squares of two standard normal
random variables is distributed as CHISQ(2). (Problem 2.12 provides formal
proof.) Modify the code in the example to do each part below. For simplicity,
when plotting the required density functions, use dens = dchisq(tt, df)
for df suitably defined.
a) If Z ∼ NORM(0, 1), then illustrate by simulation that Z 2 ∼ CHISQ(1)
and that E(Z 2 ) = 1.
d Our modified code is shown below. The histogram (omitted here) shows good fit
of the histogram to the density function of CHISQ(1). As in Problem 2.12, MGFs
can be used to show that CHISQ(ν) has mean ν and variance 2ν.
For additional verification, the last line of this program performs a Kolmogorov-
Smirnov test of goodness-of-fit. For seed 1234, the P-value 0.2454 of this test (output
omitted) is consistent with good fit of the simulated observations to CHISQ(1). The
K-S test is discussed briefly at the end of the answers to Problem 2.15. c
set.seed(1234)
m = 40000; z = rnorm(m); t = z^2
hist(t, breaks=30, prob=T, col="wheat")
tt = seq(0, max(t), length=200); dens = dchisq(tt, 1)
lines(tt, dens, col="blue")
mean(t); var(t)
ks.test(t, pchisq, 1)
set.seed(1235)
m = 100000; nu = 3
z = rnorm(m * nu); DTA = matrix(z, nrow=m); t = rowSums(DTA^2)
tt = seq(0, max(t), length=200); dens = dchisq(tt, nu)
hist(t, breaks=30, ylim=c(0, 1.1*max(dens)), prob=T, col="wheat")
lines(tt, dens, col="blue")
mean(t); nu # simulated and exact mean
var(t); 2*nu # simulated and exact variance
2 Instructor Manual: Generating Random Numbers 45
2.14 Illustrating the Box-Muller method. Use the program below to imple-
ment the Box-Muller method of simulating a random sample from a standard
normal distribution. Does the histogram of simulated values seem to agree
with the standard normal density curve? What do you conclude from the
chi-squared goodness-of-fit statistic? (This statistic, based on 10 bins with
E(Ni ) = Ei , has the same approximate chi-squared distribution as in Prob-
lem 2.4, but here the expected counts Ei are not the same for all bins.) Before
drawing firm conclusions, run this program several times with different seeds.
set.seed(1236)
m = 2*50000; z = numeric(m)
u1 = runif(m/2); u2 = runif(m/2)
z1 = sqrt(-2*log(u1)) * cos(2*pi*u2) # half of normal variates
z2 = sqrt(-2*log(u1)) * sin(2*pi*u2) # other half
z[seq(1, m, by = 2)] = z1 # interleave
z[seq(2, m, by = 2)] = z2 # two halves
cut = c(min(z)-.5, seq(-2, 2, by=.5), max(z)+.5)
hist(z, breaks=cut, ylim=c(0,.4), prob=T)
zz = seq(min(z), max(z), by=.01)
lines(zz, dnorm(zz), col="blue")
E = m*diff(pnorm(c(-Inf, seq(-2, 2, by=.5), Inf))); E
N = hist(z, breaks=cut, plot=F)$counts; N
Q = sum(((N-E)^2)/E); Q; qchisq(c(.025,.975), 9)
d The data from the run shown are consistent with a standard normal distribution.
(Also, the K-S test for this run has P-value 0.7327.) Additional runs (seeds unspec-
ified) yielded Q = 10.32883, 7.167461, 8.3525, 2.360288, and 9.192768. Because Q is
approximately distributed as CHISQ(9) we expect values averaging 9, and falling
between 2.7 and 19.0 for 95% of the simulation runs. c
46 2 Instructor Manual: Generating Random Numbers
Variance. The variance of the sum of mutually independent random variables is the
sum of their variances.
à !
X
12
b) For the random variable Z of part (a), evaluate P {−6 < Z < 6}. Theoret-
ically, how does this result differ for a random variable Z that is precisely
distributed as standard normal?
d If all 12 of the Ui were 0, then Z = −6; if all of them were 1, then Z = 6. Thus,
P {−6 < Z < 6} = 1, exactly. If Z 0 ∼ NORM(0, 1), then P {−6 < Z 0 < 6} ≈ 1, but
not exactly. There is some probability in the “far tails” of standard normal, but not
much. We can get the exact probability of the two tails in R as diff(2*pnorm(-6)),
which returns 1.973175 × 10−09 , about 2 chances in a billion. Nevertheless, the main
difficulty simulating NORM(0, 1) with a sum of 12 uniformly distributed random
variables is that the shape of the density function of the latter is not precisely
normal within (−6, 6). c
goodness-of-fit test? (Assuming normality, this statistic has the same ap-
proximate chi-squared distribution as in Problem 2.4. Here again, there
are 10 bins, but now the expected counts Ei are not all the same.) Before
drawing firm conclusions, run this program several times with different
seeds. Also, make a few runs with m = 10 000 iterations.
d Below the program we show a handmade table of results from the chi-squared and
the Kolmogorov-Smirnov goodness-of-fit tests—both for m = 100 000 and 10,000.
(The first results for m = 100 000 result from seed 1237.) For truly normal data, we
expect Q values averaging 9, and falling between 2.7 and 19 in 95% of simulation
runs. Also, we expect K-S P-values between 2.5% and 97.5%. Symbols * indicate
values suggesting poor fit, + for a value suggesting suspiciously good fit. This method
is inferior to the Box-Muller method of Problem 2.14 and should not be used. c
set.seed(1237)
m = 100000; n = 12
u = runif(m*n); UNI = matrix(u, nrow=m)
z = rowSums(UNI) - 6
m = 100000
-----------------------
25.46162* 0.07272492
31.97759* 0.1402736
37.38722* 0.02045037*
10.00560 0.1683760
18.00815* 0.7507103
37.38722* 0.02045037*
m = 10000
-----------------------
6.72417 0.2507788
12.34241 0.01991226*
1.928343+ 0.6481974
9.201392 0.7471199
4.814753 0.5253633
12.67015 0.651179
48 2 Instructor Manual: Generating Random Numbers
2.16 Random triangles (Project). If three points are chosen at random from
a standard bivariate normal distribution (µ1 = µ2 = ρ = 0, σ1 = σ2 = 1),
then the probability that they are vertices of an obtuse triangle is 3/4. Use
simulation to illustrate this result. Perhaps explore higher dimensions. (See
Portnoy (1994) for a proof and for a history of this problem tracing back to
Lewis Carroll.)
d Below is a program to simulate the probability of an obtuse triangle in n = 2
dimensions. The code can easily be generalized to n > 2 dimensions. The number
of random triangles generated and evaluated is m. There are three matrices a, b,
and c: one for each vertex of a triangle. Each matrix contains m rows: one for each
triangle generated; the n entries in each row are the coordinates of one vertex of the
triangle.
The ith components of m-vectors AB, AC, and BC are the squared lengths of the
sides of the ith random triangle. A triangle is obtuse if the squared length of its
longest side exceeds the sum of the squared lengths of the other two sides.
set.seed(1238)
m = 1000000 # number of triangles
n = 2 # dimensions
> pr.obtuse
[1] 0.749655
Errors in Chapter 2
Page numbers refer to the the text. All corrections are scheduled to appear in
the second printing. Statements of problems in this Instructor Manual have
been corrected. Includes corrections compiled through July 13, 2011.
p26 Example 2.1. In the first line below Figure 2.1: ri = 21 should be r1 = 21.
[Thanks to Jeff Glickman.]
3.2 We explore two ways to evaluate e−1/2 ≈ 0.61, correct to two deci-
mal places, using only addition, subtraction, multiplication, and division—
the fundamental operations available to the makers of tables 50 years ago. On
most modern computers, the evaluation of ex is a chip-based function.
P∞ k
a) Consider the Taylor (Maclauren) expansion ex = k=0 x /k!. Use the
first few terms of this infinite series to approximate e−1/2 . How many
terms are required to get two-place accuracy? Explain.
k = 0:10; nr.terms = k + 1; x = -1/2
taylor.term = x^k/factorial(k); taylor.sums = cumsum(taylor.term)
round(cbind(k, nr.terms, taylor.sums), 10)
min(nr.terms[round(taylor.sums, 2) == round(exp(-1/2), 2)])
b) Use the relationship ex = limn→∞ (1 + x/n)n . Notice that this is the limit
of an increasing sequence. What is the smallest value of k such that n = 2k
gives two-place accuracy for e−1/2 ?
k = 1:10; n = 2^k; x = -1/2; term = (1 + x/n)^n
cbind(k, n, term)
min(k[round(term, 2) == round(exp(-1/2), 2)])
c) Run the following R script. For each listed value of x, say whether the
method of part (a) or part (b) provides the better approximation of ex .
d We add several lines to the code to show results in tabular and graphical formats.
In the table, if the column t.best shows 1 (for TRUE), then the sum of the first seven
terms of the Taylor expansion is at least as good as the 1024th term in the sequence.
(The two methods agree exactly for x = 0, and they agree to several decimal places
for values of x very near 0. But for accuracy over a wider interval, we might want
to sum a few more terms of the Taylor series.) c
x = seq(-2, 2, by=.25)
tay.7 = 1 + x + x^2/2 + x^3/6 + x^4/24 + x^5/120 + x^6/720
seq.1024 = (1 + x/1024)^1024; exact = exp(x)
t.err = tay.7 - exact; s.err = seq.1024 - exact
t.best = (abs(t.err) <= abs(s.err))
round(cbind(x, tay.7, seq.1024, exact, t.err, s.err, t.best), 5)
plot(x, s.err, ylim=c(-.035, .035),
ylab="Taylor.7 (solid) and Sequence.1044 Errors")
points(x, t.err, pch=19, col="blue")
3.3 Change the values of the constants in the program of Example 3.1 as
indicated.
a) For a = 0 and b = 1, try each of the values m = 10, 20, 50, and 500.
Among these values of m, what is the smallest m that gives five-place
accuracy for P {0 < Z < 1}?
54 3 Instructor Manual: Monte Carlo Integration
d We show the substitution m = 10 and make a table by hand to show the complete
set of answers. Compare with the exact value P {0 < Z < 1} = 0.3413447, computed
using diff(pnorm(c(0,1))). c
m = 10; a = 0; b = 1; w = (b - a)/m
g = seq(a + w/2, b - w/2, length=m)
const = 1/sqrt(2 * pi); h = const * exp(-g^2 / 2)
sum(w * h)
> sum(w * h)
[1] 0.3414456
m Riemann approximation
-------------------------------
10 0.3414456 above
20 0.3413700
50 0.3413488
500 0.3413448 5-place accuracy
5000 0.3413447 Example 3.1
b) For m = 5000, modify this program to find P {1.2 < Z ≤ 2.5}. Compare
your answer with the exact value obtained using the R function pnorm.
m = 5000; a = 1.2; b = 2.5; w = (b - a)/m
g = seq(a + w/2, b - w/2, length=m)
const = 1/sqrt(2 * pi)
h = const * exp(-g^2 / 2)
sum(w * h)
diff(pnorm(c(1.2, 2.5)))
> sum(w * h)
[1] 0.10886 # Riemann approximation
> diff(pnorm(c(1.2, 2.5)))
[1] 0.10886 # exact value from ’pnorm’
> sum(w * h)
[1] 0.3934693 # Riemann approximation
> 1 - exp(-.5)
[1] 0.3934693 # exact value from calculus
3 Instructor Manual: Monte Carlo Integration 55
3.5 Run the program of Example 1.2 several times (omitting set.seed)
to evaluate P {0 < Z ≤ 1}. Do any of your answers have errors that exceed
the claimed margin of error 0.0015? Also, changing constants as necessary,
make several runs of this program to evaluate P {0.5 < Z ≤ 2}. Compare
your results with the exact value; the margin of error is larger here.
d To seven places, the exact value is P {0 < Z ≤ 1} = 0.3413447. Values from several
runs are shown here—all of which happen to fall within the claimed margin of error.
Of course, your values will differ slightly.
In Section 3.4 we show that the last line of code provides an approximate 95%
margin of error for this Monte Carlo integration over the unit interval (shown only
for the first run). Finally, we show two Monte Carlo values for P {0.5 < Z ≤ 2}. c
m = 500000; a = 0; b = 1; w = (b - a)/m
u = a + (b - a) * runif(m); h = dnorm(u)
mc = sum(w * h)
exact = pnorm(1)-.5
mc; abs(mc - exact)
2*(b-a)*sd(h)/sqrt(m)
3.6 Use Monte Carlo integration with m = 100 000 to find the area of the
first quadrant of the unit circle, which has area π/4. Thus obtain a simulated
value of π = 3.141593. How many places of accuracy do you get?
set.seed(1234)
m = 100000; a = 0; b = 1; w = (b - a)/m
u = a + (b - a) * runif(m); h = sqrt(1 - u^2)
quad = sum(w * h); quad; 4*quad
3.7 Here we consider two very similar random variables. In each part below
we wish to evaluate P {X ≤ 1/2} and E(X). Notice that part (a) can be done
by straightforward analytic integration but part (b) cannot.
a) Let X be a random variable distributed as BETA(3, 2) with density func-
tion f (x) = 12x2 (1−x), for 0 < x < 1, and 0 elsewhere. Use the numerical
integration method of Example 3.1 to evaluate the specified quantities.
Compare the results with exact values obtained using calculus.
R 1/2
d Exact values: P {X ≤ 1/2} = 12 0 (x2 −x3 ) dx = 12[(1/3)(1/2)3 −(1/4)(1/2)4 ] =
R1 2 3
0.3125 and E(X) = 12 0
x(x − x ) dx = 12(1/4 − 1/5) = 3/5. c
m = 5000; a = 0; b = 1; w = (b - a)/m
g = seq(a + w/2, b - w/2, length=m)
h = g * 12 * g^2 * (1 - g)
mean = sum(w * h)
3 Instructor Manual: Monte Carlo Integration 57
prob; mean
> prob; mean
[1] 0.3125
[1] 0.6
b) Let X be a random variable distributed as BETA(2.9, 2.1) with density
Γ (5) 1.9
function f (x) = Γ (2.9)Γ (2.1) x (1 − x)1.1 , for 0 < x < 1, and 0 elsewhere.
Use the method of Example 3.1.
m = 5000; a = 0; b = .5; w = (b - a)/m
g = seq(a + w/2, b - w/2, length=m)
h = dbeta(g, 2.9, 2.1) # BETA(2.9, 2.1) density function
prob = sum(w * h)
m = 5000; a = 0; b = 1; w = (b - a)/m
g = seq(a + w/2, b - w/2, length=m)
h = g * dbeta(g, 2.9, 2.1)
mean = sum(w * h)
prob; mean
> prob; mean
[1] 0.3481386
[1] 0.58 # 29/50 = 2.9/(2.9 + 2.1) = 0.58
c) Use the Monte Carlo integration method of Example 3.2 for both of the
previous parts. Compare results.
d We show Monte Carlo integration only for the problem in part (b). c
set.seed(1235)
m = 500000; a = 0; b = 1/2; w = (b - a)/m
u = a + (b - a) * runif(m); h = dbeta(u, 2.9, 2.1)
prob = sum(w * h)
a = 0; b = 1; w = (b - a)/m
u = a + (b - a) * runif(m); h = u * dbeta(u, 2.9, 2.1)
mean = sum(w * h)
prob; mean
> prob; mean
[1] 0.3483516 # approx. P{X < 1/2}
[1] 0.5791294 # approx. E(X)
Hints and answers: (a) From integral calculus, P {X ≤ 1/2} = 5/16 and E(X) = 3/5.
(Show your work.) For the numerical integration, modify the lines of the pro-
gram of Example 1.2 that compute the density function. Also, let a = 0 and let
b = 1/2 for the probability. For the expectation, let b = 1 use h = 12*g^3*(1-g) or
h = g*dbeta(g, 3, 2). Why? (b) The constant factor of f (x) can be evaluated in R
as gamma(5)/(gamma(2.9)*gamma(2.1)), which returns 12.55032. Accurate answers
are 0.3481386 (from pbeta(.5, 2.9, 2.1)) and 29/50.
58 3 Instructor Manual: Monte Carlo Integration
> sum(w * h)
[1] 12.19097
Hint: In the program of Example 3.1, let a = 80, b = 120. Also, for the line of code
defining h, substitute h = (20 - abs(g - 100))*dnorm(g, 100, 10). Provide the
answer (between 12.0 and 12.5) correct to two decimal places.
√
3.9 Suppose you do not know the value of 2. You can use simulation to
approximate it as follows.√Let X = U 2 , where U ∼ UNIF(0, 1). Then show
that 2P {0 < √
X ≤ 1/2} = 2, and use the sampling method with large m to
approximate 2.
d The cumulative distribution function of U ∼ UNIF(0, 1) is FU (u) = P {U ≤ u} = u,
for 0 < u ≤ 1. Then, if X = U 2 , we have P {0 < X ≤ 1/2} = P {U 2 ≤ 1/2} =
√ √ √
FU (1/ 2) = 1/ 2 = 2/2. c
set.seed(1236)
m = 500000; u = runif(m)
2 * mean(u^2 < 1/2); sqrt(2)
Note: Of the methods in Section 3.1, only sampling is useful. You could find the
√
density function of X, but it involves 2, which you are pretending not to know.
set.seed(1237)
m = 500000 # sample size
u = 10*runif(m) # sample of size m from UNIF(0, 10)
v = 5 + rnorm(m) # sample of size m from NORM(5, 1)
t = u + v # sample of size m from T
hist(t) # similar to Figure 3.3 on p57 of the book
mean(t > 15) # prop. of t > 15; est. of P{T > 15}
mean(t) # sample mean estimates E(T)
sd(t) # sample SD estimates SD(T)
> mean(t > 15) # prop. t > 15; est. of P{T > 15}
[1] 0.039844
> mean(t) # sample mean estimates E(T)
[1] 9.999499
> sd(t) # sample SD estimates SD(T)
[1] 3.056966
Comments: The mean of the 500 000 observations of T is the balance point of the
histogram. How accurately does this mean simulate E(T ) = E(U )+E(V ) = 10? Also,
compare simulated and exact SD(T ). The histogram facilitates a rough guess of the
value P {T > 15}. Of the m = 500 000 sampled values, it seems that approximately
20 000 (or 4%) exceed 15. Compare this guess with the answer from your program.
set.seed(1238)
m = 20000 # number of candidate values
u = runif(m) # sample from UNIF(0, 1)
y = sqrt(u) # candidate y from BETA(2, 1)
acc = rbinom(m, 1, y) # accepted with probability y
x = y[acc==T] # accepted values x from BETA(3, 1)
# Numerical values
mean(x) # sample mean estimates E(X) = .75
sd(x) # sample sd estimates SD(X) = .1936
mean(x < 1/2) # estimates P{X < 1/2} = .125
mean(acc) # acceptance rate
with FX (1/2) = 4/π, so we can use Bb(x) = 4/π. Modify the program
of part (a) to implement the AR method for simulating values of X,
beginning with the following two lines. Annotate and explain your code.
Make a figure similar to the bottom panel of Figure 3.10. For verification,
note that E(X) = 1/2, SD(X) = 1/4, and FX (1/2) = 1/2. What is the
acceptance rate?
m = 40000; y = runif(m)
acc = rbinom(m, 1, dbeta(y, 1.5, 1.5)/(4/pi)); x = y[acc==T]
set.seed(1239)
m = 40000 # nr of candidate values
y = runif(m) # candidate y
acc = rbinom(m, 1, dbeta(y, 1.5, 1.5)/(4/pi)) # acceptance rule
x = y[acc==T] # accepted values
# Numerical values
mean(x) # sample mean estimates E(X) = .5
sd(x) # sample sd estimates SD(X) = .25
mean(x < 1/2) # estimates P{X < 1/2} = 1/2
mean(acc) # acceptance rate
c) Repeat part (b) for X ∼ BETA(1.4, 1.6). As necessary, use the R function
gamma to evaluate the necessary values of the Γ -function. The function
rbeta implements very efficient algorithms for sampling from beta dis-
tributions. Compare your results from the AR method in this part with
results from rbeta.
d The general form of a beta density function is f (x) = ΓΓ(α)Γ (α+β)
(β)
xα−1 (1 − x)β−1 ,
for 0 < x < 1. If α, β > 1, then it is a simple exercise in differential calculus to show
that f (x) has a unique mode at x = (α − 1)/(α + β − 2). So the maximum of the
density function for BETA(α = 1.4, β = 1.6) occurs at x = 0.4, and the maximum
Γ (3)
value is f (0.4) = Γ (1.4)Γ (1.6)
0.40.4 0.60.6 = 1.287034. The first block of code below
62 3 Instructor Manual: Monte Carlo Integration
shows several ways in which the mode and maximum can be verified numerically
in R. A program generating random samples from BETA(1.4, 1.6) follows. c
set.seed(1240)
m = 100000 # nr of candidate values
y = runif(m) # candidate y
acc = rbinom(m, 1, dbeta(y, 1.4, 1.6)/max2) # acceptance rule
x = y[acc==T] # accepted values
# Numerical values
mean(x) # sample mean estimates E(X)
sd(x) # sample sd estimates SD(X)
mean(x < 1/2) # estimates P{X < 1/2}
mean(acc) # acceptance rate
x1 = rbeta(m, alpha, beta); mean(x1); sd(x1); mean(x1 < 1/2)
alpha/(alpha+beta)
sqrt(alpha*beta/((alpha+beta+1)*(alpha+beta)^2))
pbeta(1/2, alpha, beta)
sqrt(alpha*beta/((alpha+beta+1)*(alpha+beta)^2))
[1] 0.2494438 # exact SD(X)
> pbeta(1/2, alpha, beta)
[1] 0.5528392 # exact P{X < 1/2}
Answers: (a) Compare with exact values E(X) = 3/4, SD(X) = (3/80)1/2 = 0.1936,
and FX (1/2) = P {X ≤ 1/2} = 1/8.
3.12 In Example 3.5, interpret the output for the run shown in the ex-
ample as follows. First, verify using hand computations the values given for
Y1 , Y2 , . . . , Y5 . Then, say exactly how many Heads were obtained in the first
9996 simulated tosses and how many Heads were obtained in all 10 000 tosses.
d In the first 9996 tosses there were 0.49990(9996) = 4997 Heads, and in all 10 000
tosses there were 4999 Heads. Noticing that, among the last four tosses, only those
numbered 9999 and 10 000 resulted in Heads (indicated by 1s), we can also express
the number of Heads in the first 9996 tosses as 4999 − 2 = 4997. c
3.13 Run the program of Example 3.5 several times (omitting set.seed).
Did you get any values of Y10 000 outside the 95% interval (0.49, 0.51) claimed
there? Looking at the traces from your various runs, would you say that the
runs are more alike for the first 1000 values of n or the last 1000 values?
d Of course, there is no way for us to know what you observed. However, we can
make a probability statement. When n = 10, 000 and π = 1/2, the distribution of
X ∼ BINOM(n, π) places about 95% of its probability in the interval (0.49, 0.51).
Specifically, diff(pbinom(c(4901, 5099), 10000, 1/2)) returns 0.9522908.
There is much more variability in the appearance of traces near the beginning
than near the end, where most traces have become very
p close to 1/2. In particular,
Sn ∼ BINOM(n, 1/2), so SD(Yn ) = SD(Sn /n) = 1/4n, which decreases with
increasing n. c
3.14 By making minor changes in the program of Example 3.2 (as below), it
is possible to illustrate the convergence of the approximation to J = 0.341345
as the number n of randomly chosen points increases to m = 5000. Explain
what each statement in the code does. Make several runs of the program. How
variable are the results for very small values of n, and how variable are they
for values of n near m = 5000? (Figure 3.11 shows superimposed traces for 20
runs.)
64 3 Instructor Manual: Monte Carlo Integration
d Generally speaking, the traces become less variable near m = 5000 as they converge
towards J. More specifically, let the random variable H be the height of the normal
curve above a randomly chosen point in (0, 1).
After the program above is run, we can estimate SD(H) by sd(h), which returned
approximately 0.0486 after one run of the program. Then the standard deviation of
√
the estimated area Jn , based on n points is about 0.0486/ n, which decreases as n
increases. c
Note: The plotting parameter ylim establishes a relatively small vertical range for
the plotting window on each run, making it easier to assess variability within and
among runs.
3.16 In Example 3.5, let ² = 1/10 and define Pn = P {|Yn − 1/2| < ²} =
P {1/2 − ² < Yn < 1/2 + ²}. In R, the function pbinom is the cumulative
distribution function of a binomial random variable.
a) In the R Console window, execute
n = 1:100; pbinom(ceiling(n*0.6)-1, n, 0.5) - pbinom(n*0.4, n, 0.5)
Explain how this provides values of Pn , for n = 1, 2, . . . 100. (Notice
that the argument n in the function pbinom is a vector, so 100 results
are generated by the second statement.) Also, report the five values
P20 , P40 , P60 , P80 , and P100 , correct to six decimal places, and compare
results with Figure 3.12.
d With ² = 1/10, we have Pn = P {0.4 < Yn < 0.6} = P {0.4n < Xn < 0.6n},
where Xn ∼ BINOM(n, 1/2). The ceiling function rounds up to the next integer,
and subtracting 1 ensures that the upper end of the interval does not include an
unwanted value.
For example, when n = 10, only Xn = 5 satisfies the strict inequalities. Look
at the output below to see that P10 is precisely the value 0.246094 returned by
round(dbinom(5, 10, 1/2), 6). With less elegant mathematics and simpler R
code, we could have illustrated the LLN using Pn0 = P {1/2 − ² < Yn ≤ 1/2 + ²}.
In the output below, we use the vector show to restrict the output to only a few
lines relevant here and in part (b). c
> cbind(n, p)
n p
[1,] 200 0.379271
[2,] 600 0.652246
[3,] 1000 0.782552
[4,] 1400 0.858449
[5,] 1800 0.905796
[6,] 2200 0.936406
[7,] 2600 0.956637
[8,] 3000 0.970209
[9,] 3400 0.979413
[10,] 3800 0.985707
[11,] 4200 0.990038
[12,] 4600 0.993035
[13,] 5000 0.995116
3.17 Modify the program of Example 3.5 so that there are only n = 100
tosses of the coin. This allows you to see more detail in the plot. Compare the
behavior of a fair coin with that of a coin heavily biased in favor of Heads,
P (Heads) = 0.9, using the code h = rbinom(m, 1, 0.9). Make several runs
for each type of coin. Some specific points for discussion are: (i) Why are there
long upslopes and short downslopes in the paths for the biased coin but not for
the fair coin? (ii) Which simulations seem to converge faster—fair or biased?
(iii) Do the autocorrelation plots acf(h) differ between fair and biased coins?
d (i) The biased coin has long runs of Heads (average length 10), which correspond to
upslopes, interspersed with occasional Tails. The fair coin alternates Heads and Tails
(average length of each kind of run is 2). (ii) Biased coins have smaller variance and
so their traces converge faster. For a fair coin V(Yn ) = 1/4n, but a biased coin with
π = 0.9 has V(Yn ) = π(1−π)/n = 9/100n. (iii) The Hi are independent for both fair
and biased coins, so neither autocorrelation function should show significant corre-
lation at any lag (except, of course, for “lag” 0, which always has correlation 1). c
set.seed(1212)
m = 100; n = 1:m
h = rbinom(m, 1, 1/2); y = cumsum(h)/n # fair coin
plot (n, y, type="l", ylim=c(0,1)) # trace (not shown)
cbind(n,h,y)[1:18, ]
acf(h, plot=F) # 2nd parameter produces printed output
3 Instructor Manual: Monte Carlo Integration 67
> cbind(n,h,y)[1:18, ]
n h y
[1,] 1 0 0.0000000
[2,] 2 0 0.0000000
[3,] 3 1 0.3333333
[4,] 4 0 0.2500000
[5,] 5 1 0.4000000
[6,] 6 0 0.3333333
[7,] 7 0 0.2857143
[8,] 8 0 0.2500000
[9,] 9 1 0.3333333
[10,] 10 0 0.3000000
[11,] 11 1 0.3636364
[12,] 12 1 0.4166667
[13,] 13 0 0.3846154
[14,] 14 1 0.4285714
[15,] 15 1 0.4666667
[16,] 16 0 0.4375000
[17,] 17 0 0.4117647
[18,] 18 0 0.3888889
0 1 2 3 4 5 6 7 8
1.000 -0.225 0.010 -0.101 -0.010 -0.023 0.050 0.002 -0.006
9 10 11 12 13 14 15 16 17
0.022 0.136 -0.017 0.075 -0.136 -0.004 -0.017 0.178 -0.096
18 19 20
0.099 -0.013 -0.044
set.seed(1213)
m = 100; n = 1:m
h = rbinom(m, 1, .9); y = cumsum(h)/n # biased coin
plot (n, y, type="l", ylim=c(0,1)) # trace (not shown)
cbind(n,h,y)[1:18, ]
acf(h, plot=F) # 2nd parameter produces printed output
> cbind(n,h,y)[1:18, ]
n h y
[1,] 1 1 1.0000000
[2,] 2 1 1.0000000
[3,] 3 1 1.0000000
[4,] 4 1 1.0000000
[5,] 5 1 1.0000000
[6,] 6 1 1.0000000
[7,] 7 1 1.0000000
68 3 Instructor Manual: Monte Carlo Integration
[8,] 8 0 0.8750000
[9,] 9 1 0.8888889
[10,] 10 1 0.9000000
[11,] 11 1 0.9090909
[12,] 12 1 0.9166667
[13,] 13 1 0.9230769
[14,] 14 0 0.8571429
[15,] 15 1 0.8666667
[16,] 16 1 0.8750000
[17,] 17 1 0.8823529
[18,] 18 1 0.8888889
0 1 2 3 4 5 6 7 8
1.000 -0.023 -0.024 -0.025 -0.015 -0.016 0.085 -0.019 -0.009
9 10 11 12 13 14 15 16 17
-0.010 -0.113 0.090 0.088 -0.117 -0.107 -0.006 0.004 -0.100
18 19 20
0.001 0.000 -0.001
3.18 A version of the program in Example 3.5 with an explicit loop would
substitute one of the two blocks of code below for the lines of the original
program that make the vectors h and y.
# First block: One operation inside loop
h = numeric(m)
for (i in 1:m) { h[i] = rbinom(1, 1, 1/2) }
y = cumsum(h)/n
Modify the program with one of these blocks, use m = 500 000 iterations,
and compare the running time with that of the original “vectorized” program.
To get the running time of a program accurate to about a second, use as the
first line t1 = Sys.time() and as the last line t2 = Sys.time(); t2 - t1.
t1 = Sys.time()
m = 500000; n = 1:m
h = rbinom(m, 1, 1/2) # Original vectorized version
3 Instructor Manual: Monte Carlo Integration 69
t1 = Sys.time()
m = 500000; n = 1:m; h = numeric(m)
for (i in 1:m) { # Version with one
h[i] = rbinom(1, 1, 1/2) } # operation inside loop
y = cumsum(h)/n
t2 = Sys.time()
elapsed.1 = t2 - t1
t1 = Sys.time()
m = 500000; n = 1:m; y = numeric(m); h = numeric(m)
for (i in 1:m) { # Version
if (i==1) # with
{b = rbinom(1, 1, 1/2); h[i] = y[i] = b} # several
else # operations
{b = rbinom(1, 1, 1/2); h[i] = b; # inside
y[i] = ((i - 1)*y[i - 1] + b)/i} } # loop
t2 = Sys.time()
elapsed.2 = t2 - t1
Note: On computers available as this is being written, the explicit loops in the
substitute blocks take noticeably longer to execute than the original vectorized code.
3.19 The program in Example 3.6 begins with a Sunny day. Eventually,
there will be a Rainy day, and then later another Sunny day. Each return to
Sun (0) after Rain (1), corresponding to a day n with Wn−1 = 1 and Wn = 0,
signals the end of one Sun–Rain “weather cycle” and the beginning of another.
(In the early part of the plot of Yn , you can probably see some “valleys” or
“dips” caused by such cycles.)
If we align the vectors (W1 , . . . , W9999 ) and (W2 , . . . , W10 000 ), looking
to see where 1 in the former matches 0 in the latter, we can count the
complete weather cycles in our simulation. The R code to make this count
is length(w[w[1:(m-1)]==1 & w[2:m]==0]). Type this line in the Console
window after a simulation run—or append it to the program. How many cycles
do you count with set.seed(1237)?
d In the program below, we see 201 complete Sun–Rain cycles in 10 000 simulated
days, for an average cycle length of about 10, 000/201 = 49.75 days. However, the
code cy.end = n[w[1:(m-1)]==1 & w[2:m]==0] makes a list of days on which such
70 3 Instructor Manual: Monte Carlo Integration
cycles end, and the last complete cycle in this run ended on day 9978 (obtained as
max(cy.end)). Thus, a fussier estimate of cycle length would be 9978/201 = 49.64.
Although we have simulated 10 000 days, we have seen only 201 cycles, so we can’t
expect this estimate of cycle length to be really close. Based on the means of geo-
metric random variables, the exact theoretical length is 1/0.3 + 1/0.6 = 50 days. c
Hint: One can show that the theoretical cycle length is 50 days. Compare this with
the top panel of Figure 3.7 (p63).
3.20 Branching out from Example 3.6, we discuss two additional imaginary
islands. Call the island of the example Island E.
a) The weather on Island A changes more readily than on Island E. Specif-
ically, P {Wn+1 = 0|Wn = 0} = 3/4 and P {Wn+1 = 1|Wn = 1} = 1/2.
Modify the program of Example 3.6 accordingly, and make several runs.
Does Yn appear to converge to 1/3? Does Yn appear to stabilize to its
limit more quickly or less quickly than for Island E?
d The trace in the program below (plot not shown) stabilizes much more quickly
than the one for Island E. Intuitively, it seems likely that it will be sunny much of
the time: A sunny day will be followed by a rainy one only a quarter of the time,
but a rainy day has a 50-50 chance of being followed by a sunny one. One can show
that the proportion of rainy days over the long run is α/(α + β) = 1/3, where α
and β are the respective probabilities of weather change. c
0 1 2 3 4 5 6 7 8
1.000 0.228 0.052 0.021 -0.011 -0.033 -0.026 -0.028 -0.014
9 10 11 12 13 14 15 16 17
0.001 0.018 -0.002 -0.011 -0.008 -0.001 0.002 0.002 0.019
18 19 20 21 22 23 24 25 26
0.004 0.005 -0.008 -0.007 -0.006 0.005 -0.009 -0.018 0.003
27 28 29 30 31 32 33 34 35
0.002 0.001 0.012 0.012 0.008 -0.001 0.000 0.009 -0.007
36 37 38 39 40
-0.002 0.007 0.017 0.001 0.008
set.seed(1235)
m = 10000; n = 1:m; alpha = 1/3; beta = 2/3
w = numeric(m); w[1] = 0
for (i in 2:m) {
if (w[i-1]==0) w[i] = rbinom(1, 1, alpha)
else w[i] = rbinom(1, 1, 1 - beta) }
y = cumsum(w)/n
plot(n, y, type="l") # plot not shown
targ = alpha/(alpha + beta); abline(h = targ)
y[m/2]; y[m]
acf(w, plot=F) # for part (c)
72 3 Instructor Manual: Monte Carlo Integration
0 1 2 3 4 5 6 7 8
1.000 0.014 -0.005 -0.012 0.004 -0.007 0.017 -0.002 0.011
9 10 11 12 13 14 15 16 17
0.008 -0.014 0.010 -0.006 -0.004 0.005 -0.010 -0.005 -0.011
18 19 20 21 22 23 24 25 26
-0.014 -0.007 -0.010 -0.012 0.008 0.002 0.011 -0.005 -0.005
27 28 29 30 31 32 33 34 35
0.005 -0.003 0.003 -0.005 -0.005 0.002 0.005 -0.014 -0.008
36 37 38 39 40
0.009 0.007 0.002 -0.012 0.013
c) Make acf plots for Islands A and B, and compare them with the corre-
sponding plot in the bottom panel of Figure 3.7 (p63).
d The ACF plot for Island A shows the dependence of each day’s weather on the
previous couple of days. The ACF plot for Island B shows no significant correlation
for any lag, which is consistent with independence, noted in the answer to part (b) c
Note: We know of no real place where weather patterns are as extremely persistent
as on Island E. The two models in this problem are both more realistic.)
3.21 Proof of the Weak Law of Large Numbers (Theoretical). Turn the
methods suggested below (or others) into carefully written proofs. Verify the
examples. (Below we assume continuous random variables. Similar arguments,
with sums for integrals, would work for the discrete case.)
a) Markov’s Inequality. Let W be a random variable
R ∞that takes only positive
values and has a finite expected value E(W ) = 0 xfW (w) dw. Then, for
any a > 0, P {W ≥ a} ≤ E(W )/a.
Method of proof: Break the integral into two nonnegative parts, over the
intervals (0, a) and (a, ∞). Then E(W ) cannot be less than
R ∞ the second
integral, which in turn cannot be less than aP {W ≥ a} = a a fW (w) dw.
Example: Let W ∼ UNIF(0, 1). Then, for 0 < a < 1, E(W )/a = 1/2a and
P {W ≥ a} = 1 − P {W < a} = 1 − a. Is 1 − a < 1/2a?
where the first inequality holds because a nonnegative term is omitted and the
second holds because the value of w inside the integral must be at least as big as a.
The first equality requires that the support of W is contained in the positive half
line.
In the example, for W ∼ UNIF(0, 1) we have E(W ) = 1/2 and P {W ≥ a} = 1−a,
for 0 < a < 1. Markov’s Inequality says that E(W )/a = 1/2a ≥ P {W ≥ a} = 1 − a.
This amounts to the claim that a(1 − a) ≤ 1/2. But, for 0 < a < 1, the function
g(a) = a(1 − a) is a parabola with maximum value 1/4 at a = 1/2. So, in fact,
a(1 − a) ≤ 1/4 < 1/2.
The inequalities in the proof of Markov’s result may seem to have been obtained
by such extreme strategies (throwing away one integral and severely truncating
another), that you may wonder if equality is ever achieved. Generally not, in practical
applications. But consider a degenerate “random variable” X with P {X = µ} = 1
and µ > 0. Then E(X) = µ, E(X)/µ = 1, and also P {X ≥ µ} = 1. c
Note: What we have referred to in this section as the Law of Large Numbers is
usually called the Weak Law of Large Numbers (WLLN), because a stronger result
can be proved with more advanced mathematical methods than we are using in this
book. The same assumptions imply that P {Ȳn → µ} = 1. This is called the Strong
Law of Large Numbers. The proof is more advanced because one must consider the
joint distribution of all Yn in order to evaluate the probability.
74 3 Instructor Manual: Monte Carlo Integration
3.22 In Example 3.5, we have Sn ∼ BINOM(n, 1/2). Thus E(Sn ) = n/2 and
V(Sn ) = n/4. Find the mean and variance of Yn . According to the Central
Limit Theorem, Yn is very nearly normal for large n. Assuming Y10 000 to
be normal, find P {|Y10 000 − 1/2| ≥ 0.01}. Also find the margin of error in
estimating P {Heads} using Y10 000 .
1 1 1 1
d We have E(Yn ) = E(Sn /n) = n
E(Sn ) = 2
, V(Yn ) = V(Sn /n) = n2
V(Sn ) = 4n
,
and SD(Yn ) = 2√1 n . Thus
½ ¾
|Y10 000 − 1/2| 0.01
P {|Y10 000 − 1/2| ≥ 0.01} = P ≥ =2
1/200 1/200
≈ P {|Z| ≥ 2} = 0.0455,
for the random variable T̄ . The program below carries out the required simulation
and computation. The density function of GAMMA(12, 6) is illustrated in Figure 3.13
along with numerical results from the programs, as mentioned in part (c). c
set.seed(1215)
m = 500000; n = 12
x = rexp(m*n, rate=1/2); DTA = matrix(x, m)
x.bar = rowMeans(DTA)
mean(x.bar > 3)
1 - pgamma(3, 12, rate=6)
c) Compare your results from parts (a) and (b) with Figure 3.13 and numer-
ical values given in its caption.
b) What assumption of Section 3.4 fails for d = −1/2? What is the value
of J? Of V(Y )? Try running the two approximations. How do you explain
the unexpectedly good behavior of the Monte Carlo simulation?
76 3 Instructor Manual: Monte Carlo Integration
d Using d = -1/2, we obtain the output shown below. The function f (x) = x−1/2
is not bounded in (0, 1). However, the integral is finite, and the area under f (x)
√
and above (0, ²), for small ², Ris very small (specifically, 2 ²). Thus the Riemann
1
approximation is very near to 0 x−1/2 dx = 2. Monte Carlo results are good because
values of u in (0, ²) are extremely
R 1 rare. To put it another way, all assumptions are
valid for the computation of ² x−1/2 dx ≈ 2 by either method, and the value of
R ² −1/2
0
x dx ≈ 0, for ² > 0 sufficiently small. c
3.25 This problem shows how the rapid oscillation of a function can affect
the accuracy of a Riemann approximation.
a) Let h(x) = | sin kπx| and k be a positive integer. Then use calculus to
R1
show that 0 h(x) dx = 2/π = 0.6366. Use the code below to plot h on
[0, 1] for k = 4.
k = 4
x = seq(0,1, by = 0.01); h = abs(sin(k*pi*x))
plot(x, h, type="l")
3 Instructor Manual: Monte Carlo Integration 77
d As the plot (not shown here) illustrates, the function h(x) has k congruent sine-
shaped “humps” with maximums at 1 and minimums at 0. Thus the claim can be
established by finding the area of one of these k humps:
Z 1/k Z 1 h i1
1 1 2
| sin kπx| dx = sin πy dy = cos πy = ,
0
k 0
k 0 kπ
where we have made the substitution
R1 y = kx at the first equality. Adding the k
equal areas together, we have 0 h(x) dx = 2/π = 0.6366. c
c) Use calculus to show that V(Y ) = V(h(U )) = 1/2 − 4/π 2 = 0.0947. How
accurately is this value approximated by simulation? If m = 10 000, find
the margin of error for the Monte Carlo approximation in part (b) based
on SD(Y ) and the Central Limit Theorem. Are your results consistent
with this margin of error? d No solution is provided for this part. c
R1
3.26 The integral J = 0 sin2 (1/x) dx cannot be evaluated analytically,
R∞
but advanced analytic methods yield 0 sin2 (1/x) dx = π/2.
R1
a) Assuming this result, show that J = π/2 − 0 x−2 sin2 x dx. Use R to plot
both integrands on (0, 1), obtaining results as in Figure 3.14.
d We manipulate the integral obtained by advanced methods to obtain the alternate
form of J:
Z ∞ Z 1 Z ∞
π
= sin2 (1/x) dx = sin2 (1/x) dx + sin2 (1/x) dx
2 0 0 1
Z 1 Z 1
2 −2 2
= sin (1/x) dx + y sin y dy,
0 0
where the last integral arises from the substitution y = 1/x, dx = −y −2 dy. c
set.seed(1234)
u = runif(m, a, b); hu = (sin(1/u))^2
y = (b - a)*hu; MC1[i] = mean(y)
EstME1[i]=2*sd(y)/sqrt(m)
u = runif(m, a, b); hu = (sin(u)/u)^2
y = (b - a)*hu; MC2[i] = pi/2 - mean(y)
EstME2[i]=2*sd(y)/sqrt(m) }
round(cbind(M, RA1, RA2, MC1, EstME1, MC2, EstME2), 4)
3.27 Modify the program of Example 3.9 to approximate the volume be-
neath the bivariate standard normal density surface and above two additional
regions of integration as specified below. Use both the Riemann and Monte
Carlo methods in parts (a) and (b), with m = 10 000.
a) Evaluate P {0 < Z1 ≤ 1, 0 < Z2 ≤ 1}. Because Z1 and Z2 are indepen-
dent standard normal random variables, we know that this probability is
0.3413452 = 0.116516. For each method, say whether it would have been
better to use m = 10 000 points to find P {0 < Z ≤ 1} and then square
the answer.
d The first program below uses both Riemann and Monte Carlo methods to find
P {0 < Z ≤ 1}2 . The second approximates P {0 < Z1 ≤ 1, 0 < Z2 ≤ 1} with both
methods. For each Monte Carlo integration, results from several runs are shown.
The exact value, from (pnorm(1) - .5)^2, is 0.1165162.
For Riemann approximation, squaring P {0 < Z ≤ 1} gives the exact answer to
seven places; the result from a 2-d grid accurate only to five places. For the Monte
Carlo method, integration over the square seems slightly better; both methods can
give an incorrect fourth digit, but the fourth digit seems to vary a little more when
the probability of the interval is squared. Perhaps, even in this simple example for
one and two dimensions, improvement of Monte Carlo integration results in higher
dimensions is barely beginning to show. c
m = 10000; a = 0; b = 1; w = (b - a)/m
x = seq(a + w/2, b-w/2, length=m); hx = dnorm(x)
set.seed(1111)
u = runif(m, a, b); hu = dnorm(u); y = (b - a)*hu
80 3 Instructor Manual: Monte Carlo Integration
m = 10000
g = round(sqrt(m)) # no. of grid pts on each axis
x1 = rep((1:g - 1/2)/g, times=g) # these two lines give
x2 = rep((1:g - 1/2)/g, each=g) # coordinates of grid points
hx = dnorm(x1)*dnorm(x2)
sum(hx)/g^2 # Riemann P{Unit square}
set.seed(1120)
u1 = runif(m) # these two lines give a random
u2 = runif(m) # point in the unit square
hu = dnorm(u1)*dnorm(u2)
mean(hu) # Monte Carlo P{Unit square}
b) Evaluate P {Z12 + Z22 < 1}. Here the region of integration does not have
area 1, so remember to multiply by an appropriate constant. Because
Z12 + Z22 ∼ CHISQ(2), the exact answer can be found with pchisq(1, 2).
d With both methods, we integrate over the portion of the unit circle in the first
quadrant and then multiply by 4. The Riemann approximation agrees with the
exact value 0.3936 to three places. To get about 10 000 points of evaluation for both
methods, we increase the number of candidate points for the Monte Carlo method
to 12 732 (and happen to have 10 028 of them accepted in the run shown). The
multiplier 1/2 in the Monte Carlo integration of Example 3.9 becomes π/4 here. In
the (very lucky) run shown, the Monte Carlo integration has four-place accuracy. c
3 Instructor Manual: Monte Carlo Integration 81
m = 10000; g = round(sqrt(m))
x1 = rep((1:g-1/2)/g, times=g); x2 = rep((1:g-1/2)/g, each=g)
hx = dnorm(x1)*dnorm(x2)
4 * sum(hx[x1^2 + x2^2 < 1])/g^2 # Riemann approximation
pchisq(1, 2) # exact value
set.seed(1222)
m = round(10000*4/pi) # to get about 10000 accepted
u1 = runif(m); u2 = runif(m); hu = dnorm(u1)*dnorm(u2)
hu.acc = hu[u1^2 + u2^2 < 1]
m.prime = length(hu.acc); m.prime # number accepted
4*(pi/4) * mean(hu.acc) # Monte Carlo result
2*pi*sd(hu.acc)/sqrt(m.prime) # MC margin of error
c) The joint density function of (Z1 , Z2 ) has circular contour lines centered
at the origin, so that probabilities of regions do not change if they are
rotated about the origin. Use this fact to argue that the exact value of
P {Z1 > 0, Z2 > 0, Z1 +Z2 < 1}, which was approximated in Example 3.9,
can be found with (pnorm(1/sqrt(2)) - 0.5)^2.
d Consider the square in all four quadrants, of which our triangle contains a quarter
of the total area. This square has two of its vertices at (0, 1) and (0, −1), and the
√
length of one of its sides is 2. If we rotate this square by 45 degrees so that its sides
are parallel to the axes, it will still have the same probability under the bivariate
normal curve as before rotation. By symmetry, the desired probability is the same
as for the square within the first quadrant after rotation, with two of its corners at
√ √
the origin and (1/ 2, 1/ 2). Then, by an argument similar to that of part (a), the
R expression provided above computes the exact probability J = 0.06773 mentioned
in Example 3.9. c
3.28 Here we extend the idea of Example 3.9 to three dimensions. Suppose
three items are drawn at random from a population of items with weights (in
grams) distributed as NORM(100, 10).
d We do not provide a detailed code. See the partial Answers at the end of the
problem. c
82 3 Instructor Manual: Monte Carlo Integration
set.seed(1123)
m = 100000; d = 1:4; ball.vol = pi^(d/2)/gamma(d/2 + 1)
u1 = runif(m); u2 = runif(m); u3 = runif(m); u4 = runif(m)
hu = dnorm(u1); sq.dist = u1^2
3 Instructor Manual: Monte Carlo Integration 83
Note: In case you want to explore higher dimensions, the general formula for the
hypervolume of the unit ball in d dimensions is π d/2/ Γ ((d+2)/2); for a derivation see
Courant and John (1989), p459. Properties of higher dimensional spaces may seem
strange to you. What happens to the hypervolume of the unit ball as d increases?
What happens to the probability assigned to the (entire) unit ball by the d-variate
standard normal distribution? What happens to the hypervolume of the smallest
hypercube that contains it? There is “a lot of room” in higher dimensional space.
84 3 Instructor Manual: Monte Carlo Integration
Errors in Chapter 3
Page numbers refer to the the text. All corrections are scheduled to appear in
the second printing. Statements of problems in this Instructor Manual have
been corrected. Includes corrections compiled through July 13, 2011.
p74 Problem 3.7. Hint (a): R code h = 12*g^3*(1-g) should be h = 12*g^2*(1-g).
p76 Problem 3.11(b). Should refer to Figure 3.10 (on the next page), not Figure 3.2.
[Thanks to Leland Burrill.]
p84 Problem 3.27(c). The probability should be P {Z1 > 0, Z2 > 0, Z1 + Z2 < 1}.
That is, the event should be restricted to the first quadrant.
Note: Equation (4.1) says that the first and third probabilities are equal.
d For the analytical proof here and to find E(X 2 ) in part (c), use integration by
parts or the moment generating function E(esX ) = m(x) = λ/(λ + s), which has
m0 (0) = E(X) and m00 (0) = E(X 2 ).
To illustrate that a random sample of 100 000 observations from EXP(λ = 3)
has a sample mean very near 1/3, type the following into the R Session window:
mean(rexp(100000, 3)). You can anticipate two-place accuracy. For a similar illus-
tration of the value of SD(X) in part (c), use sd(rexp(100000, 3)). Referring to
part (a), what are the results of qexp(1/2, 3) and log(2)/3, and the approximate
result of quantile(rexp(100000, 3), 1/2)? c
c) Similarly, show that V(X) = E(X 2 ) − [E(X)]2 = 1/λ2 , and SD(X) = 1/λ.
4.3 Explain each step in equation (4.2). For X ∼ EXP(λ) and r, s, t > 0,
why is P {X > r + t|X > r} = P {X > s + t|X > s}?
d Following the Hints, the first equal sign below uses the definition of conditional
probability, P (A|B) = P (A ∩ B)/P (B). Then numerator of the resulting fraction is
P (A ∩ B) = P (A), because A = P {X > s + t} is a subset of B = P {X > s}.
4.4 In the R code below, each line commented with a letter (a)–(h) returns
an approximate result related to the discussion at the beginning of Section 4.1.
For each, say what method of approximation is used, explain why the result
may not be exactly correct, and provide the exact value being approximated.
d Let Y ∼ POIS(λ) and X ∼ EXP(λ). Then with λ = 0.5, we have E(Y ) = 0.5 and
E(X)P=1001/λ−λ
= 2: P∞ −λ k
(a) k=0
e λk /k! ≈ k=0
e λ /k! = 1 verifies that the Poisson probability
distribution
P100 function sums to 1, as it must.
(b) ke−λ λk /k! ≈ E(Y ) = 0.5.
k=0 R 1000 −λy R∞
(c) The Riemann approximation of 0 λe dy ≈ 0 λe−λy dy = 1.
R 100
(d) The Riemann approximation of E(X) ≈ 0 yλe−λy dy.
(e & f) The sampling method approximates E(X) = SD(X) = 1/λ = 2.
(g & h) The probability P {X > 0.1} = P {X > 0.15 | X > 0.05} = 0.9512 is
approximated (by the sampling method), thus illustrating the no-memory property
for s = 0.1 and t = 0.05. c
4 Answers to Problems: Applied Probability Models 87
Hints: Several methods from Chapter 3 are used. Integrals over (0, ∞) are approxi-
mated. For what values of s and t is the no-memory property illustrated?
4.5 Four statements in the R code below yield output. Which ones? Which
two statements give the same results? Why? Explain what two other state-
ments compute. Make the obvious modifications for maximums, try to predict
the results, and verify.
> x1 = c(1, 2, 3, 4, 5, 0, 2) # define x1
> x2 = c(5, 4, 3, 2, 1, 3, 7) # define x2
> min(x1, x2); pmin(x1, x2)
[1] 0 # minimum value among 14
[1] 1 2 3 2 1 0 2 ## ’parallel’ (elementwise) minimum
set.seed(1212)
m = 100000; lam1 = 1/5; lam2 = 1/4
x1 = rexp(m, lam1); x2 = rexp(m, lam2)
v = pmin(x1, x2)
mean(x2 == v) # Min wait is wait for 2nd teller (part a)
mean(v) # Avg. min is avg. wait to start (part b)
c) Now suppose there is only one teller with service rate λ = 1/5. You are
next in line to be served. Approximate by simulation the probability it
will take you more than 5 minutes to finish being served. This is the same
as one of the probabilities mentioned under Scenario 2. Which one? What
is the exact value of the probability you approximated? Discuss.
d The time to finish service with this one teller is the sum of the waiting time to
start service and the waiting time to finish service with the teller. From Scenario 2
of the example we know that T = X1 + X2 ∼ GAMMA(2, 1/5). Thus we can find the
exact value of P {T > 5} in R as 1 - pgamma(5, 2, 1/5), which returns 0.7358. c
set.seed(1212)
m = 100000; lam1 = lam2 = 1/5
x1 = rexp(m, lam1); x2 = rexp(m, lam2)
t = x1 + x2
mean(t > 5)
FW (t) = P {W ≤ t} = P {X1 ≤ t, X2 ≤ t}
= P {X1 ≤ t}P {X2 ≤ t} = (1 − e−t/5 )(1 − e−t/4 ),
b) Use the result of part (a) to verify the exact value P {W > 5} = 0.5490
given in Scenario 3.
d Then P {W > 5} = 1 − FW (5) = (1 − e−1 )(1 − e−1.25 ) = 0.5490. The numeri-
cal evaluation is done in R with 1 - (1 - exp(-1))*(1 - exp(-1.25)). Also see
below. c
set.seed(1212)
m = 100000; lam1 = 1/5; lam2 = 1/4
x1 = rexp(m, lam1); x2 = rexp(m, lam2)
w = pmax(x1, x2)
mean(w); mean(w > 5)
d) Use the result of part (a) to find the Rdensity function fW (t) of W, and
∞
hence find the exact value of E(W ) = 0 tfW (t) dt. d See method above. c
4.8 Modify the R code in Example 4.2 to explore a parallel system of four
CPUs, each with failure rate λ = 1/5. The components are more reliable here,
but fewer of them are connected in parallel. Compare the ECDF of this system
with the one in Example 4.2. Is one system clearly better than the other?
(Defend your answer.) In each case, what is the probability of surviving for
more than 12 years?
d The approximate probability of 12-year survival for the system of the example
is 0.22, whereas the corresponding probability for the system of this problem is 0.32.
90 4 Answers to Problems: Applied Probability Models
Over the long run it is hard to beat more reliable components, even if there are
fewer of them connected in parallel. More precisely, 1 − (1 − e−12/4 )5 = 0.2254, while
1 − (1 − e−12/5 )4 = 0.3164. (See Problem 4.9.)
The figure made by the program below illustrates that, for time periods between
about 3 and 30 years, the system of this problem is more reliable (dashed line in the
ECDF is below the solid one). But initially the system of the example is the more
reliable one. For example, reliabilities at 2 years are 1 − (1 − e−2/4 )5 = 0.9906 and
1 − (1 − e−2/5 )4 = 0.9882, respectively. What is computed by the three extra lines
below the program output? c
set.seed(12)
m = 100000; ecdf = (1:m)/m
n = 5; lam = 1/4; x = rexp(m*n, lam)
DTA = matrix(x, nrow=m); w1 = apply(DTA, 1, max)
n = 4; lam = 1/5; y = rexp(m*n, lam)
DTA = matrix(y, nrow=m); w2 = apply(DTA, 1, max)
par(mfrow = c(1,2))
w1.sort = sort(w1); w2.sort = sort(w2)
plot(w1.sort, ecdf, type="l", xlim=c(0,30), xlab="Years")
lines(w2.sort, ecdf, lty="dashed")
abline(v=12, col="green")
plot(w1.sort, ecdf, type="l", xlim=c(0, 3),
ylim=c(0,.06), xlab="Years")
lines(w2.sort, ecdf, lty="dashed")
abline(v=2, col="green")
par(mfrow = c(1,1))
mean(w1 > 12) # aprx prob syst of example surv 12 yrs
mean(w2 > 12) # aprx prob syst of problem surv 12 yrs
> mean(w1 > 12) # aprx prob syst of example surv 12 yrs
[1] 0.22413
> mean(w2 > 12) # aprx prob syst of problem surv 12 yrs
[1] 0.31804
b) How accurately does the ECDF in Example 4.2 approximate the cumula-
tive distribution function FW in part (a)? Use the same plot statement
as in the example, but with parameters lwd=3 and col="green", so that
the ECDF is a wide green line. Then overlay the plot of FW with
tt = seq(0, 30, by=.01); cdf = (1-exp(-lam*tt))^5; lines(tt, cdf)
and comment.
d The change in the program is elementary and well specified. The resulting figure
illustrates excellent agreement between the CDF and the ECDF. c
c) Generalize the result for FW in part (a) so that the lifetime of the ith
component is distributed as EXP(λi ), where the λi need not be equal.
Qn −λi t
d The generalization is FW (t) = i=1
(1 − e ). c
d) One could find E(W ) by taking the derivativeR of FW in part (a) to get
∞
the density function fW and then evaluating 0 tfW (t) dt, but this is a
messy task. However, in the case where all components have the same
failure rate λ, we can find E(W ) using the following argument, which is
based on the no-memory property of exponential distributions.
Start with the expected wait for the first component to fail. That is,
the expected value of the minimum of n components. The distribution is
EXP(nλ) with mean 1/λn. Then start afresh with the remaining n − 1
components, and conclude that the mean additional time until the second
failure is 1/λ(n − 1). Continue in this fashion to show that the R code
sum(1/(lam*(n:1))) gives the expected lifetime of the system. For a
five-component system with λ = 1/4, as in Example 4.2, show that this
result gives E(W ) = 9.1333.
> lam = 1/4; n = 5; sum(1/(lam*(n:1)))
[1] 9.133333
Note: The argument in (d) depends on symmetry, so it doesn’t work in the case
where components have different failure rates, as in part (c).
set.seed(1213)
m = 100000
Eng = rexp(m, 1/3); Per = rnorm(m, 4, 1)
Leg = 2*rbinom(m, 1, .5) + 2; Acc = runif(m, 1, 5)
DTA = cbind(Eng, Per, Leg, Acc)
w = apply(DTA, 1, max)
b) Which division is most often the last to complete its review? If that divi-
sion could decrease its mean review time by 1 week, by simply subtract-
ing 1 from the values in part (a), what would be the improvement in the
6-week probability value?
set.seed(1214)
m = 100000
Eng = rexp(m, 1/3); Per = rnorm(m, 4, 1) - 1
Leg = 2*rbinom(m, 1, .5) + 2; Acc = runif(m, 1, 5)
DTA = cbind(Eng, Per, Leg, Acc)
w = apply(DTA, 1, max)
mean(w); mean(w > 6)
c) How do the answers in part (a) change if the uniformly distributed time for
Accounting starts precisely when Engineering is finished? Use the original
distributions given in part (a).
set.seed(1215)
m = 100000
Eng.Acc = rexp(m, 1/3) + runif(m, 1, 5)
Per = rnorm(m, 4, 1)
Leg = 2*rbinom(m, 1, .5) + 2
DTA = cbind(Eng.Acc, Per, Leg)
w = apply(DTA, 1, max)
mean(w); mean(w > 6)
Hints and answers: (a) Rounded results from one run with m = 10 000; give more
accurate answers: 5.1, 0.15. (b) The code mean(w==Eng) gives the proportion of the
time Engineering is last to finish. Greatest proportion is 0.46. (c) Very little. Why?
(Ignore the tiny chance that a normal random variable might be negative.)
4.11 Explain the similarities and differences among the five matrices pro-
duced by the R code below. What determines the dimensions of a matrix
made from a vector with the matrix function? What determines the order
in which elements of the vector are inserted into the matrix? What happens
when the number of elements of the matrix exceeds the number of elements
of the vector? Focus particular attention on MAT3, which illustrates a method
we use in Problem 4.12.
> a1 = 3; a2 = 1:5; a3 = 1:30
> MAT1 = matrix(a1, nrow=6, ncol=5); MAT1
[,1] [,2] [,3] [,4] [,5] # 6 rows, 5 columns, as
[1,] 3 3 3 3 3 # specified by arguments
[2,] 3 3 3 3 3
[3,] 3 3 3 3 3
[4,] 3 3 3 3 3
[5,] 3 3 3 3 3
[6,] 3 3 3 3 3
4.12 In Example 4.2, each of the five component CPUs in the parallel
system has failure rate λ = 1/4 because it is covered by a thickness of lead
foil that cuts deadly radiation by half. That is, without the foil, the failure
rate would be λ = 1/2. Because the foil is heavy, we can’t afford to increase
the total amount of foil used. Here we explore how the lifetime distribution of
the system would be affected if we used the same amount of foil differently.
a) Take the foil from one of the CPUs (the rate goes to 1/2) and use it to
double-shield another CPU (rate goes to 1/8). Thus the failure rates for the
five CPUs are given in a 5-vector lam as shown in the simulation program
below. Compare the mean and median lifetimes, probability of survival
longer than 10 years, and ECDF curve of this heterogeneous system with
similar results for the homogeneous system of Example 4.2. Notice that in
order for each column of the matrix to have the same rate down all rows,
it is necessary to fill the matrix by rows using the argument (byrow=T).
Thus the vector of five rates “recycles” to provide the correct rate for each
element in the matrix. (See Problem 4.11 for an illustrative exercise.)
d Run the program to see the figure. Is the altered configuration (dotted curve)
generally more reliable than the configuration of the example? c
4 Answers to Problems: Applied Probability Models 95
b) Denote the pattern of shielding in part (a) as 01112. Experiment with other
patterns with digits summing to 5, such as 00122, 00023, and so on. The
pattern 00023 would have lam = c(1/2, 1/2, 1/2, 1/8, 1/16). Which
of your patterns seems best? Discuss.
d Based on the issues explicitly raised above, a configuration such as the one denoted
by 00005, with all the shielding on one CPU, seems best. But see the Notes below. c
Notes: The ECDF of one very promising reallocation of foil in part (b) is shown in
Figure 4.9. Parallel redundancy is helpful, but “it’s hard to beat” components with
lower failure rates. In addition to the kind of radiation against which the lead foil
protects, other hazards may cause CPUs to fail. Also, because of geometric issues,
the amount of foil actually required for, say, triple shielding may be noticeably more
than three times the amount for single shielding. Because your answer to part (b)
does not take such factors into account, it might not be optimal in practice.
d Because E(R) = 23.26 = 2.326σ we see that Runb = R/2.326 = 0.43R is unbiased
for σ and that K = 0.43. Moreover,
Thus, as in Example 4.3, the expected length of an R-based 95% confidence interval
for σ is E(R)(1/0.849 − 1/4.197) = 2.326σ(0.940) = 2.19. c
4.14 Modify the code of Example 4.3 to try “round-numbered” values of n
such as n = 30, 50, 100, 200, and 500. Roughly speaking, for what sample sizes
are the constants K = 1/4, 1/5, and 1/6 appropriate to make Runb = KR an
unbiased estimator of σ? (Depending on your patience and your computer,
you may want to use only m = 10 000 iterations for larger values of n.)
d We simulated data with σ = 1 so that, for each sample size, the denominator
d = 1/K is simply approximated as the average of the simulated ranges. Because
we seek only approximate values, 10 000 iterations would be enough, but we used
50 000 for all sample sizes below.
set.seed(1235)
m = 50000; n = c(30, 50, 100, 200, 500)
mu = 100; sg = 1
k = length(n); d = numeric(k)
for (i in 1:k) {
x = rnorm(m*n[i], mu, sg); DTA = matrix(x, m)
x.mx = apply(DTA, 1, max); x.mn = apply(DTA, 1, min)
x.rg = x.mx - x.mn
d[i] = mean(x.rg) }
round(cbind(n, d), 3)
You can see that, for a normal sample of about size n = 30, dividing the sample
range by 4 gives an reasonable estimate of σ. Two additional cases, with easy-to-
remember round numbers, are to divide by 5 for samples of about 100 observations,
and by 6 for samples of about 500. However, as n increases, the sample range R
4 Answers to Problems: Applied Probability Models 97
becomes ever less correlated with S, which is the preferred estimate of σ. (The
sample variance S 2 is unbiased for σ 2 and has the smallest variance among unbiased
estimators. A slight bias, decreasing as n increases, is introduced by taking the
square root, as we see in Problem 4.15.)
Nevertheless, for small values of n, estimates of σ based on R can be useful.
In particular, estimates of σ for industrial process control charts have traditionally
been based on R, and engineering statistics books sometimes provide tables of the
appropriate unbiasing constants for values of n up to about 20.
Elementary statistics texts often suggest estimating σ by dividing the sample
range by a favorite value—typically 4 or 5. This suggestion may be accompanied by
a carefully chosen example in which the results are pretty good. However, we see
here that no one divisor works across all sample sizes.
As n increases, S converges to σ in probability, but R diverges to infinity. Normal
tails have little probability beyond a few standard deviations from the mean, but
the tails do extend to plus and minus infinity. So, if you take enough observations,
you are bound to get “outliers” from far out in the tails of the distribution, which
inflate R. c
4.15 This problem involves exploration of the sample standard deviation S
as an estimate of σ. Use n = 10.
a) Modify the program of Example 4.3 to simulate the distribution of S.
Use x.sd = apply(DTA, 1, sd). Although E(S 2 ) = σ 2 , equality (that
is, unbiasedness) does not survive the nonlinear operation of taking the
square root. What value a makes Sunb = aS an unbiased estimator of σ?
d The simulation below gives a ≈ 1.027, with 1/a = 0.973. This approximation is
based on n = 10 and aE(S) = E(Sunb ) = σ = 10. It is in good agreement with the
exact value E(S) = 0.9727σ, which can be found analytically. (See the Notes). c
set.seed(1238)
m = 100000; n = 10; mu = 100; sg = 10
x = rnorm(m*n, mu, sg); DTA = matrix(x, m)
x.sd = apply(DTA, 1, sd)
a = mean(x.sd)/sg; mean(x.sd)/sg; a
a = sg/mean(x.sd); mean(x.sd)/sg; a
[1] 0.973384
[1] 1.027344
b) Verify the value of E(LS ) given in Example 4.3. To find the confidence
limits of a 95% confidence interval for S, use qchisq(c(.025,.975), 9)
and then use E(S) in evaluating E(LS ). Explain each step.
d The quantiles from qchisq are 2.70 and 19.02, so that the confidence interval for σ 2
is derived from
P {2.70 < 9S 2 /σ 2 < 19.20} = P {9S 2 /19.20 < σ 2 < 9S 2 /2.70} = 95%.
On taking square roots, the 95% CI for σ is (3S/4.362, 3S/1.643) or (0.688S, 1.826S),
which has expected length E(LS ) = 1.138E(S). According to the Notes, for n = 10,
98 4 Answers to Problems: Applied Probability Models
the exact value of E(S) = 0.9727σ. Thus E(LS ) = 1.138(0.9727)σ = 1.107σ. We did
some of the computations in R, as shown below. c
c) Statistical theory says that V(Sunb ) in part (a) has the smallest possible
variance among unbiased estimators of σ. Use simulation to show that
V(Runb ) ≥ V(Sunb ).
d As above and in Example 4.3, we use sample size n = 10. From the simulation
in the example, we know that Runb = R/K, where K ≈ 30.8/σ = 30.8/10 = 3.08.
From the simulation in part (a), we know that Sunb = aS, where a ≈ 0.973. We use
these values in the further simulation below. The last line of code illustrates that
SD(Runb ) ≈ 2.6 > SD(Sunb ) ≈ 2.4. c
set.seed(1240)
m = 100000; n = 10; mu = 100; sg = 10; K = 3.08; a = 1.027
x = rnorm(m*n, mu, sg); DTA = matrix(x, m)
x.sd = apply(DTA, 1, sd); s.unb = a*x.sd
x.mx = apply(DTA, 1, max); x.mn = apply(DTA, 1, min)
x.rg = x.mx - x.mn; r.unb = x.rg/K
mean(s.unb); mean(r.unb) # validation: both about 10
sd(s.unb); sd(r.unb) # first should be smaller
4.16 For a sample of size 2, show that the sample range is precisely a mul-
tiple of the sample standard deviation. [Hint: In the definition of S 2 , express
X̄ as (X1 + X2 )/2.] Consequently, for n = 2, the unbiased estimators of σ
based on S and R are identical.
d With X̄ as suggested variance becomes
·³ ´2 ³ ´2 ¸
X1 − X2 X2 − X1 (X1 − X2 )2 R2
S2 = + = = .
2 2 2 2
√ √
Upon taking square roots, we have S = |X1 − X2 |/ 2 = R/ 2. c
4 Answers to Problems: Applied Probability Models 99
a) Find the unbiasing constants necessary to define Runb and Sunb . These
estimators are, of course, not necessarily the same as for normal data.
b) Show that V(Runb ) < V(Sunb ). For data from such a uniform distribution,
one can prove that Runb is the unbiased estimator with minimum variance.
c) Find the quantiles of Runb and Sunb necessary to make 95% confidence
intervals for σ. Specify the endpoints of both intervals in terms of σ. Which
confidence interval, the one based on R or the one based on S, has the
shorter expected length?
c) In a plot similar to Figure 4.5, show the points for which the usual 95%
confidence interval for σ covers the population value σ = 10. How does
this differ from the display of points for which the t confidence interval
for µ covers the true population value?
d The boundaries are horizontal parallel lines. c
100 4 Answers to Problems: Applied Probability Models
4.19 Repeat the simulation of Example 4.5 twice, once with n = 15 random
observations from NORM(200, 10) and again with n = 50. Comment on the
effect of sample size.
4.20 More on Example 4.6 and Figure 4.10.
a) Show that there is an upper linear bound on the points in Figure 4.10. This
boundary is valid for any sample in which Pnegative values are impossible.
2 2 2
Suggested
P 2 steps:
P Start with (n − 1)S = Y
i i √− n Ȳ . For your data, say
2
why i Yi ≤ ( i Yi ) . Conclude that Ȳ ≥ S/ n.
P 2 P 2 2
d First, when all Yi ≥ 0, we have i Yi ≤ ( i Yi ) = (nȲ ) , because the expansion
of the square of the sum on the right-hand side has all n of the terms Yi2 from the
left-hand side, in addition to some nonnegative products. Then,
X
n
b) Use plot to make a scatterplot similar to the one in Figure 4.10 but with
m = 100 000 points, and then use lines to superimpose your line from
part (a) on the same graph.
d Just for variety, we have used slightly different code below than that shown for
Examples 4.4, 4.6, and 4.7. The last few lines of the program refer to part (c). c
set.seed(12)
m = 100000; n = 5; lam = 2
DTA = matrix(rexp(m*n, lam), nrow=m)
x.bar = rowMeans(DTA) # alternatively ’apply(DTA, 1, mean)’
x.sd = apply(DTA, 1, sd)
plot(x.bar, x.sd, pch=".")
abline(a=0, b=sqrt(n), lty="dashed", col="blue")
abline(h=1.25, col="red")
abline(v=0.5, col="red")
mean(x.bar < .5) # Estimates of probabilities
mean(x.sd > 1.25) # discussed
mean((x.bar < .5) & (x.sd > 1.25)) # in part (c)
c) For Example 4.6, show (by any method) that P {Ȳ ≤ 0.5} and P {S > 1.25}
are both positive but that P {Ȳ ≤ 0.5, S > 1.25} = 0. Comment.
4 Answers to Problems: Applied Probability Models 101
4.21 Figure 4.6 (p101) has prominent “horns.” We first noticed such horns
on (Ȳ , S) plots when working with uniformly distributed data, for which the
horns are not so distinct. With code similar to that of Example 4.7 but simu-
lated samples of size n = 5 from UNIF(0, 1) = BETA(1, 1), make several plots
of S against Ȳ with m = 10 000 points. On most plots, you should see a few
“straggling” points running outward near the top of the plot. The question
is whether they are real or just an artifact of simulation. (That is, are they
“signal or noise”?) A clue is that the stragglers are often in the same places
on each plot. Next try m = 20 000, 50 000, and 100 000. For what value of m
does it first become obvious to you that the horns are real?
d Many people say 50 000. For a full answer, describe and discuss. c
0.6325, 0.7746, 0.8944, and 1 (rounded to four places). You can use the R code with
points to plot heavy dots at the cusps of the horns. c
b) The horn at the lower left in the figure of part (a) is the image of one
vertex of the hypercube, (0, 0, 0, 0, 0). The horn at the lower right is the
image of (1, 1, 1, 1, 1). They account for two of the 32 vertices. Each of the
remaining horns is the image of multiple vertices. For each horn, say how
many vertices get mapped onto it, its “multiplicity.”
d The multiplicities are given by the binomial coefficients that describe how many
1s are selected out of 5 possibilities. Multiplicities that correspond to the vertices
in part (a) are 1, 5, 10, 10, 5, and 1, respectively. Multiplicities explain why some
horns are more distinctly defined in the plots. c
c) Now make a plot with n = 10 and m = 100 000. In addition to the two
horns at the bottom, how many do you see along the top? Explain why
the topmost horn has multiplicity (10
5 ) = 252.
Ctrl = c(16, 18, 18, 24, 19, 11, 10, 15, 16, 18, 18,
13, 19, 10, 16, 16, 24, 13, 9, 14, 21, 19,
7, 18, 19, 12, 11, 22, 25, 16, 13, 11, 13)
> t.test(Pair.Diff)
data: Pair.Diff
t = 5.783, df = 32, p-value = 2.036e-06
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
10.34469 21.59470 # 95% CI: agrees with above
sample estimates:
mean of x
15.96970
data: Pair.Diff
t = 5.783, df = 32, p-value = 2.036e-06
alternative hypothesis: true mean is not equal to 0
99 percent confidence interval: # reasonably close to bootstrap
8.407362 23.532031 # percentile 99% CI above
sample estimates:
mean of x
15.96970
104 4 Answers to Problems: Applied Probability Models
qqnorm(Pair.Diff)
abline(a=d.mean, b=d.sd, lwd=2, col="red")
> shapiro.test(Pair.Diff)
data: Pair.Diff
W = 0.9574, p-value = 0.2183
Notes: (b) Use parameter conf.level=.99 in t.test. Approximate t interval:
(8.4, 23.5). (c) Although the normal probability plot of Pair.Diff seems to fit a
curve better than a straight line, evidence against normality is not strong. For ex-
ample, the Shapiro-Wilk test fails to reject normality: shapiro.test(Pair.Diff)
returns a p-value of 0.22.
4.24 Student heights. In a study of the heights of young men, 41 students at
a boarding school were used as subjects. Each student’s height was measured
(in millimeters) in the morning and in the evening, see Majundar and Rao
(1958). Every student was taller in the morning. Other studies have found
a similar decrease in height during the day; a likely explanation is shrinkage
along the spine from compression of the cartilage between vertebrae. The 41
differences between morning and evening heights are displayed in the R code
below.
d The data have been moved to the program in part (a), and they are also used in
part (b). The normal probability plot (not shown here) gives the impression that
the data are very nearly normal. c
a) Make a normal probability plot of these differences with qqnorm(dh) and
comment on whether the data appear to be normal. We wish to have
4 Answers to Problems: Applied Probability Models 105
set.seed(1776)
n = length(dh) # number of data pairs
d.bar = mean(dh) # observed mean of diff’s
B = 10000 # number of resamples
re.x = sample(dh, B*n, repl=T)
RDTA = matrix(re.x, nrow=B) # B x n matrix of resamples
re.mean = rowMeans(RDTA) # vector of B ‘d-bar-star’s
B = 10000; n = length(dh)
# Parameter estimates
dh.bar = mean(dh); sd.dh = sd(dh)
# Resampling
re.x = rnorm(B*n, dh.bar, sd.dh)
RDTA = matrix(re.x, nrow=B)
# Results
re.mean = rowMeans(RDTA)
hist(re.mean)
bci = quantile(re.mean, c(.025, .975)); bci
2*dh.bar - bci[2:1]
SUMMARY
Interval Part Method
-----------------------------------------------------
(8.73, 10.46) (a) Traditional CI from CHISQ(40)
(8.77, 10.43) (a) Both nonparametric bootstraps
(8.76, 10.45) (b) Parametric bootstrap (simple)
(8.74, 10.44) (b) Parametric bootstrap (percentile)
Notes: (a) Nearly normal data, so this illustrates how closely the bootstrap procedure
agrees with the t procedure when we know the latter is appropriate. The t interval is
(8.7, 10.5); in your answer, provide two decimal places. (b) This is a “toy” example
because T = n1/2 (d¯ − µ)/Sd ∼ T(n − 1) and (n − 1)Sd2 /σ 2 ∼ CHISQ(n − 1) provide
useful confidence intervals for µ and σ without the need to do a parametric bootstrap.
(See Rao (1989) and Trumbo (2002) for traditional analyses and data, and see
Problem 4.27 for another example of the parametric bootstrap.)
> x
[1] 7.55 11.82 1.46 1.40 4.36 28.95 12.30 5.40 9.57 1.47
[11] 13.91 7.62 12.38 44.24 10.55 10.35 18.76 6.55 3.37 5.88
[21] 23.65 6.42 2.94 5.66 1.06 0.59 5.79 39.59 11.73 9.97
[31] 14.35 0.37 3.24 13.20 2.04 10.23 3.02 7.25 7.52 2.35
[41] 10.80 10.28 12.92 12.53 5.55 3.01 12.93 9.95 5.14 20.08
4 Answers to Problems: Applied Probability Models 107
data: x.bar/mu
D = 0.0095, p-value = 0.7596
alternative hypothesis: two.sided
b) As an illustration, even though we know the data are not normal, find the
t confidence interval for µ.
set.seed(1); x = round(rexp(50, 1/10), 2) # re-generate the 50 obs.
mean(x) + qt(c(.025,.975), 49)*sd(x)/sqrt(50) # 95% t CI
c) Set a fresh seed. Then replace Pair.Diff by x in the code of Example 4.8
to find a 95% nonparametric bootstrap confidence interval for µ.
108 4 Answers to Problems: Applied Probability Models
d) Does a normal probability plot clearly show the data are not normal? The
Shapiro-Wilk test is a popular test of normality. A small p-value indicates
nonnormal data. In R, run shapiro.test(x) and comment on the result.
d The normal probability plot (not shown) clearly indicates these exponential data
are not normal, and the Shapiro-Wilk test decisively rejects normality.
set.seed(1)
x = round(rexp(50, 1/10), 2) # re-generate the 50 obs.
shapiro.test(x)
> shapiro.test(x)
data: x
W = 0.789, p-value = 4.937e-07
It seems worth commenting that the t interval does a serviceable job even
for these strongly right-skewed exponential data. Because of the Central Limit
Theorem, the mean of 50 exponential observations is “becoming” normal.
Specifically, X̄ has a gamma distribution with shape parameter 50, which is
still skewed to the right, but more nearly symmetrical than the exponential
distribution. c
Answers: (a) (7.6, 13.3), (b) (7.3, 12.4). (c) On one run: (7.3, 12.3).
4 Answers to Problems: Applied Probability Models 109
set.seed(1789)
m = 100000; cover = numeric(m); B = 1000; n = 50
for (i in 1:m)
{
x = rnorm(n) # simulate a sample
re.x = sample(x, B*n, repl=T) # resample from it
RDTA = matrix(re.x, nrow=B)
re.mean = rowMeans(RDTA)
cover[i] = prod(quantile(re.mean, c(.025,.975)))
# does bootstrap CI cover?
}
mean(cover < 0)
Errors in Chapter 4
Page numbers refer to the the text. All corrections are scheduled to appear in
the second printing. Statements of problems in this Instructor Manual have
been corrected. Includes corrections compiled through July 13, 2011.
p116 Problem 4.26. In the third line inside the loop of the program: The right paren-
thesis should immediately follow repl=T, not the comment. The correct line
reads:
5.1 In a newspaper trivia column, L. M. Boyd (1999) ponders why lie de-
tector results are not admissible in court. His answer is that “lie detector tests
pass 10 percent of the liars and fail 20 percent of the truth-tellers.” If you use
these percentages and take {D = 1} to mean being deceitful and {T = 1}
to mean failing the test, what are the numerical values of the sensitivity and
specificity for such a lie detector test? (Continued in Problem 5.12.)
d “Pass 10 percent of the liars” means P {T = 0|D = 1} = 1 − η = 0.1, so sensitivity
η = 0.9. “Fail 20 percent of the truth-tellers” means P {T = 1|D = 0} = 1 − θ = 0.2,
so specificity θ = 0.8. c
of them all, and that the software can access pictures of exactly the people on that
list. We take D = 1 to mean that a scanned passenger is a terrorist. The preva-
lence π = P {D = 1} is the proportion of scanned passengers who are terrorists. We
suppose Mann believes π to be very small.
We take T = 1 to mean that the software flags a passenger as a terrorist. Then
the sensitivity η = P {T = 1|D = 1} is the conditional probability that a scanned
passenger is flagged as a terrorist, given that he or she is on the terrorist list. The
specificity of the face-recognition software θ = P {T = 0|D = 0} is the conditional
probability that someone not on the list will will not be flagged as a terrorist.
In the first part of the quote, Mann seems to say P {D = 0|T = 1} = 0.0068,
which implies P {D = 1|T = 1} = 0.9932. But this “reversed” conditional probability
is not the specificity η. In Section 5.3, we refer to it as the predictive value of a
positive test: γ = P {D = 1|T = 1}.
In any case, sensitivity is a property of the face-recognition software. Without
the “cooperation” of the terrorists lining up to help test the software, it would seem
difficult for the company to know the sensitivity. By contrast, the company could
easily know the specificity by lining up people who are not terrorists and seeing how
many of them the software incorrectly flags as terrorists. We venture to speculate
that 1 − θ = P {T = 1|D = 0} may be what the company is touting as its low
“mistaken” rate of 0.68%.
Later in the quote, Mann focuses on 170 000 out of a population of 25 000 000
passengers in 2001 or 0.68% of passengers that would be “wrongly picked out” as
being on the list (and 170 000/365 ≈ 466, which we suppose to be his “almost 500
false alarms”). The probability of a false positive (wrong identification or a false
alarm) ought be P {T = 1|D = 0} = 1 − θ = 0.0068, which matches our speculation
of what Mann really meant in his first statement.
In our experience, people who have not studied probability frequently confuse
the three probabilities P {D = 0|T = 1}, P {T = 1|D = 0}, and P {D = 0, T = 1}.
Respectively, each of these is a fraction of a different population: all passengers
tagged as terrorists, all passengers who are not terrorists, and all passengers. What
we expect of the student in this problem is not to confuse η with γ.
Considering the current state of face-recognition software, we agree with Mann
that using it at an airport to detect terrorists would be a challenge. Retaining all the
dubious assumptions at the beginning of this answer, let’s speculate in addition that
π = 0.0001 (one in 10 000) of airport passengers were terrorists, and that sensitivity
η and specificity θ were both very large, say η = θ = 0.9932. Then using formulas
in Section 6.3:
τ = P {T = 0} = πη + (1 − π)(1 − θ) = 0.0068 = 0.68%
and
γ = πη/[πη + (1 − π)(1 − θ)] = 0.00146 = 0.146%.
Therefore, on the one hand, out of Mann’s hypothetical n = 68 400 passengers
a day, there would indeed be about nτ = 466 alarms, almost all of them false. And
decreasing the prevalence π won’t change the number 466 by much.
On the other hand, we’re pretending that the proportion of terrorists in the entire
population is one in 10 000. But on average, among every group of 1000/0.146 = 685
passengers flagged by the software, about one is a terrorist. Based on these numerical
assumptions, the flagged passengers would be a pretty scary bunch, worth some
individual attention. c
5 Answers to Problems: Screening Tests 115
5.3 Consider a bogus test for a virus that always gives positive results,
regardless of whether the virus is present or not. What is its sensitivity?
What is its specificity? In describing the usefulness of a screening test, why
might it be misleading to say how “accurate” it is by stating its sensitivity
but not its specificity?
d Sensitivity η = 1; specificity θ = 0. One could argue that the term “accurate” is too
vague to be useful in discussing screening tests. An ideal test will have high values
of both η and θ. As we see in Section 5.3, in a particular population of interest, it
would also be desirable to have high values of the predictive values γ and δ. c
5.4 Suppose that a medical screening test for a particular disease yields
a continuum of numerical values. On this scale, the usual practice is to take
values less than 50 as a negative indication for having the disease {T = 0}, and
to take values greater than 56 as positive indications {T = 1}. The borderline
values between 50 and 56 are usually also read as positive, and this practice
is reflected in the published sensitivity and specificity values of the test. If
the borderline values were read as negative, would the sensitivity increase or
decrease? Explain your answer briefly.
d Sensitivity η = P {T = 1|D = 1} would decrease because fewer outcomes are now
counted as positive. By contrast, the specificity θ = P {T = 0|D = 0} would increase,
and for the same reason. Whether η or θ has the bigger change depends on how likely
the patients getting scores between 50 and 56 are are to have the disease. c
5.5 Many criteria are possible for choosing the “best” (η, θ)-pair from an
ROC plot. In Example 5.1, we mentioned the pair with η = θ. Many references
vaguely suggest picking a pair “close to” the upper-left corner of the plot. Two
ways to quantify this are to pick the pair on the curve that maximizes the
Youden index η + θ or the pair that maximizes η 2 + θ2 .
a) As shown below, modify the line of the program in Example 5.1 that
prints numerical results. Use the expanded output to find the (η, θ)-pair
that satisfies each of these maximization criteria.
cbind(x, eta, theta, eta + theta, eta^2 + theta^2)
x = seq(40,80,1)
eta = 1 - pnorm(x, 70, 15); theta = pnorm(x, 50, 10)
show = (x >= 54) & (x <= 65)
youden = eta + theta; ssq = eta^2 + theta^2
round(cbind(x, eta, theta, youden, ssq)[show,], 4)
x = seq(40,80,1)
eta = 1 - pnorm(x, 70, 15); theta = pnorm(x, 50, 10)
Notes: When the ROC curve is only roughly estimated from data (as in Problem 5.6),
it may make little practical difference which criterion is used. Also, if false-positive
results are much more (or less) consequential errors than false-negative ones, then
criteria different from any of these may be appropriate.
5 Answers to Problems: Screening Tests 117
5.6 Empirical ROC. DeLong et al. (1985) investigate blood levels of creat-
enine (CREAT) in mg% and β2 microglobulin (B2M) in mg/l as indicators
of imminent rejection {D = 1} in kidney transplant patients. Based on data
from 55 patients, of whom 33 suffered episodes of rejection, DeLong and her
colleagues obtained the sensitivity data in Table 5.2 (p133 of the text).
For example, as a screening test for imminent rejection, we might take a
createnine level above 1.7 to be a positive test result. Then we would estimate
its sensitivity as η(1.7) = 24/33 = 0.727 because 24 patients who had a
rejection episode soon after the test had createnine levels above 1.7.
Similarly, consider a large number of instances in which the createnine test
was not soon followed by a rejection episode. Of these, 53.5% had levels at
most 1.7, so θ(1.7) ≈ 0.535. For a test that “sounds the alarm” more often, we
can use a cut-off level smaller than 1.7. Then we will “predict” more rejection
episodes, but we will also have more false alarms.
Use these data to make approximate ROC curves for both CREAT and
B2M. Put both sets of points on the same plot, using different symbols (or
colors) for each, and try to draw a smooth curve through each set of points
(imitating Figure 5.1). Compare your curves to determine whether it is worth-
while to use a test based on the more expensive B2M determinations. Would
you use CREAT or B2M? If false positives and false negatives were equally se-
rious, what cutoff value would you use? What if false negatives are somewhat
more serious? Defend your choices.
cre.sens = c(.939, .939, .909, .818, .758, .727, .636, .636, .545,
.485, .485, .394, .394, .364, .333, .333, .333, .303)
cre.spec = c(.123, .203, .281, .380, .461, .535, .649, .711, .766,
.773, .803, .811, .843, .870, .891, .894, .896, .909)
b2m.sens = c(.909, .909, .909, .909, .879, .879, .879, .879, .818,
.818, .818, .788, .788, .697, .636, .606, .576, .576)
b2m.spec = c(.067, .074, .084, .123, .149, .172, .215, .236, .288,
.359, .400, .429, .474, .512, .539, .596, .639, .676)
d The R code above makes a plot similar to Figure 5.5 for the CREAT data (solid
black dots), but also including points for an estimated ROC curve of the B2M data
(open blue circles). The curves are similar, except that the CREAT curve seems a
118 5 Answers to Problems: Screening Tests
little closer to the upper-left corner of the plot. Therefore, if one only measurement
is to be used, the less expensive CREAT measurement seems preferable, providing
a screening test for transplant rejection with relatively higher values of η and θ (but
see the Notes).
If false positives and false negatives are equally serious, then we should pick a
point on the smoothed curve where η ≈ θ. For the CREAT ROC curve, this seems to
be somewhere in the vicinity of η ≈ θ ≈ 6.4 (see the output to the program above),
which means a createnine level near 1.8 mg% (see Table 5.2). The probability of a
false negative is P {T = 0|D = 1} = 1 − η (the probability that we do not detect
a patient is about to reject). Making this probability smaller means making the
sensitivity η larger, which means moving upward on the ROC curve, and toward
a smaller createnine cut-off value for the test (see the sentence about “more false
alarms” in the question). c
Notes: Data can be coded as follows. [The code originally provided here has been
moved to the first four statements in the program above.] Use plot for the first set
of points (as shown in Figure 5.5), then points to overlay the second. In practice,
a combination of the two determinations, including their day-to-day changes, may
provide better predictions than either determination alone. See DeLong et al. for
an exploration of this possibility and also for a general discussion (with further
references) of a number of issues in diagnostic testing. The CREAT data also appear
in Pagano and Gauvreau (2000) along with the corresponding approximate ROC
curve.
5.8 Suppose that a screening test for a particular parasite in humans has
sensitivity 80% and specificity 70%.
a) In a sample of 100 from a population, we obtain 45 positive tests. Estimate
the prevalence.
d We use equation (5.4) on p123 of the text to find the estimate p of π:
5.9 Consider the ELISA test of Example 5.2, and suppose that the preva-
lence of infection is π = 1% of the units of blood in a certain population.
a) What proportion of units of blood from this population tests positive?
120 5 Answers to Problems: Screening Tests
d Recall that η = 0.99 and θ = 0.97. Then by equation (5.2) on p123, we have
τ = πη + (1 − π)(1 − θ) = 0.01(0.99) + 0.99(0.03) = 0.0396. c
b) Suppose that n = 250 units of blood are tested and that A of them yield
positive results. What values of t = A/n and of the integer A yield a
negative estimate of prevalence?
d In formula (5.4) for the estimate p of π, the denominator is η+θ−1 = 0.96 > 0, so p
is negative precisely when the numerator is negative. That is, when t = A/n < 0.03.
(Even if no units are infected, we expect from the specificity θ = 0.97 that 3% of
sampled units test positive.) So we require integer A < 0.03(250) = 7.5; that is, the
estimate p < 0 when A ≤ 7. c
c) Use parts (a) and (b) to find the proportion of random samples of size
250 from this population that yields negative estimates of prevalence.
d From part (a) we know that A ∼ BINOM(n = 250, τ = 0.0396), and from part (b)
we seek P {A ≤ 7}. The R code pbinom(7, 250, 0.0396) returns 0.2239. c
5.10 Write a program to make a figure similar to Figure 5.4 (p127).
What are the exact values of PVP γ and PVN δ when π = 0.05?
d The program is shown below, including some optional embellishments that put
numerical values on the plot (not shown here). The requested values of PVP γ and
PVN δ are shown at the end. c
pp = seq(0, 1, by=.001); eta = .99; theta = .97
tau = pp*eta + (1 - pp)*(1 - theta)
gamma = pp*eta/tau; delta = (1 - pp)* theta/(1 - tau)
b) All of those who test positive will be subjected to more expensive, less
convenient (possibly even somewhat risky) diagnostic procedures to de-
termine whether or not they actually have the disease. What percentage
of the population will be subjected to these procedures?
d This percentage is τ = 0.0447, computed in part (a). c
c) The entire population can be viewed as split into four sets by the random
variables D and T : either of them may be 0 or 1. What proportion of
the entire population falls into each of these four sets? Suppose you could
change the sensitivity of the test to 99% with a consequent change in speci-
ficity to 94%. What factors of economics, patient risk, and preservation
of life would be involved in deciding whether to make this change?
d In terms of random variables D and T , the four probabilities (to full four-place
accuracy) are as follows, respectively: First,
which add to 1 − τ = 0.9553. Also notice that the first probabilities in each display
above add to π = 0.005. Of course, all four probabilities add to 1.
We note that these four probabilities could be also expressed in terms of τ , γ,
and δ. (Note: In the first printing, we used “false positive” and similar terms to refer
to the four sets. Many authors reserve this terminology for conditional probabilities,
such as 1 − γ.)
On increasing the sensitivity to η = 99% and decreasing specificity to η = 94%:
This would increase the number of subjects testing positive. The disadvantage would
be that more people would undergo the expensive and perhaps risky diagnostic
procedure. Specifically, τ increases from 0.0447 to 0.0547. So the bill for doing the
diagnostic procedures would be 0.0547/0.0447 = 1.22 times as large—a 22% increase.
122 5 Answers to Problems: Screening Tests
The advantage would be that a few more people with the disease would be alerted,
possibly in time to be cured. Specifically, the fraction of the population denoted
by {D = 1, T = 1} would increase from 0.00490 to 0.00495. But the PVP would
actually decrease from 0.1096 to 0.0901. c
Note: (b) This is a small fraction of the population. It would have been prohibitively
expensive (and depending on risks, possibly even unethical) to perform the definitive
diagnostic procedures on the entire population. But the screening test permits focus
on a small subpopulation of people who are relatively likely to have the disease and
in which it may be feasible to perform the definitive diagnostic procedures.
5.12 Recall the lie detector test of Problem 5.1. In the population of in-
terest, suppose 5% of the people are liars.
a) What is the probability that a randomly chosen member of the population
will fail the test?
d From the answer to Problem 5.1, recall that sensitivity η = 0.9 and specificity
θ = 0.8. Here we evaluate
using equation (5.2) on p123. (We provide the notation requested in part (d) as we
go along.) c
b) What proportion of those who fail the test are really liars? What propor-
tion of those who fail the test are really truth-tellers?
d We require
where the computation follows from equation (5.5) on p126. Of those who fail the
test, the proportion 1 − γ = P {D = 0|T = 1} = 1 − 0.1915 = 0.8085 will be falsely
accused of being liars. (By contrast, among all who take the test: the proportion
P {D = 0, T = 1} = (1 − π)(1 − θ) = 0.95(0.2) = 0.19 will be falsely accused, and
the proportion P {D = 1, T = 1} = πη = 0.045 rightly accused. As a check, notice
that τ = 0.19 + 0.045.)
Recall the original complaint, quoted in Problem 5.1, that lie detector tests pass
10% of liars. That deficiency, together with examples, such as the ones in the previous
paragraph, showing that the tests accuse relatively large numbers of truthful people,
makes judges reluctant to allow results of lie detector tests in the courtroom. c
c) What proportion of those who pass the test are really telling the truth?
d This proportion is
5.13 In Example 5.3, a regulatory agency may be concerned with the values
of η and γ. Interpret these two conditional probabilities in terms of testing a
batch for potency. Extend the program in this example to obtain approximate
numerical values for η and γ.
d In the language of Example 5.3, η = P (F |B) and γ = P (B|F ). The program below
has been extended to evaluate all four conditional probabilities, including θ and δ.
(The spacing in the output has been fudged slightly for easy reading.) c
set.seed(1066)
n = 500000
mu.s = 110; sd.s = 5; cut.s = 100
sd.x = 1; cut.x = 101
s = rnorm(n, mu.s, sd.s)
x = rnorm(n, s, sd.x)
Note: For verification, the method of Problem 5.14 provides values accurate to at
least four decimal places.
Then η = P (BF )/π, θ = P (GP )/(1 − τ ), γ = P (BF )/τ , and δ = P (GP )/(1 − τ ). c
124 5 Answers to Problems: Screening Tests
where we have expressed the conditional CDF of X|S = s as the integral of its density
function. In part (c) we evaluate this probability as P (G∩P ) = 0.96044. Notice that
this is not the same as P {S > 100}P {E > 1} = (1 − Φ(−2))(1 − Φ(1)) = 0.1550,
because events S and E are not independent. c
5.15 In Example 5.3, change the rule for “passing inspection” as follows.
Each batch is assayed twice; if either of the two assays gives a result above
101, then the batch passes.
d A routine change in the program. No answers provided. c
a) Change the program of the example to simulate the new situation; some
useful R code is suggested below. What is the effect of this change on τ ,
θ, and γ?
x1 = rnorm(n,s,sd.x); x2 = rnorm(n,s,sd.x); x = pmax(x1, x2)
b) If you did Problem 5.13, then also compare the numerical values of η and γ
before and after the change in the inspection protocol.
5.16 In Example 5.4, suppose that Machine D is removed from service and
that Machine C is used to make 20% of the parts (without a change in its
error rate). What is the overall error rate now? If a defective part is selected
at random, what is the probability that it was made by Machine A?
d First, we show R code that can be used to get the results in Example 5.4. Names
of the vectors anticipate some of the terminology in Chapter 8.
Vector prior shows the proportions of all plastic parts made by each of the
four machines (component 1 for A, 2 for B, and so on). That is, if we went into
the warehouse and selected a part at random, the elements of this vector show the
probabilities that the bracket was made by each machine. (Suppose a tiny code
number molded into each bracket allows us to determine the machine that made
that bracket.)
The ith element of the vector like shows the likelihood that a bracket from the
ith machine is defective. The code computes the vector post. Of all of the defective
parts, the ith element of this vector is the fraction that is made by the ith machine.
Now we modify the program to work the stated problem. Because Machine D
has been taken out of service, the 4-vectors above have become 3-vectors below.
126 5 Answers to Problems: Screening Tests
We see that 0.55% of all parts are defective. That is, P (E) = 0.55. This is about
half as many defectives as in the Example (using Machine D). Also, Machine A
makes over 44% of the defective brackets now (compared with 39% in the Example.
That is: we now have posterior probability P (A|E) = 0.444. c
5.17 There are three urns, identical in outward appearance. Two of them
each contain 3 red balls and 1 white ball. One of them contains 1 red ball and
3 white balls. One of the three urns is selected at random.
a) Neither you nor John has looked into the urn. On an intuitive “hunch,”
John is willing to make you an even-money bet that the urn selected has
one red ball. (You each put up $1 and then look into the urn. He gets
both dollars if the urn has exactly one red ball, otherwise you do.) Would
you take the bet? Explain briefly.
d Yes, unless you believe John has extraordinary powers or has somehow cheated.
P (Urn 3) = 1/3. There are two chances in three that you would win a dollar, and one
chance in three that you would lose a dollar; expected profit for you: 1(2/3) − 1(1/3)
or a third of a dollar. c
b) Consider the same situation as in (a), except that one ball has been chosen
at random from the urn selected, and that ball is white. The result of this
draw has provided both of you with some additional information. Would
you take the bet in this situation? Explain briefly.
d Denote by W the event that the first ball drawn from the urn selected is white,
and by Ui , for i = 1, 2, 3, the event that the ith urn was selected. We want to know
P (U3 |W ) = P (U3 ∩ W )/P (W ) = P (U3 )P (W |U3 )/P (W ). The total probability of
getting a white ball is
X X
P (W ) = P (Ui ∩ W ) = P (Ui )P (W |Ui ) = 1/3(1/4 + 1/4 + 3/4) = 5/12,
Of course, this is an application of Bayes’ Theorem in which the three urns are the
partitioning events. Some frequentist statisticians insist that the inverse conditional
probabilities from Bayes’ Theorem be called something other than probabilities (such
as proportions or fractions). For example, a conditional outcome such as “disease
given positive test” is either true or not. Possibly pending results of a gold standard
test, we will know which. But this particular subject (or unit of blood) has already
been chosen, and his or her (or its) infection status is difficult to discuss according
to a frequentist or long-run interpretation of probability.
However, in the simple gambling situation of this problem, when you are con-
sidering how to bet, you are entitled to your own personal probability for each urn.
Before getting the information about drawing a white ball, a reasonable person would
take John’s offer. After that information is available, a reasonable person would not.
The Bayesian approach to probabilities, formally introduced in Chapter 8, is often
used to model such personal opinions. c
5.18 According to his or her use of an illegal drug, each employee in a large
company belongs to exactly one of three categories: frequent user, occasional
user, or abstainer (never uses the drug at all). Suppose that the percentages
of employees in these categories are 2%, 8%, and 90%, respectively. Further
suppose that a urine test for this drug is positive 98% of the time for frequent
users, 50% of the time for occasional users, and 5% of the time for abstainers.
d In the program below, the vector prior shows the distribution of heavy users, oc-
casional users, and abstainers, respectively. The vector like shows the probabilities
of the respective groups to get positive test results. Finally, after computation, the
vector post shows the distribution of the three employee types among those getting
positive test results. c
a) If employees are selected at random from this company and given this
drug test, what percentage of them will test positive?
d As shown by the quantity pos in the output above P (Positive test) = 0.1046. c
d From the vector post in the output of the program, among those who have positive
tests, the proportion of abstainers is 43%. c
c) Suppose that employees are selected at random for testing and that those
who test positive are severely disciplined or dismissed. How might an
employee union or civil rights organization argue against the fairness of
drug testing in these circumstances?
d Possible points: Over 10% of employees will be accused of drug use and disciplined
or dismissed. Of those, 43% are abstainers. c
There are almost no frequent users among the negatives (45 in 100,000), and very
few occasional users (less than 5%). Almost all (more than 95%) of those testing
negative are abstainers.
But this does not help those falsely accused: 4.5% of all employees tested are
abstainers and have positive tests (from the product line in the results for positive
results): P (Abstain ∩ Positive) = 0.045. From the product line for negative tests,
85.5% of all employees abstain and test negative: P (Abstain ∩ Negative) = 0.855.
Together, of course, these last two probabilities add up to the 90% of all employees
who abstain. c
Comment: (d) Consider, as one example, a railroad that tests only train operators
who have just crashed a train into the rear of another train.
5 Answers to Problems: Screening Tests 129
where the integrals are taken over the real line. Give reasons for each step
in this equation. (Compare this result with equation (5.7).)
d The first inequality expresses the definition of a conditional density function fS|X
in terms of the joint density function fX,S and the marginal density function fX . The
second shows the marginal density fX as the integral of the joint density function
fX,S with respect to s. The last inequality uses the definition of the conditional
density function fX|S in both numerator and denominator.
Extra. In some applications (not here) it is obvious from the functional form
of the numerator fS (s)fX|S (x|s) on the right, that it must be proportional to a
known density function. In that case, it is not necessary to find the integral in the
denominator, and one writes
where the symbol ∝ is read “proportional to.” The left-hand side is called the
posterior probability density. The function fS (s)fX|S (x|s), viewed as a function
of x given data s, is called the likelihood function. And the function fS (s) is called
the prior density function. More on this in Chapter 8. c
so
1 − γ = P (Good|Fail) = P {S > 100|X ≤ 101} ≈ 0.433.
All of these events are conditioned on a events of positive probability.
By contrast, this problem focuses on P {S ≤ 100|X = 100.5}, which can be eval-
uated using the conditional density fS|X at a the value x = 100.5. The conditional
density is defined by conditioning on the event {|X − 100.5| < ²} as ² → 0. Roughly,
one might say P {S ≤ 100|X = 100.5} = 0.8814 is the probability a batch is good
given that it “barely” fails because X = 100.5 (just below the cut-off value 101).
130 5 Answers to Problems: Screening Tests
The first printing had a misprint, asking about 1 − δ, when 1 − γ was intended.
Confusion of P {S ≤ 100|X = 100.5} with 1 − δ might result from a reversal of the
roles of X and S. Avoiding confusion of P {S ≤ 100|X = 100.5} with 1 − γ requires
one to make the distinction between (i) the general condition X < 101 for failure of
a batch and (ii) the specific condition X = 100.5 for a particular batch. c
wide = 30
s = seq(100, 100+wide, 0.001)
numer = 30 * mean(dnorm(s, 110, 5) * dnorm(100.5, s, 1))
denom = dnorm(100.5, 110, 5.099)
numer/denom
> numer/denom
[1] 0.8113713
5 Answers to Problems: Screening Tests 131
Errors in Chapter 5
Page numbers refer to the the text. All corrections are scheduled to appear in
the second printing. Statements of problems in this Instructor Manual have
been corrected. Includes corrections compiled through July 13, 2011.
p128 Example 5.3. In the second paragraph, change to: ...the value observed is a
conditional random variable X|{S = s} ∼ NORM(s, 1).
p133 Problem 5.6. In the second paragraph, three instances of 2.5 should be 1.7. (For
clarity, in the second printing, the first two paragraphs of the problem are to be
revised as shown in this Manual.)
6.1 In each part below, consider the three Markov chains specified as fol-
lows: (i) α = 0.3, β = 0.7; (ii) α = 0.15, β = 0.35; and (iii) α = 0.03, β = 0.07.
a) Find P {X2 = 1|X1 = 1} and P {Xn = 0|Xn−1 = 0}, for n ≥ 2.
d Probabilities of not moving. (i) P {X2 = 1|X1 = 1} = p11 = 1 − β = 1 − 0.7 = 0.3,
P {Xn = 0|Xn−1 = 0} = p00 = 1 − α = 0.7. (ii) p11 = 0.65, p00 = 0.85. (iii) p11 =
0.93, p00 = 0.97. c
c) For each chain, modify the program of Example 6.2 to find the long-run
fraction of steps in state 1.
In the notation of Sections 6.2 and 6.3, we seek
The exact answer is 0.3 in all three cases. For case (iii), the program and
partial results for P64 are shown below. (You will find that smaller powers of
the transition matrix suffice for the other two cases.)
P = matrix(c(.97, .03,
.07, .93), nrow=2, ncol=2, byrow=T)
P
P2 = P %*% P; P4 = P2 %*% P2; P4
P8 = P4 %*% P4; P16 = P8 %*% P8; P16
P32 = P16 %*% P16; P64 = P32 %*% P32; P64
...
134 6 Answers to Problems: Chains with Two States
d) For each chain, make and interpret plots similar to Figures 6.3 (where the
number of steps is chosen to illustrate the behavior clearly), 6.4, and 6.5.
d The program changes are trivial, but you may need to experiment to find what m
is required for satisfactory convergence. In case (i), the autocorrelation for positive
lags is 0. In case (iii) autocorrelations are highly significant even for moderately
large lags. c
6.2 There is more information in the joint distribution of two random vari-
ables than can be discerned by looking only at their marginal distributions.
Consider two random variables X1 and X2 , each distributed as BINOM(1, π),
where 0 < π < 1.
a) In general, show that 0 ≤ Q11 = P {X1 = 1, X2 = 1} ≤ π. In particular,
evaluate Q11 in three cases, in which: (i) X1 and X2 are independent,
(ii) X2 = X1 , and (iii) X2 = 1 − X1 , respectively.
d In general, 0 ≤ Q11 = P {X1 = 1, X2 = 1} ≤ P {X1 = 1} = π, where we have used
the inequality in the Hints. In particular,
(i): We have 0 ≤ Q11 = P {X1 = 1, X2 = 1} = P {X1 = 1}P {X2 = 1} = π 2 ≤ π.
(ii): If X1 = X2 , then 0 ≤ Q11 = P {X1 = 1, X2 = 1} = P {X1 = 1} = π.
(iii): If X1 + X2 = 1, then 0 ≤ Q11 = P {X1 = 1, X2 = 1} = 0, so Q11 = 0. c
for Q11 , which can be equated and solved to get π = α/(α + β). The rest follows
from simple probability rules and algebra. The three cases of parts (a) and (b) can
be expressed in terms of α and β as: (i) α + β = 1, independence (see the Hints);
(ii) α = β = 0, “never-move”; (ii) α = β = 1, “flip-flop.” c
b) Show that the geometric series p(x) sums to 1, so that one is sure to see
a Head eventually.
P∞
d Let T = x=1
p(x). Then
> mean(rx)
[1] 3.01637 # simulates E(X) by sampling
Anticipating part (c) in the R code above, we also approximate E(X) = 3, first
P∞
by summing 35 terms of the series x=1
xp(x), and then by a simulation based
on 100 000 random realizations of X. c
t
X
∞
πet
= πe [(1 − π)et ]x−1 = ,
1 − (1 − π)et
x=1
where the final result is from the standard formula for the sum of a geometric
series. Differentiation to obtain µX = E(X) = 1/π is elementary calculus. The
variance V(X) = (1 − π)/π 2 can be found using m00 (0) = E(X 2 ) and the formula
V(X) = E(X 2 ) − µ2X . Appended to the run of the program in the answer to part (b),
the instruction var(rx) returned 6.057, approximating V(X) = 6. c
6.4 Suppose the weather for a day is either Dry (0) or Rainy (1) according
to a homogeneous 2-state Markov chain with α = 0.1 and β = 0.5. Today is
Monday (n = 1) and the weather is Dry.
a) What is the probability that both tomorrow and Wednesday will be Dry?
d Tuesday and Wednesday will both be Dry days: P {X2 = 0, X3 = 0|X1 = 0} =
P {X2 = 0|X1 = 0}P {X3 = 0|X2 = 0} = p00 p00 = (1 − α)2 = 0.92 = 0.81. c
c) Use equation (6.5) to find the probability that it will be Dry two weeks
from Wednesday (n = 17).
d Upper-left element of P16 : p00 (16) = β/(α + β) + (1 − α − β)16 α/(α + β) =
0.5/0.6+0.416 (0.1)/0.6 = 0.8333334. This is not far from λ0 = β/(α+β) = 0.5/0.6 =
0.8333333. The fact that we began with a Dry day has become essentially irrelevant. c
6 Answers to Problems: Chains with Two States 137
d) Modify the R code of Example 6.1 to find the probability that it will be
Dry two weeks from Wednesday.
P = matrix(c(.9, .1,
.5, .5), nrow=2, ncol=2, byrow=T)
P; P2 = P %*% P; P4 = P2 %*% P2; P4
P8 = P4 %*% P4; P8
P16 = P8 %*% P8; P16
...
> P16 = P8 %*% P8; P16
[,1] [,2]
[1,] 0.8333334 0.1666666
[2,] 0.8333330 0.1666670
e) Over the long run, what will be the proportion of Rainy days? Modify
the R code of Example 6.1 to simulate the chain and find an approximate
answer.
d The exact value is α/(α + β) = 1/6. The plot generated by the program below
(not shown here) indicates that m = 50 000 iterations is sufficient for the trace to
stabilize hear the exact value. c
set.seed(1234)
m = 50000; n = 1:m; x = numeric(m); x[1] = 0
alpha = 0.1; beta = 0.5
for (i in 2:m) {
if (x[i-1]==0) x[i] = rbinom(1, 1, alpha)
else x[i] = rbinom(1, 1, 1 - beta) }
y = cumsum(x)/n; y[m]
a = sum(x[1:(m-1)]==0 & x[2:m]==1); a # No. of cycles
m/a # Average cycle length
plot(y, type="l", ylim=c(0,.3), xlab="Step",
ylab="Proportion of Rainy Days")
abline(h = 1/6, col="green")
> y[m]
[1] 0.1659
> a = sum(x[1:(m-1)]==0 & x[2:m]==1); a # No. of cycles
[1] 4123
> m/a # Average cycle length
[1] 12.12709
d Runs of Rain have average length 1/β = 2. The total cycle length averages
1/α + 1/β = 12. This is approximated by the program in part (e) as 12.1. c
d) Start with X1 = 0. For n > 1, a fair die is rolled. If the maximum value
shown on the die at any of the steps 2, . . . , n is smaller than 6, then
Xn = 0; otherwise, Xn = 1.
d Answers for (i)–(iii) are all 0. Once a 6 is seen at step i, we have Xi = 1 and the
value of Xn , for n > i, can never be 0. (iv) P {X13 = 0|X11 = 0} = (5/6)2 because
rolls of the die at steps 12 and 13 must both show values less than 6. (v) This is
a homogeneous Markov chain with α = 1/6 and β = 0. It is one of the absorbing
chains mentioned in Section 6.1 (p141). c
e) At each step n > 1, a fair coin is tossed, and Un takes the value −1 if the
coin shows Tails and 1 if it shows Heads. Starting with V1 = 0, the value
of Vn for n > 1 is determined by
Vn = Vn−1 + Un (mod 5).
The process Vn is sometimes called a “random walk” on the points
0, 1, 2, 3 and 4, arranged around a circle (with 0 adjacent to 4). Finally,
Xn = 0, if Vn = 0; otherwise Xn = 1.
6 Answers to Problems: Chains with Two States 139
d The V -process is a Markov chain with five states, S = {0, 1, 2, 3, 4}. (Processes
with more than two states are discussed in more detail in Chapter 7. Perhaps you
can write the 5 × 5 transition matrix.)
(i) Because X1 = V1 = 0, we know that X2 = 1, so P {X3 = 0|X2 = 1} = 1/2. At
step 2 the V -process must be in state 1 or 4; either way, there is a 50-50 chance that
X3 = V3 = 0. (ii) Similarly, P {X13 = 0|X12 = 1, X11 = 0} = 1/2. (iii) However,
P {X13 = 0|X12 = 1, X11 = 1, X10 = 0} = 0, because the V -process must be
in either state 2 or 3 at step 12, with no chance of returning to 0 at step 13.
(If we don’t know the state of the X-process at step 10, it’s more difficult to say
what happens at step 13. But the point in (v) is that its state at step 10 matters.)
(iv) P {X13 = 0|X11 = 0} = 1/4.
(v) The X-process is not Markov because the probabilities in (ii) and (iii) differ.
The X-process is a function of the V -process; this example shows that a function of
a Markov chain need not be Markov. c
Hints and partial answers: (a) Independence is consistent with the Markov property.
(b) Steps 1 and 2 are independent. Show that the values at steps 1 £and¤ 2 determine
the value at step 3 but the value at step 2 alone does not. (c) P = 16 51 15 . (d) Markov
chain. (e) The X-process is not Markov.
6.6 To monitor the flow of traffic exiting a busy freeway into an industrial
area, the highway department has a TV camera aimed at traffic on a one-
lane exit ramp. Each vehicle that passes in sequence can be classified as Light
(for example, an automobile, van, or pickup truck) or Heavy (a heavy truck).
Suppose data indicate that a Light vehicle is followed by another Light vehi-
cle 70% of the time and that a Heavy vehicle is followed by a Heavy one 5%
of the time.
a) What assumptions are necessary for the Heavy-Light process to be a ho-
mogenous 2-state Markov chain? Do these assumptions seem realistic?
(One reason the process may not be independent is a traffic law that for-
bids Heavy trucks from following one another within a certain distance
on the freeway. The resulting tendency towards some sort of “spacing”
between Heavy trucks may carry over to exit ramps.)
d Assume Markovian dependence only on the last step; probably a reasonable ap-
proximation to reality. Assume the proportions of Light and Heavy vehicles remain
constant (homogeneous) over time. This does not seem reasonable if applied day and
night, weekdays and weekends, but it may be reasonable during business hours. c
b) If I see a Heavy vehicle in the monitor now, what is the probability that
the second vehicle after it will also be Heavy? The fourth vehicle after it?
d Denote Heavy as 1 and Light as 0. Then P {X3 = 1|X1 = 1} = p11 p11 + p10 p01 =
0.052 + 0.95(0.30) = 0.2875. For the second probability, we need to keep track of
eight possible sequences of five 0s and 1s, beginning and ending with 1s, so it is easier
to use matrix multiplication. Below we show the first, second and fourth powers of
the transition matrix with α = 0.3 and β = 0.95. Notice that the lower-right element
140 6 Answers to Problems: Chains with Two States
of P2 is p11 (2) = 0.2875, as above. Similarly, the second required probability is the
lower-right element of P4 , which is p11 (4) = 0.2430, to four places. c
P = matrix(c(.7, .3,
.95, .05), nrow=2, ncol=2, byrow=T)
P
P2 = P %*% P; P2
P4 = P2 %*% P2; P4
> P
[,1] [,2]
[1,] 0.70 0.30
[2,] 0.95 0.05
> P2 = P %*% P; P2
[,1] [,2]
[1,] 0.7750 0.2250
[2,] 0.7125 0.2875
> P4 = P2 %*% P2; P4
[,1] [,2]
[1,] 0.7609375 0.2390625
[2,] 0.7570312 0.2429687
c) If I see a Light vehicle in the monitor now, what is the probability that
the second vehicle after it will also be Light? The fourth vehicle after it?
d From the output for part (b), p00 (2) = 0.7750 and p00 (4) = 0.7609. c
d) In the long run, what proportion of the vehicles on this ramp do you
suppose is Heavy?
d The long run probability λ1 = limr→∞ p11 (r) = limr→∞ p01 (r) = β/(α + β) =
0.30/1.25 = 0.24. c
e) How might an observer of this Markov process readily notice that it differs
from a purely independent process with about 24% Heavy vehicles.
d In a purely independent process, runs of Heavy vehicles would average about
1/0.76 = 1.32 in length, so we would regularly see pairs of Heavy vehicles. Specif-
ically, if one vehicle is Heavy, then the next one is Heavy roughly a quarter of the
time. By contrast, this Markov chain will produce runs of Heavy vehicles that av-
erage about 1/0.95 = 1.05 in length, so we would very rarely see pairs of Heavy
vehicles. c
n = 1:100
p1 = pbinom(n*.12, n, .25)
p2 = 1 - pbinom(n*.12, n, .05)
N = min(n[pmax(p1, p2) < .05])
N; p1[N]; p2[N]
P {X1 = 0, X3 = 1} = P {X1 = 0, X2 = 0, X3 = 1}
+ P {X1 = 0, X2 = 1, X3 = 1}.
P (A ∩ B) P (A ∩ B ∩ C) P (A ∩ B ∩ C)
P (B|A)P (C|A ∩ B) = = .
P (A) P (A ∩ B) P (A)
142 6 Answers to Problems: Chains with Two States
This equation justifies the fourth equality below. The Markov property accounts for
the simplification involved in the fifth equality.
P {X1 = 0, X3 = 1}
p01 (2) = P {X3 = 1|X1 = 0} =
P {X1 = 0}
P {X1 = 0, X2 = 0, X3 = 1} + P {X1 = 0, X2 = 1, X3 = 1}
=
P {X1 = 0}
P {X1 = 0}P {X2 = 0|X1 = 0}P {X3 = 1|X2 = 0, X1 = 0}
=
P {X1 = 0}
P {X1 = 0}P {X2 = 1|X1 = 0}P {X3 = 1|X2 = 1, X1 = 0}
+
P {X1 = 0}
P {X1 = 0}P {X2 = 0|X1 = 0}P {X3 = 1|X2 = 0}
=
P {X1 = 0}
P {X1 = 0}P {X2 = 1|X1 = 0}P {X3 = 1|X2 = 1}
+
P {X1 = 0}
p0 p00 p01 p0 p01 p11
= + = p00 p01 + p01 p11
p0 p0
= (1 − α)α + α(1 − β) = α(2 − α − β).
In the next-to-last line above, we use notation p0 of part (c) and Section 6.2. c
6.8 To verify equation (6.5) do the matrix multiplication and algebra nec-
essary to verify each of the four elements of P2 .
d We show how the upper-left element p00 (2) of P2 arises from matrix multiplication.
The first row of P is the vector (p00 , p01 ) = (1 − α, α), and its first column is the
vector (p00 , p10 ) = (1 − α, β). So
X
1
p00 (2) = p0i pi0 = p00 p00 + p01 p10 = (1 − α)2 + αβ.
i=0
where we omit a few steps of routine algebra between the first and second lines. c
6.9 Prove equation (6.6), by mathematical induction as follows:
Initial step: Verify that the equation is correct for r = 1. That is, let r = 1
in (6.6) and verify that the result is P.
Induction step: Do the matrix multiplication P·Pr , where Pr is given by the
right-hand side of (6.6). Then simplify the result to show that the product
Pr+1 agrees with the right-hand side of (6.6) when r is replaced by r + 1.
d We do not show the somewhat tedious and very routine algebra required here. c
6 Answers to Problems: Chains with Two States 143
> P
[,1] [,2]
[1,] 0.0001 0.9999
[2,] 0.9999 0.0001
0 1 2 3 4 5 6 7 8 9 10
1.00 -0.99 0.98 -0.97 0.96 -0.95 0.94 -0.93 0.92 -0.91 0.90
11 12 13 14 15 16 17 18 19 20
-0.89 0.88 -0.87 0.86 -0.85 0.84 -0.83 0.82 -0.81 0.80
0 1 2 3 4 5 6 7 8
1.000 -0.990 0.980 -0.970 0.960 -0.950 0.941 -0.931 0.921
9 10 11 12 13 14 15 16 17
-0.911 0.901 -0.891 0.881 -0.871 0.861 -0.851 0.842 -0.832
18 19 20
0.822 -0.812 0.802
While the powers of the matrix converge very slowly, the almost-deterministic
simulations converge very rapidly to 1/2 (for m = 100 and starting with X1 = 0),
and to 50/101 (for m = 101). There is no simple connection between the speed of
convergence of the powers of the transition matrix to a matrix with all-identical rows
and the speed with which the trace of the simulated chain converges to its limit.
We have omitted the plots as in Figures 6.3, 6.4, and 6.5; you should make them
and look at them. Printouts for the ACF, show alternating negative and positive
autocorrelations.
Very occasionally, the usually-strict alternation between 0 and 1 at successive
steps may be broken, giving slightly different results for y[m] (the average X̄m ), and
also giving a rare value of x[m] (single observation Xm ) that is 0 for m = 100—or
that is 1 for m = 101. c
Note: The autocorrelations for small lags have absolute values near 1 and they
alternate in sign; for larger lags, the trend towards 0 is extremely slow.
b) If a position along the strand is not C, then what is the probability that
the next position is C?
d The probability that non-C is followed by (C) is p01 = α = 0.327. c
> P
[,1] [,2]
[1,] 0.673 0.327
[2,] 0.632 0.368
> P2 = P %*% P; P2
[,1] [,2]
[1,] 0.659593 0.340407
[2,] 0.657912 0.342088
> P4 = P2 %*% P2; P4
[,1] [,2]
[1,] 0.6590208 0.3409792
[2,] 0.6590180 0.3409820
c) It is possible for a chain that does not have a long-run distribution to have
a steady-state distribution. What is the steady-state distribution of the
“flip-flop” chain? What are the steady-state distributions of the “never
move” chain?
d The unique steady-state vector for the flip-flop chain is σ = (1/2, 1/2). The never-
move chain has the two-dimensional identity matrix as its transition matrix: P = I,
so any 2-element vector σ has σP = σI = σ. If the elements σ are nonnegative and
add to unity, it is a steady-state distribution of the never-move chain. c
6.13 Suppose a screening test for a particular disease has sensitivity η = 0.8
and specificity θ = 0.7. Also suppose, for a particular population that is espe-
cially at risk for this disease, PV Positive γ = 0.4 and PV Negative δ = 0.9.
a) Use the analytic method of Example 6.6 to compute π.
eta = 0.8; theta = 0.7; gamma= 0.4; delta = 0.9
Q = matrix(c(theta, 1-theta,
1-eta, eta ), ncol=2, byrow=T)
R = matrix(c(delta, 1-delta,
1-gamma, gamma ), ncol=2, byrow=T)
P = Q %*% R; alpha = P[1,2]; beta = P[2,1]
prevalence = alpha/(alpha + beta); prevalence
> prevalence
[1] 0.2235294
6 Answers to Problems: Chains with Two States 147
set.seed(1066)
m = 100000; d = t = numeric(m); d[1] = 0
eta = .8; theta = .7; gamma = .4; delta = .9
for (n in 2:m) {
if (d[n-1]==1) t[n-1] = rbinom(1, 1, eta)
else t[n-1] = rbinom(1, 1, 1 - theta)
if (t[n-1]==1) d[n] = rbinom(1, 1, gamma)
else d[n] = rbinom(1, 1, 1 - delta) }
runprop = cumsum(d)/1:m
par(mfrow=c(1,2)) # plots not shown here
plot(runprop, type="l", ylim=c(.1,.35),
xlab="Step", ylab="Running Proportion Infected")
abline(v=m/2, lty="dashed")
acf(d, ylim=c(-.1,.4))
par(mfrow=c(1,1))
mean(d[(m/2+1):m])
acf(d, plot=F)
> mean(d[(m/2+1):m])
[1] 0.22268
> acf(d, plot=F)
0 1 2 3 4 5 6 7 8
1.000 0.151 0.027 0.007 -0.004 -0.005 -0.001 0.002 0.001
9 10 11 12 13 14 15 16 17
0.001 0.007 0.004 0.003 0.002 -0.003 0.000 -0.002 -0.001
...
Answer: π ≈ 0.22; your answer to part (a) should show four places.
148 6 Answers to Problems: Chains with Two States
6.14 Mary and John carry out an iterative process involving two urns and
two dice as follows:
(i) Mary has two urns: Urn 0 contains 2 black balls and 5 red balls; Urn 1
contains 6 black balls and 1 red ball. At step 1, Mary chooses one ball at
random from Urn 1 (thus X1 = 1). She reports its color to John and returns
it to Urn 1.
(ii) John has two fair dice, one red and one black. The red die has three
faces numbered 0 and three faces numbered 1; the black die has one face
numbered 0 and five faces numbered 1. John rolls the die that corresponds to
the color Mary reported to him. In turn, he reports the result X2 to Mary. At
step 2, Mary chooses the urn numbered X2 (0 or 1).
(iii) This process is iterated to give values of X3 , X4 , . . . .
a) Explain why the X-process is a Markov chain, and find its transition
matrix.
d Mary’s choice of an urn at step n, depends on what John tells her, which depends
in turn on Mary’s choice at step n − 1. However, knowing her choice at step n − 1
is enough to compute probabilities of her choices at step n. No information earlier
than step n is relevant in that computation. Thus the X-process is a Markov chain.
Its transition matrix is shown in the Hint. c
Q = (1/7)*matrix(c(2, 5,
6, 1), ncol=2, byrow=T)
R = (1/6)*matrix(c(1, 5,
3, 3), ncol=2, byrow=T)
P = Q %*% R; alpha = P[1,2]; beta = P[2,1]
urn.1 = alpha/(alpha + beta); urn.1
c) Modify the program of Example 6.5 to approximate the result in part (b)
by simulation.
d The run of the program below gives very nearly the exact answer. The seed chosen
for that run is an unusually “lucky” one; with a seed randomly chosen from your
system clock, you should expect about two-place accuracy.
6 Answers to Problems: Chains with Two States 149
Problem 6.13 could pretty much be done by plugging in parameters from Ex-
amples 6.5 and 6.6. By contrast, this problem gives you the opportunity to think
through the logic of the Gibbs Sampler in order to get the correct binomial para-
meters inside the loop.
After running the program below (in which urn denotes the X-process), you can
use the same code to make plots of the trace and ACF is in part (c) of Problem 6.13,
but change the plotting interval of the vertical scale to something like (0.65, 0.85). c
set.seed(2011)
m = 100000
urn = die = numeric(m); urn[1] = 0
for (n in 2:m)
{
if (urn[n-1]==1) die[n-1] = rbinom(1, 1, 1/7)
else die[n-1] = rbinom(1, 1, 1 - 2/7)
if (die[n-1]==1) urn[n] = rbinom(1, 1, 3/6)
else urn[n] = rbinom(1, 1, 1 - 1/6)
}
runprop = cumsum(urn)/1:m
mean(urn[(m/2+1):m])
> mean(urn[(m/2+1):m])
[1] 0.73522
Hint: Drawing from an urn and rolling a die are each “half” a step; account for both
possible paths to each full-step transition: P = 17 [ 26 51 ] · 61 [ 13 53 ].
150 6 Answers to Problems: Chains with Two States
Errors in Chapter 6
Page numbers refer to the the text. All corrections are scheduled to appear in
the second printing. Statements of problems in this Instructor Manual have
been corrected. Includes corrections compiled through July 13, 2011.
p148 Example 6.2. In the second line below printout, the transition probability should
be p01 (4) ≈ 0.67, not 0.69. [Thanks to Leland Burrill.]
p153 Example 6.6. In the displayed equation, the lower-right entry in first matrix
should be 0.99, not 0.00. [Thanks to Tony Tran.] The correct display is as
follows:
· ¸· ¸ · ¸
0.97 0.03 0.9998 0.0002 0.9877 0.0123
P= =
0.01 0.99 0.5976 0.4024 0.6016 0.3984
p155 Problem 6.5(e). The displayed equation should have ’mod 5’; consequently, the
points should run from 1 through 5, and 0 should be adjacent to 4. The answer
for part (e) should say: “The X-process is not Markov.” The correct statement
of part (e) is as follows:
e) At each step n > 1, a fair coin is tossed, and Un takes the value −1
if the coin shows Tails and 1 if it shows Heads. Starting with V1 = 0,
the value of Vn for n > 1 is determined by
7.1 Ergodic and nonergodic matrices. In the transition matrices of the six
4-state Markov chains below, elements 0 are shown and * indicates a positive
element. Identify the ergodic chains, giving the smallest value N for which
PN has all positive elements. For nonergodic chains, explain briefly what
restriction on the movement among states prevents ergodicity.
**0 0 ** * 0 ** 00
0 * * 0 * * * 0 * * 0 0
a) P =
0 0 *
, b) P = , c) P = ,
* * * * 0 * 0 * *
*00 * 00 0 * 00 *0
0 * 00 0* * 0 * *0 0
0 0 * * 0 0 0 * * *0 0
d) P =
*
, e) P = , f) P = .
0 0 0 * 0 0 0 0 ** *
* 0 00 0* * * 0 00 *
d For parts (a), (b), (d), and (f), see the Answers below.
Part (c): The classes of intercommunicating states are A = {1, 2} and B = {3, 4}.
Class B can lead to A, but not the reverse. So B is transient and A is persistent.
Not an ergodic chain.
Part (e): The path 2 → 4 → 3 → 1 → 2 shows that all states intercommunicate. An
ergodic chain. c
Answers: In each chain, let the state space be S = {1, 2, 3, 4}. (a) Ergodic, N = 3.
(b) Class {1, 2, 3} does not intercommunicate with {4}. (d) Nonergodic because of
the period 3 cycle {1} → {2} → {3, 4} → {1}; starting in {1} at step 1 allows visits to
{3, 4} only at steps 3, 6, 9, . . . . (f) Starting in {3} leads eventually to absorption in
either {1, 2} or {4}.
152 7 Instructor Manual: Chains with Larger State Spaces
> P2
[,1] [,2] [,3] [,4]
[1,] 0.263860 0.242890 0.247740 0.245510
[2,] 0.265354 0.246180 0.226442 0.262024
[3,] 0.264332 0.247168 0.239408 0.249092
[4,] 0.254158 0.249127 0.241367 0.255348
> P4
[,1] [,2] [,3] [,4]
[1,] 0.2619579 0.2462802 0.2389381 0.2528238
[2,] 0.2617925 0.2463029 0.2389403 0.2529643
[3,] 0.2619256 0.2462810 0.2388936 0.2528999
[4,] 0.2618687 0.2463348 0.2387957 0.2530008
> P8
[,1] [,2] [,3] [,4]
[1,] 0.2618869 0.2462998 0.238892 0.2529213
[2,] 0.2618869 0.2462998 0.238892 0.2529214
[3,] 0.2618869 0.2462998 0.238892 0.2529213
[4,] 0.2618869 0.2462998 0.238892 0.2529214
7 Instructor Manual: Chains with Larger State Spaces 153
rowSums(P)
> rowSums(P)
[1] 1 1 1 1 # verifying that all rows of this P add to 1
set.seed(1234)
m = 100000; x = numeric(m); x[1] = 1
for (i in 2:m) {
if (x[i-1] == 1)
x[i] = sample(1:4, 1, prob=c(0.300, 0.205, 0.285, 0.210))
if (x[i-1] == 2)
x[i] = sample(1:4, 1, prob=c(0.322, 0.298, 0.078, 0.302))
if (x[i-1] == 3)
x[i] = sample(1:4, 1, prob=c(0.248, 0.246, 0.298, 0.208))
if (x[i-1] == 4)
x[i] = sample(1:4, 1, prob=c(0.177, 0.239, 0.292, 0.292)) }
summary(as.factor(x))/m # Table of proportions
mean(x[1:(m-1)]==2 & x[2:m]==3) # Est. Proportion of CpG
hist(x, breaks=0:4 + .5, prob=T, xlab="State", ylab="Proportion")
For a graphical impression of the difference between the distribution of the four
nucleotides in the sea and island chains, compare the “histogram” (interpreted as a
bar chart) from the simulation above with the bar chart in Figure 7.1.
As suggested, the simulation program above is similar to the one in Example 7.1.
The program below uses a different style based directly on the transition matrix.
The style is more elegant, but maybe not quite as easy to understand as the one
above. Compare the two programs, and see if you can see how each one works.
154 7 Instructor Manual: Chains with Larger State Spaces
We purposely used the same seed in both versions of the program. That the
simulated results are exactly the same in both cases indicates that the two programs
are doing exactly the same simulation. c
Note: The proportion of CpGs among dinucleotides in the island model is approx-
imately 9%; here it is only about 2%. Durbin et al. (1998) discuss how, given the
nucleotide sequence for a short piece of the genome, one might judge whether or
not it comes from a CpG island. Further, with information about the probabilities of
changing between island and “sea,” one might make a Markov chain with 8 states:
A0 , T0 , G0 , C0 for CpG islands and A, T, G, C for the surrounding sea. However, when
observing the nucleotides along a stretch of genome, one cannot tell A from A0 ,
T from T0 , and so on. This is an example of a hidden Markov model.
c) Make several simulation runs similar to the one at the end of Example 7.2
and report the number of steps before absorption in each.
d We use the transition matrix, as in the second part of the answer to Problem 7.2.
For brevity, we use a rather crude and “wasteful” program: We bet on absorption
before step 1000 (an extremely good bet, but use 100 000 if you’re really not a
risk taker), and we do not bother to stop the program at absorption (as we do in
Problem 7.4). We start at Cross 3 (state 3) as in the example, and we change the
population parameter of the sample function to 1:6 in order to match the number
of states in the chain.
Finally, we do not specify a seed, and do show the results of several runs. Unlike
the situation with an ergodic chain, the starting state makes a difference here. So
results would be different if we picked a state other than 3 to start. (If we started
in state 1 or 6, be would be “absorbed” at the outset. We explore absorption times
further in Problem 7.4.
P = (1/16)*matrix(c(16, 0, 0, 0, 0, 0,
4, 8, 4, 0, 0, 0,
1, 4, 4, 4, 2, 1,
0, 0, 4, 8, 0, 4,
0, 0, 16, 0, 0, 0,
0, 0, 0, 0, 0, 16), nrow=6, byrow = T)
m = 1000; x = numeric(m); x[1] = 3
for (i in 2:m) { x[i] = sample(1:6, 1, prob=P[x[i-1], ]) }
sba = length(x[(x > 1) & (x < 6)]); sba
> sba
[1] 14
if (x[i]==6) x = c(x, 6)
# condition below checks for absorption
if (length(x[x==1 | x==6]) > 0) a = i + 1
}
step.a[j] = a -1 # absorption step for jth run
state.a[j] = x[length(x)] # absorption state for jth run
}
b) What is the average length of time this chain spends in any one state before
moving to the next? What is the average length of time to go around the
circle once? From these results, deduce the long-run distribution of this
chain. (In many chains with more than 2 states, the possible transitions
among states are too complex for this kind of analysis to be tractable.)
d The average length of time staying in any one state is 2. The number of steps W
until a move is distributed geometrically with π = 1/2, so E(W ) = 1/π = 2.. So the
average number of steps to go around the circle is 5(2) = 10. On average, one-fifth
of the time is spent in each state, so the long-run distribution is expressed by the
vector λ = (0.2, 0.2, 0.2, 0.2, 0.2). c
c) Show that the vector σ = (1/5, 1/5, 1/5, 1/5, 1/5) satisfies the matrix
equation σP = σ and thus is a steady-state distribution of this chain. Is
σ also the unique long-run distribution?
d Below we show the transition matrix P, illustrate that σ(P ) = σ, and compute
a sufficiently high power of P to show that the matrix is ergodic with limiting
distribution λ (as claimed in part (b)), and thus also that σ = λ.
7 Instructor Manual: Chains with Larger State Spaces 159
> P = 1/2*matrix(c(1, 1, 0, 0, 0,
+ 0, 1, 1, 0, 0,
+ 0, 0, 1, 1, 0,
+ 0, 0, 0, 1, 1,
+ 1, 0, 0, 0, 1), nrow=5, byrow=T)
> P
[,1] [,2] [,3] [,4] [,5]
[1,] 0.5 0.5 0.0 0.0 0.0
[2,] 0.0 0.5 0.5 0.0 0.0
[3,] 0.0 0.0 0.5 0.5 0.0
[4,] 0.0 0.0 0.0 0.5 0.5
[5,] 0.5 0.0 0.0 0.0 0.5
> P2 = P %*% P; P2
[,1] [,2] [,3] [,4] [,5]
[1,] 0.25 0.50 0.25 0.00 0.00
[2,] 0.00 0.25 0.50 0.25 0.00
[3,] 0.00 0.00 0.25 0.50 0.25
[4,] 0.25 0.00 0.00 0.25 0.50
[5,] 0.50 0.25 0.00 0.00 0.25
> P4 = P2 %*% P2
> P8 = P4 %*% P4; P8
[,1] [,2] [,3] [,4] [,5]
[1,] 0.2226563 0.1406250 0.1406250 0.2226563 0.2734375
[2,] 0.2734375 0.2226563 0.1406250 0.1406250 0.2226563
[3,] 0.2226563 0.2734375 0.2226563 0.1406250 0.1406250
[4,] 0.1406250 0.2226563 0.2734375 0.2226563 0.1406250
[5,] 0.1406250 0.1406250 0.2226563 0.2734375 0.2226563
In the answer to part (b) of Problem 7.2, we showed how to simulate a Markov
chain using its one-step transition matrix P. The random walk on a circle provides a
good opportunity to show another method of simulation—“programming the story.”
160 7 Instructor Manual: Chains with Larger State Spaces
This method is important because later in the chapter we consider Markov chains
that don’t have matrices. (For example, see Problem 7.11.)
set.seed(1239)
m = 100000; x = numeric(m); x[1] = 0
for (i in 2:m)
{
d = rbinom(1, 2, 1/2) # 1 if Heads, 0 if Tails
x[i] = (x[i-1] + d) %% 5 # moves clockwise if Head
}
summary(as.factor(x))/m
> summary(as.factor(x))/m
0 1 2 3 4
0.19929 0.20057 0.19934 0.19986 0.20094
The resulting limiting distribution is in essential agreement with the stationary
distribution given above. c
d) Transition matrices for Markov chains are sometimes called stochastic,
meaning that each row sums to 1. In a doubly stochastic matrix, each
column also sums to 1. Show that the limiting distribution of a K-state
chain with an ergodic, doubly stochastic transition matrix P is uniform
on the K states.
d Let σ be a K-vector with all elements 1/K. Also, as usual, denote the elements of
P as pij , for i, j = 1, . . . , K. Then the jth element of σP is
X
K
1 1 X
K
1
pij = pij = ,
K K K
i=1 i=1
PK
where the last equality holds because i=1
pij = 1, for j = 1, . . . , K, as required by
the doubly-stochastic nature of P. c
e) Consider a similar process with state space S = {0, 1, 2, 3}, but with 0
adjacent to 3, and with clockwise or counterclockwise movement at each
step determined by the toss of a fair coin. (This process moves at every
step.) Show that the resulting doubly stochastic matrix is not ergodic.
d Suppose we start in an even-numbered state at step 1. Then we must be in an
even-numbered state at any odd-numbered step. For even n, Pn will have pij = 0,
for odd i and even j, and also for even i and odd j. A similar argument can be made
to show that Pn must have 0 elements for odd powers n. Therefore, there can be no
power of P with all positive elements, and Pn cannot approach a limit will all rows
the same. Below we illustrate with a few powers of P.
This is called a periodic chain of period 2. For K = 2 states, the only periodic
chain is the flip-flop chain discussed in Chapter 6. But for larger K, there can be
a variety of kinds of periodic chains. Such a random walk on a circle, with forced
movement to an immediately adjacent state at each step, is periodic with period 2
when the number of states K is even, but aperiodic (not periodic) when the number
of states is odd.
7 Instructor Manual: Chains with Larger State Spaces 161
P = (1/2)*matrix(c(0, 1, 0, 1,
1, 0, 1, 0,
0, 1, 0, 1,
1, 0, 1, 0), nrow=4, byrow=T)
P
P2 = P %*% P; P2
P3 = P2 %*% P; P3
P4 = P2 %*% P2; P4
> P
[,1] [,2] [,3] [,4]
[1,] 0.0 0.5 0.0 0.5
[2,] 0.5 0.0 0.5 0.0
[3,] 0.0 0.5 0.0 0.5
[4,] 0.5 0.0 0.5 0.0
> P2 = P %*% P; P2
[,1] [,2] [,3] [,4]
[1,] 0.5 0.0 0.5 0.0
[2,] 0.0 0.5 0.0 0.5
[3,] 0.5 0.0 0.5 0.0
[4,] 0.0 0.5 0.0 0.5
> P3 = P2 %*% P; P3
[,1] [,2] [,3] [,4]
[1,] 0.0 0.5 0.0 0.5
[2,] 0.5 0.0 0.5 0.0
[3,] 0.0 0.5 0.0 0.5
[4,] 0.5 0.0 0.5 0.0
d Below, we show the transition matrix and then use it to illustrate ideas in part (b).
(Rounding and multiplying by 27 make the output fit the width of the page.) c
P = (1/14)*matrix(c(7, 7, 0, 0, 0, 0, 0, 0,
1, 7, 6, 0, 0, 0, 0, 0,
0, 2, 7, 5, 0, 0, 0, 0,
0, 0, 3, 7, 4, 0, 0, 0,
0, 0, 0, 4, 7, 3, 0, 0,
0, 0, 0, 0, 5, 7, 2, 0,
0, 0, 0, 0, 0, 6, 7, 1,
0, 0, 0, 0, 0, 0, 7, 7), nrow=8, byrow=T)
ss.vec = dbinom(0:7, 7, 1/2) ## steady state vector
round(P, 5); ss.vec*2^7; ss.vec %*% P*2^7
P2 = P %*% P; P4 = P2 %*% P2; P8 = P4 %*% P4; P16 = P8 %*% P8
P32 = P16 %*% P16; P64 = P32 %*% P32; P128 = P64 %*% P64
P128 * 2^7
> round(P, 5)
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
[1,] 0.50000 0.50000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000
[2,] 0.07143 0.50000 0.42857 0.00000 0.00000 0.00000 0.00000 0.00000
[3,] 0.00000 0.14286 0.50000 0.35714 0.00000 0.00000 0.00000 0.00000
[4,] 0.00000 0.00000 0.21429 0.50000 0.28571 0.00000 0.00000 0.00000
[5,] 0.00000 0.00000 0.00000 0.28571 0.50000 0.21429 0.00000 0.00000
[6,] 0.00000 0.00000 0.00000 0.00000 0.35714 0.50000 0.14286 0.00000
[7,] 0.00000 0.00000 0.00000 0.00000 0.00000 0.42857 0.50000 0.07143
[8,] 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.50000 0.50000
> ss.vec*2^7
[1] 1 7 21 35 35 21 7 1
> ss.vec %*% P*2^7 # see part (b)
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
[1,] 1 7 21 35 35 21 7 1
b) Show that the steady-state distribution of this chain is BINOM(7, 21 ). That is,
show that it satisfies λP = λ. This is also the long-run distribution.
c) More generally, show that if there are M molecules, the long-run distrib-
ution is BINOM(M, 12 ).
d For i = 1, . . . , M − 1, the positive transition probabilities can be expressed as
pi−1,i = (M − i + 1)/2M , pii = M/2M = 1/2, and pi+1,i = (i + i)/2M . We have
shown in part (a) that the chain is ergodic, so the long-run (limiting) and steady-
state (stationary) distributions are the same.
Now we show that the vector λ has elements λi = C(M, i)/2M , where the
C-notation denotes the binomial coefficient and i = 0, 1, . . . , M . In the product λP,
the first term is easily seen to be λ0 = (1/2M )(M/2M ) + (M/2M )(1/2M ) = 1/2M ,
as required for i = 0. Similarly, it is easy to see that λM = 1/2M .
For i = 1, . . . , M − 1, the ith element in the product has three terms, which
simplify to λi = C(M, i)/2M . Because all three terms have some factors in common,
we abbreviate by writing K = M !/(2M 2M ):
as required. c
d) If there are 10 000 molecules at steady state, what is the probability that
between 4900 and 5100 are in Box A?
d We interpret this to mean between 4900 and 5100, inclusive. Below is the exact
binomial probability and its normal approximation. For such a large n, the normal
approximation is very good and the continuity correction might be ignored. c
Note: This is a variant of the famous Ehrenfest model, modified to have proba-
bility 1/2 of no movement at any one step and thus to have an ergodic transition
matrix. (See Cox and Miller (1965), Chapter 3, for a more advanced mathematical
treatment.)
164 7 Instructor Manual: Chains with Larger State Spaces
7.7 A Gambler’s Ruin problem. As Chris and Kim begin the following gam-
bling game, Chris has $4 and Kim has $3. At each step of the game, both play-
ers toss fair coins. If both coins show Heads, Chris pays Kim $1; if both show
Tails, Kim pays Chris $1; otherwise, no money changes hands. The game con-
tinues until one of the players has $0. Model this as a Markov chain in which
the state is the number of dollars Chris currently has. What is the probability
that Kim wins (that is, Chris goes broke)?
d Matrix multiplication. Because, at each step, there is no exchange of money with
probability 1/2, the process moves relatively slowly. Below we use P256 to show the
exact probabilities of absorption into state 0 (Chris is ruined) from each starting
state. These are in the the first column of the matrix. Similarly, probabilities that
Kim is ruined are shown in the last column.
For example, given that Chris has $4 at the start, the probability Chris is ruined
is 0.42857 and the probability Kim is ruined is 0.57143. Notice that these numbers
are from row [5,] of the matrix, which corresponds to state 4. However, this method
does not give us information about how many steps the game lasts until absorption.
P = (1/4)*matrix(c(4, 0, 0, 0, 0, 0, 0, 0,
1, 2, 1, 0, 0, 0, 0, 0,
0, 1, 2, 1, 0, 0, 0, 0,
0, 0, 1, 2, 1, 0, 0, 0,
0, 0, 0, 1, 2, 1, 0, 0,
0, 0 ,0, 0, 1, 2, 1, 0,
0, 0, 0, 0, 0, 1, 2, 1,
0, 0, 0, 0, 0, 0, 0, 4), nrow=8, byrow=T)
P2 = P %*% P
P4 = P2 %*% P2 # intermediate powers not printed
P8 = P4 %*% P4
P16 = P8 %*% P8
P32 = P16 %*% P16
P64 = P32 %*% P32
P128 = P64 %*% P64
P256 = P128 %*% P128
round(P256, 5)
> round(P256, 5)
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
[1,] 1.00000 0 0 0 0 0 0 0.00000
[2,] 0.85714 0 0 0 0 0 0 0.14286
[3,] 0.71428 0 0 0 0 0 0 0.28571
[4,] 0.57143 0 0 0 0 0 0 0.42857
[5,] 0.42857 0 0 0 0 0 0 0.57143
[6,] 0.28571 0 0 0 0 0 0 0.71428
[7,] 0.14286 0 0 0 0 0 0 0.85714
[8,] 0.00000 0 0 0 0 0 0 1.00000
Simulation We use the matrix format introduced in the answer to Problem 7.2(b)
to modify the program of Problem 7.3. Notice that in this program i stands for the
7 Instructor Manual: Chains with Larger State Spaces 165
state being vacated rather than the one being entered. Also, the row and column
numbers of the matrix run from 1 through 8, whereas the states run from 0 through 7.
We have used state 4 as the starting state.
After simulating 10 000 games, we obtain results the probability of Chris’s ruin as
0.4287, which is very close to the exact value 0.42857 obtained just above from P256 .
The mean time until absorption is about 25. Although, 95% of the games ended by
the 65th roll of the die, the histogram (not shown here) for the seed we used indicates
that there was at least one game that lasted for almost 200 rolls of the die. (For
the seed shown, we found from max(step.a) that the longest game had 195 steps.
Then, from sum(step.a==195), we discovered that there happened to be a second
game of that same length.) c
set.seed(1235)
m = 10000 # number of runs
step.a = numeric(m) # steps when absorption occurs
state.a = numeric(m) # states where absorbed
for (j in 1:m)
{
x = 4 # initial state
a = 0 # changed upon absorption
while(a==0) {
i = length(x) # current step; state found below
x = c(x, sample(0:7, 1, prob=P[x[i]+1,])) # uses P from above
if (length(x[x==0 | x==7]) > 0) a = i + 1 }
step.a[j] = a # absorption step for jth run
state.a[j] = x[length(x)] # absorption state for jth run
}
Note: This is a version of the classic gambler’s ruin problem. Many books on stochas-
tic processes derive general formulas for the probability of the ruin of each player
and the expected time until ruin. Approximations of these results can be obtained
by adapting the simulation program of Problem 7.4.
166 7 Instructor Manual: Chains with Larger State Spaces
7.8 Suppose weather records for a particular region show that 1/4 of Dry
(0) days are followed by Wet (1) days. Also, 1/3 of the Wet days that are
immediately preceded by a Dry day are followed by a Dry day, but there can
never be three Wet days in a row.
a) Show that this situation cannot be modeled as a 2-state Markov chain.
d Let Xi denote the state (0 or 1) at step i. Then P {X3 = 1|X2 = 1, X1 = 0} = 2/3,
but P {X3 = 1|X2 = 1, X1 = 1} = 0 because of the prohibition of three wet days in
a row. c
P = (1/12)*matrix(c(9, 3, 0, 0,
0, 0, 4, 8,
9, 3, 0, 0,
0, 0, 12, 0), nrow=4, byrow=T)
P2 = P %*% P; P4 = P2 %*% P2; P8 = P4 %*% P4
P16 = P8 %*% P8; P32 = P16 %*% P16
P4; P32
> ss %*% P
[,1] [,2] [,3] [,4]
[1,] 0.5294118 0.1764706 0.1764706 0.1176471
Without the 3-day restriction, the X-process would have been a Markov process
with states 0 and 1. That process would have α = 1/4, β = 1/3, and long-run distri-
bution λ = (4/7, 3/7). Because states 00 and 10 of the Y -process result in a dry day,
the long-run probability of a dry day is 0.5294118 + 0.1764706 = 0.7058824, com-
pared to the probability 4/7 = 0.5714286 of a dry day without the 3-day restriction.
Thus the prohibition against three wet days in a row implies a substantial increase
in the proportion of dry days over the long run. c
Hints and answers: (a) The transition probability p11 would have to take two
different values depending on the weather two days back. State two relevant condi-
tional probabilities with different values. (b) Over the long run, about 29% of the
days are Wet; give a more accurate value.
d See the Hints. The last row of the one-step transition matrix is (0, θ, (1 − θ)2 ). c
b) Show that this chain is ergodic. What is the smallest N that gives PN > 0?
d By simple matrix algebra, smallest is N = 2. For a numerical result, see the answer
to part (d). c
c) According to the Hardy-Weinberg Law, this Markov chain has the “equi-
librium” (steady-state) distribution σ = [θ2 , 2θ(1 − θ), (1 − θ)2 ]. Verify
that this is true.
d Simple matrix algebra. Also, the answer to part (d) for the case with θ = 0.2. c
168 7 Instructor Manual: Chains with Larger State Spaces
d) For θ = 0.2, simulate this chain for m = 50 000 iterations and verify
that the sampling distribution of the simulated states approximates the
Hardy-Weinberg vector.
d We used m = 100 000 iterations below. In simulation, this chain is relatively slow
to stabilize. However, P32 agrees with the limiting value. c
P = matrix(c(.2, .8, 0,
.1, .5, .4,
0, .2, .8), nrow=3, byrow=T)
P2 = P %*% P; P4 = P2 %*% P2
P8 = P4 %*% P4; P16 = P8 %*% P8
P32 = P16 %*% P16; P64 = P32 %*% P32
P2; P32
> ss %*% P
[,1] [,2] [,3]
[1,] 0.04 0.32 0.64
set.seed(1238)
m = 100000; x = numeric(m); x[1] = 1
for (i in 2:m)
{
x[i] = sample(1:3, 1, prob=P[x[i-1], ])
}
summary(as.factor(x))/m
> summary(as.factor(x))/m
1 2 3 # compare with exact
0.04007 0.31928 0.64065 # (.04, .32, .64)
Hints and partial answers: (a) In deriving p12 , notice that it makes no difference how
the A-alleles in the population may currently be apportioned among males of types
7 Instructor Manual: Chains with Larger State Spaces 169
AA and Aa. For example, suppose θ = 20% in a male population with 200 alleles
(100 individuals), so that there are 40 a-alleles and 160 As. If only genotypes AA
and aa exist, then there are 80 AAs to choose from, any of them would contribute an
A-allele upon mating, and the probability of an Aa offspring is 80% = 1 − θ. If there
are only 70 AAs among the males, then there must be 20 Aas. The probability that
an Aa mate contributes an A-allele is 1/2, so that the total probability of an Aa
offspring is again 1(0.70) + (1/2)(0.20) = 80% = 1 − θ. Other apportionments of
genotypes AA and Aa among males yield the same result. The first row of the matrix
P is [θ, 1 − θ, 0]; its second row is [θ/2, 1/2, (1 − θ)/2]. (b) For the given σ, show
that σP = σ. (d) Use a program similar to the one in Example 7.1.
7.10 Algebraic approach. For a K-state ergodic transition matrix P, the
long-run distribution is proportional to the unique row eigenvector λ corre-
sponding to eigenvalue 1. In R, g = eigen(t(P))$vectors[,1]; g/sum(g),
where the transpose function t is needed to obtain a row eigenvector,
$vectors[,1] to isolate the relevant part of the eigenvalue-eigenvector dis-
play, and the division by sum(g) to give a distribution. Use this method to
find the long-run distributions of two of the chains in Problems 7.2, 7.5, 7.6,
and 7.8—your choice, unless your instructor directs otherwise. (See Cox and
Miller (1965) for the theory.)
d In general, eigenvectors can involve complex numbers. But the relevant eigenvec-
tor for an ergodic Markov chain is always real. We use as.real to suppress the
irrelevant 0 imaginary components.
We also show the results for the Hardy-Weinberg Equilibrium of Problem 7.9,
along with the complete eigenvalue-eigenvector display from which the particular
eigenvector of interest is taken. c
# Problem 7.2: CpG Sea
P = matrix(c(0.300, 0.205, 0.285, 0.210,
0.322, 0.298, 0.078, 0.302,
0.248, 0.246, 0.298, 0.208,
0.177, 0.239, 0.292, 0.292), nrow=4, byrow=T)
g = eigen(t(P))$vectors[,1]; g/sum(g); as.real(g/sum(g))
> eigen(t(P))
$values
[1] 1.000000e+00 5.000000e-01 -9.313297e-17
$vectors
[,1] [,2] [,3]
[1,] 0.05581456 0.1961161 0.4082483
[2,] 0.44651646 0.5883484 -0.8164966
[3,] 0.89303292 -0.7844645 0.408248
7 Instructor Manual: Chains with Larger State Spaces 171
set.seed(1237)
m = 10000
d = sample(c(-1,0,1), m, replace=T, c(1/2,1/4,1/4))
x = numeric(m); x[1] = 0
for (i in 2:m) {x[i] = abs(x[i-1] + d[i]) }
summary(as.factor(x))
cutp=0:(max(x)+1) - .5; hist(x, breaks=cutp, prob=T)
k = 1:max(x); points(c(0,k), c(1/4,(3/4)*(1/2)^k)) # see part (b)
> summary(as.factor(x))
0 1 2 3 4 5 6 7 8 9 10
2453 3687 1900 936 498 255 135 69 34 26 7
If j ≥ 2:
X
∞
7.12 Attraction toward the origin. Consider the random walk simulated by
the R script below. There is a negative drift when Xn−1 is positive and a
positive drift when it is negative, so that there is always drift towards 0. (The
R function sign returns values −1, 0, and 1 depending on the sign of the
argument.)
# set.seed(1212)
m = 10000; x = numeric(m); x[1] = 0
for (i in 2:m)
{
drift = (2/8)*sign(x[i-1]); p = c(3/8+drift, 2/8, 3/8-drift)
x[i] = x[i-1] + sample(c(-1,0,1), 1, replace=T, prob=p)
}
summary(as.factor(x))
par(mfrow=c(2,1)) # prints two graphs on one page
plot(x, type="l")
cutp = seq(min(x), max(x)+1)-.5; hist(x, breaks=cutp, prob=T)
par(mfrow=c(1,1))
> summary(as.factor(x))
-4 -3 -2 -1 0 1 2 3 4 5
13 96 508 2430 4035 2343 466 93 15 1
a) Write the transition probabilities pij of the chain simulated by this pro-
gram. Run the program, followed by acf(x), and comment on the result-
ing graphs. (See Figure 7.15.)
d The transition probabilities are p 0,−1 = p 01 = 3/8, and p 00 = 2/8, for transitions
from state 0. For transitions from a positive state i, the probabilities are pi,i−1 = 5/8,
pii = 2/8, and pi,i+1 = 1/8. Finally, for negative i, we have pi,i−1 = 1/8, pii = 2/8,
and pi,i+1 = 5/8.
A tally of simulated values of the chain, for the seed shown, appears beneath the
program. A “history plot” of simulated values in the order they occurred is shown
in Figure 5.17, along with a relative frequency histogram. The chain seems to move
7 Instructor Manual: Chains with Larger State Spaces 173
readily among its states. The bias for transitions toward the origin keeps the chain
from moving very far from 0, with most of the values between ±3. The ACF plot
shows positive, significant correlations for most lags up to about 25. c
b) Use the method of Problem 7.11 to show that the long-run distribution
is given by λ0 = 2/5 and λi = 65 ( 15 )|i| for positive and negative integer
values of i. Do these values agree with your results in part (a)?
P∞ P−1 P∞ 65 2 65 6+8+6
d First, i=−∞ i
λ = i=−∞ i
λ + λ0 + i=1 λi = 5 8 + 5 + 5 8 = 20 = 1,
where the first and last terms are sums of the same geometric series. Thus the λi ,
for (negative, zero, and positive) integer i, form a probability
P∞distribution.
Then, we verify that the steady-state equations λj = λ p , for all inte-
−∞ i ij
gers j, have the solutions claimed. For this verification, we distinguish five cases—
where j < −1, j = −1; j = 0, j = 1, and j > 1, respectively:
For j = 0, the right side of the equation has only three terms because only three of
6 5
the relevant pij are positive: λ0 = λ−1 p−1,0 + λ0 p00 + λ1 p10 = 25 8
+ 52 82 + 25
6 5
8
=
30+20+30 2
200
= 5
.
For j = 1, the right side again has three positive terms: λ1 = λ0 p01 +λ1 p11 +λ2 p21 =
23
58
+ 65 ( 15 )1 28 + 65 ( 51 )2 58 = 65 5+2+1
40
= 65 ( 15 ).
For j > 1: Similarly, the equation becomes λj = λj−1 pj−1,j + λj pjj + λj+1 pj+1,j =
6 1 j−1 1
[( )
5 5 8
+ ( 15 )j 28 + ( 15 )j+1 85 ] = 65 18 ( 15 )j−1 [1 + 25 + 15 ] = 56 ( 15 )j .
For the two negative cases, the verification is very similar to the corresponding
positive ones because of the absolute value in the exponent of 15 and because the
drift is in the opposite direction.
The code above prints exact probabilities of the most likely values of the steady-
state distribution. The values simulated in part (a) are in reasonable agreement with
these exact values. Of course, if we used m = 100 000 iterations, the simulated values
would tend to be more accurate. c
7.13 Random walk on a circle. In Example 7.5, the displacements of the
random walk on the circle are UNIF(−0.1, 0.1) and the long-run distribution
is UNIF(0, 1). Modify the program of the example to explore the long-run
behavior of such a random walk when the displacements are NORM(0, 0.1).
Compare the two chains.
d This process with normally-distributed displacements also produces uniformly dis-
tributed outcomes on the unit interval. The plots (not shown here) look similar to
the corresponding ones for the process of Example 7.5 shown in Figures 7.6 and 7.7.
The hist function with parameter plot=F provides text output of information
used to plot and label a histogram. Here we show two of the several lines of the
174 7 Instructor Manual: Chains with Larger State Spaces
resulting output—giving the counts in each of the 10 bins and the bin midpoints. For
a uniform limiting distribution, we would expect about m/10 = 5000 observations in
each bin, and the bin counts we obtained from the simulation seem at least roughly
consistent with that. c
set.seed(1214)
m = 500000
d = c(0, rnorm(m-1, 0, .1))
x = cumsum(d) %% 1
b) S is the state space of a Markov chain. Starting with (X1 , Y1 ) = (1/2, 1/2),
choose a vertex of the triangle at random (probability 1/3 each) and
let (X2 , Y2 ) be the point halfway to the chosen vertex. At step 3, choose a
vertex, and let (X3 , Y3 ) be halfway between (X2 , Y2 ) and the chosen ver-
tex. Iterate. Suppose the first seven vertices chosen are A, A, C, B, B, A, A.
(These were taken from the run in part (c).) Find the coordinates of
(Xn , Yn ), for n = 2, 3, . . . , 8, and plot them by hand.
d Below is a hand-made table of results. The last three rows are from R code that
captures the first seven components of m-vectors in the program of part (c). c
Step: 1 2 3 4 5 6 7
Vertex: A A C B B A A
k: 1 1 3 2 2 1 1
x: 0.5 0.25 0.125 0.0625 0.53125 0.765625 0.3828125
y: 0.5 0.25 0.125 0.5625 0.28125 0.140625 0.0703125
c) As shown in Figure 7.12 (p176), the R script below generates enough points
of S to suggest the shape of the state space. (The default distribution of
the sample function assigns equal probabilities to the values sampled, so
the prob parameter is not needed here.)
176 7 Instructor Manual: Chains with Larger State Spaces
# set.seed(1212)
m = 5000
e = c(0, 1, 0); f = c(0, 0, 1)
k = sample(1:3, m, replace=T)
x = y = numeric(m); x[1] = 1/2; y[1] = 1/2
for (i in 2:m) {
x[i] = .5*(x[i-1] + e[k[i-1]])
y[i] = .5*(y[i-1] + f[k[i-1]]) }
plot(x,y,pch=20)
Within the limits of your patience and available computing speed, increase
the number m of iterations in this simulation. Why do very large values of
m give less-informative plots? Then try plot parameter pch=".". Also,
make a plot of the first 100 states visited, similar to Figure 7.10. Do you
think such plots would enable you to distinguish between the Sierpinski
chain and the chain of Example 7.7?
d Plotting points have area, which Sierpinski’s Triangle does not. So too many points
can muddy the picture. Points made using pch="." have much less area than those
made with pch=20, so you can use more of them without making a mess—perhaps
to get a plot you like better.
A plot connecting the first 100 simulated points is a relatively poor tool for
distinguishing the Sierpinski chain from the one in Figure 7.10. But the plot for
Sierpinski’s triangle does tend to have longer line segments—skipping across the
large missing central triangle. You might learn to exploit this tendency. c
m = 25000
a = d = c(1, 1, 1)/2; b = c = numeric(3)
e = c(0, 0, 1)/2; f = c(0, 1, 0)/2
k = sample(1:3, m, repl=T)
h = numeric(m); w = numeric(m); h[1] = 0; w[1] = 0
for (i in 2:m) {
h[i] = a[k[i]]*h[i-1] + b[k[i]]*w[i-1] + e[k[i]]
w[i] = c[k[i]]*h[i-1] + d[k[i]]*w[i-1] + f[k[i]] }
plot(w, h, pch=".", col="darkblue")
Note: See Barnsley (1988) for a detailed discussion of fractal objects with many
illustrations, some in color. Our script is adapted from pages 87–89; its numerical
constants can be changed to produce additional fractal objects described there.
7.18 A bivariate normal distribution for (X, Y ) with zero means, unit stan-
dard deviations, and correlation 0.8, as in Section 7.5, can be obtained as a
linear transformation of independent random variables.
√ Specifically, if U and
V are independently distributed as NORM(0, 2/ 5), then let X = U + V /2
and Y = U/2 + V .
a) Verify analytically that the means, standard deviations, and correlation
are as expected. Then use the following program to simulate and plot this
bivariate distribution. Compare your results with the results obtained in
Examples 7.8 and 7.9.
d Equations for the verification are as follows:
Expectations: E(X) = E(U ) + (1/2)E(V ) = 0, and similarly of Y .
Variances: V(X) = V(U + V /2) = 4/5 + (1/4)(4/5) = 1, and similarly for Y .
Correlation:
because V(X) = V(Y ) = 1 (first line) and U and V are independent (second line). c
set.seed(234)
m = 10000
u = rnorm(m,0,2/sqrt(5)); v = rnorm(m,0,2/sqrt(5))
x = u + v/2; y = u/2 + v
round(c(mean(x), mean(y), sd(x), sd(y), cor(x,y)), 4)
best = pmax(x, y)
mean(best >= 1.25)
plot(x, y, pch=".", xlim=c(-4,4), ylim=c(-4,4))
7 Instructor Manual: Chains with Larger State Spaces 179
set.seed(123)
m = 1000000
u = rnorm(m,0,3/sqrt(10)); v = rnorm(m,0,3/sqrt(10))
x = u + v/3; y = u/3 + v
round(c(mean(x), mean(y), sd(x), sd(y), cor(x,y)), 4)
best = pmax(x, y)
mean(best >= 1.25)
show = 1:30000
plot(x[show], y[show], pch=".", xlim=c(-4,4), ylim=c(-4,4))
lines(c(-5, 1.25, 1.25), c(1.25, 1.25, -5), lwd=2, col="red")
initial fluke. The estimate (−2.2, −2.2) of the center is grossly far from the origin.
(See the Notes.) c
set.seed(1234)
m = 40000
rho = .8; sgm = sqrt(1 - rho^2)
xc = yc = numeric(m) # vectors of state components
xc[1] = -3; yc[1] = 3 # arbitrary starting values
jl = 1.25; jr = .75 # l and r limits of proposed jumps
...
for (i in 2:m)
{
xc[i] = xc[i-1]; yc[i] = yc[i-1] # if no jump
xp = runif(1, xc[i-1]-jl, xc[i-1]+jr)
yp = runif(1, yc[i-1]-jl, yc[i-1]+jr)
nmtr.r = dnorm(xp)*dnorm(yp, rho*xp, sgm)
dntr.r = dnorm(xc[i-1])*dnorm(yc[i-1], rho*xc[i-1], sgm)
nmtr.adj = dunif(xc[i-1], xp-jl, xp+jr)*
dunif(yc[i-1], yp-jl, yp+jr)
dntr.adj = dunif(xp, xc[i-1]-jl, xc[i-1]+jr)*
dunif(yp, yc[i-1]-jl, yc[i-1]+jr)
7 Instructor Manual: Chains with Larger State Spaces 181
x = xc[(m/2+1):m]; y = yc[(m/2+1):m]
round(c(mean(x), mean(y), sd(x), sd(y), cor(x,y)) ,4)
mean(diff(xc)==0); mean(pmax(x, y) > 1.25)
par(mfrow=c(1,2), pty="s")
jump = diff(unique(x)); hist(jump, prob=T, col="wheat")
plot(x, y, xlim=c(-4,4), ylim=c(-4,4), pch=".")
par(mfrow=c(1,1), pty="m")
c) After a run of the program in part (b), make and interpret autocorrelation
function plots of x and of x[thinned], where the latter is defined by
thinned = seq(1, m/2, by=100). Repeat for realizations of Y .
Notes: (a) The acceptance criterion still has valid information about the shape of
the target distribution, but the now-asymmetrical jump function is biased towards
jumps downward and to the left. The approximated percentage of subjects awarded
certificates is very far from correct. (c) Not surprisingly for output from a Markov
chain, the successive pairs (X, Y ) sampled by the Metropolis-Hastings algorithm
after burn-in are far from independent. “Thinning” helps. To obtain the desired
degree of accuracy, we need to sample more values than would be necessary in
a simulation with independent realizations as in Problem 7.18. It is important to
distinguish the association between Xi and Yi on the one hand from the association
among the Xi on the other hand. The first is an essential property of the target
distribution, whereas the second is an artifact of the method of simulation.
7.20 We revisit the Gibbs sampler of Example 7.9.
a) Modify this program to sample from a bivariate normal distribution with
zero means, unit standard deviations, and ρ = 0.6. Report your results. If
you worked Problem 7.18, compare with those results.
d The only substantive change is that rho = .8 becomes rho = .6 in the first line of
code. Results below are from seed 1236. As in Example 7.9, we use only m = 20 000
iterations. Of these, we used results from the 10 000 iterations after burn-in for the
summary shown below.
Accuracy is not quite as good as in our answer to Problem 7.18(a) where we
used 10 000 independent simulated points. (See part (b).) The simulated value of
ρ is reasonably near 0.6. But, of course, it is not as accurate as in our answer to
Problem 7.18(b), where we chose to simulate 1 000 000 independent points.
set.seed(1236)
m = 20000
rho = .6; sgm = sqrt(1 - rho^2)
...
182 7 Instructor Manual: Chains with Larger State Spaces
The probability that the best of the two scores is greater here (about 17.5%) for
ρ = 0.6 than it was in Example 7.9 (about 15.3%) where ρ = 0.8. As the correlation
decreases, opportunity to demonstrate achievement is improved.
If the tests were independent (ρ = 0), then the exact probability would be given
by 1 - pnorm(1.25)^2, which returns 0.2001377. When ρ = 1, so that the two scores
are identical, the exact probability is 0.1056498. You can change ρ in the program
to approximate the first of these extreme results. But not the second—why not? c
b) Run the original program (with ρ = 0.8) and make an autocorrelation plot
of X-values from m/2 on, as in part (c) of Problem 7.19. If you worked
that problem, compare the two autocorrelation functions.
c) In the Gibbs sampler of Example 7.9, replace the second statement inside
the loop by yc[i] = rnorm(1, rho*xc[i-1], sgm) and run the result-
ing program. Why is this change a mistake?
d On a practical level, we can see that this change is a bad idea because it
gives obviously incorrect results. The trial run below approximates ρ as −0.057,
whereas the original program in Example 7.9, with the same seed, gave 0.8044—
very close to the known value ρ = 0.8. Moreover, the altered program approximates
P {max(X, Y ) ≥ 1.25} as 0.2078, while the original program gives 0.1527.
This example illustrates the importance of using a newly generated result in
a Gibbs sampler as early as possible. Here, the main purpose is to simulate the
distribution of max(X, Y ), and we might not have realized that the answer for
P {max(X, Y ) ≥ 1.25} is wrong. However, the wrong answer for the known correla-
tion is a clear indication that the program is not working as it should.
set.seed(1235); m = 20000
rho = .8; sgm = sqrt(1 - rho^2)
xc = yc = numeric(m); xc[1] = -3; yc[1] = 3
for (i in 2:m) {
xc[i] = rnorm(1, rho*yc[i-1], sgm)
yc[i] = rnorm(1, rho*xc[i-1], sgm) }
x = xc[(m/2+1):m]; y = yc[(m/2+1):m]
round(c(mean(x), mean(y), sd(x), sd(y), cor(x,y)), 4)
best = pmax(x,y); mean(best >= 1.25)
The answer to why this modified program does not work lies on an explanatory
and more theoretical level. The difficulty with the change is that each new simu-
lated value Yi in the chain needs to be paired with the corresponding new Xi just
generated, not with the previous value Xi−1 . Because this is a Markov Chain, there
is some association between one step and the next. It is important not to entangle
that association with the association between Xi and Yi .
It would be OK to reverse the order in which the x-values and y-values are
simulated, as shown below in a correct modification of the program.
set.seed(1235); m = 20000
rho = .8; sgm = sqrt(1 - rho^2)
xc = yc = numeric(m); xc[1] = -3; yc[1] = 3
for (i in 2:m) {
yc[i] = rnorm(1, rho*xc[i-1], sgm)
xc[i] = rnorm(1, rho*yc[i], sgm) }
x = xc[(m/2+1):m]; y = yc[(m/2+1):m]
round(c(mean(x), mean(y), sd(x), sd(y), cor(x,y)), 4)
best = pmax(x,y); mean(best >= 1.25)
Here, we get a correct view of the dependence between (Xi , Yi ) pairs, because we
have not disrupted the pairing. And again, we have useful estimates of ρ = 0.8 and
P {max(X, Y ) ≥ 1.25} ≈ 0.15. c
Note: (b) In the Metropolis-Hastings chain, a proposed new value is sometimes
rejected so that there is no change in state. The Gibbs sampler never rejects.
Errors in Chapter 7
Page numbers refer to the the text. All corrections are scheduled to appear in
the second printing. Statements of problems in this Instructor Manual have
been corrected. Includes corrections compiled through July 13, 2011.
p183 Problem 7.4 In the program, the first statement after inner loop should read
a[j] = a - 1 (not a). The correct code is shown in this Manual. This error
in the program makes a small difference in the histogram of Figure 7.14 (most
notably, the first bar there is a little too short). A corrected figure is scheduled
for the second printing; you will see it if you work the problem.
184 7 Instructor Manual: Chains with Larger State Spaces
> x[pdf==max(pdf)]
[1] 0.6338
Then the statement qbeta(c(.025, .975), 380, 220) returns a 95% prior
probability interval (0.5944, 0.6714), when rounded to four places. Roughly speak-
ing, the consultant choosing this prior distribution must feel that π is near 63% and
pretty sure to be between 60% and 67%.
Extra: There are infinitely many 95% prior probability intervals on π, and it is
customary to use the one that cuts 2.5% from each tail of the relevant distribution.
This interval is called the probability-symmetric interval. If we were to insist on
the shortest 95% probability interval, then another grid search would be in order,
as shown in the R code below. (See also Problem 8.10.)
Thus the shortest 95% probability interval is (0.5947, 0.6717), cutting almost
2.6% from the lower tail and just over 2.4% from the upper. Unless the distribution
is very severely skewed, there is often no practical difference between the probability-
symmetric interval and the shortest one. c
c) Modify the R code of Example 8.5 to make a version of Figure 8.5 (p207)
that describes this problem.
x = seq(.50, .73, .001); prior = dbeta(x, 380, 220)
post = dbeta(x, 380 + 38, 220 + 62)
plot(x, post, type="l", ylim=c(0, 25), lwd=2, col="blue",
xlab="Proportion in Favor", ylab="Density")
post.int = qbeta(c(.025,.975), 418, 282)
abline(v=post.int, lty="dashed", col="red")
abline(h=0, col="darkgreen"); lines(x, prior)
8 Instructor Manual: Introduction to Bayesian Estimation 187
d For variety, and to encourage you to make this plot for yourself, we have changed
the plotting window and included vertical red lines to show the 95% posterior prob-
ability interval. c
d) Pollsters sometimes report the margin of sampling error √ for a poll with
n subjects as being roughly given by the formula 100/ n %. According
to this formula, what is the (frequentist’s) margin of error for the poll in
part (b)? How do you suppose the formula is derived?
d The rough margin of error from the formula is 10%. The R code below shows
results we hope will be self-explanatory. The inverted parabola π(1 − π) has its
maximum at π = 1/2. As above, let x be the number of Successes out of p n individuals
sampled. Approximating 1.96 by 2, the traditional margin of error 1.96 p(1 − p)/n
√
becomes 1/ n. When p is reasonably near 1/2 the formula still works reasonably
well: we have (0.5)2 = 0.25, while (0.4)(0.6) = 0.24, and even (0.3)(0.7) = 0.21. c
Hints: (a) Use R code qbeta(c(.025,.975), 380, 220) to find one 95% prior
probability interval. (b) One response: P {π < 0.55} < 1%.
p(d) A standard formula
for an interval with roughly 95% confidence is p ± 1.96 p(1 − p)/n, where n is
“large” and p is the sample proportion in favor (see Example 1.6). What value of π
maximizes π(1 − π)? What if π = 0.4 or 0.6?
8.2 In Example 8.1, we require a prior distribution with E(π) ≈ 0.55 and
P {0.51 < π < 0.59} ≈ 0.95. Here we explore how one might find suitable
parameters α and β for such a beta-distributed prior.
a) For a beta distribution, the mean is µ = α/(α + β) and the variance is
σ 2 = αβ/[(α + β)2 (α + β + 1)]. Also, a beta distribution with large enough
values of α and β is roughly normal, so that P {µ − 2σ < π < µ + 2σ} ≈
0.95. Use these facts to find values of α and β that approximately satisfy
the requirements. (Theoretically, this normal distribution would need to
be truncated to have support (0, 1).)
d We require α/(α + β) = 0.55, so that β = (0.45/0.55)α = 0.818α, α + β = 1.818α,
and αβ = 0.818α2 . Also, we require 2σ ≈ 0.04 or σ 2 ≈ 0.0004. Then routine algebra
gives α ≈ 340, and thus β ≈ 278. c
188 8 Instructor Manual: Introduction to Bayesian Estimation
b) The following R script finds values of α and β that may come close to
satisfying the requirements and then checks to see how well they succeed.
What assumptions about α and β are inherent in the script? Why do we
use β = 0.818α? What values of α and β are returned? For the values of
the parameters considered, how close do we get to the desired values of
E(π) and P {0.51 < π < 0.59}?
alpha = 1:2000 # trial values of alpha
beta = .818*alpha # corresponding values of beta
c) If the desired mean is 0.56 and the desired probability in the interval
(0, 51, 0.59) is 90%, what values of the parameters are returned by a suit-
ably modified script?
alpha = 1:2000 # trial values of alpha
beta = ((1 - .56)/.56)* alpha # corresponding values of beta
c) The R script below plots examples of each of the 25 cases, scaled vertically
(with top) to show the properties in parts (a) and (b) about as well as
can be done and yet show most of each curve.
190 8 Instructor Manual: Introduction to Bayesian Estimation
Run the code and compare the resulting matrix of plots with your results
above (α-cases are rows, β columns). What symmetries within and among
the 25 plots are lost if we choose beta = c(.7, 1, 1.7, 2, 7)? (See
Figure 8.6.)
d Three cases along the principal diagonal in Figure 8.6 are no longer symmetrical.
(Correction: The first printing had errors in the code, corrected at ## above, affecting
captions in the figure. See the erratum at the end of this chapter of answers. c
8.4 In Example 8.1, we require a prior distribution with E(π) ≈ 0.55 and
P {0.51 < π < 0.59} ≈ 0.95. If we are willing to use nonbeta priors, how might
we find ones that meet these requirements?
a) If we use a normal distribution, what parameters µ and σ would satisfy
the requirements?
d For a method, see the answer to Problem 8.2. Answers: µ = 0.55 and σ = 0.02. c
c) Plot three priors on the same axes: BETA(330, 270) of Example 8.1 and
the results of parts (a) and (b).
xx = seq(0.48, 0.62, by=0.001)
plot(xx, dbeta(xx, 330, 270), ylim=c(0, 20), type="l",
ylab="Prior Density", xlab=expression(pi))
lines(xx, dnorm(xx, .55, .02), lty="dashed", col="red")
lines(c(.4985, .55, .6015), c(0, 19.43, 0), lwd=2, col="blue")
abline(h = 0, col="darkgreen")
8 Instructor Manual: Introduction to Bayesian Estimation 191
d) Do you think the expert would object to any of these priors as an expres-
sion of her feelings about the distribution of π?
d Superficially, the three densities seem about the same, so probably not—especially
if the expert is not a student of probability. The normal and beta curves are al-
most identical, provided that the normal curve is restricted to the interval (0, 1).
(The area under the normal curve outside the unit interval is negligible for practical
purposes, so one could restrict its support without adjusting the height of the den-
sity curve.) The triangular density places no probability at all outside the interval
(0.4985, 0.6015), so a poll with a large n and p = x/n far outside that interval would
still not yield a posterior distribution with any probability outside the interval. c
Notes: (c) Plot: Your result should be similar to Figure 8.7. Use the method in Exam-
ple 8.5 to put several plots on the same axes. Experiment: If v = c(.51, .55, .59)
and w = c(0, 10, 0), then what does lines(v, w) add to an existing plot? (d) The
triangular prior would be agreeable only if she thinks values of π below 0.4985 or
above 0.6015 are absolutely impossible.
# Results
post.mean = mean(pp*igd)/d; post.mean
post.pr.bigwin = (1/m)*sum(igd[pp > .6])/d; post.pr.bigwin
post.cum = cumsum((igd/d)/m)
min(pp[post.cum > .025]); min(pp[post.cum > .975])
b) Now suppose we choose the prior NORM(0.55, 0.02) to match the expert’s
impression that the prior should be centered at π = 55% and put 95% of its
probability in the interval 51% < π < 59%. The shape of this distribution
is very similar to BETA(330, 270) (see Problem 8.4). However, the normal
prior is not a conjugate prior. Write the kernel of the posterior, and say
why the method of Exampe 8.5 is intractable. Modify the program above
to use the normal prior (substituting the function dnorm for dbeta). Run
the modified program. Compare the results with those in part (a).
2 2
d The prior has kernel p(π) ∝ e−(π−µ) /2σ , where µ and σ are as specified above.
The likelihood is of the form p(x|π) ∝ π x (1 − π)n−x . The product doesn’t have the
form of any familiar density function. c
# Results
post.mean = mean(pp*igd)/d
post.pr.bigwin = (1/m)*sum(igd[pp > .6])/d
post.cum = cumsum((igd/d)/m)
pi.lower = min(pp[post.cum > .025])
pi.upper = min(pp[post.cum > .975])
d; post.mean; post.pr.bigwin; pi.lower; pi.upper
c) The scripts in parts (a) and (b) above are “wasteful” because grid values
of π are generated throughout (0, 1), but both prior densities are very
nearly 0 outside of (0.45, 0.65). Modify the program in part (b) to integrate
over this shorter interval.
Strictly speaking, you need to divide d, post.mean, and so on, by 5 be-
cause you are integrating over a region of length 1/5. (Observe the change
in b if you shorten the interval without dividing by 5.) Nevertheless, show
that this correction factor cancels out in the main results. Compare your
results with those obtained above.
d In computing the probability and the cumulative distribution function based on
the posterior distribution, we are integrating p(π|x) in equation (8.2) on p200 of the
text. The integrals in the numerator and denominator involve the same interval. c
# Results
post.mean = mean(pp*igd)/d
post.pr.bigwin = (1/m)*sum(igd[pp > .6])/d
post.cum = cumsum((igd/d)/m)
pi.lower = min(pp[post.cum > .025])
pi.upper = min(pp[post.cum > .975])
d; post.mean; post.pr.bigwin; pi.lower; pi.upper
# Results
post.mean = mean(pp*igd)/d
post.pr.bigwin = (1/m)*sum(igd[pp > .6])/d
post.cum = cumsum((igd/d)/m) # These three
pi.lower = min(pp[post.cum > .025]) # lines of code
pi.upper = min(pp[post.cum > .975]) # require the sort.
d; post.mean; post.pr.bigwin; pi.lower; pi.upper
set.seed(2345)
x = 620; n = 1000 # data
m = 1000000 # nr of norm points
pp = sort(rnorm(m, .55, .02)) # pts from norm
igd = dbinom(x, n, pp) # weighted likelihood
d = mean(igd) # denominator
# Results
post.mean = mean(pp*igd)/d
post.pr.bigwin = (1/m)*sum(igd[pp > .6])/d
post.cum = cumsum((igd/d)/m) # requires sort
pi.lower = min(pp[post.cum > .025])
pi.upper = min(pp[post.cum > .975])
d; post.mean; post.pr.bigwin; pi.lower; pi.upper
b) Modify the program of part (a) to find the posterior corresponding to the
“isosceles” prior of Problem 8.4. Make sure your initial value is within the
support of this prior, and use the following lines of code for the numerator
and denominator of the ratio of densities. Notice that, in this ratio, the
constant of integration cancels, so it is not necessary to know the height
196 8 Instructor Manual: Introduction to Bayesian Estimation
set.seed(2345)
m = 100000
piec = numeric(m) # states of chain
piec[1] = 0.5 # starting value in base of triangle
for (i in 2:m) {
piec[i] = piec[i-1] # if no jump
piep = runif(1, piec[i-1]-.05, piec[i-1]+.05) # proposal
nmtr = max(.0515-abs(piep-.55), 0)*dbinom(620, 1000, piep)
dmtr = max(.0515-abs(piec[i-1]-.55), 0)*
dbinom(620, 1000, piec[i-1])
r = nmtr/dmtr; acc = (min(r,1) > runif(1)) # accept prop.?
if(acc) {piec[i] = piep} }
pp = piec[(m/2+1):m] # after burn-in
quantile(pp, c(.025,.975)); mean(pp > .6)
qbeta(c(.025,.975), 950, 650); 1-pbeta(.6, 950, 650)
hist(pp, prob=T, col="wheat", main="")
xx = seq(.5, .7, len=1000)
lines(xx, dbeta(xx, 950, 650), lty="dashed", lwd=2)
Notes: (a) In the program, the code %%1 (mod 1) restricts the value of nmtr to
(0, 1). This might be necessary if you experiment with parameters different from
those in this problem. (b) Even though the isosceles prior may seem superficially
similar to the beta and normal priors, it puts no probability above 0.615, so the
posterior can put no probability there either. In contrast, the data show 620 out of
1000 respondents are in favor.
b) Plot the likelihood function for n = 1000 and x = 620. Approximate its
maximum value from the graph. Then do a numerical maximization with
the R script below. Compare it with the answer in part (a).
pp = seq(.001, .999, .001) # avoid ’pi’ (3.1416)
like = dbinom(620, 1000, pp); pp[like==max(like)]
plot(like, type="l") # plot not shown
e) For the particular case with n = 1000 and x = 620, find the posterior
mode and a 95% probability interval.
d In the R program below, we include a grid search for the mode of the posterior
distribution just to confirm that it agrees with the MLE—as the answer to part (d)
says it must. In view of the assertion in the Note (also see Problem 1.20), it is not
a surprise that the 95% posterior probability interval is numerically similar to the
Agresti-Coull 95% confidence interval, but it is a bit of a surprise to find agreement
to four-places in this particular problem.
Extra: This remarkable agreement in one instance raises the question how good
the agreement is across the board. A few computations for n = 1000 and then for
n = 100, suggest that very good agreement is not unusual.
Not explored here is behavior for success ratios very near 0 or 1, where interval
estimation can become problematic, but where there is evidence that the Bayesian
intervals may perform better. Also notice that, using R, the Bayesian intervals are
easy to find. c
Note: In many estimation problems, the MLE is in close numerical agreement with
the Bayesian point estimate based on a noninformative prior and on the posterior
mode. Also, a confidence interval based on the MLE may be numerically similar
to a Bayesian probability interval from a noninformative prior. But the underlying
philosophies of frequentists and Bayesians differ, and so the ways they interpret
results in practice may also differ.
8.8 Recall that in Example 8.6 researchers counted a total of t = 256 mice
on n = 50 occasions. Based on these data, find the interval estimate for λ
described in each part. Comment on similarities and differences.
a) The prior distribution GAMMA(α0 , κ0 ) has least effect on the posterior
distribution GAMMA(α0 + t, κ0 + n) when α0 and κ0 are both small. So
prior parameters α0 = 1/2 and κ0 = 0 give a Bayesian 95% posterior
probability interval based on little prior information.
d By formula, the posterior distribution is GAMMA(1/2 + 256, 50). Rounded to four
places, the R code qgamma(c(.025,.975), 256.5, 50) returns (4.5214, 5.7765). c
d Rounded to four places, this interval is (4.5120, 5.7871). For practical purposes,
all of the intervals in this problem are essentially the same. Ideally, in a homework
paper, you would summarize the intervals (and their lengths) in a list for easy
comparison.
In general, discussions about which methods are best center on their coverage
probabilities (as considered for binomial intervals in Section 1.2) and on their average
lengths (see Problem 1.19). Getting useful interval estimates for Poisson λ is more
difficult when x is very small. c
Notes: (a) Actually, using κ0 = 0 gives an improper prior. See the discussion in
Problem 8.12. (b) This style of CI has coverage inaccuracies similar to those of the
traditional CIs for binomial π (see Section 1.2). (c) See Stapleton (2008), Chapter 12.
8.9 In a situation similar to that in Examples 8.2 and 8.6, suppose that we
want to begin with a prior distribution on the parameter λ that has E(λ) ≈ 8
and P {λ < 12} ≈ 0.95. Subsequently, we count a total of t = 158 mice in
n = 12 trappings.
a) To find the parameters of a gamma prior that satisfy the requirements
above, write a program analogous to the one in Problem 8.2. (You can
come very close with α0 an integer, but don’t restrict κ0 to integer values.)
d In the program below, we try values α0 = 0.1, 0.2, . . . , 20.0 and get α0 = 12.8 and
κ0 = 1.6, which give E(λ) = 12.8/1.6 = 8 and P {λ < 12} = 0.95, to three places.
Using integer values of α0 , as suggested in the problem, we would get α0 = 13 and
κ0 = 1.625, which give E(λ) = 8 and P {λ < 12} a little above 95%. c
b) Find the gamma posterior that results from the prior in part (a) and the
data given above. Find the posterior mean and a 95% posterior probability
interval for λ.
d We use the prior GAMMA(α0 = 12.8, κ0 = 1.6) along with the data t = 158 and
n = 12 to obtain the posterior GAMMA(αn = α0 + t = 170.8, κn = κ0 + n = 13.6),
as on p202 of the text. Then the R code qgamma(c(.025, .975), 170.8, 13.6)
returns the 95% posterior probability interval (10.75, 14.51). Of course, if you used
a slightly different prior, your answer may differ slightly. c
c) As in Figure 8.2(a), plot the prior and the posterior. Why is the posterior
here less concentrated than the one in Figure 8.2(a)?
d The code below graphs the prior (black) and posterior (blue) density curves. The
posterior is less concentrated than the one in Example 8.6 because it is based on
much less data. c
xx = seq(2, 18, by=.01); top = max(dgamma(xx, 170.8, 13.6))
plot(xx, dgamma(xx, 12.8, 1.6), type="l", lwd=2,
ylim=c(0, top), xlab="Mice in Region", ylab="Density",
main="Prior (black) and Posterior Densities")
lines(xx, dgamma(xx, 170.8, 13.6), col="blue")
abline(h=0, col="darkgreen")
d) The ultimate noninformative gamma prior is the improper prior having
α0 = κ0 = 0 (see Problems 8.7 and 8.12 for definitions). Using this prior
and the data above, find the posterior mean and a 95% posterior proba-
bility interval for λ. Compare the interval with the interval in part (b).
d The code qgamma(c(.025, .975), 158, 12) returns the required posterior prob-
ability interval (11.19, 15.30), of length 4.10 (based on unrounded results). Owing
to the influence of the informative prior of part (b), the interval (10.75, 14.51), of
length 3.76, is shorter and shifted a little to the left. But see the comment below. c
Partial answers: In (a) you can use a prior with α0 = 13. Our posterior intervals
in (b) and (d) agree when rounded to integer endpoints: (11, 15), but not when
expressed to one- or two-place accuracy—as you should do.
8.10 In this chapter, we have computed 95% posterior probability intervals
by finding values that cut off 2.5% from each tail. This method is computa-
tionally relatively simple and gives satisfactory intervals for most purposes.
However, for skewed posterior densities, it does not give the shortest interval
with 95% probability.
The following R script finds the shortest interval for a gamma posterior.
(The vectors p.low and p.up show endpoints of enough 95% intervals that
we can come very close to finding the one for which the length, long, is a
minimum.)
d The suggested code, slightly modified, has been moved to part (a). See also the
Extra example in the answer to Problem 8.1(a)—where there is not much difference
between the shortest and the probability-symmetric probability intervals. c
202 8 Instructor Manual: Introduction to Bayesian Estimation
a) Compare the length of the shortest interval with that of the usual
(probability-symmetric) interval. What probability does the shortest in-
terval put in each tail?
alp = 5; kap = 1
p.lo = seq(.001,.05, .00001); p.up = .95 + p.lo
q.lo = qgamma(p.lo, alp, kap); q.up = qgamma(p.up, alp, kap)
long = q.up - q.lo # avoid confusion with function ‘length’
cond = (long==min(long))
PI.short = c(q.lo[cond], q.up[cond]); PI.short # shortest PI
diff(PI.short) # length of shortest PI
pr =c(p.lo[cond], 1-p.up[cond]); pr # probs in each tail
dens.ht = dgamma(PI.short, alp, kap); dens.ht # for part (c)
PI.sym = qgamma(c(.025,.975), alp, kap); PI.sym # prob-sym PI
diff(PI.sym) # length of prob-symmetric PI
b) Use the same method to find the shortest 95% posterior probability inter-
val in Example 8.6. Compare it with the probability interval given there.
Repeat, using suitably modified code, for 99% intervals.
d In Example 8.6, the posterior probability interval is is based on αn = 260 and
κn = 50.33. For large α, gamma distributions become nearly symmetrical. So the
shortest and probability-symmetric intervals are nearly the same for this example. c
set.seed(2345)
m = 100000; nu = 9500; r = 900; s = 1100
x = rhyper(m, 900, nu - r, 1100)
nu.est.lp = floor(r*s/x)
nu.est.sch = floor((r+1)*(s+1)/(x+1) - 1)
mean(nu.est.lp); mean(nu.est.sch)
sd(nu.est.lp); sd(nu.est.sch)
d The modified program is shown below, along with the 95% Bayesian probability
interval it computes. The version of the negative binomial distribution implemented
in R counts only the number of Failures up until the required number (here 150) of
Successes are encountered, and so it has support 0, 1, 2, . . . . Because we use nu - 150
as the first argument of the negative binomial probability function in R, the values
of ν with positive probability are 150, 151, 152, . . . .
The mean of the negative binomial prior (counting both Successes and Failures)
is 150/.014 = 10714.29, which is somewhat larger than the data alone would suggest.
So it is not surprising that the resulting probability interval covers somewhat larger
values than the probability interval resulting from the flat prior in part (c).
We leave it to you to make the trivial change in the program of Problem 4.27(d).
With seed 1935, we obtained the simple bootstrap CI (8181, 11 511); our CI from
the percentile method was (7711, 11 041). Maybe it would be worthwhile for you
to explore what bootstrap CIs result if the nearly-unbiased Schnabel estimator of
population size is used throughout the parametric bootstrap procedure. c
8.12 In Example 8.7, we show formulas for the mean and precision of the
posterior distribution. Suppose five measurements of the weight of the beam,
using a scale known to have precision τ = 1, are: 698.54, 698.45, 696.09,
697.14, 698.62 (x̄ = 697.76).
a) Based on these data and the prior distribution of Example 8.3, what is
the posterior mean of µ? Does it matter whether we choose the mean, the
median, or the mode of the posterior distribution as our point estimate?
(Explain.) Find a 95% posterior probability interval for µ. Also, suppose
we are unwilling to use this beam if it weighs more than 699 pounds; what
are the chances of that?
d The formulas µn = (τ0 /τn )µ0 + (nτ /τn )x̄ and τn = τ0 + nτ , are used in the R
√
code below. The parameters and distributions are: prior NORM(µ0 , σ0 = 1/ τ0 );
√
likelihood NORM(µ, σ = 1/ τ ), in which σ is known and µ is to be estimated using
√
data x̄; and posterior NORM(µn , σn = 1/ τn ).
The last line of the R code below computes the 95% posterior probability interval
for comparison with intervals in parts (c) and (d). c
b) Modify the R script shown in Example 8.5 to plot the prior and posterior
densities on the same axes. (Your result should be similar to Figure 8.3.)
d The distributions, parameters, and data are as in part (a). The code is shown, but
not the resulting plot, which is indeed similar to Figure 8.3. c
c) Taking a frequentist point of view, use the five observations given above
and the known variance of measurements produced by our scale to give a
95% confidence interval for the true weight of the beam. Compare it with
the results of part (a) and comment.
d For n = 5 observations with mean x̄ = 697.76, chosen at random from a normal
population with unknown mean µ and known σ = 1, the 95% confidence interval
√
for µ is x̄ ± 1.96σ/ n. This computes as (696.8835, 698.6365). c
d) The prior distribution in this example is very “flat” compared with the
posterior: its precision is small. A practically noninformative normal prior
is one with precision τ0 that is much smaller than the precision of the data.
As τ0 decreases, the effect of µ0 diminishes. Specifically, limτ0 →0 µn = x̄
and limτ0 →0 τn = nτ. The effect is as if we had used p(µ) ∝ 1 as the
prior.
R ∞ Of course, such a prior distribution is not strictly possible because
−∞
p(µ) dµ would be ∞. But it is convenient to use such an improper
prior as shorthand for understanding what happens to a posterior as
the prior gets less and less informative. What posterior mean and 95%
probability interval result from using an improper prior with our data?
Compare with the results of part (c).
208 8 Instructor Manual: Introduction to Bayesian Estimation
d Numerically, the Bayesian 95% posterior probability interval is the same as the 95%
confidence interval in part (c). The interpretation might differ depending on the
user’s philosophy of inference. Computationally, the Bayesian interval is found in R
as qnorm(c(.025, .975), 697.76, 1/sqrt(5/1)). c
e) Now change the example: Suppose that our vendor supplies us with a
more consistent product so that the prior NORM(701, 5) is realistic and
that our data above come from a scale with known precision τ = 0.4.
Repeat parts (a) and (b) for this situation.
d The code below is the obvious modification of the code in part (a). We leave it to
you to modify the program of part (b) and make the resulting plot. The prior is not
as flat as in parts (a) and (b), but the data now have more precision than before,
and so the data still predominate in determining the posterior. c
To obtain the first expression above, recall that the likelihood function
is the joint density function of x = (x1 , . . . , xn )|µ. To obtain the second,
write (xi − µ)2 = [(xi − x̄) + (x̄ − µ)]2 , expand the square, and sum over i.
On distributing the sum, you should obtain three terms. One of them
provides the desired result, another is 0, and the third is irrelevant because
it does not contain the variable µ. (A constant term in the exponential is
a constant factor of the likelihood, which is not included in the kernel.)
8 Instructor Manual: Introduction to Bayesian Estimation 209
b) To derive the expression for the kernel of the posterior, multiply the kernels
of the prior and the likelihood, and expand the squares in each. Then put
everything in the exponential over a common denominator, and collect
terms in µ2 and µ. Terms in the exponent that do not involve µ are
constant factors of the posterior density that may be adjusted as required
in completing the square to obtain the desired posterior kernel.
d In terms of τn and µn , we express the kernel of the posterior density as
· ¸ h i
1 n
p(µ|x) ∝ p(µ) p(x|µ) ∝ exp − 2 (µ − µ0 )2 × exp − 2 (x̄ − µ)2
2σ0 2σ
h i h i
τ0 nτ
= exp − (µ − µ0 )2 × exp − (x̄ − µ)2
2 2
h i
τ0 2 2 nτ 2
= exp − (µ − 2µµ0 + µ0 ) − (x̄ − 2x̄µ + µ2 )
2 2
h i
τ0 2 nτ
∝ exp − (µ − 2µµ0 ) − (−2x̄µ + µ2 )
2 2
h i h i
τ0 + nτ 2 τn
= exp − µ + (τ0 µ0 + nτ x̄)µ = exp − (µ2 − 2µµn )
2 2
h i h i
τn 2 τn
∝ exp − (µ − 2µµn + µn ) = exp − (µ − µn )2 .
2
2 2
√
We recognize the last term as the kernel of NORM(µn , 1/ τn ) = NORM(µn , σn ).
The proportionality symbol on the fourth line indicates that terms not involving µ
have been dropped, the one on the last line indicates that such a term has been added
in order to complete the square. (Because we used precisions instead of variances
after the first line, the “common denominator” of the suggested procedure became,
in effect, a common factor.) c
210 8 Instructor Manual: Introduction to Bayesian Estimation
8.14 For a pending American football game, the “point spread” is estab-
lished by experts as a measure of the difference in ability of the two teams. The
point spread is often of interest to gamblers. Roughly speaking, the favored
team is thought to be just as likely to win by more than the point spread as
to win by less or to lose. So ideally a fair bet that the favored team “beats the
spread” could be made at even odds. Here we are interested in the difference
x = v − w between the point spread v, which might be viewed as the favored
team’s predicted lead, and the actual point difference w (the favored team’s
score minus its opponent’s) when the game is played.
a) Suppose an amateur gambler, perhaps interested in bets that would not
have even odds, is interested in the precision of x and is willing to assume
x ∼ NORM(0, σ). Also, recalling relatively few instances with |x| > 30,
he decides to use a prior distribution on σ that satisfies P {10 < σ < 20} =
P {100 < σ 2 = 1/τ < 400} = P {1/400 < τ < 1/100} = 0.95. Find parame-
ters α0 and κ0 for a gamma-distributed prior on τ that approximately
satisfy this condition. (Imitate the program in Problem 8.2.)
d We wish to imitate the program of Problem 8.2(b) (for a beta prior) or, perhaps
more directly, the program of Problem 8.9(a) (already adapted for a gamma prior).
Both of these programs are based on a target mean value. Because we want to have
P {10 < σ < 20}, it seems reasonable to suppose the prior mean may lie a little to
the right of the center 15 of this interval, perhaps around 16. On the precision scale,
this would indicate α0 /κ0 = 1612 ≈ 0.004. (The parameters mentioned in the Hints
give mean 0.0044.) Consistent with this guess, we seek parameters α0 and κ0 for a
gamma prior on τ with P {1/400 < τ < 1/100} = 0.95.
The the program below yields parameters α0 = 16 and κ0 = 4000, which put
close to 95% of the probability in this interval. (Your parameters may be somewhat
different, depending on the details of your search.)
The last line of the program shows that most of the 5% of probability that
this prior distribution puts outside the interval ( 2012 , 1012 ) lies in the left tail. Before
deciding that this is unacceptably lopsided, the gambler needs to ponder whether an
initial guess such as P {12.7 < σ < 21} = 95% would express his prior view about as
well: diff(pgamma(c(1/21^2, 1/12.7^2), t.al, t.ka)) returns about 95%, and
about 2.5% is in the lower tail. It practice, it is alright not to be too finicky how well
a member of the desired prior family can match a rough probability guess about
prior information. c
prompt a statistician to express the interval for σ to only one decimal place accu-
racy, but we have displayed three places here to emphasize that the two intervals
for σ in this part are essentially equal. c
Notes and hints: (a) Parameters α0 = 11, κ0 = 2500 give probability 0.9455, but
your program should give integers that come closer to 95%. (b) The data x in
part (b), taken from more extensive data available online, Stern(1992), are for 1992
NFL home games; x̄ ≈ 0 and the data pass standard tests for normality. For a
more detailed discussion and analysis of point spreads, see Stern (1991). (c) The
two intervals for σ agree closely, roughly (12, 15). You should report results to one
decimal place.
b) The following five errors are observed when analyzing test specimens:
−2.65, 0.52, 1.82, −1.41, 1.13. Based on the prior distribution in part (a)
and these data, find the posterior distribution, the posterior median value
of τ , and a 95% posterior probability interval for τ . Use these to give the
posterior median value of σ and a 95% posterior probability interval for σ.
d The mean of these n = 5 observations is x̄ = −0.118, which makes the assumption
that µ = 0 seem realistic. There are too few observations to for a worthwhile test of
normality, so we will have to take that assumption on faith.
P
Moreover, s2 = ( x2i )/n = 13.87035
= 2.774. Then αn = α0 + n2 = 5 + 25 = 7.5
ns2 13.8703
and κn = κ0 + 2 = 2 + 2
= 8.93515. The posterior median and 95% prob-
ability interval for τ are found from qgamma(c(.5, .025, .975), 7.5, 8.93515),
which returns 0.802 for the median and (0.3504, 1.5382) for the interval. In terms of
√
σ = 1/ τ , the median is 1.116 and the interval is (0.8063, 1.6893). c
c) On the same axes, make plots of the prior and posterior distributions of τ .
Comment.
8 Instructor Manual: Introduction to Bayesian Estimation 213
d The R code required to make the plot is similar to that used in Example 8.5 and
Problem 8.12(b). c
Errors in Chapter 8
Page numbers refer to the the text. All corrections are scheduled to appear in
the second printing. Statements of problems in this Instructor Manual have
been corrected. Includes corrections compiled through July 13, 2011.
p208 Problem 8.3(c). In two lines of the inner loop of the program code, the loop
indices i and j should be reversed, to have alpha[i] and beta[j]. As a result
of this error, values of alpha and beta inside parentheses are reversed in captions
in Figure 8.6. [A corrected figure is scheduled for 2nd printing.] The correct inner
loop is shown below and in Problem 8.3(c) of this Manual.
for (j in 1:5) {
top = .2 + 1.2 * max(dbeta(c(.05, .2, .5, .8, .95),
alpha[i], beta[j]))
plot(x,dbeta(x, alpha[i], beta[j]),
type="l", ylim=c(0, top), xlab="", ylab="",
main=paste("BETA(",alpha[i],",", beta[j],")", sep="")) }
p214 Problem 8.8(c). The second R statement should be qgamma(.975, t+1, n), not
gamma(.975, t+1, n).
Note: If you are using the first printing of the text, please consult the list of
errata at the end of this chapter.
9.1 Estimating prevalence π with an informative prior.
a) According to the prior distribution BETA(1, 10), what is the probability
that π lies in the interval (0, 0.2)?
d The R code pbeta(.2, 1, 10) returns 0.8926258. c
b) If the prior BETA(1, 10) is used with the data of Example 9.1, what is the
(posterior) 95% Bayesian interval estimate of π?
d We use the program of Example 9.1 with appropriate change in the prior distrib-
ution. We do not show the plots.
set.seed(1234)
m = 50000; PI = numeric(m); PI[1] = .5
alpha = 1; beta = 10 # parameters of beta prior
eta = .99; theta = .97
n = 1000; A = 49; B = n - A
for (i in 2:m) {
num.x = PI[i-1]*eta; den.x = num.x + (1-PI[i-1])*(1 - theta)
X = rbinom(1, A, num.x/den.x)
num.y = PI[i-1]*(1 - eta); den.y = num.y + (1-PI[i-1])*theta
Y = rbinom(1, B, num.y/den.y)
PI[i] = rbeta(1, X + Y + alpha, n - X - Y + beta) }
aft.brn = seq(m/2 + 1,m)
par(mfrow=c(2,1))
plot(aft.brn, PI[aft.brn], type="l")
hist(PI[aft.brn], prob=T)
par(mfrow=c(1,1))
mean(PI[aft.brn])
quantile(PI[aft.brn], c(.025, .975))
216 9 Instructor Manual: Gibbs Sampling
> mean(PI[aft.brn])
[1] 0.02033511
> quantile(PI[aft.brn], c(.025, .975))
2.5% 97.5% # Interval from Example 9.1:
0.007222679 0.035078009 # (0.0074, 0.0355)
Compared with the posterior probability interval of Example 9.1, this result is
shifted slightly downward. c
c) What parameter β would you use so that BETA(1, β) puts about 95%
probability in the interval (0, 0.05)?
d The R code in the Hints does a grid search among integers from 1 through 100,
giving β = 59. c
d) If the beta distribution of part (c) is used with the data of Example 9.1,
what is the 95% Bayesian interval estimate of π?
d With seed 1236 and the appropriate change in the prior, the program of part (a)
gives the results shown below. In this run, the posterior mean is found to be 1.78%,
which is consistent with the approximate value given in the Hints. c
> mean(PI[aft.brn])
[1] 0.01781736
> quantile(PI[aft.brn], c(.025, .975))
2.5% 97.5%
0.005396175 0.031816216
9.2 Run the program of Example 9.1 and use your simulated posterior
distribution of π to find Bayesian point and interval estimates of the predictive
power of a positive test in the population from which the data are sampled.
How many of the 49 subjects observed to test positive do you expect are
actually infected?
d Use the equation γ = πη/(πη+(1−π)(1−θ)) to transform the posterior distribution
of π to a posterior distribution of γ, which provides point and interval estimates
of γ. Multiplying these values by 49 gives an idea how many infected units there are
among those testing positive.
By either method, we must remember that the prior and posterior distributions
are for π, a property of a particular population. The sensitivity η and specificity θ
are properties of the screening test. When η and θ are known, γ becomes a function
of π, inheriting its prior and posterior distributions from those of π. c
9.3 In Example 5.2 (p124), the test has η = 99% and θ = 97%, the data
are n = 250 and A = 6, and equation (9.1) on p220 gives an absurd negative
estimate of prevalence, π = −0.62%.
a) In this situation, with a uniform prior, what are the Bayesian point es-
timate and (two-sided) 95% interval estimate of prevalence? Also, find a
one-sided 95% interval estimate that provides an upper bound on π.
218 9 Instructor Manual: Gibbs Sampling
d The appropriate line near the beginning of the program of Example 9.1 has been
modified to show the current data; the last line provides the required one-sided
probability interval.
set.seed(1240)
m = 50000; PI = numeric(m); PI[1] = .5
alpha = 1; beta = 1; eta = .99; theta = .97
n = 250; A = 6; B = n - A # Data
for (i in 2:m) {
num.x = PI[i-1]*eta; den.x = num.x + (1-PI[i-1])*(1 - theta)
X = rbinom(1, A, num.x/den.x)
num.y = PI[i-1]*(1 - eta); den.y = num.y + (1-PI[i-1])*theta
Y = rbinom(1, B, num.y/den.y)
PI[i] = rbeta(1, X + Y + alpha, n - X - Y + beta) }
aft.brn = seq(m/2 + 1,m)
par(mfrow=c(2,1))
plot(aft.brn, PI[aft.brn], type="l")
hist(PI[aft.brn], prob=T)
par(mfrow=c(1,1))
mean(PI[aft.brn])
quantile(PI[aft.brn], c(.025, .975))
quantile(PI[aft.brn], .95)
> mean(PI[aft.brn])
[1] 0.008846055
> quantile(PI[aft.brn], c(.025, .975))
2.5% 97.5%
0.0002883394 0.0285951866 # two-sided probability interval
> quantile(PI[aft.brn], .95)
95% # one-sided
0.02400536
The data happen to have fewer positive tests than we would expect from false-
negatives alone: A = 6 < n(1 − θ) = 250(.03) = 7.5. The Gibbs sampler indicates
that the prevalence is very likely below 2.4%. The histogram in the lower panel of
Figure 10.6 shows a right-skewed posterior distribution “piled up” against 0. c
b) In part (a), what estimates result from using the prior BETA(1, 30)?
d Only the line of the program for the prior is changed. Results are shown below for
a run with seed 1241. c
> mean(PI[aft.brn])
[1] 0.007416064
> quantile(PI[aft.brn], c(.025, .975))
2.5% 97.5%
0.0002133581 0.0242588017
> quantile(PI[aft.brn], .95)
95%
0.02041878
9 Instructor Manual: Gibbs Sampling 219
Comment: a) See Figure 9.10. Two-sided 95% Bayesian interval: (0.03%, 2.9%). Cer-
tainly, this is more useful than a negative estimate, but don’t expect a narrow interval
with only n = 250 observations. Consider that a flat-prior 95% Bayesian interval
estimate of τ based directly on t = 6/250 is roughly (1%, 5%).
9.4 In each part below, use the uniform prior distribution on π and suppose
the test procedure described results in A = 24 positive results out of n = 1000
subjects.
a) Assume the test used is not a screening test but a gold-standard test,
so that η = θ = 1. Follow through the code for the Gibbs sampler in
Example 9.1, and determine what values of X and Y must always occur.
Run the sampler. What Bayesian interval estimate do you get? Explain
why the result is essentially the same as the Bayesian interval estimate you
would get from a uniform prior and data indicating 24 infected subjects
in 1000, using the code qbeta(c(.025,.975), 25, 977).
d Because η = θ = 1, we also have γ = δ = 1, so both “binomial” simulations have all
Successes. Thus the degenerate random variables are X ≡ A = 24 and Y ≡ B = 976.
This means that the posterior distribution from which each PI[i] is generated is
BETA(αn = 1 + 24 = 25, βn = 1 + 976 = 977), as in the Hints. c
c) Why are the results from parts (a) and (b) not much different?
d The gold-standard test in part (a) has τ = π = 24/100 = 2.4%. The only difference
in part (b) is that about 3% of the approximately 2.4% of infected subjects, almost
surely less than 1% of the population, will have different test results. c
Hints: a) The Gibbs sampler simulates a large sample precisely from BETA(25, 977)
and cuts off appropriate tails. Why these parameters? Run the additional code:
set.seed(1237); pp=c(.5, rbeta(m-1, 25, 977)); mean(pp[(m/2):m])
c) Why no false positives among the 24 in either (a) or (b)? Consider false negatives.
220 9 Instructor Manual: Gibbs Sampling
a) Rerun the Gibbs sampler of Example 9.1 three times with different seeds,
which you select and record. How much difference does this make in the
Bayesian point and interval estimates of π? Use one of the same seeds in
parts (b) and (c) below.
b) Redraw the running averages plot of Figure 9.2 so that the vertical plotting
interval is (0, 0.5). (Change the plot parameter ylim.) Does this affect
your perception of when the process “becomes stable”? Repeat, letting
the vertical interval be (0.020, 0.022), and comment.
d By choosing a small enough window on the vertical axis, you can make even a very
stable process appear to be unstable. You need to keep in mind how many decimal
places of accuracy you hope to get in the final result. c
c) Change the code of the Gibbs sampler in the example so that the burn-in
period extends for 15 000 steps. Compared with the results of the example,
what change does this make in the Bayesian point and interval estimates
of π? Repeat for a burn-in of 30 000 steps and comment.
9 Instructor Manual: Gibbs Sampling 221
9.6 Thinning. From the ACF plot in Figure 9.2 on p223, we see that the
autocorrelation is near 0 for lags of 25 steps or more. Also, from the right-
hand plot in this figure, it seems that the process of Example 9.1 stabilizes
after about 15 000 iterations. One method suggested to mitigate effects of
autocorrelation, called thinning, is to consider observations after burn-in
located sufficiently far apart that autocorrelation is not an important issue.
a) Use the data and prior of Example 9.1. What Bayesian point estimate
and probability interval do you get by using every 25th step, starting
with step 15 000? Make a histogram of the relevant values of PI. Does
thinning in this way have an important effect on the inferences?
d The program below uses the same seed as Example 9.1, for a direct comparison.
Code for the histogram and the ACF plot [requested in part (b)] is shown, but
not the plots themselves. Some “un-thinned” results obtained in the example are
shown in comments; here, the difference between thinned and un-thinned results lies
beyond the second decimal place. c
for (i in 2:m)
{
num.x = PI[i-1]*eta; den.x = num.x + (1-PI[i-1])*(1 - theta)
X = rbinom(1, A, num.x/den.x)
num.y = PI[i-1]*(1 - eta); den.y = num.y + (1-PI[i-1])*theta
Y = rbinom(1, B, num.y/den.y)
PI[i] = rbeta(1, X + Y + alpha, n - X - Y + beta)
}
thin.aft.brn = seq(.3*m+1, m, by=25) # thinned set of steps
mean(PI[thin.aft.brn])
quantile(PI[thin.aft.brn], c(.025, .975))
acf(PI[thin.aft.brn], plot=F) # for part (b)
par(mfrow=c(2,1))
hist(PI[thin.aft.brn], prob=T)
acf(PI[thin.aft.brn], ylim=c(-.1, .6)) # for part (b)
par(mfrow=c(1,1))
> mean(PI[thin.aft.brn])
[1] 0.02040507 # Un-thinned 0.0206
> quantile(PI[thin.aft.brn], c(.025, .975))
2.5% 97.5%
0.007883358 0.035243238 # Un-thinned (0.0074, 0.0355)
> acf(PI[thin.aft.brn], plot=F) # for part (b)
222 9 Instructor Manual: Gibbs Sampling
0 1 2 3 4 5 6 7 8
1.000 0.026 0.008 0.017 -0.049 -0.049 -0.010 -0.011 0.013
9 10 11 12 13 14 15 16 17
0.008 0.045 0.049 -0.004 -0.073 0.025 -0.020 -0.035 -0.017
18 19 20 21 22 23 24 25 26
-0.078 -0.018 -0.009 -0.013 0.011 -0.024 0.014 -0.035 0.007
27 28 29 30 31
0.011 -0.015 -0.025 -0.005 -0.022
for (i in 2:m) {
num.x = PI[i-1]*eta; den.x = num.x + (1-PI[i-1])*(1 - theta)
X = rbinom(1, A, num.x/den.x)
num.y = PI[i-1]*(1 - eta); den.y = num.y + (1-PI[i-1])*theta
Y = rbinom(1, B, num.y/den.y)
PI[i] = rbeta(1, X + Y + alpha, n - X - Y + beta) }
aft.brn = seq(m/2 + 1,m)
est.d = density(PI[aft.brn], from=0, to=1); mx = max(est.d$y)
hist(PI[aft.brn], ylim=c(0, mx), prob=T, col="wheat")
lines(est.d, col="darkgreen")
quantile(PI[aft.brn], c(.025, .975)) # posterior probability int.
mean(PI[aft.brn]) # posterior mean
median(PI[aft.brn]) # posterior median
est.d$x[est.d$y==mx] # density est of post. mode
9 Instructor Manual: Gibbs Sampling 223
a) Run the code to verify that it gives the result claimed. In the R Ses-
sion window, type ?density and browse the information provided on
kernel density estimation. In this instance, what is the reason for the
parameters from=0, to=1? What is the reason for finding mx before the
histogram is made? In this book, we have used the mean of sampled val-
ues after burn-in as the Bayesian point estimate of π. Possible alternative
estimates of π are the median and the mode of the sampled values after
burn-in. Explain how the last statement in the code roughly approximates
the mode.
d Making the figure (not shown): The prior distribution BETA(1, 1) = UNIF(0, 1) has
support (0, 1), so we know that the posterior distribution has this same support, and
we want the density estimate for the posterior to be constrained to this interval also.
In many cases, the value of the density estimate at its mode turns out to be greater
than the height of any of the histogram bars, so we set the vertical axis of the
histogram to accommodate the height of the density estimate.
In the output above: After showing the posterior probability interval, we show
the three possible point estimates: mean (as usual), median, and (density-estimated)
mode. Our simulated posterior distribution is slightly skewed to the right, so the
mean is the largest of these and the mode is the smallest. However, the skewness is
slight, and all three point estimates of π round to 0.02 = 2%. c
set.seed(1246)
m = 50000; PI = numeric(m); PI[1] = .2
alpha = 1; beta = 1
eta = .95; theta = .98
n = 100; A = 2; B = n - A
for (i in 2:m) {
num.x = PI[i-1]*eta; den.x = num.x + (1-PI[i-1])*(1 - theta)
X = rbinom(1, A, num.x/den.x)
num.y = PI[i-1]*(1 - eta); den.y = num.y + (1-PI[i-1])*theta
Y = rbinom(1, B, num.y/den.y)
PI[i] = rbeta(1, X + Y + alpha, n - X - Y + beta) }
b) How does the posterior mean compare with the estimate from equa-
tion (9.1) on p220? Use the Agresti-Coull adjustment t0 = (A + 2)/(n + 4).
d The point estimate p of π from equation (9.1), using the Agresti-Coull adjustment
t0 = (A + 2)/(n + 4) = 4/104 = 0.03846 to estimate τ , is
t0 + θ − 1 4/104 + 0.98 − 1
p= = = 0.01985 ≈ 2%.
η+θ−1 0.95 + 0.98 − 1
This is close to the value in part (a). Moreover, if use the Agresti-Coull upper bound
for τ in equation (9.1), we get about 5% as an upper bound for π. (See the R code
below.) If we use the traditional estimate t = A/n = 2/100 = 0.02 of τ , then
equation (9.1) gives p = 0 as the estimate of π. c
d) What Bayesian estimates would you get with the prior of part (c) if there
are no test-positive animals among 100? In this case, what part of the
Gibbs sampling process becomes deterministic?
d Point estimate: about π = 0.7%; bound: about 2.2%. With A = 0, we have X ≡ 0. c
Comments: In (a) and (b), the Bayesian point estimate and the estimate from equa-
tion (9.1) are about the same. If there are a few thousand animals in the herd,
these results indicate there might indeed be at least one infected animal. Then, if
the disease is one that may be highly contagious beyond the herd or if diseased
animals pose a danger to humans, we could be in for serious trouble. If possible,
first steps might be to quarantine this herd for now, find the two animals that tested
positive, and quickly subject them to a gold-standard diagnostic test for the disease.
That would provide more reliable information than the Gibbs sampler based on the
screening test results. d) Used alone, a screening test with η = 95% and θ = 98%
applied to a relatively small proportion of the herd seems a very blunt instrument
for trying to say whether the herd is free of a disease.
226 9 Instructor Manual: Gibbs Sampling
9.9 Write and execute R code to make diagnostic graphs for the Gibbs
sampler of Example 9.2 showing ACFs and traces (similar to the plots in
Figure 9.2). Comment on the results. d Imitate the relevant code in Example 9.1.c
9.10 Run the code below. Explain step-by-step what each line (beyond
the first) computes. How do you account for the difference between diff(a)
and diff(b)?
x.bar = 9.60; x.sd = 2.73; n = 41
x.bar + qt(c(.025, .975), n-1)*x.sd/sqrt(n)
a = sqrt((n-1)*x.sd^2 / qchisq(c(.975,.025), n-1)); a; diff(a)
b = sqrt((n-1)*x.sd^2 / qchisq(c(.98,.03), n-1)); b; diff(b)
set.seed(1947)
m = 50000
MU = numeric(m); THETA = numeric(m)
THETA[1] = 1
n = 5; x.bar = 28.31; x.var = 5.234^2
mu.0 = 25; th.0 = 4
alp.0 = 30; kap.0 = 1000
for (i in 2:m)
{
th.up = 1/(n/THETA[i-1] + 1/th.0)
mu.up = (n*x.bar/THETA[i-1] + mu.0/th.0)*th.up
MU[i] = rnorm(1, mu.up, sqrt(th.up))
set.seed(1948)
m = 50000
MU = numeric(m); THETA = numeric(m)
THETA[1] = 1
n = 5; x.bar = 28.31; x.var = 5.234^2
mu.0 = 0; th.0 = 10^6
alp.0 = .01; kap.0 = .01
...
9.12 Before drawing inferences, one should always look at the data to see
whether assumptions are met. The vector x in the code below contains the
n = 41 observations summarized in Example 9.2.
x = c( 8.50, 9.75, 9.75, 6.00, 4.00, 10.75, 9.25, 13.25,
10.50, 12.00, 11.25, 14.50, 12.75, 9.25, 11.00, 11.00,
8.75, 5.75, 9.25, 11.50, 11.75, 7.75, 7.25, 10.75,
7.00, 8.00, 13.75, 5.50, 8.25, 8.75, 10.25, 12.50,
4.50, 10.75, 6.75, 13.25, 14.75, 9.00, 6.25, 11.75, 6.25)
mean(x)
var(x)
shapiro.test(x)
par(mfrow=c(1,2))
boxplot(x, at=.9, notch=T, ylab="x",
xlab = "Boxplot and Stripchart")
stripchart(x, vert=T, method="stack", add=T, offset=.75, at = 1.2)
qqnorm(x)
par(mfrow=c(1,1))
> mean(x)
[1] 9.597561
> var(x)
[1] 7.480869
> shapiro.test(x)
data: x
W = 0.9838, p-value = 0.817
b) Comment on the graphical output in Figure 9.11. (The angular sides of the
box in the boxplot, called notches, indicate a nonparametric confidence
interval for the population median.) Also comment on the result of the
test. Give several reasons why it is reasonable to assume these data come
from a normal population.
d There are two graphical displays (shown in Figure 9.11 of the text). The first
is a notched boxplot with a stripchart on the same scale. The notches indicate a
nonparametric CI for the population median, roughly (9, 11), perhaps a little lower,
with a sample median a little below 10mm. Within the accuracy of reading the plot,
the numerical agreement of this CI with probability interval from the Gibbs sampler
in Example 9.2 seems good. There is no evidence that the population is skewed. The
second plot is a normal probability plot (Q-Q plot). Points fall roughly in a straight
line, as is anticipated for data from a normal population.
230 9 Instructor Manual: Gibbs Sampling
set.seed(1237)
m = 50000
MU = numeric(m); THETA = numeric(m)
MU[1] = 5 # initial value for MU
n = 41; x.bar = 9.6; x.var = 2.73^2
mu.0 = 0; th.0 = 400
alp.0 = 1/2; kap.0 = 1/5
for (i in 2:m)
{ # use MU[i-1] to sample THETA[i]
alp.up = n/2 + alp.0
kap.up = kap.0 + ((n-1)*x.var + n*(x.bar - MU[i-1])^2)/2
THETA[i] = 1/rgamma(1, alp.up, kap.up)
# use THETA[i] to sample MU[i]
th.up = 1/(n/THETA[i] + 1/th.0)
mu.up = (n*x.bar/THETA[i] + mu.0/th.0)*th.up
MU[i] = rnorm(1, mu.up, sqrt(th.up))
}
SIGMA = sqrt(THETA)
mean(SIGMA[aft.brn]) # point estimate of sigma
bi.SIGMA = sqrt(bi.THETA); bi.SIGMA
par(mfrow=c(2,2))
plot(aft.brn, MU[aft.brn], type="l")
plot(aft.brn, SIGMA[aft.brn], type="l")
hist(MU[aft.brn], prob=T); abline(v=bi.MU, col="red")
hist(SIGMA[aft.brn], prob=T); abline(v=bi.SIGMA, col="red")
par(mfrow=c(1,1))
a) By subtracting and adding x̄, show that the exponential in the likelihood
1
function can be written as exp{− 2θ [(n − 1)s2 + n(x̄ − µ)2 ]}.
d Looking at the sum in the exponent (over values i from 1 to n), we have
X X
(xi − µ)2 = [(xi − x̄) + (x̄ − µ)]2
X£ ¤
= (xi − x̄)2 + 2(x̄ − µ)(xi − x̄) + (x̄ − µ)2
= (n − 1)s2 + n(x̄ − µ)2 .
Upon distributing the sum in the second line: The first term in the second line
becomes the first term in the last line by the definition of s2 . The last term in the
232 9 Instructor Manual: Gibbs Sampling
second line does not involve i and so becomes the last term in the last line. The
P
second term in the second line is 0 because it is a multiple of (xi − x̄) = 0. c
b) The distribution of θ|x, µ used in the Gibbs sampler is based on the prod-
uct p(θ|x, µ) ∝ p(θ) p(x|µ, θ). Expand and then simplify this product to
verify that θ|x, µ ∼ IG(αn , κn ), where αn and κn are as defined in the
example.
d In the product of the kernels of the prior and likelihood, there are two kinds of
factors: powers of θ and exponentials (powers of e). The product of the powers of θ
is 0
θ−(α0 +1) × θ−n/2 = θ−(α0 +n/2)+1 = θ−(α +1) ,
where α0 = α0 + n/2 is defined in Example 9.2. If we denote by A the quantity
displayed in part (a), then the product of the exponential factors is
³ ´ ³ ´ µ ¶ µ ¶
κ0 A κ0 + A/2 κ0
exp − × exp − = exp − = exp − ,
θ 2θ θ θ
where κ0 = (κ0 + A/2) = κ0 + [(n − 2)s2 + n(x̄ − µ)2 ]/2 is defined in the example. c
9.15 The R code below was used to generate the data used in Example 9.3.
If you run the code using the same (default) random number generator in R
we used and the seed shown, you will get the same data.
set.seed(1212)
g = 12 # number of batches
r = 10 # replications per batch
mu = 100; sg.a = 15; sg.e = 9 # model parameters
a.dat = matrix(rnorm(g, 0, sg.a), nrow=g, ncol=r)
# ith batch effect across ith row
e.dat = matrix(rnorm(g*r, 0, sg.e), nrow=g, ncol=r)
# g x r random item variations
X = round(mu + a.dat + e.dat) # integer data
X
> X
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 103 113 88 96 89 88 80 92 89 81
[2,] 143 116 126 127 132 121 129 148 129 119
[3,] 107 107 98 103 113 104 99 103 98 109
[4,] 71 72 89 63 85 71 75 76 98 57
[5,] 105 101 113 110 109 101 114 114 113 107
[6,] 88 93 100 91 98 105 103 91 123 110
[7,] 71 52 67 59 67 67 60 68 62 53
[8,] 115 102 93 111 130 114 97 103 112 98
[9,] 58 70 65 78 67 60 74 80 47 68
[10,] 133 119 130 136 133 116 131 118 140 135
[11,] 103 101 97 110 125 107 115 106 110 94
[12,] 83 106 86 91 88 107 92 98 88 95
9 Instructor Manual: Gibbs Sampling 233
a) Run the code and verify whether you get the same data. Explain the
results of the statements a.dat, var(a.dat[1,]), var(a.dat[,1]), and
var(as.vector(e.dat)). How do the results of the first and second of
these statements arise? What theoretical values are approximated (not
very well because of the small sample size) by the last two of them.
d The function rnorm(g, 0, sg.a) generates the g values Ai of the model displayed
just above the middle of p229 of the text. By default, matrices are filled by columns.
Thus, when the g × r matrix a.dat is made, the values Ai are recycled r times to
yield a matrix with r identical columns. The statement a.dat[1,] would print r
copies of the single value A1 in the first row of the matrix, and its variance is 0.
The statement a.dat[,1] would print the g values Ai in the first column of the
2
matrix. Its variance approximates the parameter σA = θA of the model, the “batch
variance.” In practice, the values Ai are latent (not observable), so they are not
available for estimating θA . (See part (b) for an unbiased estimate of θA based on
observed data.)
In this simulation, we know the values of the Ai , and their sample variance is
about 397.5, which is not very close to θA = 152 = 225. Because this estimate
is based on a sample of only g = 12 simulated observations, the poor result is not
surprising. (Notice that we can make this comparison in a simulation, where we have
specified true parameter values and have access to latent generated variables—none
of which could be known in a practical situation.)
The function rnorm(g*r, 0, sg.e) generates gr observations with mean 0 and
variance σ, the “error components” of the model. These are put into a g × r matrix,
in which all values are different. The statement var(as.vector(e.dat)) finds the
variance of these gr observations, which estimates the parameter σ 2 = θ of the
model. The numerical value is about 82.8, which is not far from θ = 92 = 81. c
We have already noted (p231 of the text) that the estimate θ̂A is unsatisfactory
because (owing to the subtraction) it can sometimes be negative. But even when it is
positive, we shouldn’t assume it is really close to θA . First, a reliable estimate would
require a large number g of batches. Second, as noted in the answer to part (a), we
cannot observe the Ai directly, and so we must try to estimate θA by disentangling
information about the Ai from information about the eij . Here, the numerical value
2
of θ̂A is about 500, which is not very near the value θA = σA = 152 = 225 we
specified for the simulation. c
Hints: a) By default, matrices are filled by columns; shorter vectors recycle. The
variance components of the model are estimated.
b) Intermediate. Derive the confidence intervals in part (a) from the distrib-
utions of the quantities involved.
p
d Grand mean µ. From p230 of the text, (x̄.. − µ)/ MS(Batch)/gr ∼ T(g − 1). To
find a 100(1 − α)% confidence interval for µ, let L and U cut off probability α/2
from the lower and upper tails of T(g − 1), respectively. Then
( ) ( )
x̄.. − µ µ − x̄..
1−α = P L≤ p ≤U =P −U ≤ p ≤ −L
MS(Batch)/gr MS(Batch)/gr
( r r )
MS(Batch) MS(Batch)
=P −U ≤ µ − x̄.. ≤ −L
gr gr
( r r )
MS(Batch) MS(Batch)
=P x̄.. − U ≤ µ ≤ x̄.. − L .
gr gr
9 Instructor Manual: Gibbs Sampling 235
We have L = −U > 0 for the symmetrical distributionp T(g − 1), a 100(1 − α)%,
so a confidence interval for µ is given by x̄.. ± U θ̂/gr. Because they view µ to be
a fixed, unknown constant, frequentist statisticians object to the use of the word
probability in connection with this interval. The idea is that, over many repeated
experiments, such an interval will be valid with relative frequency 1 − α.
Error variance θ. The point estimate of θ = σ 2 is θ̂ = MS(Error). Moreover, we
know that (gr − 1)θ̂/θ ∼ CHISQ(gr − 1). Let L0 and U 0 cut off probability α/2 from
the lower and upper tails of CHISQ(gr − 1), respectively. Then
½ ¾ ½ ¾
0(gr − 1)θ̂ 1 θ 1
1−α = P L ≤ ≤ U0 =P ≤ ≤ 0
θ U0 (gr − 1)θ̂ L
½ ¾
(gr − 1)θ̂ (gr − 1)θ̂
=P ≤θ≤ .
U0 L0
¡ ¢
Thus a 100(1 − α)% confidence interval for θ is (gr − 1)θ̂/U 0 , (gr − 1)θ̂/L0 .
Intraclass correlation ρI . No exact distribution theory leads to a confidence interval
for the batch component of variance θA . However, in practice, it is often useful have
a confidence interval for the ratio ψ = θA /θ of the two variance components. Also of
interest is the intraclass correlation ρI = Cor(xij , xij 0 ) = θA /(θA + θ), where j 6= j 0 .
Thus, ρI (read: “rho-sub-I”) is the fraction of the total variance of an individual
observation that is due to the variance among batches.
From p230 of the text, we know that (g −1)MS(Batch)/(rθA +θ) ∼ CHISQ(g −1)
and (g(r − 1))MS(Error)/θ ∼ CHISQ(g(r − 1)). Dividing each of these chi-squared
random variables by its degrees of freedom, and taking the ratio, we have
θ MS(Batch) 1
= R ∼ F(g − 1, g(r − 1)).
rθA + θ MS(Error) rψ + 1
Then, letting L00 and R00 cut off probability α/2 from the lower and upper tails of
F(g − 1, g(r − 1)), respectively, and noticing that 1/ρI = 1/ψ + 1, we have
½ ¾ n o
R R R
1−α = P L00 ≤ ≤ U 00 =P ≤ rψ + 1 ≤ 00
rψ + 1 U 00 L
½ ¾ ½ ¾
R − U 00 R − L00 rL00 1 rU 00
=P 00
≤ψ≤ =P +1≤ +1≤ +1
rU rL00 R−L 00 ψ R − U 00
½ ¾
R − U 00 R − L00
=P 00
≤ ρI ≤ .
R + (r − 1)U R + (r − 1)L00
From these equations we can make confidence intervals for ψ and for ρI . c
Hint: b) For ρI , start by deriving a confidence interval for ψ = θA /θ. What multiple
of R is distributed as F(g − 1, g(r − 1))?
9.17 Figure 9.8 on p235 shows four diagnostic plots for the simulated pos-
terior distribution of σA in the Gibbs sampler of Example 9.3. Make similar
diagnostic plots for the posterior distributions of µ, σ, and ρI .
d Imitate the code in Example 9.1 to used to make Figures 9.1 and 9.2. But, for each
of µ, σ, and ρI , use par(mfrow=(2,2)) to make one figure with four panels. c
236 9 Instructor Manual: Gibbs Sampling
set.seed(1237)
g = 12; r = 10
mu = 100; sg.a = 1; sg.e = 9
a.dat = matrix(rnorm(g, 0, sg.a), nrow=g, ncol=r)
e.dat = matrix(rnorm(g*r, 0, sg.e), nrow=g, ncol=r)
X = round(mu + a.dat + e.dat)
X.bar = apply(X, 1, mean); X.sd = apply(X, 1, sd)
round(rbind(X.bar, X.sd), 3)
a) Figure 9.6 (p231) shows boxplots for each of the 12 batches simulated
above. Compare it with Figure 9.5 (p230). How can you judge from these
two figures that the batch component of variance is smaller here than in
Example 9.3?
d Small dots within the boxes of the boxplots in Figures 9.5 and 9.6 indicate the batch
means. In Figure 9.6 these batch means are much less variable than in Figure 9.5.
This suggests that θA is smaller for the data in Figure 9.6 than for the data of
Figure 9.5.
The rationale for this rough method of comparing batch variances θA using
boxplots is as follows. The model is xij = µ + Ai + eij , where Aj ∼ NORM(0, σA ),
eij ∼ NORM(0, σ) and all Ai and eij are mutually independent, and V(xij ) = θA +θ.
So the variability of the x̄i. is a rough guide to the size of θA . We say rough because
there is no straightforward way to visualize θA alone. The batch mean of each boxplot
reflects both components of variance. c
b) Run the Gibbs sampler of Section 9.3 for these data using the same un-
informative priors as shown in the code there. You should obtain 95%
9 Instructor Manual: Gibbs Sampling 237
√
Bayesian interval estimates for µ, σ, σA = θA , and ρI that cover the
values used to generate the data X. See Figure 9.12, where one-sided in-
tervals are used for σA and ρI .
d Below we show the modifications required in the program of Section 9.3, along with
commented labels for the parts of the program. Changes in the Data section generate
the data shown in part (a). In the last section we make one-sided confidence intervals
for θA and ρI because the sampled posterior distributions are strongly right-skewed
with no left-hand tail. (Two lines of code with ## correct typographical errors in the
first printing of the text, replacing incorrect n with correct k.) c
# Sampling
for (k in 2:m) {
alp.up = alp.0 + g/2
kap.up = kap.0 + sum((a - MU[k-1])^2)/2
VAR.BAT[k] = 1/rgamma(1, alp.up, kap.up)
in the previous problem) is (0.001, 0.078). The code below also finds a 95% one-sided
upper bound 0.11 for ρI ; this differs substantially from the corresponding Bayesian
bound 0.05. For our simulated data, we know that ρI = θA /(θ + θA ) = 1/(1 + 92 ) =
0.0122, so both the one- or two-sided Bayesian intervals provide information that is
more useful what is provided by traditional CIs. c
set.seed(1234)
g = 12; r = 10; mu = 100; sg.a = 1; sg.e = 9
m = 100000; R = numeric(m)
for (i in 1:m) {
a.dat = matrix(rnorm(g, 0, sg.a), nrow=g, ncol=r)
e.dat = matrix(rnorm(g*r, 0, sg.e), nrow=g, ncol=r)
X = round(mu + a.dat + e.dat)
X.bar = apply(X, 1, mean); X.sd = apply(X, 1, sd)
MS.Bat = r*var(X.bar); MS.Err = mean(X.sd^2)
R[i] = MS.Bat/MS.Err }
mean(R < 1)
# Sampling
...
SUMMARY TABLE
c) With flat priors for µ and θA , but the prior of part (b) for θ, run the
Gibbs sampler to find 95% Bayesian interval estimates for µ, σA , σ, and ρI
from the data given above. Compare these intervals with your answers in
part (a) and comment.
d This is the same program as in part (a) except that we have changed to the infor-
mative prior distribution on θ = σ 2 . As a result the Bayesian probability interval
for σ is narrower than in part (a), and accordingly the probability interval for the
intraclass correlation ρI = θA /(θA +θ) a little narrower. Again here, we have omitted
(at the ...s) some parts of the program that have been shown earlier. c
9 Instructor Manual: Gibbs Sampling 243
# Sampling
...
Note: Data are from page 239 of Snedecor and Cochran (1980). The unbiased esti-
2
mate of θA = σA is positive here. Estimation of σA by any method is problematic
because there are so few batches.
244 9 Instructor Manual: Gibbs Sampling
d With the obvious changes, this problem is solved in the same way as Problem 9.20.
A Bayesian analysis of these dye data can also be found in Example 10.3. c
a) Use these data to find unbiased point estimates of µ, σA , and σ. Also find
95% confidence intervals for µ, σ, and ρI (see Problem 9.16).
b) Use a Gibbs sampler to find 95% Bayesian interval estimates for µ, σA , σ,
and ρI from these data. Specify noninformative prior distributions as in
Example 9.3. Make diagnostic plots.
Answers: b) Roughly: (1478, 1578) for µ; (15, 115) for σA . See Box and Tiao (1973)
for a discussion of these data, reported in Davies (1957).
a) Compute the batch means, and thus x̄.. and MS(Batch). Use your results
to find the unbiased point estimates of µ, θA , and θ.
Pr
d Batch means are x̄i. = (1/r) j=1 xij , for i = 1, . . . , g, where g = 22 and r = 4.
For example, x̄1. = 218/4 = 54.5. These form the vector X.bar below. Estimates are
1 XX 1X
g r g
d We have
Pg made a few changes so that the summarized data can be used. In partic-
2
ular, i=1
(r − 1)s i = g(r − 1)MS(Error). Because of the changes, we show the full
Gibbs sampler program for this problem.
Posterior means for µ, σA , and σ are in reasonably good agreement with the cor-
responding MMEs; where they differ we prefer the Bayesian results with flat priors.
Moreover, there is very good numerical agreement of the 95% posterior probability
intervals for µ, θ, and ψ with corresponding 95% CIs that were obtained using the
methods illustrated at the end of the answers for part(a).
Finally, the last block of code, which makes plots similar to those in Figures 9.8
and 9.9, is a reminder to look at diagnostic graphics for all Gibbs samplers. Here
they are all well behaved. c
# Sampling
for (k in 2:m) {
alp.up = alp.0 + g/2
kap.up = kap.0 + sum((a - MU[k-1])^2)/2
VAR.BAT[k] = 1/rgamma(1, alp.up, kap.up)
Note: Data are taken from Brownlee (1956), p325. Along with other inferences
from these data, the following traditional 90% confidence intervals are given there:
(43.9, 47.4) for µ; (17.95, 31.97) for θ; and (0.32, 1.62) for ψ = θA /θ. (See Prob-
lem 9.16.)
9.24 Using the correct model. To assess the variability of a process for
making a pharmaceutical drug, measurements of potency were made on one
pill from each of 50 bottles. These results are entered into a spreadsheet as 10
rows of 5 observations each. Row means and standard deviations are shown
below.
Row 1 2 3 4 5 6 7 8 9 10
Mean 124.2 127.8 119.4 123.4 110.6 130.4 128.4 127.6 122.0 124.4
SD 10.57 14.89 11.55 10.14 12.82 9.99 12.97 12.82 16.72 8.53
d According to the Gibbs sampler, the batch variance quite small compared with the
error variance. We give one-sided probability intervals for the batch variance and
the intraclass correlation. The block of code at the end of the program shows that
the MME of the batch variance is positive, but the null hypothesis that the batch
variance is 0 is not rejected (P-value = 40%). c
g = 10; r = 5
X.bar = c(124.2, 127.8, 119.4, 123.4, 110.6,
130.4, 128.4, 127.6, 122.0, 124.4)
X.sd = c(10.57, 14.89, 11.55, 10.14, 12.82,
9.99, 12.97, 12.82, 16.72, 8.53)
set.seed(1247)
m = 50000; b = m/4 # iterations; burn-in
MU = VAR.BAT = VAR.ERR = numeric(m)
for (k in 2:m)
{
alp.up = alp.0 + g/2
kap.up = kap.0 + sum((a - MU[k-1])^2)/2
VAR.BAT[k] = 1/rgamma(1, alp.up, kap.up)
par(mfrow=c(2,2))
hist(MU[b:m], prob=T); abline(v=bi.MU)
hist(SIGMA.BAT[b:m], prob=T); abline(v=bi.SG.B)
hist(SIGMA.ERR[b:m], prob=T); abline(v=bi.SG.E)
hist(ICC[b:m], prob=T); abline(v=bi.ICC)
par(mfrow=c(1,1))
# Comparable MMEs
MS.Bat = r*var(X.bar); MS.Err = mean(X.sd^2)
est.theta.a = (MS.Bat - MS.Err)/r
R = MS.Bat/MS.Err; P.val = 1 - pf(R, g-1, g*(r-1))
mean(X.bar); est.theta.a; MS.Bat; MS.Err; R; P.val
b) The truth is that all 50 observations come from the same batch. Record-
ing the data in the spreadsheet by rows was just someone’s idea of a
convenience. Thus, the data would properly be analyzed without regard
to bogus “batches” according to a Gibbs sampler as in Example 9.3.
(Of course, this requires summarizing the data in a different way. Use
s2 = [9MS(Batch) + 40MS(Error)]/49, where s is the standard deviation
of all 50 observations.) Perform this analysis, compare it with the results
of part (a), and comment.
d Now, letting θ = σ 2 , we have xi = µ+ei , where the ei are independently distributed
as NORM(0, σ), for i = 1, . . . , n = 50, so there is no θA in the model. Assuming that
9 Instructor Manual: Gibbs Sampling 251
the data were truly collected according to this model, the correct value in part (a)
would have been θA = 0.
Using the values from the end of the code in part (a), we have
in the standard ANOVA table for the model of part (a), to obtain (n − 1)s2 for the
model of this part. p
Then a 95% CI for µ is 123.82 ± 2.0096 153.8407/50 or (120.30, 127.35),
where 2.0096 is from qt(.975, 49). Also, a 95% CI for θ is (107.35, 238.89), from
49*153.8407/qchisq(c(.975, .025), 49), so that the CI for σ is (10.36, 15.46).
Below, the Gibbs sampler of Example 9.2 is suitably modified to provide Bayesian
interval estimates for these parameters, based on noninformative priors. (We leave
it to you to make the diagnostic graphs shown in the code below and to provide
code for graphs similar to Figures 9.2, and 9.9.) c
set.seed(1070)
m = 50000
MU = numeric(m); THETA = numeric(m)
THETA[1] = 10
for (i in 2:m)
{
th.up = 1/(n/THETA[i-1] + 1/th.0)
mu.up = (n*x.bar/THETA[i-1] + mu.0/th.0)*th.up
MU[i] = rnorm(1, mu.up, sqrt(th.up))
par(mfrow=c(2,2))
plot(aft.brn, MU[aft.brn], type="l")
plot(aft.brn, SIGMA[aft.brn], type="l")
hist(MU[aft.brn], prob=T); abline(v=bi.MU, col="red")
hist(SIGMA[aft.brn], prob=T); abline(v=bi.SIGMA, col="red")
par(mfrow=c(1,1))
Note: Essentially a true story, but with data simulated from NORM(125, 12) replac-
ing unavailable original data. The most important “prior” of all is to get the model
right.
9 Instructor Manual: Gibbs Sampling 253
Errors in Chapter 9
Page numbers refer to the the text. All corrections are scheduled to appear in
the second printing. Statements of problems in this Instructor Manual have
been corrected. Includes corrections compiled through July 13, 2011.
p221 Example 9.1. Just below displayed equations: The second factor of the second
factor of the second term in the denominator of PVP γ is (1 − θ), not (1 − η).
The same equation on the previous page is correct, as is the program on p221.
p239 Problem 9.5(b). The vertical interval in the last line should be (0.020, 0.022).
p240 Problem 9.7: In the R code, the ylim argument of the hist function should be
ylim=c(0, mx). The correct line of code is:
hist(PI[aft.burn], ylim=c(0, mx), prob=T, col="wheat")
p240 Problem 9.8(b). Add the following sentence:
Use the Agresti-Coull adjustment t0 = (A + 2)/(n + 4).
p245 Problem 9.16. At the beginning of the second line of code, include the statement:
df.Err = g*(r-1);. [Thanks to Leland Burrill.]
11.1
> exp(1) # the standard way to write the constant e
[1] 2.718282
> exp(1)^2; exp(2) # both give the square of e
[1] 7.389056
[1] 7.389056
> log(exp(2)) # ’log’ is log-base-e, inverse of ’exp’
[1] 2
11.2
> numeric(10); rep(0, 10) # two ways to get a vector of ten 0s
[1] 0 0 0 0 0 0 0 0 0 0
[1] 0 0 0 0 0 0 0 0 0 0
> c(0,0,0,0,0,0,0,0,0,0) # a third (tedious) way
[1] 0 0 0 0 0 0 0 0 0 0
> -.5:10; seq(-.5, 9.5) # two ways to write the same vector
[1] -0.5 0.5 1.5 2.5 3.5 4.5 5.5 6.5 7.5 8.5 9.5
[1] -0.5 0.5 1.5 2.5 3.5 4.5 5.5 6.5 7.5 8.5 9.5
256 11 Answers to Problems: Appendix
d A careful choice of the width of the Session window allows the result of 10/2:22
to fit the width of our page. c
11.4
> (1:4)^(0:3) # two vectors of same length
[1] 1 2 9 64
> (1:4)^2 # 4-vector and constant (1-vector)
[1] 1 4 9 16
> (1:2)*(0:3) # 2-vector recycles
[1] 0 2 2 6
11.5
> x1 = (1:10)/(1:5); x1 # 5-vector recycles
[1] 1.000000 1.000000 1.000000 1.000000 1.000000 6.000000
[7] 3.500000 2.666667 2.250000 2.000000
> x1[8] # eighth element of x1
[1] 2.666667
> x1[8] = pi # eighth elem of x1 changed
> x1[6:8] # change visible
[1] 6.000000 3.500000 3.141593
11.6
> x2 = c(1, 2, 7, 6, 5); cumsum(x2) # same length
[1] 1 3 10 16 21
> diff(cumsum(x2)) # ’diff’ not inverse of ’cumsum’
[1] 2 7 6 5
11.8
> x4 = seq(-1, 1, by= .1); x4 # 21 unique values
[1] -1.0 -0.9 -0.8 -0.7 -0.6 -0.5 -0.4 -0.3 -0.2 -0.1 0.0
[12] 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
> x5 = round(x4); x5 # three unique values
[1] -1 -1 -1 -1 -1 0 0 0 0 0 0 0 0 0 0 0 1 1
[19] 1 1 1
> unique(x5) # list unique values in x5
[1] -1 0 1
> length(unique(x5)) # count unique values in x5
[1] 3
> x5 == 0 # T for elem. where x5 matches 0
[1] FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE
[10] TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE
[19] FALSE FALSE FALSE
> x5[x5==0] # list 0s in x5
[1] 0 0 0 0 0 0 0 0 0 0 0
d Correction to the first printing: In the statement of the problem, the closed interval
for t should be [0, 1] (not [−1, 1]). The related R code requires no change. c
d Note: One reader has reported that his installation of R does not give two values
in the last line. Conceivably so. Exact “matching” of values that are equal to many
decimal places is a delicate operation. c
11.13
a) d Because Γ (n) = (1 − 1)!, we evaluate the constant term of f (t) as 4!/(2! 1!) =
24/2 = 12. (Alternatively, see the second line of the program below. Also, see
the answer to part (b).) Then f (t) = 12t2 (1 − t) = 12(t2 − t3 ). Perhaps an
implied task, but explicitly stated, is to a routine integrationR to verify that
1
this density function integrates to unity over (0, 1) We have 12 0 (t2 − t3 ) dt =
12[t3 /3 − t4 /4]10 = 12(1/3 − 1/4) = 1.
The required plot could be made by changing the second line of the program
in Example 11.5 to read f = 12*(t^2 - t^3) and the main label of the plot
function to say BETA(3, 2). Instead, we show a program below that will plot
most members of the beta family of distributions by changing only the parame-
ters alpha and beta The last few lines of the program perform a grid search for
the mode of BETA(33, 2) as requested in part (c) (Also see the last few lines of
code in Problem 11.8).
The main label is made by paste-ing together character strings, separated by
commas. Some of these character strings are included in quotes, others are
made in R by converting numeric constants to character strings that express
their numerical values. The default separator when the strings are concatenated
is a space; here we specify the null string with the argument sep="". We do not
show the resulting, slightly left skewed, plot. c
alpha = 3; beta = 2
k = gamma(alpha + beta)/(gamma(alpha)*gamma(beta)); k
t = seq(0, 1, length=200); f = k*t^(alpha-1)*(1 - t)^(beta-1)
m.lab = paste("Density of BETA(", alpha, ", ", beta, ")", sep="")
plot(t, f, type="l", lwd=2, col="blue", ylab="f(t)", main=m.lab)
abline(h=0, col="darkgreen"); abline(v=0, col="darkgreen")
> k
[1] 12
11.19
d We simulate the distribution of the random variable X, which is the number of
Aces obtained when a five-card poker hand is dealt at random (without replacement)
from a standard deck of cards, containing four Aces. We have slightly modified the
program of Example 11.7 to find the exact distribution of X for comparison (see
Problem 11.20). We also plot points on the histogram (not shown here) correspond-
ing to these exact values. Approximate probabilities obtained from two runs of the
program are in very good agreement with the exact probabilities. (Results from the
second run are shown at the very end of the printout below.) c
cut = (0:5) - .5
summary(as.factor(aces))/m # simulated probabilities
11.20
d The task is to compute the distribution of the random variable X which is the
number of Aces obtained when a five-card poker hand is dealt at random (without
replacement) from a standard deck of cards, containing four Aces. The distribution is
P {X = r} = (4r )(5−r
48
)/(52
5 ), for r = 0, 1, 2, 3, 4. The required computation, using the
choose function in R is shown at the end of the program in the answer to part (a).
This distribution is hypergeometric, and R has a function dhyper for computing
the distribution somewhat more simply, as illustrated below. Compare the answers
with the exact answers provided in the program for part (a).
In terms of our present application, the parameters of dhyper are in turn: the number
of Aces seen, the number of Aces in the deck (4), the number of non-Aces in the
deck (48), and the number of cards drawn (5). c
11 Answers to Problems: Appendix 261
Multiple Choice Quiz. Students not familiar with R need to start imme-
diately learning to use R as a calculator. With computers available, this quiz
might be used in class or as a very easy initial take-home assignment. Even
without computers, if the few questions with “tricky” answers are eliminated,
it might work as an in-class quiz.
Instructions: Mark the one best answer for each question.
If answer (e) is not specified, it stands for an alternative
answer such as "None of the above," "Cannot be determined
from information provided," "Gives error message," and so on.
Errors in Chapter 11
Page numbers refer to the the text. All corrections are scheduled to appear in
the second printing. Statements of problems in this Instructor Manual have
been corrected. Includes corrections compiled through July 13, 2011.
p281 Section 11.2.2. In the R code above the problems: Use vector v5 instead of w5
in both instances. [Thanks to Wenqi Zheng.] The correct line is;
> w5; w5[9]
p285 Problem 11.8. The closed interval should be [0, 1], not [−1, 1]. The related R
code is correct. [Thanks to Tony Tran.]
p155 Problem 6.5(e). The displayed equation should have ’mod 5’; consequently, the
points should run from 1 through 5, and 0 should be adjacent to 4. The answer
for part (e) should say: “The X-process is not Markov.” The correct statement
of part (e) is as follows:
11 Answers to Problems: Appendix 265
e) At each step n > 1, a fair coin is tossed, and Un takes the value −1
if the coin shows Tails and 1 if it shows Heads. Starting with V1 = 0,
the value of Vn for n > 1 is determined by
for (j in 1:5) {
top = .2 + 1.2 * max(dbeta(c(.05, .2, .5, .8, .95),
alpha[i], beta[j]))
plot(x,dbeta(x, alpha[i], beta[j]),
type="l", ylim=c(0, top), xlab="", ylab="",
main=paste("BETA(",alpha[i],",", beta[j],")", sep="")) }
p214 Problem 8.8(c). The second R statement should be qgamma(.975, t+1, n), not
gamma(.975, t+1, n).
p221 Example 9.1. Just below displayed equations: The second factor of the second
factor of the second term in the denominator of PVP γ is (1 − θ), not (1 − η).
The same equation on the previous page is correct, as is the program on p221.
p239 Problem 9.5(b). The vertical interval in the last line should be (0.020, 0.022).
p240 Problem 9.7: In the R code, the ylim argument of the hist function should be
ylim=c(0, mx). The correct line of code is:
hist(PI[aft.burn], ylim=c(0, mx), prob=T, col="wheat")
p240 Problem 9.8(b). Add the following sentence:
Use the Agresti-Coull adjustment t0 = (A + 2)/(n + 4).
p245 Problem 9.16. At the beginning of the second line of code, include the statement:
df.Err = g*(r-1);. [Thanks to Leland Burrill.]
p245 Problem 9.18. The summary data printed by the program is usable, but does not
correspond to seed 1237. [Figure 9.6 (p231) illustrates the data for seed 1237.]
The correct summary data are shown with the problem in this Manual.
p246 Problem 9.20(b). Notation for the prior on σ should be IG(β0 = 35, λ0 = 0.25)
to match the code in the program of Example 9.3.
p281 Section 11.2.2. In the R code above the problems: Use vector v5 instead of w5
in both instances. [Thanks to Wenqi Zheng.] The correct line is;
> w5; w5[9]
p285 Problem 11.8. The closed interval should be [0, 1], not [−1, 1]. The related R
code is correct. [Thanks to Tony Tran.]