0% found this document useful (0 votes)
10 views

BT_Wk3_LectureNotes(2)

Chapter 3 discusses Bayesian theory, focusing on specifying hyperparameters for prior distributions, including Beta, Gamma, and Normal priors. It explains how to derive conjugate posteriors for parameters of normal distributions and provides examples of calculating posterior distributions for unknown means with known variances. The chapter emphasizes the importance of prior specifications and their impact on posterior estimates in Bayesian analysis.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

BT_Wk3_LectureNotes(2)

Chapter 3 discusses Bayesian theory, focusing on specifying hyperparameters for prior distributions, including Beta, Gamma, and Normal priors. It explains how to derive conjugate posteriors for parameters of normal distributions and provides examples of calculating posterior distributions for unknown means with known variances. The chapter emphasizes the importance of prior specifications and their impact on posterior estimates in Bayesian analysis.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

Chapter 3

Week 3

L3: Bayesian Theory—Priors, Part 2

3.1 Specifying hyperparameters

Hyperparameters are sometimes calculated on the basis of subjective specifications of summary measures for
the prior distribution. One common procedure is to equate a priori guesses of expected values, variances,
or coefficients of variances to the algebraic formulas for those values in the prior distribution1 . Another is
to specify the probability that θ lies in some range [θL , θU ]; e.g., Pr(3 ≤ θ ≤ 9) = 0.8.

Beta prior for Binomial. For example, a Beta(α, β) prior will be used for a binomial datap distribution,
Binomial(n, θ), and the desired expected value of θ is 0.8 with a coefficient of variation of 0.25 ( V [θ]/E[θ]).
This implies a desired variance of (0.25 ∗ 0.8)2 = 0.04. One can calculate α and β by solving the following
system of two equations with two unknowns2 .
α
E[θ] =
α+β
αβ
V [θ] =
(α + β)2 (α + β + 1)

Gamma prior for Poisson. A similar exercise can be carried out when specifying a Gamma(α, β) prior
for a Poisson data distribution, Poisson(θ). Desired characteristics
p of the prior distribution are that the
expected value of θ is 10 with a coefficient of variation of 0.30 ( V [θ]/E[θ]). One can calculate α and β
using the following3 :
α
E[θ] =
β
α
V [θ] = 2
β
1 This is referred to moment matching in Bayesian Models: A Statistical Primer for Ecologists, by Hobbs and Hooten (2015).

I really like this book and note that it is written for ecologists rather than statisticians. It is available online through the
University Library webpage.
2 Try this out: given E[θ]=0.8 and CV =0.25, show that α=2.4 and β=0.6.
3 Given E[θ]=12 and CV =0.3, show that α=11.11111 and β=0.9259259.

40
Normal prior for Normal µ. Instead of specifying moment-related measures of parameters, one might
specify desired quantiles. For example, the data distribution is Normal(θ, σ 2 ), where σ 2 is known, and a
normal distribution will be used as the prior for θ, i.e., θ ∼ Normal(µo , σo2 ). Specifying a 95% credible
interval, the 2.5th and 97.5th percentiles for the prior should be 15 and 30. One can calculate µ0 and σ02
using the following equations4 :

15 = µo − 1.960 ∗ σo
30 = µo + 1.960 ∗ σo

In general, letting zp denote the standard normal, Normal(0,1), quantile for probability p, and yp denote the
normal quantile for probability p for Normal(µo , σo2 ):

yp1 = µo + zp1 ∗ σo
yp2 = µo + zp2 ∗ σo

where p1 < p2 (and yp1 < yp2 ).

4 With 2.5th and 97.5th percentiles equal to 15 and 30, µo =22.5; σo = 3.826531.

41
3.2 Normal Distribution Priors

Given the ubiquity and utility of the normal distribution, it is useful to thoroughly explore the posterior for
the parameters of that distribution for various priors. We begin with conjugate priors for just one parameter
of a univariate normal distribution, here denoted Normal(µ, σ 2 ), either µ or σ 2 . Later we’ll examine the
case of specifying joint prior distributions for µ and σ 2 .
To derive the conjugate posteriors, there is a fair amount of algebraic manipulation involved. It is worth
understanding and being able to reproduce the following for yourself as such techniques are useful in other
circumstances.
In all cases considered, the data distribution is Normal(µ, σ 2 ) from which there are n independent and
identically distributed (iid) observations, y = (y1 , y2 , . . ., yn ). The joint probability density function is
n  Pn
(yi − µ)2
  
Y 1 1 − n
f (y|µ, σ 2 ) = √ exp − 2 (yi − µ)2 = 2πσ 2 2 exp − i=1 2 (3.1)
i=1 2πσ 2 2σ 2σ

Remarks.

• Re-expressing the Normal pdf. A useful re-expression of the normal pdf is the following.
 Pn 2
n(ȳ − µ)2
  
i=1 (yi − ȳ)
2
 n
2 −2
f (y|µ, σ ) = 2πσ exp − exp − (3.2)
2σ 2 2σ 2

Check the validity of this re-expression. (Hint: rewrite (yi − µ)2 as ([yi − ȳ] + [ȳ − µ])2 .)

• Sufficient statistics. Suppose that σ 2 is known. Note the only term in eq’n 3.2 containing µ includes
ȳ, thus ȳ is the relevant data for inference about µ. We say that ȳ is a sufficient statistic for µ, as this
statistic contains all the information in the data that is useful for estimating the unknown parameter
(here it is µ).
Pn 2
i=1 (yi −y)
Now suppose that σ 2 is also unknown, letting s2 = n−1 , then

n(ȳ − µ)2
   
2
 n
2 −2 (n − 1) 2
f (y|µ, σ ) = 2πσ exp − s exp − (3.3)
2σ 2 2σ 2

and (y, s2 ) is the joint sufficient statistic for (µ, σ 2 ).


• Precision. Instead of writing the normal pdf in terms of µ and σ 2 , it is sometimes written in terms of
µ and τ , whereτ is called the Precision, and is the inverse of the variance, τ = 1/σ 2 . Symbolically, y
∼ Normal µ, τ1 , and mathematically

τ (y − µ)2
 
τ
f (y|µ, τ ) = √ exp − (3.4)
2π 2

Some software packages for Bayesian inference, e.g. JAGS and WinBUGS, specify the normal distri-
bution in terms of the precision τ .
The term precision has an intuitive interpretation, if a random variable is more precise, it is less
variable, i.e., as τ increases, σ 2 decreases.

42
3.2.1 Normal distribution: unknown µ and known σ 2

While assuming that σ 2 is known is usually not realistic, the methods used for inference for µ are helpful
building blocks for more realistic situations.
To begin we write the likelihood for µ, given that σ 2 is known, and examine the portion of the pdf in Eq’n
(3.2) that involves µ alone:

n(ȳ − µ)2 (µ − ȳ)2


   
f (y|µ, σ 2 ) ≡ L(µ|y, σ 2 ) ∝ exp − = exp − (3.5)
2σ 2 2σ 2 /n

Note that the last expression in (3.5) can be seen as the kernel of a Normal distribution. Thus the conjugate
distribution for µ (when σ 2 , or τ , is known) is the Normal distribution:

Prior for µ: µ ∼ Normal µ0 , σ02




where µ0 and σ02 are the hyperparameters of the prior. An alternative formulation for the prior (see Reich
and Ghosh, p 47), which leads to a tidier posterior distribution, is to write σ02 = σ 2 /m, where m is some
positive number. Of course if one specifies σ02 first, then m is σ 2 /σ02 .

σ2
 
Prior for µ: µ ∼ Normal µ0 , (3.6)
m

The posterior distribution for µ given a normal prior is then:

Posterior: p(µ|y, σ 2 ) ∝ π(µ)p(y|µ)


 Pn 2
(µ − µ0 )2
  
i=1 (yi − µ)
1  n
2 −2
= (2πσ 2 /m)− 2 exp − ∗ 2πσ exp −
2σ 2 /m 2σ 2
m(µ − µ0 )2 n(ȳ − µ)2
   
∝ exp − 2
exp −
2σ 2σ 2
mµ2 − 2mµ0 µ + nµ2 − 2nȳµ
 
∝ exp −
2σ 2
2
 
(m + n)µ − 2(mµ0 + ny)µ
= exp −
2σ 2
" mµ0 +ny 2
#
(µ − m+n )
∝ exp − (3.7)
2σ 2 /(m + n)

where eq’n 3.7 is the kernel for a normal distribution, in other words:

σ2
 
2 mµ0 + nȳ
Posterior for µ|y, σ : Normal , (3.8)
m+n m+n

The posterior mean for µ is thus a weighted combination of the prior mean, µ0 , and the sample mean, ȳ:
mµ0 + nȳ m n
E[µ|y, σ 2 ] = = µ0 + ȳ = (1 − w)µ0 + wȳ
m+n m+n m+n
n
where w= m+n .

43
Comments.

• As n increases, the posterior mean is dominated by the sample mean:


 
2 m n
lim E[µ|y, σ ] = lim µ0 + ȳ = y
n→∞ n→∞ m + n m+n

And the posterior becomes concentrated at y

σ2
lim V [µ|y, σ 2 ] = lim =0
n→∞ n→∞ m + n

• Conversely, as m increases, σ 2 /m decreases, and the posterior is dominated by the prior.


mµ0 + nȳ
lim E[µ|y, σ 2 ] = lim = µ0
m→∞ m→∞ m+n
σ2
lim V [µ|y, σ 2 ] = lim =0
m→∞ m→∞ m + n

Thus a point mass at µ0 as σ02 goes to 0.

• Conversely, as m decreases, σ 2 /m increases (a “vaguer” prior), the influence of the prior decreases:
mµ0 + nȳ
lim E[µ|y, σ 2 ] = lim =y
m→0 m→0 m+n
σ2 σ2
lim V [µ|y, σ 2 ] = lim =
m→0 m→0 m + n n

Unknown µ and known τ

If express the sampling distribution for y in terms of precision, τ = 1/σ 2 , then the prior for µ (given known
τ ) is:
 
1
µ ∼ Normal µ0 , (3.9)
τm

And the posterior for µ:


 
mµ0 + nȳ 1
Posterior for µ|y, σ 2 : Normal , (3.10)
m + n τ (m + n)

Posterior predictive distribution

As discussed in Lecture 1, the posterior predictive distribution is the marginal probability distribution for a
new scalar value, y new , given the past data, yold :
Z Z
p(y new |yold ) = p(y new |θ, yold )p(θ|yold )dθ = p(y new |θ)p(θ|yold )dθ (3.11)

The second equality results from y new being conditionally independent on yold given θ.

44
In this normal distribution case with known σ 2 , letting µ1 and σ12 denote the posterior mean and variance
for µ:
(y new − µ)2 (µ − µ1 )2
Z Z    
1 1
p(y new |y) = p(y new |µ)p(µ|yold )dµ = √ exp − exp − dµ
2σ 2 2σ12
p
2πσ 2 2πσ12
(y new − µ1 )2
 
1
=p exp − (3.12)
2π(σ12 + σ 2 ) 2(σ12 + σ 2 )

Thus, y new |yold ∼ Normal µ1 , σ12 + σ 2 5 . Note that the variance of y new is the sum of the variance for the
distribution of y if µ were known, namely, σ 2 , and the variance due to the uncertainty in µ.

Example: inference for µ

Assume that the amount of money spent during September on food by individual students, y, is Normally
distributed with an unknown mean µ and known standard deviation, σ, of £50, or a precision, τ , of 1/502
= 0.0004. You would like to estimate µ and will take a simple random sample of n=20 students.
2
Before doing so, you specify a Normal(µ0 , 50
m ) prior for µ. You think that the average is around µ0 =
£200 and set µ0 = 200. Further, you guess that the 25th and 75th percentiles are 150 to 250. Given that
the corresponding standard normal percentiles are -0.6744898 and 0.6744898, you can solve for m using the
following equation:
150 − 200
−0.6744898 = √
50/ m
Thus m = 0.67448982 = 0.4549. Then the prior for µ:
502
 
µ ∼ Normal 200,
0.4549

The simple random sample of n=20 was then taken and the average was £165.
The mean and variance for the posterior distribution for µ are based on (3.8):
0.4549 ∗ 200 + 20 ∗ 165
E[µ|ȳ] = = 165.778
0.4549 + 20
502
V [µ|ȳ] = = 122.2199 = 11.052
0.4549 + 20
and the posterior for µ is
µ|ȳ ∼ Normal 165.778, 11.052


A weight of 0.4549/20.4549, about 2%, was given to the prior mean√ (and 98% to the sample mean) and the
posterior standard deviation for µ went from 74 in the prior (50/ 0.4549)) to 11.05 in the posterior.
Figure 3.1 shows the prior and posterior distributions for µ.

Posterior predictive distributions. If a student was randomly sampled, the prior and posterior predic-
tive distributions for that student’s food expenditures are:
Prior y new ∼ Normal 200, 502


Posterior y new |ȳ old = 165 ∼ Normal 165.778, 11.052 + 502 = 51.22


5 For details on the derivation see Lecture 3A: Supplement Posterior Predictive For Normal Dist’n on Learn.

45
Figure 3.1: Prior and posterior distribution for µ, average amount spent on food during September, assuming
Normal(µ, 502 ) distribution.

Prior
Posterior

0.03
0.02
0.01
0.00

0 100 200 300 400

Comparing the posterior to the prior, the mean has decreased by around £34 while the prediction variance
of 51.22 is slightly larger than 502 , with the additional variance due to the uncertainty in the value of µ.

46
3.2.2 Normal distribution: unknown τ or σ 2 and known µ

Again this is usually not a realistic situation, but the results are useful for more complex and realistic models.

Conjugate prior for τ

Before presenting results for σ 2 , we begin with the precision, τ = 1/σ 2 .


We examine the likelihood for τ , keeping only terms that involve τ (see Eq’n 3.1).
Pn
τ i=1 (yi − µ)2 τ z2
   
n n
f (y|µ, τ ) ≡ L(τ |y, µ) ∝ (τ ) 2 exp − = (τ ) 2 exp − (3.13)
2 2

where to reduce notation


n
X
z2 = (yi − µ)2
i=1

The likelihood (3.13) is the kernel of a Gamma distribution. Thus the conjugate prior for τ when µ is known
is

τ ∼ Gamma (α, β) (3.14)

and the posterior for τ :

z2τ
 
n
α−1
Posterior: τ |y, µ ∝ (τ ) exp (−βτ ) (τ ) exp −
2
2
= (τ )(α+ 2 −1) exp −(β + z 2 /2)τ
n 
(3.15)
⇒ (3.16)
z2
 
n
τ |y, µ ∼ Gamma α + , β + (3.17)
2 2

47
Comments.

• Recall that if θ ∼ Gamma(α, β), then E[θ]=α/β and V [θ] = α/β 2 .


• Examining the posterior mean:

α + n/2 α n/2 2α n
E[τ |y] = = + = + (3.18)
β + z 2 /2 β + z 2 /2 β + z 2 /2 2β + z 2 2β + z 2
2α 2 n
Referring to 2β+z 2 , as n increases z increases, thus the term goes to zero. Referring to 2β+z 2 goes to
Pn 1
n/ i=1 (yi − µ)2 = σ̂2 , where σ̂2 is the maximum likelihood estimate (mle) for σ 2 . Thus as n increases
E[τ |y] approaches 1/σ̂ 2 or τ̂ , the mle for τ .

• One approach for selecting the hyperparameters for the Gamma dist’n prior is to specify approximate
values for E[τ ] and V [τ ] and then solve for α and β. (Admittedly, specifying a value for V [τ ] might
be a little involved.)
• An Aside: Re-expression as a χ2 distribution. The χ2 distribution with ν degrees of freedom has the
following pdf (see Appendix A of King and Ross’s notes).

2−ν/2 ν −1
 
θ
p(θ) = θ 2 exp −
Γ(ν/2) 2

Note that this is the same pdf as for Gamma ν2 , 12 . Focusing on the kernel of the posterior for τ in


(3.15):

p(τ |y, µ) ∝ (τ )(α+ 2 −1) exp −(β + z 2 /2)τ


n 

τ (2β + z 2 )
 
2 2α+n −1 2α+n −1
∝ (2β + z ) 2 τ 2 exp − (3.19)
2
2
 
2 2α+n −1 τ (2β + z )
= (τ (2β + z )) 2 exp − (3.20)
2
2α+n
where the multiplier (2β + z 2 ) 2 −1 in (3.19) is a constant that does not affect the kernel. Then it
can be seen that (3.20) is the kernel of a χ2 distribution for τ (2β + z 2 ) with 2α + n degrees of freedom:

τ (2β + z 2 ) ∼ χ22α+n (3.21)

This can also be written as what is called a scaled χ2 distribution6 :


1
τ∼ χ2 (3.22)
2β + z 2 2α+n

Conjugate prior for σ 2

Again examine the likelihood for σ 2 keeping only terms that involve σ 2 .
 Pn 2
z2
  
i=1 (yi − µ)
2 2
 n
2 −2
 n
2 −2
f (y|µ, σ ) ≡ L(σ |y, µ) ∝ σ exp − = σ exp − 2 (3.23)
2σ 2 2σ
6 If a random variable θ multiplied by a constant c follows a distribution D, cθ ∼ D, then θ follows a scaled distribution

(1/c)D.

48
The term on the right-hand side of Eq’n (3.23) is the kernel of an Inverse Gamma distribution. The pdf for
an Inverse Gamma with parameters α and β (see Appendix A of King and Ross’s notes):

βα
 
β
p(θ) = (θ)−(α+1) exp − (3.24)
Γ(α) θ

Thus Inverse Gamma(α, β) is the conjugate prior for σ 2 when µ is known.


Then the posterior for σ 2 :

z2
   
2 2 −(α+1) β  n
2 −2
Posterior: σ |y, µ ∝ (σ ) exp − 2 σ exp − 2
σ 2σ
2
 
β + z /2
= (σ 2 )−(α+ 2 +1) exp −
n
(3.25)
σ2

where eq’n 3.25 is the kernel for the Inverse Gamma:

z2
 
n
Posterior for σ 2 |µ: Inverse Gamma α + , β + (3.26)
2 2

Comments.

• If θ ∼ Inverse Gamma(α, β), E[θ] = β/(α − 1) if α > 1. and V [θ] is β 2 /((α − 1)2 (α − 2)) if α > 2.
Thus, values for the hyperparameters, α and β, can be deduced given prior notions about the mean
and the variance of σ 2 .
• Relatively small values for α and β, e.g., 0.01 or 0.001, are often used in practice. Such choices are not
universally considered a good idea. However, examining the sensitivity of the posterior to choices of α
and β is good practice.
• As for τ , the posterior for σ 2 can be written as a scaled negative χ2 distribution (See Appendix A of
King and Ross).
• If x ∼ Gamma(α, β), then y = 1/x ∼ Inverse Gamma(α, β)7 .

Example: inference for σ 2

The construction timber called 2x4 has cross-piece dimensions of 1.5 inches by 3.5 inches on average. Let
Y be the longer dimension and assume that Y ∼ Normal(µ, σ 2 ). The higher the quality control the closer
the value of Y should be to 3.5 inches, in other words, σ 2 should be relatively small. A building company is
considering purchasing timber from a new supplier but before doing so would like to see just how precisely
the 2 by 4’s are cut. They will take a random sample of n=10 2x4s and measure the length of the longer
side. They are comfortable assuming that the average length µ is 3.5, thus the data distribution is Y ∼
Normal(3.5, σ 2 ).
Before taking the sample, they decide to use a Inverse Gamma(α, β) prior distribution for σ 2 , and need to
specify the hyperparameters. They would like to be cautious and will assume a priori that the average value
of σ 2 is 0.05 with a variance of 0.02. This results in Inverse Gamma(2.125, 0.05625) (check this). A simple
random sample of n=10 2x4s yielded the following measurements:

3.527 3.387 3.466 3.382 3.612 3.680 3.471 3.475 3.603 3.680
7 This can be shown using the so-called change of variable theorem.

49
P10
where i=1 (yi − 3.5)2 = 0.117997. The posterior distribution for σ 2 |y:
 
10 0.117997
σ 2 |y ∼ Inverse Gamma 2.125 + , 0.05624 + = Inverse Gamma (7.125, 0.1152385)
2 2
0.11523852
Thus the posterior mean for σ 2 is 0.1152385
7.125−1 = 0.0188 and posterior variance is (7.125−1)2 ∗(7.125−2) = 6.908e-05.
Figure 3.2 plots the prior and posterior distributions for σ 2 .

Figure 3.2: Prior and posterior distribution for σ 2 , variance of the longer side of 2x4s, assuming sides are
Normal(3.5, σ 2 ).

70

Prior
Posterior
60
50
40
30
20
10
0

0.0 0.1 0.2 0.3 0.4

σ2

R does not have “built-in” Inverse Gamma distribution functions which could be used to calculate the
quantiles of the posterior distribution for σ 2 . However, we can use the quantile function for the Gamma
distribution in R, namely qgamma which will yield quantiles for τ = 1/σ 2 , and invert the results. The
posterior distribution for τ is Γ(7.125, 0.1152385) and the 2.5 and 97.5 percentiles can be found with
qgamma(c(0.025,0.975),shape=7.125, rate=0.1152385) = (25.10391, 114.81646). Thus
1
0.95 = Pr(25.104 ≤ τ ≤ 114.8) = Pr(25.104 ≤ 2 ≤ 114.8)
  σ
1 2 1
= Pr .00871 ≤ σ 2 ≤ 0.03983

= Pr ≤σ ≤
114.8 25.104
Thus a 95% credible interval for σ 2 is (0.00871, 0.03983).

3.3 Bayes Theorem for multiple parameters

Often there will be q > 1 parameters:


Θ = {θ1 , θ2 , . . . , θq }
Given data y, Bayes theorem has the same form
Pr(y|Θ) Pr(Θ)
Pr(Θ|y) = ∝ Pr(y|Θ) Pr(Θ)
Pr(y)

50
3.3.1 Comments
• The posterior distribution for multiple parameters can be high dimensional, e.g., q parameters = q
dimensions, and summarising a high dimensional space can be complicated.
• Often one dimensional summaries, namely marginal posterior distributions, are examined instead:
Z
p(θi |y) = p(Θ|y)dθ1 dθ2 . . . dθi−1 dθi+1 . . . dθq

This is the posterior distribution for θi found by “averaging” over all the other parameters, and thus
“projecting” (collapsing) the q-dimensional posterior distribution on a single dimension. One can then
examine a single posterior density plot, for example, calculate posterior mean, variance, and credible
interval for θi .
• However, one dimensional summaries can fail to capture important features of the joint posterior.
• Two-dimensional graphical summaries are useful: contour plots or perspective plots, or in the case of
samples from the posterior distribution, pairwise scatterplots.

• Two-dimensional numerical summaries include correlations or covariances.


• With more than two-dimensions, however, detecting patterns, if they exist, can be more difficult and
complex.

3.3.2 Normal Dist’n: Unknown µ and σ 2

Without deriving any results, we discuss two approaches to the situation where both µ and σ 2 are unknown.

Joint prior constructed with independent marginal priors

One approach to arriving at a joint prior for both µ and σ 2 is to specify independent informative marginal
priors for µ and σ 2 and multiply the two to yield a joint prior.
The joint posterior distribution:
− n2  Pn
(y −µ)2

π(µ)π(σ 2 ) 2πσ 2 exp − i=12σ2i
p(µ, σ 2 |y) = R R  Pn  (3.27)
−n (y −µ)2
π(µ)π(σ 2 ) (2πσ 2 ) 2
exp − i=12σ2i dµ dσ 2

In general (3.27) will not be something that can be calculated analytically depending on the choices of π(µ)
and π(σ 2 ), because of the integral in the denominator. Numerical or simulation-based integration methods,
which we will discuss later, can be used to yield approximate results.
Consider the 2x4 timber example, but now assume that both µ and σ 2 are unknown. Suppose one specified
that the prior for µ is Lognormal(µ0 , σ02 ) and the prior for σ 2 is Gamma(α, β). The denominator of (3.27)
in this case:
Z ∞Z ∞  Pn 2
(ln(µ) − µ0 )2
  α 
1 1 β i=1 (yi − µ)
2 α−1 2
 n
2 −2
p
2 µ
exp − 2 (σ ) exp(−βσ ) 2πσ exp − 2
dµ dσ 2
0 −∞ 2πσ0
2σ 0 Γ(α) 2σ

which is “probably” not analytically tractable (no closed form solution, I’m guessing as I’ve not tried to
solve it).

51
Comment. While the joint prior distribution for µ and σ 2 was constructed by multiplying two independent
marginal pdfs, the joint posterior distribution is not the product of two independent marginal pdfs. This
is common: joint priors constructed as products of independent distributions do not usually yield a joint
posterior that can be written as products of independent distributions.

Conjugate prior for µ and σ 2

There is a joint conjugate prior density for µ and σ 2 . It is defined in terms of a marginal prior density for σ 2 ,
which is an Inverse Gamma, and then a conditional prior density for µ given σ 2 , which is a Normal where
the variance hyperparameter is a function of σ 2 . Namely,

σ2
 
2
µ, σ ∼ Inverse Gamma(α, β) × N µ0 , (3.28)
κ

There are 4 hyperparameters, α, β, µ0 , and κ. As can be seen, conditional on the value of σ 2 , prior uncertainty
about the variance of µ around the expected value µ0 increases as σ 2 increases and as κ decreases. Thus
decreasing the values for α and β and the values for κ leads to a more dispersed prior for µ.
The resulting joint posterior density is then the product of an inverse gamma density and a normal density
(conditional on σ 2 ):

(n − 1)s2 κn(ȳ − µ0 )2 κµ0 + nȳ σ 2


   
2 n
µ, σ |y ∼ Inverse Gamma α + , β + + × Normal , (3.29)
2 2 2(κ + n) κ+n κ+n

The marginal density for σ 2 is Inverse Gamma, while the marginal density for µ conditional on σ 2 is a
students’ t distribution (see “Applied Bayesian Statistics”, 2013, Cowles, M.K.).

3.3.3 Example: Multiple Parameter Inference

As a demonstration of multiple parameter inference, we fit a Bayesian linear regression of the lengths of
dugongs8 (sea cows, a type of marine mammal; see Figure 3.3) against their age in years. There were n=27
dugongs measured and the sampling model was the following.
iid
Lengthi ∼ Normal β0 + β1 ln(age), σ 2


The relationship is shown in Figure 3.4.

Dugong length vs log(age)


●●
2.6

● ● ●


● ● ●

2.4

●●

● ● ●
2.2



2.0



1.8


0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5

Log(Age)

Figure 3.3: Dugong (image from Wikipedia)


Figure 3.4: Dugong lengths plotted against log(age).
8 Example motivated from Bayesian Methods for Data Analysis, 2009. Carlin and Louis.

52

There are three parameters, Θ = β0 , β1 , σ 2 . The following independent marginal prior distributions were
chosen to construct a joint prior distribution.

β0 , β1 ∼ Uniform (−50, 50) , σ 2 ∼ Uniform (0.01, 20)

The resulting (estimated) posterior distribution was found using JAGS (code in the Appendix). The posterior
means and standard deviation are shown below, along with the maximum likelihood estimates (except σ̂ 2 is
bias-correction of the mle).

Bayesian Frequentist
Mean SD mle std error
β0 1.76098 0.053089 1.762 0.0424
β1 0.27757 0.023531 0.277 0.0188
σ2 0.01277 0.002719 0.0081

Figure 3.5 shows the marginal posterior distributions for the three parameters as well as a scatterplot of the
samples of β0 and β1 . The scatterplot shows that there is a negative relationship between β0 and β1 .

53
Figure 3.5: Marginal posterior distributions for β0 , β1 , σ 2 , and the joint distribution of β0 and β1

β0 β1

15
6

10
4

5
2
0

1.6 1.7 1.8 1.9 0.20 0.25 0.30 0.35

β0 β1

σ2 Joint dist β0 β1

● ●
250

0.35

●●
● ● ●● ●●

●●●● ● ●● ●
●●●●● ● ● ●
●●●●●●●●●●
● ●●●●● ●
●●
● ●●
●●
●●

● ●
●●●
●●●●● ●
●●
● ●

●●
●●

●●

●●

●●●●
●●

●●


●●●
●●
●●

●●
●●
●●


● ●

●●
●●
●●


●●

●●●
●●

●●


● ●●
●●

●●●●

●●
●●
●●

●●● ●
● ●●
0.30

● ●●●●●●
●●


●●●
●●
●●

●●
●●
●●

●●
●●●

150

●●●
●●
●●
●●
●●


●●


●●

●●

●●




●●

●●


●●


●●
●●

●●
● ●
●●
●●
● ●●●
●●
●●

●●


●●

●●




●●

●●



●●


●●




●●

●●


●●
●●
●●

●● ●


●●


●●

●●




●●


●●



●●


●●



●●

●●


●●


●●
●●

●●

●●●
● ●●●
●●
●●●●
●●


●●


●●



●●




●●


●●



●●



●●

●●


●●

●●


●●


●●●●
●●●
●●● ●
●●


●●



●●


●●



●●


●●



●●



●●

●●



●●


●●

●●

●●

●●


●●



●●

●● ●
●●
●●
●●●


●●●

●●



●●

●●




●●



●●


●●

●●



●●


●●




●●


●●
●●
●●

●●

●●●


●●


●●




●●



●●


●●


●●

●●




●●




●●


●●



●●




●●
●●

●●


●●
●●●●●


●●●

●●

●●
●●

●●

●●

●●



● ●


●●

●●



●●



●●


●●

●●



●●



●●


●●
●●

●●●● ●
●●●
●●

●●
●●

●●

●●
●●



●●



●●


●●


●●




●●
●●

●●


●●


●●

●●


●●●

●●●
●●●●
● ●●
●●●
● ●●

● ●
●●
●●
●●●●● ●
0.25

●●
● ●



●●


●●



●●


●●

●●

●●


●●




●●


●●


●●●

●●

●●●

● ●●●
●●
●●
●●


●●●●
●●


●●
●●


●●



●●


●●

●●

●●

●●



●●

●●

●●●
● ●
●● ●● ●●

●●

●●


●●


●●


●●
●●


●●●

●●

●●

●●●
●●
● ●
●●●
●●
●●●

●●●


●●●

●●
●●

●●
●●
●●●
●●●
●●
●● ●●●
● ●●●
●●●●
●●
● ●
●● ●● ●
50

● ●●
●●
●●●

●●

●●

●●●


●●●


●●●
● ●

●●●
● ●● ● ●
●● ●● ● ●●
●●●

●●
●●●●
●●●●
●●
●●●
● ● ●●

●●●
●●●
●●


●●
●●●
● ●

●●
●● ●●
● ●●●
● ●●●●●●
0.20

● ●
0

0.010 0.020 0.030 1.60 1.70 1.80 1.90

σ2 β0

54
3.4 R Code

3.4.1 Example: Posterior µ with known σ 2


#-- Posterior mu given known sigma
sigma <- 50; m <- 0.6744898^2; n <- 20; ybar <- 165; mu.0<-200
w <- n/(m+n)
post.E <- w*ybar+ (1-w)*mu.0;
post.V <- sigma^2/(m+n)
cat("w=",w,"1-w=",1-w,"post.E=",post.E, "post.V=",post.V,"post.sd=",sqrt(post.V),"\n")

#--plot prior and posterior


x.seq <- 10:450
y.prior <- dnorm(x.seq,mean=mu.0,sd=sigma/sqrt(m))
y.post <- dnorm(x.seq,mean=post.E,sd=sqrt(post.V))
my.ylim <- range(c(y.prior,y.post))
plot(x.seq,y.prior,xlab=expression(mu),ylab="",ylim=my.ylim,type="l")
lines(x.seq,y.post,lty=2,col=2)
legend("topright",legend=c("Prior","Posterior"),col=1:2,lty=1:2)

3.4.2 Example: Posterior σ 2 with known µ


library(MCMCpack) #this has dinvgamma() function

inv.gamma.param.calc <- function(mu,V) {


alpha <- 2+mu^2/V
beta <- mu*(alpha-1)
out <- list(alpha=alpha,beta=beta)
return(out)
}
set.seed(742)
n <- 10
y <- rnorm(n=n,mean=3.5,sd=sqrt(0.01))
y <- round(y,3)
sse <- sum((y-3.5)^2)
cat("sse=",sse,"\n")

temp <- inv.gamma.param.calc(0.05,0.02)


prior.alpha <- temp$alpha
prior.beta <- temp$beta

post.alpha <- prior.alpha+n/2


post.beta <- prior.beta + sse/2

post.mean <- post.beta/(post.alpha-1)


post.var <- post.beta^2/((post.alpha-1)^2*(post.alpha-2))

cat("Prior alpha and beta=",prior.alpha,prior.beta,"\n")


cat("Post alpha and beta=", post.alpha,post.beta,"\n")
cat("Post mean=",post.mean,"var=",post.var,"\n")

theta.seq <- seq(0.01,0.4,length=100)


prior.density <- dinvgamma(x=theta.seq, shape=prior.alpha, scale = prior.beta)
post.density <- dinvgamma(x=theta.seq, shape=post.alpha, scale = post.beta)
my.ylim <- range(c(prior.density,post.density))
plot(theta.seq,prior.density,type="l",xlab=expression(sigma^2),ylab="",ylim=my.ylim)
lines(theta.seq,post.density,col=2,lty=2)
legend("topright",legend=c("Prior","Posterior"),lty=1:2,col=1:2)

# 95% credible interval


x <- qgamma(c(0.025,0.975),shape=7.125, rate=0.1152385)

55
print(x)
print(1/x)

56
3.4.3 Example: Linear regression with Dugong data
The R code for fitting the Dugong data and the call to JAGS are shown below.

library(rjags)

dugong.data <-
list(age = c( 1.0, 1.5, 1.5, 1.5, 2.5, 4.0, 5.0, 5.0, 7.0,
8.0, 8.5, 9.0, 9.5, 9.5, 10.0, 12.0, 12.0, 13.0,
13.0, 14.5, 15.5, 15.5, 16.5, 17.0, 22.5, 29.0, 31.5),
length = c(1.80, 1.85, 1.87, 1.77, 2.02, 2.27, 2.15, 2.26, 2.47,
2.19, 2.26, 2.40, 2.39, 2.41, 2.50, 2.32, 2.32, 2.43,
2.47, 2.56, 2.65, 2.47, 2.64, 2.56, 2.70, 2.72, 2.57), n = 27)

log.age <- log(dugong.data$age)


plot(dugong.data$length ~ log.age,xlab="Log(Age)",ylab="",main="Dugong length vs log(age)")

# Initial values for running 3 MCMC chains in JAGS


num.chains <- 3
beta0.set <- c(-1,0,1)
beta1.set <- c(-1,0,1)
sigma2.set <- c(3,10,15)
dugong.inits <- list()
for(i in 1:num.chains) {
dugong.inits[[i]] <- list(beta0=beta0.set[i],beta1=beta1.set[i],
sigma2=sigma2.set[i])
}

dugong.model <- "model {


# data that will be read in are age and length and n
#Hyperparameters
beta.low <- -50
tau <- 1/sigma2

#priors
beta0 ~ dunif(-50,50)
beta1 ~ dunif(-50,50)
sigma2 ~ dunif(0.01,20)

#Likelihood
for(i in 1:n) {
logage[i] <- log(age[i])
mu[i] <- beta0 + beta1 * logage[i]
length[i] ~ dnorm(mu[i], tau)
}
}"

set.seed(742)
burnin <- 2000
inference.length <- 10000
dugong.results.initial <- jags.model(file=textConnection(dugong.model),
data=dugong.data, inits=dugong.inits,
n.chains=num.chains)
update(dugong.results.initial, n.iter=burnin)
dugong.results.final <- coda.samples(model=dugong.results.initial,
variable.names=c("beta0","beta1","sigma2"),
n.iter=inference.length,thin=10)
summary(dugong.results.final)

#--- for looking at the entire combined results convert the mcmc.list object to a data frame
dugong.results.df <- as.data.frame(as.matrix(dugong.results.final))
head(dugong.results.df)
par(mfrow=c(2,2),oma=c(0,0,3,0))

57
plot(density(dugong.results.df$beta0),xlab=expression(beta[0]),ylab="",
main=expression(beta[0]))
plot(density(dugong.results.df$beta1),xlab=expression(beta[1]),ylab="",
main=expression(beta[1]))
plot(density(dugong.results.df$sigma2),xlab=expression(sigma^2),ylab="",
main=expression(sigma^2))
plot( dugong.results.df$beta0,dugong.results.df$beta1,xlab=expression(beta[0]),ylab="",
main= expression(paste("Joint dist ",beta[0]," ",beta[1])))
par(mfrow=c(1,1))
#if(plot.it) dev.copy2pdf(file=paste0(output,"L5_F_dugong_posterior_plots.pdf"))

#Frequentist results
dugong.freq.lm <- lm(length ~ log.age,data=dugong.data)
summary(dugong.freq.lm)

58

You might also like