0% found this document useful (0 votes)
76 views

Bayes Lectures English

- Bayesian econometrics involves updating subjective beliefs about unknown parameters θ based on data, using Bayes' theorem. The posterior distribution P(θ|y) is proportional to the prior P(θ) times the likelihood P(y|θ). - The likelihood P(y|θ) represents the probability of observing the data y given the parameters θ. For a normal distribution, the likelihood can be written as a function of the sample variance s^2 and difference between the parameter θ and the sample mean Y. - As more data is observed sequentially, the posterior distributions become more concentrated, tending toward a normal shape as the number of observations N increases. The previous posterior serves as the new prior in Bayesian updating.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
76 views

Bayes Lectures English

- Bayesian econometrics involves updating subjective beliefs about unknown parameters θ based on data, using Bayes' theorem. The posterior distribution P(θ|y) is proportional to the prior P(θ) times the likelihood P(y|θ). - The likelihood P(y|θ) represents the probability of observing the data y given the parameters θ. For a normal distribution, the likelihood can be written as a function of the sample variance s^2 and difference between the parameter θ and the sample mean Y. - As more data is observed sequentially, the posterior distributions become more concentrated, tending toward a normal shape as the number of observations N increases. The previous posterior serves as the new prior in Bayesian updating.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 74

RS – Lecture 17

Lecture 17
Bayesian Econometrics

Bayesian Econometrics: Introduction


• Idea: We are not estimating a parameter value, , but rather updating
and sharpening our subjective beliefs about .

• The centerpiece of the Bayesian methodology is Bayes’s theorem:


P(A|B) = P(A∩B)/P(B) = P(B|A) P(A)/P(B).

• Think of B as “something known” –say, the data- and A as


“something unknown” –e.g., the coefficients of a model.

• Our interest: Value of the parameters ( ), given the data (y).

• Reversing Bayes’s theorem, we write the joint probability of and y:


P( ∩ y) = P(y| ) P( )

1
RS – Lecture 17

Bayesian Econometrics: Introduction


• Then, we write: P( |y) = P(y| ) P( )/P(y) (Bayesian learning)

• For estimation, we can ignore the term P(y), since it does not depend
on the parameters. Then, we can write:
P( |y)  P(y| ) x P( )

• Terminology:
- P(y| ): Density of the data, y, given the parameters, . Called the
likelihood function. (I’ll give you a value for , you should see y.)
- P( ): Prior density of the parameters. Prior belief of the researcher.
- P( |y): Posterior density of the parameters, given the data. (A mixture
of the prior and the “current information” from the data.)

Note: Posterior is proportional to likelihood times prior.

Bayesian Econometrics: Introduction


• The typical problem in Bayesian statistics involves obtaining the
posterior distribution:
P( |y) ∝ P(y| ) x P( )

To get P( |y), we need:


- The likelihood, P(y| ), will be assumed to be known. The likelihood
carries all the current information about the parameters and the data.
- The prior, P( ), will be also known. Q: Where does it come from?

Note: The posterior distribution embodies all that is “believed” about


the model:
Posterior = f(Model|Data)
= Likelihood( ,Data) x Prior( ) / P(Data)

2
RS – Lecture 17

Bayesian Econometrics: Introduction


• We want to get P( |y) ∝ P(y| ) x P( ). There are two ways to
proceed to estimate P( |y):
(1) Pick P( ) and P(y| ) in such a manner that P( |y) can be
analytically derived. This is the “old” way.

(2) Numerical approach. Randomly draw from P( ) and, then, from


P(y| ) to produce an ED for . This is the modern way.

• Note: Nothing controversial about Bayes’s theorem. For RVs with


known pdfs, it is a fact of probability theory. But, the controvery starts
when we model unkown pdfs and “update” them based on data.

Good Intro Reference (with references): “Introduction to Bayesian


Econometrics and Decision Theory” by Karsten T. Hansen (2002).

Bayes’ Theorem: Summary of Terminology


• Recall Bayes’ Theorem:

P  y |   P  
P  y  
P y 

- P(): Prior probability about parameter .


- P(y|): Probability of observing the data, y, conditioning on . This
conditional probability is called the likelihood –i.e., probability of event
y will be the outcome of the experiment depends on .
- P(y): Posterior probability -i.e., probability assigned to , after y is
observed.
- P(y): Marginal probability of y. This the prior probability of
witnessing the data y under all possible scenarios for , and it depends
on the prior probabilities given to each 

3
RS – Lecture 17

Bayes’ Theorem: Example


Example: Player’s skills evaluation in sports.
S: Event that the player has good skills (& be recruited by the team).
T: Formal tryout performance (say, good or bad).

After seeing videos and scouting reports and using her previous
experience, the coach forms a personal belief about the player’s skills.
This initial belief is the prior, P(S).

After the formal tryout performance, the coach (event T) updates her
prior beliefs. This update is the posterior:

P T | S  P S 
P S T  
P T 

Bayes’ Theorem: Example


Example: Player’s skills evaluation in sports.
- P(S): Coach’s personal estimate of the probability that the player
has enough skills to be drafted –i.e., a good player- , based on
evidence other than the tryout. (Say, .40.)
- P(T=good|S): Probability of seeing a good tryout performance if
the player is actually good. (Say, .80.)
- T is related to S:
P(T=good|S (good player)) = .80
P(T=good|SC (bad player)) = .20
- After the tryout, the coach updates her beliefs : P(S|T=good)
becomes our new prior. That is:

PT  good | S  PS 


PS T  good  
.80 x .40
  .7272
PT  good  .80x.40  .20x.60

4
RS – Lecture 17

Bayesian Econometrics: Sequential Learning


• Consider the following data from N=50 Bernoulli trials:
00100100000101110000101000100000000000011000010100
If is the probability of a “1" at any one trial then the likelihood of
any sequence of s trials containing y ones is
p (y| ) = y (1− )s-y
Let the prior be a uniform: p( )=1. Then, after 5 trials the posterior is:
p ( |y) ∝ (1− )4 x 1 = θ (1−θ)4
and after 10 trials the posterior is
p ( |y) ∝ (1− )4 x θ (1−θ)4 = θ2 (1−θ)8
and after 40 trials the posterior is
p ( |y) ∝ 8 (1− )22 x θ2 (1−θ)8 = θ10 (1−θ)30
and after 50 trials the posterior is
p ( |y) ∝ 4 (1− )6 x θ10 (1−θ)30 = 14 (1− )36

Bayesian Econometrics: Sequential Learning

• Notes:
- The previous posterior becomes the new prior.
- Beliefs tend to become more concentrated as N increases.
- Posteriors seem to look more normal as N increases.

5
RS – Lecture 17

Likelihood
• It represents the probability of observing the data, y, conditioning
on . For example, Yi ~ N( , σ2).

There is a useful factorization, when Yi ~ N( , σ2), which uses:

 (Y   )  [(Y  Y )  (  Y )]   (Y  Y )
i
i
2

i
i
2

i
i
2
 T (  Y ) 2  (T 1)s 2  T (  Y ) 2

where s2 = sample variance. Then, the likelihood can be written as:


1
L(y |  , 2 )  (1/ 2 2 )T / 2 exp{ 2 [(T 1)s2  T (  Y )2 ]}
2
Note: Bayesians work with h = 1/σ2, which is called “precision.” A
gamma prior is usually assumed for h. Then,
h
L (y |  ,  2 )  ( h ) T / 2 exp{  [(T  1) s 2  T (  Y ) 2 ]}
2
h Th
 ( h ) T / 2 exp{  (T  1) s 2 } x exp{  (  Y ) 2 }
2 2

Priors: Improper and Proper


• A prior represent the (prior) belief of the researcher about , before
seeing the data (X,y). These prior subjective probability beliefs about
the value of are summarized with the prior distribution, P().

We can have Improper and Proper priors.


Prob  y |  i  Prob  i 
 y 
 j Prob  y |  j  Prob  j 
Prob i

If we multiply P( i) and P( j) by a constant, the posterior probabilities


will still add up to 1 and be a proper distribution. But, now the priors
do not add up to 1. They are no longer proper.

• When this happens, the prior is called an improper prior. However, the
posterior distribution need not be a proper distribution if the prior is
improper.

6
RS – Lecture 17

Priors: Informative and Non-informative


• In a previous example, we assume the prior P(S) –i.e., a coach’s
prior belief about a player’s skills, before tryouts.

• This is the Achilles heel of Bayesian statistics. Where do they came


from?

• Priors can have many forms. We usually divide them in non-


informative and informative priors for estimation of parameters
–Non-informative (or diffuse) priors: There is a total lack of prior
belief in the Bayesian estimator. The estimator becomes a
function of the likelihood only.
–Informative prior: Some prior information enters the estimator.
The estimator mixes the information in the likelihood with the
prior information.

Priors: Informative and Non-informative


Example: Supposee we have i.i.d. Normal data, Yi ~ N( , σ2).
Assume σ2 is known. We want to learn about , that is, we want to get
P( |y). We need a prior for .

We assume a normal prior for : P( ) ~ N( 0, σ02).


- 0 is our best guess for , before seeing y.
- σ02 states the confidence in our prior. Small σ02 shows big
confidence. It is common to relate σ02 to σ2, say σ02 = sqrt{σ2 M}.

This prior gives us some flexibility. Depending on σ02, this prior can
be informative or diffuse. A small σ02 represents the case of an
informative prior. As σ02 increases, the prior becomes more diffuse.

7
RS – Lecture 17

Prior Distributions: Diffuse Prior - Example


N
E.g., the binomial example: L(;N,s)=   s (1  )Ns
s
Uninformative Prior (?): Uniform (flat) P()=1, 0    1
N s N s
   (1  )  1
 
s s (1  )Ns
P(|N,s)= 
1 N s (N  s  1)(s  1)

N s
   (1  )  1d (N  2)
0
s
(N  2)
 s (1  )Ns  a Beta distribution
(s  1)(N  s  1)
s+1 s+1
Posterior mean = =
(N-s+1)+(s+1) N+2
For the example, N=20, s=7. MLE = 7/20=.35.
Posterior Mean =8/22=.3636 > MLE. Why? The prior was informative.
(Prior mean = .5)

Priors: Conjugate Priors


• When the posterior distributions P( |y) are in the same family as the
prior probability distribution, P( ), the prior and posterior are then
called conjugate distributions. The prior is called a conjugate prior for the
likelihood.

• For example, the normal family is conjugate to itself (or self-conjugate)


with respect to a normal likelihood function: if the likelihood function
is normal, choosing a normal prior over the mean will ensure that the
posterior distribution over the mean is also normal.

• The beta distribution conjugates to itself with respect to the


Binomial likelihood.

• Choosing conjugate priors helps to produce tractable posteriors.

8
RS – Lecture 17

Priors: Conjugate Priors - Example


Mathematical device to produce a tractable posterior
This is a typical application
N (N  1)
L(;N,s)=   s (1  )Ns  s (1  )Ns
 
s (s  1)(N  s  1)
(a+b) a1
Use a conjugate beta prior , p()=  (1  )b1
(a)(b)
 (N  2) Ns   (a+b) a1 b 1 
 (s  1)(N  s  1)  (1  )   (a)(b)  (1  ) 
s

Posterior    
1 (N  2)   (a+b) 
0  (s  1)(N  s  1)  (1  )   (a)(b)  (1  )  d
s N s a1 b 1

s a1 (1  )Ns b1


 1
 a Beta distribution.

0
s a1 (1  )Ns b1 d
s+a
Posterior mean = (we used a = b = 1 before)
N+a+b

Priors: Hierarchical Models


• Bayesian methods can be effective in dealing with problems with a
large number of parameters. In these cases, it is convenient to think
about the prior for a vector parameters in stages.

• Suppose that θ=( 1, 2, ..., K) and λ is a another parameter vector,


of lower dimension than θ. λ maybe a parameter of a prior or a
random quantity. The prior p(θ) can be derived in stages:
p(θ, ) = p(θ|λ)p(λ).
Then,
p(θ)= p(θ|λ)p(λ) dλ

We can write the joint as:


p(θ,λ,y) = p(y|θ,λ) p(θ|λ)p(λ).

9
RS – Lecture 17

Priors: Hierarchical Models


• We can think of the joint p(θ,λ,y) as the result of a Hierarchical (or
“Multilevel”) Model:
p(θ,λ,y) = p(y|θ,λ) p(θ, ) = p(y|θ,λ) p(θ|λ)p(λ).

The prior p(θ, ) is decomposed using a prior for the prior, p(λ), a
hyperprior. Under this interpretation, we call λ a hyperparameter.

• Hierarchical models can be very useful, since it is often easier to


work with conditional models than full joint models.

Example: In many stochastic volatility models, we estimate the time-


varying variance (Ht) along with other parameters (). We write the
joint as:
f ( H t ,  )  f (Yt H t ) f ( H t ,  ) f ( )

Priors: Hierarchical Models - Example


• Supposee we have i.i.d. Normal data, Yi ~ N( , σ2). We want to
learn about ( , σ2) or, using h =1/σ2, φ ( , h). That is, we want to
get P(φ |y). We need a joint prior for φ.

It can be easier to work with P(φ) = P( |σ-2) P(σ-2).

For | σ-2, we assume P( |σ-2) ~ N( 0, σ02), where σ02 = sqrt{σ2 M}.

For σ2, we assume an inverse gamma. Then, for h = σ-2, we have a


gamma distribution, which is function of ( , ):
 ( / 2)
T /2 T
( 1) 2
f ( x; ,  )  x 1 e x  f (h)  ( 2 ) 2 e ( / 2)
( ) (T / 2)
where =T/2 and =1/(2 )=Φ/2 are usual priors ( 2 is related to the
2

variance of the T N(0, 2) variables we are implicitly adding).

10
RS – Lecture 17

Priors: Hierarchical Models - Example


Then, the joint prior, P(φ) can be written as:

(   0 ) 2 ( / 2)
T /2
2
f ( ,  2 )  (2 2 M ) 1/ 2 exp{ }x ( 2 ) (T / 21) e ( / 2)
2 M2
(T / 2)

Aside: The Inverse Gamma for f()


• The usual prior for σ2 is the inverse-gamma. Recall that if X has a
Γ( , ) distribution, then 1/X has an inverse-gamma distribution with
parameters (shape) and -1 (scale). That is:

f ( x;  ,  )  x   1 e  (  / x ) x  0.
 ( )

• Then, h=1/σ2 is distributed as Γ( , ):



f ( x   2 ; ,  )  x 1 e  x x  0.
 ( )

• Q: Why do we choose the inverse-gamma prior for σ2?


(1) p(σ2) = 0 for σ2 <0.
(2) Flexible shapes for different values for , –recall, when =/2
and =½, the gamma distribution becomes the  2.
(3) Conjugate prior =>the posterior of σ-2|X will also be Γ( *, *).

11
RS – Lecture 17

Prior Information: In general, we need f()



• Inverse gamma’s pdf: f ( x;  ,  )  x   1 e  (  / x ) x  0.
 ( )
α =1, λ=1
α =2, λ=1
α =3, λ=1
α =3, λ = 0.5

• Mean[x] = /( -1) ( >1).


Var[x] = 2/[( -1)( -2)] ( >2).

• A multivariate generalization of the inverse-gamma distribution is


the inverse-Wishart distribution.

Prior Information: Intuition


• (Taken from Jim Hamilton.) Assume k=1. A student says: “there is
a 95% probability that  is between b ± 1.96 sqrt{σ2 (X’ X)-1}.”

A classical statistician says: “No!  is a population parameter. It either


equals 1.5 or it doesn’t. There is no probability statement about .”
“What is true is that if we use this procedure to construct an interval
in thousands of different samples, in 95% of those samples, our
interval will contain the true .”

• OK. Then, we ask the classical statistician:


- “Do you know the true ?” “No.”
- “Choose between these options. Option A: I give you $5 now.
Option B: I give you $10 if the true is in the interval between 2.5
and 3.5.” “I’ll take the $5, thank you.”

12
RS – Lecture 17

Prior Information: Intuition


• OK. Then, we ask the classical statistician, again:
- “Good. But, how about these? Option A: I give you $5 now.
Option B: I give you $10 if the true is between -8.0 and +13.9.”
“OK, I’ll take option B.”

• Finally, we complicate the options a bit:


- “Option A: I generate a uniform number between 0 and 1. If the
number is less than π, I give you $5.
Option B: I give you $5 if the true is in the interval (2.0, 4.0). The
value of π is 0.2”
“Option B.”
- “How about if π = 0.8?”
“Option A.”

Prior Information: Intuition


• Under certain axioms of rational choice, there will exist a unique π*,
such that he chooses Option A if π >π*, and Option B otherwise.
Consider π* as the statistician’s subjective probability.

• We can think of π* as the statistician’s subjective probability that


is in the interval (2.0, 4.0).

13
RS – Lecture 17

Posterior
• The goal is to say something about our subjective beliefs about ;
say, the mean , after seeing the data (y). We characterize this with the
posterior distribution:
P( |y) = P(y| ) P( )/P(y)

• The posterior is the basis of Bayesian estimation. It takes into


account the data (say, y & X) and our prior distribution (say, 0).

• P( |y) is a pdf. It is common to describe it with the usual classical


measures. For example: the mean, median, variance, etc. Since they
are functions of the data, they are Bayesian estimators.

• Under a quadratic loss function, it can be shown that the posterior


mean E[ |y], is the optimal Bayesian estimator of .

Posterior: Example - Normal-Normal


• We have i.i.d. normal data: Yi ~ N( , σ2). Then, the likelihood:
h
L ( | y ,  2 )  ( h )T / 2 exp{   (Yi   ) 2 }
2 i
• Now, we need to specify priors. We need a joint prior: f( , σ2). In the
Normal-Normal model, we assume σ2 known (usually, we work with
h=1/σ2). Thus, we only specify a normal prior for : f( ) ~ N( 0, σ02).

• σ02 states the confidence in our prior. Small σ02 shows confidence.

• In realistic applications, we add a prior for f(σ2). Usually, an inverse


gamma.

• Then, we determine the posterior: Likelihood x Prior:


h 1 (  0 ) 2
f ( | y, 2 )  (h)T / 2 exp{  i
2 i
(Y   ) 2
} x
2 0
exp{
2 02
}

14
RS – Lecture 17

Posterior: Example - Normal-Normal


• Or using the Likelihood factorization:
h 1 (  0 )2
f ( | y,  2 )  (h)T / 2 exp{ [(T 1)s 2  T (  Y )2 ]}x exp{ }
2 2 0 2 02
h Th 1 (  0 )2
 (h)T / 2 exp{ (T 1)s 2} x exp{ (  Y )2}x exp{ }
2 2 2 0 2 02
1 1 1 T (  0 )2
 ( )T / 2 exp{ 2 (T 1)s 2} x exp{ 2 (  Y )2  }
 2 0 2 2 2 02

• A little bit of algebra, using:


ab  cd 2 ac
a( x  b)2  c( x  d )2  (a  c)(x  )  (b  d )2
ac ac

we get for the 2nd expression inside the exponential:


T (   0 ) 2 1
(  Y ) 2   [ T/  2  1 /  02 ](   ) 2  (Y   0 ) 2
 2
 02  02   2 / T

Posterior: Normal-Normal
T (   0 ) 2 1 1
(  Y ) 2   (   ) 2  (Y   0 ) 2
2  02 2  02   2 / T

(T/ 2 )Y  (1 /  02 ) 0 1
where   & 2
T/  1 / 
2 2
0 T/  1 /  02
2

• Since we only need to include the terms in θ, then:


1 1 1 1 1
f ( | y, 2 )  ( )T / 2 exp{ 2 (T 1)s 2 } x exp{ 2 (   )2  (Y  0 )2}
 2 0 2 2 2( 02   2 / T )
1
 exp{ (   )2 }
2 2
That is, the posterior is: N ( , 2 )

• The posterior mean,  , is the Bayesian estimator. It takes into


account the data (y) and our prior distribution. It is a weighted average
of our prior θ0 and Y .

15
RS – Lecture 17

Posterior: Bayesian Learning


(T/ 2 )Y  (1 /  02 ) 0
• Update formula for :    Y  (1   ) 0
T/ 2  1 /  02
(T/ 2 )  02
where   
T/  1 / 
2 2
0   2 / T
2
0

• The posterior mean is a weighted average of the usual estimator and


the prior mean, 0.

Results:
- As T→∞, the posterior mean  converges to Y.
- As σ0 →∞, our prior information is worthless.
2

- As σ02 →0, complete certainty about our prior information.

This result can be interpreted as Bayesian learning, where we combine


our prior with the observed data. Our prior gets updated! The extent
of the update will depend on our prior distribution.

Posterior: Bayesian Learning


• As more information is known or released, the prior keeps changing.

Example: In R.
bayesian_updating <- function(data,mu_0,sigma2_0,plot=FALSE) {
require("ggplot2")
T = length(data) # length of data
xbar = mean(data) # mean of data
sigma2 = sd(data)^2 # variance of data

# Likelihood (Normal)
xx <- seq(xbar-2*sqrt(sigma2), xbar+2*sqrt(sigma2),sqrt(sigma2)/40)
yy <- 1/(sqrt(2*pi*sigma2/T))*exp(-1/(2 *sigma2/T)*(xx - xbar)^2 )
# yy <- 1/(xbar+4*sqrt(sigma2)-xbar+4*sqrt(sigma2))
df_likelihood <- data.frame(xx,yy,1) # store data
type <- 1
df1 <- data.frame(xx,yy,type)

# Prior (Normal)
xx <- seq(mu_0-4*sqrt(sigma2_0), mu_0+4*sqrt(sigma2_0),(sqrt(sigma2_0)/40))
yy = 1/(sqrt(2*pi*sigma2_0))*exp(-1/(2 *sigma2_0)*(xx - mu_0)^2)
type <- 2
df2 <- rbind(df1,data.frame(xx,yy,type))

16
RS – Lecture 17

Posterior: Bayesian Learning


Example (continuation):
# Posterior
omega <- sigma2_0/(sigma2_0 + sigma2/T)
pom = omega * xbar + (1-omega)*mu_0 # posterier mean
pov = 1/(T/sigma2 + 1/sigma2_0) # posterior variance
xx = seq(pom -4*sqrt(pov), pom + 4*sqrt(pov),(sqrt(pov)/40))
yy = 1/(sqrt(2 * pi * pov))*exp(-1./(2 *pov)* (xx - pom)^2 )
type <- 3
df3 <- rbind(df2,data.frame(xx,yy,type))
df3$type <- factor(df3$type,levels=c(1,2,3),
labels = c("Likelihood", "Prior", "Posterior"))

if(plot==TRUE){
return(ggplot(data=df3, aes(x=xx, y=yy, group=type, colour=type))
+ ylab("Density")
+ xlab("x")
+ ggtitle("Bayesian updating")
+ geom_line()+theme(legend.title=element_blank()))
} else {
Nor <- matrix(c(pom,pov), nrow=1, ncol=2, byrow = TRUE)
return(Nor)
}
}

Posterior: Bayesian Learning


Example (continuation):
dat <- 5*rnorm(20,0,sqrt(2)) # generate normal data T= 20, mean=0, var=50
# xbar = -2.117, σ2 = 54.27

# Scenario 1 – Precise prior (θ0=7, σ02 =2)


df <- bayesian_updating(dat,7,2,plot=TRUE) # priors mu_0=7, sigma2_0=2
df # ω = .4243, pom = 3.1314, pov = 1.1514

# Scenario 2 – Difusse prior (θ0=7, σ02 =40)


df <- bayesian_updating(dat,7,40,plot=TRUE) # priors mu_0=7, sigma2_0=40
df # ω = .9365, pom = -1.5382, pov = 2.5411

17
RS – Lecture 17

Posterior: James-Stein Estimator


• Let xt ~ N( t , σ2) for t=1,2,...., T. Then, let MLE (also OLS) be ̂ t .
• Let m1, m2,..., mT be any numbers.
• Define
S = Σt (xt - mt )2
= 1 – [(T-2)σ2/S]
mi* = ̂ t + (1- ) mi

• Theorem: Under the previous assumptions,


E[Σt ( t - mt*)2]< E[Σt ( t - ̂ t )2]

Remark: Some kind of shrinkage can always reduce the MSE relative
to OLS/MLE.

Note: The Bayes estimator is the posterior mean of . This is a


shrinkage estimator.

Predictive Posterior
• The posterior distribution of is obtained, after the data y is
observed, by Bayes' Theorem::
P( |y) ∝ P(y| ) x P( )

Suppose we have a new set of observations, z, independent of y


given . That is,
P(z,y| ) = P(z| ) x P(y| )
Then,
P(z|y) = P(z, |y) d = P(z| ,y) P( |y) d
= P(z| ) P( |y) d = E |y [P(z| )]

P(z|y) is the predictive posterior distribution, the distribution of new


(unobserved) observations. It is equal to the conditional (over the
posterior of |y) expected value of the distribution of the new data,
z.

18
RS – Lecture 17

Predictive Posterior: Example


Example: Player’s skills evaluation in sports.
Suppose the player is drafted. Before the debut, the coach observes
his performance in practices. Let Z be the performance in practices
(again, good or bad). Suppose Z depends on S as given below:
P(Z=good|S) = .95
P(Z=good|SC) = .10
(We have previously determined: P(S|T=g) = 0.72727.)

Using this information, the coach can compute predictive posterior


of Z, given T. For example, the coach can calculate the probability
of observing Z=bad, given T=good:
P(Z=b|T=g) = P(Z=b|T=g, SC) P(SC|T=g) + P(Z=b|T=g,S) P(S|T=g)
= P(Z=b|SC) P(SC|T=g) + P(Z=b|S) P(S|T=g)
= .90 x 0.27273 + .05 x 0.72727 = .28182
Note: Z and T are conditionally independent.

Bayesian vs. Classical: Introduction


• The goal of a classical statistician is getting a point estimate for the
unknown fixed population parameter , say using OLS.

These point estimates will be used to test hypothesis about a model,


make predictions and/or to make decisions –say, consumer choices,
monetary policy, portfolio allocation, etc.

• In the Bayesian world, is unknown, but it is not fixed. A Bayesian


statistician is interested in a distribution, the posterior distribution,
P( |y); not a point estimate.

“Estimation:” Examination of the characteristics of P( |y):


- Moments (mean, variance, and other moments)
- Intervals containing specified probabilities

19
RS – Lecture 17

Bayesian vs. Classical: Introduction


• The posterior distribution will be incorporated in tests of hypothesis
and/or decisions.

In general, a Bayesian statistician does not separate the problem of


how to estimate parameters from how to use the estimates.

• In practice, classical and Bayesian inferences are often very similar.

• There are theoretical results under which both worlds produce the
same results. For example, in large samples, under a uniform prior, the
posterior mean will be approximately equal to the MLE.

Bayesian vs. Classical: Interpretation


• In practice, classical and Bayesian inferences and concepts are often
similar. But, they have different interpretations.

• Likelihood function
- In classical statistics, the likelihood is the density of the observed
data conditioned on the parameters.
- Inference based on the likelihood is usually “maximum
likelihood.”
- In Bayesian statistics, the likelihood is a function of the parameters
and the data that forms the basis for inference – not really a
probability distribution.
- The likelihood embodies the current information about the
parameters and the data.

20
RS – Lecture 17

Bayesian vs. Classical: Interpretation


• Confidence Intervals (C.I.)
- In a regular parametric model, the classical C.I. around MLEs –for
example, b ± 1.96 sqrt{s2 (X’ X)-1}-- has the property that whatever
the true value of the parameter is, with probability 0.95 the confidence
interval covers the true value, .

- This classical C.I. can also be also interpreted as an approximate


Bayesian probability interval. That is, conditional on the data and
given a range of prior distributions, the posterior probability that the
parameter lies in the confidence interval is approximately 0.95.

• The formal statement of this remarkable result is known as the


Bernstein-Von Mises theorem.

Bayesian vs. Classical: Bernstein-Von Mises


Theorem
• Bernstein-Von Mises theorem:
- The posterior distribution converges to normal with covariance
matrix equal to 1/T times the information matrix --same as classical
MLE.

Note: The distribution that is converging is the posterior, not the


sampling distribution of the estimator of the posterior mean.

- The posterior mean (empirical) converges to the mode of the


likelihood function --same as the MLE. A proper prior disappears
asymptotically.
- Asymptotic sampling distribution of the posterior mean is the same
as that of the MLE.

21
RS – Lecture 17

Bayesian vs. Classical: Bernstein-Von Mises


Theorem
• That is, in large samples, the choice of a prior distribution is not
important in the sense that the information in the prior distribution
gets dominated by the sample information.

That is, unless your prior beliefs are so strong that they cannot be
overturned by evidence, at some point the evidence in the data
outweights any prior beliefs you might have started out with.

• There are important cases where this result does not hold, typically
when convergence to the limit distribution is not uniform, such as unit
roots. In these cases, there are differences between both methods.

Linear Model: Classical Setup


• Let’s consider the simple linear model:
yt  X t   t ,  t | X t , y t ~ N (0,  2 )
To simplify derivations, assume X is fixed. We want to estimate .

• Classical OLS (MLE=MM) estimation


b = (X′X)-1X′ y and b|X ~N(, σ2 (X’ X)-1)
- The estimate of 2 is s2 = (y - Xb)′(y - Xb)/(T-k)
- The uncertainty about b is summarized by the regression coefficients
standard errors –i.e., the diagonal of the matrix: Var(b|X) = s2(X’X)-1.

• Testing: If Vkk is the k-th diagonal element of Var(b|X), then


(bk’-0)/(sVkk1/2) = tT-k --the basis for hypothesis tests.

22
RS – Lecture 17

Linear Model: Bayesian Setup


• For the normal linear model, we assume f(yi|i, 2):
yt ~ N(t, 2) for t = 1,..., T
where t = 0 + 1X1t +… + kXkt = Xt β

Bayesian goal: Get the posterior distribution of the parameters (β, 2).

• By Bayes Rule, we know that this is simply:


f(β, 2|y,X) ∝ i f(yi| i, 2)  f(β, 2)
=> we need to choose a prior distribution for f(β, 2).

• To simplify derivations, we assume X is fixed (k=1) and σ2 is known.


That is, we only want to estimate .

Linear Model: Prior Distribution


• Since we assume σ2 is known. Then, we only care about f(β) –or,
technically, f(β|2). We assume f( ) ~ N(m, ψ2):
( β  m) 2
f ( β )  (2 2 ) 1/ 2 exp{ }
2 2

- m is our best guess for , before seeing y and X.


- ψ measures the confidence in our guess. It is common to relate ψ to
2, say ψ= sqrt{2M}.

• This assumpiton for f( ) gives us some flexibility: Depending on ψ,


this prior can be informative (small ψ) or diffuse (big ψ). In addition, it
is a conjugate prior!

• But, we could have assumed a different prior distribution, say a


uniform. Remember, priors are the Achilles heel of Bayesian statistics.

23
RS – Lecture 17

Linear Model: In general, we need f()


• We assumed σ2 is known. But, in realistic applications, we need a
prior for σ2. The usual prior for σ2 is the inverse-gamma. Then, h=1/σ2
is distributed as Γ( , ):

f ( x   2 ; ,  )  x 1 e  x x  0.
( )

• Usual values for ( , ): =T/2 and =1/(2 2)=Φ/2, where 2 is


related to the variance of the T N(0, 2) variables we are implicitly
adding. You may recognize this parameterization of the gamma as a
non-central T2 distribution. Then,
T
2 ( / 2) T / 2 ( 1) 2
f ( )  (  2 ) 2 e  (  / 2 ) 
(T / 2)

Linear Model: In general, we need f()


Note: Under the scenario “σ2 unknown”, we have =(β,2). We need
the joint prior P( ) along with the likelihood, P(y| ), to obtain the
posterior P( |y).

In this case, we can write P( ) = P(β|-2) P(-2):


1/ 2 ( β m)2
( / 2)
T /2
 1   2
f ( β, 2
)  e 2 2 M
x (  2 ) (T / 2 1) e  (  / 2 )
 2 M   (T / 2 )
2

Then, we write the posterior as usual: P( |y) ∝ P(y| ) P( ).

24
RS – Lecture 17

Linear Model: Assumptions


• So far, we have made the following assumptions:
- Data is i.i.d. Normal: yt ~ N(t, 2) for t = 1,..., T
- DGP for t is known: t = 0 + 1X1t +… + kXkt = Xt β
- X is fixed (k=1).
- σ2 is known
- The prior distribution is given by f( ) ~ N(m, ψ2).

Linear Model: Likelihood


• The likelihood function represents the probability of the data, given
fixed values for the parameters . That is, P(y| ). It’s assumed known.

• In our linear model yt = Xt + t, with t ~ N(0,σ2).

• We assume the X’s are fixed (and σ2 is still known). Then,


T
( yt  X t β ) 2
2 T / 2 t 1
f ( y | X, β ,  )  ( 2  )
2
exp{  }
2 2
• Recall that we can write: y – X = (y – Xb) – X ( – b)
Then,
TSS = (y–Xb)’(y–Xb) + ( – b)’X’X( – b) – 2( – b)’X’ (y–X )
= υ s2 + ( – b)’ X’X ( - b)

where s2 = RSS/(T-k) = (y–Xb)’(y–Xb)/(T-k); and υ=(T-k).

25
RS – Lecture 17

Linear Model: Likelihood


• Using (y–X )’(y–X ) = υs2+( - b)’ X’X ( - b),

the likelihood can be factorized as:


1 1 s 2
f (y | X, β ,  2 )  (1 / 2 )T / 2 ( 2 )T / 2 exp{ ( β  b)' X' X( β  b)}x( ) / 2 exp{ }
2 2
 2
2 2
h hs 2
 (h)T / 2 exp{ ( β  b)' X' X( β  b)} x ( h) / 2 exp{ }
2 2

• The likelihood can be written as a product of a normal and a


density of form f( ) = θ- exp{- / }. This is an inverted gamma
distribution.

Linear Model: In general, we need f(X|).


• There is a subtle point regarding this Bayesian regression setup.

A full Bayesian model includes a distribution for the independent


variable X, f(X|). Thus, we have a joint likelihood f(y,X|,,)
and joint prior f(,,).

• The fundamental assumption of the normal linear model is that


f(y|X, , ) and f(X|) are independent in their prior distributions,
such that the posterior distribution factors into:

f(, , |y,X) = f(, |y,X) f(|y,X)

=> f(, |y,X)  f(, ) f(y| , ,X)

26
RS – Lecture 17

Linear Model: Joint Distribution


• We can also write the joint probability distribution of y and . This
joint pdf characterizes our joint uncertainty about β and the data (y):
f (y, β | X,  2 )  f (y | X, β,  2 ) f ( β |  2 )

• Then, we need the likelihood and the prior:


1 1 s 2
f ( y | X, β ,  2 )  (1 / 2 2 ) T / 2 exp{  ( β  b )' X' X( β  b )}x( ) / 2 exp{  }
2 2  2
2 2
h hs 2
 ( h ) T / 2 exp{  ( β  b )' X' X( β  b )} x ( h ) / 2 exp{  }
2 2
( β  m) 2
f ( β )  (2 2 ) 1/ 2 exp{ }
2 2
• Let A = ψ2/σ2. Then, the joint distribution is:
h( β  m) 2 h h s 2
exp{  ( β  b )' X' X( β  b )} x (h)  / 2 exp{ }
f (y , β | X,  2 )  2A 2 2
( 2) (T 1) / 2 T 

Linear Model: Posterior


• The goal is to say something about our subjective beliefs about (in
our linear example, - after seeing the data (y). We characterize this
with the posterior distribution:
P( |y) = P(y∩ )/P(y) = P(y| ) P( )/P(y)

• In our linear model, we determine the posterior with:

f (y | X, β,  2 ) f (β |  2 ) f (y | X, β,  2 ) f (β |  2 )
f (β | y, X,  2 )  
f (y | X,  2 )
 f (y | X,β,  ) f (β |  )dβ
2 2

• One way to find this posterior distribution is by brute force


(integrating and dividing). Alternatively, we can factor the joint
density into a component that depends on θ and a component that
does not depend on θ. The denominator does not depend on .
Then, we just concentrate on factoring the joint pdf.

27
RS – Lecture 17

Linear Model: Posterior


• We can ignore the denominator. Then,
h( β  m) 2 h hs 2
exp{  ( β  b)' X' X( β  b)} x (h)  / 2 exp{ }
f (y, β | X, 2 )  2A 2 2
(2) (T 1) / 2 T 

where A= ψ2/σ2. The joint pdf factorization, excluding constants:


h hs 2
f ( β | y , X,  2 )  ( h ) T / 2 exp{  ( β  b )' X' X( β  b )} x ( h ) / 2 exp{  }x
2 2
h
x  1 exp{  (β  m)' A 1 ( β  m )}
2
h
 h1 / 2 | A 1  X' X |1 / 2 exp{  ( β  m*)' ( X' X  A 1 )( β  m*)}
2

where m*= (A-1+X’X)-1(A-1 m + X’X b). (For derivations, see


Hamilton (1994), Ch. 12.)

Linear Model: Posterior


h
f ( | y,  2 , X)  h1/ 2 | A 1  X' X |1/ 2 exp{ ( β  m*)' ( X' X  A1 )( β  m*)}
2
where m*= (A-1+X’X)-1(A-1 m + (X’X) b).

In other words, the pdf of , conditioning on the data, is normal with


mean m* and variance matrix σ2 (X’X+A-1)-1.

• To get to this result, we have assumed a i.i.d. normal distribution for


(y|X, σ2), and for we chose a normal prior distribution.

• Now, the posterior belongs to the same family (normal) as the prior.
The normal (conjugate) prior was a very convenient choice.

• The mean m* takes into account the data (X and y) and our prior
distribution. It is a weighted average of our prior m and b (OLS).

28
RS – Lecture 17

Linear Model: Bayesian Learning


• m* = (A-1+X’X)-1(A-1 m + (X’X) b).

• We have Bayesian learning: We combine our prior with the data. As


more information is known or released, the prior keeps changing.

• Recall that A= ψ2/σ2. Then, if our prior distribution has a large


variance ψ2 (a diffuse prior), our prior, m, will have a lower weight.

As ψ →∞, m* → b --our prior information is worthless.

As ψ →0, m* → m --complete certainty about our prior info.

• Note that with a diffuse prior, we can say now:


“Having seen the data, there is a 95% probability that β is in the interval b ±
1.96 sqrt{σ2 (X’ X)-1}.”

Linear Model: Remarks


• We can do similar calculations when we impose another prior on σ.
But, the results would change.

• We get exact results, because we made clever distributional


assumptions on the data. If not exact results are possible, numerical
solutions will be used.

• For example, if we assume a gamma for h, with the usual


assumptions for ( , ), the joint prior becomes complicated:
1/ 2
 1  1
f (  ( β,  2 ) | y)   1
 exp{ 2 (β  m)' A ( β  m)}
 2 M  2
2

( / 2)
T /2
2
 ( 2 )(T / 21) e( / 2)
(T / 2)

29
RS – Lecture 17

Linear Model: Remarks


• When we setup our probability model, we are implicitly
conditioning on a model, call it H, which represents our beliefs about
the data-generating process. Thus,

f(,|y,X,H)  f(,|H) f(y|,,X,H)

It is important to keep in mind that our inferences are dependent on


H.

• This is also true for the classical perspective, where results can be
dependent on the choice of likelihood function, covariates, etc.

Linear Model: Interpretation of Priors


• Suppose we had an earlier sample, {y´,X´}, of T’ observations,
which are independent of the current sample, {y,X}.

• The OLS estimate based on all information available is:


1
   
   
T T' T T'
b*   xt xt '  x' x' '  xt yt '  x' y' '
t 1 t 1 t t t 1 t 1 t t
   
and the variance is
1
 
 
T T'
Var[b*]   2  xt xt '  x' x' '
t 1 t 1 t t
 
• Let m be the OLS estimate based on the prior sample {y´,X´}:
1 1
     
  
T' T' T'
m x' x' '  x' y ' ' and Var[m]  2  x' x' '  2 A
t 1 t t t 1 t t t 1 t t
     

30
RS – Lecture 17

Linear Model: Interpretation of Priors


• Then, 1
  
   
T T' T T'
b*   xt xt '  x' x' ' xt yt '  x' y' '
t 1 t 1 t t t 1 t 1 t t
   
1
   A1    A1m 
T T
x x ' x y '
 t 1 t t   t 1 t t 

• This is the same formula for the posterior mean m*.

• Thus, the question is what priors should we use?

• There are a lot of publications, using the same data. To form priors,
we cannot use the results of previous research, if we are not going to
use a correlated sample!

The Linear Regression Model – Example 1


• Let’s go over the multivariate linear model. Now, we impose a diffuse
uniform prior for =(β, h). Say, f(β, h) ∝ h-1.

h
Now, f ( | y , X)  hT / 2 exp{ [s 2  ( β  b )' X' X( β  b )] }  h 1
2

• If we are interested in β, we can integrate out the nuisance parameter


h to get the marginal posterior of f(β|y,X):
h
h [ s 2  ( β  b )' X' X( β  b ) ]} dh
T / 2 1
f ( β | y, X)  exp{ 
2
( β  b )' X' X( β  b )  T / 2
 [1  ]
s 2
where we use the following integral result (Γ(s,x): the incomplete Γ):

x exp{  xb} dx  b  a 1 [  ( a  1)   ( a  1, b )]
a

31
RS – Lecture 17

The Linear Regression Model – Example 1


( β  b )' X' X( β  b )  T / 2
• The marginal posterior f ( β | y , X )  [1  ]
s 2
is the kernel of a multivariate t distribution. That is,
f ( β | y , X )  t v ( β | b, s 2 ( X' X ) 1 )

Note: This is the equivalent to the repeated sample distribution of b.

• Similarly, we can get f(h|y,X) by integrating out β:


h
h [ s 2  ( β  b )' X' X( β  b ) ]} dβ
T / 2 1
f (h | y, X )  exp{ 
2
h h
 h T / 2 1 exp{   s 2 }  exp{  [ ( β  b )' X' X( β  b ) ]} dβ
2 2
h
 h  / 2 1 exp{   s 2 }
2
which is the kernel of a Γ( , ) distribution, with = υ/2 and =υs2/2.

The Linear Regression Model – Example 1


• The mean of a gamma distribution is / . Then,
E[h|y,X] = [υ/2]/[υs2/2]=1/s2.

• Now, we interpret the prior f(β, h) ∝ h-1 as non-informative: The


marginal posterior distributions have properties closely resembling the
corresponding repeated sample distributions.

32
RS – Lecture 17

The Linear Regression Model – Example 2


• Let’s go over the multivariate linear model. Now, we impose a diffuse
uniform prior for and an inverse gamma for σ2.

Likelihood
2
))(y-Xβ) (y-Xβ)]
L(β,σ 2|y,X)=[2πσ 2 ]-n/2 e -[(1/(2σ

Transformation using d=(N-K) and s 2  (1 / d)( y  Xb)( y  Xb)


 1   1 2  1  1  1 
  22  ( y  Xβ)( y  Xβ)    2 ds   2   2 (β  b)  2 X X  (β  b)
      

Diffuse uniform prior for β, conjugate gamma prior for 2


Joint Posterior
d1
[ds 2 ]v 2  1 
f(β, 2 | y , X )  e  ds (1 / 2 )
[2]K / 2 | 2 ( X X ) 1 |1 / 2
2

(d  2)  2 
 exp{(1 / 2)(β - b) '[2 ( X X ) 1 ]1 (β - b)}

The Linear Regression Model – Example 2


• From the joint posterior, we can get the marginal posterior for .

After integrating  2 out of the joint posterior:


[ds 2 ]v  2  (d  K / 2)
[2 ]K / 2 | X X |1 / 2
 (d  2)
f (β | y , X )  .
[ds 2  12 (β  b) X X (β  b)]d K / 2
n-K
Multivariate t with mean b and variance matrix [s 2 ( X'X ) 1 ]
n K 2
The Bayesian 'estimator' equals the MLE. Of course; the prior was
noninformative. The only information available is in the likelihood.

33
RS – Lecture 17

Linear Model: Complicated Posteriors


• So far, we got exact results, because we made clever distributional
assumptions on the data. If not exact results are possible, numerical
methods will be used to say something about the posterior.

• For example, if we assume a normal for and a gamma for h, with


the usual assumptions for ( , ), the joint prior becomes:
1/ 2
 1  1
f (  ( β,  2 ))   1
 exp{ 2 (β  m)' A ( β  m)}
 2 2
M  2
( / 2)
T /2
2
 ( 2 )(T / 21) e( / 2)
(T / 2)

Linear Model: Complicated Posteriors


• Now, the posterior
h h s 2
f ( | y , X )  ( h ) K / 2 exp{  ( β  b )' X' X( β  b )}  ( h ) / 2 exp{  } 
2 2
1/ 2
 h  h
  exp{  (β  m)' A 1 ( β  m )}
 2M  2
( / 2)
T /2
 ( h ) ( T / 2 1) e  (  / 2 ) h
 (T / 2 )
h
 h T 1 exp{  [ s 2  ( β  b )' ( X' X )( β  b )} 
2
h
 exp{  (β  m)' A 1 ( β  m )} e  (  / 2 ) h
2
• This joint posterior is complicated and it does not lead to convenient
expressions for the marginals of β and h. We will use numerical
methods to say something about the joint. But, we can derive
analytical expressions for the conditional posteriors f(β|y,X) and f(h|y,X).

34
RS – Lecture 17

Linear Model: Complicated Posteriors


• The conditional posteriors play a very important role in some
numerical methods (simulations) to get joint (full) posteriors.

• For example, let’s derive f(β|y,X,h):


h
f ( β | y , X , h )  exp{  [( β  b )' ( X' X )( β  b )  (β  m)' A 1 ( β  m )]}
2

• After some algebra (including results for quadratic forms), we get:


1
f ( β | y , X , h )  exp{  [( β  m*)'  * ( β  m*)}
2

where m* = Σ*-1(hX’y+hAm) and Σ*= (hX’X+hA)-1. That is, we get a


multivariate normal, with the usual mix of prior and sample info.

• Similar work for f(h|y,X,β) delivers a gamma distribution.

Presentation of Results
• P( |y) is a pdf. For the simple case, the one parameter , it can be
graphed. But, if θ is a vector of many parameters, the multivariate pdf
cannot be presented in a graph of it.

• It is common to present measures analogous to classical point


estimates and confidence intervals. For example:
(1) E( |y) = ∫ p( |y) d -- posterior mean
(2) Var( |y) = E( 2|y)- {E( |y)}2 -- posterior variance
(3) p(k1> >k2|y) = ∫k1> >k2 p( |y) d -- confidence interval

• In general, it is not possible to evaluate these integrals analytically,


and we must rely on numerical methods.

35
RS – Lecture 17

Presentation of Results: MC Integration

Presentation of Results: MC Integration

36
RS – Lecture 17

Presentation of Results: IS – Example


Example: We want to use importance sampling (IS) to evaluate the
integral x1/2 over the range [0,1]:
# Without Importance Sampling
set.seed(90)
T.sim=1,000
lambda = 3
X <- runif(T.sim,0.001,1)
h <- X^(-0.5) # h(x)
c( mean(h), var(h) )

# Importance sampling Monte Carlo


w <- function(x) dunif(x, 0.001, 1)/dexp(x,rate=lambda) * pexp(1, rate=lambda) # [pi(x)/q(x)]
h_f <- function(x) x^(-0.5) # h(x)
X <- rexp(T.sim,rate=lambda)
X <- X[X<=1]
Y.h <- w(X)*h_f(X)
c(mean(Y.h), var(Y.h))

Note: Make sure that q(x) is a well defined pdf –i.e., it integrates to 1.
This is why above we use q(x)= dexp(x,lambda)/ pexp(1,lambda).

Simulation Methods
• Q: Do we need to restrict our choices of prior distributions to these
conjugate families? No. The posterior distributions are well defined
irrespective of conjugacy. Conjugacy only simplifies computations.

• If you are outside the conjugate families, you typically have to resort
to numerical methods for calculating posterior moments.

• If the model is no longer linear, it may be very difficult to get


analytical posteriors.

• What to do we do in these situations? We simulate the behavior of


the model.

37
RS – Lecture 17

Simulation Methods: Nonlinear Models


• It is possible to do Bayesian estimation and inference over
parameters in a nonlinear model.

• Steps:
1. Parameterize the model
2. Propose the likelihood conditioned on the parameters
3. Propose the priors – joint prior for all model parameters
4. As usual, the posterior is proportional to likelihood times prior.
(Usually requires conjugate priors to be tractable.)
5. Sample –i.e., draw observations- from the posterior to study its
characteristics.

Numerical Methods
• Sampling from the joint posterior P(θ|y) may be difficult or
impossible. For example, in the linear model, assuming a normal prior
for , and an inverse-gamma prior for σ2, we get a complicated joint
posterior distribution for ( , σ2).

• To do simulation based estimation, we need joint draws on ( , σ2).


But, if P(θ|y) is complicated ⇒ we cannot easily draw from it.

• For these situations, many methods have been developed that make
the process easier, including Gibbs sampling, Data Augmentation, and the
Metropolis-Hastings algorithm.

• All three are examples of Markov Chain-Monte Carlo (MCMC)


methods.

38
RS – Lecture 17

Numerical Methods: MCMC Preliminaries


• Monte Carlo (first MC): A simulation. We take quantities of interest
of a distribution from simulated draws from the distribution.

Example: Monte Carlo integration


Suppose we have a distribution p( ) (say, a posterior) that we want to
take quantities of interest from. We can evaluate the integral
analytically, I:
I = ∫ g( ) p( ) d
where g( ) is some function of (g( ) = , for the mean and
g( ) = [ − E( )]2 for the variance).

We can approximate the integrals via Monte Carlo Integration by


simulating M values from p( ) and calculating:
IM = Σ g( )/M

Numerical Methods: MCMC Preliminaries


For example, we can compute the expected value of the Beta(3,3)

or via Monte Carlo methods (R Code):


> M <- 10000
> beta.sims <- rbeta(M, 3, 3)
> sum(beta.sims)/M
[1] 0.4981763

• From, the LLN, the Monte Carlo approximation IM is a consistent


(simulation) estimator of the true value I. That is, IM → I, as M →∞.

Q: But, to apply the LLN we need independence. What happens if we


cannot generate independent draws?

39
RS – Lecture 17

Numerical Methods: MCMC Preliminaries


• Q: But, to apply the LLN we need independence. What happens if
we cannot generate independent draws?

Suppose we want to draw from our posterior distribution p( |y), but


we cannot sample independent draws from it.

However, we may be able to sample draws from p( |y) that are


“slightly” dependent.

If we can sample slightly dependent draws using a Markov chain, then


we can still find quantities of interests from those draws.

Note: We will use this method of integration when we cannot do it in


a simpler/easier way.

Numerical Methods: MCMC Preliminaries


• Monte Carlo (first MC): A simulation.

• Markov Chain (the second MC): A stochastic process in which


future states are independent of past states given the present state.

• Recall that a stochastic process is a consecutive set of random


quantities defined on some known state space, Θ.
- Θ: our parameter space
- Consecutive implies a time component, indexed by t.

• A draw t describes the state at time (iteration) t. The next draw t+1
is dependent only on t. This is because of the Markov property:
p( t+1 | t) = p( t+1 | t, t-1, t-2, ... 1)

40
RS – Lecture 17

Numerical Methods: MCMC Preliminaries


• The state of a Markov chain (MC) is a random variable indexed by t,
say, t. The state distribution is the distribution of t, pt( ).

A stationary distribution of the chain is a distribution π such that, if


pt( ) = π => pt+s( ) = π for all s.

• Under certain conditions a chain will have the following properties:


- A unique stationary distribution.
- Converge to that stationary distribution π as t → ∞.
- Ergodic. That is, averages of successive realizations of will
converge to their expectations with respect to π.

A lot of research has been devoted to establish the certain conditions.

MCMC – Ergodicity (P. Lam)


• Usual certain conditions for ergodicity
The Markov chain is aperiodic, irreducible (it is possible to go from any
state to any other state), and positive recurrent (eventually, we expect to
return to a state in a finite amount of time).

Ergodic Theorem
• Let 1, 2, 3, ... M be M values from a Markov chain that is aperiodic,
irreducible, and positive recurrent –i.e., chain is ergodic-, and E[g( )] < ∞.
Then, with probability 1:
Σ g( )/M → ∫Θ g( ) p( ) d

This is the Markov chain analog to the SLLN, and it allows us to


ignore the dependence between draws of the Markov chain when we
calculate quantities of interest from the draws.

41
RS – Lecture 17

MCMC - Ergodicity (P. Lam)


• Aperiodicity
A Markov chain is aperiodic if the only length of time for which the
chain repeats some cycle of values is the trivial case with cycle length
equal to one.

Let A, B, and C denote the states (analogous to the possible values of


) in a 3-state Markov chain. The following chain is periodic with
period 3, where the period is the number of steps that it takes to
return to a certain state.

As long as the chain is not repeating an identical cycle, then the


chain is aperiodic.

MCMC: Markov Chain - Example


• A chain is characterized by its transition kernel whose elements
provide the conditional probabilities of t+1 given the values of t.
• The kernel is denoted by P(x,y). (The rows add up to 1.)

Example : Employees at t  0 are distributed over two plants A & B


 0/  A0 B0   100 100
The employees stay and move between A & B according to P
P PAB  .7 .3
P   AA  
 PBA PBB  .4 .6
At t  1, the number of employees at A & B is given by :
P P  .7 .3
 1/  A1 B1    0/ P  A0 B0   AA AB   100 100  
 PBA PBB  .4 .6
 .7 *100  .4 *100, .3 *100  .6 *100  110 90

42
RS – Lecture 17

MCMC: Markov Chain - Example


At t  2,
P PAB 
A1 B1    0/ P   A0 B0   AA   110 90
 PBA PBB 
P PAB  PAA PAB 
A2 B2    0/ P 2  A0 B0   AA  
 PBA PBB   PBA PBB 
2
.7 .3 .7 .3
 100 100    110 90  
.4 .6 .4 .6
 113 87

After t  k years :  k/  Ak Bk    0/ P k

Note: Under certain conditions, as t → ∞, Pt converges to the


stationary distribution. That is, π=π P.

MCMC: General Idea


• We construct a chain, or sequence of values, 0, 1, . . . , such that
for large k, k can be viewed as a draw from the posterior distribution
of , p( |X), given the data X={X1, . . . , XN}.

• This is implemented through an algorithm that, given a current


value of the parameter vector k, and given X, draws a new value k+1
from a distribution f(.) indexed by k and the data:
k+1 ∼ f( | k, X)

• We do this in a way that if the original k came from the posterior


distribution, then so does k+1. That is,
k| X ∼ p( |X), => k+1|X ∼ p( | X).

43
RS – Lecture 17

MCMC: General Idea


• In many cases, irrespective of where we start, that is, irrespective of
0, as k → ∞, it will be the case that the distribution of the parameter
conditional only on the data, X, converges to the posterior
distribution as k → ∞:
 p( |X),
k|X
d

• Then just pick a 0 and approximate the mean and standard


deviation of the posterior distribution as:
E[ |X] = 1/(K − K0 +1) Σk=K0...K k
Var( |X) = 1/(K − K0 +1) Σk=K0...K { k - E( |X)}2

• Usually, the first K0−1 iterations are discarded to let the algorithm
converge to the stationary distribution without the influence of
starting values, 0 (burn in).

MCMC: Burn-in (P. Lam)


• As a matter of practice, most people throw out a certain number of
the first draws, known as the burn-in. This is to make our draws
closer to the stationary distribution and less dependent on the starting
point.

• Think of it as a method to pick initial values.

• However, it is unclear how much we should burn-in since our draws


are all slightly dependent and we don’t know exactly when
convergence occurs.

• Not a lot of theory about it.

44
RS – Lecture 17

MCMC: Thinning the Chain (P. Lam)


• In order to break the dependence between draws in the Markov
chain, some have suggested only keeping every dth draw of the chain.
That is, we keep M= { d, 2d, 3d, ... Md}

• This is known as thinning.


- Pros:
- We may get a little closer to i.i.d. draws.
- It saves memory since you only store a fraction of the draws.
- Cons:
- It is unnecessary with ergodic theorem.
- Shown to increase the variance of your MC estimates.

MCMC - Remarks
• In classical stats, we usually focus on finding the stationary
distribution, given a Markov chain.

• MCMC methods turn the theory around: The invariant density is


known (maybe up to a constant multiple) --it is the target density, π(.),
from which samples are desired–, but the transition kernel is
unknown.

• To generate samples from π(.), the methods find and utilize a


transition kernel P(x, dy) whose Kth iteration converges to π(.) for
large K.

45
RS – Lecture 17

MCMC - Remarks
• The process is started at an arbitrary x and iterated a large number
of times. After this, the distribution of the observations generated
from the simulation is approximately the target distribution.

• Then, the problem is to find an appropriate P(x, dy) that works!

• Once we have a Markov chain that has converged to the stationary


distribution, then the draws in our chain appear to be like draws from
p( |y), so it seems like we should be able to use Monte Carlo
Integration methods to find quantities of interest.

• Our draws are not independent, which we required for MC


Integration to work (remember LLN). For dependent draws, we rely
on the Ergodic Theorem.

MCMC: Gibbs Sampling


• When we can sample directly from the conditional posterior
distributions, the algorithm is known as Gibbs Sampling.

• The Gibbs sampler partitions the vector of parameters into two


(or more) blocks or parts, say = ( 1, 2, 3). Instead of sampling k+1
directly from the (known) joint conditional distribution of
f( | k; X),
it may be easier to sample from the (known) full conditional
distributions, p( j| -jk; X): ( -j= i, lk)
- first sample 1k+1 from p( 1| 2k, 3k; X),
- then sample 2k+1 from p( 2| 1k+1, 3k; X).
- then sample 3k+1 from p( 3| 1k+1, 2k+1; X).

• It is clear that if k is from the posterior distribution, then so is k+1.

46
RS – Lecture 17

MCMC: Gibbs Sampling (P. Lam)


• Q: How can we know the joint distribution simply by knowing the
full conditional distributions?
A: The Hammersley-Clifford Theorem shows that we can write the
joint density, p(x, y), in terms of the conditionals p(x|y) and p(y|x).

• Then, how do we figure out the full conditionals?


Suppose we have a posterior p( |y). To calculate the full conditionals
for each , do the following:
1. Write out the full posterior ignoring constants of proportionality.
2. Pick a block of parameters (say, 1) and drop everything that
doesn’t depend on -1.
3. Figure out the normalizing constant (and, thus, the full conditional
distribution p( 1| -1, y) ).
4. Repeat steps 2 and 3 for all parameters.

MCMC: Gibbs Sampling - Steps


Example: Suppose we want to draw ( , h ) using a Gibbs sampler:
Step 1: Start with an arbitrary starting value 0 =( 0, h0 ).
Step 2: Generate a sequence of ’s, following:
- Sample βk+1 from p( |hk; X)
- Sample hk+1 from p(h| k+1; X)
Step 3: Repeat Step 2 for k= 1, 2, .... K.

Note: The sequence { k}k=1,...,K is a Markov chain with transition


kernel
π( k+1| k)= p( 2k+1| 1k+1; X) x p( 1k+1| 2k; X)

This transition kernel is a conditional distribution function that


represents the probability of moving from k to k+1.

47
RS – Lecture 17

MCMC: Gibbs Sampling - Details


• Under general conditions, the realizations from a Markov Chain as
K→∞ converge to draws from the ergodic distribution of the chain
π( ) satisfying
π( k+1)= ∫ π( k+1| k) π( k) d k

• Theorem: The ergodic distribution of this chain correspond to the


posterior distribution. That is,
π( )= p( |X)
Proof:
∫ π( k+1| k) π( k) d k = ∫ [p( 2k+1| 1k+1; X) p( 1k+1| 2k; X)] p( k|X) d k
= ∫ p( k+1, k; X) d k = p( k+1; X)

Implication: If we discard the first Ko draws (large Ko), then { K0+1,...,


K+1} represent draws from the posterior distribution p( ; X) .

MCMC: Gibbs Sampling – Diagnostics


• Like in all numerical procedures, it is always important to check the
robustness of the results before start using the output:
- Use different 0 (check traceplots for different sequences & GR)
- Use different K0, K (may be use the “efective sample size,” ESS)
- Plot j as a function of j (check the auto-/cross-correlations in
the sequence and across parameters).

• Run Diagnostics Tests. There are many:


- Geweke (1992): A Z-test, comparing the means of the first 10% of
sequence and the last 50% of the sequence.
- Gelman and Rubin (GR, 1992): A test based on comparing different
sequences, say N. The statistic is called Shrink factor is based on the
ratio of the variance of the N posterior means sequences and the
average of the posterior s2 of the N sequences.

48
RS – Lecture 17

MCMC: Gibbs Sampling – Limitations


Three usual concerns:
• Even if we have the full posterior joint pdf, it may not be possible
or practical to derive the conditional distributions for each of the RVs
in the model.

• Second, even if we have the posterior conditionals for each variable,


it might be that they are not of a known form, and, thus, there is not
a straightforward way to draw samples from them.

• Finally, there are cases where Gibbs sampling is very inefficient.


That is, the “mixing" of the Gibbs sampling chain might be very slow,
-i.e., the algorithm spends a long time exploring a local region with
high density, and takes a very long to explore all regions with
signicant probability mass.

Gibbs Sampler – Example 1: Bivariate Normal


 0   1   
Draw a random sample from bivariate normal   ,  
 0    1  
v  u  u 
(1) Direct approach:  1     1  where  1  are two
v
 2 r u
 2 r  u2 
1 0
independent standard normal draws (easy) and =  
 1 2 
1 
such that '=   . 1  , 2  1   .
2

 1
(2) Gibbs sampler: v1 | v 2 ~ N v 2 , 1  2 
 
v 2 | v1 ~ N   v1 , 1   2 
 

49
RS – Lecture 17

Gibbs Sampler – Example 1: Bivariate Normal


• R Code
# initialize constants and parameters

N <- 5,000 # length of chain


burn <- 1,000 # burn-in length
X <- matrix(0, N, 2) # the chain, a bivariate sample

rho <- -.75 # correlation


mu1 <- 0
mu2 <- 0
sigma1 <- 1
sigma2 <- 1
s1 <- sqrt(1-rho^2)*sigma1
s2 <- sqrt(1-rho^2)*sigma2

Gibbs Sampler – Example 1: Bivariate Normal


# generate the chain
X[1, ] <- c(mu1, mu2) #initialize
for (i in 2:N) {
x2 <- X[i-1, 2]
m1 <- mu1 + rho * (x2 - mu2) * sigma1/sigma2
X[i, 1] <- rnorm(1, m1, s1)
x1 <- X[i, 1]
m2 <- mu2 + rho * (x1 - mu1) * sigma2/sigma1
X[i, 2] <- rnorm(1, m2, s2)
}

b <- burn + 1
x <- X[b:N, ]

# compare sample statistics to parameters


colMeans(x)
cov(x)
cor(x)
plot(x, main="", cex=.5, xlab=bquote(X[1]),
ylab=bquote(X[2]), ylim=range(x[,2]))

50
RS – Lecture 17

Gibbs Sampler – Example 1: Bivariate Normal


> colMeans(x)
[1] 0.03269641 -0.03395135
> cov(x)
[,1] [,2]
[1,] 1.0570041 -0.8098575
[2,] -0.8098575 1.0662894
> cor(x)
[,1] [,2]
[1,] 1.0000000 -0.7628387
[2,] -0.7628387 1.0000000

Gibbs Sampler – Example 2: CLM


Now, in the CLM, we assume a normal prior for β and a gamma for
h, with the usual assumptions for the prior values of ( 0, 0). The joint
posterior is complicated. We use the conditional posteriors to get the
joing posterior:
1
f ( β | y , X , h )  exp{  [( β  m*)'  * ( β  m *)}
2
h
f ( h | β , y , X )  h T / 2  0 1 exp{  ( y  Xβ )' ( y  Xβ )  h  0 }
2
where m* = Σ*-1(hX’y+hAm) and Σ*= (hX’X+hA)-1. That is, we get a
multivariate normal, for β with the usual mix of prior and sample info
and a gamma for h, with parameters (T/2+ 0, (y–Xβ)’(y–Xβ)/2+ 0).

• The Gibbs sampler samples back and forth between the two
conditional posteriors.

51
RS – Lecture 17

Gibbs Sampler – Example 3: Logistic Regression


• A standard Bayesian logistic regression model (e.g., modelling the
probability of a merger) can be written as follows:

y i ~ Binomial ( n i , p i )
logit ( p i )  X 
 0 ~ N ( 0 , m 0 ),  1 ~ N ( 0 , m1 )

• Can we write out the conditional posterior distributions and use


Gibbs Sampling?
p (  0 | y , 1 )  p ( y |  0 , 1 ) p (  0 )
yi ni  y i
 exp(  0  1 xi )   1  1 2
       exp(  0 )
i  1  exp(  0   1 xi )   1  exp(  0  1 xi )  m0 2 m0
p (  0 | y , 1 ) ~ ?

Gibbs Sampler – Example 3: Logistic Regression


• This distribution is not a standard distribution. It cannot be simply
simulated from a standard random number generator. We can simulate
it using MCMC.

• The R package MCMCpack can estimate this model. Also, Bayesian


software OpenBUGS, JAGS, WinBUGS, and Stan, which link to R,
can fit this model using MCMC.

See R link: https://cran.r-project.org/web/views/Bayesian.html

52
RS – Lecture 17

MCMC: Data Augmentation


• Situation: It is difficult or impossible to sample directly from the
posterior, but there exists an unobservable/latent variable Y such that
it is possible to conditionally sample P( |Y) and P(Y| ).

• Data augmentation (DA): Methods for constructing iterative


optimization or sampling algorithms through the introduction of
unobserved data or latent variables.

• DA was popularized by Dempster, Laird, and Rubin (1977), in their


article on the EM algorithm, and by Tanner and Wong (1987).

• A DA algorithm starts with the construction of the so-called


augmented data, Yaug, which are linked to the observed data, Yobs, via a
many-to-one mapping M: Yaug ⟶ Yobs.

MCMC: Data Augmentation


• Now, we have “complete data.” To work with it, we require that the
marginal distribution of Yobs implied by P(Yaug| ) must be the
original model P(Yobs| ). That is, we relate the “observed data”
posterior distribution to the “complete data”:

f ( | Yobs , M )   f ( , Yaug | Yobs , M )dYaug   f ( | Yaug , Yobs , M ) f (Yaug | Yobs , M )dYaug

• We introduce the RV Yaug because it helps. We have a situation


where a Gibbs sampler can be used to simulate P( |Yobs ,M). Two
steps:
- Draw Yaug from their joint posterior, P(Yaug|Yobs ,M)
- Draw from its completed-data posterior: P( |Yobs ,Yaug,M)

Q: Under which conditions, inference from completed data and


inference from observed data are the same?

53
RS – Lecture 17

MCMC: Data Augmentation – Example 1


Suppose we are interested in estimating the parameters of a censored
regression. There is a latent variable:
Yi*= Xiβ +εi, εi|Xi ∼ iid N(0, 1) i=1,2,...,M,....,N

• We observe Yi =max(0, Yi*), and the regressors Xi. Suppose we


observe M zeroes.

• Suppose the prior distribution for β is N( , Ω). But, the posterior


distribution for β does not have a closed form expression.

Remark: We view both the vector Y* = (Y1*,...,YN*) and β as


unknown RV. With an appropriate choice of P(Y*|data, β ) and
P(β|Y*), we can use a Gibbs sample to get the full posterior
P(β,Y*|data).

MCMC: Data Augmentation – Example 1


• The Gibbs sampler consists of two steps:
Step 1 (Imputation): Draw all the missing elements of Y* given the
current value of the parameter β, say βk:
Yi*|β, data ∼ TN(Xiβk, 1; 0) (an Nx1vector!)
if observation i is truncated, where TN( , σ2; c) denotes a truncated
normal distribution with mean , variance σ2, and truncation point c
(truncated from above).

Step 2 (Posterior): Draw a new value for the parameter, βk+1 given
the data and given the (partly drawn) Y*:
p( β|data,Y*)∼ N((X’X+Ω−1)-1 (X’Y+Ω−1 ),(X’X+Ω−1)-1)

54
RS – Lecture 17

MCMC: Data Augmentation – Example 2


Example: Incomplete univariate data
Suppose that Y1, ..., YN ~ Binomial (1, )
Prior for ~ Beta( , )
Then, the posterior of is also Beta:
p( |Y) ~ Beta( + Σi to N Yi , + N - Σi to N Yi )

Suppose N-M observations are missing. That is, Yobs= {Y1, ..., YM}
Then, p( |Yobs) ~ Beta( + Σi to M Yi , + M - Σi to M Yi )

Step 1: Draw all the missing elements of Y* given the current value of
the parameter , say k.
Step 2: Draw a new value for the parameter, k+1 given the data and
given the (partly drawn) Y*.

MCMC: Metropolis-Hastings (MH)


• MH is an alternative, and more general, way to construct an MCMC
sampler (to draw from the posterior).

• It provides a form of generalized rejection sampling, where values


are drawn –i.e., the ’s– from approximate distributions and
“corrected” so that, asymptotically they behave as random
observations from the target distribution –for us, the posterior.

• MH sampling algorithms sequentially draw candidate observations


from a ‘proposal’ distribution, conditional on the current observations,
thus inducing a Markov chain.

• We deal with Markov chains: The distribution of the next sample


value, say y= k+1, depends on the current sample valule, say x= k.

55
RS – Lecture 17

MCMC: Metropolis-Hastings (MH)


• In principle, the algorithm can be used to sample from any integrable
function. But, its most popular application is sampling from a
posterior distribution.

• The MH algorithm jumps around the parameter space, but in a way


that the probability to be at a point is proportional to the function we
sample from –i.e., the target function.

• Named after the Metropolis et al. (1953) paper, which frist proposed
it and Hastings (1970) who generalized it. Rediscovered and
popularized by Tanner and Wong (1987) and Gelfand and Smith
(1990)

MCMC: MH – Proposal Distribution


• We want to find a function p(x, y) from where we can sample, that
satisfies the (time) reversibility condition (equation of balance):
π(x) p(x, y) = π(y) p(y,x)

• The proposal (or candidate-generating) density is denoted q(x,y), where


∫q(x,y) dy = 1.
Interpretation: When a process is at the point x (= k), the density
generates a value y (= k+1) from q(x,y). It tells us how to move from
current x to new y.

• Idea: Suppose P is the true density. We simulate a point using the


q(x,y). We ‘accept’ it only if it is ‘likely’ (not ‘more likely’). If it
happens that q(x,y) itself satisfies the reversibility condition for all
(x,y), we are done.

56
RS – Lecture 17

MCMC: MH – Proposal Distribution


• But, for example, we might find that for some (x,y):
π(x) q(x, y) > π(y) q(y,x) (*)
In this case, speaking somewhat loosely, the process moves from x to
y too often and from y to x too rarely.

• We want balance. To correct this situation by reducing the number


of moves from x to y with the introduction of a probability a(x,y) < 1
that the move is made:
a(x,y) = probability of move from x to y.
If the move is not made, the process again returns x as a value from
the target distribution.

• Then, transitions from x to y (y≠x) are made according to


pMH(x, y) = q(y,x) a(x,y) y≠x

MCMC: MH – Algorithm Rejection Step


Example: We focus on a single parameter and its posterior
distribution π( |Y). We draw a sequence { 1, 2, 3,....} from a MC.
- At iteration k, let = k. Then, propose a move: *. That is, generate
a new value * from a proposal distribution q( k, *).
- Rejection rule:
Accept * (& let k+1= *) with (acceptance) probability a( k, *)
Reject * with probability 1-a (& set k+1= k).

We have define an acceptance function!

Note: It turns out that the acceptance probability, a(x,y), is a function


of π(y)/π(x) –the importance ratio. This ratio helps the sampler to visit
higher probability areas under the full posterior.

57
RS – Lecture 17

MCMC: MH – Probability of Move


• We need to define a(x,y), the probability of move.

• In our example (*), to get movements from x to y, we define a(y,x)


to be as large as possible (with upper limit 1!). Now, the probability of
move a(x,y) is determined by requiring that pMH(x, y) satisfies the
reversibility condition. Then,
π(x) q(x, y) a(x,y) = π(y) q(y,x) a(y,x) = π(y) q(y,x)
=> a(x,y) = π(y) q(y,x)/[π(x) q(x, y)].
Note: If the example (*) is reversed, we set a(x,y)=1 and a(y,x) as
above.

• Then, in order for pMH(x, y) to be reversible, a(x,y) must be


a(x,y) = min{[π(y) q(y,x)]/[π(x) q(x, y)],1} if π(x) q(x, y)]>0,
=1 otherwise.

MCMC: MH – Probability of Move


• If q(.) is symmetric, then q(x,y) = q(y,x). Then, the probability of
move a(x,y) reduces to π(y)/π(x) –the importance ratio. Thus, the
acceptance function:
- If π(y)≥π(x), the chain moves to y.
- Otherwise, it moves with probability given by π(y)/π(x).

Note: This case, with q(.) symmetric, is called Metropolis Sampling.

• This acceptance function plays two roles:


1) It helps the sampler to visit higher probability areas under the full
posterior –we do this through the ratio π(y)/π(x).
2) It should explore the space and avoid getting stuck at one site –i.e.,
it can reverse its previous move. This constraint is given by the ratio
by q(x|y)/q(y|x).

58
RS – Lecture 17

MCMC: MH – At Work

• We consider moves from x (note that q(x,y) is symmetric):


- A move to candidate y1 is made with certainty –i.e., π(y1)>π(x).
=> We always say yes to an “up-hill” jump!
- A a move to candidate y2 is made with probability π(y2)/π(x).
Note: The q(x,y) distribution is also called jumping distribution.

MCMC: MH – Transition Kernel


• In order to complete the definition of the transition kernel for the
MH chain, we consider the possibly non-zero probability that the
process remains at x:
r(x) = 1 - ∫R q(x,y) a(x,y) dx.

• Then, the transition kernel of the MH chain, denoted by pMH(x, dy)


is given by:
pMH(x, dy) = q(y,x) a(x,y) dy + [1 - ∫R q(x,y) a(x,y) dy] dx(dy).
where dx(dy) is an indicator funtion = 1 if x =dy and 0 otherwise.

59
RS – Lecture 17

MCMC: MH Algorithm
• MH Algorithm
We know π(.), a complicated posterior. Say, from e the CLM with Yi
iid normal, a normal prior for β and gamma prior for h => =(β,h).

We assume q( k, k+1) –say, a (symmetric) normal


(1) Initialized with the (arbitrary) value 0 (k=0):
(2) Generate * from q( k, .) and draw u from U(0, 1).
- If u ≤ a( k, *) = π( *)/π( k) =>set k+1 = *.

Else =>set k+1 = k.

(3) Repeat for k = 1 , 2 ,. . . , K.

• Return the values ( 1, 2, . .. , k, k+1,..., K}.

MCMC: MH – Acceptance Rates


• It is important to monitor the acceptance rate (the fraction of
candidate draws that are accepted) of the MH algorithm.

• If the acceptance rate is too high, the chain is probably not mixing
well -i.e., not moving around the parameter space enough. If it is too
low, the algorithm is too inefficient (rejecting too many draws).

• In general, the acceptance rate falls as the dimension of P( |y)


increases (especially, for highly dependent parameters) resulting in
slow moving chains and long simulations.

• Simulation times can be improved by using the single component


MH algorithm. Instead of updating the whole together, is divided
in components -say,(β,h)-, with each component updated separately.

60
RS – Lecture 17

MCMC: MH – Acceptance Rates


• What is high or low depends on the specific algortihm. One way of
finding a ‘good’ proposal distribution is to choose a distribution that
gives a particular acceptance rate. When we adjust the parameters of a
pdf, say σ, to obtain a “good” proposal, we say we’re tuning the MH.

• It has been suggested that a ‘good’ acceptance rate is often around


50% for RW chains and close to 100% for independent chains -see
Muller (1993) .

• Roberts, Gelman and Gilks (1994) suggest:


- 45% for unidimensional problems.
- 23% in the limit.
- 25% for 6 dimensions.

MCMC: MH – Acceptance Rates - Example


(From P. Lam.) We want to use a random walk Metropolis algorithm
to sample from a Gamma(1.7, 4.4) distribution with a Normal q(y,x)
distribution with SD = 2.
mh.gamma <- function(T.sims, start, burnin, cand.sd, shape, rate) {
theta.cur <- start
draws <- c()
theta.update <- function(theta.cur, shape, rate) {
theta.can <- rnorm(1, mean = theta.cur, sd = cand.sd) # theta_t+1?
accept.prob <- dgamma(theta.can, shape, rate)/dgamma(theta.cur,shape, rate) # a()
if (runif(1) <= accept.prob) theta.can # reject?
else theta.cur
}
for (i in 1:n.sims) {
draws[i] <- theta.cur <- theta.update(theta.cur, shape,rate)
}
return(draws[(burnin + 1):T.sims])
}
mh.draws <- mh.gamma(10000, start = 1, burnin = 1000, cand.sd = 2,shape = 1.7, rate = 4.4)

61
RS – Lecture 17

MCMC: MH Algorithm - Example


• (From Florian Hartig) Suppose we have the CLM with
Data: Yi = +βXi +εi, εi ~iid N(0,σ2).
Priors: β ~ U(0,10); ~ N(m=0, σ02=9); & σ2 ~ U(0.001,30)
=> =( ,β,σ2).

• We simulate the data, with =1, β=2, σ=5, & T=50.

• Proposal densities: 3 Normals with 0=(2,0,7) &


SD=(0.1,0.5,0.3).

• Iterations = 10,000 & Burn in = 5,000

• OLS: a=1.188 (0.68), b=1.984 (.047).

MCMC: MH Algorithm - Example

62
RS – Lecture 17

MCMC: MH – Special Cases


• We pick proposal distributions, q(x,y) that are easy to sample.
Remarkably, the proposal distributions, q(x,y) can have almost any
form.

• There are some (silly) exceptions; but assuming that the proposal
allows the chain to explore the whole posterior and does not produce
a recurrent chain we are OK.

• We tend to work with symmetric q(x,y), but the problem at hand


may require asymmetric proposal distributions; for instance, to
accommodate a particular constraints in the model. For example, to
estimate the posterior distribution for a variance parameter, we
require that our proposal does not generate values smaller than 0.

MCMC: MH – Special Cases


• Three special cases of MH algorithm are:
1. Random walk metropolis sampling. (That is, y = x + z, where z is
the RV drawn from q(z).)
2. Independence sampling. (That is, q(x,y) = q(y).)
3. Gibbs sampling. (We never reject from the proposals –i.e., the
conditional posteriors!)

• Critical question: Selecting the spread and location of q(x,y).

63
RS – Lecture 17

MCMC: MH - Random Walk Metropolis


• This is a pure Metropolis sampling –see Metropolis et al. (1953).
- Let q(x,y) = q1(|y – x|) q1 is a multivariate density.
- y = x + z, where the random increment z ~ q1. (It is called a
random walk chain!)

• Typical examples of random walk proposals densities are Normal


distributions centered around the current value of the parameter –i.e.
q(x,y) ~N(x,s2), where s2 is the (fixed) proposal variance that can be
tuned to give particular acceptance rates. Multivariate-t distributions
are also used.

• The RW MH is a good alternative, usual default for the algorithm.

MCMC – MH - Independence Sampler


• The independence sampler is so called as each proposal is
independent of the current parameter value. That is,
q(y,x) = q2(y) (an independent chain -see Tierney (1994).)

That is, all our candidate draws y are drawn from the same
distribution, regardless of where the previous draw was.

  ( x | Z )q ( x) 
This leads to acceptance probability a ( x, y )  min1, .
  ( y | Z )q ( y ) 
a value y with density π(y) > π(x) is automatically accepted.

• Distributions used for q2: A Normal based around the ML estimate


with inflated variance. A Multivariate-t.

64
RS – Lecture 17

MCMC – MH - Independence Sampler


• The independence sampler can sometimes work very well but equally
can work very badly!

• The efficiency depends on how close the jumping distribution is to


the posterior.

• Generally speaking, chain will behave well only if the q(.) distribution
has heavier tails than the posterior.

MCMC: MH - Adaptive Method (ad hoc)


• Adaptive Method
- Before the burn-in, we have an adaptation period, where the sampler
improves the proposal distribution. The adaptive method requires a
desired acceptance rate, for example, 30% and tolerance, for example,
10% resulting in an acceptable range of (20%,40%).

- If we are outside the acceptable range, say we reject too much, we


adjust the proposal distribution, for example by changing/reducing
the spread (say, σ).

• MLwiN uses an adaptive method to construct univariate Normal


proposals with an acceptance rate of approximately 50%.

65
RS – Lecture 17

MCMC: MH - Adaptive Method Algorithm


• Run the MH sampler for consecutive batches of 100 iterations.
Compare the number accepted, N with the desired acceptance rate,
R. Adjust variance accordingly.
If N  R,   new   old /(2  NR )
N
If N  R,   new   old  (2  100
100  R
)

• Repeat this procedure until 3 consecutive values of N lie within the


acceptable range and then, mark (fixed) this parameter. Check other
parameters.

• When all the parameters are marked the adaptation period is over.

Note: Proposal SDs are still modified after being marked until
adaptation period is over.

MCMC: MH- Example: Bivariate Normal


• We want to simulate values from

 
f ( x)  exp (1 / 2) x'Σ-1x ;  1 .9 
=  
 .9 1

• Proposal distribution: RW chain: y = x + z, z ~ bivariate Uniform


on (- i, i), for i=1,2.( i controls the spread)

To avoid excessive move, let 1=.75 and 2=1.

• The probability of move (for a symmetric proposal) is:


 1 
 exp{  ( y   )'  ( y   )}
1

 ( x , y )  min  1 , 2 .
 exp{  1 ( x   )'   1 ( x   )} 
 
 2 

66
RS – Lecture 17

MCMC: MH -Example: Bivariate Normal

Application 1: The Probit Model (Greene)


• The Probit Model:
(a) y i *  x iβ +  i
 i ~ N[0,1]
(b) y i  1 if y i * > 0, 0 otherw ise
Consider estim ation of β and y i * (data augm entation)
(1) If y* w ere observed, this would be a linear regression
(y i w ould not be useful since it is just sgn(y i *).)
W e saw in the linear m odel before, p( β | y i *, y i )
(2) If (only) β w ere observed, y i * w ould be a draw from
the norm al distribution w ith m ean x iβ and variance 1.
But, y i gives the sign of y i * . y i * | β , y i is a draw from
the truncated norm al (above if y=0, below if y=1)

67
RS – Lecture 17

Application 1: The Probit Model (Greene)


• The Gibbs sampler for the probit model:

(1) Choose an initial value for β (maybe the MLE)


(2) Generate y i * by sampling N observations from
the truncated normal with mean x iβ and variance 1,
truncated above 0 if y i  0, from below if y i  1.
(3) Generate β by drawing a random normal vector with
mean vector (X'X )-1 X'y * and variance matrix (X'X )-1
(4) Return to 2 10,000 times, retaining the last 5,000
draws - first 5,000 are the 'burn in.'
(5) Estimate the posterior mean of β by averaging the
last 5,000 draws.
(This corresponds to a uniform prior over β.)

Generating Random Draws from f(X)


T h e in v e rs e p ro b a b ility m e th o d o f s a m p lin g
ra n d o m d ra w s :
If F (x ) is th e C D F o f ra n d o m v a ria b le x , th e n
a ra n d o m d ra w o n x m a y b e o b ta in e d a s F -1 (u)
w h e re u is a d ra w fro m th e s ta n d a rd u n ifo rm (0 ,1 ).
E x a m p le s :
E x p o n e n tia l: f(x )=  e x p (-  x ); F (x )= 1 -e x p (-  x )
x = -(1 /  )lo g (1 -u )
N o rm a l: F (x ) =  (x ); x =  -1 (u)
T ru n c a te d N o rm a l: x =  i +  -1 [1 -(1 -u )*  (  i )] fo r y = 1 ;
x =  i +  -1 [u  (-  i )] fo r y = 0 .

68
RS – Lecture 17

Example: Simulated Probit


? Generate raw data
Sample ; 1 - 1000 $
Create ; x1=rnn(0,1) ; x2 = rnn(0,1) $
Create ; ys = .2 + .5*x1 - .5*x2 + rnn(0,1) ; y = ys > 0 $
Namelist; x=one,x1,x2$
Matrix ; xx=x'x ; xxi = <xx> $
Calc ; Rep = 200 ; Ri = 1/Rep$
Probit ; lhs=y;rhs=x$
? Gibbs sampler
Matrix ; beta=[0/0/0] ; bbar=init(3,1,0);bv=init(3,3,0)$$
Proc = gibbs$
Do for ; simulate ; r =1,Rep $
Create ; mui = x'beta ; f = rnu(0,1)
; if(y=1) ysg = mui + inp(1-(1-f)*phi( mui));
(else) ysg = mui + inp( f *phi(-mui)) $
Matrix ; mb = xxi*x'ysg ; beta = rndm(mb,xxi)
; bbar=bbar+beta ; bv=bv+beta*beta'$
Enddo ; simulate $
Endproc $
Execute ; Proc = Gibbs $ (Note, did not discard burn-in)
Matrix ; bbar=ri*bbar ; bv=ri*bv-bbar*bbar' $
Matrix ; Stat(bbar,bv); Stat(b,varb) $

Example: Probit MLE vs. Gibbs

--> Matrix ; Stat(bbar,bv); Stat(b,varb) $


+---------------------------------------------------+
|Number of observations in current sample = 1000 |
|Number of parameters computed here = 3 |
|Number of degrees of freedom = 997 |
+---------------------------------------------------+
+---------+--------------+----------------+--------+---------+
|Variable | Coefficient | Standard Error |b/St.Er.|P[|Z|>z] |
+---------+--------------+----------------+--------+---------+
BBAR_1 .21483281 .05076663 4.232 .0000
BBAR_2 .40815611 .04779292 8.540 .0000
BBAR_3 -.49692480 .04508507 -11.022 .0000
+---------+--------------+----------------+--------+---------+
|Variable | Coefficient | Standard Error |b/St.Er.|P[|Z|>z] |
+---------+--------------+----------------+--------+---------+
B_1 .22696546 .04276520 5.307 .0000
B_2 .40038880 .04671773 8.570 .0000
B_3 -.50012787 .04705345 -10.629 .0000

69
RS – Lecture 17

Application 2: Stochastic Volatility (SV)


In the SV model, we have
ht    ht 1   t ;  t ~ N ( 0,  2 )
Or in logs,
log h t     log h t 1   t

• We have 3 SV parameters to estimate φ =(ω,,σ 2) and the latent ht.

• The difference with ARCH models: The shocks that govern the
volatility are not necessarily mean t’s. There is a volatility shock.

• SVOL Estimation is based on the idea of hierarchical structure:


- f(y|ht) (distribution of the data given the volatilities)
- f(ht|φ) (distribution of the volatilities given the parameters)
- f(φ) (distribution of the parameters)

Application 2: Stochastic Volatility (SV)


Bayesian Goal: To get the posterior f(ht,φ|y)

Priors (Beliefs):
Normal-Gamma for f(φ). (Standard Bayesian regression model)
Inverse-Gamma for f(σ ) 2 - f(σ -2) = (σ -2)(T/2-1) exp{-( /2) σ -2}
Normals for ω, .
Impose (assume) stationarity of ht. (Truncate  as necessary)

Algorithm: MCMC (JPR (1994).)


Augment the parameter space to include ht.
Using a proper prior for f(ht,φ) the MCMC provides inference about
the joint posterior f(ht,φ|y): 2 ( / 2) T / 2 2
T
( 1)
 ( / 2) 2
f ( ) (
2 ) e
 (T / 2 )
• Classic reference: Andersen (1994), Mathematical Finance.
• Application to interest rates: Kalimipalli and Susmel (2004, JEF).

70
RS – Lecture 17

Application 2: Stochastic Volatility (SV)


• Gibbs Algorithm for Estimating SV Model --from K&S (2004).
 rt  ( aˆ 0  aˆ1 rt 1 )  RES t

RES t  ht rt 21  t ,   0 .5
ln( ht )    1 ln( ht 1 )   2  t 1
- In the SV model, we estimate the parameter vector and 1 latent
variable:  ={ω,, 1,} and Ht = {h1,...,ht}.
- Parameter set therefore consists of Θ = {Ht, } for all t.

• Using Bayes theorem to decompose the joint posterior density as


follows.
f ( H n , )  f (Yn H n ) f ( H n , ) f ()

Application 2: Stochastic Volatility (SV)


f ( H n , )  f (Yn H n ) f ( H n , ) f ()
• Next draw the marginals f(Ht|Yt,,), and f(|Yt, Ht), using a Gibbs
sampling algorithm:
Step 1: Specify initial values (0) ={ω(0),  (0) ,(0)}. Set i =1.

Step 2:
Draw the underlying volatility using the multi-move simulation
sampler –see, De Jong and Shephard (1995)--, based on parameter
values from step 1.

- The multi-move simulation sampler draws Ht for all the data points
as a single block. Recall we can write:
ln( RESt2 )  ln(ht )  ln(rt 1 )  ln( t2 ) (A - 1)

71
RS – Lecture 17

Application 2: Stochastic Volatility (SV)


ln(RESt2 )  ln(ht )  ln(rt 1 )  ln(t2 ) (A -1)
where ln(t2) can be approximated by a mixture of seven normal
variates -Chib, Shephard, and Kim (1998).
ln (εt2 )  zt

 
7
f(zt )   f N zi mi  1.2704,vi2 i  { 1,2,....7 } (A - 2)
i 1

- Now, (A-1) can be written as


ln(RESt2 )  ln(ht )  ln(rt 1 )  zt kt  i  (A - 3)
where kt is one of the seven underlying densities that generates zt.

- Once the underlying densities kt, for all t, are known, (A-3) becomes
a deterministic linear equation and along with the SV model can be
represented in a linear state space model.

Application 2: Stochastic Volatility (SV)


- If interested in estimating  as a free parameter, rewrite (A-1 ) as

ln(RESt2 )  ln(ht )  2 ln(rt 1 )  ln( t2 ) (A - 1)


Then, estimate  approximating ln(t2) by a lognormal distribution.
Once  is known, follow (A-3) and extract the latent volatility.

Step 3:
Based the on output from steps 1 and 2, the underlying kt in (A-3) is
sampled from the normal distribution as follows:
  
f z t i ln( y t2 ), ln(ht )  qi f N z i ln(ht )  mi  1.2704, vi2  ik (A - 4)

For every observation t, we draw the normal density from each of the
seven normal distributions {kt = 1,2,..,7}. Then, we select a “k” based
on draws from uniform distribution.

72
RS – Lecture 17

Application 2: Stochastic Volatility (SV)


Step 4:
Cycle through the conditionals of parameter vector  ={ω, , 1}
for the volatility equation using Chib (1993), using output from steps
1-3. Assuming that f() can be decomposed as:

f ( Yn , H n )  f ( Yn , H n ,  ) f (2 Yn , H n , 2 ) f ( Yn , H n ,  ) (A - 5)

where -j refers to the  parameters excluding the jth parameter.

- The conditional distributions (normal for ω and , inverse gamma


for σ 2) are described in Chib (1993). You need to specify the prior
means and standard deviations.

Step 5: Go to step 2. (Now, Set i =2.)

Conclusions (Greene)
• Bayesian vs. Classical Estimation
– In principle, different philosophical views and differences in
interpretation
– As practiced, just two different algorithms
– The religious debate is a red herring –i.e., misleading.

• Gibbs Sampler. A major technological advance


– Useful tool for both classical and Bayesian
– New Bayesian applications appear daily

73
RS – Lecture 17

Standard Criticisms (Greene)


• Of the Classical Approach
– Computationally difficult (ML vs. MCMC)
– It is difficult to pay attention to heterogeneity, especially in
panels when N is large.
– Responses: None are true. See, e.g., Train (2003, Ch. 10)

• Of Classical Inference in this Setting


– Asymptotics are “only approximate” and rely on “imaginary
samples.” Bayesian procedures are “exact.”
– Response: The inexactness results from acknowledging that
we try to extend these results outside the sample. The
Bayesian results are “exact” but have no generality and are
useless except for this sample, these data and this prior. (Or
are they? Trying to extend them outside the sample is a
distinctly classical exercise.)

Standard Criticisms (Greene)


• Of the Bayesian Approach
– Computationally difficult.
– Response: Not really, with MCMC and Metropolis-Hastings
– The prior (conjugate or not) is a hoax. It has nothing to do
with “prior knowledge” or the uncertainty of the investigator.
– Response: In fact, the prior usually has little influence on the
results. (Bernstein and von Mises Theorem)

• Of Bayesian ‘Inference’
– It is not statistical inference
– How do we discern any uncertainty in the results? This is
precisely the underpinning of the Bayesian method. There is
no uncertainty. It is ‘exact.

74
Powered by TCPDF (www.tcpdf.org)

You might also like