Bayes Lectures English
Bayes Lectures English
Lecture 17
Bayesian Econometrics
1
RS – Lecture 17
• For estimation, we can ignore the term P(y), since it does not depend
on the parameters. Then, we can write:
P( |y) P(y| ) x P( )
• Terminology:
- P(y| ): Density of the data, y, given the parameters, . Called the
likelihood function. (I’ll give you a value for , you should see y.)
- P( ): Prior density of the parameters. Prior belief of the researcher.
- P( |y): Posterior density of the parameters, given the data. (A mixture
of the prior and the “current information” from the data.)
2
RS – Lecture 17
P y | P
P y
P y
3
RS – Lecture 17
After seeing videos and scouting reports and using her previous
experience, the coach forms a personal belief about the player’s skills.
This initial belief is the prior, P(S).
After the formal tryout performance, the coach (event T) updates her
prior beliefs. This update is the posterior:
P T | S P S
P S T
P T
4
RS – Lecture 17
• Notes:
- The previous posterior becomes the new prior.
- Beliefs tend to become more concentrated as N increases.
- Posteriors seem to look more normal as N increases.
5
RS – Lecture 17
Likelihood
• It represents the probability of observing the data, y, conditioning
on . For example, Yi ~ N( , σ2).
(Y ) [(Y Y ) ( Y )] (Y Y )
i
i
2
i
i
2
i
i
2
T ( Y ) 2 (T 1)s 2 T ( Y ) 2
• When this happens, the prior is called an improper prior. However, the
posterior distribution need not be a proper distribution if the prior is
improper.
6
RS – Lecture 17
This prior gives us some flexibility. Depending on σ02, this prior can
be informative or diffuse. A small σ02 represents the case of an
informative prior. As σ02 increases, the prior becomes more diffuse.
7
RS – Lecture 17
8
RS – Lecture 17
Posterior
1 (N 2) (a+b)
0 (s 1)(N s 1) (1 ) (a)(b) (1 ) d
s N s a1 b 1
9
RS – Lecture 17
The prior p(θ, ) is decomposed using a prior for the prior, p(λ), a
hyperprior. Under this interpretation, we call λ a hyperparameter.
10
RS – Lecture 17
( 0 ) 2 ( / 2)
T /2
2
f ( , 2 ) (2 2 M ) 1/ 2 exp{ }x ( 2 ) (T / 21) e ( / 2)
2 M2
(T / 2)
11
RS – Lecture 17
12
RS – Lecture 17
13
RS – Lecture 17
Posterior
• The goal is to say something about our subjective beliefs about ;
say, the mean , after seeing the data (y). We characterize this with the
posterior distribution:
P( |y) = P(y| ) P( )/P(y)
• σ02 states the confidence in our prior. Small σ02 shows confidence.
14
RS – Lecture 17
Posterior: Normal-Normal
T ( 0 ) 2 1 1
( Y ) 2 ( ) 2 (Y 0 ) 2
2 02 2 02 2 / T
(T/ 2 )Y (1 / 02 ) 0 1
where & 2
T/ 1 /
2 2
0 T/ 1 / 02
2
15
RS – Lecture 17
Results:
- As T→∞, the posterior mean converges to Y.
- As σ0 →∞, our prior information is worthless.
2
Example: In R.
bayesian_updating <- function(data,mu_0,sigma2_0,plot=FALSE) {
require("ggplot2")
T = length(data) # length of data
xbar = mean(data) # mean of data
sigma2 = sd(data)^2 # variance of data
# Likelihood (Normal)
xx <- seq(xbar-2*sqrt(sigma2), xbar+2*sqrt(sigma2),sqrt(sigma2)/40)
yy <- 1/(sqrt(2*pi*sigma2/T))*exp(-1/(2 *sigma2/T)*(xx - xbar)^2 )
# yy <- 1/(xbar+4*sqrt(sigma2)-xbar+4*sqrt(sigma2))
df_likelihood <- data.frame(xx,yy,1) # store data
type <- 1
df1 <- data.frame(xx,yy,type)
# Prior (Normal)
xx <- seq(mu_0-4*sqrt(sigma2_0), mu_0+4*sqrt(sigma2_0),(sqrt(sigma2_0)/40))
yy = 1/(sqrt(2*pi*sigma2_0))*exp(-1/(2 *sigma2_0)*(xx - mu_0)^2)
type <- 2
df2 <- rbind(df1,data.frame(xx,yy,type))
16
RS – Lecture 17
if(plot==TRUE){
return(ggplot(data=df3, aes(x=xx, y=yy, group=type, colour=type))
+ ylab("Density")
+ xlab("x")
+ ggtitle("Bayesian updating")
+ geom_line()+theme(legend.title=element_blank()))
} else {
Nor <- matrix(c(pom,pov), nrow=1, ncol=2, byrow = TRUE)
return(Nor)
}
}
17
RS – Lecture 17
Remark: Some kind of shrinkage can always reduce the MSE relative
to OLS/MLE.
Predictive Posterior
• The posterior distribution of is obtained, after the data y is
observed, by Bayes' Theorem::
P( |y) ∝ P(y| ) x P( )
18
RS – Lecture 17
19
RS – Lecture 17
• There are theoretical results under which both worlds produce the
same results. For example, in large samples, under a uniform prior, the
posterior mean will be approximately equal to the MLE.
• Likelihood function
- In classical statistics, the likelihood is the density of the observed
data conditioned on the parameters.
- Inference based on the likelihood is usually “maximum
likelihood.”
- In Bayesian statistics, the likelihood is a function of the parameters
and the data that forms the basis for inference – not really a
probability distribution.
- The likelihood embodies the current information about the
parameters and the data.
20
RS – Lecture 17
21
RS – Lecture 17
That is, unless your prior beliefs are so strong that they cannot be
overturned by evidence, at some point the evidence in the data
outweights any prior beliefs you might have started out with.
• There are important cases where this result does not hold, typically
when convergence to the limit distribution is not uniform, such as unit
roots. In these cases, there are differences between both methods.
22
RS – Lecture 17
Bayesian goal: Get the posterior distribution of the parameters (β, 2).
23
RS – Lecture 17
24
RS – Lecture 17
T
( yt X t β ) 2
2 T / 2 t 1
f ( y | X, β , ) ( 2 )
2
exp{ }
2 2
• Recall that we can write: y – X = (y – Xb) – X ( – b)
Then,
TSS = (y–Xb)’(y–Xb) + ( – b)’X’X( – b) – 2( – b)’X’ (y–X )
= υ s2 + ( – b)’ X’X ( - b)
25
RS – Lecture 17
26
RS – Lecture 17
f (y | X, β, 2 ) f (β | 2 ) f (y | X, β, 2 ) f (β | 2 )
f (β | y, X, 2 )
f (y | X, 2 )
f (y | X,β, ) f (β | )dβ
2 2
27
RS – Lecture 17
• Now, the posterior belongs to the same family (normal) as the prior.
The normal (conjugate) prior was a very convenient choice.
• The mean m* takes into account the data (X and y) and our prior
distribution. It is a weighted average of our prior m and b (OLS).
28
RS – Lecture 17
( / 2)
T /2
2
( 2 )(T / 21) e( / 2)
(T / 2)
29
RS – Lecture 17
• This is also true for the classical perspective, where results can be
dependent on the choice of likelihood function, covariates, etc.
30
RS – Lecture 17
• There are a lot of publications, using the same data. To form priors,
we cannot use the results of previous research, if we are not going to
use a correlated sample!
h
Now, f ( | y , X) hT / 2 exp{ [s 2 ( β b )' X' X( β b )] } h 1
2
x exp{ xb} dx b a 1 [ ( a 1) ( a 1, b )]
a
31
RS – Lecture 17
32
RS – Lecture 17
Likelihood
2
))(y-Xβ) (y-Xβ)]
L(β,σ 2|y,X)=[2πσ 2 ]-n/2 e -[(1/(2σ
(d 2) 2
exp{(1 / 2)(β - b) '[2 ( X X ) 1 ]1 (β - b)}
33
RS – Lecture 17
34
RS – Lecture 17
Presentation of Results
• P( |y) is a pdf. For the simple case, the one parameter , it can be
graphed. But, if θ is a vector of many parameters, the multivariate pdf
cannot be presented in a graph of it.
35
RS – Lecture 17
36
RS – Lecture 17
Note: Make sure that q(x) is a well defined pdf –i.e., it integrates to 1.
This is why above we use q(x)= dexp(x,lambda)/ pexp(1,lambda).
Simulation Methods
• Q: Do we need to restrict our choices of prior distributions to these
conjugate families? No. The posterior distributions are well defined
irrespective of conjugacy. Conjugacy only simplifies computations.
• If you are outside the conjugate families, you typically have to resort
to numerical methods for calculating posterior moments.
37
RS – Lecture 17
• Steps:
1. Parameterize the model
2. Propose the likelihood conditioned on the parameters
3. Propose the priors – joint prior for all model parameters
4. As usual, the posterior is proportional to likelihood times prior.
(Usually requires conjugate priors to be tractable.)
5. Sample –i.e., draw observations- from the posterior to study its
characteristics.
Numerical Methods
• Sampling from the joint posterior P(θ|y) may be difficult or
impossible. For example, in the linear model, assuming a normal prior
for , and an inverse-gamma prior for σ2, we get a complicated joint
posterior distribution for ( , σ2).
• For these situations, many methods have been developed that make
the process easier, including Gibbs sampling, Data Augmentation, and the
Metropolis-Hastings algorithm.
38
RS – Lecture 17
39
RS – Lecture 17
• A draw t describes the state at time (iteration) t. The next draw t+1
is dependent only on t. This is because of the Markov property:
p( t+1 | t) = p( t+1 | t, t-1, t-2, ... 1)
40
RS – Lecture 17
Ergodic Theorem
• Let 1, 2, 3, ... M be M values from a Markov chain that is aperiodic,
irreducible, and positive recurrent –i.e., chain is ergodic-, and E[g( )] < ∞.
Then, with probability 1:
Σ g( )/M → ∫Θ g( ) p( ) d
41
RS – Lecture 17
42
RS – Lecture 17
43
RS – Lecture 17
• Usually, the first K0−1 iterations are discarded to let the algorithm
converge to the stationary distribution without the influence of
starting values, 0 (burn in).
44
RS – Lecture 17
MCMC - Remarks
• In classical stats, we usually focus on finding the stationary
distribution, given a Markov chain.
45
RS – Lecture 17
MCMC - Remarks
• The process is started at an arbitrary x and iterated a large number
of times. After this, the distribution of the observations generated
from the simulation is approximately the target distribution.
46
RS – Lecture 17
47
RS – Lecture 17
48
RS – Lecture 17
1
(2) Gibbs sampler: v1 | v 2 ~ N v 2 , 1 2
v 2 | v1 ~ N v1 , 1 2
49
RS – Lecture 17
b <- burn + 1
x <- X[b:N, ]
50
RS – Lecture 17
• The Gibbs sampler samples back and forth between the two
conditional posteriors.
51
RS – Lecture 17
y i ~ Binomial ( n i , p i )
logit ( p i ) X
0 ~ N ( 0 , m 0 ), 1 ~ N ( 0 , m1 )
52
RS – Lecture 17
53
RS – Lecture 17
Step 2 (Posterior): Draw a new value for the parameter, βk+1 given
the data and given the (partly drawn) Y*:
p( β|data,Y*)∼ N((X’X+Ω−1)-1 (X’Y+Ω−1 ),(X’X+Ω−1)-1)
54
RS – Lecture 17
Suppose N-M observations are missing. That is, Yobs= {Y1, ..., YM}
Then, p( |Yobs) ~ Beta( + Σi to M Yi , + M - Σi to M Yi )
Step 1: Draw all the missing elements of Y* given the current value of
the parameter , say k.
Step 2: Draw a new value for the parameter, k+1 given the data and
given the (partly drawn) Y*.
55
RS – Lecture 17
• Named after the Metropolis et al. (1953) paper, which frist proposed
it and Hastings (1970) who generalized it. Rediscovered and
popularized by Tanner and Wong (1987) and Gelfand and Smith
(1990)
56
RS – Lecture 17
57
RS – Lecture 17
58
RS – Lecture 17
MCMC: MH – At Work
59
RS – Lecture 17
MCMC: MH Algorithm
• MH Algorithm
We know π(.), a complicated posterior. Say, from e the CLM with Yi
iid normal, a normal prior for β and gamma prior for h => =(β,h).
• If the acceptance rate is too high, the chain is probably not mixing
well -i.e., not moving around the parameter space enough. If it is too
low, the algorithm is too inefficient (rejecting too many draws).
60
RS – Lecture 17
61
RS – Lecture 17
62
RS – Lecture 17
• There are some (silly) exceptions; but assuming that the proposal
allows the chain to explore the whole posterior and does not produce
a recurrent chain we are OK.
63
RS – Lecture 17
That is, all our candidate draws y are drawn from the same
distribution, regardless of where the previous draw was.
( x | Z )q ( x)
This leads to acceptance probability a ( x, y ) min1, .
( y | Z )q ( y )
a value y with density π(y) > π(x) is automatically accepted.
64
RS – Lecture 17
• Generally speaking, chain will behave well only if the q(.) distribution
has heavier tails than the posterior.
65
RS – Lecture 17
• When all the parameters are marked the adaptation period is over.
Note: Proposal SDs are still modified after being marked until
adaptation period is over.
f ( x) exp (1 / 2) x'Σ-1x ; 1 .9
=
.9 1
66
RS – Lecture 17
67
RS – Lecture 17
68
RS – Lecture 17
69
RS – Lecture 17
• The difference with ARCH models: The shocks that govern the
volatility are not necessarily mean t’s. There is a volatility shock.
Priors (Beliefs):
Normal-Gamma for f(φ). (Standard Bayesian regression model)
Inverse-Gamma for f(σ ) 2 - f(σ -2) = (σ -2)(T/2-1) exp{-( /2) σ -2}
Normals for ω, .
Impose (assume) stationarity of ht. (Truncate as necessary)
70
RS – Lecture 17
RES t ht rt 21 t , 0 .5
ln( ht ) 1 ln( ht 1 ) 2 t 1
- In the SV model, we estimate the parameter vector and 1 latent
variable: ={ω,, 1,} and Ht = {h1,...,ht}.
- Parameter set therefore consists of Θ = {Ht, } for all t.
Step 2:
Draw the underlying volatility using the multi-move simulation
sampler –see, De Jong and Shephard (1995)--, based on parameter
values from step 1.
- The multi-move simulation sampler draws Ht for all the data points
as a single block. Recall we can write:
ln( RESt2 ) ln(ht ) ln(rt 1 ) ln( t2 ) (A - 1)
71
RS – Lecture 17
7
f(zt ) f N zi mi 1.2704,vi2 i { 1,2,....7 } (A - 2)
i 1
- Once the underlying densities kt, for all t, are known, (A-3) becomes
a deterministic linear equation and along with the SV model can be
represented in a linear state space model.
Step 3:
Based the on output from steps 1 and 2, the underlying kt in (A-3) is
sampled from the normal distribution as follows:
f z t i ln( y t2 ), ln(ht ) qi f N z i ln(ht ) mi 1.2704, vi2 ik (A - 4)
For every observation t, we draw the normal density from each of the
seven normal distributions {kt = 1,2,..,7}. Then, we select a “k” based
on draws from uniform distribution.
72
RS – Lecture 17
Conclusions (Greene)
• Bayesian vs. Classical Estimation
– In principle, different philosophical views and differences in
interpretation
– As practiced, just two different algorithms
– The religious debate is a red herring –i.e., misleading.
73
RS – Lecture 17
• Of Bayesian ‘Inference’
– It is not statistical inference
– How do we discern any uncertainty in the results? This is
precisely the underpinning of the Bayesian method. There is
no uncertainty. It is ‘exact.
74
Powered by TCPDF (www.tcpdf.org)