Krolzig Macroeconometrics I
Krolzig Macroeconometrics I
Macroeconometrics I
Books
◦ Gourieroux, C. and A. Monfort (1997), Time Series and Dynamic Models. Cambridge:
Cambridge University Press. Chapters 1, 5, 6, 9.
• Hamilton, J.D. (1994). Time Series Analysis. Princeton: Princeton University Press.
Chapters 1 - 5, 7, 8, 11.
◦ Harvey, A.C. (1981a). The Econometric Analysis of Time Series. 2nd edition. Hemel
Hempstead: Philip Allan.
◦ Harvey, A.C. (1981b). Time Series Models. London: Philip Allan.
• Hendry, D.F. (1995). Dynamic Econometrics. Oxford: Oxford University Press.
Chapter 2.
• Johnston, J. and J. DiNardo (1997). Econometric Methods. 4th edition, New York:
McGraw-Hill. Chapter 7.
◦ Lütkepohl, H. (1991). Introduction to Multiple Time Series Analysis. New York:
Springer. Chapters 2, 3, 5.
◦ Spanos, A. (1986). Statistical Foundations of Econometric Modelling. Cambridge:
Cambridge University Press.
Advanced
◦ Box, G.E.P. and Jenkins, G.M. (1976). Time Series Analysis,Forecasting and Control.
San Francisco: Holden-Day.
◦ Davidson, J.E.H. (1994). Stochastic Limit Theory. Oxford: Oxford University Press.
◦ Hall, P. and Heyde, C.C. (1980). Martingale Limit Theory and its Applications. London:
Academic Press.
◦ White, H. (1984). Asymptotic Theory for Econometricians. London: Academic Press.
1
2
Course Structure
Concepts of statistical time series (Week 1): Stochastic processes, stationarity, ergodicity,
integration, autocorrelation, white noise, innovations, martingales etc.
Types of linear time series models (Week 2): AutoRegressive and Moving Average processes
Estimation and statistical inference (Week 3): Impact of autocorrelation on regression res-
ults, Maximum Likelihood estimation of Gaussian ARMA models, statistical testing,model
specification procedures
Introduction
Time Series
17.5
15.0
12.5
10.0
7.5
5.0
2.5
(1) Forecasting;
(2) Relation between time series: Causality and time lags;
(3) Distinction between short and long run;
(4) Study of agent’s expectations;
(5) Trend removal;
(6) Seasonal adjustment;
(7) Detection of structural breaks;
(8) Control of the process.
4
f (yt |Yt−1 ).
yt = E[yt |Yt−1 ] + εt .
where (i) E[yt |Yt−1 ] is the component of yt that can be predicted once the history of the
process Yt−1 is known and (ii) εt denotes the unpredictable news.
Let (Ω, M, Pr) be a probability space, where Ω is the sample space (set of all elementary
events), M is a sigma-algebra of events or subsets of Ω, and Pr is a probability measure
defined on M.
Definition 1. A random variable is a real valued function y : Ω → R such that for each real
number c, Ac = {ω ∈ Ω|y(ω) ≤ c} ∈ M.
In other words, Ac is an event for which the probability is defined in terms of Pr . The function
F : R → [0, 1] defined by F (c) = Pr(Ac ) is the distribution function of y.
Suppose T is some index set with at most countably many elements like, for instance, the set
of all integers or all positive integers.
y :T ×Ω→R
ω0 ··· ωj ··· ωm
to yt0 (ω0 ) yt0 (ωj ) yt0 (ωm )
.. ..
. .
ti yti (ω0 ) ··· yti (ωj ) ··· yti (ωm )
.. ..
. .
tn ytn (ω0 ) ytn (ωj ) ytn (ωm )
Definition 3. A time series {yt }Tt=1 is (the finite part of) a particular realization {yt }t∈T of a
stochastic process.
6
γt (h)
ρt (h) = p .
γt (0)γt−h (0)
1.3 Stationarity
Definition 5. The process {yt } is said to be strictly stationary if for any values of h1 , h2 · · ·
hn the joint distribution of (yt , yt+h1 · · · , yt+hn ) depends only on the intervals h1 , h2 · · · hn
but not on the date t itself:
1.4 Ergodicity
The statistical ergodicity theorem concerns what information can be derived from an average
over time about the common average at each point of time. Note that the WLLN does not
apply as the observed time series represents just one realization of the stochastic process.
Definition 7. Let {yt (ω), ω ∈ Ω, t ∈ T } be a weakly stationary process, such that E [yt (ω)] =
P
µ < ∞ and E (yt (ω) − µ)2 = σy2 < ∞ for all t. Let ȳT = T −1 Tt=1 yt be the time average.
If ȳT converges in probability to µ as T → ∞, {yt } is said to be ergodic for the mean.
Note that ergodicity focus on asymptotic independence while stationarity focus on the time-
invariance of the process. For the type of stochastic processes considered in this lecture one
will imply the other. However, as illustrated by the following example, they can differ:
Example 1. Consider the stochastic processes {yt } defined by
(
u0 for t = 0 with u0 = N(0, σ 2 ).
yt =
yt−1 for t > 0.
E [yt ] = E [u0 ] = 0
E yt2 = E u20 = σ 2
E [yt yt−h ] = E u20 = σ 2 .
X
T −1
ȳT = T −1
P
yt −→ 0.
t=0
−1
PT −1
But one can easily see that ȳT = T t=0 yt = u0 where u0 is the realization of a normally
distributed random variable.
To be ergodic, the memory of a stochastic process should fade in the sense that the covariance
between increasingly distant observations converges to zero sufficiently rapidly. For stationary
P
processes it can be shown that absolutely summable autocovariances ∞ h=0 |γ(h)| < ∞ are
sufficient to ensure ergodicity.
Similarly, we can define ergodicity for the second moments:
X
T
γ̂(h) = (T − h)−1
P
(yt − µ)(yt−h − µ) −→ γ(h).
t=h+1
8
X
T
γ̂h = T −1 (yt − ȳ)(yt−h − ȳ).
t=h+1
(h)
Note that for an AR(p) process: αh = 0 for h > p such that
√ (h) d
T α̂h −→ N (0, 1) .
9
First differences of i t
∆i t
2
−2
0.0
−2.5
0 5 10 0 5 10 0 5 10 0 5 10
0 5 10 0 5 10 0 5 10 0 5 10
where ut is independent, identically distributed with zero mean and variance σ 2 < ∞ for all
t > 0. The random walk is non-stationary and, hence, non-ergodic.
X
t
yt = y0 + us for all t > 0.
s=1
But the second moments are diverging. The variance is given by:
!2 !2
2 Xt X
t
γt (0) = E yt = E y0 + us = E us
s=1 s=1
" t t #
XX X
t X
t X
= E us uk = E u2s + us uk
s=1 k=1 s=1 s=1 k6=s
X
t
X t X
= E u2s + E [us uk ]
s=1 s=1 k6=s
X
t
= σ 2 = tσ 2 .
s=1
Definition 8. A white-noise process is a weakly stationary process which has a zero mean
and is uncorrelated over time:
ut ∼ WN(0, σ 2 ).
Thus {ut } is a WN process if for all t ∈ T : E[ut ] = 0, E[u2t ] = σ 2 < ∞ and E[ut ut−h ] = 0
where h 6= 0 and t − h ∈ T . If the assumption of a constant variance is relaxed to E[u2t ] < ∞,
sometimes {ut } is called a weak WN process.
Note that the assumption of normality implies strict stationarity and serial independence (un-
predictability). A generalization of the NID is the IID process with constant, but unspecified
higher moments:
Definition 10. A process {ut } with independent, identically distributed variates is denoted
IID:
ut ∼ IID(0, σ 2 ).
Definition 11. The stochastic process {xt } is said to be martingale with respect to an in-
formation set, Jt−1 , of data realized by time t − 1 if its mean E[|xt |] < ∞ and the condi-
tional expectation E[xt |Jt−1 ] = xt−1. The process {ut = xt − xt−1 } with E[|ut |] < ∞ and
E[ut |Jt−1 ] = 0 for all t is called a martingale difference sequence, MDS.
The assumption of a MDS is often made in limit theorems. For empirical modelling, the
following is crucial:
12
Definition 12. An innovation {ut } against an information set Jt−1 is a process whose dens-
ity f (ut |Jt−1 ) does not depend on Jt−1 ; also {ut } is a mean innovation against an inform-
ation set Jt−1 if E[ut |Jt−1 ] = 0.
Thus an innovation {ut } must be WN(0, σ 2 ) if Jt−1 contains the history Ut−1 of ut , but not
conversely. Consequently, an innovation must be a MDS.
Hans–Martin Krolzig Hilary Term 2002
Macroeconometrics I
Introduction to Time–Series Analysis
Lyt = yt−1
Lj yt = yt−j for all j ∈ N
Differencing operator If {yt } is a stochastic process, then also the following processes exist:
Linear Filter Transformation of an input series {xt } to an output series {yt } by applying the
lag polynomial A(L):
Xm Xm
yt = A(L)xt = j
aj L xt = aj xt−j = a−n xt+n +· · ·+a0 x0 +· · ·+am xt−m
j=−n j=−n
13
14
b t |Yt−1 ).
εt ≡ yt − E(y
The value of κt is uncorrelated with εt−j for any j, though κt can be predicted arbitrarily well
from a linear function of Yt−1 :
b t |Yt−1 ).
κt = E(κ
Box–Jenkins approach to modelling time series: Approximate the infinite lag polynomial
with the ratio of two finite-order polynomials α(L) and β(L):
∞
X β(L) 1 + β1 L + . . . + βq Lq
Ψ(L) = ψj Lj ∼
= =
α(L) 1 − α1 L − . . . − αp Lp
j=0
p q Model Type
p>0 q=0 α(L)yt = εt (pure) autoregressive process of order p AR(p)
p=0 q>0 yt = β(L)εt (pure) moving-average process of order q MA(q)
p>0 q>0 α(L)yt = β(L)εt (mixed) autoregressive moving-average process ARMA(p, q)
15
A p-th order autoregression, denoted AR(p) process, satisfies the difference equation:
X
p
yt = ν + αj yt−j + εt where εt ∼ WN(0, σ 2 ).
j=1
yt = α(1)−1 ν + α(L)−1 εt
∞
X ∞
X
ν
yt = µ + ψj εt−j , where µ = and Ψ(L) = α(L)−1 with |ψj | < ∞
α(1)
j=0 j=0
Stability analysis based on the p-th order linear inhomogeneous difference equation:
X
p
yt − µ = αj (yt−j − µ)
j=1
Autocovariance function
Yule-Walker equations
ρ1 = α1 + α2 ρ1 + · · · + αp ρp−1
ρ2 = α1 ρ1 + α2 + · · · + αp ρp−2
.. ⇒ ρ 1 , · · · , ρp
.
ρp = α1 ρp−1 + α2 ρp−2 + · · · + αp
ρk = α1 ρk−1 + α2 ρk−2 + · · · + αp ρk−p for k>p
16
NID(0,1) 1
ACF 1
PACF
2.5
0 0 0
-2.5
0 0 0
-2.5
0 50 100 150 200 0 5 10 15 20 25 0 5 10 15 20 25
AR(1) with a1=-0.50 1
ACF 1
PACF
2.5
0 0 0
-2.5
5
AR(1) with a1=0.90 1
ACF 1
PACF
0
0 0
-5
20
AR(1) with a1=1.00 1
ACF 1
PACF
10
0 0
0
-10
0 0
-50
5
AR(2) with a1=0.60 and a2=0.30 1
ACF 1
PACF
0
0 0
-5
0 50 100 150 200 0 5 10 15 20 25 0 5 10 15 20 25
5
AR(2) with a1=0.30 and a2=0.60 1
ACF 1
PACF
0
0 0
-5
0 0 0
0
0 0
0 0 0
-2.5
0 50 100 150 200 0 5 10 15 20 25 0 5 10 15 20 25
10
AR(2) with a1=1.60 and a2=-0.80 1
ACF 1
PACF
0 0 0
-5
X
q
yt = µ + β(L)εt = µ + εt + βi εt−i
i=1
β(L)−1 yt = β(1)−1 µ + εt
∞
X
yt = µ + φj (yt−j − µ) + εt
j=1
P∞
where φ(L) = 1 − = 1 − φ1 L − φ2 L2 + . . . = β(L)−1 .
j=1 φj L
j
P
Autocovariance function: Consider zt = yt − µ = qi=0 βi εt−i , β0 = 1
!
Xq
γ0 = βi2 σ 2
i=0
!
X
q−k
γk = βi βi+k σ 2 for k = 1, 2, . . . , q
i=0
γk = 0 for k > q
19
0 0 0
-2.5
0
0 0
-2.5
5
MA(1) with b1=-1.00 1
ACF 1
PACF
2.5
0 0
0
-2.5
0
0 0
0
0 0
-2.5
5
MA(4) with b=(-0.6,0.3,-0.5,0.5) 1
ACF 1
PACF
2.5
0 0
0
-2.5
α(L)yt = ν + β(L)εt
Xp X
q
yt = ν + αj yt−j + εt + βi εt−i
j=1 i=1
(i) Stability, α(z) = 0 ⇒ |z| > 1, guarantees stationarity and MA(∞) representation
yt = α(1)−1 ν + α(L)−1 β(L)εt
∞
X
yt = µ + ψj εt−j
j=0
Example: ARMA(1, 1)
h i h i
γ(0) = E yt2 = E (α1 yt−1 )2 + 2E [(α1 yt−1 ) (εt + β1 εt−1 )] + E (εt + β1 εt−1 )2
2
= α21 E yt−1 + 2α1 β1 E [yt−1 εt−1 ] + E ε2t + β12 ε2t−1
= α21 γ(0) + (1 + 2α1 β1 + β12 )σε2
= (1 − α21 )−1 (1 + 2α1 β1 + β12 )σε2
21
5
ARMA(1,1) with a1=0.5, b1=0.3 1
ACF 1
PACF
0 0
0
5
ARMA(1,1) with a1=0.5, b1=0.9 1
ACF 1
PACF
0
0 0
-5
0 50 100 150 200 0 5 10 15 20 25 0 5 10 15 20 25
ARMA(1,1) with a1=0.5, b1=-0.9 1
ACF 1
PACF
2.5
0 0 0
-2.5
0 50 100 150 200 0 5 10 15 20 25 0 5 10 15 20 25
(1 − α1 L1 − . . . − αp Lp )(ψ0 + ψ1 L1 + ψ2 L2 + . . .) = (1 + β1 L1 + . . . + βq Lq )
L0 : ψ0 = 1
L1 : ψ1 − α1 ψ0 = β1 =⇒ ψ1 = β1 + α1 ψ0
L2 : ψ2 − α1 ψ1 − α2 ψ0 = β2 =⇒ ψ1 = β2 + α1 ψ1 + α2 ψ0
Xh Xh
Lh : − αi ψh−i = βh =⇒ ψh = βh + αi ψh−i
i=0 i=1
the optimal predictor of yt+h is given by the conditional expectation for the given information
set Ωt :
ŷt+h|t = E[yt+h |Ωt ],
where in the following the available information is the past of the stochastic process up to time
t, Ωt = Yt .
For stationary ARMA processes, unlike many non-linear DGPs, the conditional mean can be
easily derived analytically: Using the AR(∞) representation,
∞
X
yt+h = µ + φj (yt+h−j − µ) + εt+h ,
j=1
and applying the expectation operator the optimal predictor results as follows:
X
h−1 ∞
X
ŷt+h|t = E[yt+h |Yt ] = µ + φj (E[yt+h−j |Yt ] − µ) + φh+j (E[yt−j |Yt ] − µ).
j=1 j=0
Using that E[ys |Yt ] = ys for s ≤ t, the optimal predictor is given by:
X
h−1 ∞
X
ŷt+h|t = µ + φj (ŷt+h−j|t − µ) + φh+j (yt−j − µ),
j=1 j=0
Thus the predictor ŷt+h|t of an ARMA(p, q) process is a linear function of the past realizations
of the process.
The prediction error associated with the optimal predictor ŷt+h|t is given by
and that E[εs |Yt ] = 0 for s > t, the optimal predictor can be written as
∞
X
ŷt+h|t = µ + ψj εt+h−j .
j=h
X
h−1
êt+h|t = yt+h − ŷt+h|t = ψj εt+h−j .
j=0
Macroeconometrics I
Introduction to Time–Series Analysis
yt = ν + α1 yt−1 + εt , εt ∼ NID(0, σ 2 ),
Denoting Yt ∼ N(µ, σy2 Ω) with Ωij =α|i−j| and collecting the parameters of the model to
λ = (ν : α : σ 2 )0 , the likelihood function is given by:
−1 1
2 − T2 2 0 −1
L(λ) := f (YT ; λ) = (2πσy ) |Ω| exp − 2 (YT − µ) Ω (YT − µ) .
2σy
24
25
The prediction-error decomposition uses that the εt are independent, identically distributed:
Y
T
f (ε2 , · · · , εT ) = fε (εt ).
t=2
Since εt = yt − (ν + α1 yt−1 ), have that f (yt |yt−1 ) = fε yt − (ν + α1 yt−1 ) for t =
2, · · · , T. Hence: (" # )
Y
T
f (y1 , · · · , yT ) = f (yt |yt−1 ) f (y1 ) .
t=2
X
T
`(λ) = log L(λ) = log f (yt |Yt−1 ; λ) + log f (y1 |λ)
t=2
( )
1 X 2
T
T T −1 1 1
= − log(2π) − log(σε2 ) + 2 ε − log σy2 + 2 (y1 − µ)2
2 2 2σε t=2 t 2 2σy
σε2
where y1 ∼ N µ, σy2 with µ = ν
1−α1 and σy2 = 1−α21
.
26
Gaussian AR(p) :
Exact MLE
The exact MLE follows from the prediction error decomposition as:
Y T
λ̃ = arg max f (YT |λ) = arg max
f (yt |Yt−1 ; λ) fy (y1, · · · , yp |λ) .
λ
t=p+1
Y
T
λ̃ = arg max f (Yt−1 |Yp ; λ) = arg max f (yt |Yt−1 ; λ).
λ λ
t=p+1
Conditioning on Yp = (y1 , · · · , yp )0 :
X
T
`(λ) = log f (YT |Yp ; λ) = log f (yt|Yt−1 ; λ)
t=p+1
1 X 2
T
T −p T −p 2
= − log(2π) − log(σε ) − 2 εt (ν, α).
2 2 2σε
t=p+1
X
T
arg max `(ν, α) = arg min ε2t (ν, α)
(ν,α) (ν,α)
t=p+1
where µ = (1 − α1 − · · · − αp )−1 ν.
• OLS of α in the mean-adjusted model above where µ is estimated by the sample mean
P
ȳ = T −1 Tt=1 yt
• Yule-Walker estimation of α (Method of moments)
−1
α1 γ̂0 · · · γ̂p−1 γ̂1
.. .. .. .. ..
. = . . . .
αp γ̂p−1 · · · γ̂0 γ̂p
PT PT
where γ̂h = (T − h)−1 t=h+1 (yt − ȳ)(yt−h − ȳ) and µ̂ = ȳ = T −1 t=1 yt .
28
Gaussian MA(q) :
X
T
`(λ) = log f (Yt |ε0 ; λ) = log f (εt |εt−1 ; λ)
t=1
1 X 2
T
T T 2
= − log(2π) − log(σε ) − 2 ε (µ, β).
2 2 2σε t=1 t
X
t
εt = yt − β1 εt−1 = (−β1 ) ε0 +t
(−β1 )j yt−j .
j=1
Gaussian ARMA(p, q) :
1 X 2
T
T −p T −p 2
= − log(2π) − log(σε ) − 2 ε .
2 2 2σε t=p+1 t
• The initial values of the unobservable error terms are set to zero εp := E [εp ] = 0.
• Assumes stability and invertibility.
• Exact ML and backcasting of εp possible.
• Check for common roots in α(L) and β(L) :
Grid-Search
n o
• Consider grid of potential values of β = (β1 , · · · , βq )0 : β (1) , · · · β (M )
(m) (m) 0
• Maximize the log likelihood `(ν, α,σ 2 |β (m) ) conditional on β (m) = (β1 , · · · , βq )
by GLS
(m)
ν, α,σ 2 =arg max `(ν, α,σ 2 |β (m) )
ν,α,σ2
where the step length s is selected corresponding to the evaluation of the log-likelihood
function at several points (grid search).
• General problems: convergence; local maxima.
31
3.5 The asymptotic distribution of the Ordinary Least Squares (OLS) estimator for time
series models
yt = zt0 θ + ut
OLS estimation
θ̂ = (Z 0 Z)−1 Z 0 y
= (Z 0 Z)−1 Z 0 (Zθ + u)
= θ + (Z 0 Z)−1 Z 0 u
−1
1 0 1 0
= θ+ ZZ Zu
T T
−1
√ 1 0 1 0
=⇒ T θ̂ − θ = ZZ √ Zu
T T
If yt is a stable AR(p) process and ut is a standard white noise process, then the following
results hold (see Mann & Wald, 1943):
1 0 P
ZZ →Γ
T
√ 1 0
Z u → N(0, σu2 Γ)
d
T
T
32
1 0
where Γ = plim TZ Z .
Necessary condition for the consistency of OLS with stochastic (but stationary) regressors is
that zt is asymptotically uncorrelated with ut , i.e. plim T1 Z 0 u = 0 :
−1
1 0 1 0 −1 1 0
plim θ̂ − θ = plim ZZ plim Zu =Γ plim Zu
T T T
yt = αyt−1 + ut
ut = ρut−1 + εt , εt ∼ WN(0, σε2 ),
h i
such that Z 0 = y0 . . . yT −1 . Then
" #
XT XT
1 0 1 1
E Zu = E yt−1 ut = E yt−1 ρut−1 + εt
T T T
t=1 t=1
1X
T
= E yt−1 ρ(yt−1 − αyt−2 ) + εt
T t=1
1 X T
1 XT
1 X
T
= ρ 2
E yt−1 − αρ E yt−1 yt−2 + E yt−1 εt
T T T
t=1 t=1 t=1
h i
= ρ γy (0) − αγy (1)
where γy (h) is the autocovariance function of {yt } which can be represented as an AR(2)
process.
33
α(L) ∆d yt = ν + β(L)ut
where N (m) is the number of freely estimated parameters in model m. Very often the in-
formation criterion is formulated in terms of the estimated residual variance instead of the
log-likelihood. The model selection consists then in choosing the model which minimizes:
cT
IC(m) = log σ̃ 2 (m) − N (m).
T
Note that the concentrated likelihood function is given by
T T T
log L = − − log(2π) − log(σ̃ 2 )
2 2 2
such that the two criteria are equivalent.
The most popular information criteria are the following:
2 log(log(T ))
HQ(m) = log σ̃ 2 (m) − N (m);
T
Schwarz or Bayesian criterion, cT = log(T ):
log(T )
SC(m) = log σ̃ 2 (m) − N (m).
T
34
General-to-specific modelling
‘LSE’ approach: starting from a general dynamic statistical model, which captures the essen-
tial characteristics of the underlying data set, standard testing procedures are used to reduce
its complexity by eliminating statistically-insignificant variables, checking the validity of the
reductions at every stage to ensure the congruence of the selected model.
35
A.1 The asymptotic distribution of the Maximum Likelihood (ML) estimator for time
series models
We characterize the time series model for yt by its density conditional on xt and on past
observations of yt and xt denoted by Yt−1 and Xt−1 :
where λ is a vector of unknown parameters and ut is the prediction error. Let where λ̄ be
the true value of the parameter vector λ and `T (λ) the(log) likelihood function for a sample
y1 , . . . , yT :
X
T X
T
`T (λ) = lt (λ|ut ) = log f (yt |xt , Yt−1, Xt−1 , λ).
t=1 t=1
The ML estimator λ̃ is the solution to the systems of equations emerging from the first-order
conditions for the maximum of the likelihood function
∂`T (λ)
s(λ̃) = = 0,
∂λ λ̃
Consistency
If the model is globally identified, i.e. the following conditions are satisfied:
(i) plim `T (λ)/T = F (λ̄, λ). (existence of the plim of the (log) likelihood function),
(ii) F (λ̄, λ̄) > F (λ̄, λ) for all λ in the parameter space (unique global maximum of F () at
λ̄),
(iii) the plim converges uniformly,
plim λ̃ = λ̄.
Assume that the likelihood function has continuous 2nd derivatives. Then expanding around
λ̄ by using the Mean Value Theorem gives
2
∂`T ∂`T ∂ `T
=0= + (λ̃ − λ̄)
∂λ λ̃ ∂λ λ̄ ∂λ∂λ0 λ∗
36
A central limit theorem can be derived by using the following property of the score vector:
Under regularity conditions the score vector has the Martingale property
∂`t ∂`t−1
E xt , Yt−1, Xt−1 = . (2)
∂λ ∂λ
This important theorem follows from the fact that
∂lt
E xt , Yt−1, Xt−1 = 0.
∂λ
This can be seen by differentiating of the identity
Z
f (yt |xt , Yt−1, Xt−1 , λ)dyt = 1
with regard to λ :
Z Z
∂f (yt |xt , Yt−1, Xt−1 , λ) ∂ft
dyt = dyt = 0,
∂λ ∂λ
∂ft ∂lt
where = ft since lt = log ft . Thus one has:
∂λ ∂λ
Z
∂lt ∂lt
ft dyt = E xt , Yt−1, Xt−1 = 0.
∂λ ∂λ
Assume that
1 ∂ 2 `T 1 ∂ 2 `T
plim = lim E = H.
T ∂λ∂λ0 λ̄ T →∞ T ∂λ∂λ0
=a = −H.
Finally, by applying the Cramer Linear Transformation Theorem to (1) we obtain the asymp-
totic distribution of the ML estimator
√
T (λ̃ − λ̄) → N(0, H−1 =a H−1 )
d
such that the covariance matrix of the asymptotic distribution of the ML estimator is the limit
of T times the inverse of the information matrix
√
T (λ̃ − λ̄) → N(0,=−1
d
a ). (MLE)
Hence, under standard conditions, the MLE λ̃ is consistent, asymptotically efficient and nor-
mal.
Hannan theorem
Note that in the case of ARMA models, one need not invoke a Martingale CLT in order to
establish the asymptotic distribution of the score vector and therefore of the ML estimator.
Instead the Hannan theorem can be applied:
Suppose that yt is generated by a regular linear process. Let cij (h)
1X
T
cij (h) = yit yjt+h
T t=1
be the sample 2nd moments for the hth lag of yt . and cT a vector of some cij (hs ). Then
√ d
T (cT − E [cT ]) → N(0, Σc ) (4)
where
Σc = lim T (cT − E [cT ]) (cT − E [cT ])0 .
T →∞
38
We consider likelihood ratio (LR), Lagrange multiplier (LM ) and Wald (W ) tests of the
hypothesis
H0 : φ(λ) = 0 vs. H1 : φ(λ) 6= 0,
where, φ : Rn → Rr is a continuously differentiable function with rank
∂φ(λ)
r = rank ≤ dim λ.
∂λ0
φ(λ) = Φλ2 − ϕ = 0,
Under the null, LR has an asymptotic χ2 -distribution with r degrees of freedom. Under
standard conditions, the LR test statistic has the same asymptotic distribution as the Lagrange
multiplier statistic LM and the Wald statistic W .
39
While the scores of an unrestricted model have sample mean zero by construction, the scores
of the restricted model can be used to implement a Lagrange multiplier (LM) test
• The LM test is especially suitable for model checking. As long as the null hypothesis is
not altered, testing different model specifications against a maintained model does not
require a new estimation.
• For all unrestricted parameters, say λ1 , the score is zero:
∂`(λ)
s1 (λ̂) = = 0.
∂λ1 λ=λ̃
h i
By partitioning the parameter vector as λ = (λ1 , λ2 ) such that s(λ̂)0 = 0 : s2 (λ̂)0 ,
the LM test statistic can be written as:
" #−1 " #
h i = = 0
11 12
LM = 0 : s2 (λ̂)0
=21 =22 s2 (λ̂)
h i−1
= s2 (λ̂)0 =(λ̂) s2 (λ̂)
22
= s2 (λ̂)0 =−1 −1
22 − =21 =11 =22 s2 (λ̂).
• Interpretation: The scores of the last r elements reflect the increase of the likelihood
function if the constraints are relaxed:
0
0 ∂φ(λ)
s(λ̂) = π̂
∂λ λ̂
0
Wald test
Thus, if H0 : φ(λ) = 0 is true and the variance–covariance matrix is invertible, the Wald test
is given by:
−1
0 1 ∂φ(λ) ∂φ(λ)0
φ(λ̃) → χ2 (r),
d
W = φ(λ̃) Σ (W)
T ∂λ0 λ̃ λ̃ ∂λ λ̃
A.3 An example
yt = βxt + ut
1 2
ut = ρut−1 + εt , u0 ∼ N 0, 1−ρ2σ
where εt ∼ NID(0, σ 2 ) and uncorrelated with xt−j for all t, j ≥ 0, and |ρ| < 1.
First note that the model can be written in the form (Koyck transformation):
yt = βxt + (1 − ρL)−1 εt
⇒ (1 − ρL)yt = (1 − ρL)βxt + εt
such that OLS might be used iteratively to estimate β and ρ (Cochrane–Orcutt procedure =
ˆ
Gauss–Newton algorithm)
Prediction-error decomposition gives the approximate likelihood function of λ = (β, ρ, σ 2 )0
for the sample y1 , . . .,yT conditioned on y0 :
1 X 2
T
T T
= − log(2π) − log σ 2 − 2 ε (β, ρ)
2 2 2σ t=1 t
where εt (β, ρ) = (yt − ρyt−1 ) − β(xt − ρxt−1 ) = (yt − βxt ) − ρ(yt−1 − βxt−1 ).
Under standard conditions, the MLE λ̃ is consistent, asymptotically efficient and normal:
√
T λ̃ − λ̄ → N 0, =−1
d
a .
1X
T
∂ log f (yt |Xt , Yt−1 ; λ)
=OP = ht (λ) ht (λ)0 with ht (λ) =
T t=1 ∂λ0
• score vector
1 P P
T ∂εt T
∂`(λ) 1
− 2σ 2 2εt σ2 εt (xt − ρxt−1 )
∂β t=1 ∂β t=1
P P
∂`(λ) T ∂εt =
T
s(λ) = = − 2σ1 2 2εt
1
σ2
εt (yt−1 − βxt−1 )
.
∂ρ ∂ρ
t=1
t=1
∂`(λ) P
T
1 P 2
T
− 2σT 2 + 2σ1 4 ε2t − 2σ2 + 2σ4
T
εt
∂σ 2 t=1 t=1
• information matrix
= = E [H(λ)]
T
1 P 2 2 2
σ t=1 E[xt ] − 2ρE[xt xt−1 ] + ρ E[xt−1 ]
2 . .
1 P T P
T
=
σ 2 {E[ut−1 (xt − ρxt−1 )] − E[εt xt−1 ]} 1
σ2
E[u2t−1 ] .
t=1 t=1
1 P P P
T T T
1 1
σ 4 E[εt (xt − ρxt−1 )] σ4 E[εt ut−1 ] − 2σT 4 + σ6 E[ε2t ]
t=1 t=1 t=1
T
(1 + ρ2 )(σx2 + µ2x ) − 2ρ(σx2 ρx (1) + µ2x ) 0 0
σ 2
T
=
0 0 .
1 − ρ2
T T
0 0 − 4
+ 4
2σ σ
• asymptotic information matrix
m
xx
0 0
σ2 σ 2 m−1 0 0
1 xx
=a = lim
1
== 0 0 =⇒ .=−1 =
0 1 − ρ2 0 .
T →∞ T 1 − ρ2 a
1 0 0 2σ 4
0 0
2σ 4
43
Since only ρ is restricted, the score vector evaluated under the null has the form
0
β̂
∂`(λ)
s(λ̂) = s ρ̂ = ∂ρ .
2
σ̂ε λ̂
0
The Wald test is based on the unconstrained estimator λ̃ = (β̃, ρ̃, σ̃ 2 )0 , and given by:
h i−1
1
W = λ̃0 Φ0 Φ Σ̃λ̃ Φ0 Φλ̃
T
1 2 −1
= ρ̃ σ̃ρ̃ ρ̃
T
ρ̃2
→ χ2 (1)
d
=
(1 − ρ̃2 )/T
Thus the Wald test statistic W has the form of a t-test statistic:
2
ρ̃
W = q = t2ρ=0 .
1 2
T σ̃ρ̃
Hans–Martin Krolzig Hilary Term 2002
Macroeconometrics I
Introduction to Time–Series Analysis
Vector autoregressive processes are just the generalization of the univariate autoregressive
process discussed in section 2. The basic model considered in the following is a stable vector
autoregressive model possibly including an intercept term: the K-dimensional time series
vector yt = (y1t , . . . , yKt )0 is generated by a vector autoregressive process of order p, denoted
VAR(p) model,
yt = ν + A1 yt−1 + · · · Ap yt−p + εt (5)
where t = 1, . . . , T , the ν is a vector of intercepts and Ai are coefficient matrices.
The error process εt = (ε1t , . . . , εKt )0 is an unobservable zero-mean vector white noise pro-
cess,
εt ∼ WN(0, Σ).
that is, E[εt ] = 0, E[εt ε0t ] = Σ, and E[εt ε0s ] = 0 for s 6= t, where the variance-covariance
matrix Σ is time-invariant, positive-definite and non-singular.
The stronger assumptions of independence,
εt ∼ IID(0, Σ).
or Gaussianity,
εt ∼ NID(0, Σ).
imply that the innovations are the first-step prediction errors
εt = yt − E[yt |Yt−1 ].
44
45
Thus the expectation of yt conditional on the information set Yt−1 = (yt−1 , yt−2 , . . . , y1−p ) is
given by
Xp
E[yt |Yt−1 ] = ν + Aj yt−j .
j=1
where Γ(h) = Γ(−h)0 and E[yt ε0t−h ] = 0 for h 6= 0 and Σ for h = 0 . The autocorrelation
function R(h) = [ρij (h)] can be computed as:
γij (h)
ρij (h) = p .
γii (0)γjj (0)
X
p
Byt = γ + Γj yt−j + ηt ,
j=1
Since VAR models represent the correlations among a set of variables they are often used to
analyze the relationships between variables. In the following we will consider the partition of
yt into (zt , xt )0 :
" # " # " #" # " #
zt ν1 Xp
A11,i A12,i zt−i ε1t
= + +
xt ν2 i=1
A 21,i A 22,i x t−i ε2t
Granger causality
Granger (1969) introduced a concept of causality which is based on the idea that a cause can
not come after the effect. Thus, if a variable x affects a variable z, the former should help to
predict the latter:
MSPE (zt+h |Zt \ Xt ) ≤ MSPE (zt+h |Zt ) ,
for some h > 0, where MSPE (zt |Ω) is the mean square prediction error associated with the
optimal predictor conditional on a given information set Ω.
Condition for stable VAR(p) processes: xt does not Granger-cause zt if and only if
Instantaneous causality
Sometimes the term ‘instantaneous causality’ is used in economic analyses. There is said to
be instantaneous causality between zt and xt if
Condition for stable VAR(p) processes: there is no instantaneous causality between zt and xt
if and only if
E[ε1t ε02t ] = Σ12 = Σ021 = 0
Weak exogeneity
Instantaneous non-causality implies that the (set of) equations of the VAR are unrelated to
each other. Thus the probability density function of yt conditional on its past Yt−1 is given by
where the parameter vectors λx and λz of the system can be varied freely. Consequently, all
possible reductions of the system can be efficiently estimated by OLS, and model-selection
procedures can be applied equation-by-equation without a loss in efficiency.
Strong exogeneity
The maximum likelihood estimation of the Gaussian VAR model (5) for an observed multiple
time series YT = (yT , . . . , y1−p ) when the initial values of Y0 = (y0 , . . . , y1−p ) are fixed,
results in a seemingly unrelated regression (SUR) problem discussed in Professor Pagan’s
lecture.
In case of the unrestricted VAR, OLS is asymptotically efficient. An asymptotically efficient
estimation method under linear restrictions is (feasible) GLS. Many other estimation problems
in the VAR have been solved by Sims (1980) (see also section 3 in Lütkepohl, 1991).
48
References
Granger, C. W. J. (1969). Investigating causal relations by econometric models and cross-
spectral methods. Econometrica, 37, 424–438.
Lütkepohl, H. (1991). Introduction to Multiple Time Series Analysis. Berlin: Springer.
Sims, C. A. (1980). Macroeconomics and reality. Econometrica, 48, 1–48. Reprinted in
Granger, C. W. J. (ed.) (1990), Modelling Economic Series. Oxford: Clarendon Press.