0% found this document useful (0 votes)
65 views48 pages

Krolzig Macroeconometrics I

The document provides an introduction to time series analysis for a macroeconometrics course. It outlines the course structure, covering concepts like stationarity, autocorrelation, and different time series models like autoregressive, moving average, and vector autoregressive processes. It also lists recommended textbooks and sections for different topics like statistical properties of time series, estimation and inference for time series models, and multiple time series models.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
65 views48 pages

Krolzig Macroeconometrics I

The document provides an introduction to time series analysis for a macroeconometrics course. It outlines the course structure, covering concepts like stationarity, autocorrelation, and different time series models like autoregressive, moving average, and vector autoregressive processes. It also lists recommended textbooks and sections for different topics like statistical properties of time series, estimation and inference for time series models, and multiple time series models.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 48

Hans–Martin Krolzig Hilary Term 2002

Macroeconometrics I

Introduction to Time–Series Analysis


MPhil in Economics

Books

◦ Gourieroux, C. and A. Monfort (1997), Time Series and Dynamic Models. Cambridge:
Cambridge University Press. Chapters 1, 5, 6, 9.
• Hamilton, J.D. (1994). Time Series Analysis. Princeton: Princeton University Press.
Chapters 1 - 5, 7, 8, 11.
◦ Harvey, A.C. (1981a). The Econometric Analysis of Time Series. 2nd edition. Hemel
Hempstead: Philip Allan.
◦ Harvey, A.C. (1981b). Time Series Models. London: Philip Allan.
• Hendry, D.F. (1995). Dynamic Econometrics. Oxford: Oxford University Press.
Chapter 2.
• Johnston, J. and J. DiNardo (1997). Econometric Methods. 4th edition, New York:
McGraw-Hill. Chapter 7.
◦ Lütkepohl, H. (1991). Introduction to Multiple Time Series Analysis. New York:
Springer. Chapters 2, 3, 5.
◦ Spanos, A. (1986). Statistical Foundations of Econometric Modelling. Cambridge:
Cambridge University Press.

Advanced

◦ Box, G.E.P. and Jenkins, G.M. (1976). Time Series Analysis,Forecasting and Control.
San Francisco: Holden-Day.
◦ Davidson, J.E.H. (1994). Stochastic Limit Theory. Oxford: Oxford University Press.
◦ Hall, P. and Heyde, C.C. (1980). Martingale Limit Theory and its Applications. London:
Academic Press.
◦ White, H. (1984). Asymptotic Theory for Econometricians. London: Academic Press.

1
2

Course Structure

Concepts of statistical time series (Week 1): Stochastic processes, stationarity, ergodicity,
integration, autocorrelation, white noise, innovations, martingales etc.

• Hamilton (1994): Chapter 3 (Sections 3.1 - 3.3).


• Hendry (1995): Chapter 2 (Sections 2.1 - 2.3, 2.5 - 2.11, 2.23, 2.25. - 2.27)
◦ Gourieroux & Monfort (1997): Chapter 5 (Sections 5.1 - 5.2).
◦ Hamilton (1994): Section 7.2.
◦ Hendry (1995): Appendix A4 (Sections 1 - 5, 8, 11).
◦ Spanos (1984): Chapters 8 - 9.
◦ Harvey (1981a): Chapter 1.
◦ Harvey (1981b): Chapter 1, Chapter 2 (Sections 2.1 - 2.4).

Types of linear time series models (Week 2): AutoRegressive and Moving Average processes

• Hamilton (1994): Chapter 3 (Sections 3.3 - 3.7).


◦ Gourieroux & Monfort (1997): Sections 5.3 - 5.4, 6.3 - 6.4.
◦ Hamilton (1994): Chapters 1 - 2.

Estimation and statistical inference (Week 3): Impact of autocorrelation on regression res-
ults, Maximum Likelihood estimation of Gaussian ARMA models, statistical testing,model
specification procedures

• Hamilton (1994): Chapters 5, 8, Section 4.8.


• Hendry (1995): Section 4.5, Appendix A4 (Section 12).
• Spanos (1986): Sections 23.3 - 23.6.
◦ Gourieroux & Monfort (1997): Sections 6.2, 9.2.
◦ Spanos (1986): Chapter 23 (Sections 23.1 - 23.2).
◦ Harvey (1981a): Chapters 3 - 6.
◦ Harvey (1981b): Chapter 5.

Multiple time series models (Week 4): Vector AutoRegressive processes

• Hamilton (1994): Sections 11.1 - 11.4.


◦ Lütkepohl (1991): Sections 2.1 - 2.3, 3.1 - 3.4, 3.6, 5.2.
3

Introduction

Time Series

The sequence of observations (yt , t ∈ T ) of a variable y at different dates t is called time


series. Usually T is equally spaced, such that t = 1, . . . , T .

An example: UK inflation and the short-term interest rate

UK inflation and interest rates

short−term interest rate (i t)


20.0
annual rate of inflation (π t)

17.5

15.0

12.5

10.0

7.5

5.0

2.5

1975 1980 1985 1990 1995 2000

Problems in time series

(1) Forecasting;
(2) Relation between time series: Causality and time lags;
(3) Distinction between short and long run;
(4) Study of agent’s expectations;
(5) Trend removal;
(6) Seasonal adjustment;
(7) Detection of structural breaks;
(8) Control of the process.
4

Time series analysis

• Goal: Formulation of a statistical model which is a congruent representation of the


(unknown) stochastic process that generated the (observed) time series.
• Approach: Modelling the time-invariant conditional density of {yt }

f (yt |Yt−1 ).

Conditioning on the history of the process, Yt−1 = (yt−1, yt−2, . . . , y0 ).


• Result: Prediction-error decomposition

yt = E[yt |Yt−1 ] + εt .

where (i) E[yt |Yt−1 ] is the component of yt that can be predicted once the history of the
process Yt−1 is known and (ii) εt denotes the unpredictable news.

Time series models: Examples

• Univariate autoregressive process of order one, AR(1)

yt = ayt−1 + εt, εt ∼ WN(0, σ 2 ).

• Multiple time series:


yt = Ayt−1 + εt, εt ∼ WN(0, Σ).
Vector autoregressive process, VAR(1):
" # " #" # " #
y1t a11 a12 y1t−1 ε1t
= + ,
y2t a21 a22 y2t−1 ε2t
" # " # " #!
ε1t 0 σ12 σ12
∼ WN ,
ε2t 0 σ21 σ22

• Autoregressive distributed lag model (ADL):

y1t = a11 y1t−1 + a12 y2t−1 + ε1t , ε1t ∼ WN(0, σ12 ).


5

1 Concepts of Statistical Time Series

1.1 Random variables

Let (Ω, M, Pr) be a probability space, where Ω is the sample space (set of all elementary
events), M is a sigma-algebra of events or subsets of Ω, and Pr is a probability measure
defined on M.

Definition 1. A random variable is a real valued function y : Ω → R such that for each real
number c, Ac = {ω ∈ Ω|y(ω) ≤ c} ∈ M.
In other words, Ac is an event for which the probability is defined in terms of Pr . The function
F : R → [0, 1] defined by F (c) = Pr(Ac ) is the distribution function of y.

1.2 Stochastic processes

Suppose T is some index set with at most countably many elements like, for instance, the set
of all integers or all positive integers.

Definition 2. A (discrete) stochastic process is a real valued function

y :T ×Ω→R

such that for each fixed t ∈ T , yt (ω) is a random variable.


In other words, a stochastic process is an ordered sequence of random variables {yt (ω), ω ∈
Ω, t ∈ T }, such that for each t ∈ T , yt (ω) is a random variable on the sample space Ω, and
for each ω ∈ Ω, yt (ω) is a realization of the stochastic process on the index set T (that is an
ordered set of values, each corresponds to a value of the index set).

ω0 ··· ωj ··· ωm
to yt0 (ω0 ) yt0 (ωj ) yt0 (ωm )
.. ..
. .
ti yti (ω0 ) ··· yti (ωj ) ··· yti (ωm )
.. ..
. .
tn ytn (ω0 ) ytn (ωj ) ytn (ωm )

Definition 3. A time series {yt }Tt=1 is (the finite part of) a particular realization {yt }t∈T of a
stochastic process.
6

A realization of a stochastic process is a function T → R where t → yt (ω). The underlying


stochastic process is said to have generated the time series. The time series y1 (ω), . . . , yT (ω)
is usually denoted by y1 , . . . , yT of simply by yt .
A stochastic process may be described by the joint distribution functions of all finite subcol-
lections of yt ’s, t ∈ S ∈ T . In practice the complete system of distributions will often be
unknown. Therefore, we will be often concerned with the first and second moments.
The joint distribution of (yt , yt−1 , . . . , yt−h ) is usually characterized by the autocovariance
function:

γt (h) = Cov(yt , yt−h )


= E [(yt − µt )(yt−h − µt−h )]
Z Z
= · · · (yt − µt )(yt−h − µt−h ) f (yt , · · · , yt−h ) dyt · · · dyt−h
R
where µt = E[yt ] = yt f (yt ) dyt is the unconditional mean of yt .
The autocorrelation function is given by:

γt (h)
ρt (h) = p .
γt (0)γt−h (0)

1.3 Stationarity

Definition 4. The process{yt } is said to be weakly stationary or covariance stationary if the


first and second moments of the process exist and are time-invariant:

E [yt ] = µ < ∞ for all t ∈ T ,


E [(yt − µ)(yt−h − µ)] = γ(h) < ∞ for all t and h.

Stationarity implies γt (h) = γt (−h) = γ(h).

Definition 5. The process {yt } is said to be strictly stationary if for any values of h1 , h2 · · ·
hn the joint distribution of (yt , yt+h1 · · · , yt+hn ) depends only on the intervals h1 , h2 · · · hn
but not on the date t itself:

f (yt , yt+h1 · · · , yt+hn ) = f (yτ , yτ +h1 · · · , yτ +hn ) for all t and τ .

Strict stationarity implies that all existing moments are time-invariant.

Definition 6. The process {yt } is said to be Gaussian if the joint density of


(yt , yt+h1 · · · , yt+hn )
f (yt , yt+h1 · · · , yt+hn )
is Gaussian for any h1 , h2 · · · hn .
7

1.4 Ergodicity

The statistical ergodicity theorem concerns what information can be derived from an average
over time about the common average at each point of time. Note that the WLLN does not
apply as the observed time series represents just one realization of the stochastic process.
Definition 7. Let {yt (ω), ω ∈ Ω, t ∈ T } be a weakly stationary process, such that E [yt (ω)] =
  P
µ < ∞ and E (yt (ω) − µ)2 = σy2 < ∞ for all t. Let ȳT = T −1 Tt=1 yt be the time average.
If ȳT converges in probability to µ as T → ∞, {yt } is said to be ergodic for the mean.
Note that ergodicity focus on asymptotic independence while stationarity focus on the time-
invariance of the process. For the type of stochastic processes considered in this lecture one
will imply the other. However, as illustrated by the following example, they can differ:
Example 1. Consider the stochastic processes {yt } defined by
(
u0 for t = 0 with u0 = N(0, σ 2 ).
yt =
yt−1 for t > 0.

Then {yt } is strictly stationary but not ergodic.


Proof. Obviously we have that yt = u0 for all t ≥ 0.
Stationarity follows from:

E [yt ] = E [u0 ] = 0
   
E yt2 = E u20 = σ 2
 
E [yt yt−h ] = E u20 = σ 2 .

Thus we have µ = 0, γ(h) = σ 2 , ρ(h) = 1 are time-invariant.


Ergodicity for the mean requires:

X
T −1
ȳT = T −1
P
yt −→ 0.
t=0

−1
PT −1
But one can easily see that ȳT = T t=0 yt = u0 where u0 is the realization of a normally
distributed random variable.

To be ergodic, the memory of a stochastic process should fade in the sense that the covariance
between increasingly distant observations converges to zero sufficiently rapidly. For stationary
P
processes it can be shown that absolutely summable autocovariances ∞ h=0 |γ(h)| < ∞ are
sufficient to ensure ergodicity.
Similarly, we can define ergodicity for the second moments:

X
T
γ̂(h) = (T − h)−1
P
(yt − µ)(yt−h − µ) −→ γ(h).
t=h+1
8

1.5 Characterizing economic time series

Ergodicity allows the characterization of stochastic processes by their sample moments:

Mean sample mean


X
T
ȳ = T −1 yt .
t=1

Note that for yt ∼ IID(µ, σ 2 )


√ 
T ȳ −→ N µ, σ 2 .
d

ACF sample autocorrelations ρ̂h = γ̂h /γ̂(0)

X
T
γ̂h = T −1 (yt − ȳ)(yt−h − ȳ).
t=h+1

Note that for yt ∼ IID(µ, σ 2 )


√ d
T ρ̂h −→ N (0, 1) .
(h)
PACF partial autocorrelation αh = Corr(yt , yt−h |yt−1,···, yt−h+1 ) :
the last coefficient in a linear regression of yt on a constant and its h last values
   −1  
(h)
α1 γ0 · · · γh−1 γ1
 .   . ..   .. 
 ..  =  .. ..
.   . 
  .
(h)
αh γh−1 · · · γ0 γh

(h)
Note that for an AR(p) process: αh = 0 for h > p such that
√ (h) d
T α̂h −→ N (0, 1) .
9

First differences of i t
∆i t
2

−2

1975 1980 1985 1990 1995 2000


First differences of π t
∆π t
2.5

0.0

−2.5

1975 1980 1985 1990 1995 2000

Figure 1 UK inflation and interest rate: first differences.

1.0 1.0 1.0 1.0


ACF−i ACF−π ACF−∆i ACF−∆π

0.5 0.5 0.5 0.5

0.0 0.0 0.0 0.0

−0.5 −0.5 −0.5 −0.5

0 5 10 0 5 10 0 5 10 0 5 10

1.0 1.0 1.0 1.0


PACF−i PACF−π PACF−∆i PACF−∆π

0.5 0.5 0.5 0.5

0.0 0.0 0.0 0.0

−0.5 −0.5 −0.5 −0.5

0 5 10 0 5 10 0 5 10 0 5 10

Figure 2 UK inflation and interest rate: ACF and PACF.


10

1.6 Integrated processes

An important class of nonstationary processes are integrated processes. An integrated process


is one that can be made stationary by differencing.

Example 2. The stochastic processes {yt } is said to be a random walk if

yt = yt−1 + ut for t > 0 and y0 = 0.

where ut is independent, identically distributed with zero mean and variance σ 2 < ∞ for all
t > 0. The random walk is non-stationary and, hence, non-ergodic.

Proof. By solving backward in time by repeated substitution we have:

X
t
yt = y0 + us for all t > 0.
s=1

The mean is time-invariant:


" #
X
t X
t
µ = E [yt ] = E y0 + us = y0 + E [us ] = 0.
s=1 s=1

But the second moments are diverging. The variance is given by:
 !2   !2 
 2 Xt X
t
γt (0) = E yt = E  y0 + us  = E  us 
s=1 s=1
" t t #  
XX X
t X
t X
= E us uk = E u2s + us uk 
s=1 k=1 s=1 s=1 k6=s

X
t
  X t X
= E u2s + E [us uk ]
s=1 s=1 k6=s

X
t
= σ 2 = tσ 2 .
s=1

The autocovariances are:


" ! !#
X
t X
t−h
γt (h) = E [yt yt−h ] = E y0 + us y0 + uk
s=1 k=1
" t !#
X X
t−h X
t−h
 
= E us uk = E u2k
s=1 k=1 k=1
X
t−h
= σ 2 = (t − h)σ 2 for all h > 0.
k=1
11

Finally, the autocorrelation function ρt (h) for h > 0 is given by:


 2
γt2 (h) (t − h)σ 2 h
ρ2t (h) = = 2 2
= 1 − for all h > 0.
γt (0)γt−h (0) [tσ ] [(t − h)σ ] t

If a stochastic process has to be differenced d times to reach stationarity, it is said to be integ-


rated of order d or I(d). A random walk is integrated of order 1 or I(1), stationary processes
are I(0).

1.7 Some basic processes

Definition 8. A white-noise process is a weakly stationary process which has a zero mean
and is uncorrelated over time:
ut ∼ WN(0, σ 2 ).

Thus {ut } is a WN process if for all t ∈ T : E[ut ] = 0, E[u2t ] = σ 2 < ∞ and E[ut ut−h ] = 0
where h 6= 0 and t − h ∈ T . If the assumption of a constant variance is relaxed to E[u2t ] < ∞,
sometimes {ut } is called a weak WN process.

Definition 9. If the white-noise process {ut } is normally distributed it is called a Gaussian


white-noise process:
ut ∼ NID(0, σ 2 ).

Note that the assumption of normality implies strict stationarity and serial independence (un-
predictability). A generalization of the NID is the IID process with constant, but unspecified
higher moments:

Definition 10. A process {ut } with independent, identically distributed variates is denoted
IID:
ut ∼ IID(0, σ 2 ).

In contrast to a white noise process, an IID process is unpredictable.


Two other concepts are also useful:

Definition 11. The stochastic process {xt } is said to be martingale with respect to an in-
formation set, Jt−1 , of data realized by time t − 1 if its mean E[|xt |] < ∞ and the condi-
tional expectation E[xt |Jt−1 ] = xt−1. The process {ut = xt − xt−1 } with E[|ut |] < ∞ and
E[ut |Jt−1 ] = 0 for all t is called a martingale difference sequence, MDS.

The assumption of a MDS is often made in limit theorems. For empirical modelling, the
following is crucial:
12

Definition 12. An innovation {ut } against an information set Jt−1 is a process whose dens-
ity f (ut |Jt−1 ) does not depend on Jt−1 ; also {ut } is a mean innovation against an inform-
ation set Jt−1 if E[ut |Jt−1 ] = 0.

Thus an innovation {ut } must be WN(0, σ 2 ) if Jt−1 contains the history Ut−1 of ut , but not
conversely. Consequently, an innovation must be a MDS.
Hans–Martin Krolzig Hilary Term 2002

Macroeconometrics I
Introduction to Time–Series Analysis

2 Types of linear time series models

2.1 Linear processes

Lag operator Suppose {yt } is a stochastic process. Then we define that

Lyt = yt−1
Lj yt = yt−j for all j ∈ N

Differencing operator If {yt } is a stochastic process, then also the following processes exist:

∆yt = (1 − L)yt = yt − yt−1


∆j yt = (1 − L)j yt for all j ∈ N+
∆s yt = (1 − L )yt s
(“seasonal differencing”)

Linear Filter Transformation of an input series {xt } to an output series {yt } by applying the
lag polynomial A(L):
 
Xm Xm
yt = A(L)xt =  j
aj L xt = aj xt−j = a−n xt+n +· · ·+a0 x0 +· · ·+am xt−m
j=−n j=−n

Linear process {yt } has the representation


 

X ∞
X
yt = A(L)εt =  aj Lj  εt = aj εt−j where εt ∼ WN(0, σ 2 )
j=−∞ j=−∞

13
14

2.2 Wold’s decomposition

Proposition 1 (Wold decomposition ). Any zero-mean covariance-stationary process {yt }


can be represented in the form
X∞
yt = ψj εt−j + κt
j=0
P∞
where ψ0 = 1 and j=0 ψj2 < ∞. The term εt is white noise and represents the error made
in forecasting yt on the basis of a linear function of its past Yt−1 = {yt−j }∞
j=1 :

b t |Yt−1 ).
εt ≡ yt − E(y

The value of κt is uncorrelated with εt−j for any j, though κt can be predicted arbitrarily well
from a linear function of Yt−1 :
b t |Yt−1 ).
κt = E(κ

Box–Jenkins approach to modelling time series: Approximate the infinite lag polynomial
with the ratio of two finite-order polynomials α(L) and β(L):

X β(L) 1 + β1 L + . . . + βq Lq
Ψ(L) = ψj Lj ∼
= =
α(L) 1 − α1 L − . . . − αp Lp
j=0

Types of linear time series models

p q Model Type
p>0 q=0 α(L)yt = εt (pure) autoregressive process of order p AR(p)
p=0 q>0 yt = β(L)εt (pure) moving-average process of order q MA(q)
p>0 q>0 α(L)yt = β(L)εt (mixed) autoregressive moving-average process ARMA(p, q)
15

2.3 Autoregressive processes

A p-th order autoregression, denoted AR(p) process, satisfies the difference equation:

X
p
yt = ν + αj yt−j + εt where εt ∼ WN(0, σ 2 ).
j=1

Using the lag operator L :

α(L)yt = ν + εt where α(L) = 1 − α1 L − . . . − αp Lp , αp 6= 0.

Stability, α(z) = 0 ⇒ |z| > 1, guarantees stationarity and MA(∞) representation

yt = α(1)−1 ν + α(L)−1 εt

X ∞
X
ν
yt = µ + ψj εt−j , where µ = and Ψ(L) = α(L)−1 with |ψj | < ∞
α(1)
j=0 j=0

Stability analysis based on the p-th order linear inhomogeneous difference equation:

X
p
yt − µ = αj (yt−j − µ)
j=1

|α1 | < 1 : stability −1 < α2 < 1 − |α1 | : stability


AR(1): α1 = 1 : unit root AR(2):
|α1 | > 1 : instability α21 + 4α2 < 0 : complex roots

Autocovariance function

γ(h) = E [yt yt−h ] = E [(α1 yt−1 + · · · + αp yt−p + εt ) yt−h ]


= α1 E [yt−1 yt−h ] + · · · + αp E [yt−p yt−h ] + E [εt yt−h ]
= α1 γ(h − 1) + · · · + αp γ(h − p)

Yule-Walker equations

ρ1 = α1 + α2 ρ1 + · · · + αp ρp−1 



ρ2 = α1 ρ1 + α2 + · · · + αp ρp−2 
.. ⇒ ρ 1 , · · · , ρp
. 



ρp = α1 ρp−1 + α2 ρp−2 + · · · + αp 
ρk = α1 ρk−1 + α2 ρk−2 + · · · + αp ρk−p for k>p
16

NID(0,1) 1
ACF 1
PACF
2.5

0 0 0

-2.5

0 50 100 150 200 0 5 10 15 20 25 0 5 10 15 20 25


AR(1) with a1=0.50 1
ACF 1
PACF
2.5

0 0 0

-2.5
0 50 100 150 200 0 5 10 15 20 25 0 5 10 15 20 25
AR(1) with a1=-0.50 1
ACF 1
PACF
2.5

0 0 0

-2.5

0 50 100 150 200 0 5 10 15 20 25 0 5 10 15 20 25

Figure 3 Stationary AR(1) processes.

5
AR(1) with a1=0.90 1
ACF 1
PACF

0
0 0

-5

0 50 100 150 200 0 5 10 15 20 25 0 5 10 15 20 25

20
AR(1) with a1=1.00 1
ACF 1
PACF

10

0 0
0

-10

0 50 100 150 200 0 5 10 15 20 25 0 10 20


AR(1) with a1=1.02 1
ACF 1
PACF
0

0 0
-50

0 50 100 150 200 0 5 10 15 20 25 0 10 20

Figure 4 Persistent and non-stationary AR(1) processes.


17

5
AR(2) with a1=0.60 and a2=0.30 1
ACF 1
PACF

0
0 0

-5
0 50 100 150 200 0 5 10 15 20 25 0 5 10 15 20 25

5
AR(2) with a1=0.30 and a2=0.60 1
ACF 1
PACF

0
0 0

-5

0 50 100 150 200 0 5 10 15 20 25 0 5 10 15 20 25


AR(2) with a1=-0.30 and a2=0.60 1
ACF 1
PACF

0 0 0

0 50 100 150 200 0 5 10 15 20 25 0 5 10 15 20 25

Figure 5 AR(2) processes with real roots.

AR(2) with a1=0.60 and a2=-0.80 1


ACF 1
PACF

0
0 0

0 50 100 150 200 0 5 10 15 20 25 0 5 10 15 20 25


AR(2) with a1=0.60 and a2=-0.30 1
ACF 1
PACF
2.5

0 0 0

-2.5
0 50 100 150 200 0 5 10 15 20 25 0 5 10 15 20 25

10
AR(2) with a1=1.60 and a2=-0.80 1
ACF 1
PACF

0 0 0

-5

0 50 100 150 200 0 5 10 15 20 25 0 5 10 15 20 25

Figure 6 AR(2) processes with complex roots.


18

2.4 Moving-average processes

A q-th order moving-average process, denoted MA(q), is characterized by

X
q
yt = µ + β(L)εt = µ + εt + βi εt−i
i=1

where εt ∼ WN(0, σ 2 ) and β(L) = 1 + β1 L + . . . + βq Lq , βq 6= 0.


Invertibility, β(z) = 0 ⇒ |z| > 1, guarantees uniqueness and AR(∞) representation

β(L)−1 yt = β(1)−1 µ + εt

X
yt = µ + φj (yt−j − µ) + εt
j=1
P∞
where φ(L) = 1 − = 1 − φ1 L − φ2 L2 + . . . = β(L)−1 .
j=1 φj L
j

P
Autocovariance function: Consider zt = yt − µ = qi=0 βi εt−i , β0 = 1
!
Xq
γ0 = βi2 σ 2
i=0
!
X
q−k
γk = βi βi+k σ 2 for k = 1, 2, . . . , q
i=0
γk = 0 for k > q
19

MA(1) with b1=0.80 1


ACF 1
PACF
2.5

0 0 0

-2.5

0 50 100 150 200 0 5 10 15 20 25 0 5 10 15 20 25


MA(1) with b1=-0.80 1
ACF 1
PACF
2.5

0
0 0

-2.5

0 50 100 150 200 0 5 10 15 20 25 0 5 10 15 20 25

5
MA(1) with b1=-1.00 1
ACF 1
PACF

2.5

0 0
0

-2.5

0 50 100 150 200 0 5 10 15 20 25 0 5 10 15 20 25

Figure 7 MA(1) processes.

MA(2) with b1=-0.50 and b2=0.50 1


ACF 1
PACF

0
0 0

0 50 100 150 200 0 5 10 15 20 25 0 5 10 15 20 25


MA(3) with b = (0.3, -0.5, 0.5) 1
ACF 1
PACF
2.5

0
0 0

-2.5

0 50 100 150 200 0 5 10 15 20 25 0 5 10 15 20 25

5
MA(4) with b=(-0.6,0.3,-0.5,0.5) 1
ACF 1
PACF

2.5

0 0
0

-2.5

0 50 100 150 200 0 5 10 15 20 25 0 5 10 15 20 25

Figure 8 MA(q) processes.


20

2.5 Mixed autoregressive moving-average processes

An ARMA(p, q) process includes both autoregressive and moving-average terms:

α(L)yt = ν + β(L)εt
Xp X
q
yt = ν + αj yt−j + εt + βi εt−i
j=1 i=1

where εt ∼ WN(0, σ 2 ), α(L) = 1 − α1 L − . . . − αp Lp , αp 6= 0 and β(L) = 1 + β1 L + . . . +


βq Lq , βq 6= 0

(i) Stability, α(z) = 0 ⇒ |z| > 1, guarantees stationarity and MA(∞) representation
yt = α(1)−1 ν + α(L)−1 β(L)εt

X
yt = µ + ψj εt−j
j=0

(ii) Invertibility, β(z) = 0 ⇒ |z| > 1, allows AR(∞) representation


β(L)−1 α(L)(yt − µ) = εt

X
yt = µ + φj (yt−j − µ) + εt
j=1

(iii) No common roots in α(L) and β(L)


Q )
α(L) = pj=1 (1 − λj L)
Q =⇒ λj 6= µi for all i, j
β(L) = qi=1 (1 − µi L)

Autocovariance function: Useful reparametrisation


α(L)yt = ut , where ut = β(L)εt
Autocovariances
γ(h) = E [yt yt−h ] = E [(α1 yt−1 + · · · + αp yt−p + ut ) yt−h ]
= α1 E [yt−1 yt−h ] + · · · + αp E [yt−p yt−h ] + E [ut yt−h ]
= α1 γ(h − 1) + · · · + αp γ(h − p) + E [(εt + β1 εt−1 + · · · + βq εt−q ) yt−h ]
Variance
  h i  
γ(0) = E yt2 = E (α1 yt−1 + · · · + αp yt−p )2 + 2E [(α1 yt−1 + · · · + αp yt−p ) ut ] + E u2t

Example: ARMA(1, 1)
  h i h i
γ(0) = E yt2 = E (α1 yt−1 )2 + 2E [(α1 yt−1 ) (εt + β1 εt−1 )] + E (εt + β1 εt−1 )2
 2   
= α21 E yt−1 + 2α1 β1 E [yt−1 εt−1 ] + E ε2t + β12 ε2t−1
= α21 γ(0) + (1 + 2α1 β1 + β12 )σε2
= (1 − α21 )−1 (1 + 2α1 β1 + β12 )σε2
21

5
ARMA(1,1) with a1=0.5, b1=0.3 1
ACF 1
PACF

0 0
0

0 50 100 150 200 0 5 10 15 20 25 0 5 10 15 20 25

5
ARMA(1,1) with a1=0.5, b1=0.9 1
ACF 1
PACF

0
0 0

-5
0 50 100 150 200 0 5 10 15 20 25 0 5 10 15 20 25
ARMA(1,1) with a1=0.5, b1=-0.9 1
ACF 1
PACF
2.5

0 0 0

-2.5
0 50 100 150 200 0 5 10 15 20 25 0 5 10 15 20 25

Figure 9 ARMA(1, 1) processes.

Method of undetermined coefficients

Example: Coefficients of the MA(∞) representation of an ARMA(p, q):

Ψ(L) = α(L)−1 β(L)


⇔ α(L)Ψ(L) = β(L)
P∞ i
Approach: Ψ(L) := i=0 ψi L

(1 − α1 L1 − . . . − αp Lp )(ψ0 + ψ1 L1 + ψ2 L2 + . . .) = (1 + β1 L1 + . . . + βq Lq )

Compare coefficient at each lag:

L0 : ψ0 = 1
L1 : ψ1 − α1 ψ0 = β1 =⇒ ψ1 = β1 + α1 ψ0
L2 : ψ2 − α1 ψ1 − α2 ψ0 = β2 =⇒ ψ1 = β2 + α1 ψ1 + α2 ψ0
Xh Xh
Lh : − αi ψh−i = βh =⇒ ψh = βh + αi ψh−i
i=0 i=1

where α0 = 1, αh = 0 for h > p, βh = 0 for h > q.


22

2.6 Predicting ARMA(p, q) processes

For the mean square prediction error (MSPE) criterion,


h i

min E (yt+h − ŷ)2 Ωt ,

the optimal predictor of yt+h is given by the conditional expectation for the given information
set Ωt :
ŷt+h|t = E[yt+h |Ωt ],
where in the following the available information is the past of the stochastic process up to time
t, Ωt = Yt .
For stationary ARMA processes, unlike many non-linear DGPs, the conditional mean can be
easily derived analytically: Using the AR(∞) representation,

X
yt+h = µ + φj (yt+h−j − µ) + εt+h ,
j=1

and applying the expectation operator the optimal predictor results as follows:

X
h−1 ∞
X
ŷt+h|t = E[yt+h |Yt ] = µ + φj (E[yt+h−j |Yt ] − µ) + φh+j (E[yt−j |Yt ] − µ).
j=1 j=0

Using that E[ys |Yt ] = ys for s ≤ t, the optimal predictor is given by:

X
h−1 ∞
X
ŷt+h|t = µ + φj (ŷt+h−j|t − µ) + φh+j (yt−j − µ),
j=1 j=0

and can be calculated recursively starting with the one-step predictor



X
ŷt+1|t = µ + φj (yt−j − µ).
j=0

Thus the predictor ŷt+h|t of an ARMA(p, q) process is a linear function of the past realizations
of the process.
The prediction error associated with the optimal predictor ŷt+h|t is given by

êt+h|t = yt+h − E[yt+h |Yt ].

Using the MA(∞) representation,



X
yt+h = µ + ψj εt+h−j ,
j=0
23

and that E[εs |Yt ] = 0 for s > t, the optimal predictor can be written as

X
ŷt+h|t = µ + ψj εt+h−j .
j=h

Hence the prediction error is given by

X
h−1
êt+h|t = yt+h − ŷt+h|t = ψj εt+h−j .
j=0

The prediction error variance results as


 
X
h−1
 ψj2  σε2 ,
j=0

such that under normality, ε ∼ N(0, σε2 )


   
X
h−1
yt+h|t ∼ N ŷt+h|t ,  ψj2  σε2  ,
j=0

which allows the construction of forecast intervals.


Note: The optimal predictor is a property of the data-generating process. A forecasting rule is
any systematic operational procedure for making statements about future events.
Hans–Martin Krolzig Hilary Term 2002

Macroeconometrics I
Introduction to Time–Series Analysis

3 Estimation of dynamic models

3.1 The likelihood function of a Gaussian AR(1) process

Multivariate normal density

For the Gaussian AR(1) process,

yt = ν + α1 yt−1 + εt , εt ∼ NID(0, σ 2 ),

the joint distribution of Yt = (y1 , y2 , · · · yT )0 is Gaussian: Yt ∼ N(µ, Σ)


     
y1 µ γ0 · · · γT −1
 ..   ..  ,  .. .. .. 
 .  ∼ N  .   . . . 
yT µ γT −1 ··· γ0

where γh = αh1 σy2 , σy2 = (1 − α21 )−1 σ 2 and µ = (1 − α1 )−1 ν.


Thus the probability density function of the sample Yt = (y1 , y2 , · · · yT ) is given by the mul-
tivariate normal density
 
−1 1
− T2 2 0 −1
f (YT ) = (2π) |Σ| exp − (YT − µ) Σ (YT − µ) .
2

Denoting Yt ∼ N(µ, σy2 Ω) with Ωij =α|i−j| and collecting the parameters of the model to
λ = (ν : α : σ 2 )0 , the likelihood function is given by:
 
−1 1
2 − T2 2 0 −1
L(λ) := f (YT ; λ) = (2πσy ) |Ω| exp − 2 (YT − µ) Ω (YT − µ) .
2σy

24
25

Prediction-error decomposition (sequential factorization)

The prediction-error decomposition uses that the εt are independent, identically distributed:

Y
T
f (ε2 , · · · , εT ) = fε (εt ).
t=2

 
Since εt = yt − (ν + α1 yt−1 ), have that f (yt |yt−1 ) = fε yt − (ν + α1 yt−1 ) for t =
2, · · · , T. Hence: (" # )
Y
T
f (y1 , · · · , yT ) = f (yt |yt−1 ) f (y1 ) .
t=2

For εt ∼ NID(0, σε2 ), the log likelihood is given by:

X
T
`(λ) = log L(λ) = log f (yt |Yt−1 ; λ) + log f (y1 |λ)
t=2
( )  
1 X 2
T
T T −1 1  1
= − log(2π) − log(σε2 ) + 2 ε − log σy2 + 2 (y1 − µ)2
2 2 2σε t=2 t 2 2σy
 σε2
where y1 ∼ N µ, σy2 with µ = ν
1−α1 and σy2 = 1−α21
.
26

3.2 ML estimation of Gaussian autoregressive models

Gaussian AR(p) :

yt = ν + α1 yt−1 + · · · + αp yt−p + εt , εt ∼ NID(0, σε2 )

Exact MLE

The exact MLE follows from the prediction error decomposition as:
  
 Y T 
λ̃ = arg max f (YT |λ) = arg max  
f (yt |Yt−1 ; λ) fy (y1, · · · , yp |λ) .
λ  
t=p+1

Conditional MLE = OLS

Take Yp = (y1 , · · · , yp )0 as fixed pre-sample values

Y
T
λ̃ = arg max f (Yt−1 |Yp ; λ) = arg max f (yt |Yt−1 ; λ).
λ λ
t=p+1

Conditioning on Yp = (y1 , · · · , yp )0 :

X
T
`(λ) = log f (YT |Yp ; λ) = log f (yt|Yt−1 ; λ)
t=p+1

1 X 2
T
T −p T −p 2
= − log(2π) − log(σε ) − 2 εt (ν, α).
2 2 2σε
t=p+1

where εt (ν, α) = yt − ν − α1 yt−1 − · · · − αp yt−p .


Thus the MLE of (ν, α) results by minimizing the sum of squared residuals:

X
T
arg max `(ν, α) = arg min ε2t (ν, α)
(ν,α) (ν,α)
t=p+1

• Hence the MLE γ̃ = (ν̃,α̃)0 is equivalent to OLS estimates ν̂ and α̂.



• ν̃ and α̃ are consistent estimators if {yt } is stationary and T (γ̃ − γ) is asymptotically
normally distributed.
• The asymptotic properties of MLE also follow from the general ML theory (see §4.1)
• MLE based on the exact likelihood and the conditional likelihood are asymptotically
equivalent.
27

Also asymptotically equivalent:

• MLE of the mean-adjusted model

yt − µ = α1 (yt−1 − µ) + · · · + αp (yt−p − µ) + εt , εt ∼ NID(0, σε2 )

where µ = (1 − α1 − · · · − αp )−1 ν.
• OLS of α in the mean-adjusted model above where µ is estimated by the sample mean
P
ȳ = T −1 Tt=1 yt
• Yule-Walker estimation of α (Method of moments)
   −1  
α1 γ̂0 · · · γ̂p−1 γ̂1
 ..   .. .. ..   .. 
 . = . . .   . 
αp γ̂p−1 · · · γ̂0 γ̂p
PT PT
where γ̂h = (T − h)−1 t=h+1 (yt − ȳ)(yt−h − ȳ) and µ̂ = ȳ = T −1 t=1 yt .
28

3.3 ML estimation of moving-average models

Gaussian MA(q) :

yt = µ + εt + β1 εt−1 + · · · + βq εt−q , εt ∼ NID(0, σε2 )

Conditional MLE = NLLS

Conditioning on ε0 = (ε0, ε−1 , · · · , ε1−q )0 :

X
T
`(λ) = log f (Yt |ε0 ; λ) = log f (εt |εt−1 ; λ)
t=1

1 X 2
T
T T 2
= − log(2π) − log(σε ) − 2 ε (µ, β).
2 2 2σε t=1 t

where εt (µ, β) = yt − µ − β1 εt−1 − · · · − βq εt−q .

• The unobservable initials are set to zero, ε0 = E [ε0 ] = 0.


• The MLE of (µ, β) results again by minimizing the sum of squared errors.
• Analytical expressions for MLE are usually not available due to highly non-linear FOCs.
• MLE requires the application of numerical optimization techniques to `(λ) (see §.3.4).
• Conditioning requires invertibility, e.g. |β1 | < 1 for MA(1) processes:

X
t
εt = yt − β1 εt−1 = (−β1 ) ε0 +t
(−β1 )j yt−j .
j=1

• Method of moments: Wilson–algorithm.


29

3.4 ML estimation of Gaussian ARMA models

Gaussian ARMA(p, q) :

yt = ν + α1 yt−1 + · · · + αp yt−p + εt + β1 εt−1 + · · · + βq εt−q εt ∼ NID(0, σε2 )

Conditional MLE = NLLS

Conditioning on Yp and εp = (εp , εp−1 , · · · , εp−q )0 :

`(λ) = log f (Yt |Yp, εp ; λ)


X
T
= log f (εt |Yt−1 , εt−1 ; λ)
t=p+1

1 X 2
T
T −p T −p 2
= − log(2π) − log(σε ) − 2 ε .
2 2 2σε t=p+1 t

where εt = yt − [ν + α1 yt−1 + · · · + αp yt−p + β1 εt−1 + · · · + βq εt−q ]

• The initial values of the unobservable error terms are set to zero εp := E [εp ] = 0.
• Assumes stability and invertibility.
• Exact ML and backcasting of εp possible.
• Check for common roots in α(L) and β(L) :

α(L)(yt − µ) = β(L)ut ⇔ (1 − πL)α(L)(yt − µ) = (1 − πL)β(L)ut

• Test for ARMA(p, q) vs. ARMA(p + 1, q + 1) involves nuisance parameters.

Simple consistent method: Durbin (1960)

• Approximate the AR(∞) representation of an ARMA(p, q) process by an AR(h), h >


p + q.
• OLS delivers
(h) (h) (h)
ût = yt − ν̂ (h) − α̂1 yt−1 − · · · − α̂h yt−h
• Then estimate the ARMA parameters by regressing
(h) (h)
yt = ν + α1 yt−1 + · · · + αp yt−pt + β1 ût−1 + · · · + βq ût−q + ut
30

Grid-Search
n o
• Consider grid of potential values of β = (β1 , · · · , βq )0 : β (1) , · · · β (M )
(m) (m) 0
• Maximize the log likelihood `(ν, α,σ 2 |β (m) ) conditional on β (m) = (β1 , · · · , βq )
by GLS
(m)
ν, α,σ 2 =arg max `(ν, α,σ 2 |β (m) )
ν,α,σ2

• Finally, MLE is given by:

β= arg max `(ν (m) , α(m) ,σ 2,(m) |β (m) )


β(m) ∈{β (1) ,···β(M ) }

Numerical Optimization: Newton-Raphson

• Idea: Second-order Taylor approximation of `(λ) around λ0 :


 0  2 
∂`(λ0 ) 1 0 ∂ `(λ0 )
`(λ) ∼
= `(λ0 ) + (λ − λ0 ) + (λ − λ0 ) (λ − λ0 )
∂λ 2 ∂λ∂λ0

Maximization gives FOC


 2 
∂`(λ0 ) ∂ `(λ0 )
+ (λ − λ0 ) = 0
∂λ ∂λ∂λ0
 −1
∂ 2 `(λ0 ) ∂`(λ0 )
=⇒ λ = λ0 −
∂λ∂λ0 ∂λ
• Iterate until convergence λ(m+1) ≈ λ(m) :
" #−1
(m+1) (m) ∂ 2 `(λ(m) ) ∂`(λ(m) )
λ =λ −
∂λ∂λ0 ∂λ

• Modification (due to approximation and numerical errors):


" #−1
(m+1) (m) ∂ 2 `(λ(m) ) ∂`(λ(m) )
λ =λ −s
∂λ∂λ0 ∂λ

where the step length s is selected corresponding to the evaluation of the log-likelihood
function at several points (grid search).
• General problems: convergence; local maxima.
31

3.5 The asymptotic distribution of the Ordinary Least Squares (OLS) estimator for time
series models

Regressions with lagged endogenous variables

We consider the AR(p) model

yt = ν + α1 yt−1 + · · · + αp yt−p + ut , ut ∼ WN(0, σu2 )

where y0, · · · , y1−p are given. Notation as a regression model

yt = zt0 θ + ut

with θ = (ν, α1 , · · · , αp )0 and zt = (1, yt−1 , · · · , yt−p )0 :


     
y1 1 y0 y1−p u1
 ..   .. .. ..  θ +  .. 
 . = . ···
. .   . 
yT 1 yT −1 yT −p uT
| {z } | {z } | {z }
y Z u

OLS estimation

θ̂ = (Z 0 Z)−1 Z 0 y
= (Z 0 Z)−1 Z 0 (Zθ + u)
= θ + (Z 0 Z)−1 Z 0 u
 −1  
1 0 1 0
= θ+ ZZ Zu
T T
 −1  
√   1 0 1 0
=⇒ T θ̂ − θ = ZZ √ Zu
T T

• OLS is no longer linear in y.


• Hence OLS cannot be BLUE. In general OLS is no longer unbiased.
• Small sample properties are analytically difficult to derive.

Asymptotic properties of OLS

If yt is a stable AR(p) process and ut is a standard white noise process, then the following
results hold (see Mann & Wald, 1943):
1 0 P
ZZ →Γ
T  
√ 1 0
Z u → N(0, σu2 Γ)
d
T
T
32

Then consistency and asymptotic normality follows from Cramér’s theorem:


√  
T θ̂ − θ → N(0, σu2 Γ−1 )
d

1 0

where Γ = plim TZ Z .

Impact of error autocorrelation on regression results

Necessary condition for the consistency of OLS with stochastic (but stationary) regressors is

that zt is asymptotically uncorrelated with ut , i.e. plim T1 Z 0 u = 0 :
 −1    
1 0 1 0 −1 1 0
plim θ̂ − θ = plim ZZ plim Zu =Γ plim Zu
T T T

OLS is no longer consistent under autocorrelation of the regression error as


 
1 0
plim Z u 6= 0
T

Example: Consider an AR(1) model with first-order autocorrelation of its errors

yt = αyt−1 + ut
ut = ρut−1 + εt , εt ∼ WN(0, σε2 ),
h i
such that Z 0 = y0 . . . yT −1 . Then

" #   
  XT XT
1 0 1 1
E Zu = E yt−1 ut = E yt−1 ρut−1 + εt 
T T T
t=1 t=1
  
1X 
T
= E yt−1 ρ(yt−1 − αyt−2 ) + εt 
T t=1
        
1 X T
1 XT
1 X
T
= ρ 2 
E yt−1 − αρ  E yt−1 yt−2  +  E yt−1 εt 
T T T
t=1 t=1 t=1

h i
= ρ γy (0) − αγy (1)

where γy (h) is the autocovariance function of {yt } which can be represented as an AR(2)
process.
33

3.6 Empirical modelling of time series

Box Jenkins modelling philosophy

(1) Ensure stationarity by data transformation


• yt is an autoregressive integrated moving-average ARIMA(p, d, q) process iff

α(L) ∆d yt = ν + β(L)ut

• Example: Random Walk = ARIMA(0,1,0) process


(2) Specify a parsimonious ARMA(p, q) model
(3) Estimate the parameters in α(L) and β(L)
(4) Perform diagnostic analysis

Information-criteria based model selection

Choose the model m∗ ∈ {1, . . . , M } which maximizes a penalized likelihood function

IC(m) = 2 log L(m) − cT N (m)

where N (m) is the number of freely estimated parameters in model m. Very often the in-
formation criterion is formulated in terms of the estimated residual variance instead of the
log-likelihood. The model selection consists then in choosing the model which minimizes:
cT
IC(m) = log σ̃ 2 (m) − N (m).
T
Note that the concentrated likelihood function is given by
T T T
log L = − − log(2π) − log(σ̃ 2 )
2 2 2
such that the two criteria are equivalent.
The most popular information criteria are the following:

Akaike information criterion, cT = 2


2
AIC(m) = log σ̃ 2 (m) − N (m);
T
Hannan-Quinn criterion, cT = 2 log(log(T )):

2 log(log(T ))
HQ(m) = log σ̃ 2 (m) − N (m);
T
Schwarz or Bayesian criterion, cT = log(T ):

log(T )
SC(m) = log σ̃ 2 (m) − N (m).
T
34

General-to-specific modelling

‘LSE’ approach: starting from a general dynamic statistical model, which captures the essen-
tial characteristics of the underlying data set, standard testing procedures are used to reduce
its complexity by eliminating statistically-insignificant variables, checking the validity of the
reductions at every stage to ensure the congruence of the selected model.
35

A Appendix: Statistical inference in dynamic models

A.1 The asymptotic distribution of the Maximum Likelihood (ML) estimator for time
series models

The likelihood function

We characterize the time series model for yt by its density conditional on xt and on past
observations of yt and xt denoted by Yt−1 and Xt−1 :

f (yt |xt , Yt−1, Xt−1 , λ) = f (ut |λ),

where λ is a vector of unknown parameters and ut is the prediction error. Let where λ̄ be
the true value of the parameter vector λ and `T (λ) the(log) likelihood function for a sample
y1 , . . . , yT :
X
T X
T
`T (λ) = lt (λ|ut ) = log f (yt |xt , Yt−1, Xt−1 , λ).
t=1 t=1

The ML estimator λ̃ is the solution to the systems of equations emerging from the first-order
conditions for the maximum of the likelihood function
 
∂`T (λ)
s(λ̃) = = 0,
∂λ λ̃

where s(λ) is the score vector.

Consistency

If the model is globally identified, i.e. the following conditions are satisfied:

(i) plim `T (λ)/T = F (λ̄, λ). (existence of the plim of the (log) likelihood function),
(ii) F (λ̄, λ̄) > F (λ̄, λ) for all λ in the parameter space (unique global maximum of F () at
λ̄),
(iii) the plim converges uniformly,

then the ML estimates of λ will have the consistency property

plim λ̃ = λ̄.

Expansion of the first-order condition

Assume that the likelihood function has continuous 2nd derivatives. Then expanding around
λ̄ by using the Mean Value Theorem gives
     2 
∂`T ∂`T ∂ `T
=0= + (λ̃ − λ̄)
∂λ λ̃ ∂λ λ̄ ∂λ∂λ0 λ∗
36

where λ∗ is some point between λ̃ and λ̄.



After rescaling by dividing by 1/ T , we obtain
  2     
√ 1 ∂ `T 1 ∂`T
T (λ̃ − λ̄) = − √ . (1)
T ∂λ∂λ0 λ∗ T ∂λ λ̄
If the model is globally identified, the plim of the rescaled likelihood function is strictly con-
cave in the neighbourhood of λ̄, and so
  2     2  
1 ∂ `T 1 ∂ `T
plim 0
= plim =H
T ∂λ∂λ λ∗ T ∂λ∂λ0 λ̄
is a finite, symmetric, negative definite matrix. Thus, to determine the asymptotic distribution
of the ML estimator it remains to determine the asymptotic distribution of the score vector
   
1 ∂`T
√ .
T ∂λ λ̄

The asymptotic distribution of the score vector

A central limit theorem can be derived by using the following property of the score vector:
Under regularity conditions the score vector has the Martingale property
 
∂`t ∂`t−1
E xt , Yt−1, Xt−1 = . (2)
∂λ ∂λ
This important theorem follows from the fact that
 
∂lt
E xt , Yt−1, Xt−1 = 0.
∂λ
This can be seen by differentiating of the identity
Z
f (yt |xt , Yt−1, Xt−1 , λ)dyt = 1

with regard to λ :
Z Z
∂f (yt |xt , Yt−1, Xt−1 , λ) ∂ft
dyt = dyt = 0,
∂λ ∂λ
∂ft ∂lt
where = ft since lt = log ft . Thus one has:
∂λ ∂λ
Z  
∂lt ∂lt
ft dyt = E xt , Yt−1, Xt−1 = 0.
∂λ ∂λ

Theorem (2) allows the use of a Martingale CLT:


 
1 ∂`T d
√ −→ N (0, =a ) , (3)
T ∂λ λ̄
37

where =a is the asymptotic information matrix


   
1 1 ∂`T ∂`T 0
=a = lim = = lim E .
T →∞ T T →∞ T ∂λ ∂λ

Assume that       
1 ∂ 2 `T 1 ∂ 2 `T
plim = lim E = H.
T ∂λ∂λ0 λ̄ T →∞ T ∂λ∂λ0

Then the Slutzky Theorem implies that

=a = −H.

Finally, by applying the Cramer Linear Transformation Theorem to (1) we obtain the asymp-
totic distribution of the ML estimator

T (λ̃ − λ̄) → N(0, H−1 =a H−1 )
d

such that the covariance matrix of the asymptotic distribution of the ML estimator is the limit
of T times the inverse of the information matrix

T (λ̃ − λ̄) → N(0,=−1
d
a ). (MLE)

Hence, under standard conditions, the MLE λ̃ is consistent, asymptotically efficient and nor-
mal.

Hannan theorem

Note that in the case of ARMA models, one need not invoke a Martingale CLT in order to
establish the asymptotic distribution of the score vector and therefore of the ML estimator.
Instead the Hannan theorem can be applied:
Suppose that yt is generated by a regular linear process. Let cij (h)

1X
T
cij (h) = yit yjt+h
T t=1

be the sample 2nd moments for the hth lag of yt . and cT a vector of some cij (hs ). Then
√ d
T (cT − E [cT ]) → N(0, Σc ) (4)

where

Σc = lim T (cT − E [cT ]) (cT − E [cT ])0 .
T →∞
38

A.2 Testing principles

We consider likelihood ratio (LR), Lagrange multiplier (LM ) and Wald (W ) tests of the
hypothesis
H0 : φ(λ) = 0 vs. H1 : φ(λ) 6= 0,
where, φ : Rn → Rr is a continuously differentiable function with rank
 
∂φ(λ)
r = rank ≤ dim λ.
∂λ0

Let λ̃ denote the unconstrained MLE

λ̃ = arg max `(λ)

and λ̂ the restricted MLE under the null



λ̂ = arg max `(λ) − π 0 φ(λ)

where π is a vector of Lagrange multipliers.


Often the interest centers on linear restrictions. By partitioning the parameter vector as λ =
(λ1 , λ2 ), we have r restrictions on the parameter vector λ2 ,

φ(λ) = Φλ2 − ϕ = 0,

while no constraint is given for the parameter vector λ1 .

Likelihood Ratio test

The likelihood ratio (LR) test can be based on the statistic


 
LR = 2 `(λ̃) − `(λ̂) → χ2 (r).
d
(LR)

Under the null, LR has an asymptotic χ2 -distribution with r degrees of freedom. Under
standard conditions, the LR test statistic has the same asymptotic distribution as the Lagrange
multiplier statistic LM and the Wald statistic W .
39

Lagrange Multiplier (LM) test

While the scores of an unrestricted model have sample mean zero by construction, the scores
of the restricted model can be used to implement a Lagrange multiplier (LM) test

LM = s(λ̂)0 =−1 (λ̂) s(λ̂) → χ2 (r).


d
(LM)

b a (λ̂) we also have that


Using an estimate of the asymptotic information matrix =
1 b −1 (λ̂) s(λ̂) →
s(λ̂)0 = χ2 (r).
d
LM = a
T
The LM test operates under the condition that the model is estimated under the null .

• The LM test is especially suitable for model checking. As long as the null hypothesis is
not altered, testing different model specifications against a maintained model does not
require a new estimation.
• For all unrestricted parameters, say λ1 , the score is zero:

∂`(λ)
s1 (λ̂) = = 0.
∂λ1 λ=λ̃
h i
By partitioning the parameter vector as λ = (λ1 , λ2 ) such that s(λ̂)0 = 0 : s2 (λ̂)0 ,
the LM test statistic can be written as:
" #−1 " #
h i = = 0
11 12
LM = 0 : s2 (λ̂)0
=21 =22 s2 (λ̂)
h i−1
= s2 (λ̂)0 =(λ̂) s2 (λ̂)
22
 
= s2 (λ̂)0 =−1 −1
22 − =21 =11 =22 s2 (λ̂).

• Note that under the block-diagonality of the information matrix,


" # " #
−1
=11 0 = 0
== ⇒ =−1 = 11
,
0 =22 0 =−122

the LM test statistic can further be simplified to


h i−1
LM = s2 (λ̂)0 =22 (λ̂) s2 (λ̂).

• Interpretation: The scores of the last r elements reflect the increase of the likelihood
function if the constraints are relaxed:
 
0

0 ∂φ(λ)
s(λ̂) = π̂
∂λ λ̂
0

where π̂ is the vector of Lagrange multipliers.


40

Wald test

The Wald test statistic W is based on an unconstrained estimator λ̃ which is asymptotically


normal: √  
d 
T λ̃ − λ → N 0, Σλ̃

where Σλ̃ = =−1


a in the case of MLE. It follows that φ(λ̃) is also normal for large samples
 
√ d ∂φ(λ) ∂φ(λ)0
T [φ(λ̃) − φ(λ)] → N 0, Σ .
∂λ0 λ̃ λ̃ ∂λ λ̃

Thus, if H0 : φ(λ) = 0 is true and the variance–covariance matrix is invertible, the Wald test
is given by:
  −1
0 1 ∂φ(λ) ∂φ(λ)0
φ(λ̃) → χ2 (r),
d
W = φ(λ̃) Σ (W)
T ∂λ0 λ̃ λ̃ ∂λ λ̃

where Σ̃λ̃ is a consistent estimator of Σλ̃ .


For linear restrictions of the form φ(λ) = Φλ− ϕ, the relevant Wald statistic can be expressed
as  h
1 i−1
W = (Φλ̃ − ϕ)0 Φ Σ̃λ̃ Φ0 (Φλ̃ − ϕ).
T
41

A.3 An example

Consider the simplest possible dynamic econometric model

yt = βxt + ut
 
1 2
ut = ρut−1 + εt , u0 ∼ N 0, 1−ρ2σ

where εt ∼ NID(0, σ 2 ) and uncorrelated with xt−j for all t, j ≥ 0, and |ρ| < 1.

First note that the model can be written in the form (Koyck transformation):

yt = βxt + (1 − ρL)−1 εt
⇒ (1 − ρL)yt = (1 − ρL)βxt + εt

Hence, conditional on β or ρ we have a linear regression model:

yt − ρyt−1 = β(xt − ρxt−1 ) + εt ,


yt − βxt = ρ(yt−1 − βxt−1 ) + εt .

such that OLS might be used iteratively to estimate β and ρ (Cochrane–Orcutt procedure =
ˆ
Gauss–Newton algorithm)
Prediction-error decomposition gives the approximate likelihood function of λ = (β, ρ, σ 2 )0
for the sample y1 , . . .,yT conditioned on y0 :

`(λ) = log L(λ) = log f (yT , . . . , y1 |xT , , . . . , x1 , x0 ; λ)


X
T
= log f (yt |xt , xt−1 , yt−1 ; λ)
t=1

1 X 2
T
T T
= − log(2π) − log σ 2 − 2 ε (β, ρ)
2 2 2σ t=1 t

where εt (β, ρ) = (yt − ρyt−1 ) − β(xt − ρxt−1 ) = (yt − βxt ) − ρ(yt−1 − βxt−1 ).
Under standard conditions, the MLE λ̃ is consistent, asymptotically efficient and normal:
√   
T λ̃ − λ̄ → N 0, =−1
d
a .

Estimates of the asymptotic information matrix =a can be based on second-derivatives


 
1 ∂ 2 `(λ)
=2D = −
T ∂λ∂λ0
of outer-products which use the conditional score vector ht (λ)

1X
T
∂ log f (yt |Xt , Yt−1 ; λ)
=OP = ht (λ) ht (λ)0 with ht (λ) =
T t=1 ∂λ0

where for the example we have the following results:


42

• score vector
     
1 P P
T ∂εt T
∂`(λ) 1
 − 2σ 2 2εt   σ2 εt (xt − ρxt−1 )

 ∂β   t=1 ∂β   t=1 
   P   P 
 ∂`(λ)   T ∂εt =
T
s(λ) =   =  − 2σ1 2 2εt  
1
σ2
εt (yt−1 − βxt−1 ) 
.
 ∂ρ   ∂ρ   
   t=1
 
t=1

∂`(λ) P
T
1 P 2
T
− 2σT 2 + 2σ1 4 ε2t − 2σ2 + 2σ4
T
εt
∂σ 2 t=1 t=1

• (minus) second derivatives


 
∂2`
 − ∂β∂β 0 . . 
 
 ∂ 2 ` ∂ 2 ` 
H(λ) =   − ∂ρ∂β 0 − . 

 ∂ρ∂ρ0 
 2
∂ ` 2
∂ ` ∂ `2 
− 2 0 − 2 0 − 2 20
∂σ ∂β ∂σ ∂ρ ∂σ ∂σ
 
1 P
T
2
 σ2 t=1(xt − ρxt−1 ) . . 
 
 1 P T
1 P 2
T 
=   σ 2 {u t−1 (x t − ρx t−1 ) − εt x t−1 } σ 2 ut−1 . .

 t=1 t=1 
 
1 P 1 P 1 P 2
T T T
σ 4 ε t (x t − ρx t−1 ) σ 4 εt u t−1 − T
2σ 4 + σ 6 εt
t=1 t=1 t=1

• information matrix

= = E [H(λ)]
 T  
1 P 2 2 2
 σ t=1 E[xt ] − 2ρE[xt xt−1 ] + ρ E[xt−1 ]
2 . . 
 
 1 P T P
T 
= 
 σ 2 {E[ut−1 (xt − ρxt−1 )] − E[εt xt−1 ]} 1
σ2
E[u2t−1 ] . 

 t=1 t=1 
 
1 P P P
T T T
1 1
σ 4 E[εt (xt − ρxt−1 )] σ4 E[εt ut−1 ] − 2σT 4 + σ6 E[ε2t ]
t=1 t=1 t=1
 T   
(1 + ρ2 )(σx2 + µ2x ) − 2ρ(σx2 ρx (1) + µ2x ) 0 0
 σ 2 
 T 
= 
 0 0 .

 1 − ρ2 
T T
0 0 − 4
+ 4
2σ σ
• asymptotic information matrix
 m 
xx
0 0  
 σ2  σ 2 m−1 0 0
 1  xx
=a = lim
1
== 0 0  =⇒ .=−1 = 
 0 1 − ρ2 0  .

T →∞ T  1 − ρ2  a
 1  0 0 2σ 4
0 0
2σ 4
43

LM test of the null ρ = 0 versus ρ 6= 0

Since only ρ is restricted, the score vector evaluated under the null has the form
   0

β̂
   ∂`(λ) 
s(λ̂) = s  ρ̂  =   ∂ρ  .

2
σ̂ε λ̂
0

Using the block-diagonality of the covariance matrix, the LM test is given by


h i−1
LM = s(λ̂)0 =(λ̂) s(λ̂)
h i−1
= s2 (λ̂)0 =22 (λ̂) s2 (λ̂) → χ2 (1)
d

where the model is estimated under H0 : ρ = 0, such that:

=22 (λ̂) = T (1 − ρ̂2 ) = T


1 X 1 X 1X 2
T T T
2
s2 (λ̂) = ε̂t ût−1 = 2 ût ût−1 , with σ̂ = ût
σ̂ 2 σ̂ T
t=1 t=1 t=1

Thus a test for first-order residual autocorrelation results:


 2  2
PT
1 P
T
 ût ût−1  T ût ût−1 
−1  t=1   t=1  ∼
LM = T   =T  = T ρb2u (1)
 1 P T   1 PT 
T û2t T û2t
t=1 t=1

Wald test of the null ρ = 0 versus ρ 6= 0

Write restriction in the form:

φ(λ) = Φλ = 0 where λ = (β, ρ, σ 2 )0 and Φ = (0, 1, 0).

The Wald test is based on the unconstrained estimator λ̃ = (β̃, ρ̃, σ̃ 2 )0 , and given by:
 h i−1
1
W = λ̃0 Φ0 Φ Σ̃λ̃ Φ0 Φλ̃
T
 
1 2 −1
= ρ̃ σ̃ρ̃ ρ̃
T
ρ̃2
→ χ2 (1)
d
=
(1 − ρ̃2 )/T
Thus the Wald test statistic W has the form of a t-test statistic:
 2
ρ̃ 
W = q = t2ρ=0 .
1 2
T σ̃ρ̃
Hans–Martin Krolzig Hilary Term 2002

Macroeconometrics I
Introduction to Time–Series Analysis

4 Vector Autoregressive Processes

Since Sims (1980) critique of traditional macroeconometric modeling, vector autoregressive


(VAR) models are widely used in macroeconometrics. Their popularity is due to the flexib-
ility of the VAR framework and the ease of producing macroeconomic models with useful
descriptive characteristics, within statistical tests of economically meaningful hypothesis can
be executed. Over the last two decades VARs have been applied to numerous macroeconomic
data sets providing an adequate fit of the data and fruitful insight on the interrelations between
economic data.

4.1 Stable vector autoregressive processes

Vector autoregressive processes are just the generalization of the univariate autoregressive
process discussed in section 2. The basic model considered in the following is a stable vector
autoregressive model possibly including an intercept term: the K-dimensional time series
vector yt = (y1t , . . . , yKt )0 is generated by a vector autoregressive process of order p, denoted
VAR(p) model,
yt = ν + A1 yt−1 + · · · Ap yt−p + εt (5)
where t = 1, . . . , T , the ν is a vector of intercepts and Ai are coefficient matrices.
The error process εt = (ε1t , . . . , εKt )0 is an unobservable zero-mean vector white noise pro-
cess,
εt ∼ WN(0, Σ).
that is, E[εt ] = 0, E[εt ε0t ] = Σ, and E[εt ε0s ] = 0 for s 6= t, where the variance-covariance
matrix Σ is time-invariant, positive-definite and non-singular.
The stronger assumptions of independence,

εt ∼ IID(0, Σ).

or Gaussianity,
εt ∼ NID(0, Σ).
imply that the innovations are the first-step prediction errors

εt = yt − E[yt |Yt−1 ].

44
45

Thus the expectation of yt conditional on the information set Yt−1 = (yt−1 , yt−2 , . . . , y1−p ) is
given by
Xp
E[yt |Yt−1 ] = ν + Aj yt−j .
j=1

Stability implies stationarity,

det (IK − A1 z − . . . − Ap z p ) 6= 0 for |z| ≤ 1.

The mean is given by

µ = E[yt ] = (IK − A1 − . . . − Ap )−1 ν,

and the autocovariance function results from the Yule-Walker equations


0
Γ(h) = E[yt yt−h ] = A1 Γ(h − 1) + . . . + A1 Γ(h − p) + E[yt ε0t−h ]

where Γ(h) = Γ(−h)0 and E[yt ε0t−h ] = 0 for h 6= 0 and Σ for h = 0 . The autocorrelation
function R(h) = [ρij (h)] can be computed as:

γij (h)
ρij (h) = p .
γii (0)γjj (0)

4.2 Impulse response analysis

Stable VAR(p) processes possess an MA(∞) representation



X
yt = µ + Ψj εt−j
j=0

where Ψ0 = IK . For VAR(1) processes, Ψj = Aj .


The (k, l)-th element ψkl,j of the MA matrix Ψj can be interpreted as the reaction of variable k
in response to a unit shock in variable l, j periods ago.
A problematic assumption in this type of impulse response analysis is that the shock occurs
only in one of the variables. Such an assumption might be reasonable if the errors of different
variables are independent. If the variance-covariance matrix is non-diagonal, orthogonalized
impulses are often considered. Define ηt = P −1 εt as the ‘structural’ impulse where P is a
lower triangular matrix resulting from the Choleski decomposition of Σ = P P 0 . Depending
on the choice of P a particular orthogonalized MA(∞) representation emerges,

X ∞
X
yt = µ + Ψj P P −1 εt−j = µ + Θj ηt−j ,
j=0 j=0
46

where Θ0 = P and Θj = Ψj P . Note that by selecting P , a certain causal ordering of the


variables is implied. Therefore the resulting representation

X
p
Byt = γ + Γj yt−j + ηt ,
j=1

with ηt ∼ WN(0, IK ), B = P −1 , γ = P −1 ν, Γj = P −1 Aj is said to be a structural


VAR(p). Note that the restriction E[ηk2 ] = 1 can be replaced by the normalization Bkk = 1
with E[ηk2 ] = ωk2 .

4.3 Structural analysis

Since VAR models represent the correlations among a set of variables they are often used to
analyze the relationships between variables. In the following we will consider the partition of
yt into (zt , xt )0 :
" # " # " #" # " #
zt ν1 Xp
A11,i A12,i zt−i ε1t
= + +
xt ν2 i=1
A 21,i A 22,i x t−i ε2t

where the innovation process εt is given by:


" # " # " #!
ε1t 0 Σ11 Σ12
∼ IID , .
ε2t 0 Σ21 Σ22

Granger causality

Granger (1969) introduced a concept of causality which is based on the idea that a cause can
not come after the effect. Thus, if a variable x affects a variable z, the former should help to
predict the latter:
MSPE (zt+h |Zt \ Xt ) ≤ MSPE (zt+h |Zt ) ,
for some h > 0, where MSPE (zt |Ω) is the mean square prediction error associated with the
optimal predictor conditional on a given information set Ω.
Condition for stable VAR(p) processes: xt does not Granger-cause zt if and only if

A12,i = 0 for all i = 1, . . . , p.

Alternatively, zt does not Granger-cause xt if and only if

A21,i = 0 for all i = 1, . . . , p.


47

Instantaneous causality

Sometimes the term ‘instantaneous causality’ is used in economic analyses. There is said to
be instantaneous causality between zt and xt if

MSPE (zt |Yt−1 ∪ xt ) ≤ MSPE (zt |Yt−1 )

Condition for stable VAR(p) processes: there is no instantaneous causality between zt and xt
if and only if
E[ε1t ε02t ] = Σ12 = Σ021 = 0

Weak exogeneity

Instantaneous non-causality implies that the (set of) equations of the VAR are unrelated to
each other. Thus the probability density function of yt conditional on its past Yt−1 is given by

f (yt |Yt−1 ; λy ) = f (zt |Yt−1 ; λz ) · f (xt |Yt−1 ; λx )

where the parameter vectors λx and λz of the system can be varied freely. Consequently, all
possible reductions of the system can be efficiently estimated by OLS, and model-selection
procedures can be applied equation-by-equation without a loss in efficiency.

Strong exogeneity

If xt is not Granger-causal for zt and there is no instantaneous causality between zt and xt ,


then zt is said to be strongly exogenous for xt . Thus the stochastic process zt can be ana-
lyzed without modelling xt , and the probability density function of YT = (y1 , . . . , yT ) can be
factorized as:
f (YT |λy ) = f (XT |ZT ; λx ) · f (ZT ; λz ).

4.4 Estimation of VAR models

The maximum likelihood estimation of the Gaussian VAR model (5) for an observed multiple
time series YT = (yT , . . . , y1−p ) when the initial values of Y0 = (y0 , . . . , y1−p ) are fixed,
results in a seemingly unrelated regression (SUR) problem discussed in Professor Pagan’s
lecture.
In case of the unrestricted VAR, OLS is asymptotically efficient. An asymptotically efficient
estimation method under linear restrictions is (feasible) GLS. Many other estimation problems
in the VAR have been solved by Sims (1980) (see also section 3 in Lütkepohl, 1991).
48

References
Granger, C. W. J. (1969). Investigating causal relations by econometric models and cross-
spectral methods. Econometrica, 37, 424–438.
Lütkepohl, H. (1991). Introduction to Multiple Time Series Analysis. Berlin: Springer.
Sims, C. A. (1980). Macroeconomics and reality. Econometrica, 48, 1–48. Reprinted in
Granger, C. W. J. (ed.) (1990), Modelling Economic Series. Oxford: Clarendon Press.

You might also like