Dynamic Econometric Models Time Series Econometrics For Microeconometricians 2011
Dynamic Econometric Models Time Series Econometrics For Microeconometricians 2011
Walter Beckert
Department of Economics
Birkbeck College, University of London
Institute for Fiscal Studies
1 Introduction
1.1 Overview
The course surveys linear and nonlinear econometric models and estimation tech-
niques, presenting them in a method of moments framework. While emphasizing
their applicability under general assumptions on the data generating process, the
emphasis will be on applications in time series analysis.
The first part of the course treats single equation models, while the second part
is devoted to systems of equations. Starting from a review of the linear regression
model (OLS, GLS, FGLS), the course revisits basic properties of stochastic processes
and their implications for time-series regressions, cast in the form of general autore-
1
gressive distributed lag (ARDL) and error correction model (ECM) representations.
These notes are intended as a reference guide to the material covered in the course.
The lectures will follow the notes closely, but will focus on the main principles
and results, omitting much of the intermittent algebra. The presentation of the
course material rests on the kind of mathematical and statistical tools and the styles
of argument that microeconometricians are typically familiar with. The primary
objective is to provide an approach to econometric concept in time series analysis
that appeals to the intuitive understanding of microeconometricians, not a fully
rigorous delineation of results.
2
2 Generalized Method of Moments Estimation
This section provides a basic review of Method of Moments estimation in the familiar
context of the linear regression model. It sets up the general framework and notation
in which the remainder of the course and these notes will proceed.
(i) EY|X [yt |xt ] = g(xt ; θ0 ) a.s. for all t, where θ0 ∈ Θ ⊂ Rk is an unknown
parameter vector, and the function g is possibly nonlinear in θ0 ; in the
special case of linearity, EY|X [yt |xt ] = x′t θ0 a.s., the linear regression
model; in the latter case, this is equivalent to EY|X [yt − x′t θ0 |xt ] = 0 a.s.
for all t;
(ii) continuing with the linear model, EY|X [(yt − xt θ0 )2 |xt ] = σ02 > 0 a.s. for
all t, which is referred to as conditional homoskedasticity.
Note: (i) by itself does not identify θ0 , unless k = 1; (ii) identifies σ02 . Based on (i),
unconditional moment conditions can be derived by iterated expectations:
3
i.e. k unconditional moments, which can identify θ0 . Note also that (ii) holds
unconditionally as well: EYX [(yt − x′t θ0 )2 ] = σ02 for all t.
The idea behind Method of Moments (MOM) estimation of θ0 and σ02 is to replace
population moments by sample analogues (empirical moments, sample averages):
For any θ ∈ Θ,
1∑
T
moments in (i’): ET [xt (yt − x′t θ)] = xt (yt − x′t θ) = mT (y, X; θ)
T t=1
1∑
T
moments in (ii): ET [(yt − x′t θ)2 )] = (yt − x′t θ)2 .
T t=1
The MOM estimators θ̂T and σ̂T2 solve the empirical analogues to (i’) and (ii):
In this linear model, the MOM estimator for θ0 is equivalent to the familiar OLS
estimator: (iii) implies
1∑
T
xt (yt − x′t θ̂T ) = 0
T t=1
( )
1∑ ′ 1∑
xt xt θ̂T = xt yt
T t T t
ET [xt x′t ] θ̂T = ET [xt yt ]
−1
θ̂T = [ET [xt x′t ]] ET [xt yt ]
[ ]−1
∑ ∑
= xt x′t xt yt
t t
′ −1 ′
= (X X)
X y = θ̂OLS ,
[ ]
provided rk(X′ X) = k. Hence, θ̂T is conditionally unbiased: E θ̂T |X = θ0 . Its con-
ditional variance is var(θ̂T |X) = (X′ X)−1 X′ var(y|X)X(X′ X)−1 ; provided that the
yt are conditionally independent across t, i.e. that var(y|X) = σ02 IT , the conditional
variance of the MOM estimator reduces to var(θ̂T |X) = σ02 (X′ X)−1 . In this case,
the MOM estimator enjoys all the properties of the OLS estimator, a direct conse-
quence of the Gauss-Markov Theorem which rests entirely on conditional moment
4
assumptions: Suppose EY|X [y|X] = Xθ0 = [x′t θ0 ]t=1,··· ,T , and var(y|X) = σ02 IT ,
σ02 > 0; then θ̂T is the best linear unbiased estimator (BLUE), i.e. it is efficient (in
the sense of having minimum variance among all linear, unbiased estimators of θ0 ).
where s2T is the OLS estimator of σ02 . This implies that the MOM estimator σ̂T2 is
biased in small samples (finite T ).
Suppose that the previous population orthogonality conditions between xt and resid-
uals yy − x′t θ0 do not hold, but that for some vector of instruments zt , for any t,
where Z = (z1 , · · · , zT )′ .
5
not square. Let PZ = Z(Z′ Z)−1 Z′ , the orthogonal projector onto the column space of
Z, col(Z).; recall that orthogonal projectors are idempotent and symmetric. Then,
orthogonality of y − Xθ0 and Z according to (1) implies orthogonality of y − Xθ0
and X′ PZ , so that
where X̂ = PZ X are the fitted values from the regression of the columns of X
onto Z, i.e. only those components of X which are orthogonal to y − Xθ0 accord-
ing to (1). The conditional variance of the 2SLS estimator is var(θ̂2SLS |X, Z) =
σ02 (X′ PZ X)−1 . Note that, if dim(zt ) = k and rk(Z′ X) = k, then (X′ PZ X)−1 =
(X′ Z(Z′ Z)−1 Z′ X)−1 = (Z′ X)−1 Z′ Z(X′ Z)−1 = (Z′ X)−1 Z′ Z((Z′ X)−1 )′ , i.e. the con-
ditional variance collapses to the one of the IV estimator. Note also, for future
reference, that the conditional variances of the IV moment functions are
Suppose, as at the outset, that EYX [xt (y − x′t θ0 )] = 0 for all t, but var(y|X) = Ω,
a positive definite, symmetric T × T matrix. This change in the second moment as-
sumptions can be expected to affect the second moment properties of the OLS/MOM
estimator θ̂T , i.e. its conditional variance-covariance matrix and, thereby, its effi-
ciency.
As before, the moment conditions involving the first moments yield the OLS/MOM
estimator for θ0 , θ̂T = (X′ X)−1 X′ y. The second moment assumptions, however, now
6
imply
The Gauss-Markov Theorem implies that, while θ̂T is still conditionally unbiased, it
is no longer efficient. Note also: The conditional variance of the moment functions
is now var(X′ (y − Xθ0 )|X) = X′ ΩX.
7
The GLS estimator above is only feasible if Ω is known. If it is not known, it
needs to be estimated, based on first-stage residuals obtained from consistent, but
inefficient OLS estimation of θ0 . Once a consistent estimator Ω̂T is obtained, θ0 can
be re-estimated in a second step, using Ω̂T in lieu of Ω:
This line of reasoning suggests that it is generally beneficial (in the sense of efficiency)
to weight moment functions by their conditional variances. The Generalized Method
of Moments (GMM) proceeds in this fashion.1
To illustrate this, re-consider the instrumental variable set-up above, with dim(zt ) =
m ≥ k and var(y|X, Z) = σ02 IT . In this case, as shown above, the moment functions
zt (yt − x′t θ0 ) have conditional variance var(Z′ (y − Xθ0 )|X, Z) = σ02 (Z′ Z).
′
θ̂GM M = arg min ET [Z′ (y − Xθ)] ΣET [Z′ (y − Xθ)] .
θ∈Θ
The first-order conditions of the minimization problem define the GMM estimator
θ̂GM M ; in this case:
(y − Xθ̂GM M )′ ZΣZ′ X = 0
θ̂GM M = (X′ ZΣZ′ X)−1 X′ ZΣZ′ y,
1
Hansen, L.P. (1982): “Large Sample Properties of Generalized Methods of Moments Estima-
tors”, Econometrica, 50(4), 1029-1054; and Hansen, L.P. and K.J. Singleton (1982): “Generalized
Instrumental Variables Estimators of Nonlinear Rational Expectations Models”, Econometrica,
50(5), 1269-1286.
8
with conditional variance
var(θ̂GM M |X, Z; Σ) = (X′ ZΣZ′ X)−1 X′ ZΣZ′ σ02 IT ZΣZ′ X(X′ ZΣZ′ X)−1 .
The Hausman test examines the consistency of MOM estimators in the face of
possible failures of moment conditions.2
Suppose θ̃T and θ̂T are two estimators of θ0 , obtained on the basis of different
assumptions about valid moment restrictions; e.g. θ̃T uses moments beyond those
√
used by θ̂T . The null hypothesis H0 is that both θ̂T and θ̃T are T consistent; i.e.,
in the example, that the additional moments are valid, so that θ̃T is relatively more
efficient than θ̂T . Under H0 ,
√ d
T (θ̂T − θ̃T ) → N (0, VD ),
for some asymptotic variance-covariance matric V(D , which may) be singular. The
alternative hypothesis HA implies that limT →∞ Pr |θ̂T − θ̃T | > ϵ > 0 for any ϵ > 0.
The Hausman test statistic takes the usual quadratic form
( )′ ( )
−
HT = T θ̃T − θ̂T V̂D θ̃T − θ̂T ,
2
Hausman, J.A. (1978): “Specification Tests in Econometrics”, Econometrica, 46(5), 1251-
1271.
9
where V̂D− is a consistent estimator of the (generalized) inverse of VD . Under the
null hypothesis, its asymptotic distribution is χ2 with degrees of freedom equal to
the number of restrictions imposed by the null hypothesis.
and the null hypothesis of the validity of the k moment conditions is rejected at the
α-level if HT > χ2k (1 − α). Note that it follows from the orthogonality of relatively
efficient estimators that, under the null hypothesis,
( )
cov β̂OLS, β̂OLS − β̂IV/2SLS = 0
( ) ( )
⇒ var β̂OLS = cov β̂OLS , β̂IV/2SLS .
Hence,
( )
V = var β̂OLS − β̂IV/2SLS
( ) ( )
= var β̂IV/2SLS − var β̂OLS
[ ]
−1 −1
= σ02 (X′ PZ X) − (X′ X) .
A convenient fact often facilitates the computation of the Hausman test statistic
HT . A consequence of the (conditional) orthogonality of a relative efficient estimator
and its difference to other consistent, but inefficient estimators is that the (condi-
tional) covariances between such estimators is equal to the variance of the efficient
√
estimator. Hence, if θ̃T is efficient relative to θ̂T , then VD = avar( T (θ̃T − θ̂T )) =
√ √
avar( T (θ̂T − θ0 )) − avar( T (θ̃T − θ0 )).
10
As an example, consider the linear, homoskedastic model and let X = [X1 , X2 ],
where X1 consists of exogenous covariates, while X2 is suspected of lack of exo-
geneity. In other words, the validity of the set of unconditional moment conditions
E[X2 (y − X1 θ1 − X2 θ2 )] = 0 is in doubt. Let W be an array of instruments for
X2 in case X2 is endogenous, and let Z = [X1 , W] denote the array of all in-
struments (i.e. the columns of X1 act as instruments for themselves). Then, the
null hypothesis H0 is that X2 is exogenous, while the alternative hypothesis HA
is that X2 is not exogenous. Under H0 , the Gauss-Markov Theorem implies that
the OLS estimator for θ0 = (θ1′ , θ2′ )′ , θ̂OLS , is efficient; its asymptotic distribution
√
is T (θ̂OLS − θ0 ) → N (0, σ02 (X′ X)−1 ), conditional on X. Under HA , θ̂OLS is in-
d
consistent, but the 2SLS estimator θ̂2SLS is consistent; its asymptotic distribution
√
is T (θ̂2SLS − θ0 ) → N (0, σ02 (X′ PZ X)−1 ), conditional on X and Z. Since the OLS
d
and 2SLS estimators are (conditionally) orthogonal under the null hypothesis, their
conditional covariance matrix is zero under H0 . Hence, conditional on X and Z,
√ ( )
T (θ̂OLS − θ̂2SLS ) → N 0, σ02 ((X′ PZ X)−1 − (X′ X)−1 ) ,
d
where σ̂T2 is an estimator of σ02 based on either the OLS or 2SLS regression residuals.
This test is referred to as Hausman-Wu Exogeneity Test.
11
where û is a set of the vectors of fitted residuals from the reduced form regressions
of the hypothesized endogenous RHS variables onto all exogenous variables. This
hypothesis can be tested using a t-test with dim(col(X2 )) = 1 degrees of freedom if
X2 ∈ R, and an F -test as above otherwise.
Another test of the validity of moment conditions can be based on the GMM criterion
function. When the parameter vector of interest θ0 is exactly identified under the
alternative hypothesis and over-identified under the null hypothesis, then GMM
moment tests are called test of over-identifying restrictions. Let EYX [m(yt , xt ; θ0 )] =
0 denote the r population moment conditions under the null hypothesis, where
dim(θ0 ) = k and r > k, i.e. there are r − k over-identifying restrictions. The
empirical analogues to the population moment functions are ET [m(yt , xt ; θ0 )]. Let
Σ̂⋆T be (a consistent estimator of) the (optimal) GMM weighting matrix Σ⋆ , and let
θ̂GM M be the GMM estimator of θ0 . The minimized, second round GMM criterion
function [ ]′ [ ]
JT = T ET m(yt , xt ; θ̂GM M ) Σ̂⋆T ET m(yt , xt ; θ̂GM M )
then serves as a test statistic for the validity of the over-identifying moment con-
ditions. This particular test statistic is referred to as the Sargan-Hansen (1982)
J-test4 . Its asymptotic distribution, as T → ∞, is χ2r−k , and the test rejects the null
hypothesis when the statistic exceeds the critical values of a χ2r−k random variable
for the appropriate test size. This does not permit any inference about which of the
moment conditions is invalid, however.
In the case of the example in the preceding subsection, the Sargan-Hansen J test
statistic of the null hypothesis that Z is a valid array of instruments is
−1
(y − Xθ̂2SLS )′ Z′ [Z′ (I − PX )Z] Z(y − Xθ̂2SLS )
JT = ,
σ̂T2
and its asymptotic distribution under the null hypothesis is also χ2r−k ; see Appendix
for details. Note that, in general, the Hausman-Wu test requires estimation under
4
Hansen, L.P. (1982): “Large Sample Properties of Generalized Methods of Moments Estima-
tors”, Econometrica, 50(4), 1029-1054
12
both the null and the alternative hypothesis, while the Sargan-Hansen J test only
requires estimation under the null hypothesis.
Broadly speaking, the case of weak instruments refers to a situation in which the
correlation between the endogenous variable and its instrument(s) is low. The treat-
ment of situations with weak instruments is an area of active current research.5 In
the case of a single endogenous variable x2 , a test for the weakness of instruments,
due to Bound et al. (1995)6 , is a partial R2 , denoted by Rp2 , that isolates the impact
of the instruments on the endogenous variable, after eliminating the effect of the
other exogenous variables on the latter. The statistic Rp2 is given by the R2 of the
regression
x2 − x̂2 = (z − ẑ)′ δ + ν,
where x̂2 = PX1 x2 and ẑ = PX1 z. When Rp2 is low, then z is considered an array of
weak instruments for x2 .
Various tests for model selection have been proposed in the literature, but none
is entirely satisfactory. In regression models, the regression R2 = 1 − û′ û/y′ y is
often considered, where û is the vector of fitted residuals. Superior models exhibit
smaller R2 . This measure does not require distributional assumptions and, hence,
5
For a recent survey, see Stock and Yogo (2002), NBER Technical Working Paper 284.
6
Bound, J., Jaeger, D.A., and R.M. Baker (1995): “Problems with Instrumental Variables
Estimation When the Correlation between Instruments and Endogenous Explanatory Variables Is
Weak”, Journal of the American Statistical Association, 90, No. 430, 443-350.
13
is embedded in the method of moments framework. Alternatively, under distri-
butional assumptions, measures based on the log-likelihood are available and have
some information theoretic interpretation. The Akaike information criterion (AIC)
adjusts the sample log-likelihood at the MLE θ̂ for model j, lT (θ̂(j) ), for the number
of estimated parameters, kj = dim(θ(j) ), so that AICj = −2lT (θ̂(j) ) + 2kj . Under
( )
normality assumptions, the AIC reduces to AICj = 2kj + T ln û′j ûj /T , where ûj
is the vector of fitted residuals of model j. The Schwarz Bayesian information or
posterior odds criterion (SBC), in addition, accounts for sample size T and is de-
fined as SBC = −2lT (θ̂(j) ) + kj ln(T ). The SBC is a closely related variant of the
Bayes Information Criterion (BIC), which is defined as BIC = SBC/T . Under nor-
( )
mality assumptions, the BIC reduces to BIC = ln û′j ûj /T + kj ln(T )/T . Models
with higher information criteria are deemed superior. In comparison to AIC, the
SBC/BIC criterion tends to choose more parsimonious models. Many practitioners
also test the goodness-of-fit in terms of the accuracy of out-of-sample prediction.
1. Structural Stability: Tests for structural stability examine whether the pa-
rameters to be estimated are constant over the sampling period, the null hy-
pothesis. Considering a simple linear regression model, under the alternative
hypothesis,
yt = x′t θ1 + ϵt t = 1, · · · , T1 ,
yt = xt θ2 + ϵt , t = T1 + 1, · · · , T,
14
breakpoint test statistic is
(ϵ̂′ ϵ̂ − (ϵ̂′1 ϵ̂1 + ϵ̂′2 ϵ̂2 )) /k
CT = ∼ Fk,T −k .
(ϵ̂′1 ϵ̂1 + ϵ̂′2 ϵ̂2 ) /(T − k)
This test requires that the variances of the residuals ϵt are the same in both
subperiods. This can be tested using the Goldfeld-Quandt test
s21 ϵ̂′1 ϵ̂1 /(T1 − k)
GQT = = ∼ FT1 −k,T2 −k ,
s22 ϵ̂′2 ϵ̂2 /(T2 − k)
where the larger variance estimate should form the numerator so that the
statistic is greater than unity. Chow also suggested a test for predictive failure
for the case when T2 < k,
(ϵ̂′ ϵ̂ − ϵ̂′1 ϵ̂1 ) /T2
C˜T = ′ ∼ FT2 ,T1 −k .
ϵ̂1 ϵ̂1 /(T1 − k)
the Ramsey RESET test amounts to running a second stage regression of ϵ̂t on
xt and the squared predicted dependent variable ŷt2 and to testing whether the
coefficient on ŷt2 is zero, using a t-test. A numerically equivalent implemen-
tation of the test uses yt in lieu of ϵ̂t in the second-stage regression. Higher
powers of ŷt can be included to test for further degrees of curvature, using
F -tests.
yt = x′t θ0 + ϵt
ϵt = ρϵt−1 + νt ,
where νt is white noise, i.e. serially uncorrelated and has mean zero and
constant variance. If θ0 were estimated by OLS, then the estimated residuals
would be
ϵ̂t = yt − x′t θ̂ = x′t (θ0 − θ̂) + ρϵt−1 + νt .
15
using a t-test. Testing against the alternative hypothesis of higher-order serial
correlation on the process for ϵt can be done analogously by including further
lags of ϵ̂t and testing that their coefficients are jointly equal to zero, using an
F -test.
It should be noted that cases (3.) and (4.) do not impede the usual first-
moment properties of the OLS estimator for θ0 (unbiasedness, consistency),
because they pertain to second-moment assumptions. But the conditional
variance-covariance matrix of θ̂ is no longer σ02 (X′ X)−1 , but
var(\
θ̂OLS |X) = (X′ X)−1 X′ Ω̂X(X′ X)−1 ,
5. Influential Observations
An influential observation is a data point that is crucial to inferences drawn
from the data. While the various approaches described here provide quanti-
tative measures of the statistical influence of an observation, it is important
16
to keep in kind, however, that only knowledge of the subject matter and the
data itself can determine whether this influence is substantively informative
or merely due to data reporting error.
Consider the linear regression model in which the k × k matrix X′ X has full
rank case. Define the orthogonal projection matrix H = X(X′ X)−1 X′ . Then,
∑
Ŷt = Htt Yt + Hts Ys ,
s̸=t
which can be highly effective for picking up single outliers or influential obser-
vations. Another Jackknife measure of the influence of an observation on the
joint inference regarding θ0 are Cook’s distances
X′ X
CDt = (θ̂−t − θ̂)′ (θ̂−t − θ̂), t = 1, · · · , T,
ks2
which can be compared to an F -distribution to estimate the percentage influ-
ence of Yt on θ̂.
7
The idea of the Jackknife is due to Tukey. Based on the ”leave one out” estimates θ̂−t ,
t = 1, · · · , T , the random variables T θ̂ − (T − 1)θ̂−t may be treated as i.i.d. estimates of θ0 .
They provide an effective way to get a sampling distribution of θ̂ without recourse to asymptotic
arguments and as an alternative to the bootstrap.
17
Best econometric practice usually derives an estimable statistical model from an
underlying economic model or theory that rationalizes the data generating process.
It is important to recognize that, while the various goodness-of-fit measures and
diagnostic tests may be generally useful statistical tools for specification testing
and model selection, when they fail to support the estimated model they do not
provide any guidance as to how to adjust the model because they are not linked
to the economic model. Failures of these tests, therefore, may be indicative of
a misspecified economic model and suggest a re-examination at that level of the
econometric analysis.
Let E[yt − x′t θ0 |xt ] = E[ϵt |xt ] = 0 for all t, but suppose that
ϵt = yt − x′t θ0 = ut + αut−1 ,
where
ut−s |xt ∼ i.i.d. E[ut−s |xt ] = 0, E[u2t−s |xt ] = σ02 , s = 0, 1, · · ·
for any t, where 1{A} is an indicator function taking value 1 if the event A occurs,
and zero otherwise. Hence, the conditional second moment matrix of the residuals
18
is bidiagonal,
1 + α2 α 0 ···
α 1 + α2 α ···
var(y − Xθ0 |X) = σ02 .. =: Ω.
0 α 1 + α2 .
.. .. ..
. ··· . .
3.1.2 Estimation
19
Let X = y− = (y0 , · · · , yT −1 )′ . Note that y − Xθ0 = y − y− ρ0 = [yt −
ρ0 yt−1 ]t=1,··· ,T . While y − Xθ0 |X involved T random variables with non-degenerate
distribution, its analogue in the AR(1) model is the vector y − y− ρ0 |y− , but this
d
involves T − 1 constants (since it is conditioned on y− ), and only yT − ρ0 yT −1 |y−1 =
yT − ρ0 yT −1 |yT −1 has a non-degenerate distribution. Therefore, in the case of au-
toregressive processes, the joint distribution of the vector y conditional on initial
conditions (i.e. y0 in the case of an AR(1); on (y0 , · · · , y−p+1 ) in the case of an
AR(p), for integer p) needs to be determined.
By recursive substitution,
yt = ρ0 yt−1 + ϵt
= ρ0 (ρ0 yt−2 + ϵt−1 ) + ϵt
∑
t−1
= ρt0 y0 + ρs0 ϵt−s .
s=0
∑
min{t,s}−1 2 max{t,s}
σ02 (1 − ρ0 )
cov(yt , ys |y0 ) = σ02 ρ2τ = .
τ =0
0
1 − ρ20
Note that both first and second conditional moments depend on t. Without
further restrictions, this would imply that any MOM estimator of ρ0 (OLS, FGLS)
would depend on t as well, which is inconsistent with the notion of ρ0 being a
time-invariant population parameter. This problem could only be overcome if the
unconditional moments did not depend on t. Regarding the first unconditional
moments, by iterated expectations
which is independent of t if, and only if, E[y0 ] = E[yt ] = 0 for all t. Regarding the
20
second unconditional moments,
Assuming (covariance) stationarity, i.e. |ρ0 | < 1, the above results on the mo-
ments of the stationary distribution can now be obtained more easily: Discarding
the trivial case ρ0 = 0, for the first moments, for any t,
21
3.2.2 Estimation
Denote the characteristic polynomial in the lag operator L of the AR(1) process by
Φ(L) = 1 − ρ0 L, so that Φ(L)yt = ϵt .8 It is necessary and sufficient for the AR(1) to
be stationary that the roots z of the characteristic equation |Φ(z)| = 0 lie outside
the unit circle, i.e. that |z| = 1
|ρ0 |
> 1, which is equivalent to the previous condition
for covariance stationarity. If ϵt is also i.i.d., then this is a random walk.
yt = yt−1 + ϵt .
Notice that its first difference, yt − yt−1 = ϵt is stationary. Hence, in the case of
ρ0 = 1, the process {yt , t ≥ 0} is said to be difference-stationary, or integrated of
order 1, denoted by I(1). In this notation, the covariance stationary case is denoted
by I(0).
8
The lag operator L is defined by Lyt = yt−1 .
22
W.l.o.g. let y0 = 0 a.s. for the remainder of this section. Then,
( T )−1 T
∑ ∑
ρ̂ − 1 = 2
yt−1 yt−1 ϵt ,
t=1 t=1
∑
t−1
= E[ϵ2s ]
s=1
= (t − 1)σ02 a.s.,
so that a.s.9
[ ]
∑
T ∑
T
E 2
yt−1 y0 = 0 = (t − 1)σ02
t=1 t=1
∫ T
≈ σ02 (t − 1)dt
1
∝ σ02 T = Op (T 2 ).
2
Similarly, E[yt−1 ϵt ] = E [yt−1 E[ϵt |yt−1 ]] = 0, and, since E[yt−1 ϵt ] = E[ 12 (yt2 − yt−1
2
−
ϵ2t )],
[ ] [ ]
∑
T
1 2 1 ∑T
E yt−1 ϵt y0 = 0 = E (y − y02 ) − ϵ2 y0 = 0
2 T 2 t=1 t
t=1
( )2
1 ∑T ∑ T
= E ϵt − ϵ2t y0 = 0
2 t=1 t=1
[ ]
∑
= E ϵs ϵt y0 = 0 = 0 a.s.,
s̸=t
so that a.s.
( )2 ( )2
∑
T ∑
E yt−1 ϵt y 0 = 0 = E ϵs ϵt y0 = 0
t=1 s̸=t
= T (T − 1)σ04
= Op (T 2 ).
9
This section uses the Mann-Wald notation: A random variable wT = Op (T α ) if, for any δ > 0,
there exists M > 0 such that Pr(|T −α wT | > M ) < δ for all T ; wT = op (T α ) if Pr(|T −α wT | > δ) →
0 for every δ > 0, as T → ∞.
23
This suggests that
i.e. that, in the unit root case, T (ρ̂ − 1) = Op (1). This is to be compared to
√
the stationary case, in which T (ρ̂T − ρ0 ) = Op (1), with asymptotic distribution
N (0, 1 − ρ20 ). The preceding argument makes clear why the asymptotic variance
of this distribution collapses in the unit root case when ρ → 1: In the unit root
case, ρ̂T converges to ρ0 = 1 at rate T −1 , i.e. faster than T − 2 , the reason being
1
∑ 2 ∑ 2
that t yt−1 = Op (T 2 ) in the non-stationary case, while t yt−1 = Op (T ) in the
stationary case. In the stationary case, |ρ0 | < 1,
( )−1
1∑ 2 1∑
T T
ρ̂ − ρ0 = y yt−1 ϵt ,
T t=1 t−1 T t=1
∑T σ02 ∑T
t=1 yt−1 → E[yt ] =
1
it follows from a LLN that T
2 2
1−ρ20
, while var( T1 t=1 yt−1 ϵt ) =
2
1 σ0
T 1−ρ20
. Therefore, by a CLT,
( )
1 ∑
T
d σ02
√ yt−1 ϵt → N 0,
T t=1 1 − ρ20
√ d ( )
T (ρ̂T − ρ0 ) → N 0, 1 − ρ20
√
i.e. T (ρ̂T − ρ0 ) = Op (1).
∆yt = β0 yt−1 + ϵt
is ( )−1
∑ ∑
2
β̂T = β0 + yt−1 yt−1 ϵt .
t t
24
√
The preceding discussion shows that this estimator converges to β0 < 0 at rate T
if the process {yt , t ≥ 0} is I(0), and it converges to β0 = 0 ar rate T if {yt , t ≥ 0}
is I(1). This is the basis for the Dickey-Fuller unit root test, which tests the null
hypothesis of a unit root (equivalent to β0 = 0) against the alternative hypothesis
of stationarity (equivalent to β0 < 0). The Dickey-Fuller test statistics10 is
β̂T
DF T = .
se(β̂T )
The Dickey-Fuller test statistic has a non-standard (Dickey-Fuller) distribution un-
der the null hypothesis. This distribution depends both on the estimated model and
the true data generating process; e.g. the critical value of this test in this model
with 5 percent probability of rejecting a true null hypothesis is approximately -2.9,
while it would be around -2 for a standard one-sided t-test.
3.2.4 Extensions
yt = α0 + ρ0 yt−1 + γ0 t + ϵt ,
where ϵt is white noise, i.e. i.i.d. across t with mean zero and constant variance.
The model can be re-parameterized as before, for β0 = ρ0 − 1,
∆yt = α0 + β0 (yt−1 − δ0 t) + ϵt ,
∆yt = α + βyt−1 + γt + ϵt ,
10
Dickey, D.A. and W.A. Fuller (1979): “Distribution of the Estimators for Autoregressive Time
Series with a Unit Root”, Journal of the American Statistical Association, 74, 427-431.
25
and the Dickey-Fuller test statistics is, as before, DF T = seβ̂(Tβ̂ ) , but the distribution
T
of this test statistic differs from above, because a deterministic trend is included in
the regression.
In this model, the characteristic polynomial in the lag operator is Φ(L) = 1 − ρ10 L −
ρ20 L2 , and stationarity requires that the roots of |Φ(z)| = |1 − ρ01 z − ρ02 z 2 | = 0 lie
outside the unit circle. Conversely, the process has a unit root if the characteristic
equation permits z = 1 as a solution, i.e. if 1 − ρ01 − ρ02 = 0. In this case, a
re-parametrization suitable for testing the hypothesis of a unit root is
where β0 = ρ01 +ρ02 −1 is zero under the null hypothesis, and δ0 = γ0 /(1−ρ01 −ρ02 ).
Running this regression and testing H0 : β0 = 0 yields an Augmented Dickey-Fuller
test. Again, the Dickey-Fuller test statistic has a different distribution under the
null hypothesis, because of the presence of the lagged differences ∆yt−1 . Notice that,
if the AR(2) process is the true data generating process, but ∆yt−1 were omitted
in the Dickey-Fuller regression, then this omission would induce serial correlation
in the estimated residuals: The regression residuals in the mis-specified regression
estimate ρ02 ∆yt−1 +ϵt , and these terms are correlated, because the yt s are correlated.
All of this generalizes to AR(p) processes, with and without deterministic trend,
where p is a positive integer. The relevant re-parametrization of an AR(p), without
deterministic trend, becomes
∑
p−1
∆yt = α0 + β0 yt−1 + δ0s ∆yt−s + ϵt ,
s=1
where
β0 = ρ01 + · · · + ρ0p − 1
δ0s = −(ρ0,s+1 + · · · + ρ0p ), for s = 1, · · · , p − 1.
26
To see this, define ρ(L) = ρ01 + · · · + ρ0p − 1, δ(L) = δ01 L + · · · + δ0,p−1 Lp−1 , and
notice that
∆yt = α0 + ρ(L)yt + ϵt
= α0 + (β0 L + δ(L)(1 − L))yt + ϵt ,
because
The issues discussed above remain essentially the same when contemporaneous and
lagged xt s are re-introduced. Such models are called autoregressive distributed lag
(ARDL) models. The easiest version is an ARDL(1,1), in which xt is a scalar
covariate which appears next to lagged yt (the AR(1) part) contemporaneously and
with one lag (the DL(1) part),
yt = α0 + α1 yt−1 + β0 xt + β1 xt−1 + ϵt .
The implicit assumption in this model is that the process {xt , t ≥ 0} is weakly
exogenous, i.e. the parameters of its marginal distribution are not linked with the
parameters of the conditional distribution of yt , given xt and the past.
this balances LHS and RHS in terms of order of integration if xt is I(0) and α1 = 1.
27
If xt itself also is I(1),
and in order to balance LHS and RHS in terms of order of integration, either
(i) β0 + β1 = 0 and α1 = 1, or
Case (i) yields a model in first differences, ∆yt = α0 + β0 ∆xt + ϵt . Case (ii) is
equivalent to
α0 β0 + β1
yt = + xt + νt
1 − α1 1 − α1
α0 β0 + β1
y⋆ = E[yt |xt ] = + xt ,
1 − α1 1 − α1
where νt is white noise (I(0)). In this case, with both yt and xt being I(1) processes
(so that ∆yt and ∆xt are I(0)), but a particular linear combination of yt and xt ,
yt − α0
1−α1
− β0 +β1
1−α1 t
x, being I(0), the two stochastic processes are said to be co-
integrated. Note that this co-integration relationship has the interpretation of a
stable long-run equilibrium relationship between yt and xt , i.e. it is implied by the
original model if yt = yt−1 and xt = xt−1 . This permits the model to be re-cast in
its error correction model (ECM) representation
28
⋆
its long-run equilibrium level yt−1 , and that it adjusts upwards (downwards) if the
long-run equilibrium level increases (decreases).
yt = α + βxt + νt
νt = ρνt−1 + ϵt .
β0 +β1
A model with unit long-run coefficient would impose the restriction 1−α1
= 1. A
random walk with drift requires α1 = 1 and β0 = β1 = 0.
29
3.4 Spurious Regression
Granger and Newbold (1974)12 and Phillips (1986)13 were the first to identify the
issue of spurious regressions. An example common in applied work, and used here to
illustrate the issues involved, might consider the monthly price of a good or service
provided by a firm (yt ) as a function of monthly trading volume or sales (xt ).14 The
question of interest is whether a change in industry structure, such as for example
the merger of the firm with another firm in the same industry at time T0 , translated
into latent synergies that were passed on to consumers in the form of lower prices.
Let δt = 1{t≥T0 } denote a binary variable that takes on value 1 after the merger was
completed. The proposed model is
yt = α0 δt + β0 xt + ut ,
The last property, independence of yt and xt , implies that β0 = 0, and in this case,
if the merger has no effect on prices, then ut = yt = yt−1 + ϵt , where ϵt is white noise.
t t t xt t t t xt
( )−1 ( )
∑ x2 ∑ ∑ x 2 ut
= α0 + (T − T0 ) + ∑t 2 ut + ∑t 2 .
t≥T
x
t t t≥T t≥T t xt
0 0 0
12
Granger, C.W. and P. Newbold (1974): “Spurious Regression in Econometrics, Journal of
Econometrics, 2, 111-120
13
Phillips, P.C.B. (1986): “Understanding Spurious Regressions in Econometrics”, Journal of
Econometrics, 33, 311-340
14
The additional issue of endogeneity of xt is ignored in the discussion of this section.
30
The individual components of this expression can be expected to have the following
asymptotic properties: With probability one,
Hence,
( )−1 ( )
Op (T 2 ) Op (T 2 )
α̂T = α0 + Op (T ) − Op (T ) −
Op (T 2 ) Op (T 2 )
= α0 + Op (1),
i.e. limT →∞ Pr(|α̂T − α0 | > ϵ) > 0 for any ϵ > 0. In other words, if α0 = 0, then a
conventional t-test will erroneously reject this hypothesis with positive probability.
There are two features to note about this. First, non-stationarity of a regressor
(xt ) can spill over, in the sense of having an impact on statistical properties of
coefficient estimates of other regressors, not just on its own coefficient. Second,
if Case (ii) in the preceding section were true, i.e. yt and xt were co-integrated,
√
then T consistency would be preserved; in this case, a linear combination of I(1)
variables is stationary (I(0)), and this renders the regression residuals I(0). This also
suggests one (single equation based) test for co-integration: First, the individual
variables are tested for unit roots; second, if unit roots are not rejected, a linear
regression model of one variable onto the others is estimated, and the estimated
31
regression residuals are tested for a unit root, using an ADF test (again with different
critical values). This is the original Engle-Granger procedure15 . It suffers from
inherent problems, however: The assignment of the variables to LHS and RHS
is arbitrary, and it implicity assumes weak exogeneity of the RHS variables. The
conclusion from this is that all variables should be treated equally and symmetrically,
in some sense, i.e. in a system based, multivariate, rather a single equation based,
univariate approach.
yt = A0 + A1 yt−1 + · · · + Ap yt−p + ϵt ,
A(L) = A1 L + · · · + Ap Lp .
32
the regression residuals {ϵ̂t , t = p + 1, · · · , T } in the usual way, i.e. the (i, j) element
∑T
t=p+1 ϵ̂it ϵ̂jt , for i, j = 1, · · · , m.
1
Σ̂ij = T −p
yt = (I − A(L))−1 (A0 + ϵt )
∑
∞
−1
= (I − A(1)) A0 + ψi ϵt−i ,
i=1
where the convention is adopted that ψ0 = I. The leading constant follows from
∑
p
E[yt ] = A0 + Ai E[yt ] = A0 + A(1)E[yt ] = (I − A(1))−1 A0 .
i=1
(I − A1 L − · · · − Ap Lp )−1 = I + ψ1 L + ψ2 L2 + · · · ,
which is equivalent to
(I − A1 L − · · · − Ap Lp )(I + ψ1 L + ψ2 L2 + · · · ) = I.
16
Granger, C.W.J. (1969): “Investigating Causal Relations by Econometric Models and Cross-
Spectral Methods”, Econometrica, 37(3), 424-348; also, Sims, C.A. (1972): “Money, Income and
Causality”, American Economic Review, 62(4), 540-552
17
Hamilton, J.D. (1994): Time Series Analysis, Princeton: Princeton University Press
33
Hence, matching coefficients on L, L2 , · · · ,
−A1 + ψ1 = 0 ⇒ ψ1 = A1
−A2 + ψ2 − A1 ψ1 = 0 ⇒ ψ2 = A1 ψ1 + A2 = A21 + A2
..
.
general result: ψs = A1 ψs−1 + A2 ψs−2 + · · · + Ap ψs−p , s = 1, 2, · · ·
In the context of modelling multivariate series and estimation of such models, essen-
tially the same issues arise as in the univariate setting, as discussed above. Hence,
in a multivariate context, error correction representations of VAR(p)s, called Vector
ECMs (VECMs), are useful for the same reasons given before.
∑
p−1
yt = A0 + Φyt−1 + Γi ∆yt−1 + ϵt ,
i=1
which is equivalent to
[ ( p−1 ) ]
∑
(I − ΦL) − Γi Li (I − L) yt = A0 + ϵt ,
i=1
where
Φ = A(1) = A1 + · · · Ap
Γi = − [Ai+1 + · · · + Ap ] , i = 1, 2, · · · , p − 1.
34
To see this, note that
( p−1 )
∑
(I − ΦL) − Γi Li
(I − L)
i=1
= I − ΦL − Γ1 L + Γ1 L2 − Γ2 L2 + Γ2 L3 − · · · − Γp−1 Lp−1 + Γp−1 Lp
= I − (Φ + Γ1 ) L − (Γ2 − Γ1 ) L2 − · · · − (Γp−1 − Γp−2 ) Lp−1 + Γp−1 Lp
= I − A1 L − · · · Ap Lp .
This is referred to as the Sims, Stock and Watson (1990) canonical representa-
tion, originally due to Fuller (1976)18 . Notice that this is yet again simply a re-
parametrization, and there exists a one-to-one mapping between the coefficient ma-
trices of the VAR and the VECM. The VECM can be estimated by OLS, and VAR
coefficients can be determined via the above formulae.
∆yt = A0 + Πyt−1 + ϵt
⇒ α′ ∆yt = α′ A0 + α′ Πyt−1 + α′ ϵt
⇒ α′ ∆yt = α′ A0 + α′ ϵt ,
18
Sims, C.A., Stock, J.H. and M.W. Watson (1990): “Inference in Linear Time Series Models
with Some Unit Roots”, Econometrica, 58(1), 123-144; Fuller, W.A. (1976): Introduction to
Statistical Time Series, New York: Wiley
35
i.e. α′ yt is I(1), a contradiction to the hypothesis that yt is I(0). Hence, full rank
of Π is equivalent to all components of yt being covariance stationary.
Noting that each equation in a VECM looks just like an univariate ARDL model
in which xt represents another component of the vector yt , one might expect the
matrix Π to be informative about co-integrating relationships as well, because Πyt−1
is just a collection of m linear combinations of the elements of yt−1 . In order to then
balance the order of integration of the LHS and RHS, it must be the case that Π,
in a sense that will be made precise below, contains all coefficients of co-integrating
relationships among the elements of yt , i.e. all co-integrating vectors that induce
linear combinations of the elements of yt which are I(0). It follows from the preceding
two paragraphs that the case of co-integration among the component series of yt
corresponds to 0 < rk(Π) = r < m. In this case, it is said that there exist r distinct
co-integrating relationships between the m elements of yt , each corresponding to a
co-integrating vector βj so that βj′ yt is I(0), j = 1, · · · , r. In terms of the solutions to
the determinantal equation, the case of r co-integration relationships between the m
elements of yt is equivalent to m − r solutions (out of mp solutions of |I − A(z)| = 0)
that lie on the unit circle, with a real part equal to unity, while all other solutions
lie outside the unit circle and correspond to the co-integrating relationships and
higher-order dynamics.
∑
p
yt = A(L)yt + ϵt = Ai yt−1 + ϵt ,
i=1
36
(3) Π = A(1) − I has rk(Π) = r, and there exists an m × r matrix α, such that
Π = αβ ′ ;
∑p−1
(4) there exists a VECM: ∆yt = αzt−1 + i=1 γi ∆yt−i + ϵt .
The last assertion of part (2) is not critical for the understanding of the further
development; its proof is given in an appendix.
If some of the series in the VAR are subject to a deterministic time trend - which,
if present, in the case of economic series is typically linear - then it can be included
into the co-integrated relationship, in analogy to Section 3.2.4 above.19 Formally, in
terms of the formalism of the preceding Theorem, if the original VAR(p) is of the
form
∑
p
yt = A(L)yt + γt + ϵt = Ai yt−1 + γt + ϵt ,
i=1
It is important to note that α and β are not uniquely determined, since for any
non-singular r × r matrix Q, Π = αβ ′ = αQQ−1 β ′ = α̃β̃ ′ , where α̃ = αQ and
β̃ = β(Q−1 )′ . The same argument applies to δ. The appropriate choice of Q is
usually guided by economic theory and equivalent to imposing r2 restrictions on the
elements of Q.
19
If it were included without being restricted to be part of the co-integrating relationship, then
this might imply a quadratic trend in the respective original series.
37
4.3 Johansen Co-integration Tests
Cases (I) and (II) are considered here in turn.21 As in the case of testing for unit
roots in the case of univariate stochastic processes, there are further test variants
when deterministic trends are included in the model.
Consider the model ∆yt = Πyt−1 + vt ; here, the intercept vector and the lagged
differences ∆yt−s , s = 1, · · · , p1 are omitted, as they are irrelevant to the under-
standing of the underlying principles of the test procedure. Stack up the T systems
∆y′t = y′t−1 Π′ + v′t , to form
∆Y = Y−1 Π′ + v,
38
contemporaneous variance-covariance matrix Ω and serially independent, then the
joint probability density of this model, or the likelihood function of the parameters
Π and Ω, given the data, is
( )
∏
T
1 ∑ ′ −1
− T2
f (vt ; Π, Ω) ∝ |Ω| exp − v Ω vt
2 t t
t=1
( ( ))
1 ∑
= |Ω|− 2 v′t Ω−1 vt
T
exp − tr
2
( t
)
1 ∑ ( )
= |Ω|− 2 tr v′t Ω−1 vt
T
exp −
2 t
( )
1 ∑ ( )
= |Ω|− 2 tr Ω−1 vt v′t
T
exp −
2 t
( ( ))
1 ∑
= |Ω|− 2 exp − tr Ω−1 vt v′t
T
2
( t
)
1 ( −1 ′
)
= |Ω|− 2
T
exp − tr Ω V V ,
2
where V = ∆Y − Y−1 Π′ .22 Given Π, Ω can be concentrated out in the usual way,
i.e. by choosing Ω = T1 V′ V.23 Then, the concentrated likelihood function is
( (( )−1 ))
∏
T
1 ′
− T2
1 V′ V
f (vt ; Π) ∝ VV exp tr V′ V
t=1
T 2 T
− T2
1 ′
∝ VV
T
− T2
1 ′
= (Y − Y−1 Π′ ) (Y − Y−1 Π′ )
T
→ max
Π
T ( ′ )
⇔ min ln (Y − Y−1 Π′ ) (Y − Y−1 Π′ ) .
Π 2
Imposing the null hypothesis of Case (I), i.e. the m2 restrictions Π = 0, yields
T
2
ln (|∆Y′ ∆Y|), which is proportional to the log-likelihood function under the null
hypothesis.
22
Strictly speaking, the preceding expression is the conditional density of Y, given y0 .
23
Appendix B.3 is a brief review of concentrating out parameters from a likelihood function.
39
Under the alternative hypothesis, the unrestricted estimator of Π is the OLS es-
timator (on each equation), and the log-likelihood function, evaluated at the estima-
tor, is proportional to the logarithm of the residual sum of squares, i.e. proportional
( )
to T2 ln ∆Y′ MY−1 ∆Y , where the T × T matrix MY−1 = I − Y−1 (Y′−1 Y−1 )−1 Y′−1
is the orthogonal projector onto the space orthogonal to the column space of Y−1 .
The likelihood ratio test statistic for Case (I) is then, as usual, twice the difference
between the log-likelihood of the unrestricted and restricted model, i.e.
( )
∆Y′ MY−1 ∆Y
LRT = −T ln ′ ∼ χ2m2 ,
∆Y ∆Y
and the null hypothesis is rejected when this statistic exceeds the critical value of a
χ2m2 distribution for the appropriate size of the test.
where the third (fourth) equality follows from a linear algebra result provided in
Appendix B.1.1 (B.1.2), and {µi , i = 1, · · · , m} are the characteristic roots (eigen-
values) of the matrix
− 21 − 21
Q = I − (∆Y′ ∆Y) ∆Y′ Y−1 (Y′−1 Y−1 )−1 Y′−1 ∆Y (∆Y′ ∆Y) .
In this case, since the null hypothesis is the same as in Case (I), the denominator
of the test statistics (the log-likelihood function under the null hypothesis) remains
40
the same as before. The numerator is proportional to the logarithm of the sum of
squared residuals when the restriction Π′ = αβ ′ is imposed, where α, β ∈ Rm , and
the model is
∆Y = Y−1 Π′ + v = Y−1 αβ ′ + v.
Let z = Y−1 α, which is stationary under the alternative hypothesis, with co-
efficient vector β; i.e. α is the single co-integrating vector under the alternative
hypothesis. Cast in this form, the model under the alternative hypothesis amounts
to m LHS variables collected in ∆yt and a single RHS variable zt , which enters each
equation with an individual coefficient βi , i = 1, · · · , m:
∆yi,t = βi zt + vi,t , i = 1, · · · , m; t = 1, · · · , T.
( α′ Aα )
This is a ratio of quadratics in α, i.e. of the form T
2
ln α′ Bα
, where A = Y′−1 M∆Y Y−1
and B = Y′−1 Y−1 , which is p.d.s. The FOCs of this minimization problem yield
−2
0 = (α̂′ B α̂) (α̂′ B α̂2Aα̂ − α̂′ Aα̂2B α̂)
( )
α̂′ Aα̂
⇒0 = A− ′ B α̂
α̂ B α̂
= (A − r̂B)α̂
( 1 )
⇔ 0 = B − 2 AB − 2 − r̂I γ̂,
1
are m pairs of eigenvalues r̂ and associated eigenvectors α̂. Minimization with re-
spect to α̂ leads to choosing the smallest eigenvalue, r̂min . Hence, the log-likelihood
41
( )
under the alternative hypothesis is proportional to T
2
ln |∆Y′ ∆Y| r̂min , so that
the Johansen likelihood ratio test statistic for Case (II) is
( )
LRT = −T ln r̂min .
− 12 − 21
(∆Y′ ∆Y) ∆Y′ MY−1 ∆Y (∆Y′ ∆Y) ,
and these are 1 − µi , i = 1, · · · , m (see Appendix B.1.4), where the µi are the
eigenvalues of the matrix Q encountered in Case (I). Hence, the Johansen likelihood
ratio test statistic can also be expressed as
LRT = −T ln (1 − µmax ) .
Using the same principles as in the preceding two subsections, the Johansen likeli-
hood ratio test statistics for the remaining two test cases can be deduced. For Case
(III), H0 : rk(Π) = r against HA : rk(Π) > r, the test statistic is
∑
m
( )
LRT = −T ln µ(i) ,
i=r+1
where µ(1) < · · · < µ(m) are the ordered eigenvalues of the matrix Q obtained in
Case (I). Similarly, for Case (IV), H0 : rk(Π) = r against HA : rk(Π) = r + 1,
( ) ( )
LRT = −T ln 1 − µ(m−r) = −T ln r̂(r+1) ,
where r̂(1) < · · · < r̂(m) are the ordered eigenvalue of I − Q. The critical values
depend on m and r and are provided in tables or by statistical software.
42
5 Supplement: Time Series Models of Heteroskedas-
ticity
Up to this point, it was assumed that the stochastic processes being modelled are
propelled by innovations that have constant variances and covariances over time.
This assumption impedes the analysis of potential volatility in the series, i.e. chang-
ing or heteroskedastic variances (and covariances) over time. Time series models
of heteroskedasticity have important applications as a useful tool to capture the
volatility of a stochastic process, notably in empirical finance. Recent experience in
financial markets shows that - beyond the theory of efficient financial markets which
predicts no autocorrelation in asset returns - squared returns vary widely and, to
some extent, predictably depend on the past. This suggests that conditional vari-
ances may follow a time series process as well, and sometimes this process may be
characterized by a distribution with thick tails.
For the purpose of illustration, consider the univariate stationary AR(p) process
∑
yt = c + pi=1 ϕi yt−i + ut , where ut is assumed to be white noise, i.e. ut is i.i.d. with
E[ut ] = 0 and E[ut us ] = σ 2 1{t=s} , σ 2 > 0. The white noise assumption implies that
the process’ unconditional variance is constant. This does not preclude that the
conditional variance may vary over time. One way to model this is as a stationary
AR(m) for {u2t , t = 1, · · · }:
∑
m
u2t =ξ+ αj u2t−j + ωt ,
j=1
where ωt is white noise, i.e. ωt is i.i.d. with E[ωt ] = 0 and E[ωt ωs ] = λ2 1{t=s} ,
λ2 > 0, for all t. Since E[ut |ut−s , s = 1, 2, · · · ] = 0 this implies for the conditional
variance of ut , given the past,
∑
m
E[u2t |u2t−s , s = 1, 2, · · · ] = ξ + αj u2t−j .
j=1
43
(ARCH) model (Engle (1982)24 ).
the unit circle. Provided these conditions hold, the unconditional variance of ut can
be expressed in terms of the ARCH model parameters as
σ 2 = ξ/(1 − α(1)).
Further restrictions are required if the model is designed to eliminate thick tails,
i.e. to control higher-order moments. To see this, consider the alternative represen-
√ ∑
tation of the innovations ut = ht vt , ht = ξ + m 2
j=1 αj ut−j , so that vt =
√ut have
ht
the interpretation of standardized innovations of the primary process yt , satisfying
The thickness of the tails of the distribution of vt is governed by its fourth moment,
E[(vt2 − 1)2 ]. Since u2t = ht vt2 = ht + ωt , it follows that ωt = ht (vt2 − 1), so that
E[ωt2 ] = λ2 = E[h2t (vt2 −1)2 ] = E[h2t ]E[(vt2 −1)2 ], because vt is independent. Consider,
for simplicity, the case of an ARCH(1) model, for which ht = ξ + α1 u2t−1 . Then, the
24
Engle, R.F. (1982): “Autoregressive Conditional Heteroscedasticity with Estimates of the
Variance of United Kingdom Inflation”, Econometrica, 50(4), 987-1009.
44
unconditional expectation of h2t is
[ ]
E[h2t ] = E (ξ + α1 u2t−1 )2
= ξ 2 + α12 E[u4t−1 ] + 2α1 E[u2t−1 ]
( ) ξ
= ξ 2 + α12 var(u2t−1 ) + (E[u2t−1 ])2 + 2α1 ξ
1 − α1
( 2 2
)
λ ξ ξ
= ξ 2 + α12 + + 2α1 ξ
1 − α1 (1 − α1 )
2 2 1 − α1
2 2
α1 λ (1 − α1 ) ξ + α1 ξ + 2(1 − α1 )α1 ξ 2
2 2 2 2
= +
1 − α12 (1 − α1 )2
2 2 2
ξ α1 λ
= + .
(1 − α1 )2 1 − α12
Therefore,
[ ] λ2
E (vt2 − 1)2 = α21 λ2
.
ξ2
(1−α1 )2
+ 1−α21
λ2 (1 − 3α12 ) 2ξ 2
= .
1 − α12 (1 − α1 )2
The right-hand side is positive. Therefore, for the left-hand side to be positive, it is
√
required that α1 ≥ 1/ 3.
Empirically, for financial time series, such restrictions on the tails of their distri-
butions are typically rejected. Researchers, therefore, often maintain distributional
assumptions that allow for thicker tails, e.g. t-distribution instead of normality.
45
Let Yt = (yt , yt−1 , · · · ). Then, the conditional density of yt , given the past, is
( )
1 1
f (yt |Yt−1 ; θ) = √ exp − ((1 − ϕ(L))yt − c) 2
2πht 2ht
5.3 Extensions
∑∞
Consider the model ht = ξ + π(L)u2t , where π(L) = j=1 πj Lj is an infinite polyno-
mial in the lag operator L and the ut are white noise, as above. Parameterize π(L)
as the ratio of two finite order polynomials in L:
α(L)
π(L) = ,
1 − δ(L)
where
∑
m
α(L) = αj Lj
j=1
∑r
δ(L) = δk Lk ,
k=1
where it is assumed that |1 − δ(z)| = 0 has all roots outside the unit circle.
This yields
α(L) 2
ht = ξ + u,
1 − δ(L) t
46
from which it follows that
which is equivalent to
∑
r ∑
m
ht = ξ + δi ht−i + αj u2t−j .
i=1 j=1
Then,
∑
r ∑
m
ht + u2t =ξ− δ1 (u2t−1 − ht−1 ) − · · · − δr (u2t−r − ht−r ) + δi u2t−i + αj u2t−j + u2t .
i=1 j=1
Defining the martingale difference sequences ωt = u2t −ht which satisfies E[ωt |past] =
0, and p = max{r, m}, this model is equivalent to
∑
p
∑
r
u2t =ξ+ (δs + αs )u2t−s + ωt − δk ωt−k ,
s=1 k=1
47
A Granger Representation Theorem, part (2)
The following results are useful for the development of the Johansen Tests for
the number of cointegrating vectors.
det(A − λi In ) = 0.
(A − λi In )ai = 0, i = 1, . . . , n.
48
The collection of eigenvalues of A, λ(A) = {λi , i = 1, . . . , n}, is called
the matrix spectrum of A and satisfies
∏
n
|A| = λi .
i=1
Then, µi , i = 1, . . . , n, satisfying
′ ′ ′ ′
to |λi In − Ã B̃B̃ Ã| = 0, while (2) is equivalent to |µi In − B̃ ÃÃ B̃| = 0.
′
Letting C = Ã B̃, (1) is equivalent to |λi In − CC′ | = 0, while (2) is
|µi In − C′ C| = 0. Denoting the corresponding characteristic vectors by
xi and zi ,
C′ Cxi = λi xi ,
CC′ zi = µi zi ,
implying
so that µi = λi .
1.4 Let λi , i = 1, . . . , n, be the eigenvalues of A. Then, γi = 1 − λi , i =
1, . . . , n, are the eigenvalues if In − A.
Proof : This follows immediately from the definition of λi ,
49
2. Let W = [U, V], where the matrices U and V have dimensions n × a and
n × b, respectively. Let MU = In − U(U′ U)−1 U′ , and analogously for MV .
Then,
[ ]
U′ U U′ V
W′ W = .
V′ U V′ V
For the case a = b = 1, U and V are column vectors, hence their inner
products are scalars, and so it can readily be verified that
Note that the order of maximization is immaterial. For any value of β, max-
∑ ′
n=1 (yn − xn β) .
imization with respect to σ 2 yields the solution σ 2 (β) = N1 N 2
max
2
L(β, σ 2 ; y,X) ⇔ max L(β, σ 2 (β); y,X)
β,σ β
N 1
⇔ max − ln(σ 2 (β)) −
β 2 2
N
⇔ max − ln(u(β)′ u(β)), where u(β) = y − Xβ.
β 2
50
It is straightforward to check that this results in the well-known MLE for β0 ,
β̂, equivalent to the OLS estimator, and in the MLE for σ02 , σ̂ 2 = σ 2 (β̂).
Finally,
( ) ( )
var Z (y − Xβ̂2SLS X, Z = σ 2 Z′ I − X(X′ PZ X)−1 X′ Z
′
( )
= σ 2 Z′ Z − Z′ X(X′ PZ X)−1 X′ Z .
51