Economics 508 Introduction To Simultaneous Equation Econometric Models
Economics 508 Introduction To Simultaneous Equation Econometric Models
Economics 508
Lecture 10
Introduction to Simultaneous Equation
Econometric Models
where ûi , i = 1, . . . , m are the n-vectors of residuals from any initial (consistent) estimate
of the model, typically from an OLS fit to the individual equations.
An important observation is that there is no efficiency gain from the reweighting by
(Ω ⊗ I)−1 if X = (I ⊗ X0 ). That is, if Xi = X0 for all i as would be the case in some
demand system contexts, we gain nothing from doing the system estimate over what is
accomplished in the equation-by-equation OLS case. To see this write
(Ω−1 ⊗ I)(I ⊗ X0 ) = Ω−1 ⊗ X0 .
We are solving the equations in the weighted case
X > (Ω ⊗ I)−1 û = 0
but if X = (I ⊗ X0 ), this is equivalent to
(Ω−1 ⊗ X0> )û
but this is satisfied by assuring that
X0> ûi = 0 i = 1, . . . , m
which are just the normal equations for the separate OLS regressions.
A useful introduction to maximum likelihood estimation of systems of equations may
be provided by the SUR model. For this purpose it is convenient to stack the observations
in “the opposite way” that is, to write
yj = Xj β + uj j = 1, . . . , n
where
xj1 0 · · · 0
0 xj2 · · · 0
Xj =
0 ··· xjm
where xji is a pi row vector. Now stacking the model we have,
y = Xβ + u
and now u ∼ N (0, I ⊗ Ω). Note that, with this formulation
β̂ = (X > (I ⊗ Ω−1 )X)−1 X 0 (I ⊗ Ω−1 )y
n
X Xn
> −1 −1
= ( Xj Ω Xj ) Xj> Ω−1 yj
j=1 j=1
where implicitly we recognize that the uj ’s are functions of the β vector. As usual it is
more convenient to work with log likelihood,
n 1 X > −1
`(β, Ω) = K − log |Ω| − uj Ω uj
2 2
3
We have already seen how to estimate β in this model. We now consider two variants on
estimation of Ω.
Case 1. Suppose that Ω is known up to a scalar, i.e., Ω = ωΩ0 with the matrix Ω0
known.
Recall that |ωΩ0 | = ω m |Ω0 | so
n 1 X > −1
`(β, ω) = K − (m log ω + log |Ω0 |) − uj Ω0 uj
2 2ω
so
∂l nm 1 X > −1
=− + 2 uj Ω0 uj = 0
∂ω 2ω 2ω
implies
X
−1
ω̂ = (mn)−1 u>
j Ω0 uj .
has been considerable discussion about various schemes to orthogonalize the errors, but
these “solutions” introduce new problems having to do with the nonuniqueness of the
orthogonalization.
There are two common procedures for testing for cointegration, one introduced by Engle
and Granger (1987), the other by Johansen (1988). We will describe both very succinctly.
Consider the problem posed in PS3 of testing for the cointegration of two series {xt , yt }.
If we knew the coefficients of the cointegrating relationship, i.e., if we hypothesized, for
example,
zt = yt − α − βxt
was stationary then the situation would be relatively simple we would simply apply the
Dickey-Fuller as the augmented Dickey-Fuller test to the new variable zt . If we reject
the null hypothesis that the series zt has a unit root, having already concluded that the
hypotheses that xt and yt themselves cannot be rejected, then we may conclude that there
is evidence for the cointegration of {xt , yt }
It may seem very implausible that we might “know” α, β, however in some examples
this isn’t so strange. Often theory might suggest that (α, β) = (0, 1) is reasonable. But
what should we do if we don’t have any a priori notion about (α, β)?
Fortunately, the answer to this question, at least from a pragmatic point of view is
quite simple. As a first step we estimate the parameters using the usual least squares
procedure, and then proceed as before using ẑt = yt − α̂ − β̂xt . The only difference is
that some adjustment in the original DF critical values is necessary. For the problem
set, these new critical values are provided in Table B.9 from Hamilton (1994). There are
additional complications due to trends, but I will defer these to the complete treatment
of these matters in 574, our time-series course.
In the case that we have more than two series the situation is a bit more complicated.
An elegant general approach is provided by the canonical correlation methods of Johansen
(1991). Johansen’s approach employs two sets of auxiliary regressions. Returning to the
prior matrix notation, write
where λ̂i denote ordered eigenvalues λ̂1 > λ2 > . . . > λ̂m of the matrix Σ̂−1 −1
vv Σ̂vu Σ̂uu Σ̂uv ,
and r, as above, denotes the rank of the cointegrating relationship, that is the rank of the
matrix Π. An LR test based on this approach and intended to test rank r against rank
6
For another example, in the simple bivariate example of PS3, our test would correspond to
testing the null hypothesis of r = 0 cointegrating vectors against the alternative of r = 1
cointegrating vectors. Here m = 2 so we have the test statistic
2(`1 − `0 ) = −n log(1 − λ1 )
From Table B.10 of Hamilton (1994) we find that the critical value of this test is 3.84 at 5%.
A reasonably complete derivation of these expressions is provided by the semi-historical
discussion in the concluding sections.
4. Canonical Correlation
Regression generalizes the notion of simple bivariate correlation by finding the linear
combination of a vector of covariates that is most highly correlated with the response
variable, that is it finds the linear combination of x’s that maximizes the correlation of y
and ŷ.
2
σ̂y|x
2
R =1− = ρ2 (y, ŷ).
σ̂y2
This is just the squared cosine of the angle between y and ŷ.
Hotelling (1935) generalized this notion further to handle a vector of response vari-
ables. Suppose we have, m y’s and p x’s? Consider two arbitrary linear combinations
α> y and β > x and suppose for convenience α and β could be chosen so that they yield
unit variance. We would like to choose α and β so that they maximized the correlation
between α> y and β > x. How? Let’s write: Cov(α> y, β > x) = α> Σxy β, V (α> y) = α> Σyy α
and V (β > x) = β > Σxx β. We want to maximize the Lagrangean expression
λ1 λ2
α> Σxy β − α> Σyy α − β > Σxx β
2 2
so we have the first order conditions
Σxy β = λ1 Σyy α
Σxy α = λ2 Σxx β
multiplying through by α and β respectively gives λ1 = α> Σxy β and λ2 = α> Σxy β, so
λ1 = λ2 = ρ. Now multiply the second equation by Σxy Σ−1
xx
Σxy Σ−1
xx Σxy α = ρΣxy β
and multiply the first by ρ
ρ2 Σyy α = ρΣxy β
and subtract to obtain the eigenvalue problem,
Σxy Σ−1 2
xx Σxy α − ρ Σyy α = 0.
The usual eigenvalue problem (A−λI)x = 0 implies Ax = λx so the vector x is unaltered
in direction by the transformation A except in length. Our problem can be reformulated
in this way by writing
(Σ−1 −1 2
yy Σxy Σxx Σxy − ρ I)α = 0.
Now of course there is more than one eigenvalue – there are m of them, where m is the
dimension of the response y. They can be ordered and for this problem they are called
the canonical correlations and the corresponding eigenvectors are called the canonical
variables. The latter are constructed in the following way: Given a pair (ρ2 , α) satisfying
(3) we can find an associated β by simply regressing α> y on x, thereby constructing all
of the triples, (ρ2 , α, β).
To illustrate this we can revisit an example introduced by Waugh (1942). Yes, that
Waugh. He writes: “Professor Hotelling’s paper, should be widely known and his method
used by practical statisticians. Yet, few practical statisticians seem to know of the paper,
and perhaps those few are inclined to regard it as a mathematical curiosity rather than
an important and useful method of analyzing concrete problems.” The second of Waugh’s
8
examples concerns the quality of wheat and how it influences the quality of flour it pro-
duces. He uses data from 136 export shipments of hard spring wheat. The correlation
matrix of the data is shown in the following table. There are five variables for wheat qual-
ity, and four for quality of the flour. In 1942 it involved a non-trivial amount of effort to
compute the canonical correlations and Waugh adopts some clever iterative tricks to get
an approximation, but it is trivial to do so now in R. Applying the machinery introduced
above, we get: ρ = c(0.910, 0.650, 0.269, 0.169).
where ΣỸ Ỹ = Ỹ > Ỹ /n, etc. As usual the likelihood after substitution looks like this:
n
`(Â, B̂, Ω̂) = K − log |Ω̂|
2
where K is independent of the data. So we are simply trying to minimize the generalized
variance represented by the determinant of Ω.
Now we need some trickery involving determinants of partitioned matrices. We start,
following Johansen, with the identity
Σ00 Σ01 −1 −1
Σ10 Σ11 = |Σ00 | |Σ11 − Σ10 Σ00 Σ01 | = |Σ11 | |Σ00 − Σ01 Σ11 Σ10 |.
Thus,
Σ00 Σ01 B = |Σ00 | |B > (Σ11 − Σ10 Σ−1 Σ01 )B|
>
B Σ10 B > Σ11 B 00
and therefore,
|Σ00 − Σ01 B(B > (Σ−1 −1 > > −1 >
11 B) B Σ01 | = |Σ00 | |B (Σ11 − Σ10 Σ00 Σ01 )B| / |B Σ11 B|.
Translating this expression back into our likelihood notation we have
(∗) |Ω̂| = |ΣỸ Ỹ | |B > (ΣX̃ X̃ − ΣX̃ Ỹ Σ−1 Σ )B| / |B > ΣX̃ X̃ B|.
Ỹ Ỹ X̃ Ỹ
Recall that we are still trying to maximize − log |Ω̂| with respect to B. How do this?
Consider a simpler version of a similar problem. Suppose A is a symmetric positive
semi-definite matrix and we want to solve:
x> Ax
max
x x> x
We can write A = P > ΛP = Σλi pi p>i where λ1 > λ2 > · · · > λn are the eigenvalues of A
and P is the matrix of corresponding eigenvectors. P constitutes a basis for the space Rn
so any x can be written as
x = Pα
so our problem becomes, X X
max λi αi2 / αi2
α
which is accomplished by letting α = (1, 0, · · · , 0) giving λ1 as the maximum.
Generalizing slightly, consider for A symmetric positive semi-definite and B symmetric
positive definite
max x> Ax/x> Bx.
x
Write B = C >C as the Cholesky decomposition (matrix square root) of B, set y = Cx to
get
max y > C −1 AC −1 y/y > y
y
so the solution is found by choosing the largest eigenvalue of the matrix C −1 AC −1 . This
generalized eigenvalue problem can be posed as finding roots of
|λA − B| = 0.
This all generalizes to finding multiple roots and multiple eigenvectors. Let X be an n × p
matrix and consider
max |X > AX|/|X > BX|.
X
Similar arguments lead to choosing X to represent to the matrix of eigenvectors corre-
sponding to the p largest eigenvalues of C −1 AC −1 where again B = C > C, by solving
|λA − B|.
Now, finally, we are ready to get back to the problem of maximizing −|Ω̂|. This requires
solving the eigenvalue problem
|ρΣX̃ X̃ − (ΣX̃ X̃ − ΣỸ X̃ Σ−1 Σ | = 0.
Ỹ Ỹ Ỹ X̃
or setting ρ = (1 − λ)
|λΣX̃ X̃ − ΣỸ X̃ ΣỸ−1Ỹ ΣỸ X̃ | = 0.
10
where aij is the ij th element of A and (−1)i+j Aij is the ij th cofactor of A, that
is the determinant of the matrix A with the ith row and j th column deleted.
Thus, the derivative of |A| with respect to aij is just Aij which is |A| times the
jith element of A−1 . Thus ∂|A|/∂A = |A|(A−1 )0 , and thus by the chain rule, we
have (i). Note (A0 )−1 = (A−1 )0 , and the transpose is usually irrelevant since A is
symmetric in (most)
P applications.
(ii): Write x0 Ax = aij xi xj , so
∂x0 Ax
0
∂x Ax
= = (xi xj )
∂A ∂aij
(iii): To see this write
∂x0 A−1 x ∂A−1
= x0 x
∂A ∂A
and differentiate the identity AA−1 = I to obtain,
∂A −1 ∂A−1
0= A +A
∂aij ∂aij
so
∂A−1 ∂A −1
= −A−1 A
∂aij ∂aij
where ∂A/∂aij = ei e0j is a matrix with ij th element 1 and the rest zeros. Thus
∂x0 A−1 x
= −x0 A−1 ei e0j A−1 x = −e0i A−1 xx0 A−1 ej
∂aij
and (iii) follows by arranging the elements in matrix form.