Elementary Regression Theory: Theorem 1. If The K × 1 Vector Y Is Distributed As N (, V), and If B Is An M × K Matrix
Elementary Regression Theory: Theorem 1. If The K × 1 Vector Y Is Distributed As N (, V), and If B Is An M × K Matrix
Richard E. Quandt
Princeton University
Theorem 1. If the k 1 vector Y is distributed as N(, V ), and if B is an m k matrix
(m
<
=
k) of rank m, then Z = BY is distributed as N(B, BV B
).
Proof . The moment generating function for the multivariate normal distribution of y is
E(e
Y
) = e
V /2
.
Hence, the moment generating function for Z is E(e
BY
) = e
B+
BV B
/2
, which is the moment
generating function for a multivariate normal distribution with mean vector B and covariance
matrix BV B
.
Theorem 2. If B is a q n matrix, A an nn matrix, and if BA = 0, and Y is distributed as
N(,
2
I), then the linear form BY and the quadratic form Y
AP =
_
D
1
0
0 D
2
_
,
where D
1
and D
2
are diagonal, and writing the matrix in partitioned forms is needed in the proof
below. Let P
,
2
I), since E(P
Y ) = P
and E[(Z
P
)(Z P
] = P
E[(Y )(Y )
]P =
2
I by the orthogonality of P.
Now let
BP = C. (1)
By the hypothesis of the theorem,
0 = BA = BAP = BPP
AP = CD (2)
and partitioning C and D conformably, CD can be written as
CD =
_
C
11
C
12
C
21
C
22
_ _
D
1
0
0 D
2
_
,
2 Quandt
where either D
1
or D
2
must be zero. If neither D
1
nor D
2
is zero, Eq.(2) would imply that C
11
=
C
12
= C
21
= C
22
= 0; hence C = 0, and from Eq.(1), B = 0, which is a trivial case. So assume that
D
2
is zero. But if D
2
= 0, Eq.(2) implies only C
11
= C
21
= 0. Then
BY = BPP
Y = CZ = [ 0 C
2
]
_
Z
1
Z
2
_
= C
2
Z
2
where C
2
denotes [ C
12
C
22
]
, and
Y
AY = Y
PP
APP
Y = Z
DZ = Z
1
D
1
Z
1
.
Since the elements of Z are independent and BY and Y
= (x
1
, . . . , x
n
) are
independent drawings from a normal distribution, the sample mean, x and (n times) the sample
variance,
i
(x
i
x)
2
, are independently distributed.
This follows because, dening i
n
x
and
i
(x
i
x)
2
= x
_
I
ii
n
__
I
ii
n
_
x = x
_
I
ii
n
_
x
and we can verify that the matrices
_
I
ii
n
_
and
_
i
n
_
have a zero product.
Theorem 3. If A and B are both symmetric and of the same order, it is necessary and sucient
for the existence of an orthogonal transformation P
AP =diag, P
AP = D
1
and P
BP = D
2
, where D
1
and D
2
are diagonal matrices.
Then P
APP
BP = D
1
D
2
= D and P
BPP
AP = D
2
D
1
= D. But then it follows that AB = BA.
(2) Suciency. Assume that AB = BA and let
1
, . . . ,
r
denote the distinct eigenvalues of A,
with multiplicities m
1
, . . . , m
r
. There exists an orthogonal Q
1
such that
Q
1
AQ
1
= D
1
=
_
1
I
1
0 . . . 0
0
2
I
2
. . . 0
.
.
.
.
.
.
.
.
.
.
.
.
0 0 . . .
r
I
r
_
_
,
where I
j
denotes an identity matrix of order m
j
. Then D
1
commutes with Q
1
BQ
1
, since
D
1
Q
1
BQ
1
= Q
1
AQ
1
Q
1
BQ
1
= Q
1
ABQ
1
= Q
1
BAQ
1
= Q
1
BQ
1
Q
1
AQ
1
= Q
1
BQ
1
D
1
.
Regression Theory 3
It follows that the matrix Q
1
BQ
1
is a block-diagonal matrix with symmetric submatrices (Q
1
BQ
1
)
i
of dimension m
i
(i = 1, . . . , r) along the diagonal. Then we can nd orthogonal matrices P
i
such
that P
i
(Q
1
BQ
1
)
i
P
i
are diagonal. Now form the matrix
Q
2
=
_
_
P
1
0 . . . 0
0 P
2
. . . 0
.
.
.
.
.
.
.
.
.
.
.
.
0 0 . . . P
r
_
_
.
Q
2
is obviously orthogonal. Then the matrix Q dened as Q
1
Q
2
is orthogonal and diagonalizes both
A and B, since
Q
AQ = Q
2
D
1
Q
2
= Q
2
Q
2
D
1
= D
1
(because the blockdiagonality of Q
2
causes it to commute with D
1
), and
Q
BQ = Q
2
Q
1
BQ
1
Q
2
is a diagonal matrix by construction.
Theorem 4. If Y is distributed as N(, I), the positive semidenite forms Y
AY and Y
BY
are independently distributed if and only if AB = 0.
Proof . (1) Suciency. Let AB = 0. Then B
AP = D
1
and P
BP = D
2
. It follows that D
1
D
2
= 0,
since
D
1
D
2
= P
APP
BP = P
ABP
and AB = 0 by hypothesis. Then D
1
and D
2
must be of the forms
D
1
=
_
_
1
0 0
0 0 0
0 0 0
_
_
(3)
and
D
2
=
_
_
0 0 0
0
2
0
0 0 0
_
_
,
where
1
and
2
are diagonal submatrices and where D
1
and D
2
are partitioned conformably.
Now let Z = P
, I), and
Y
AY = Z
APZ = Z
1
Z
1
and
Y
BY = Z
BPZ = Z
2
Z
2
.
4 Quandt
Since the elements of Z are independent and the two quadratic forms share no element of Z in
common, they are independent.
(2) Necessity. If Y
AY and Y
AP = D
1
, then Z
D
1
Z and Z
APP
AP =
_
I
r
0
0 0
_
, (5)
where I
r
is an identity matrix of order r.
(2) Necessity. Let the eigenvalues of A all be 0 or 1. Then there exists orthogonal P such that
P
AP = E
r
, where E
r
is the matrix on the right hand side of Eq.(5). Then A = PE
r
P
, and
A
2
= PE
r
P
PE
r
P
= PE
r
P
= A, and A is idempotent.
Regression Theory 5
Theorem 6. If n-vector Y is distributed as N(0, I), then Y
AY is distributed as
2
(k) if and
only if A is idempotent of rank k.
Proof . (1) Suciency. Let P be the orthogonal matrix that diagonalizes A, and dene Z = P
Y .
Then
Y
AY = Z
APZ =
k
i=1
z
2
i
by Theorem 5. But the right hand side is the sum of squares of k normally and independently
distributed variables with mean zero and variance 1; hence it is distributed as
2
(k).
(2) Necessity. Assume that Y
AY is distributed as
2
(k). Since A is symmetric, there exists
orthogonal P such that P
Y ; then
Y
AY = Z
APZ = Z
DZ =
n
i=1
d
i
z
2
i
,
and we dene the right hand side as . Since the z
i
are N(0, 1) and independent, the moment
generating function of is
n
i=1
(1 2d
i
)
1/2
.
The moment generating function for Y
AY (since it is distributed as
2
(k)) is (1 2)
k/2
. These
two moment generating functions must obviously equal one another, which is possible only if k of
the d
i
are equal to 1 and the rest are equal to 0; but this implies that A is idempotent of rank k.
In the next theorem we introduce matrices A
r
and A
u
; the subscripts that identify the matrices
refer to the context of hypothesis testing in the regression model and indicate the model restricted
by the hypothesis or the unrestricted model.
Theorem 7. Let u be an n-vector distributed as N(0,
2
I), and let A
r
and A
u
be two idempo-
tent matrices, with A
r
= A
u
and A
r
A
u
= A
u
. Then, letting u
r
= A
r
u and u
u
= A
u
u, the quantity
F =
(u
r
uru
u
uu)/(tr(Ar)tr(Au))
u
u
uu/tr(Au)
has the F distribution with tr(A
r
) tr(A
u
) and tr(A
u
) degrees of
freedom.
Proof . Dividing both numerator and denominator by
2
, we nd from Theorem 6 that the
denominator has
2
(tr(A
u
)) distribution. From the numerator we have u
r
u
r
u
u
u
u
= u
r
A
r
u
u
u
A
u
u = u
(A
r
A
u
)u. But (A
r
A
u
) is idempotent, because
(A
r
A
u
)
2
= A
2
r
A
r
A
u
A
u
A
r
+ A
2
u
= A
r
A
u
A
u
+A
u
= A
r
A
u
.
Hence, the numerator divided by
2
has
2
(tr(A
r
) tr(A
u
)) distribution. But the numerator and
the denominator are independent, because the matrices of the respective quadratic forms, A
r
A
u
and A
u
, have a zero product.
6 Quandt
Theorem 8. If Y is distributed as N(,
2
I), then Y
AY/
2
is distributed as noncentral
2
(k, ), where =
A/2
2
if and only if A is idempotent of rank k.
The proof is omitted.
Theorem 9. The trace of a product of square matrices is invariant under cyclic permutations
of the matrices.
Proof . Consider the product BC, where B and C are two matrices of order n. Then tr(BC) =
j
b
ij
c
ji
and tr(CB) =
i
j
c
ij
b
ji
. But these two double sums are obviously equal to one
another.
Theorem 10. All symmetric, idempotent matrices not of full rank are positive semidenite.
Proof . For symmetric and idempotent matrices, we have A = AA = A
and x respectively, x
Ax=y
y, which is
>
=
0.
Theorem 11. If A is idempotent of rank r, then tr(A) = r.
Proof . It follows from Eq.(4) in Theorem 5 that P
AP = E
r
. Hence, from Theorem 9 it follows
that tr(A) = tr(P
AP) = tr(APP
) = tr(E
r
) = r.
Theorem 12. If Y is distributed with mean vector = 0 and covariance matrix
2
I, then
E(Y
AY ) =
2
tr(A).
Proof .
E(Y
AY ) = E{
j
a
ij
y
i
y
j
} = E{
i
a
ii
y
2
i
} +E{
j
i=j
a
ij
y
i
y
j
}
. But when i = j, E(y
i
y
j
) = E(y
i
)E(y
j
) = 0; hence E(Y
AY ) =
2
tr(A).
We now specify the regression model as
Y = X +u (6)
where Y is an n 1 vector of observations on the dependent variable, X is an n k matrix of
observations on the independent variables, and u is an n1 vector of unobservable error terms. We
make the following assumptions:
Assumption 1. The elements of X are nonstochastic (and, hence, may be taken to be identical
in repated samples).
Assumption 2. The elements of the vector u are independently distributed as N(0,
2
).
It follows from Assumption 2 that the joint density of the elements of u is
L =
_
1
2
2
_
n/2
e
u
u/2
2
.
Regression Theory 7
The loglikelihood is
log L =
_
n
2
_
log(2)
_
n
2
_
log
2
1
2
2
(Y X)
(Y X),
and its partial derivatives are
log L
=
1
2
_
X
(Y X)
log L
2
=
1
2
2
+
1
2
4
(Y X)
(Y X).
Setting these equal to zero yields the maximum likelihood estimates, which are
= (X
X)
1
X
Y (7)
and
2
= (Y X
(Y X
)/n. (8)
Remark. It is obvious from the form of the likelihood function that the estimate
is also the
least squares estimate of , i.e., the estimate that minimizes (Y X)
X)
1
X
= (X
X)
1
X
X)
1
X
Y )
(Y X(X
X)
1
X
Y )
=
1
n
E
_
Y
(I X(X
X)
1
X
)(I X(X
X)
1
X
)Y
=
1
n
E
_
Y
(I X(X
X)
1
X
)Y
, (9)
since I X(X
X)
1
X
(I X(X
X)
1
X
)u
_
,
and applying Theorem 12,
E(
2
) =
1
n
2
tr(I X(X
X)
1
X
) =
1
n
2
_
tr(I) tr
_
X(X
X)
1
X
_
=
1
n
2
_
n tr
_
(X
X)(X
X)
1
_
=
n k
n
2
.
It follows from Theorem 14 that an unbiased estimator of
2
can be dened as
2
ub
= (n/(n
k))
2
.
8 Quandt
Theorem 15.
has the k-variate normal distribution with mean vector and covariance
matrix
2
(X
X)
1
.
Proof . Normality of the distribution follows from Theorem 1. The fact that the mean of this
distribution is follows from Theorem 13. The covariance matrix of
is
E
_
(
)(
= E(
= E{(X
X)
1
X
(X +u)(X +u)
X(X
X)
1
}
= E
_
(X
X)
1
X
2
IX(X
X)
1
=
2
(X
X)
1
.
Theorem 16. (n k)
2
ub
/
2
is distributed as
2
(n k).
Proof . (nk)
2
ub
can be written as
u
_
IX(X
X)
1
X
. Since IX(X
X)
1
X
is idempotent
of rank n k, Theorem 6 applies.
Theorem 17.
and
2
ub
are independently distributed.
Proof . Multiplying together the matrix of the linear form, (X
X)
1
X
X)
1
X
, we obtain
(X
X)
1
X
_
I X(X
X)
1
X
_
= 0,
which proves the theorem.
Theorem 18. (Markov) Given Y = X + u, E(u) = 0, E(uu
) =
2
I, and E(u
X) = 0, the
best, linear, unbiased estimate of is given by
= (X
X)
1
X
Y .
Proof . First note that Assumption 2 is not used in this theorem. Let C be any constant matrix
of order k n and dene the linear estimator
= CY
. Without loss of generality, let C = (X
X)
1
X
+ B, so that
=
_
(X
X)
1
X
+ B
Y. (10)
Then unbiasedness requires that
E(
) = E
__
(X
X)
1
X
+ B
Y
_
= + BX = .
For the last equality to hold, we require that BX = 0.
The covariance matrix of
is
E(
)(
= E(
= E
__
(X
X)
1
X
+ B
_
X + u
_
+u
_
X(X
X)
1
+ B
,
Regression Theory 9
where we have substituted for
from Eq.(10) and for Y we have substituted X + u. Noting that
any expression in the product with BX in it is zero due to the requirement of unbiasedness, and any
expression with a single u becomes zero when we take expectations, the covariance matrix simplies
to
2
_
(X
X)
1
+ BB
. But BB
exceeds that of
by a positive semidenite matrix. Hence the least squares estimator is best.
Theorem 19. Denoting the i
th
diagonal element of (X
X)
1
by (X
X)
1
i
, the quantity
_
(
i
)/
_
(X
X)
1
i
2
_1
2
/
_
2
ub
/
2
1
2
has the t distribution with n k degrees of freedom.
Proof . (
i
)/
_
(X
X)
1
i
2
1
2
is normally distributed as N(0, 1) by Theorem 15.
2
ub
(nk)/
2
is distributed as
2
(n k) by Theorem 16. The two are independently distributed by Theorem 17.
But the ratio of an N(0, 1) variable to the squareroot of an independent
2
variate, which has been
rst divided by its degrees of freedom, has t(n k) distribution.
We next consider a test on a subset of the regression parameters. For this purpose we partition
the regression model as
Y = X
1
1
+ X
2
2
+ u, (11)
where X
1
is n(k q) and X
2
is nq, and where
1
and
2
are (k q)- and q-vectors respectively.
We shall test a hypothesis about the vector
2
, leaving
1
unrestricted by the hypothesis.
Dene S(
1
,
2
) as the sum of the squares of the regression deviations when the least-squares
estimates for both
1
and
2
are used, and denote the sum of the squares of deviations when
2
is
assigned a xed value, and a least-squares estimate is obtained for
1
as a function of the chosen
2
by S(
1
(
2
),
2
).
Theorem 20. The quantity
__
S(
1
(
2
),
2
) S(
1
,
2
)
_
/q
/
_
S(
1
,
2
)/(n k)
is distributed
as F(q, n k).
Proof . The least squares estimate for
1
, as a function of
2
, is
1
(
2
) = (X
1
X
1
)
1
X
1
(Y X
2
2
)
and the restricted sum of squares is
S(
1
(
2
),
2
) =
_
Y X
2
2
X
1
(X
1
X
1
)
1
X
1
(Y X
2
2
)
_
Y X
2
2
X
1
(X
1
X
1
)
1
X
1
(Y X
2
2
)
= (Y
2
X
2
)A
r
(Y X
2
2
) = (
1
X
1
+ u
)A
r
(X
1
1
+ u)
= u
A
r
u,
where A
r
= I X
1
(X
1
X
1
)
1
X
1
, and where we have used Eq.(11) to replace Y . Since S(
1
,
2
) is
u
A
u
u, where A
u
= I X(X
X)
1
X
(A
r
A
u
)u/q
u
A
u
u/(n k)
. (12)
10 Quandt
A
r
A
u
is idempotent, for
(A
r
A
u
)
2
= A
r
A
u
A
r
A
u
A
u
A
r
,
but
A
u
A
r
=
_
I X(X
X)
1
X
_
I X
1
(X
1
X
1
)
1
X
= I X(X
X)
1
X
X
1
(X
1
X
1
)
1
X
1
+ X(X
X)
1
X
X
1
(X
1
X
1
)
1
X
1
,
where the last matrix on the right can be written as
X
_
I
0
_
(X
1
X
1
)
1
X
1
= X
1
(X
1
X
1
)
1
X
1
.
Hence, A
u
A
r
= A
u
and A
r
A
u
is idempotent. It further follows immediately that A
u
(A
r
A
u
) = 0; hence the numerator and denominator are independently distributed. But the ratio of
two independent
2
variates, when divided by their respective degrees of freedom, has the indicated
F-distribution.
Corollary 2. If
1
is the scalar parameter representing the constant term, and
2
the vector
of the remaining k 1 parameters, the F-statistic for testing the null hypothesis H
0
:
2
= 0 is
_
R
2
/(k 1)
/
_
(1 R
2
)/(n k)
, where R
2
is the coecient of determination.
Proof . Dene y = Y Y , x = X
2
X
2
, where Y is the vector containing for each element the
sample mean of the elements of Y and X
2
is the matrix the j
th
column of which contains for each
element the sample mean of the corresponding column of X
2
. Then write the regression model in
deviations from sample mean form as
y = x
2
+v
where v represents the deviations of the error terms from their sample mean. The numerator of the
test statistic in Theorem 20 can then be written as
_
S(
2
) S(
2
)
y +
2
x
x
2
2
2
x
y (14)
and
S(
2
) = (y x
2
)
(y x
2
) = y
2
x
(y
2
x
x)
2
(15)
Note that the parenthesized expression in the last equation is zero by the denition of
2
as the
least squares estimate. Then
S(
2
) S(
2
) =
2
x
x
2
2
2
x
y +
2
x
y
Regression Theory 11
=
2
x
x
2
2
2
x
y +
2
x
y +
2
x
2
x
2
(by adding and subtracting
2
x
2
)
=
2
x
x
2
2
2
x
y +
2
x
2
(by noting that the third and fth terms cancel)
=
2
x
x
2
+
2
x
2
2
2
x
2
(by replacing x
y by x
2
)
= (
2
2
)
x(
2
2
). (16)
We also dene
R
2
= 1
S(
2
)
y
y
. (17)
Under H
0
, S(
2
) S(
2
) =
2
x
y = y
yR
2
from the rst line of Eq.(16) by the denition of R
2
in
Eq.(17). Combining the denition of S(
2
) with that of R
2
yields for the denominator y
y R
2
y
y.
Substituting these expressions in the denition of F and cancelling out y
r
X
r
)
1
X
r
and
A
u
= I X(X
X)
1
X
r
X
r
)
1
X
r
X(X
X)
1
X
+ X
r
(X
r
X
r
)
1
X
r
X(X
X)
1
X
= I X
r
(X
r
X
r
)
1
X
r
X(X
X)
1
X
+ X
r
(X
r
X
r
)
1
C
= I X
r
(X
r
X
r
)
1
X
r
= A
u
,
where we have replaced X
r
in the rst line by C
X)(X
X)
1
, and nally
replaced C
by X
r
.
The precise form of the test statistics in performing tests on subsets of regression coecients
depends on whether there are enough observations (degrees of freedom) to obtain least squares
regression coecients under the alternative hypothesis. We rst consider the case of sucient degrees
of freedom.
Sucient Degrees of Freedom.
Case 1: Test on a Subset of Coecients in a Regression. Write the model as
Y = X
1
1
+X
2
2
+ u = X +u, (18)
12 Quandt
where X
1
is nk
1
and X
2
is nk
2
, and where we wish to test H
0
:
2
= 0. Under H
0
, the model is
Y = X
1
1
+ u, (19)
and we can write
X
1
= X
_
I
0
_
,
where the matrix on the right in parentheses is of order (k
1
+ k
2
) k
1
. Hence the conditions of
Theorem 7 are satised. Denote the residuals from Eq.(18) and (19) respectively by u
u
and u
r
.
Then the F-test of Theorem 20 can also be written as
( u
r
u
r
u
u
u
u
)/k
2
u
u
u
u
/(n k
1
k
2
)
, (20)
which is distributed as F(k
2
, n k
1
k
2
).
Since u
u
= (I X(X
X)
1
X
)u and u
r
= (I X
1
(X
1
X
1
)
1
X
1
)u, the numerator is u
(A
r
A
u
)u/tr(A
r
A
u
), where the trace in question is n k
1
(n k
1
k
2
)) = k
2
. By the same token,
the denominator is u
A
u
u/(n k
1
k
2
). But this is the same as the statistic (12).
Case 2: Equality of Regression Coecients in Two Regressions. Let the model be given by
Y
i
= X
i
i
+ u
i
i = 1, 2, (21)
where X
1
is n
1
k and X
2
is n
2
k, and where k < min(k
1
, k
2
). We test H
0
:
1
=
2
. The
unrestricted model is then written as
Y =
_
Y
1
Y
2
_
=
_
X
1
0
0 X
2
_ _
2
_
+
_
u
1
u
2
_
, (22)
and the model restricted by the hypothesis can be written as
Y =
_
X
1
X
2
_
1
+ u. (23)
We obviously have
X
r
= X
_
I
I
_
,
where the the matrix in brackets on the right is of order 2k k and the conditions of Theorem 7
are satised. The traces of the restricted and unrestricted A-matrices are
tr(A
r
) = tr
_
I X
r
(X
r
X
r
)
1
X
r
_
= tr
_
I
n1+n2
_
X
1
X
2
_
(X
1
X
1
+ X
2
X
2
)
1
(X
1
X
2
)
_
= n
1
+n
2
k
and
tr(A
u
) = tr
_
I
_
X
1
0
0 X
2
_ _
X
1
X
1
0
0 X
2
X
2
_
1
_
X
1
0
0 X
2
_
_
= n
1
+n
2
2k.
Regression Theory 13
Letting u
u
and u
r
denote, as before, the unrestricted and restricted residuals respectively, the test
statistic becomes
( u
r
u
r
u
u
u
u
)/k
u
u
u
u
/(n
1
+n
2
2k)
,
which is distributed as F(k, n
1
+n
2
2k).
Case 3: Equality of Subsets of Coecients in Two Regressions. Now write the model as
Y
i
= X
i
i
+ Z
i
i
+ u
i
i = 1, 2,
where X
i
is of order n
i
k
1
and Z
i
is of order n
i
k
2
, with k
1
+k
2
< min(n
1
, n
2
). The hypothesis
to be tested is H
0
:
1
=
2
.
The unrestricted and restricted models are, respectively,
Y =
_
Y
1
Y
2
_
=
_
X
1
0 Z
1
0
0 X
2
0 Z
2
_
_
2
_
_ +
_
u
1
u
2
_
= X + u (24)
and
Y =
_
X
1
Z
1
0
X
2
0 Z
2
_
_
_
2
_
_
+u = X
r
r
+ u. (25)
Clearly,
X
r
= X
_
_
I 0 0
I 0 0
0 I 0
0 0 I
_
_,
and hence the conditions of Theorem 7 are satised. Hence we can form the F-ratio as in Eq.(12)
or (20), where the numerator degrees of freedom are tr(A
r
A
u
) = n
1
+n
2
k
1
2k
2
(n
1
+n
2
2k
1
2k
2
) = k
1
and the denominator degrees of freedom are tr(A
u
) = n
1
+n
2
2k
1
2k
2
.
Insucient Degrees of Freedom.
Case 4: Equality of Regression Coecients in Two Regressions. This case is the same as Case
2, except we now assume that n
2
<
=
k. Denote by u
r
be the residuals from the restricted regression
using the full set of n
1
+ n
2
observations and let u
u
denote the residuals from the regression using
only the rst n
1
observations. Then u
r
= A
r
u, u
u
= A
1
u
1
, where A
1
= I X
1
(X
1
X
1
)
1
X
1
. We
can then also write u
u
= [A
1
0]u, and u
u
u
u
= u
A
u
u, where
A
u
=
_
A
1
0
0 0
_
(26)
is an (n
1
+ n
2
) (n
1
+ n
2
) matrix. Since X
1
A
1
= 0, we have
X
A
u
= [X
1
X
2
]
_
A
1
0
0 0
_
= 0
14 Quandt
Hence A
r
A
u
= A
u
, and the conditions of Theorem 7 are satised. The relevant traces are tr(A
r
) =
n
1
+n
2
k and tr(A
u
) = n
1
k.
Case 5: Equality of Subsets of Coecients in Two Regressions. This is the same case as Case
3, except that we now assume that k
2
< n
2
<
=
k
1
+ k
2
; hence there are not enough observations
in the second part of the dataset to estimate a separate regression equation. As before, let u
r
be
the residuals from the restricted model, and u
u
be the residuals from the regression on the rst n
1
observations. Denote by W
1
the matrix [X
1
Z
1
]. Again as before, u
r
= A
r
u, u
u
= A
1
u
1
= [A
1
0]u,
where A
1
= I W
1
(W
1
W
1
)
1
W
1
]. Dening A
u
as in Eq.(26), we again obtain A
r
A
u
= A
u
and the
conditions of Theorem 7 are satised. The relevant traces then are tr(A
r
) = n
1
+n
2
k
1
2k
2
and
tr(A
u
) = n
1
k
1
k 2. Notice that the requirement that tr(A
r
) tr(A
u
) be positive is fullled if
and only if n
2
> k
2
, as we assumed.
Irrelevant Variables Included.
Consider the case in which
Y = X
1
1
+u
is the true model, but in which the investigator mistakenly estimates the model
Y = X
1
1
+ X
2
2
+ u.
The estimated coecient vector
= (
2
) becomes
=
_
(X
1
X
1
) (X
1
X
2
)
(X
2
X
1
) (X
2
X
2
)
_1 _
X
1
X
2
_
[X
1
1
+u]
=
_
I
0
_
1
+
_
(X
1
X
1
) (X
1
X
2
)
(X
2
X
1
) (X
2
X
2
)
_1 _
X
1
X
2
_
u
,
from which it follows that
E
_
2
_
=
_
1
0
_
,
and hence the presence of irrelevant variables does not aect the unbiasedness of the regression
parameter estimates. We next prove two lemmas needed for Theorem 22.
Lemma 1. If A is n k, n
>
=
k, with rank k, then A
A is nonsingular.
Proof . We can write
A
A = [A
1
A
2
]
_
A
1
A
2
_
= A
1
A
1
+ A
2
A
2
,
where A
1
A
1
is of order and rank k. We can then write C
A
1
= A
2
, where C
is an (n k) k
matrix. Then A
A = A
1
[I + CC
]A
1
. The matrix CC
by
unity). But then A
1
X
1
)
1
X
1
. Then the matrix (X
2
X
2
) (X
2
X
1
)(X
1
X
1
)
1
(X
1
X
2
)
is nonsingular.
Proof . Write (X
2
X
2
) (X
2
X
1
)(X
1
X
1
)
1
(X
1
X
2
) as X
2
MX
2
. M obviously has rank (M) =
nk
1
. Since MX
1
= 0, the columns of X
1
span the null-space of M. It follows that (MX
2
) = k
2
, for
if the rank of MX
2
were smaller than k
2
, there would exist a vector c
2
= 0 such that MX
2
c
2
= 0, and
the vector X
2
c
2
would lie in the null-space of M, and would therefore be spanned by the columns of
X
1
. But then we could write X
2
c
2
+X
1
c
1
= 0, for a vector c
1
= 0, which contradicts the assumption
that the columns of X are linearly independent.
But then it follows that X
2
MX
2
= X
2
M
MX
2
has rank k
2
by Lemma 1.
Theorem 22. The covariance matrix for
1
with irrelevant variables included exceeds the
covariance matrix for the correctly specied model by a positive semidenite matrix.
Proof . The covariance matrix for
1
for the incorrectly specied model is obviously the upper
left-hand block of of
2
_
(X
1
X
1
) (X
1
X
2
)
(X
2
X
1
) (X
2
X
2
)
_
1
whereas the covariance matrix in the correctly specied model is
2
(X
1
X
1
)
1
.
Since the inverse of a partitioned matrix can be written as
_
A B
C D
_
1
=
_
A
1
_
I + B(D CA
1
B
_
1
CA
1
A
1
B(D CA
1
B)
1
(D CA
1
B)
1
CA
1
(D CA
1
B)
1
_
,
the required upper left-hand block of the covariance matrix in the misspecied model is
2
(X
1
X
1
)
1
_
I + (X
1
X
2
)[(X
2
X
2
) (X
2
X
1
)(X
1
X
1
)
1
(X
1
X
2
)]
1
(X
2
X
1
)(X
1
X
1
)
1
=
2
_
(X
1
X
1
)
1
+ (X
1
X
1
)
1
(X
1
X
2
)
_
X
2
(I X
1
(X
1
X
1
)
1
X
1
)X
2
1
(X
2
X
1
)(X
1
X
1
)
1
_
Subtracting
2
(X
1
X
1
)
1
, we obtain the dierence between the two covariance matrices as
2
_
(X
1
X
1
)
1
(X
1
X
2
)
_
X
2
(I X
1
(X
1
X
1
)
1
X
1
)X
2
1
(X
2
X
1
)(X
1
X
1
)
1
_
Since the matrix in square brackets is positive denite, its inverse exists and the matrix in
__
is
positive semidenite.
16 Quandt
Relevant Variables Omitted.
Consider the case in which the true relation is
Y = X
1
1
+ X
2
2
+ u, (27)
but in which the relation
Y = X
1
1
+u (28)
is estimated. Then we have
Theorem 23. For the least squares estimator
1
we have E(
1
) =
1
+ (X
1
X
1
)
1
(X
1
X
2
)
2
.
Proof . We have
1
= (X
1
X
1
)
1
X
1
Y = (X
1
X
1
)
1
X
1
[X
1
1
+X
2
2
+u], and taking expectations
leads to the result.
Estimation with Linear Restrictions.
Now consider the model
Y = X + u, (29)
r = R, (30)
where r is p 1, R is p k, p < k, and where the rank of R is (R) = p. We assume that the
elements of r and R are known numbers; if the rank of R were less than p, then some restrictions
could be expressed as linear combinations of other restrictions and may be omitted. Minimizing the
sum of the squares of the residuals subject to (30) requires forming the Lagrangian
V = (Y X)
(Y X)
(R r), (31)
where is a vector of Lagrange multipliers. Now denote by
and
the estimates obtained by
setting the partial derivatives of Eq.(29) equal to zero, and let
denote the least squares estimates
without imposing the restrictions (30). Then we have
Theorem 24.
=
+ (X
X)
1
R
(R(X
X)
1
R
)
1
(r R
) and
= 2(R(X
X)
1
R
)
1
(r
R
).
Proof . Setting the partial derivatives of Eq.(31) equal to zero yields
V
= 2X
Y + 2(X
X)
= 0 (32)
V
= R
r = 0. (33)
Multiplying Eq.(32) on the left by R(X
X)
1
yields
2R(X
X)
1
X
Y + 2R
R(X
X)
1
R
= 0
Regression Theory 17
or
2R
+ 2R
R(X
X)
1
R
= 0. (34)
Since
satises the constraints by denition, Eq.(34) yields
= 2
_
R(X
X)
1
R
1
(r R
).
Substituting this result in Eq.(32) and solving for
yields
=
+ (X
X)
1
R
_
R(X
X)
1
R
1
(r R
). (34a)
Corollary 3. If r R
= 0, then E(
) = and
= 0.
Proof . Substituting X + u for Y and (X
X)
1
X
Y for
in the expression for
in Eq.(34a)
and taking expectations yields the result for E(
).
Dene
A = I (X
X)
1
R
_
R(X
X)
1
R
1
R. (35)
We then have
Theorem 25. The covariance matrix of
is
2
A(X
X)
1
.
Proof . Substituting (X
X)
1
X
Y for
and X +u for Y in
(Eq.(34a)), we can write
E(
) = (X
X)
1
_
X
_
R(X
X)
1
R
1
R(X
X)
1
X
u
=
_
I (X
X)
1
R
_
R(X
X)
1
R
1
R
(X
X)
1
X
u = A(X
X)
1
X
u.
(36)
Multiplying Eq.(36) by its transpose and taking expectations, yields
Cov(
) =
2
A(X
X)
1
A
=
2
_
I (X
X)
1
R
_
R(X
X)
1
R
1
R
(X
X)
1
_
I R
_
R(X
X)
1
R
1
R(X
X)
1
=
2
_
(X
X)
1
(X
X)
1
R
_
R(X
X)
1
R
1
R(X
X)
1
(X
X)
1
R
_
R(X
X)
1
R
1
R(X
X)
1
+ (X
X)
1
R
_
R(X
X)
1
R
1
R(X
X)
1
R
_
R(X
X)
1
R
1
R(X
X)
1
=
2
_
(X
X)
1
(X
X)
1
R
_
R(X
X)
1
R
1
R(X
X)
1
=
2
A(X
X)
1
.
18 Quandt
We now consider the test of the null hypothesis H
0
: R = r. For this purpose we construct an
F-statistic as in Theorem 20 (see also Eq.(12)).
The minimum sum of squares subject to the restriction can be written as
S
r
=
_
Y X
_
+ (X
X)
1
R
_
R(X
X)
1
R
1
(r R
)
_
_
Y X
_
+ (X
X)
1
R
_
R(X
X)
1
R
1
(r R
)
_
= (Y X
(Y X
)
_
X(X
X)
1
R
_
R(X
X)
1
R
1
(r R
(Y X
)
(Y X
_
X(X
X)
1
R
_
R(X
X)
1
R
1
(r R
+ (r R
_
R(X
X)
1
R
1
(X
X)
1
(X
X)(X
X)
1
R
_
R(X
X)
1
R
1
(r R
)
= S
u
+ (r R
_
R(X
X)
1
R
1
(r R
),
(37)
where S
u
denotes the unrestricted minimal sum of squares, and where the disappearance of the
second and third terms in the third and fourth lines of the equation is due to the fact that that
X
(Y X
) = 0 by the denition of
. Substituting the least squares estimate for
in (37), we
obtain
S
r
S
u
= [r R(X
X)
1
X
Y ]
_
R(X
X)
1
R
1
[r R(X
X)
1
X
Y ]
= [r R R(X
X)
1
X
u]
_
R(X
X)
1
R
1
[r R R(X
X)
1
X
u]
= u
X(X
X)
1
R
_
R(X
X)
1
R
1
R(X
X)
1
X
u = u
B
1
u,
(38)
since under H
0
, r R = 0. The matrix B
1
is idempotent and of rank p because
X(X
X)
1
R
_
R(X
X)
1
R
1
R(X
X)
1
(X
X)(X
X)
1
R
_
R(X
X)
1
R
1
R(X
X)
1
X
= X(X
X)
1
R
_
R(X
X)
1
R
1
R(X
X)
1
X
and
tr(B
1
) = tr(X(X
X)
1
R
_
R(X
X)
1
R
1
R(X
X)
1
X
)
= tr(R
_
R(X
X)
1
R
1
R(X
X)
1
(X
X)(X
X)
1
)
= tr(
_
R(X
X)
1
R
1
R(X
X)
1
R
) = tr(I
p
) = p.
The matrix of the quadratic form S
u
is clearly B
2
= I X(X
X)
1
X
X)
1
R
_
R(X
X)
1
R
1
R(X
X)
1
X
(I X(X
X)
1
X
) = 0.
Hence
(S
r
S
u
)/p
S
u
/(n k)
is distributed as F(p, n k).
We now turn to the case in which the covariance matrix of u is and we wish to test the
hypothesis H
0
: R = r. We rst assume that is known. We rst have
Regression Theory 19
Theorem 26. If u is distributed as N(0, ), and if is known, then the Lagrange Multiplier,
Wald, and likelihood ratio test statistics are identical.
Proof . The loglikelihood function is
log L() = (2)
n/2
+
1
2
log |
1
|
1
2
(Y X)
1
(Y X),
where |
1
| denotes the determinant of
1
, and the score vector is
log L
= X
1
(Y X).
By further dierentiation, the Fischer Information matrix is
I() = X
1
X.
The unrestricted maximum likelihood estimator for is obtained by setting the score vector equal
to zero and solving, which yields
= (X
1
X)
1
X
1
Y.
Letting u denote the residuals Y X
, the loglikelihood can be written as
log L =
n
2
log(2) +
1
2
log |
1
|
1
2
u
1
u.
To obtain the estimates restricted by the linear relations R = r, we form the Lagrangian
L(, ) = log L() +
(R r)
and set its partial derivatives equal to zero, which yields
log L
= X
1
(Y X) + R
= 0
log L
= R r = 0.
(39)
Multiply the rst equation in (39) by (X
1
X)
1
, which yields
=
+ (X
1
X)
1
R
.
Multiplying this further by R, and noting that R
= r, we obtain
=
_
R(X
1
X)
1
R
1
(R
r) (40)
=
(X
1
X)
1
R
_
R(X
1
X)
1
R
1
(R
r). (41)
20 Quandt
The loglikelihood, evaluated at
is
log L(
) =
n
2
log(2) +
1
2
log |
1
|
1
2
u
1
u.
We now construct the test statistics. The Lagrange multiplier statistic is
LM =
_
log L
I(
)
1
_
log L
_
= u
1
X(X
1
X)
1
X
1
u
=
R(X
1
X)
1
R
.
(42)
The Wald statistic is
W = (R
r)
_
R(X
1
X)
1
R
1
(R
r), (43)
and since the covariance matrix of (R
r) is R(X
1
X)
1
R
, W can be written as
W = (R
r)
_
R(X
1
X)
1
R
1
_
R(X
1
X)
1
R
_
R(X
1
X)
1
R
1
(R
r)
=
_
R(X
1
X)
1
R
= LM,
where we have used the denition of
in (40). The likelihood ratio test statistic is
LR = 2
_
log L(
) log L(
= u
1
u u
1
u. (44)
Since
1/2
u =
1/2
(Y X
), and substituting in this for
from its denition in (41), we obtain
1/2
u =
1/2
_
Y X
X(X
1
X)
1
R
. (45)
We multiply Eq.(45) by its transpose and note that terms with (Y X
)
1
X vanish; hence we
obtain
u
1
u = u
1
u +
R(X
1
X)
1
R
.
But the last term is the Lagrange multiplier test statistic from (42); hence comparing this with (44)
yields LR = LM.
We now consider the case when is unknown, but is a smooth function of a p-element vector ,
and denoted by (). We then have
Theorem 27. If u is normally distributed as N(0, ()), then W
>
=
LR
>
=
LM.
Proof . Denote by
the vector (
). The loglikelihood is
log L() =
n
2
log(2) +
1
2
log |
1
()| +
1
2
(Y X)
1
()(Y X).
Regression Theory 21
Denoting the unrestricted estimates by
and the restricted estimates by
, as before, and in par-
ticular, denoting by
the matrix ( ) and by
the matrix ( ), the three test statistics can be
written, in analogy with Eqs.(42) to (44), as
LM = u
1
X(X
1
X)
1
X
1
u
W = (R
r)
_
R(X
1
X)
1
R
1
(R
r)
LR = 2
_
log L( ,
) log L( ,
)
_
.
Now dene
LR( ) = 2
_
log L( ,
) log L( ,
u
)
_
, (46)
where
u
is the unrestricted maximizer of logL( , ) and
LR( ) = 2
_
log L( ,
r
) log L( ,
)
_
, (47)
where
r
is the maximizer of log L( , ) subject to the restriction R r = 0. LR( ) employs the
same matrix as the LM statistic; hence by the argument in Theorem 26,
LR( ) = LM.
It follows that
LR LM = LR LR( ) = 2
_
log L( ,
) log L( ,
u
)
>
=
0,
since the and
estimates are unrestricted. We also note that W and LR( ) use the same ,
hence they are equal by Theorem 26. Then
W LR = LR( ) LR = 2
_
log L( ,
) log L( ,
r
)
_
>
=
0, (48)
since
r
is a restricted estimate and the highest value of the likelihood with the restriction that can
be achieved is log L( ,
). Hence W
>
=
LR
>
=
LM.
We now prove a matrix theorem that will be needed subsequently.
Theorem 28. If is symmetric and positive denite of order p, and if H is of order p q,
with q
<
=
p, and if the rank of H is q, then
_
H
H
0
_
is nonsingular.
Proof . First nd a matrix, conformable with the rst,
_
P Q
Q
R
_
22 Quandt
such that
_
H
H
0
_ _
P Q
Q
R
_
=
_
I 0
0 I
_
.
Performing the multiplication and equating the two sides, we obtain
P + HQ
= I (49)
Q+ HR = 0 (50)
H
P = 0 (51)
H
Q = I (52)
From (49) we have
P +
1
HQ
=
1
. (53)
Multiplying Eq.(53) on the left by H
P = 0, we have
H
1
HQ
= H
1
. (54)
Since H is of full rank, H
1
H is nonsingular by a straightforward extension of Lemma 1. Then
Q
= (H
1
H)
1
H
1
, (55)
which gives us the value of Q. Substituting (55) in Eq.(53) gives
P =
1
1
H(H
1
H)
1
H
1
. (56)
From Eq.(50) we have
1
HR = Q,
and multiplying this by H
1
HR = I
and
R = (H
1
H)
1
, (57)
which determines the value of R. Since the matrix
_
P Q
Q
R
_
is obviously the inverse of the matrix in the theorem, the proof is complete.
Regression Theory 23
We now consider the regression mode Y = X + u, where u is distributed as N(0,
2
I), subject
to the restrictions R = 0; hence this is the same model as considered before with r = 0. Minimize
the sum of squares subject to the restrictions by forming the Lagrangian
L = (Y X)
(Y X) +
R. (58)
The rst order conditions can be written as
_
(X
X)
1
R
R 0
_ _
_
=
_
X
Y
0
_
. (59)
Denote the matrix on the left hand side of (59) by A, and write its inverse as
A
1
=
_
P Q
Q
S
_
. (60)
We can then write the estimates as
_
_
=
_
PX
Y
Q
Y
_
, (61)
and taking expectations, we have
E
_
_
=
_
PX
X
Q
X
_
. (62)
From multiplying out A
1
A we obtain
PX
X + QR = I (63)
Q
X + SR = 0 (64)
PR
= 0 (65)
Q
= I (66)
Hence we can rewrite Eq.(62) as
E
_
_
=
_
(I QR)
SR
_
=
_
0
_
, (67)
since R = 0 by denition. This, so far, reproduces Corollary 3.
Theorem 29. Given the denition in Eq.(61), the covariance matrix of (
) is
2
_
P 0
0 S
_
.
Proof . It is straightforward to note that
cov(
) = E
__
0
__ __
0
__
=
2
_
PX
XP PX
XQ
QX
XP Q
XQ
_
. (68)
24 Quandt
From (65) and (66), multiplying the second row of A into the rst column of A
1
gives
RP = 0,
and multiplying it into the second column gives
RQ = I.
Hence, multiplying Eq.(63) on the right by P gives
PX
XP + QRP = P
or, since RP = 0,
PX
XP = P.
Multiplying Eq.(63) by Q on the right gives
PX
XQ+ QRQ = Q,
or, since RQ = I,
PX
XQ = 0.
Finally, multiplying (64) by Q on the right gives
Q
XQ +SRQ = 0,
which implies that
Q
XQ = S.
We now do large-sample estimation for the general unconstrained and constrained cases. We
wish to estimate the parameters of the density function f(x, ), where x is a random variable and
is a parameter vector with k elements. In what follows, we denote the true value of by
0
. The
loglikelihood is
log L(x, ) =
n
i=1
log f(x
i
, ). (69)
Let
be the maximum likelihood estimate and let D
log L(x, )) = nI
1
(). Expanding in
Taylor Series about
0
, we have
0 = D
log L(x,
) = D
log L(x,
0
) + (
0
)D
2
log L(x,
0
) +R(x,
0
,
) (70)
Regression Theory 25
Theorem 30. If
is a consistent estimator and the third derivative of the loglikelihood function
is bounded, then
n(
0
) is distributed as N(0, I
1
(
0
)
1
).
Proof . From Eq.(70) we have
n(
0
) =
n
1/2
D
log L(x,
0
) + n
1/2
R(x,
0
,
)
n
1
D
2
log L(x,
0
)
(71)
where R is a remainder term of the form (
0
)
3
D
3
log L(x,
0
) is a sum of n terms, each of which has expectation 0 and variance
I
1
(
0
); hence by the Central Limit Theorem, n
1/2
D
log L(x,
0
) is asymptotically normally dis-
tributed with mean zero and variance equal to (1/n)nI
1
(
0
) = I
1
(
0
). The remainder term converges
in probability to zero. The denominator is 1/n times the sum of n terms, each of which has expec-
tation equal to I
1
(
0
); hence the entire denominator has the same expectation and by the Weak
Law of Large Numbers the denominator converges to this expectation. Hence
n(
0
) converges
in distribution to a random variable which is I
1
(
0
)
1
times an N(0, I
1
(
0
)) variable and hence is
asymptotically distributed as N(0, I
1
(
0
)
1
).
We now consider the case when there are p restrictions given by h()
= (h
1
(), . . . , h
p
()) = 0.
Estimation subject to the restrictions requires forming the Lagrangian
G = log L(x, )
h()
and setting its rst partial derivatives equal to zero:
D
log L(x,
) H
= 0
h(
) = 0
(72)
where H
is the k p matrix of the derivatives of h() with respect to . Expanding in Taylor Series
and neglecting the remainder term, yields asymptotically
D
log L(x,
0
) + D
2
log L(x,
0
)(
0
) H
= 0
H
0
) = 0
(73)
The matrix H
should be evaluated at
; however, writing H
= H
(
0
)
+ H
(
0
)(
0
) and
noting that if the restrictions hold,
will be near
0
and
will be small, we may take H
to be
evaluated at
0
.
Theorem 31. The vector
_
n(
0
)
1
_
26 Quandt
is asymptotically normally distributed with mean zero and covariance matrix
_
P 0
0 S
_
where
_
I
1
(
0
) H
0
_
1
=
_
P Q
Q
S
_
.
Proof . Dividing the rst line of (73) by
n, we can write
_
1
n
D
2
log L(x,
0
) H
0
__
n(
0
)
1
n
_
=
_
1
n
D
log L(x,
0
)
0
_
. (74)
The upper left-hand element in the left-hand matrix converges in probability to I
1
(
0
) and the top
element on the right hand side converges in distribution to N(0, I
1
(
0
)). Thus, (74) can be written
as
_
I
1
(
0
) H
0
__
n(
0
)
1
n
_
=
_
1
n
D
log L(x,
0
)
0
_
. (75)
Eq.(75) is formally the same as Eq.(59); hence by Theorem 29,
_
n(
0
)
1
_
is asymptotically normally distributed with mean zero and covariance matrix
_
P 0
0 S
_
where
_
I
1
(
0
) H
0
_
1
=
_
P Q
Q
S
_
. (76)
We now turn to the derivation of the asymptotic distribution of the likelihood ratio test statistic.
As before,
denotes the unrestricted, and
the restricted estimator.
Theorem 32. Under the assumptions that guarantee that both the restricted and unrestricted
estimators (
and
respectively) are asymptotically normally distributed with mean zero and co-
variance matrices I
1
(
0
) and P respectively, and if the null hypothesis H
0
: h() = 0 is true, the
Regression Theory 27
likelihood ratio test statistic, 2 log = 2(logL(x,
) log L(x,
)) is asymptotically distributed as
2
(p).
Proof . Expand log L(x,
) in Taylor Series about
, which yields to an approximation
log L(x,
) = log L(x,
) + D
log L(x,
)(
) +
1
2
(
_
D
2
(log L(x,
)
). (77)
Since the second term on the right hand side is zero by denition, the likelihood ratio test statistic
becomes
2 log = (
_
D
2
log L(x,
)
). (78)
Let v be a k-vector distributed as N(0, I
1
(
0
)). Then we can write
n(
0
) = I
1
(
0
)
1
v
n(
0
) = Pv
(79)
where P is the same P as in Eq.(76). Then, to an approximation,
2 log = v
(I
1
(
0
)
1
P)
I
1
(
0
)(I
1
(
0
)
1
P)v
= v
(I
1
(
0
)
1
P P + PI
1
(
0
)P)v
. (80)
We next show that P = PI
1
(
0
)P. From Eq.(56) we can write
P = I
1
(
0
)
1
I
1
(
0
)
1
H(H
I
1
(
0
)
1
H)
1
H
I
1
(
0
)
1
. (80)
Multiplying this on the left by I
1
(
0
) yields
I
1
(
0
)P = I H(H
I
1
(
0
)
1
H)
1
H
I
1
(
0
)
1
,
and multiplying this on the left by P (using the right-hand side of (81)), yields
PI
1
(
0
)P = I
1
(
0
)
1
I
1
(
0
)
1
H[H
I
1
(
0
)
1
H]
1
H
I
1
(
0
)
1
I
1
(
0
)
1
H[H
I
1
(
0
)
1
H]
1
H
I
1
(
0
)
1
+I
1
(
0
)
1
H[H
I
1
(
0
)
1
H]
1
H
I
1
(
0
)
1
H[H
I
1
(
0
)
1
H]
1
H
I
1
(
0
)
1
= P
(82)
Hence,
2 log = v
(I
1
(
0
)
1
P)v. (83)
Since I
1
(
0
) is symmetric and nonsingular, it can always be written as I
1
(
0
) = AA
, where A is a
nonsingular matrix. Then, if z is a k-vector distributed as N(0, I), we can write
v = Az
28 Quandt
and E(v) = 0 and cov(v) = AA
= I
1
(
0
) as required. Then
2 log = z
(I
1
(
0
)
1
P)Az
= z
I
1
(
0
)
1
Az z
PAz
= z
(A
)
1
A
1
Az z
PAz
= z
z z
PAz
= z
(I A
PA)z
. (84)
Now (A
PA)
2
= A
PAA
PA = A
PI
1
(
0
)PA, but from Eq.(82), P = PI
1
(
0
)P; hence A
PA is
idempotent, and its rank is clearly the rank of P. But since the k restricted estimates must satisfy
p independent restrictions, the rank of P is k p. Hence the rank of I A
PA is k (k p) = p.
We next turn to the Wald Statistic. Expanding h(
) = h(
0
) +H
0
)
and under the null hypothesis
h(
) = H
0
). (85)
Since
n(
0
) is asymptotically distributed as N(0, I
1
(
0
)
1
),
nh(
), which is asymptotically
the same as H
n(
0
), is asymptotically distributed as N(0, H
I
1
(
0
)
1
H
[cov(h(
))]
1
h(
) becomes
W = nh(
[H
I
1
(
0
)
1
H
]
1
h(
). (86)
Theorem 33. Under H
0
: h() = 0, and if H
n(
0
). Thus, when
h() = 0,
nh(
) = H
n(
0
)
is asymptotically distributed as H
[H
I
1
(
0
)
1
H
]
1
H
Az, (87)
which we obtain by substituting in Eq.(86) the asymptotic equivalent of
nh(
[H
I
1
(
0
)
1
H
]
1
H
AA
[H
I
1
(
0
)
1
H
]
1
H
A =
A
[H
I
1
(
0
)
1
H
]
1
H
A
Regression Theory 29
where we have substituted I
1
(
0
)
1
for AA
, I
1
(
0
)
1
is of rank k, H
log L(x,
)]
_
cov(D
log L(x,
))
1
[D
log L(x,
)]. (88)
Theorem 34. Under the null hypothesis, LM is distributed as
2
(p).
Proof . Expanding D
log L(x,
) in Taylor Series, we have asymptotically
D
log L(x,
) = D
log L(x,
) + D
2
log L(x,
)(
). (89)
D
2
log L(x,
) converges in probability to nI
1
(
0
), D
log L(x,
) converges in probability to zero,
and D
log L(x,
) converges in probability to nI
1
(
0
)(
). But asymptotically
=
under the
null; hence n
1/2
D
log L(x,
) is asymptotically distributed as N(0, I
1
(
0
)). Hence the appropriate
test is
LM = n
1
[D
log L(x,
)]I
1
(
0
)
1
[D
log L(x,
)]
which by (88) is asymptotically
LM = n
1
_
n(
I
1
(
0
)I
1
(
0
)
1
I
1
(
0
)(
)n
= n(
I
1
(
0
)(
). (90)
But this is the same as Eq.(78), the likelihood ratio statistic, since the term D
2
log L(x,
) in
Eq.(78) is nI
1
(
0
). Since the likelihood ratio statistic has asymptotic
2
(p) distribution, so does the
LM statistic.
We now illustrate the relationship among W, LM, and LR and provide arguments for their
asymptotic distributions in a slightly dierent way than before with a regression model Y = X +u,
with u distributed as N(0,
2
I), and the restrictions R = r.
In that case the three basic statistics are
W = (R
r)
_
R(X
X)
1
R
1
(R
r)/
2
LM = (R
r)
_
R(X
X)
1
R
1
(R
r)/
2
LR =
n
2
(log
2
log
2
)
(91)
where W is immediate from Eq.(45) when
is set equal to
2
I, LM follows by substituting (40)
in to (42) and setting
=
2
, and LR = 2 log follows by substituting
, respectively
in the
30 Quandt
likelihood function and computing 2 log . The likelihood ratio itself can be written as
=
_
u
u/n
u
u/n
_
n/2
=
_
1
1 +
1
n
2
(R
r)
(R
r)
_
n/2
(92)
where we have utilized Eq.(37) by dividing both sides by S
u
and taking the reciprocal. We can also
rewrite the F-statistic
(S
r
S
u
)/p
S
u
/(n k)
as
F =
(R
r)
_
R(X
X)
1
R
1
(R
r)/p
S
u
/(n k)
. (93)
Comparing (92) and (93) yields immediately
=
_
1
1 +
p
nk
F
_
n/2
(94)
and comparing W in (91) with (92) yields
=
_
1
1 + W/n
_
n/2
. (95)
Equating (94) and (95) yields
W
n
=
p
n k
F
or
W = p
_
1 +
k
n k
_
F. (96)
Although the left-hand side is asymptotically distributed as
2
(p) and F has the distribution of
F(p, nk), the right hand side also has asymptotic distribution
2
(p), since the quantity pF(p, nk)
converges in distribution to that chi
2
distribution.
Comparing the denitions of LM and W in (91) yields
LM =
_
2
2
W
_
(97)
and from Eq.(37) we have
2
=
2
(1 +W/n). (98)
Hence, from (97) and (98) we deduce
LM =
W
1 +W/n
, (99)
Regression Theory 31
and using (96) we obtain
LM =
p
_
n
nk
_
F
1 +
p
n
n
nk
F
=
npF
n k + pF
, (100)
which converges in distribution as n to
2
(p). From (95) we obtain
2 log = LR = nlog
_
1 +
W
n
_
. (101)
Since for positive z, e
z
> 1 +z, it follows that
LR
n
= log
_
1 +
W
n
_
<
W
n
and hence W > LR.
We next note that for z
>
=
0, log(1 + z)
>
=
z/(1 + z), since (a) at the origin the left and right
hand sides are equal, and (b) at all other values of z the derivative of the left-hand side, 1/(1 + z)
is greater than the slope of the right-hand side, 1/(1 +z)
2
. It follows that
log
_
1 +
W
n
_
>
=
W/n
1 +W/n
.
Using (99) and (101), this shows that LR
>
=
LM.
Recursive Residuals.
Since least squares residuals are correlated, even when the true errors u are not, it is inappropriate
to use the least squares residuals for tests of the hypothesis that the true errors are uncorrelated. It
may therefore be useful to be able to construct residuals that are uncorrelated when the true errors
are. In order to develop the theory of uncorrelated residuals, we rst prove a matrix theorem.
Theorem 35 (Barletts). If A is a nonsingular n n matrix, if u and v are n-vectors, and if
B = A+ uv
, then
B
1
= A
1
A
1
uv
A
1
1 + v
A
1
u
.
Proof . To show this, we verify that pre- or postmultiplying the above by b yields an identity
matrix. Thus, postmultiplying yields
B
1
B = I =
_
A
1
A
1
uv
A
1
1 +v
A
1
u
_
(A +uv
)
= I
A
1
uv
1 +v
A
1
u
+ A
1
uv
A
1
uv
A
1
uv
1 +v
A
1
u
= I +
A
1
uv
+ A
1
uv
+A
1
uv
(v
A
1
u) A
1
u(v
A
1
u)v
1 +v
A
1
u
= I
(102)
32 Quandt
We consider the standard regression model Y = X + u, where u is distributed as N(0,
2
I)
and where X is n k of rank k. Dene X
j
to represent the rst j rows of the X-matrix, Y
j
the
rst j rows of the Y -vector, x
j
the j
th
row of X, and y
j
the j
th
element of Y . It follows from the
denitions, for example, that
X
j
=
_
X
j1
x
j
_
and Y
j
=
_
Y
j1
y
j
_
.
Dene the regression coecient estimate based on the rst j observations as
j
= (X
j
X
j
)
1
X
j
Y
j
. (103)
We then have the following
Theorem 36.
j
=
j1
+
(X
j1
X
j1
)
1
x
j
(y
j
x
j1
)
1 +x
j
(X
j1
X
j1
)
1
x
j
.
Proof . By Theorem 35,
(X
j
X
j
)
1
= (X
j1
X
j1
)
1
(X
j1
X
j1
)
1
x
j
x
j
(X
j1
X
j1
)
1
1 +x
j
(X
j1
X
j1
)
1
x
j
.
We also have by denition that
X
j
Y
j
= X
j1
Y
j1
+ x
j
y
j
.
Substituting this in Eq.(103) gives
j
=
_
(X
j1
X
j1
)
1
(X
j1
X
j1
)
1
x
j
x
j
(X
j1
X
j1
)
1
1 + x
j
(X
j1
X
j1
)
1
x
j
_
(X
j1
Y
j1
+ x
j
y
j
)
=
j1
+ (X
j1
X
j1
)
1
x
j
y
j
(X
j1
X
j1
)
1
x
j
x
j
[(X
j1
X
j1
)
1
X
j1
Y
j1
] + (X
j1
X
j1
)
1
x
j
x
j
(X
j1
X
j1
)
1
x
j
y
j
1 + x
j
(X
j1
X
j1
)
1
x
j
=
j1
+
(X
j1
X
j1
)
1
x
j
(y
j
x
j1
)
1 + x
j
(X
j1
X
j1
)
1
x
j
,
where, in the second line, we bring the second and third terms on a common denominator and also
note that the bracketed expression in the numerator is
j1
by denition.
First dene
d
j
=
_
1 + x
j
(X
j1
X
j1
)
1
x
j
1/2
(104)
Regression Theory 33
and also dene the recursive residuals u
j
as
u =
y
j
x
j1
d
j
j = k + 1, . . . , n (105)
Hence, recursive residuals are dened only when the can be estimated from at least k observations,
since for j less than k + 1, (X
j1
X
j1
)
1
would not be nonsingular. Hence the vector u can be
written as
u = CY, (106)
where
C =
_
_
x
k+1
(X
k
Xk)
1
X
k
dk+1
1
dk+1
0 0 . . . 0
x
k+2
(X
k+1
Xk+1)
1
)X
k+1
dk+2
1
dk+2
0 . . . 0
.
.
.
.
.
.
.
.
.
x
n
(X
n1
Xn1)
1
)X
n1
dn
1
dn
_
_
. (107)
Since the matrix X
j
has j columns, the fractions that appear in the rst column of C are rows with
increasingly more columns; hence the term denoted generally by 1/d
j
occurs in columns of the C
matrix further and further to the right. Thus, the element 1/d
k+1
is in column k + 1, 1/d
k+2
in
column k + 2, and so on. It is also clear that C is an (n k) n matrix. We then have
, Theorem 37. (1) u is linear in Y ; (2) E( u) = 0; (3) The covariance matrix of the u is
scalar, i.e., CC
= I
nk
; (4) For all linear, unbiased estimators with a scalar covariance matrix,
n
i=k+1
u
2
i
=
n
i=1
u
2
i
, where u is the vector of ordinary least squares residuals.
Proof . (1) The linearity of u in Y is obvious from Eq.(106).
(2) It is easy to show that CX = 0 by multiplying Eq.(107) by X on the right. Multiplying, for
example, the (p k)
th
row of C, (p = k + 1, . . . , n), into X, we obtain
x
p
(X
p1
X
p1
)
1
X
p1
d
p
X
p1
+
1
d
p
x
p
= 0.
It then follows that E( u) = E(CY ) = E(C(X + u)) = E(u) = 0.
(3) Multiplying the (p k)
th
row of C into the (p k)
th
column of C
, we obtain
1
d
2
p
+
x
p
(X
p1
X
p1
)
1
X
p1
X
p1
(X
p1
X
p1
)
1
x
p
d
2
p
= 1
by denition. Multiplying the (p k)
th
row of C into the (s k)
th
column of C
p
(X
p1
X
p1
)
1
X
p1
d
p
1
d
p
0 . . . 0
_
_
_
X
p1
x
p
x
p+1
.
.
.
_
_
(X
s1
X
s1
)
1
d
s
=
_
x
p
d
p
d
s
+
x
p
d
p
d
s
_
(X
s1
X
s1
)
1
x
s
= 0
.
34 Quandt
(4) We rst prove that C
C = I X(X
X)
1
X
. Dene M = I X(X
X)
1
X
; then
MC
= (I X(X
X)
1
X
)C
= C
. (108)
Hence,
MC
= (M I)C
= 0. (109)
But for any square matrix A and any eigenvalues of A, if (AI)w = 0, then w is an eigenvector
of A. Since M is idempotent, and by Theorem 5 the eigenvalues of M are all zero or 1, the columns
of C
are the eigenvectors of M corresponding to the unit roots (which are n k in number, becase
the trace of M is n k).
Now let G
]
_
C
G
_
= I.
Let denote the diagonal matrix of eigenvalues for some matrix A and let W be the matrix of its
eigenvectors. Then AW = W; applying this to the present case yields
M[ C
] = [ C
]
_
I 0
0 0
_
= [ C
0 ].
Hence
M = MI = M[ C
]
_
C
G
_
= [ C
0 ]
_
C
G
_
= C
C.
But
n
i=k+1
u
2
= Y
CY = Y
[I X(X
X)
1
X
]Y =
n
i=1
u
2
.
Now dene S
j
by S
j
= (Y
j
X
j
j
)
(Y
j
X
j
j
); thus S
j
is the sum of the squares of the least
squares residuals based on the rst j observations. We then have
Theorem 38. S
j
= S
j1
+ u
2
.
Regression Theory 35
Proof . We can write
S
j
= (Y
j
X
j
j
)
(Y
j
X
j
j
) = Y
j
(I X
j
(X
j
X
j
)
1
X
j
)Y
j
= Y
j
Y
j
Y
j
X
j
(X
j
X
j
)
1
X
j
X
j
(X
j
X
j
)
1
X
j
Y
j
(where we have multiplied by X
j
X
j
(X
j
X
j
)
1
)
= Y
j
Y
j
j
X
j
X
j
j
+ 2
j1
(X
j
Y
j
+X
j
X
j
j
)
(where we replaced (X
j
X
j
)
1
X
j
Y
j
by
j
and where the third
term has value equal to zero)
= Y
j
Y
j
j
X
j
X
j
j
+ 2
j1
X
j
Y
j
+ 2
j1
X
j
X
j
j
+
j1
X
j
X
j
j1
j1
X
j
X
j
j1
(where we have added and subtracted the last term)
= (Y
j
X
j
j1
)
(Y
j
X
j
j1
) (
j1
)
j
X
j
(
j1
)
. (110)
Using the denition of X
j
and Y
j
and the denition of regression coecient estimates, we can also
write
X
j
X
j
j
= X
j
Y
j
= X
j1
Y
j1
+ x
j
y
j
= X
j1
X
j1
j1
+x
j
y +j
= (X
j
X
j
x
j
x
j
)
j1
+x
j
y
j
= X
j
X
j
j1
+ x
j
(y
j
x
j1
)
,
and multiplying through by (X
j
X
j
)
1
,
j
=
j1
+ (X
j
X
j
)
1
x
j
(y
j
x
j1
). (111)
Substituting from Eq.(111) for
j1
in Eq.(110), we obtain
S
j
= S
j1
+ (y
j
x
j1
)
2
x
j
(X
j
X
j
)
1
x
j
(y
j
x
j1
)
2
. (112)
Finally, we substitute for (X
j
X
j
)
1
in Eq.(112) from Bartletts Identity (Theorem 35), yielding
S
j
= S
j1
+ (y
j
x
j1
)
2
_
1 +x
j
(X
j1
X
j1
)
1
x
j
x
j
(X
j1
X
j1
)
1
x
j
(x
j
(X
j1
X
j1
)
1
x
j
)
2
+ (x
j
(X
j1
X
j1
)
1
x
j
)
2
1 +x
j
(X
j1
X
j1
)
1
x
j
_
from which the Theorem follows immediately, since u is dened as
(y
j
x
j1
)/(1 +x
j
(X
j1
X
j1
)
1
x
j
).
We now briey return to the case of testing the equality of regression coecients in two regression
in the case of insucient degrees of freedom (i.e., the Chow Test). As in Case 4, on p. 13, the number
36 Quandt
of observations in the two data sets is n
1
and n
2
respectively. Denoting the sum of squares from the
regression on the rst n
1
observations by u
u
u
u
and the sum of squares using all n
1
+n
2
observations
by u
r
u
r
, where the us are the ordinary (not recursive) least squares residuals, the test statistic can
be written as
( u
r
u
r
u
u
u
u
)/n
2
u
u
u
u
/(n
1
k)
.
By Theorem 37, this can be written as
(
n1+n2
i=k+1
u
2
i
n1
i=k+1
u
2
i
)/n
2
n1
i=k+1
u
2
i
/(n
1
k)
=
n1+n2
i=n1+1
u
2
i
/n
2
n1
i=k+1
u
2
i
/(n
1
k)
.
It may be noted that the numerator and denominator share no value of u
i
; since the us are inde-
pendent, the numerator and denominator are independently distributed. Moreover, each u
i
has zero
mean, is normally distributed and is independent of every other u
j
, and has variance
2
, since
E( u
2
i
) = E
_
(x
i
u
i
x
i1
)
2
1 + x
i
(X
i1
X
i1
)
1
x
i
_
=
x
i
E
_
(
i1
)(
i1
)
x
i
+E(u
2
i
)
1 +x
i
(X
i1
X
i1
)
1
x
i
=
2
.
Hence, the ratio has an F distribution, as argued earlier.
Cusum of Squares Test. We consider a test of the hypothesis that a change in the true values of
the regression coecients occured at some observation in a series of observations. For this purpose
we dene
Q
i
=
i
j=k+1
u
2
j
n
j=k+1
u
2
j
, (113)
where u
j
represent the recursive residuals.
We now have
Theorem 39. On the hypothesis that the values of the regression coecients do not change,
the random variable 1 Q
i
has Beta distribution, and E(Q
i
) = (i k)/(n k).
Proof . From Eq.(113), we can write
Q
1
i
1 =
n
j=i+1
u
2
j
i
j=k+1
u
2
j
. (114)
Since the numerator and denominator of Eq.(114) are sums of iid normal variables with zero mean
and constant variance, and since the numerator and denominator share no common u
j
, the quantity
z = (Q
1
i
1)
i k
n k
Regression Theory 37
is distributed as F(n i, i k). Consider the distribution of the random variable w, where W is
dened by
(n i)z/(i k) = w/(1 w)
. Then the density of w is the Beta density
( + 1)
()()
(1 w)
,
with = 1 + (n i)/2 and = 1 + (i k)/2. It follows that
E(1 Q
i
) =
+ 1
+ + 2
=
n i
n k
,
and
E(Q
i
) =
i k
n k
. (115)
Durbin (Biometrika, 1969, pp.1-15) provides tables for constructing condence bands for Q
i
of
the form E(Q
i
) c
0
.