Statistics 3 Notes
Statistics 3 Notes
1
x y x y x y x y x y x y
4.4 78 3.9 74 4.0 68 4.0 76 3.5 80 4.1 84
2.3 50 4.7 93 1.7 55 4.9 76 1.7 58 4.6 74
3.4 75 4.3 80 1.7 56 3.9 80 3.7 69 3.1 57
4.0 90 1.8 42 4.1 91 1.8 51 3.2 79 1.9 53
4.6 82 2.0 51 4.5 76 3.9 82 4.3 84 2.3 53
3.8 86 1.9 51 4.6 85 1.8 45 4.7 88 1.8 51
4.6 80 1.9 49 3.5 82 4.0 75 3.7 73 3.7 67
4.3 68 3.6 86 3.8 72 3.8 75 3.8 75 2.5 66
4.5 84 4.1 70 3.7 79 3.8 60 3.4 86
2
y
(1) is a linear model for E(y|x), so ǫ denotes the spread or dispersion around
this line. i.e., y = E(y|x) + ǫ. If we let g(x) = E(y|x), assuming g to be
smooth, we could consider the approximation:
3
MULTIPLE LINEAR REGRESSION MODEL
The response y is often influenced by more than one predictor variable. For
example, the yield of a crop may depend on the amount of nitrogen, potash,
and phosphate fertilizers used. These variables are controlled by the exper-
imenter, but the yield may also depend on uncontrollable variables such as
those associated with weather. A linear model relating the response y to
several predictors has the form
4
Variable Selection or Screening. The emphasis is on determining the
importance of each predictor variable in modeling the variation in y.
The predictors that are associated with an important amount of vari-
ation in y are retained; hose that contribute little are deleted.
5
Vector-matrix form of linear model.
Data is of the form: (yi , xi ), i = 1, 2, . . . , n, xi = (xi0 = 1, xi1 , . . . , xi(p−1) )′ .
The linear model is:
Equivalently,
y1 1 x11 . . . x1(p−1) β0 ǫ1
y2 1 x21 . . . x2(p−1) β1 ǫ2
.. = .. .. .. .. + .. , or
. . . ... . . .
yn 1 xn1 . . . xn(p−1) βp−1 ǫn
y = Xβ + ǫ.
6
Multivariate Distributions
X1 µ1 σ12 ρσ1 σ2
Example. ∼N , , −1 < ρ < 1, if
X2 µ2 ρσ1 σ2 σ22
2 2
1 1 − 1
2
x1 −µ1
+ 2σ 2 −2ρ 1σ 1
x −µ x −µ x2 −µ2
f (x1 , x2 ) = p e 2(1−ρ ) σ1 2 1 σ 2
.
2π σ1 σ2 1 − ρ 2
1
Suppose Zp×p is p.s.d. with wp 1. Then its spectral decomposition gives
Z = ΓDλ Γ′ , where Γ is orthogonal and Dλ is diagonal. Let λi (Z) = ith
diagonal element of Dλ , λ1 (Z) ≥ λ2 (Z) ≥ . . . ≥ λp (Z) ≥ 0 wp 1. What
about E(Z)? Is λi (E(Z)) = E(λi (Z))? No. However, E(Z) is p.s.d., so
λi (E(Z)) ≥ 0.
Suppose Xp×1 has mean µ and also E[(Xi −µi )(Xj −µj )] = Cov(Xi , Xj ) = σij
exists for all i, j. i.e., σii < ∞ for all i. Then the covariance matrix (or the
variance-covariance matrix or the dispersion matrix) of X is defined as
2
X1 Y1
Consider Xp×1 = and Yq×1 = . Then
X2 Y2
Cov(X1 , Y1 ) Cov(X1 , Y2 )
Cov(X, Y ) =
Cov(X2 , Y1 ) Cov(X2 , Y2 )
Cov(Y1 , X1 ) Cov(Y1 , X2 )
6= Cov(Y, X) =
Cov(Y2 , X1 ) Cov(Y2 , X2 )
Cov(X + Y ) = Cov(X + Y, X + Y )
= Cov(X, X) + Cov(X, Y ) + Cov(Y, X) + Cov(Y, Y )
= Cov(X) + Cov(Y ) + Cov(X, Y ) + Cov(X, Y )′
6= Cov(X) + Cov(Y ) + 2Cov(X, Y ),
3
The moment generating function (mgf) of X at α is defined as φX (α) =
E(exp(α′ X)). This uniquely determines the probability distribution of X.
Note that φX ((t1 , 0)′ )E(exp(t1 X1 )) = φX1 (t1 ). If X and Y are independent,
φX+Y (t) = E(exp(t′ (X + Y ))) = E(exp(t′ X) exp(t′ Y ))
= E(exp(t′ X))E(exp(t′ Y )) = φX (t)φY (t).
Theorem (Cramer-Wold device). If X is a random vector, its proba-
bility distribution is completely determined by the distribution of all linear
functions, α′ X, α ∈ Rp .
Proof. The mgf of α′ X, for any α ∈ Rp is φα′ X (t) = E(tα′ X). Suppose
this is known for all α ∈ Rp . Now, for any α, note φX (α) = E(exp(α′ X) =
φα′ X (1), which is then known.
Remark. To define the joint multivariate distribution of a random vector,
it is enough to specify the distribution of all its linear functions.
Multivariate Normal Distribution
Definition. Xp×1 is p-variate normal if for every α ∈ Rp , the distribution
of α′ X is univariate normal.
Result. If X has the p-variate normal distribution, then both µ = E(X)
and Σ = Cov(X) exist and the distribution of X is determined by µ and Σ.
Proof. Let X = (X1 , . . . , Xp )′ . Then for each i, Xi = αi′ X where αi =
(0, . . . , 0, 1, 0, . . . , 0)′ . Therefore, Xi = αi′ X ∼ N (., .). Hence, E(Xi ) = µi
√
and V ar(Xi ) = σii exist. Also, since |σij | = |Cov(Xi , Xj )| ≤ σii σjj , σij
exists. Set µ = (µ1 , . . . , µp )′ and Σ = ((σij )). Further, E(α′ X) = α′ µ and
V ar(α′ X) = α′ Σα, so
α′ X ∼ N (α′ µ, α′ Σα), for all α ∈ Rp .
Since {α′ X, α ∈ Rp } determine the distribution of X, µ and Σ suffice.
Notation: X ∼ Np (µ, Σ).
Result. If X ∼ Np (µ, Σ), then for any Ak×p , bk×1 ,
Y = AX + b ∼ Nk (Aµ + b, AΣA′ ).
Proof. Consider linear functions, α′ Y = α′ AX + α′ b = β ′ X + c, which
are univariate normal. Therefore Y is k-variate normal. E(Y ) = Aµ + b,
Cov(Y ) = Cov(AX) = AΣA′ .
Theorem. Xp×1 ∼ Np (µ, Σ) iff Xp×1 = Cp×r Zr×1 +µ where Z = (Z1 , . . . , Zr )′ ,
Zi i.i.d N (0, 1), Σ = CC ′ , r = rank(Σ) = rank(C).
Proof. if part: If X = CZ +µ and Z ∼ Nr (0, Ir ), then X ∼ Np (µ, CC ′ = Σ).
1
Z is multivariate normal since linear functions of Z are linear combinations
of Zi ’s, which are univarite normal (as can be shown using the change of
variable (jacobian) formula for joint densities, or using the mgf of normal).
Only if: If X ∼∼ Np (µ, Σ), and rank(Σ) = r ≤ p, then consider the spectral
∆ 1 0
decomposition, Σ = H∆H ′ , H orthogonal, ∆ = ,
0 0
∆1 = diagonal(δ1 , . . . , δr ), δi > 0. Now, X− µ ∼ N (0,Σ), and
Yr×1
H ′ (X − µ) ∼ N (0, ∆). Let H ′ (X − µ) = . Then,
T(p−r)×1
Yr×1 0 ∆1 0
∼N , .
T(p−r)×1 0 0 0
−1/2
Therefore, T = 0 w.p.1. Let Z= ∆1 Y . Then Z ∼ Nr (0, Ir ). Therefore,
1/2
∆1 Z
w.p. 1, H ′ (X − µ) = . Further, w.p. 1,
0
1/2 1/2
∆1 Z ∆1 Z 1/2
X −µ=H = (H1 |H2 ) = H1 ∆1 Z = CZ.
0 0
1/2 1/2
Also, CC ′ = H1 ∆1 ∆1 H1′ = H1 ∆1 H1′ and
′
′ ∆1 0 H1
Σ = H∆H = (H1 |H2 ) = H1 ∆1 H1′ .
0 0 H2′
Recall that if Z1 ∼ N (0, 1), its mgf is φZ1 (t) = E(exp(tZ1 )) = exp(t2 /2).
Therefore, if Z ∼ Nr (0, Ir ) then
r r
X X 1
φZ (u) = E(exp(u′ Z)) = E(exp( uj Zj )) = exp( u2j /2) = exp( u′ u).
j=1 j=1
2
1
φX (t) = exp(t′ µ + t′ Σt),
2
since E(exp(t′ X)) = E(exp(t′ (CZ + µ))) = exp(t′ µ)E(exp(t′ CZ)) =
exp(t′ µ) exp(t′ CC ′ t/2) = exp(t′ µ + t′ Σt/2).
2
Marginal and Conditional Distributions
Theorem. If X ∼∼ Np (µ, Σ), then the marginal distribution of any subset
of k components of X is k-variate normal.
Proof. Partition as follows:
! !
(1) (1)
Xk×1 µk×1 Σ11 Σ12
X= (2) , µ= (2) , Σ= .
X(p−k)×1 µ(p−k)×1 Σ′12 Σ22
X (1)
(1)
∼ N µ(1) , Σ11 . Since marginals (without
Note that X = (Ik |0) (2)
X
independence) do not determine the joint distribution, the converse is not
true.
Example. Z ∼ N (0, 1) independent of U which takes values 1 and -1 with
equal probability. Then Y = U Z ∼ N (0, 1) since
P (Y ≤ y) = P (U Z ≤ y)
1 1
= P (Z ≤ y|U = 1) + P (−Z ≤ y|U = −1)
2 2
1 1
= Φ(y) + Φ(y) = Φ(y).
2 2
Therefore, (Z, Y ) has a joint distribution under which the marginals are
normal. However,
it is not bivariate normal. Consider Z + Y =
2Z 1/2
Z + UZ = . Since P (Z + Y = 0) = 1/2 (i.e., a point mass at
0 1/2
0, and Z + Y = 2Z ∼ N (0, 1) with probability 1/2, it cannot be normally
distributed.
! ! !
(1) (1)
Xk×1 µk×1 Σ11 Σ12
Result. Let Xp×1 = (2) ∼ Np (2) , .
X(p−k)×1 µ(p−k)×1 Σ′12 Σ22
Then X (1) and X (2) are independent iff Σ12 = 0.
Proof. Only if: Independence implies that Cov(X (1) , X (2) ) = Σ12 = 0.
1
If part: Suppose that Σ12 = 0. Then, note that
2
Ex. Check for p = 2 to see if the above results agree with those of the
bivariate normal.
Theorem. Let X ∼ Np (µ, Σ), Σ > 0 (i.e., p.d.), and let
X1 µ1 Σ11 Σ12
X= , µ= , Σ= ,
X2 µ2 Σ21 Σ22
where X1 and µ1 are of length k. Also, let Σ11.2 = Σ11 − Σ12 Σ−1 22 Σ21 . Then
Σ11.2 > 0 and,
(i) X1 − Σ12 Σ−1 −1
22 X2 ∼ Nk (µ1 − Σ12 Σ22 µ2 , Σ11.2 ) and is independent of X2 ;
(ii) The conditional distribution of X1 given X2 is
Nk µ1 + Σ12 Σ−122 (X2 − µ2 ), Σ11.2 .
Ik −Σ12 Σ−1 22
Proof. (i) Let C = . Then
0 Ip−k
X1 − Σ12 Σ−122 X2 µ1 − Σ12 Σ−1
22 µ2 ′
CX = ∼ Np , CΣC .
X2 µ2
′ Ik −Σ12 Σ−1
22 Σ11 Σ12 Ik 0
CΣC =
0 Ip−k Σ21 Σ22 −Σ−1
22 Σ21 Ip−k
Σ11 − Σ12 Σ−1
22 Σ21 0 Ik 0 Σ11.2 0
= = .
Σ21 Σ22 −Σ−1
22 Σ21 Ip−k 0 Σ22
Now, independence of X1 − Σ12 Σ−1 22 X2 and X2 follows from the fact that
Cov(X1 − Σ12 Σ−1 22 X 2 , X 2 ) = 0.
(ii) Note that X1 = (X1 − Σ12 Σ−1 −1
22 X2 ) + Σ12 Σ22 X2 . Therefore, from the inde-
pendence of these two parts, X1 |(X2 = x2 ) = Σ12 Σ−1 −1
22 x2 +(X1 −Σ12 Σ22 X2 ) ∼
N (Σ12 Σ−1 −1
22 x2 + µ1 − Σ12 Σ22 µ2 , Σ11.2 ).
3
Quadratic Forms.
Recall that, Y ′ AY is called a quadratic form of Y when Y is a random vector.
Result. If X ∼ Np (µ, Σ), Σ > 0, then (X − µ)′ Σ−1 (X − µ) ∼ χ2p .
Proof. Z = Σ−1/2P (X − µ) ∼ Np (0, Ip ). i.e., Z1 , Z2 , . . . , Zp are i.i.d. N (0, 1).
Therefore Z ′ Z = pi=1 Zi2 ∼ χ2p . Note that (X − µ)′ Σ−1 (X − µ) = Z ′ Z.
Result. If X1 , X2 , . . . , Xn is a random sample from N (µ, σ 2 ), then X̄ and
n
S 2 = i=1 (Xi − X̄)2 are independent, X̄ ∼ N (µ, σ 2 /n) and S 2 /σ 2 ∼ χ2n−1 .
P
di are eigen values of A and Yi are i.i.d N (0, 1). Therefore X ′ AX has the
χ2 distribution if di = 1 or 0. Equivalently, X ′ AX ∼ χ2 if A2 = A or
A is symmetric idempotent or A is an orthogonal projection matrix. The
equivalence may be seen as follows. If d1 ≥ d2 ≥ . . . ≥ dp ≥ 0 are such that
1
d1 = d2 = . . . = dr = 1 and dr+1 = . . . = dp = 0, then
Ir 0
A = Γ Γ′ ,
0 0
Ir 0 Ir 0
A2 = Γ ′
ΓΓ Γ′ = A.
0 0 0 0
exp(−u/2)ur/2−1
Z ∞
MX ′ AX (t) = exp(tu) du
0 2r/2 Γ(r/2)
exp(− u2 (1 − 2t))ur/2−1
Z ∞
= du
0 2r/2 Γ(r/2)
= (1 − 2t)−r/2 , for 1 − 2t > 0.
p
" # " p #
X Y
MX ′ AX (t) = E exp(t di Yi2 ) = E exp(tdi Yi2 )
i=1 i=1
p p
Y Y
E exp(tdi Yi2 ) =
= (1 − 2tdi )−1/2 , for 1 − 2tdi > 0.
i=1 i=1
Now note that X ′ AX ∼ χ2r implies X ′ AX > 0 wp 1. i.e., pi=1 di Yi2 > 0
P
wp 1, which in turn imples that di ≥ 0 for all i. (This isPbecause, if dl < 0,
since Yl2 ∼ χ21 independently of Yi , i 6= l, we would have pi=1 di Yi2 < 0 with
positive probability.) Therefore, for t < mini 2d1 i , equating the two mgf’s, we
Qp r/2
Qp 1/2
have (1 − 2t)−r/2 = i=1 (1 − 2td i ) −1/2
, or (1 − 2t) = i=1 (1 − 2tdi ) ,
r
Q p
or (1 − 2t) = i=1 (1 − 2tdi ). Equality of two polynomials mean that their
roots must be the same. Check that r of the di ’s must be 1 and rest 0. Thus
the following result follows.
Result. X ′ AX ∼ χ2r iff A is a symmetric idempotent matrix or an orthogonal
projection matrix of rank r.
2
Result. Suppose Y ∼ Np (0, Ip ) and let Y ′ Y = Y ′ AY +Y ′ BY . If Y ′ AY ∼ χ2r ,
then Y ′ BY ∼ χ2p−r independent of Y ′ AY .
Proof. Note that Y ′ Y ∼ χ2p . Since Y ′ AY ∼ χ2r , A is symmetric idempotent
of rank r. Therefore, B = I − A is symmetric and B 2 = (I − A)2 =
I − 2A + A2 = I − A = B, so that B is idempotent also. Further, Rank(B)
= trace(B) = trace(I − A) = p − r. Therefore, Y ′ BY ∼ χ2p−r . Independence
is shown later.
Result. Let Y ∼ Np (0, Ip ) and let Q1 = Y ′ P1 Y , Q2 = Y ′ P2 Y , Q1 ∼ χ2r , and
Q2 ∼ χ2s . Then Q1 and Q2 are independent iff P1 P2 = 0.
Corollary. In the result before the above one, A(I − A) = 0, so Y ′ AY and
Y ′ (I − A)Y are independent.
Proof. P1 and P2 are symmetric idempotent. If P1 P2 = 0 then
Cov(P1 Y, P2 Y ) = 0 so that Q1 = (P1 Y )′ (P1 Y ) = Y ′ P12 Y = Y ′ P1 Y is in-
dependent of Q2 = (P2 Y )′ (P2 Y ) = Y ′ P2 Y . Conversely, if Q1 and Q2 are
independent χ2r and χ2s , then Q1 + Q2 ∼ χ2r+s . Since Q1 + Q2 = Y ′ (P1 + P2 )Y ,
P1 + P2 is symmetric idempotent. Hence, P1 + P2 = (P1 + P2 )2 = P12 + P22 +
P1 P2 + P2 P1 , implying P1 P2 + P2 P1 = 0. Multiplying by P1 on the left, we
get, P12 P2 +P1 P2 P1 = P1 P2 +P1 P2 P1 = 0 (∗). Similarly, multiplying by P1 on
the right yields, P1 P2 P1 + P2 P1 = 0. Subtracting, we get, P1 P2 − P2 P1 = 0.
Combining this with (∗) above, we get P1 P2 = 0.
Result. Let Q1 = Y ′ P1 Y , Q2 = Y ′ P2 Y , Y ∼ Np (0, Ip ). If Q1 ∼ χ2r ,
Q2 ∼ χ2s and Q1 − Q2 ≥ 0, then Q1 − Q2 and Q2 are independent, r ≥ s and
Q1 − Q2 ∼ χ2r−s .
Proof. P12 = P1 and P22 = P2 are symmetric idempotent. Q1 − Q2 ≥ 0
means that Y ′ (P1 −P2 )Y ≥ 0, hence P1 −P2 is p.s.d. Therefore, from Lemma
shown below, P1 − P2 is a projection matrix and also P1 P2 = P2 P1 = P2 .
Thus (P1 − P2 )P2 = 0. Also, Rank(P1 − P2 ) = tr(P1 − P2 ) = tr(P1 ) - tr(P2 )
= Rank(P1 ) - Rank(P2 ) = r − s. Hence, Q1 − Q2 = Y ′ (P1 − P2 )Y ∼ χ2r−s ,
and is independent of Q2 = Y ′ P2 Y ∼ χ2s .
Lemma. If P1 and P2 are projection matrices such that P1 − P2 is p.s.d.,
then (a) P1 P2 = P2 P1 = P2 and (b) P1 − P2 is also a projection matrix.
Proof. (a) If P1 x = 0, then 0 ≤ x′ (P1 − P2 )x = −x′ P2 x ≤ 0, implying
0 = x′ P2 x = x′ P22 x = (P2 x)′ P2 x, so P2 x = 0. Therefore, for any y,
P2 (I − P1 )y = 0 since P1 (I − P1 )y = 0 (Take x = (I − P1 )y.) Thus, for any
y, P2 P1 y = P2 y or P2 P1 = P2 , and so P2 = P2′ = (P2 P1 )′ = P1 P2 .
(b) (P1 − P2 )2 = P12 + P22 − P1 P2 − P2 P1 = P1 + P2 − P2 − P2 = P1 − P2 .
1
Result. Any orthogonal projection matrix (i.e., symmetric idempotent) is
p.s.d.
Proof. If P is a projection matrix, x′ P x = x′ P 2 x = (P x)′ P x ≥ 0.
Result. Let C be a symmetric p.s.d. matrix. If X ∼ Np (0, Ip ), then AX
and X ′ CX are independent iff AC = 0.
Proof. (i) If part: Since C is symmetric p.s.d., C = T T ′ . If AC = 0, then
AT T ′ = 0, so AT T ′ A′ = (AT )(AT )′ = 0 and hence AT = 0. Thus AX and
T ′ X are independent, so AX and (T ′ X)(T ′ X)′ = X ′ CX are independent.
(ii) Only if: If AX and X ′ CX are independent, then X ′ A′ AX and X ′ CX
are independent. But the mgf of X ′ BX for any B is E(exp(tX ′ BX)) =
|I − 2tB|−1/2 for an interval of values of t. Therefore, the joint mgf of X ′ CX
and X ′ A′ AX is |I − 2(t1 C + t2 A′ A)|−1/2 , but because of independence this
is given to be equal to
Show that, for this to hold on an open set, we must have CA′ A = 0, implying
CA′ AC ′ = 0, and thus AC ′ = 0. But C ′ = C.
Lemma. If X ∼ Np (µ, Σ), then Cov(AX, X ′ CX) = 2AΣCµ.
Proof. Note that (X − µ)′ C(X − µ) = X ′ CX + µ′ Cµ − 2X ′ Cµ = X ′ CX −
2((X − µ)′ Cµ − µ′ Cµ and E(X ′ CX) = tr(CΣ) + µ′ Cµ. Therefore X ′ CX −
E(X ′ CX) = X ′ CX − µ′ Cµ − tr(CΣ) = (X − µ)′ C(X − µ) + 2(X − µ)′ Cµ −
tr(CΣ). Hence,
Cov(AX, X ′ CX)
= E [(AX − Aµ)(X ′ CX − E(X ′ CX))]
= AE {(X − µ) [(X − µ)′ C(X − µ) + 2(X − µ)′ Cµ − tr(CΣ)]}
= 2AE {(X − µ)(X − µ)′ Cµ} − tr(CΣ)AE(X − µ)
+AE {(X − µ)(X − µ)′ C(X − µ)}
= 2AΣCµ,
since
n E(X −hµ) = 0 and E {(X − µ)(X − µ) ′
io C(X − µ)} =
P P
E (X − µ) i j Cij (Xi − µi )(Xj − µj ) = 0. To prove this last equal-
ity, it is enough to show that E {(Xl − µl )(Xi − µi )(Xj − µj )} = 0 for all
i, j, l. For this note:
(i) if i = j = l, E(Xi − µi )3 = 0.
(ii) if i = j 6= l, E {(Xi − µi )2 (Xl − µl )} = 0 since Xl − µl = σσilll (Xi − µi ) + ǫ,
2
where ǫ ∼ N (0, .) is independent of Xi , so this case reduces to (i).
(iii) if i, j and l are all different, the case reduces to (i) and (ii). Alterna-
tively, consider Y = (Y1 , Y2 , Y3 )′ ∼ N3 (0, Σ). Then Y = Σ1/2 (Z1 , Z2 , Z3 ),
where Zi are i.i.d. N (0, 1). Then to show that E(Y1 Y2 Y3 ) = 0, simply note
that Y1 Y2 Y3 is a linear combination of Zi3 , Zi2 Zj and Z1 Z2 Z3 , all of which
have expectation 0.
Loynes’ Lemma. If B is symmetric idempotent, Q is symmetric p.s.d. and
I − B − Q is p.s.d., then BQ = QB = 0.
Proof. Let x be any vector and y = Bx. Then y ′ By = y ′ B 2 x = y ′ Bx = y ′ y,
so y ′ (I −B −Q)y = −y ′ Qy ≤ 0. But I −B −Q is p.s.d., so y ′ (I −B −Q)y ≥ 0,
implying −y ′ Qy ≥ 0. Since Q is also p.s.d., we must have y ′ Qy = 0. (Note,
y is not arbitrary, but Bx for some x.) In addition, since Q is symmetric
p.s.d., Q = L′ L for some L, and hence y ′ Qy = y ′ L′ Ly = 0, implying Ly = 0.
Thus L′ Ly = Qy = QBx = 0 for all x. Therefore, QB = 0 and hence
(QB)′ = B ′ Q′ = BQ = 0.
3
Theorem. Suppose P Xi are n × n symmetric matrices with rank ki , i =
1, 2, . . . , p. Let X = pi=1 Xi have rank k. (It is symmetric.) Then, of the
conditions
(a) Xi idempotent for all i
(b) Xi Xj = 0, i 6= j
(c) XP idempotent
(d) pi=1 ki = k,
it is true that
I. any two of (a), (b), and (c) imply all of (a), (b), (c) and (d)
II. (c) and (d) imply (a) and (b)
III. (c) and {X1 , . . . , Xp−1 idempotent, Xp p.s.d.} imply that Xp idempotent
and hence (a), and therefore (b) and (d).
Proof. I (i): Show (a) and (c) imply (b) and (d). For this, note, P given (c),
I −X is idempotent and hence p.s.d. Now, given (a), X −Xi −Xj = r6=i,j Xr
is p.s.d, being the sum of p.s.d matrices. Therefore, (I −X)+(X −Xi −Xj ) =
I − Xi − Xj is p.s.d., hence Xi Xj = P0 from Loynes’
P Lemma.
P i.e., (b). Also,
given (c), Rank(X) = tr(X) = tr( Xi ) = tr(Xi ) = ki , if (a) is also
given. i.e., (d).
(ii): Show (b) and (c) imply (a) and (d). Let λ be an eigen value of X1 and
u be the corresponding eigen vector. Then X1 u = λu. Either λ = 0, or,
if λ 6= 0, u = X1 λ1 u. Therefore, for i 6= 1, Xi u = Xi X1 λ1 u = 0 given (b).
Therefore, given (b), Xu = X1 u = λu, and so λ is an eigen value of X. But
given (c), X is idempotent, and hence λ = 0 or 1. Therefore eigen values of
X1 are 0 or 1, or X1 is idempotent. Similarly for the other Xi ’s. i.e., (a).
(iii): (a) and (b) together imply (c). (Note that then they
P imply (d) also,
2 2
Xi2 =
P
since
P (a) and (c) give (d).) Given (b) and (a), X = ( Xi ) =
Xi = X, which is (c).
II. Show (c) and (d) imply (a) and (b). Given (c), I − X is idempotent and
hence has rank n − k. Therefore rank of X − I is also n − k. i.e., X − I has
n − k linearly independent rows. i.e.,
(X − I)x = 0 has n − k linearly independent equations. Further,
X2 x = 0 has k2 linearly independent equations,
..
.
Xp x = 0 has kp linearly independent equations.
1
Therefore the maximum number of linearly independent equations in
X −I
X2
.. x = 0 is n − k + k2 + . . . + kp = n − k1 .
.
Xp
2
Linear Models – Estimation
Note that,
n
X n
X n
X
2 2 2
(yi − µ) = (yi − ȳ) + n(ȳ − µ) ≥ (yi − ȳ)2
i=1 i=1 i=1
1′ Y 1 ′ 1 ′
2 1 In 1 σ2
V ar(µ̂LS ) = Cov( ) = 1 Cov(Y) 1 = σ = .
1′ 1 1′ 1 1′ 1 (1′ 1)2 n
1
Note, that µ̂LS is a linear unbiased estimate of µ. Suppose a′ Y is any linear
unbiased estimate of µ. Then E(a′ Y ) = µa′ 1 = µ for all µ implies that
a′ 1 = 1. What is the best linear unbiased estimator of µ (i.e., least MSE)?
Note,
V ar(a′ Y ) = Cov(a′ Y ) = a′ Cov(Y )a = σ 2 a′ a.
need to find a such that a′ 1 = 1 and a′ a is minimum.
To minimize this we justP
Simply note that a′ a = ni=1 a2i and
n Pn 2 n
1X 2 ai X
a − i=1
≥ 0, for all a since (ai − ā)2 ≥ 0.
n i=1 i n i=1
i.e.,
n 2 n
1X 2 1 X 1
a − ≥ 0, or a2i ≥
n i=1 i n i=1
n
with equality iff ai = n1 for all i. Therefore, µ̂LS is BLUE (Best Linear
Unbiased Estimate) irrespective of the distribution of ǫ.
Linear models: Estimation
Data: (xi , yi ), i = 1, 2, . . . , n with multiple predictors or covariates of y.
2
so we take p < n. Skip bold face for vectors and matrices unless there is
ambiguity.
First task is to estimate β. Most common approach is to use least squares
(again, in the absence of distributional assumptions on ǫ). We want
n
X
minp (yi − x′i β)2 = min ||ǫ||2 = minp ||Y − Xβ||2
β∈R β∈R
i=1
= min ||Y − θ||2 ,
θ∈MC (X)
X ′ (Y − θ̂) = 0, or X ′ θ̂ = X ′ Y.
θ = Xb
y
ŷ = θ̂
y1
3
Full rank case. Rank(X) = p. Since the columns of X are linearly inde-
pendent, there exists a unique vector β̂ such that θ̂ = X β̂. (If the columns
of X are not linearly independent β̂ is not unique.) Therefore,
X ′ X β̂ = X ′ Y.
Since X has full column rank, X ′ X is nonsingular. Therefore,
β̂LS = (X ′ X)−1 X ′ Y
is unique. One could also use calculus for this derivation:
||Y − Xβ||2 = (Y − Xβ)′ (Y − Xβ) = Y ′ Y − 2β ′ X ′ Y + β ′ X ′ Xβ,
so differentiating it w.r.t. β:
−2X ′ Y + 2X ′ Xβ = 0, or X ′ X β̂ = X ′ Y.
Note that
θ̂ = X β̂ = X(X ′ X)−1 X ′ Y = P Y = Ŷ ,
where P is the projection matrix onto MC (X).
ǫ̂ = Y − Ŷ = Y − X β̂ = (I − P )Y = residuals.
ǫ̂′ ǫ̂ = (Y − X β̂)′ (Y − X β̂) = Y ′ Y − β̂ ′ X ′ Y + β̂ ′ (X ′ X β̂ − X ′ Y )
= Y ′ Y − β̂ ′ X ′ Y = Y ′ Y − β̂ ′ (X ′ X β̂ = Y ′ (I − P )Y
Xn
= sum of squares of residuals (RSS) = (yi − x′i β̂)2
i=1
1
we obtain
θ̂1
β̂ = = (X ′ X)−1 X ′ Y
θ̂2
1 6 −2 y1 + y2 + y3
=
14 −2 3 y1 − y2 + 2y3
1 6(y1 + y2 + y3 ) − 2(y1 − y2 + 2y3 )
=
14 −2(y1 + y2 + y3 ) + 3(y1 − y2 + 2y3 )
2 4 1
1 4y1 + 8y2 + 2y3 y 1 + y 2 − y 3
= = 7
1
7
5
7 ,
14 y1 − 5y2 + 4y3 y − 14
14 1
y2 + 72 y3
1
ǫ̂′ ǫ̂ = Y ′ Y − β̂ ′ X ′ Y = (y12 + y22 + y32 ) − (4y1 + 8y2 + 2y3 )(y1 + y2 + y3 )
14
1
− (y1 − 5y2 + 4y3 )(y1 − y2 + 2y3 ).
14
since
(X β̂ − Xβ)′ (Y − X β̂) = (β̂ − β)′ (X ′ Y − X ′ X β̂) = 0.
Therefore,
(Y − Xβ)′ (Y − Xβ) ≥ (Y − X β̂)′ (Y − X β̂)
with equality iff β̂ − β = 0 since X ′ X is p.d.
2
Properties of least squares estimates
If Y = Xβ + ǫ, with E(ǫ) = 0 and Cov(ǫ) = σ 2 In . then E(β̂) = β since
1
Result. In the model, Y = Xβ + ǫ, E(ǫ) = 0, Cov(ǫ) = σ 2 In and X has full
column rank (p), we have that
2
(Y − Xβ)′ (I − P )(Y − Xβ) ∼ χ2n−p .
(b) Alternatively, note that Q = (Y − Xβ)′ (Y − Xβ) ∼ σ 2 χ2n . Now
Q = (Y − Xβ)′ (Y − Xβ)
= (Y − X β̂ + X β̂ − Xβ)′ (Y − X β̂ + X β̂ − Xβ)
= (Y − X β̂)′ (Y − X β̂) + (β̂ − β)′ X ′ X(β̂ − β)
= Q1 + Q2 ,
3
Design matrix X with less than full column rank
Consider the model,
yij = µ + αi + τj + ǫij , i = 1, 2, . . . , I; j = 1, 2, . . . , J,
for the response from the ith treatment in the jth block, say. This can be
put in the usual linear model form: Y = Xβ + ǫ as follows:
y11 1 1 0 ... 0 1 0 0 ... 0 ǫ11
y12 1 1 0 . . . 0 0 1 0 . . . 0 ǫ12
.. .. .. .. . . . . . ..
. . . . . . . .. .. .. .. . . . .. µ .
y1J 1 1 0 . . . 0 0 0 0 . . . 1 α1 ǫ1J
y21 1 0 1 . . . 0 1 0 0 . . . 0 α2 ǫ21
.
y22 1 0 1 . . . 0 0 1 0 . . . 0 . ǫ22
. . . . .. .. .. .. .. . .
. . . .
. = . . . . . . . . . . . . . . αI + .. .
y2J 1 0 1 . . . 0 0 0 0 . . . 1 τ1 ǫ2J
. . . .
.. .. .. .. . . . ... ... ... ... . . . ... τ2 ...
y 1 0 0 . . . 1 1 0 0 . . . 0 ... ǫ
I1 I1
y 1 0 0 ... 1 0 1 0 ... 0 τJ ǫ
I2 I2
.. .. .. ..
. . . . . . . ... ... ... ... . . . ...
..
.
yIJ 1 0 0 ... 1 0 0 0 ... 1 ǫIJ
Here, X does not have full column rank. For instance, the first column is
proportional to the sum of the rest. Thus X ′ X is singular, so the previous
discussion does not apply. β itself is not estimable, but what parametric
functions of β are estimable?
Result. For any matrix A, the row space of A satisfies MC (A′ ) = MC (A′ A).
Proof. Ax = 0 implies A′ Ax = 0. Also, A′ Ax = 0 implies x′ A′ Ax = 0,
so Ax = 0. Therefore the null space of A and A′ A are the same. Consider
the orthogonal space and note Rank(A′ A) = Rank(A) = Rank(A′ ). Further,
since A′ Aa = A′ b where b = Aa, MC (A′ A) ⊂ MC (A′ ). Since the ranks (or
dimensions) are the same, the spaces must be the same.
Theorem. Let Y = θ + ǫ where θ = Xβ and Xn×p has rank r < p. Then
(i) minθ∈MC (X) ||Y − θ||2 is achieved (i.e., least squares is attained) when
θ̂ = X β̂ where β̂ is any solution of X ′ Xβ = X ′ Y ;
(ii) Y ′ Y − β̂ ′ X ′ Y is unique for all nonzero Y .
Proof. (i) X ′ Xβ = X ′ Y always has some solution (for β) since MC (X ′ X) =
MC (X ′ ). However, the solution is not unique since Rank(X ′ X) = r < p.
1
Let β̂ be any solution, and let θ̂ = X β̂. Then X ′ (Y − θ̂) = 0. However, given
Y ∈ Rn , the decomposition, Y = θ̂ ⊕ (Y − θ̂) where Y − θ̂ is orthogonal
to MC (X) is unique, and for such a θ̂, ||Y − θ||2 is minimized. We know
from previous discussion that minθ∈MC (X) ||Y − θ||2 is achieved with θ̂ = P Y
which is unique.
(ii) Note that
since θ̂′ (Y − θ̂) = 0. Also, (Y − θ̂)′ (Y − θ̂) = ||Y − θ̂||2 is the unique minimum.
Question. Earlier we could find β̂ directly. How do we find θ̂ now?
Projection matrices
From the theory of orthogonal projections, given Xn×p (i.e., p many n-
vectors), there exists Pn×n satisfying
(i) P x = x for all x ∈ MC (X), and
(ii) if ξ ∈ M⊥C (X), then P ξ = 0.
What are the properties of such a P ?
1. P is unique: Suppose P1 and P2 satisfy (i) and (ii). Let w ∈ Rn . Then
w = Xa + b, b ∈ M⊥ C (X). Then,
P 2 ξ = P (P ξ) = P 0 = 0 for all ξ ∈ M⊥
C (X).
2
In − PΩ represents the orthogonal projection. i.e., Rn = Ω ⊕ Ω⊥ . Thus for
any y ∈ Rn , we have y = PΩ y ⊕ (I − PΩ )y.
If Pn×n is any symmetric idempotent matrix, it represents a projection onto
MC (P ): if y ∈ Rn , then y = P y + (I − P )y = u + v. Note
u′ v = (P y)′ (I − P )y = y ′ P (I − P )y = y ′ (P − P 2 )y = 0,
so that we get y = u ⊕ v, u ∈ MC (P ), v ∈ M⊥
C (P ).
3
Question. Given X, how to find P such that MC (X) = MC (P )?
Result. If Ω = MC (X), then PΩ = X(X ′ X)− X ′ , where X ′ X)− is any
generalized inverse of X ′ X.
Definition. If Bm×n is any matrix, a generalized inverse of B is any n × m
matrix B − satisfying BB − B = B.
Existence. From singular value decomposition of B, there exist orthogonal
matrices Pm×m and Qn×n such that
Dr×r 0r×(n−r)
Pm×m Bm×n Qn×n = ∆m×n = ,
0(m−r)×r 0(m−r)×(n−r)
−1
− Dr 0
where r = Rank(B). Define ∆n×m = and let B − = Q∆− P .
0 0
First,
−1
− Dr 0 Dr 0 Dr 0 Dr 0
∆∆ ∆ = = = ∆.
0 0 0 0 0 0 0 0
1
B11 B12
For Bp×m with rank r < p and B = , where B11 (which is r × r
B21 B22
−1
B11 0
of rank r) is nonsingular, if we take B =
−
, then note that
0 0
−1
− B11 B12 B11 0 B11 B12
BB B =
B21 B22 0 0 B21 B22
Ir 0 B11 B12 B11 B12
= −1 = −1 .
B21 B11 0 B21 B22 B21 B21 B11 B12
Now note that (B21 |B22 ) is a linear function of (B11 |B12 ), or (B21 |B22 ) =
K(B11 |B12 ) = (KB11 |KB12 ) for some matrix K. Therefore, KB11 = B21 , or
−1 −1
K = B21 B11 , so B22 = KB12 = B21 B11 B12 .
1 2 5 2
Example. Let B = 3 7 12 4 . Then rank of B is 2 since 2nd
0 1 −3 −2
1 2 5 2
row - 3× 1st row = 3rd row. Partition B as: B = 3 7 12 4 =
0 1 −3 −2
B11 B12
. Take
B21 B22
7 −2 0
−1
B11 0 −3 1 0
B− = = .
0 0 0 0 0
0 0 0
2
Then check that (X ′ X)(X ′ X)− (X ′ X) = X ′ X. We have then
X β̂ = θ̂ = X(X ′ X)− X ′ Y
1 1 y1
1/3 0 1 1 1
= 1 1 y2
0 0 1 1 1
1 1 y3
1/3 0 (y1 + y2 + y3 )/3
y1 + y2 + y3
= 1/3 0 = ȳ
y1 + y2 + y3
1/3 0 ȳ
β\
1 + β2
\
= β1 + β2 .
β\
1 + β2
β1
Only β1 + β2 can be estimated? Note β1 + β2 = (1 1) and
β2
1
MC = MC (X ′ ). More on this later.
1
3
Theorem. If Y ∼ Nn (Xβ, σ 2 In ), where Xn×p has rank r and β̂ = (X ′ X)− X ′ Y
is a least squares solution of β,
(i) X β̂ ∼ Nn (Xβ, σ 2 P ),
(ii) (β̂ − β)′ X ′ X(β̂ − β) ∼ σ 2 χ2r
(iii) X β̂ is independent of RSS = (Y − X β̂)′ (Y − X β̂). and
(iv) RSS/σ 2 ∼ χ2n−r (independent of X β̂)
X β̂ ∼ Nn (P Xβ, σ 2 P 2 ) = Nn (Xβ, σ 2 P ).
1
unique. i.e., if β̃ is any other LS solution, then also b′ X β̃ = b′ X β̂ = a′ β̂.
(ii) If d′ Y is any other linear unbiased estimate of a′ β, then
E(d′ Y ) = d′ Xβ = d′ θ = a′ β = b′ Xβ = b′ θ for all β ∈ Rp .
i.e., d′ θ = b′ θ for all θ ∈ MC (X).
i.e., (d − b)′ θ = 0 for all θ ∈ MC (X), or (d − b)⊥MC (X). Consider P =
PMC (X) = X(X ′ X)− X ′ . Then P (d − b) = 0 or P d = P b. Therefore,
µ+αi +τj is estimable for all i and j. Therefore, (µ+αi +τ1 )−(µ+αi +τ2 ) =
τ1 − τ2 is estimable.
(µ + αi + τ1 ) − (µ + αj + τ1 ) = αi − αj is estimable.
What else is estimable, apart from linear combinations of these?
Result. If a′ β is estimable, and Y ∼ Nn (Xβ, σ 2 In ), a 100(1−α)% confidence
interval for a′ β is given by
p p
a′ β̂ ± tn−r (1 − α/2) a′ (X ′ X)− a RSS/(n − r).
2
Further, since RSS/σ 2 ∼ χ2n−r independent of X β̂, and hence of c′ X β̂ =
c′ θ̂ = a′ β̂,
a′ β̂ − a′ β
p p ∼ tn−r .
σ 2 a′ (X ′ X)− a RSS/(σ 2 (n − r))
Hence,
r !
p RSS
P |a′ β̂ − a′ β| ≤ tn−r (1 − α/2) a′ (X ′ X)− a = 1 − α.
n−r
3
Maximum likelihood estimation
Does LS estimate have other optimality properties?
Since we have assumed that Y ∼ Nn (Xβ, σ 2 In ) to derive distributional prop-
erties of β̂, let us derive the maximum likelihood estimates of β and σ 2 under
this assumption. β̂mle and σ̂ 2 are values of β and σ 2 which maximize the
likelihood,
−n/2 2 −n/2 1 ′
(2π) (σ ) exp − 2 (Y − Xβ) (Y − Xβ) .
2σ
Equivalently, we may maximize the loglikelihood,
n 1
− log(σ 2 ) − 2 (Y − Xβ)′ (Y − Xβ).
2 2σ
Fix σ 2 and maximize over β, then maximize over σ 2 . Now note that maximiz-
ing over β (for any fixed σ 2 ) is equivalent to minimizing (Y −Xβ)′ (Y −Xβ) =
||Y − Xβ||2 , which yields the same estimate as the least squares. i.e.,
β̂mle = β̂ls . However, σˆ2 = RSS/n, which is not unbiased.
Estimation under linear restrictions or constraints
Consider the following examples.
(i) yij = µ + αi + τj + ǫij . Test H0 : τ1 = τ2 . i.e., test whether there is any
difference between treatments 1 and 2. Under H0 , τ1 − τ2 = 0, or Aβ = c
where A = a′ = (0, 0, . . . , 0, 1, −1, 0, . . . , 0), β = (µ, α1 , . . . , αI , τ1 , τ2 , . . .)′ .
(ii) yi = β0 + β1 xi1 + . . . + βp−1 xi(p−1) + ǫi . Test H0 : X1 , . . . , Xp−1 are not
useful.
Recall that, to derive the GLRT, we need to estimate the parameters of the
model, both with and without restrictions. While testing linear hypotheses
in a linear model, we need to estimate β under the linear constraint Aβ = c.
Consider Y = Xβ + ǫ, Xn×p of rank p, first. We will consider the deficient
rank case later. Let us see how we can find the least squares estimate of β
subject to H : Aβ = c, where Aq×p of rank q and c is given. We can use the
Lagrange multiplier method of calculus for this as follows.
min ||Y − Xβ||2 + λ′ (Aβ − c)
β
= min {Y ′ Y − 2β ′ X ′ Y + β ′ X ′ Xβ + λ′ Aβ − λ′ c} , (1)
β
1
Therefore,
′ −1 1 ′ 1
β̂H = (X X) ′
X Y − A λH = β̂ − (X ′ X)−1 A′ λH (∗).
2 2
Differentiating (1) w.r.t. λ, we get Aβ − c = 0. Since
1
c = Aβ̂H = Aβ̂ − A(X ′ X)−1 A′ λH ,
2
1
c − Aβ̂ = − A(X ′ X)−1 A′ λH , and hence
2
1 −1
− λH = A(X ′ X)−1 A′ (c − Aβ̂), and therefore
2
−1
β̂H = β̂ + (X ′ X)−1 A′ A(X ′ X)−1 A′ (c − Aβ̂).
To establish minimization subject to Aβ = c, note that
||X(β̂ − β)||2 = (β̂ − β)′ X ′ X(β̂ − β)
= (β̂ − β̂H + β̂H − β)′ X ′ X(β̂ − β̂H + β̂H − β)
= (β̂ − β̂H )′ X ′ X(β̂ − β̂H ) + (β̂H − β)′ X ′ X(β̂H − β)
+2(β̂ − β̂H )′ X ′ X(β̂H − β)
= ||X(β̂ − β̂H )||2 + ||X(β̂H − β)||2 ,
since, from (*) above, and subject to Aβ = c,
1 ′
(β̂ − β̂H )′ X ′ X(β̂H − β) = λ A(X ′ X)−1 X ′ X(β̂H − β)
2 H
1 ′ 1
= λH A(β̂H − β) = λ′H (Aβ̂H − Aβ) = 0.
2 2
Therefore,
||Y − Xβ||2 = ||Y − X β̂||2 + ||X(β̂ − β)||2
= ||Y − X β̂||2 + ||X(β̂ − β̂H )||2 + ||X(β̂H − β)||2
≥ ||Y − X β̂||2 + ||X(β̂ − β̂H )||2 ,
and is a minimum when β = β̂H . (Note, X(β̂H − β) = 0 implies X ′ X(β̂H −
β) = 0, so β̂H − β = 0 since columns of X are linearly independent.) Also,
from above, we get,
||Y − X β̂H ||2 = ||Y − X β̂||2 + ||X(β̂ − β̂H )||2 .
If we let Ŷ = X β̂ and ŶH = X β̂H , then
||Y − ŶH ||2 = ||Y − Ŷ ||2 + ||Ŷ − ŶH ||2 .
Note that this can also be established using projection matrices, and not
just for the full column rank case. Let us first establish it for the case
Rank(Xn×p ) = p again, and next extend it.
2
Let β0 be a solution of Aβ = c. Then Y −Xβ0 = X(β −β0 )+ǫ or Ỹ = Xγ +ǫ
with Aγ = A(β − β0 ) = 0. i.e.,
Ỹ = θ + ǫ, θ ∈ MC (X) = Ω, and
A(X ′ X)−1 X ′ θ = A(X ′ X)−1 X ′ X(β − β0 ) = A(β − β0 ) = Aγ = 0.
Set A1 = A(X ′ X)−1 X ′ and ω = N (A1 ) ∩ Ω. Then A1 θ = Aγ = 0 and we
want the projection of Ỹ onto ω since we want:
min ||Ỹ − θ||2 subject to A1 θ = 0.
θ∈MC (X)
Result C. If ω ⊂ Ω, then PΩ Pω = Pω PΩ = Pω .
Proof. Show that PΩ Pω and Pω PΩ both satisfy the defining properties of
Pω : If x ∈ ω ⊂ Ω, then PΩ Pω x = PΩ x = x; if ξ ∈ ω ⊥ , PΩ Pω ξ = PΩ 0 = 0.
Similar is the other case.
Result D. If ω ⊂ Ω, then PΩ − Pω = Pω⊥ ∩Ω .
Proof. Ω = MC (PΩ ), so each x ∈ Ω can be written x = PΩ y. Consider
the decomposition, PΩ y = Pω y + (PΩ − PΩ )y. Now Pω y ∈ ω ⊂ Ω, and
already PΩ y ∈ Ω, so (PΩ − PΩ )y = PΩ y − Pω y ∈ Ω. Further, Pω (PΩ − Pω ) =
Pω PΩ − Pω = Pω − Pω = 0, so that (Pω y)′ (PΩ − Pω )y = y ′ Pω (PΩ − Pω )y = 0.
Therefore, PΩ y = Pω y ⊕ (PΩ − Pω )y is the orthogonal decomposition of Ω
into ω ⊕ (ω ⊥ ∩ Ω).
Result E. If A1 is any matrix such that ω = N (A1 ) ∩ Ω, then ω ⊥ ∩ Ω =
MC (PΩ A′1 ).
Proof. Note that
ω ⊥ ∩ Ω = (Ω ∩ N (A1 ))⊥ ∩ Ω = Ω⊥ ⊕ N ⊥ (A1 ) ∩ Ω = Ω⊥ ⊕ MC (A′1 ) ∩ Ω.
1
Now, let x ∈ ω ⊥ ∩ Ω = Ω⊥ ⊕ MC (A′1 ) ∩ Ω. . Then x ∈ Ω, so x = PΩ x.
Also, x ∈ Ω⊥ ⊕ MC (A′1 ), so x = (I − PΩ )α + A′1 β. Therefore,
x = PΩ x = PΩ {(I − PΩ )α + A′1 β} = PΩ A′1 β ∈ MC (A′1 ).
Conversely, if x ∈ MC (PΩ A′1 ), then x = PΩ A′1 β = PΩ (A′1 β) ∈ MC (PΩ ) = Ω.
For any ξ ∈ ω(⊂ Ω), we have x′ ξ = β ′ A1 PΩ ξ = β ′ A1 ξ = 0 since ω =
N (A1 ) ∩ Ω. Therefore, x ∈ ω ⊥ .
Result F. If A1 is a q × n matrix of rank q, then Rank(PΩ A′1 ) = q iff
MC (A′1 ) ∩ Ω⊥ = {0}.
Proof. Rank(PΩ A′1 ) ≤ Rank(A′1 ) = Rank(A1 ) = q. Suppose Rank(PΩ A′1 ) <
q. Let the rows of A1 (i.e.,Pcolumns of A′1 ) bePa′1 , . . . , a′q . Columns of PΩ A′1
are linearly dependent, so Pqi=1 ci PΩ ai = PΩ ( qi=1 ci ai ) = 0 for some c 6= 0.
Then therePexists a vector qi=1 ci ai ∈ MC (A′1 ) ( 6= 0 since rank of A1 is q)
such that qi=1 ci ai ⊥Ω. i.e., MC (A′1 ) ∩ Ω⊥ 6= {0}. If Rank(PΩ A′1 ) = q =
Rank(A′1 ) then MC (A′1 ) = MC (PΩ A′1 ) = ω ⊥ ∩ Ω ⊂ Ω.
Now let us return to the problem of finding the projection of Ỹ onto ω =
N (A1 ) ∩ Ω which achieves:
min ||Ỹ − θ||2 subject to A1 θ = 0.
θ∈MC (X)
′ −1 ′ ′
−1 ′ −1
= PΩ Y − Xβ0 − X(X X) A A(X X) A Aβ̂ − c .
2
Therefore,
−1
X β̂H = X β̂ − X(X ′ X)−1 A′ A(X ′ X)−1 A′ Aβ̂ − c .
This yields the minimum since ||Y − X β̂H ||2 = ||Ỹ − X γ̂H ||2 .
3
Case of X having less than full column rank
Rank(Xn×p ) = r < p. Since only estimable linear functions a′ βcan be
a′1
estimated, assume a′i β, i = 1, 2, . . . , q are estimable and Aq×p = ... .
a′q
However, since = a′i m′i X
for some m′i ,
we have A = Mq×n Xn×p . Since A has
rank q, M also has rank q (≤ r). Proceeding as before, let β0 be any solution
of Aβ = c. Then consider: Ỹ = Y − Xβ0 = X(β − β0 ) + ǫ or Ỹ = Xγ + ǫ or
Ỹ = θ + ǫ, θ ∈ MC (X) = Ω, and
Therefore
Hence,
Thus,
1
Now recall, a solution of Bu = d is û = B − d. Therefore, from above, since
−1 n o
X ′ X(β̂H − β̂) = −A′ A(X ′ X)− A′ Aβ̂ − c ,
we have that
−1 n o
β̂H = β̂ − (X ′ X)− A′ A(X ′ X)− A′ Aβ̂ − c .
2
Linear Regression
SST = Y ′ Y = (Y − Ŷ )′ (Y − Ŷ ) + Ŷ ′ Ŷ
= Y ′ (I − P )Y + Y ′ P Y
= Y ′ (I − P )Y + β̂ ′ X ′ X β̂
= Y ′ Y − β̂ ′ X ′ Y + β̂ ′ X ′ Y
= RSS + SSR,
where RSS is the residual sum of squares and SSR is the sum of squares due to
regression. If Xn×p has rank r ≤ p, then n = (n − r) + r is the corresponding
decomposition of the degrees of freedom. Thus, analysis of variance is simply
the decomposition of total sum of squares into components which can be
attributed to different factors. Then this simple minded ANOVA for Y =
Xβ + ǫ will look as follows.
1
source of sum of d.f. mean F -ratio
variation squares squares
model: SSR = r= MSR = F =
Y = Xβ + ǫ β̂ ′ X ′ Y Rank(X) SSR/r MSR/MSE
residual SSE = n−r MSE =
error Y Y − β̂ ′ X ′ Y
′
SSE/n − r
Total SST = Y ′ Y n
If Y ∼ Nn (Xβ, σ 2 In ),
(i) X β̂ is independent of SSE = RSS = (Y − X β̂)′ (Y − X β̂) = Y ′ Y − β̂ ′ X ′ Y ,
and
(ii) SSE = RSS ∼ σ 2 χ2n−r ;
(iii) if indeed the linear model is not useful, then β = 0 so that β̂ ′ X ′ X β̂ =
(β̂ − β)′ X ′ X(β̂ − β) ∼ σ 2 χ2r .
Therefore, to check usefulness of the linear model, use
F = MSR/MSE ∼ Fr,n−r (if β = 0).
If β 6= 0, then β̂ ′ X ′ X β̂ ∼ non-central χ2 and E(β̂ ′ X ′ X β̂) = rσ 2 +β ′ X ′ Xβ >
rσ 2 , so large values of F-ratio indicate evidence for β 6= 0.
However, this ANOVA is not particularly useful since (usually) the first
column of X is 1 indicating that the model includes an intercept or cen-
tre. This constant term is generally useful, and we only want to check
H0 : β1 = β2 = · · · = βp−1 = 0 to check the usefulness the actual regressors,
X1 , . . . , Xp−1 (not X0 = 1). Before discussing this, let us recall a result in
probability on decomposing the variance:
If X and Y are jointly distributed (with finite second moments), then
2
The F-test (to check the goodness of linear models)
We have the model, Y = Xβ+ǫ, Xn×p of rank r ≤ p and with ǫ ∼ Nn (0, σ 2 In ).
Suppose we want to test H0 : Aβ = c, Aq×p of rank q ≤ r, and c is given.
Then
RSS = SSE = (Y − X β̂)′ (Y − X β̂) = Y ′ (I − P )Y
RSSH0 = (Y − X β̂H0 )′ (Y − X β̂H0 ), where
−1 n o
β̂H0 = β̂ + (X ′ X)− A′ A(X ′ X)− A′ c − Aβ̂ .
1
which is large if Aβ is far from c.
(iv) Note that
−1
RSSH0 − RSS = (Aβ̂ − c)′ A(X ′ X)− A′ (Aβ̂ − c) ∼ σ 2 χ2q ,
and
2
Now we use the above result for checking the goodness of the linear fit.
ANOVA for checking the goodness of Y = Xβ + ǫ, or yi = β0 + β1 xi1 +
· · · + βi(p−1) + ǫi , or equivalently for testing H0 : β1 = · · · = βp−1 = 0 is
what is needed. Intuitively, if X1 , . . . , Xp−1 provide no useful information,
then the appropriate model is yi = β0 +Pǫi , so ȳ is the only quantity that
can help in predicting y. Then RSSH0 = ni=1 (yi − ȳ)2 is the sum of squares
unexplained, and it has n − 1 d.f. If X1 , . . . , Xp−1 are also used in the model,
then (Y − X β̂)′ (Y − X β̂) = RSS is the unexplained part with n − r d.f.
How much better is RSS compared to RSSH0 ? Let SSreg denote the sum of
squares due to X1 , . . . , Xp−1 and without an intercept. Then,
RSSH0 = RSS + SSreg
n
X n
X
(yi − ȳ)2 = (yi − ŷi )2 + SSreg
i=1 i=1
In other words,
1 ′ ′
Y ′Y − Y 1 1Y = Y ′ (I − P )Y + SSreg , or
n
′ ′ 1 ′ ′
Y Y = Y (I − P )Y + SSreg + Y 1 1Y , or
n
′ ′ ′ ′ 1 ′ ′
SSR = β̂ X X β̂ = β̂ X Y = SSreg + Y 1 1Y ,
n
1
ANOVA for regression (corrected for mean)
In other words,
RSS Y ′ (I − P )Y
R2 = 1 − =1− ′
SST (corrected) Y (I − n1 11′ )Y
Pn Pn 2
i=1 (yiP− ȳ)2 − Y ′ (I − P )Y − nȳ 2 − Y ′ (I − P )Y
i=1 yi P
= n 2
= n 2
i=1 (yi − ȳ) i=1 (yi − ȳ)
Y ′ Y − nȳ 2 − Y ′ Y + Y ′ P Y Y ′ P Y − nȳ 2
= Pn 2
= P n 2
i=1 (yi − ȳ) i=1 (yi − ȳ)
SSR − nȳ 2 SSreg
= Pn 2
=
i=1 (yi − ȳ) SST (corrected)
= proportion of variability explained by regressors.
Also,
SSreg SSreg
R2 = =
SST (corrected) RSS + SSreg
r−1
SSreg /RSS ( n−r )Freg
= = r−1
1 + SSreg /RSS 1 + ( n−r )Freg
2
is an increasing function of the F-ratio.
Note that to interpret the F-ratio, normality of ǫi is needed. R2 , however, is
a percentage with a straightforward interpretation.
3
Example 1 (socio-economic study). The demand for a consumer prod-
uct is affected by many factors. In one study, measurements on the relative
urbanization (X1 ), educational level (X2 ), and relative income (X3 ) of 9 ran-
domly chosen geographic regions were obtained in an attempt to determine
their effect on the product usage (Y ). The data were:
X1 X2 X3 Y
42.2 11.2 31.9 167.1
48.6 10.6 13.2 174.4
42.6 10.6 28.7 160.8
39.0 10.4 26.1 162.0
34.7 9.3 30.1 140.8
44.5 10.8 8.5 174.6
39.1 10.7 24.3 163.7
40.1 10.0 18.6 174.5
45.9 12.0 20.4 185.7
We fit the model: Y = Xβ+ǫ, with E(ǫ) = 0 and Cov(ǫ) = σ 2 In . In this case,
n = 9, p = 4. ȳ = 167.07 andthe model
is yi = β0 + β1 xi1 + β2 xi2 + β3 xi3 + ǫi .
60.0
0.24
10.72 . The detailed ANOVA (with mean)
We get β̂ = (X ′ X)−1 X ′ Y =
−0.75
is
source d.f. SS MS F -ratio
mean 1 SSM = nȳ 2 = MSM =
251201.44 251201.44
regression 3 SSreg = MSreg = Freg =
360.45
(X1 , X2 , X3 ) 1081.35 360.45 39.57
= 9.11
residual 5 SSE = RSS = MSE =
error 197.85 39.57
Total (corrected) 8 1279.20
Total 9 252480.64
From this note that s2 = RSS/(n − r) = MSE = 39.57, so s = 6.29 = σ̂, and
R2 = 1081.35/1279.20 = 84.5%. Abridged ANOVA is
1
source d.f. SS MS F -ratio
regression 3 SSreg = MSreg = Freg =
360.45
(X1 , X2 , X3 ) 1081.35 360.45 39.57
= 9.11
residual 5 SSE = RSS = MSE =
error 197.85 39.57
Total (corrected) 8 1279.20
R2 = 84.5% is substantial. What about F = 9.11? F3,5 (.95) = 5.41 and
F3,5 (.99) = 12.06, so there is some evidence against the null and justifying
the linear fit.
Example 2. X = height (cm) and Y = weight (kg) for a sample of n = 10
eighteen-year-old American girls:
X Y
169.6 71.2
166.8 58.2
157.1 56.0
181.1 64.5
158.4 53.0
165.6 52.4
166.7 56.8
156.5 49.2
168.1 55.6
165.3 77.8
Upon fitting
the simple
linear regression model, yi = β0 + β1 xi + ǫi , we
β̂0 −36.9
get = , s2 = M SE = 71.50, s = 8.456, R2 = 21.9%,
β̂1 0.582
ȳ = 59.47. ANOVA is
source d.f. SS MS F R2
X 1 159.95 159.95 2.24 21.9%
error 8 512.01 71.50
Total (C) 9 731.96
Note the following. (i) X is expected to be a useful predictor of Y , but the
relationship may not be simple. (ii) F1,8 (.90) = 3.46 = (1.86)2 = t28 (.95), so
is there a connection between the ANOVA F-test and a t-test?
Consider simple linear regression again: yi = β0 + β1 xi + ǫi , i = 1, . . . , n, ǫi
i.i.d. N (0, σ 2 ). Then the F-ratio is the F statistic for testing the goodness of
fit of the linear model, or for testing H0 : β1 = 0. Writing the linear model
2
in the standard form, we have
1 x1
1 x2 Pn
′ n i=1 x i
Xn×2 = .. , X X = Pn Pn 2 , and
. i=1 xi i=1 xi
1 xn
Pn
′ −1 1 x2i −nx̄
i=1
(X X) = Pn .
n i=1 (xi − x̄)2 −nx̄ n
Therefore
Pn 2 Pn
β̂0 ′ −1 ′ 1 i=1 xi −nx̄ i=1 y i
= (X X) X Y = Pn Pn .
β̂1 n i=1 (xi − x̄)2 −nx̄ n i=1 xi yi
Letting SXX = ni=1 (xi − x̄)2 , SXY = ni=1 (xi − x̄)(yi − ȳ), and extracting
P P
the least squares equations, we get,
( n
)
1 X SXY
β̂1 = −nx̄ȳ + xi yi = ,
SXX i=1
SXX
( n n
) ( n
)
1 X X 1 X
β̂0 = ȳ x2i − x̄ xi yi = ȳSXX + nȳx̄2 − x̄ xi yi
SXX i=1 i=1
SXX i=1
( n
!)
1 X
= ȳSXX − x̄ xi yi − nx̄ȳ = ȳ − x̄β̂1 .
SXX i=1
Now, β̂1 ∼ N (β1 , σ 2 /SXX ), so that, to test H0 : β1 = 0, use the test statistic,
√
SXX β̂1 β̂12 SXX
p ∼ tn−2 , or ∼ F1,n−2 ,
RSS/(n − 2) MSE
3
However,
n
X n
X 2
RSS = (yi − β̂0 − β̂1 xi )2 = yi − ȳ − β̂1 (xi − x̄)
i=1 i=1
n
X n
X n
X
2 2 2
= (yi − ȳ) + β̂1 (xi − x̄) − 2β̂1 (xi − x̄)(yi − ȳ)
i=1 i=1 i=1
n
X n
X n
X
2 2 2
= (yi − ȳ) + β̂1 (xi − x̄) − 2β̂1 β̂1 (xi − x̄)2
i=1 i=1 i=1
Xn Xn
= (yi − ȳ)2 − β̂12 (xi − x̄)2 .
i=1 i=1
Pn
Therefore, SSreg = β̂12 i=1 (xi − x̄)2 , so that
β̂12 SXX
t2 = = F-ratio of ANOVA.
RSS/(n − 2)
(RSSH0 − RSS) /q
F = ∼ Fq,n−r under H0 .
RSS/(n − r)
4
Multiple Correlation
2 SSreg RSS Y ′ (I − P )Y
R = =1− =1− ′ .
SST (corrected) SST (corrected) Y (I − n1 11′ )Y
n
X n
X
2 2
RSS = (yi − ȳ) − β̂1 (xi − x̄)2 ,
i=1 i=1
so that n
{ ni=1 (xi − x̄)(yi − ȳ)}
X P 2
SSreg = β̂12 2
(xi − x̄) = Pn .
i=1 (xi − x̄)
2
i=1
Therefore,
where
Pn
i=1 (xi
− x̄)(yi − ȳ)
rXY = pPn pPn
i=1 (xi − x̄) i=1 (yi − ȳ)
2 2
1
Y σY Y σXY
′
If Cov = , then
X σXY ΣX
Cov 2 (Y, a′ X) {a′ Cov(Y, X)}2 {a′ σXY }2
Corr2 (Y, a′ X) = = = .
V ar(Y )V ar(a′ X) V ar(Y )V ar(a′ X) σY Y a′ ΣX a
1/2 −1/2
Further, taking u′ = a′ ΣX and v = ΣX σXY ,
1/2 −1/2
a′ σXY a′ ΣX ΣX σXY u′ v
= =
(σY Y a′ ΣX a)1/2 (σY Y a′ ΣX a)1/2 (σY Y a′ ΣX a)1/2
1/2
(u′ u)1/2 (v ′ v)1/2 (a′ ΣX a)1/2 σXY
′
Σ−1
X σXY
≤ =
(σY Y a′ ΣX a)1/2 (σY Y a′ ΣX a)1/2
1/2
σXY Σ−1
′
X σXY
= ,
σY Y
q
with equality if we take u ∝ v or a = Σ−1
Since R = σXY
X σXY .
′
Σ−1
∗
X σXY /σY Y ,
0 ≤ R ≤ 1 unlike the ordinary correlation coefficient. Now let us see why
∗
Cov(Y, σXY
′
Σ−1
X X)
Corr(Y, E(Y |X)) = q
σY Y σXY
′
Σ−1 −1
X ΣX ΣX σXY
σXY
′
Σ−1 σ
= q X XY = R∗ .
√ −1
σY Y σXY
′
ΣX σXY
i.e., R∗ = correlation coefficient between Y and the conditional expectation
of Y |X (or the regression of Y on X, when the conditional expectation is
linear). Further, V ar(Y ) − E (V ar(Y |X)) = σY Y − σY Y − σXY
′
Σ−1
X σXY =
−1
σXY ΣX σXY , so that the proportion of variation in Y explained by the re-
′
gression on X is equal to
V ar(Y ) − E (V ar(Y |X)) σ ′ Σ−1 σXY
R2 = = XY X = (R∗ )2 .
V ar(Y ) σY Y
2
Partial Correlation Coefficients
Recall the notation, ρ for the population and r for a sample. From the
expression for Σ11.2 note that σij.l = σij − σil σjl /σll . Thus,
σ σ
σij.l σij − ilσlljl
ρij.l = √ √ = r
σii.l σjj.l 2
σil
2
σjl
σii − σll σjj − σll
σij σ σ
√ − σll √ilσiijlσjj
σii σjj ρij − ρil ρjl
= r =q .
2
2
σjl
2 2
σil
1 − σii σll 1 − σjj σll (1 − ρ il ) 1 − ρ jl
1
Simultaneous confidence sets
and hence
r
P (β̂ − β)′ X ′ X(β̂ − β) ≤ Y ′ (I − P )Y Fr,n−r (1 − α) = 1 − α.
n−r
Therefore,
r
C = β : (β − β̂)′ X ′ X(β − β̂) ≤ RSS Fr,n−r (1 − α)
n−r
interval for a′ β. Let us see if we can extend this when we are interested in
deriving a simutaneous confidence set of coefficient 1−α for a′1 β, a′2 β, . . . , a′k β.
2
Scheffe’s method.
Let A′p×d = (a1 , a2 , . . . , ad ) where a1 , a2 , . . . , ad are linearly independent and
ad+1 , . . . , ak are linearly dependent on them. Then d ≤ min{k, r}. Let
φ = Aβ and φ̂ = Aβ̂. Then
−1
(φ̂ − φ)′ (A(X ′ X)− A′ ) (φ̂ − φ)/d
F (β) = ∼ Fd,n−r .
RSS/(n − r)
Therefore,
1 − α = P [F (β) ≤ Fd,n−r (1 − α)]
′ ′ − ′ −1
RSS
= P (φ̂ − φ) A(X X) A (φ̂ − φ) ≤ d Fd,n−r (1 − α) .
n−r
This gives an ellipsoid as before, but consider the following result.
Result. If L is positive definite,
(h′ b)2
b′ L−1 b = sup .
h6=0 h′ Lh
Therefore,
n o2
′
h (φ − φ̂) d
1−α=P sup ≤ RSS Fd,n−r (1 − α)
h6=0
h′ (A(X ′ X)− A′ ) h n−r
n o2
′
h (φ − φ̂) d
= P ≤ RSS Fd,n−r (1 − α) for all h 6= 0.
h′
(A(X ′ X)− A′ ) h n−r
′
|h (φ − φ̂)|
= P q p ≤ {dFd,n−r (1 − α)}1/2 for all h 6= 0.
RSS h′ (A(X ′ X)− A′ ) h
n−r
n o
= P |h′ (φ − φ̂)| ≤ {dFd,n−r (1 − α)}1/2 s.e.(h′ φ̂) for all h 6= 0. ,
q p
RSS
where s.e.(h′ φ̂) = n−r h′ (A(X ′ X)− A′ ) h. Therefore,
r
′ 1/2 RSS p ′ ′ −
ai β̂ ± {dFd,n−r (1 − α)} ai (X X) ai , i = 1, 2, . . . , k
n−r
1
is a simultaneous 100(1 − α)% confidence set for a′1 β, a′2 β, . . . , a′k β, by noting
that
P a′i β ∈ a′i β̂ ± {dFd,n−r (1 − α)}1/2 s.e.(a′i β̂, i = 1, 2, . . . , k ≥
n o
′ 1/2 ′
P |h (φ − φ̂)| ≤ {dFd,n−r (1 − α)} s.e.(h φ̂) for all h 6= 0. = 1 − α.
Many other methods are also available.
Regression diagnostics
2
y − ŷ
3
With the model: Y = Xβ + ǫ, with E(ǫ) = 0 and Cov(ǫ) = σ 2 In , normality
of ǫ is essential for hypothesis testing and confidence statements. How does
one check this?
Normal probability plot or Q-Q plot.
This is a graphical technique to check for normality. Suppose we have a
random sample T1 , T2 , . . . , Tn from some population, and we want to check
whether the population has the normal distribution with some mean µ and
some variance σ 2 . The method described here depends on examining the
order statistics, T(1) , . . . , T(n) . Let us recall a few facts about order statistics
from a continuous distribution. Since
n
Y
fT1 ,...,Tn (t1 , . . . , tn ) = f (ti ), (t1 , . . . , tn ) ∈ Rn ,
i=1
n
Y
fT(1) ,...,T(n) (t(1) , . . . , t(n) ) = n! f (t(i) ), t(1) < t(2) < · · · t(n) ,
i=1
n! n−i i−1
fT(i) = 1 − F (t(i) ) F (t(i) )f (t(i) ).
(i − 1)!(n − i)!
If U(1) < U(2) < · · · < U(n) are o.s. from U (0, 1), then
1 Z 1
n!
Z
E(U(k) ) = ufU(k) (u) du = uk+1−1 (1 − u)n−k+1−1 du
0 (k − 1)!(n − k)! 0
n! Γ(k + 1)Γ(n − k + 1) k
= = .
(k − 1)!(n − k)! Γ(n + 2) n+1
T(i) − µ
i − 0.5
E Φ ≈ , i = 1, 2, . . . , n.
σ n
T −µ
Therefore, plot of Φ (i)σ versus i−0.5 n
is on the line y = x. Equivalently,
T −µ
the plot of (i)σ versus Φ−1 ( i−0.5 n
) is on the line y = x. In other words, the
plot of T(i) versus Φ−1 ( i−0.5
n
) is linear. To check this, µ and σ 2 are not needed.
Since T(i) is the quantile of order i/n and Φ−1 ( i−0.5 n
) is the standard normal
1
quantile of order i−0.5
n
, this plot is called the Quantile - Quantile plot. One
looks for nonlinearity in the plot to check for non-normality.
How is this plot to be used in regression? We want to check the normality of
ǫi , but they are not observable. Instead yi are observable, but they have differ-
ent means. We consider the residuals. ǫ̂ = Y −Ŷ = (I−P ) ∼ Nn (0, σ 2 (I−P ))
if normality holds. i.e., ǫ̂i ∼ N (0, σ 2 (1 − Pii )) if Y ∼ N (Xβ, σ 2 In ). For a
fixed number of regressors (p − 1), as n increases, Pii → 0 (Weisberg), so the
residuals can be used in the Q-Q plot.
2
X1 X2 X3 X4 Y
45 39.2 38 3 16
65 47.0 36 12 15
40 24.3 14 18 10
.. .. .. .. ..
. . . . .
Correlation matrix:
Y X1 X2 X3
X1 0.158
X2 0.022 0.069
X3 0.836 -0.017 0.066
X4 -0.908∗ -0.205 0.212 -0.815
Choose X4 first, since r4y = -0.908 is the highest in magnitude. Then R2 =
(−0.908)2 = 82.4%. F = 168.79 >> F1,36 (.99). Now compute
−0.07 i = 1;
riy.4 = 0.518 i = 2;
0.398 i = 3.
If we pick X3 now, R2 = 87.9%, not very different from the previous regres-
sion. Also, X3 is not particularly useful in regression.
3
Basics of Design of Experiments and ANOVA
1
ǫi i.i.d N (0, σ 2 ). Write it in the vector/matrix form, Y = Xβ + ǫ:
y1 1 0
.. .. ..
. . .
yn 1 1 0 µ m
= + ǫ,
yn1 +1 0 1 µc
.. . .
.. ..
.
yn1 +n2 0 1
Therefore,
1
µ̂m ȳ1 µm 2 n1
0
= ∼ N2 ,σ 1 ,
µ̂c ȳ2 µc 0 n2
independent of
n1
X 1 +n2
nX
RSS = (yi − ȳ1 )2 + (yi − ȳ2 )2 ∼ σ 2 χ2n1 +n2 −2 .
i=1 i=n1 +1
2
For this reason, the design is called a completely randomized design. We can
generalize this procedure if we want to compare k means, as will be done
later.
Paired differences - example of a block design
Sometimes independent samples, such as the ones in a completely random-
ized design, from two (or k > 2) populations is not an efficient way for
comparisons. Consider the following example.
Example. It is of interest to compare an enriched formula with a stan-
dard formula for baby food. Weights of infants vary significantly and this
influences weight gain more than the difference in food quality. Therefore,
independent samples (with infants having very different weights) for the two
formulas will not be very efficient in detecting the difference. Instead, pair
babies of similar weight and feed one of them the standard formula, and the
other the enriched formula. Then observe the gain in weight:
pair 1 2 3 ... n
enriched e1 e2 e3 ... en
standard s1 s2 s3 ... sn
However, the samples may not be treated as independent but correlated. The
n pairs of observations, (e1 , s1 ), . . . , (en , sn ) may still be treated to be uncor-
related (or even independent). These n pairs are like n independent blocks,
inside each of which we can compare enriched with standard. This is the
idea of blocking and block designs. Blocks are supposed to be homogeneous
inside, so comparison of treatments within blocks becomes efficient.
3
Paired differences - example of a block design
Sometimes independent samples, such as the ones in a completely random-
ized design, from two (or k > 2) populations is not an efficient way for
comparisons. Consider the following example.
Example. It is of interest to compare an enriched formula with a stan-
dard formula for baby food. Weights of infants vary significantly and this
influences weight gain more than the difference in food quality. Therefore,
independent samples (with infants having very different weights) for the two
formulas will not be very efficient in detecting the difference. Instead, pair
babies of similar weight and feed one of them the standard formula, and the
other the enriched formula. Then observe the gain in weight:
pair 1 2 3 ... n
enriched e1 e2 e3 ... en
standard s1 s2 s3 ... sn
However, the samples may not be treated as independent but correlated.
The n pairs of observations, (e1 , s1 ), . . . , (en , sn ) may still be treated to be
uncorrelated (or even independent). These n pairs are like n independent
blocks, inside each of which we can compare enriched with standard. This
is the idea of blocking and block designs. Blocks are supposed to be homo-
geneous inside, so comparison of treatments within blocks becomes efficient.
We assume that
σ12
ei µ1 ρσ1 σ2
∼ N2 , . In the above example, we
si µ2 ρσ1 σ2 σ22
want to test H0 : µD ≡ µ1 − µ2 = 0, so consider yi = ei − si . Then,
2
yi = µD + ǫi , E(ǫi ) = 0, V ar(ǫi ) = σD = σ12 + σ22 − 2ρσ1 σ2 = V ar(yi ). If
2
normality is assumed, then we have, y1 , . . . , yn i.i.d. N (µD , σD ) and we want
to test H0 : µD = 0. Consider the test statistic,
√
nȳ
pPn ∼ tn−1 ,
2
i=1 (yi − ȳ) /(n − 1)
if H0 is true, or equivalently,
nȳ 2
Pn 2
∼ F1,n−1 .
i=1 (yi − ȳ) /(n − 1)
1
Note that,
nȳ 2
Pn 2
i=1 (yi − ȳ) /(n − 1)
n(ē − s̄)2
= 1
P n 2
n−1 i=1 [(ei − ē) − (si − s̄)]
n(ē − s̄)2
= 1
Pn Pn Pn
n−1
[ i=1 (ei − ē)2 + 2
i=1 (si − s̄) − 2 i=1 (ei − ē)(si − s̄)]
(ē − s̄)2 /( n1 + n1 )
= .
1
[ ni=1 (ei − ē)2 + ni=1 (si − s̄)2 − 2 ni=1 (ei − ē)(si − s̄)]
P P P
2(n−1)
Compare this test statistic with the one used for independent samples.
Cov(e, s) is expected to be positive (due to blocking), so the variance in the
1
[ ni=1 (ei − ē)2 + ni=1 (si − s̄)2 ],
P P
denominator above is typically less than 2(n−1)
which appears there. This is the positive effect due to blocking.
Confounding of effects.
Example. Consider two groups of similar students and two teachers. It is
of interest to compare two different training methods. Consider the design
where teacher A teaches one group using method I, whereas teacher B teaches
the other group using method II. Later the results are analyzed. The problem
with this design is that, if one group performs better it may be due to teacher
effect or due to method effect, but it is not possible to separate the effects.
We say then that the two effects are confounded. Sometimes we may not be
interested in certain effects, in which case we may actually look for designs
that will confound their effects. This will reduce the number of parameters
to be estimated.
2
design is as follows.
Let yij = response of the jth individual in the ith group (ith treatment),
j = 1, 2, . . . , ni ; i = 1, 2, . . . , k. Then,
yij = µi + ǫij , j = 1, 2, . . . , ni ; i = 1, 2, . . . , k. E(ǫij ) = 0, V ar(ǫij ) = σ 2 , un-
correlated errors; ǫij ∼ N (0, σ 2 ) i.i.d. for testing and confidence statements.
In the usual linear model formulation:
y11 1 0 ... 0
.. .. .. .
. . . . . . ..
y1n1 1 0 . . . 0
y21 0 1 . . . 0 µ1
. . . .
. . .
. . . . . . .. µ2
= .. + ǫ.
y2n2 0 1 . . . 0 .
. . .
.. .. .. . . . ... µk
y 0 0 ... 1
k1
.. .. .. .
.
. . . ... .
yknk 0 0 ... 1
1
n1
0 ··· 0 Pn1
0 1 ··· 0 j=1 y1j
Since (X ′ X)−1 = ..
n2 ..
and X ′ Y = , we get
.. .
. . ··· 0 Pn1
j=1 ykj
0 0 · · · n1k
µ̂1 ȳ1
.. ..
. = . and
µ̂k ȳk
Pk Pni
RSS = i=1 j=1 (yij − ȳi )2 =
PP 2
(yij − µ̂i )2 .
PP
ǫ̂ij =
Questions.
(i) Are the group means µi equal? i.e., test H0 : µ1 = µ2 = · · · = µk .
(ii) If not, how are they different?
3
yij = µi + ǫij , j = 1, 2, . . . , ni ; i = 1, 2, . . . , k E(ǫij ) = 0, V ar(ǫij ) = σ 2 .
y11 1 0 ... 0
.. .. .. .
. . . . . . ..
y1n1 1 0 . . . 0
y21 0 1 . . . 0 µ1
. . . ..
. . .
. . . . . . . µ2
= .. + ǫ.
y2n2 0 1 . . . 0 .
. . .
.. .. .. . . . ... µk
y 0 0 ... 1
k1
.. .. .. .
.
. . . ... .
yknk 0 0 ... 1
µ̂1 ȳ1
.. ..
. = .
µ̂k ȳk
1
20
15
10
15 20 25 30 35
But sample means do not tell the whole story, especially for small samples.
One must look at variation within samples and between samples. In the
plot above, the conclusions would be different according to whether the error
bands are green or red.
25
20
tensile strength
15
10
s1 s2 s3 s4 s5
% cotton
2
sample variation is large. Note that, if within sample variation is large com-
pared to between sample variation (like the red error bands in the plot),
then the different samples can be considered to be from a single population.
However, if within sample variation is small compared to between sample
variation (like the green error bands in the plot, i.e., |ȳi − ȳj | are large com-
pared to the error) then there is reason to believe that the groups differ.
To formalize this, we return to linear models:
yij = µi + ǫij , j = 1, 2, . . . , ni ; i = 1, 2, . . . , k, ǫij ∼ N (0, σ 2 ) i.i.d. Are the
group means different?
µ̂1 ȳ1
.. .. Pk Pni
. = . so that RSS = i=1 j=1 (yij − ȳi )2 .
µ̂k ȳk
To test H0 :
µ1 = µ2 = · · · = µk , consider
1 0 0 · · · 0 −1
0 1 0 · · · 0 −1
A(k−1)×k = .. .. .. .. .. . Then we test H0 : Aµ = 0 where A
. . . ··· . .
0 0 0 · · · 1 −1
has rank k − 1. To test H0 , we obtain µ̂H0 , RSSH0 and consider
(RSSH0 − RSS)/(k − 1)
F = , which ∼ Fk−1,Pk ni −k under H0 .
RSS/( ki=1 ni − k)
P i=1
X ni
k X ni
k X
X
min (yij − µi )2 = min (yij − µ)2 .
µ1 =µ2 =···=µk µ
i=1 j=1 i=1 j=1
Therefore,
ni
k X ni
k X
1 X X
µ̂H0 = Pk yij ≡ ȳ.. , and hence RSSH0 = (yij − ȳ.. )2 .
i=1 ni i=1 j=1 i=1 j=1
3
yij = µi + ǫij , j = 1, 2, . . . , ni ; i = 1, 2, . . . , k, ǫij ∼ N (0, σ 2 ) i.i.d. Are the
group means different?
µ̂1 ȳ1
.. .. Pk Pni
. = . so that RSS = i=1 j=1 (yij − ȳi )2 .
µ̂k ȳk
To test H0 :
µ1 = µ2 = · · · = µk , consider
1 0 0 · · · 0 −1
0 1 0 · · · 0 −1
A(k−1)×k = .. .. .. .. .. . Then we test H0 : Aµ = 0 where A
. . . ··· . .
0 0 0 · · · 0 −1
has rank k − 1. To test H0 , we obtain µ̂H0 , RSSH0 and consider
(RSSH0 − RSS)/(k − 1)
F = , which ∼ Fk−1,Pk ni −k under H0 .
RSS/( ki=1 ni − k)
P i=1
X ni
k X ni
k X
X
2
min (yij − µi ) = min (yij − µ)2 .
µ1 =µ2 =···=µk µ
i=1 j=1 i=1 j=1
Therefore,
ni
k X ni
k X
1 X X
µ̂H0 = Pk yij ≡ ȳ.. , and hence RSSH0 = (yij − ȳ.. )2 .
i=1 ni i=1 j=1 i=1 j=1
1
Pni
Introduce further notation: ȳi. = ȳi = ni j=1 yij , i = 1, 2, . . . , k. Note,
further, that
RSSH0
X ni
k X ni
k X
X
2
= (yij − ȳ.. ) = (yij − ȳi. + ȳi. − ȳ.. )2
i=1 j=1 i=1 j=1
ni
k X k k
( ni
)
X X X X
= (yij − ȳi. )2 + ni (ȳi. − ȳ.. )2 + 2 (ȳi. − ȳ.. ) (yij − ȳi. )
i=1 j=1 i=1 i=1 j=1
k
X
= RSS + ni (ȳi. − ȳ.. )2 ,
i=1
1
Pni
since j=1 (yij − ȳi. ) = 0 for all i. Therefore,
k
X
RSSH0 − RSS = ni (ȳi. − ȳ.. )2
i=1
and therefore,
Pk 2
i=1 ni (ȳi. − ȳ.. ) /(k − 1)
F = Pk Pni Pk ∼ Fk−1,Pk ni −k under H0 .
2
j=1 (yij − ȳi. ) /( i=1 ni − k) i=1
i=1
It is instructive
Pk Pntoi consider these sum of squares.
2
RSS = i=1 j=1 (yij − ȳi. )
= the sum total of all the sum of squares of deviations from the sample means
= within groups or within treatments sum of squares, SSW .
RSSH0 = ki=1 nj=1 (yij − ȳ.. )2
P P i
= total sum of squares of deviations assuming no treatment effect
= total variability (corrected) in the k samples, SST .
Pk 2
Therefore, i=1 ni (ȳi. − ȳ.. ) = SST - SSW = between groups or between
treatments sum of squares = SSB . Thus,
SS
PkT = SSW + SS PBk is the decomposition of sum of squares along with
i=1 ni − 1 = ( i=1 ni − k) + (k − 1), decomposition of d.f.
2
Example. Tensile strength data. k = 5, ni = 5. ANOVA is as follows.
source d.f. SS MS F
Factor levels (% cotton) 4 475.76 118.94 14.76 >> 4.43 = F4,20 (.99)
Error 20 161.20 8.06
Total(corrected) 24 636.96
475.76
R2 = 636.96
≈ 75%
Now that the ANOVA H0 has been rejected, we should look at the group
means (estimates) closely. Suppose we want to compare µr and µs either
with H0 : µr = µs or using a confidence interval for µr − µs .
2 1 1
µ̂r − µ̂s = ȳr. − ȳs. ∼ N µr − µs , σ +
nr ns
independently of
ni
k X
X
(yij − ȳi. )2 ∼ σ 2 χ2Pk ni −k
.
i=1
i=1 j=1
Therefore,
q
{(ȳr. − ȳs. ) − (µr − µs )} / n1r + n1s
P
r
Pk P ni P ∼ t ki=1 ni −k .
2 k
i=1 j=1 (yij − ȳi. ) / i=1 ni − k
if H0 is true.
3
Multiple comparison of group means
yij = µi + ǫij , j = 1, 2, . . . , ni ; i = 1, 2, . . . , k, ǫij ∼ N (0, σ 2 ) i.i.d.
The classic ANOVA test is the test of H0 : µ1 = µ2 = · · · = µk , which is
uninteresting and the hypothesis is usually not true. What an experimentor
usually wants to find out is which treatments are better, so rejection of H0
is usually not the end of the analysis. Once it is rejected, further work is
needed to find out why it was rejected.
Definition. A linear parametric function ki=1 ai µi = a′ µ with known con-
P
Pk ′
stants a1 , . . . , ak satisfying i=1 ai = a 1 = 0 is called a contrast (linear
contrast).
Example. If a = (1, −1, 0, . . . , 0)′ , then a′ µ = µ1 − µ2 .
µ1 = µ2 = · · · = µk if and only o
Result. n if a′ µ = 0 for all
a ∈ A = a = (a1 , . . . , ak )′ : ki=1 ai = 0 .
P
Thus, if H0 fails, atleast one of the Ha must fail for a ∈ A. i.e., a′ µ 6= 0. The
experimenter may be interested in this contrast, andPits inference. Consider
inference of any linear parametric function, a′ µ = ki=1 ai µi . We have the
model,
yij ∼ N (µi , σ 2 ), j = 1, 2, . . . , ni ; i = 1, 2, . . . , k independent. Then, ȳi. ∼
N (µi , σ 2 /ni ), i = 1, 2, . . . , k independent, and
k k k k
X X X X a2
E( ai ȳi. ) = ai µi = a′ µ, V ar( ai ȳi. ) = σ 2 i
,
i=1 i=1 i=1 i=1
ni
so that Pk
ai ȳi. − ki=1 ai µi
P
i=1
q P ∼ N (0, 1).
2 k a2i
σ i=1 ni
is just a repeat of our old result that RSS = ki=1 nj=1 (yij − ȳi. )2 = Sp2 is
P P i
1
independent of β̂ = µ̂. Thus, as discussed previously,
a′ ȳ − a′ µ
r ∼ tPk ni −k ,
2
P i=1
P k a k
Sp2 i=1 ni
i
/( i=1 n i − k)
so that
v !
u k k
′
u X a2 i
X
a ȳ ± t Pk ni −k (1 − α/2)tS 2 p /( ni − k)
i=1
i=1
ni i=1
a′ ȳ
r > tPk ni −k (1 − α/2).
P i=1
Pk a2i k
Sp2 i=1 ni
/( i=1 n i − k)
n
X n
X
1 − P (∩ni=1 Ai ) ≤ (1 − P (Ai )) = n − P (Ai ), or
i=1 i=1
n
X
P (∩ni=1 Ai ) ≥ P (Ai ) − (n − 1).
i=1
This is known as the Bonferroni Inequality. Apply this to the above problem.
′ ′
If we want a simultaneous confidence set for a(1) µ, . . . , a(d) µ, consider
v !
u k (j) 2 k
n ′
(j) P
α u
2
X (a i ) X o
C = a ȳ±t k ni −k (1− ) Sp t /( ni − k), j = 1, 2, . . . , d .
i=1 2d i=1
ni i=1
Then
d d
X X α
P (C) = P (∩dl=1 Al ) ≥ P (Al )−(d−1) = (1− )−(d−1) = d−α−d+1 = 1−α.
l=1 l=1
d
2
Reparametrization of the one-way model.
Suppose ni are
Pkall equal, and equal toP J. Also, let the number of groups be
k = I. Then i=1 ni = IJ, and ȳi. = Jj=1 yij /J, for i = 1, . . . , I.
P P
¯ = Ii=1 Jj=1 yij /(IJ). Further,
y..
P P
SSW = Ii=1 Jj=1 (yij − ȳi. )2 has d.f. (IJ − I);
P P
SSB = Ii=1 ni (yi. − ȳ.. )2 = J Ii=1 (yi. − ȳ.. )2 has d.f. I − 1.
We can rewrite the model, yij = µi + ǫij , ǫij ∼ N (0, σ 2 ) i.i.d. as follows.
P
µi = µ̄. + (µi − µ̄. ) = µ + αi , where µ̄. = Ii=1 µi /I and αi = µi − µ̄. . Then,
PI PI
i=1 αi = α. = i=1 (µi − µ̄. ) = 0. Further, H0 : µ1 = µ2 = · · · = µI
is the same PI−1as H0 : α1 = α2 = · · · = αI−1 = 0 (α. = 0 implies that
αI = − i=1 αi = 0 also.)
Similarly write
ǭi. = ǭ.. + ǭi. − ǭ.. , so that
ǫij = ǭ.. + (ǭi. − ǭ.. ) + (ǫij − ǭi. ). Therefore
X
I X
J X
I X
J X
I X
J X
I X
J
ǫ2ij = ǫ2.. + 2
(ǭi. − ǭ.. ) + (ǫij − ǭi. )2 ,
i=1 j=1 i=1 j=1 i=1 j=1 i=1 j=1
P P P
since ǭ.. Ii=1 (ǭi. − ǭ.. ) = 0, ǭ.. Ii=1 Jj=1 (ǫij − ǭi. ) = 0 and
PI PJ PI PJ
i=1 j=1 (ǭi. − ǭ.. )(ǫij − ǭi. ) = i=1 (ǭi. − ǭ.. ) j=1 (ǫij − ǭi. ) = 0.
Now, since ǫij = yij − µ − αi , we get ǭi. = ȳi. − µ − αi , ǭ.. = ȳ.. − µ, and
further, from above,
X
I X
J
(yij − µ − αi )2
i=1 j=1
X
I X
J X
I X
J X
I X
J
= (ȳ.. − µ)2 + (ȳi. − ȳ.. − αi )2 + (yij − ȳi. )2 .
i=1 j=1 i=1 j=1 i=1 j=1
P
Least squares estimates subject to Ii=1 αi = 0 may be obtained simply by
examination of the above, and they are:
1
Under H0 : α1 = α2 = · · · = αI−1 = 0, we have
X
I X
J X
I X
J
2
(yij − µ − αi ) ≡ (yij − µ)2
i=1 j=1 i=1 j=1
X
I X
J X
I X
J X
I X
J
2 2
= (ȳ.. − µ) + (ȳi. − ȳ.. ) + (yij − ȳi. )2 ,
i=1 j=1 i=1 j=1 i=1 j=1
PI PJ
so that, then, i=1 j=1 (yij − µ)2 is minimized when µ̂ = ȳ.. (with αi = 0).
We then get
X
I X
J X
I X
J
RSSH0 = (ȳi. − ȳ.. )2 + (yij − ȳi. )2
i=1 j=1 i=1 j=1
X
I X
I X
J
2
= J (ȳi. − ȳ.. ) + (yij − ȳi. )2 .
i=1 i=1 j=1
Therefore,
X
I
RSSH0 − RSS = J (ȳi. − ȳ.. )2 .
i=1
Note that all these can be done by just inspection, even though we have de-
rived these previously using other methods. The simplicity of this approach,
however, is very useful for higher-way classification models.
One-way ANOVA with equal number of observations per group.
source d.f SS MS F
P P 2
Treatments I −1 J (ȳ − ȳ.. )2 SSB /(I − 1) PJ P (ȳi. −ȳ.. )2 /(I−1)
P P i. 2
(yij −ȳi. ) /(IJ−I)
Error IJ − I P P(yij − ȳi. )2 SSW /(IJ − I)
Total (C) IJ − 1 (yij − ȳ.. )
This approach of reparametrization and decomposition generalizes to higher-
way classification where there are substantial simplifications.
2
to the field, the engineer has no control over the temperature extremes that
the device will encounter, and he knows from past experience that temper-
ature may impact the effective battery life. However, temperature can be
controlled in the product development laboratory for the purposes of testing.
The engineer decides to test all three plate materials at three different tem-
perature levels, 15◦ F, 70◦ F and 125◦ F (-10, 21 and 51 degree C), as these tem-
perature levels are consistent with the product end-use environment. Four
batteries are tested at each combination of plate material and temperature,
and the 36 tests are run in random order.
Question 1. What effects do material type and temperature have on the life
of the battery?
Question 2. Is there a choice of material that would give uniformly long life
regardless of temperature? (Robust product design?)
3
Question 1. What effects do material type and temperature have on the life
of the battery?
Question 2. Is there a choice of material that would give uniformly long life
regardless of temperature? (Robust product design?)
Life (in hrs) data for the battery design experiment:
material temperature (◦ F)
type 15 70 125
1 130 155 34 40 20 70
74 180 80 75 82 58
2 150 188 126 122 25 70
159 126 106 115 58 45
3 138 110 174 120 96 104
168 160 150 139 82 60
Both factors, material type and temperature are important and there may
be interaction between the two also. Let us denote the row factor as factor
A and column factor as factor B (in general). Then the model for the data
may be developed as follows.
Let yijk be the observed response when factor A is at the ith level (i =
1, 2, . . . , I) and factor B is at the jth level (j = 1, 2, . . . , J) for the kth
replicate (k = 1, 2, . . . , K). In the example, I = 3, J = 3, K = 4. This design
is like having IJ different cells each of which has K observations, and one
wants to see if the IJ cell means are different or not (in various ways).
1
P PJ PI
Then note that Ii=1 αi = 0, β j = 0, i=1 (αβ)ij = 0 for all j and
PJ j=1
1X
J
µ i1 j − µ i2 j = φ(i1 , i2 ) = φ(i1 , i2 )
J j=1
1X
J
= (µi j ′ − µi2 j ′ ) = µ̄i1 . − µ̄i2 . ,
J j ′ =1 1
for all i1 , i2 . Or, equivalently, µi1 j − µ̄i1 . = µi2 j − µ̄i2 . for all i1 , i2 . i.e.,
i.e., µij − µ̄i. − µ̄.j + µ̄.. = 0 for all i, j. Because of symmetry, we could have
begun with µij1 − µij2 depending on j1 , j2 , but not on i. Thus, we see that
(αβ)ij = µij − µ̄i. − µ̄.j + µ̄.. measures the interaction of i and j. Therefore,
to investigate the existence of interaction, we should test,
HAB : (αβ)ij = 0 (i = 1, 2, . . . , I; j = 1, 2, . . . , J) as the restricted model
without interaction. Estimation of (αβ)ij can also be considered. Now,
consider the main effects of factors A and B.
To test for lack of difference in levels of factor A, use, HA : αi = 0 for all i.
To test for lack of difference in levels of factor B, use, HB : βj = 0 for all
j. If HAB : (αβ)ij = 0 has been rejected, there is evidence for significant
interaction, so main effects cannot be non-existent.
2
Life (in hrs) data for the battery design experiment:
material temperature (◦ F)
type 15 70 125
1 130 155 34 40 20 70
74 180 80 75 82 58
2 150 188 126 122 25 70
159 126 106 115 58 45
3 138 110 174 120 96 104
168 160 150 139 82 60
Let yijk be the observed response when factor A is at the ith level (i =
1, 2, . . . , I) and factor B is at the jth level (j = 1, 2, . . . , J) for the kth
replicate (k = 1, 2, . . . , K).
ǫijk = ǭ... + (ǭi.. − ǭ... ) + (ǭ.j. − ǭ... ) + (ǭij. − ǭi.. − ǭ.j. + ǭ... ) + (ǫijk − ǭij. ).
1
Therefore, as in one-way classification,
X
I X
J X
K X
I X
J
ǫ2ijk = IJKǭ2... + JK 2
(ǭi. − ǭ.. ) + IK (ǭ.j − ǭ.. )2
i=1 j=1 k=1 i=1 j=1
X
I X
J X
I X
J X
K
+K (ǭij. − ǭi.. − ǭ.j. + ǭ... )2 + (ǫijk − ǭij. )2 ,
i=1 j=1 i=1 j=1 k=1
since cross products vanish. Noting that ǫijk = yijk −µ−αi −βj −(αβ)ij , with
PI PJ PI PJ
i=1 αi = 0, j=1 βj = 0, i=1 (αβ)ij = 0 for all j and j=1 (αβ)ij = 0
for all i, we get ǭ... = ȳ... − µ, ǭi.. = ȳi.. − µ − αi , ǭ.j. = ȳ.j. − µ − βj ,
ǭij. = ȳij. − µ − αi − βj − (αβ)ij . Hence,
X
I X
J X
K
(yijk − µ − αi − βj − (αβ)ij )2
i=1 j=1 k=1
X
I X
J
2 2
= IJK(ȳ... − µ) + JK (ȳi.. − ȳ... − αi ) + IK (ȳ.j. − ȳ... − βj )2
i=1 j=1
X
I X
J X
I X
J X
K
2
+K (ȳij. − ȳi.. − ȳ.j. + ȳ... − (αβ)ij ) + (yijk − ȳij. )2 .
i=1 j=1 i=1 j=1 k=1
X
I X
J X
K X
I X
J
RSSHAB = (yijk − ȳij. )2 + K (ȳij. − ȳi.. − ȳ.j. + ȳ... )2 ,
i=1 j=1 k=1 i=1 j=1
X
I X
J X
I X
J
RSSHAB − RSS = K 2
(ȳij. − ȳi.. − ȳ.j. + ȳ... ) = K ˆ 2,
(αβ) ij
i=1 j=1 i=1 j=1
2
which has d.f. (I − 1)(J − 1). To test HAB , use
(RSSHAB − RSS) /{(I − 1)(J − 1)}
FAB = ∼ F(I−1)(J−1),IJ(K−1)
RSS/{IJ(K − 1)}
under HAB . Now consider HA : αi = 0 for all i. There are I − 1 linearly
independent equations here, so the rank of A matrix is I − 1. Again, by
inspection, note that estimates of the remaining parameters, µ, βj and (αβ)ij
remain unchanged, so
X
I X
J X
K X
I
2
RSSHA = (yijk − ȳij. ) + JK (ȳi.. − ȳ... )2 , so
i=1 j=1 k=1 i=1
X
I X
I
2
RSSHA − RSS = JK (ȳi.. − ȳ... ) = JK α̂i2
i=1 i=1
X
J X
J
2
RSSHB − RSS = IK (ȳ.j. − ȳ... ) = IK β̂j2
j=1 j=1
X
I X
J X
I X
J X
K
+K (ȳij. − ȳi.. − ȳ.j. + ȳ... )2 + (yijk − ȳij. )2 .
i=1 j=1 i=1 j=1 k=1
IJK = 1 + (I − 1) + (J − 1) + (IJ − I − J + 1) + (IJK − IJ).
3
ANOVA table for 2-factor analysis:
source d.f SS MS F
A main I −1 SSA = MSA = FA =
P
effects JK Ii=1 α̂i2 MSA /MSE
B main J −1 SSB = MSB = FB =
P
effects IK Jj=1 β̂j2 MSB /MSE
AB (I − 1)(J − 1) SSAB = MSAB = FAB =
PP ˆ 2
interactions K (αβ) ij MSAB /MSE
Error IJ(K − 1) PPP RSS = MSE =
RSS
(y − ȳij. )2
P P P ijk IJ(K−1)
Total (c) IJK − 1 (yijk − ȳ... )2
2
Mean 1 P IJK
PP ȳ...
2
Total IJK yijk