0% found this document useful (0 votes)
19 views

Statistics 3 Notes

This document provides an overview of the Statistics 3: Linear Models and Linear Regression course. The course covers linear statistical models and linear regression, building on prior statistics coursework. Key topics covered include linear models, multiple linear regression models, vector-matrix forms of linear models, and multivariate distributions as they relate to linear models and regression. Assignments, exams, lecture notes, and contact information for the instructor are also provided.

Uploaded by

najam u saqib
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views

Statistics 3 Notes

This document provides an overview of the Statistics 3: Linear Models and Linear Regression course. The course covers linear statistical models and linear regression, building on prior statistics coursework. Key topics covered include linear models, multiple linear regression models, vector-matrix forms of linear models, and multivariate distributions as they relate to linear models and regression. Assignments, exams, lecture notes, and contact information for the instructor are also provided.

Uploaded by

najam u saqib
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 90

Statistics 3: Linear Models and Linear Regression

Prerequisites: Statistics I and II; Probability I and II; Linear/Matrix Al-


gebra; Proficiency in using R Statistical Software
References

1. CR Rao: Linear Statistical Inference and its Applications, Wiley (1973)

2. SR Searle: Linear Models, Wiley (1971)

3. R Christensen: Linear Models, Marcel-Dekker (1983)

4. R Rao and P Bhimasankaram: Linear Algebra, 2nd edition, Hindustan


Book Agency (2000)

5. R Bapat: Linear Algebra and Linear Models, Hindustan Book Agency


(1999)

Grading (Tentative): 20 marks for assignments; 20 marks each for two


class tests; 40 marks for the final exam
Lectures: Online lectures will be on Zoom; Time: Monday, Wednesday,
Friday, 2-3 pm
Lecture Notes: Lecture notes will be posted on Moodle. Attempt will be
made to post recording of lectures on Moodle too
Assignments: Assignments will also be posted on Moodle. Answers to
these must be submitted by uploading to Moodle.
For any contingency, you may contact me at
e-mail: [email protected]
mobile/whatsapp: 9880127065
Linear Models
It is of interest to see if a nice relationship exists between two random vari-
ables, X and Y . Eventual objective may be either prediction of a future
value or utilization of the relaionship for understanding the structure.
Ex. X = height, Y = weight of indviduals. One may ask: is there an opti-
mal weight for a given height?
Data: (xi , yi ), observations from n randomly chosen individuals, i = 1, 2, . . . , n.
Ex. X = temperature, Y = pressure of a certain volume of gas.
Data: (xi , yi ), i = 1, 2, . . . , n from a controlled experiment where a certain
volume of gas is subjected to different temperatures and the resulting pressure
is measured.
Ex. In a biological assay, Y = response corresponding to a dosage level of
X = x. Again, (xi , yi ), i = 1, 2, . . . , n from n laboratory subjects.
Ex. In an agricultural experiment, y is the yield of a crop. A piece of land
is divided into I plots according to soil fertility; J different fertilizer levels
are also used. Then, if yij is the yield from the ith plot receiving jth level of
fertilizer, we might like to try the model:
yij = µ+αi +τj +ǫij . Why do we need ǫij ? It is a random error (measurement
error, noise or uncontrolled variability) needed to explain the variation in the
model, which is needed in each of the other cases as well.
In general,
yi = α + βxi + ǫi , (1)
where y is the response variable and x is the predictor variable, and α and
β are unknown coefficients is called a linear model. Here ‘linear’ stands for
linear space, linear or additive in the coefficients and not for linear in x, as
will be seen later. Equation (1) expresses the linear or additive relationship
between E(Y |X = x) and the influencing factors.
Observe the following data and the scatter plot of y versus x, where x =
duration and y = interval (both in minutes) for eruptions of Old Faithful
Geyser.

1
x y x y x y x y x y x y
4.4 78 3.9 74 4.0 68 4.0 76 3.5 80 4.1 84
2.3 50 4.7 93 1.7 55 4.9 76 1.7 58 4.6 74
3.4 75 4.3 80 1.7 56 3.9 80 3.7 69 3.1 57
4.0 90 1.8 42 4.1 91 1.8 51 3.2 79 1.9 53
4.6 82 2.0 51 4.5 76 3.9 82 4.3 84 2.3 53
3.8 86 1.9 51 4.6 85 1.8 45 4.7 88 1.8 51
4.6 80 1.9 49 3.5 82 4.0 75 3.7 73 3.7 67
4.3 68 3.6 86 3.8 72 3.8 75 3.8 75 2.5 66
4.5 84 4.1 70 3.7 79 3.8 60 3.4 86

Table 1: Eruptions of Old Faithful Geyser, August 1 – 4, 1978

2
y

(1) is a linear model for E(y|x), so ǫ denotes the spread or dispersion around
this line. i.e., y = E(y|x) + ǫ. If we let g(x) = E(y|x), assuming g to be
smooth, we could consider the approximation:

g ′′ (0) 2 g (k) (0) k


g(x) = g(0) + g ′ (0)x + x + ... + x
2! k!
= β0 + β1 x + β2 x2 + . . . + βk xk .

This is linear in the coefficients β0 , β1 , . . . but not in x. Also, recall Weirstrass


theorem on being able to uniformly approximate by polynomials any contiu-
ous function on a closed interval. Thus, on a reasonable range of x values,
such a ‘linear’ approximation may be quite acceptable. More importantly,
special tools and techniques from linear spaces and linear algebra are avail-
able for studying linear models.

3
MULTIPLE LINEAR REGRESSION MODEL
The response y is often influenced by more than one predictor variable. For
example, the yield of a crop may depend on the amount of nitrogen, potash,
and phosphate fertilizers used. These variables are controlled by the exper-
imenter, but the yield may also depend on uncontrollable variables such as
those associated with weather. A linear model relating the response y to
several predictors has the form

y = β0 + β1 x1 + β2 x2 + . . . + βp−1 xp−1 + ǫ. (2)

The parameters β0 , β1 , . . . , βp−1 are called regression coefficients. The pres-


ence of ǫ provides for random variation in y not explained by the x variables.
This random variation may be due partly to other variables that affect y
but are not known or not observed. The model in (2) is linear in the β
parameters; it is not necessarily linear in the x variables. Thus models such
as
y = β0 + β1 x1 + β2 x22 + β3 x3 + β4 sin(x2 ) + ǫ
are included in the designation linear model. A model provides a theoretical
framework for better understanding of a phenomenon of interest. Thus a
model is a mathematical construct that we believe may represent the mech-
anism that generated the observations at hand. The postulated model may
be an idealized oversimplification of the complex real-world situation, but
in many such cases, empirical models provide useful approximations of the
relationships among variables. These relationships may be either associative
or causative.
Regression models such as (2) are used for various purposes, including the
following:

Prediction. Estimates of the individual parameters β0 , β1 , . . . are of less


importance for prediction than the overall influence of the x variables
on y. However, good estimates are needed to achieve good prediction
performance.

Data Description or Explanation. The scientist or engineer uses the es-


timated model to summarize or describe the observed data.

Parameter Estimation. The values of the estimated parameters may have


theoretical implications for a postulated model.

4
Variable Selection or Screening. The emphasis is on determining the
importance of each predictor variable in modeling the variation in y.
The predictors that are associated with an important amount of vari-
ation in y are retained; hose that contribute little are deleted.

Control of Output. A cause-and-effect relationship between y and the x


variables is assumed. The estimated model might then be used to
control the output of a process by varying the inputs. By systematic
experimentation, it may be possible to achieve the optimal output.

There is a fundamental difference between purposes 1 and 5. For prediction,


we need only assume that the same correlations that prevailed when the data
were collected also continue in place when the predictions are to be made.
Showing that there is a significant relationship between y and the x variables
in (2) does not necessarily prove that the relationship is causal. To establish
causality in order to control output, the researcher must choose the values
of the x variables in the model and use randomization to avoid the effects
of other possible variables unaccounted for. In other words, to ascertain the
effect of the x variables on y when the x variables are changed, it is necessary
to change them.

5
Vector-matrix form of linear model.
Data is of the form: (yi , xi ), i = 1, 2, . . . , n, xi = (xi0 = 1, xi1 , . . . , xi(p−1) )′ .
The linear model is:

yi = β0 + β1 xi1 + . . . + βp−1 xi(p−1) + ǫi ,


p−1
X
= βj xij + ǫi , i = 1, 2, . . . , n; xi0 = 1.
j=0

Equivalently,
      
y1 1 x11 . . . x1(p−1) β0 ǫ1
 y2   1 x21 . . . x2(p−1)  β1   ǫ2 
 ..  =  .. .. .. .. + ..  , or
      

 .   . . ... .  .   . 
yn 1 xn1 . . . xn(p−1) βp−1 ǫn
y = Xβ + ǫ.

yn×1 is the response vector, Xn×p is the matrix of predictors or covariates,


βp×1 is the vector of regression coefficients, and ǫ is random noise. y is
random since ǫ is random. X is treated as a fixed matrix and β is a fixed
but unknown vector of parameters. Note that the model involves random
vectors and matrices, so some preliminaries on these are needed before we
can proceed further.

6
Multivariate Distributions

A random vector T is a vector whose elements have a joint distribution. i.e.,


if (Ω, A, P ) is a probability space, Tp×1 : Ω → Rp is such that T −1 (B) ∈ A,
and hence P r(T ∈ B) = P (T −1 (B)).
Thus, X = (X1 , . . . , Xp )′ is a random vector if Xi ’s are random variables
with a joint distribution. If the joint density exists, we have f (x) ≥ 0 for all
x ∈ Rp such that
Z Z
f (x) dx = 1 and P (X ∈ A) = f (x) dx, A ⊂ Rp .
Rp A

     
X1 µ1 σ12 ρσ1 σ2
Example. ∼N , , −1 < ρ < 1, if
X2 µ2 ρσ1 σ2 σ22
 2  2   
1 1 − 1
2
x1 −µ1
+ 2σ 2 −2ρ 1σ 1
x −µ x −µ x2 −µ2

f (x1 , x2 ) = p e 2(1−ρ ) σ1 2 1 σ 2
.
2π σ1 σ2 1 − ρ 2

Check that E(Xi ) = µi , V ar(Xi ) = σi2 , i = 1, 2 and Cov(X1 , X2 ) = ρσ1 σ2 .


 
X1
Example.  X2  ∼ Uniform on unit ball if
X3
 3
if x21 + x22 + x23 ≤ 1;
f (x1 , x2 , x3 ) = 4π
0 otherwise.

Let X = (X1 , . . . , Xp )′ be a random vector


 and assume µi = E(Xi ) ex-
µ1
 µ2 
ists for all i. Then define E(X) =  ..  as the mean vector of X. A
 
 . 
µp
random matrix Zp×q = ((zij )) is a matrix whose elements are jointly dis-
tributed random variables. If G(Z) is a matrix valued function of Z, then
E(G(Z)) = ((E(Gij (Z)))).
If G(Z) = AZB, where A and B are constant matrices, E(G(Z)) = AE(Z)B.
If (Z, T ) has a joint distribution, and A, B, C, D are constant matrices,
E(AZB + CT D) = AE(Z)B + CE(T )D.
If Z is symmetric and positive semi-definite (nnd) with probability 1, E(Z)
is also symmetric and positive semi-definite. i.e., show a′ E(Z)a ≥ 0 for all
a. Note that a′ E(Z)a = E(a′ Za) ≥ 0, since for all a, a′ Za ≥ 0 wp 1.

1
Suppose Zp×p is p.s.d. with wp 1. Then its spectral decomposition gives
Z = ΓDλ Γ′ , where Γ is orthogonal and Dλ is diagonal. Let λi (Z) = ith
diagonal element of Dλ , λ1 (Z) ≥ λ2 (Z) ≥ . . . ≥ λp (Z) ≥ 0 wp 1. What
about E(Z)? Is λi (E(Z)) = E(λi (Z))? No. However, E(Z) is p.s.d., so
λi (E(Z)) ≥ 0.
Suppose Xp×1 has mean µ and also E[(Xi −µi )(Xj −µj )] = Cov(Xi , Xj ) = σij
exists for all i, j. i.e., σii < ∞ for all i. Then the covariance matrix (or the
variance-covariance matrix or the dispersion matrix) of X is defined as

Cov(X) = Σ = E [(X − µ)(X − µ)′ ] = ((E[(Xi − µi )(Xj − µj )])) = ((σij )).

Σ is symmetric, σii = V ar(Xi ) ≥ 0 and Σ is p.s.d.


Theorem. Σp×p is a covariance matrix (of some X) iff Σ is symmetric p.s.d.
Proof. (i) If Σ = Cov(X) for some X and E(X) = µ, then for any α ∈ Rp ,

α′ Σα = α′ Cov(X)α = α′ E [(X − µ)(X − µ)′ ] α


h i
2
= E [α′ (X − µ)(X − µ)′ α] = E {α′ (X − µ)}
= E (α′ X − α′ µ)2 = V ar(α′ X) ≥ 0,
 

so Σ is p.s.d. It is actually p.d. unless there exists α 6= 0 such that


V ar(α′ X) = 0 (i.e., α′ X = c w.p.1)
(ii) Now suppose Σ is any symmetric p.s.d matrix of rank r ≤ p. Then
Σ = CC ′ , Cp×r of rank r. Let Y1 , . . . , Yr be i.i.d with E(Yi ) = 0, V ar(Yi ) = 1.
Let Y = (Y1 , . . . , Yr )′ . Then E(Y ) = 0, Cov(Y ) = Ir . Let X = CY . Then
E(X) = 0 and
Cov(X) = E(XX ′ ) = E(CY Y ′ C ′ ) = CE(Y Y ′ )C ′ = CC ′ = Σ.
For a 6= 0, a′ Cov(X)a = 0 iff Cov(X)a = 0, or Cov(X) has a zero eigen
value.
If Xp×1 and Yq×1 are jointly distributed with finite second moments for their
elements, and with E(X) = µ, E(Y ) = ν, then
Cov(Xp×1 , Yq×1 ) = ((Cov(Xi , Yj )))p×q = ((E(Xi −µi )(Yj −νj ))) = ((E(Xi Yj )−
µi νj )) = E(XY ′ ) − µν ′ = E[(X − E(X))(Y − E(Y ))′ ].
Cov(X) = Cov(X, X) = E[(X−E(X))(X−E(X))′ ] = E(XX ′ )−E(X)(E(X))′ .
Cov(AX, BY ) = A Cov(X, Y )B ′ ,
Cov(AX) = Cov(AX, AX) = A Cov(X, X)A′ = A Cov(X)A′ .

2
   
X1 Y1
Consider Xp×1 = and Yq×1 = . Then
X2 Y2
 
Cov(X1 , Y1 ) Cov(X1 , Y2 )
Cov(X, Y ) =
Cov(X2 , Y1 ) Cov(X2 , Y2 )
 
Cov(Y1 , X1 ) Cov(Y1 , X2 )
6= Cov(Y, X) =
Cov(Y2 , X1 ) Cov(Y2 , X2 )

in general. Further, note,

Cov(X + Y ) = Cov(X + Y, X + Y )
= Cov(X, X) + Cov(X, Y ) + Cov(Y, X) + Cov(Y, Y )
= Cov(X) + Cov(Y ) + Cov(X, Y ) + Cov(X, Y )′
6= Cov(X) + Cov(Y ) + 2Cov(X, Y ),

in general. If X and Y are independent, we do have, Cov(X, Y ) =


((Cov(Xi , Yj ))) = 0 since Cov(Xi , Yj ) = 0 for all i and j.
Quadratic Forms.
X ′ AX is called a quadratic form of X. Note that
E(X ′ AX) = E[tr(X ′ AX)] = E[tr(AXX ′ )] = tr[E(AXX ′ )] = tr[AE(XX ′ )] =
tr[A(Σ + µµ′ )] = tr(AΣ) + tr(Aµµ′ ) = tr(AΣ) + µ′ Aµ, since Cov(X) = Σ =
E((X − µ)(X − µ)′ ) = E(XX ′ − Xµ′ − µX ′ + µµ′ ) = E(XX ′ ) − µµ′ .

3
The moment generating function (mgf) of X at α is defined as φX (α) =
E(exp(α′ X)). This uniquely determines the probability distribution of X.
Note that φX ((t1 , 0)′ )E(exp(t1 X1 )) = φX1 (t1 ). If X and Y are independent,
φX+Y (t) = E(exp(t′ (X + Y ))) = E(exp(t′ X) exp(t′ Y ))
= E(exp(t′ X))E(exp(t′ Y )) = φX (t)φY (t).
Theorem (Cramer-Wold device). If X is a random vector, its proba-
bility distribution is completely determined by the distribution of all linear
functions, α′ X, α ∈ Rp .
Proof. The mgf of α′ X, for any α ∈ Rp is φα′ X (t) = E(tα′ X). Suppose
this is known for all α ∈ Rp . Now, for any α, note φX (α) = E(exp(α′ X) =
φα′ X (1), which is then known.
Remark. To define the joint multivariate distribution of a random vector,
it is enough to specify the distribution of all its linear functions.
Multivariate Normal Distribution
Definition. Xp×1 is p-variate normal if for every α ∈ Rp , the distribution
of α′ X is univariate normal.
Result. If X has the p-variate normal distribution, then both µ = E(X)
and Σ = Cov(X) exist and the distribution of X is determined by µ and Σ.
Proof. Let X = (X1 , . . . , Xp )′ . Then for each i, Xi = αi′ X where αi =
(0, . . . , 0, 1, 0, . . . , 0)′ . Therefore, Xi = αi′ X ∼ N (., .). Hence, E(Xi ) = µi

and V ar(Xi ) = σii exist. Also, since |σij | = |Cov(Xi , Xj )| ≤ σii σjj , σij
exists. Set µ = (µ1 , . . . , µp )′ and Σ = ((σij )). Further, E(α′ X) = α′ µ and
V ar(α′ X) = α′ Σα, so
α′ X ∼ N (α′ µ, α′ Σα), for all α ∈ Rp .
Since {α′ X, α ∈ Rp } determine the distribution of X, µ and Σ suffice.
Notation: X ∼ Np (µ, Σ).
Result. If X ∼ Np (µ, Σ), then for any Ak×p , bk×1 ,
Y = AX + b ∼ Nk (Aµ + b, AΣA′ ).
Proof. Consider linear functions, α′ Y = α′ AX + α′ b = β ′ X + c, which
are univariate normal. Therefore Y is k-variate normal. E(Y ) = Aµ + b,
Cov(Y ) = Cov(AX) = AΣA′ .
Theorem. Xp×1 ∼ Np (µ, Σ) iff Xp×1 = Cp×r Zr×1 +µ where Z = (Z1 , . . . , Zr )′ ,
Zi i.i.d N (0, 1), Σ = CC ′ , r = rank(Σ) = rank(C).
Proof. if part: If X = CZ +µ and Z ∼ Nr (0, Ir ), then X ∼ Np (µ, CC ′ = Σ).

1
Z is multivariate normal since linear functions of Z are linear combinations
of Zi ’s, which are univarite normal (as can be shown using the change of
variable (jacobian) formula for joint densities, or using the mgf of normal).
Only if: If X ∼∼ Np (µ, Σ), and rank(Σ) = r ≤ p, then consider  the spectral
∆ 1 0
decomposition, Σ = H∆H ′ , H orthogonal, ∆ = ,
0 0
∆1 = diagonal(δ1 , . . . , δr ), δi > 0. Now, X− µ ∼ N (0,Σ), and
Yr×1
H ′ (X − µ) ∼ N (0, ∆). Let H ′ (X − µ) = . Then,
T(p−r)×1
     
Yr×1 0 ∆1 0
∼N , .
T(p−r)×1 0 0 0
−1/2
Therefore, T = 0 w.p.1. Let Z= ∆1 Y . Then Z ∼ Nr (0, Ir ). Therefore,
1/2
∆1 Z
w.p. 1, H ′ (X − µ) = . Further, w.p. 1,
0
1/2 1/2
   
∆1 Z ∆1 Z 1/2
X −µ=H = (H1 |H2 ) = H1 ∆1 Z = CZ.
0 0
1/2 1/2
Also, CC ′ = H1 ∆1 ∆1 H1′ = H1 ∆1 H1′ and
  ′ 
′ ∆1 0 H1
Σ = H∆H = (H1 |H2 ) = H1 ∆1 H1′ .
0 0 H2′

Recall that if Z1 ∼ N (0, 1), its mgf is φZ1 (t) = E(exp(tZ1 )) = exp(t2 /2).
Therefore, if Z ∼ Nr (0, Ir ) then
r r
X X 1
φZ (u) = E(exp(u′ Z)) = E(exp( uj Zj )) = exp( u2j /2) = exp( u′ u).
j=1 j=1
2

Then, if X ∼∼ Np (µ, Σ), its mgf is:

1
φX (t) = exp(t′ µ + t′ Σt),
2
since E(exp(t′ X)) = E(exp(t′ (CZ + µ))) = exp(t′ µ)E(exp(t′ CZ)) =
exp(t′ µ) exp(t′ CC ′ t/2) = exp(t′ µ + t′ Σt/2).

2
Marginal and Conditional Distributions
Theorem. If X ∼∼ Np (µ, Σ), then the marginal distribution of any subset
of k components of X is k-variate normal.
Proof. Partition as follows:
! !
(1) (1)  
Xk×1 µk×1 Σ11 Σ12
X= (2) , µ= (2) , Σ= .
X(p−k)×1 µ(p−k)×1 Σ′12 Σ22

X (1)
 
(1)
∼ N µ(1) , Σ11 . Since marginals (without

Note that X = (Ik |0) (2)
X
independence) do not determine the joint distribution, the converse is not
true.
Example. Z ∼ N (0, 1) independent of U which takes values 1 and -1 with
equal probability. Then Y = U Z ∼ N (0, 1) since

P (Y ≤ y) = P (U Z ≤ y)
1 1
= P (Z ≤ y|U = 1) + P (−Z ≤ y|U = −1)
2 2
1 1
= Φ(y) + Φ(y) = Φ(y).
2 2
Therefore, (Z, Y ) has a joint distribution under which the marginals are
normal. However,
 it is not bivariate normal. Consider Z + Y =
2Z 1/2
Z + UZ = . Since P (Z + Y = 0) = 1/2 (i.e., a point mass at
0 1/2
0, and Z + Y = 2Z ∼ N (0, 1) with probability 1/2, it cannot be normally
distributed.
! !  !
(1) (1)
Xk×1 µk×1 Σ11 Σ12
Result. Let Xp×1 = (2) ∼ Np (2) , .
X(p−k)×1 µ(p−k)×1 Σ′12 Σ22
Then X (1) and X (2) are independent iff Σ12 = 0.
Proof. Only if: Independence implies that Cov(X (1) , X (2) ) = Σ12 = 0.

1
If part: Suppose that Σ12 = 0. Then, note that

M(X (1) ,X (2) ) (s1 , s2 )


′ 
s1
= E(exp(s′1 X (1)
+ s′2 X (2) )
= E(exp X )
s2
 ′  (1)   ′   
s1 µ 1 s1 Σ11 Σ12 s1
= E(exp + )
s2 µ(2) 2 s2 Σ′12 Σ22 s2
 
′ (1) ′ (2) 1 ′ 1 ′ ′
= exp s1 µ + s2 µ + s1 Σ11 s1 + s2 Σ22 s2 + s1 Σ12 s2
2 2
   
′ (1) 1 ′ ′ (2) 1 ′
= exp s1 µ + s1 Σ11 s1 exp s2 µ + s2 Σ22 s2
2 2
= MX1 (s1 )MX2 (s2 ),

for all s1 and s2 iff Σ12 = 0.


Result. Suppose X ∼ Np (µ, Σ) and let U = AX, V = BX. Then U and V
are independent iff Cov(U, V ) = AΣB ′ = 0.
   
U A
Proof. Same as above, since = X ∼ N (., .).
V B
Theorem. If X ∼ Np (µ, Σ) and Σ is p.d. then
 
1
fX (x) = (2π) −p/2
|Σ| −1/2
exp − (x − µ) Σ (x − µ) , x ∈ Rp .
′ −1
2

Proof. Let Σ = CC ′ where C = Σ1/2 is nonsingular. Then X = CZ + µ,


Z ∼ N (0, Ip ). Since Zi are i.i.d N (0, 1),
p
−p/2 1X 2 1
fZ (z) = (2π) exp(− zi ) = (2π)−p/2 exp(− z ′ z).
2 i=1 2

Since X = CZ + µ, Z = C −1 (X − µ). Jacobian of the transformation is


dz = |C|−1 dx = |Σ|−1 dx. Therefore,
 
−p/2 −1/2 1 ′ ′ −1 −1
fX (x) = (2π) |Σ| exp − (x − µ) (C ) C (x − µ)
2
 
−p/2 −1/2 1 ′ −1
= (2π) |Σ| exp − (x − µ) Σ (x − µ) .
2

Note. fX (x) is constant on the ellipsoid, {x : (x − µ)′ Σ−1 (x − µ) = r2 }.

2
Ex. Check for p = 2 to see if the above results agree with those of the
bivariate normal.
Theorem. Let X ∼ Np (µ, Σ), Σ > 0 (i.e., p.d.), and let
     
X1 µ1 Σ11 Σ12
X= , µ= , Σ= ,
X2 µ2 Σ21 Σ22

where X1 and µ1 are of length k. Also, let Σ11.2 = Σ11 − Σ12 Σ−1 22 Σ21 . Then
Σ11.2 > 0 and,
(i) X1 − Σ12 Σ−1 −1
22 X2 ∼ Nk (µ1 − Σ12 Σ22 µ2 , Σ11.2 ) and is independent of X2 ;
(ii) The conditional distribution of X1 given X2 is
Nk µ1 + Σ12 Σ−122 (X2 − µ2 ), Σ11.2 .
 
Ik −Σ12 Σ−1 22
Proof. (i) Let C = . Then
0 Ip−k
    
X1 − Σ12 Σ−122 X2 µ1 − Σ12 Σ−1
22 µ2 ′
CX = ∼ Np , CΣC .
X2 µ2

   
′ Ik −Σ12 Σ−1
22 Σ11 Σ12 Ik 0
CΣC =
0 Ip−k Σ21 Σ22 −Σ−1
22 Σ21 Ip−k
    
Σ11 − Σ12 Σ−1
22 Σ21 0 Ik 0 Σ11.2 0
= = .
Σ21 Σ22 −Σ−1
22 Σ21 Ip−k 0 Σ22

Now, independence of X1 − Σ12 Σ−1 22 X2 and X2 follows from the fact that
Cov(X1 − Σ12 Σ−1 22 X 2 , X 2 ) = 0.
(ii) Note that X1 = (X1 − Σ12 Σ−1 −1
22 X2 ) + Σ12 Σ22 X2 . Therefore, from the inde-
pendence of these two parts, X1 |(X2 = x2 ) = Σ12 Σ−1 −1
22 x2 +(X1 −Σ12 Σ22 X2 ) ∼
N (Σ12 Σ−1 −1
22 x2 + µ1 − Σ12 Σ22 µ2 , Σ11.2 ).

Remark. Under multivariate normality, the best regression is linear. If


we want to predict X1 based on X2 , the best predictor is E(X1 |X2 ), which
is equal to µ1 − Σ12 Σ−1 −1
22 µ2 + Σ12 Σ22 x2 . The prediction error, however, is
independent of X2 .

3
Quadratic Forms.
Recall that, Y ′ AY is called a quadratic form of Y when Y is a random vector.
Result. If X ∼ Np (µ, Σ), Σ > 0, then (X − µ)′ Σ−1 (X − µ) ∼ χ2p .
Proof. Z = Σ−1/2P (X − µ) ∼ Np (0, Ip ). i.e., Z1 , Z2 , . . . , Zp are i.i.d. N (0, 1).
Therefore Z ′ Z = pi=1 Zi2 ∼ χ2p . Note that (X − µ)′ Σ−1 (X − µ) = Z ′ Z.
Result. If X1 , X2 , . . . , Xn is a random sample from N (µ, σ 2 ), then X̄ and
n
S 2 = i=1 (Xi − X̄)2 are independent, X̄ ∼ N (µ, σ 2 /n) and S 2 /σ 2 ∼ χ2n−1 .
P

Proof. First note that X = (X1 , X2 , . . . , Xn )′ ∼ Nn (µ1, σ 2 In ). Now con-


sider an orthogonal matrix An×n = ((aij )) with the first row being a′1 =
( √1n , √1n , . . . , √1n ) = √1n 1′ . (Simply consider a basis for Rn with a1 as the
first vector, orthogonalize the rest.) Now let Y = AX. i.e., Yi = a′i X, i =
1, 2, . . . , n. Since X ∼ Nn (µ1, σ 2 In ), we have that Y ∼ Nn (µA1, σ 2 AA′ ) =
Nn (µA1, σ 2 In ). Therfeore, Yi are independent normal with variance σ 2 .√Fur-
ther, E(Yi ) = E(a′i X) = µa′i 1. Thus, E(Y1 ) = µa′1 1 = µ √1n 1′ 1 = nµ.
√ ′ 2
For i > 1, E(Y i ) = µa ′
i 1 = µ nai a1 = 0. i.e., Y2 , . . . , YP n are i.i.d N (0, σ ).

Therefore, i=2 Yi2 ∼ χ2n−1 . Further, Y1 = a′1 X = √1n ni=1 Xi = nX̄ ∼
Pn
√ 2 2
Pn 2
N
Pn ( nµ, σ ) and is independent of (Y 2 , . . . , Y n ). Also, S = i=1 (Xi − X̄) =
2 2 2 2
P n 2 2
i=1 Xi − nX̄ = X X − Y1 = Y Y − Y1 = i=2 Yi ∼ χn−1 which is inde-
′ ′

pendent of Y1 , and therefore of X̄.


If X ∼ Np (0, I), then X ′ X = pi=1 Xi2 ∼ χ2p . i.e., X ′ IX ∼ χ2p . Also, note
P
X ′ ( √1p 1 √1p 1′ )X = pX̄ 2 ∼ χ21 and X ′ (I − p1 11′ )X ∼ χ2p−1 .

What is the distribution of X ′ AX for any arbitrary A which is p.s.d.? With-


out loss of generality we can assume that A is symmetric since
1 1
X ′ AX = X ′ ( (A+A′ ))X = X ′ BX, where B = (A+A′ ) is always symmetric.
2 2
Since A is symmetric p.s.d., A = ΓDλ Γ′ , so X ′ AX = X ′ ΓDPλpΓ X = 2Y Dλ Y ,
′ ′

where Y = Γ X ∼ Np (0, Γ Γ = I). Therefore X AX = i=1 di Yi , where


′ ′ ′

di are eigen values of A and Yi are i.i.d N (0, 1). Therefore X ′ AX has the
χ2 distribution if di = 1 or 0. Equivalently, X ′ AX ∼ χ2 if A2 = A or
A is symmetric idempotent or A is an orthogonal projection matrix. The
equivalence may be seen as follows. If d1 ≥ d2 ≥ . . . ≥ dp ≥ 0 are such that

1
d1 = d2 = . . . = dr = 1 and dr+1 = . . . = dp = 0, then
 
Ir 0
A = Γ Γ′ ,
0 0
   
Ir 0 Ir 0
A2 = Γ ′
ΓΓ Γ′ = A.
0 0 0 0

If A2 = A then ΓDλ Γ′ ΓDλ Γ′ = ΓDλ2 Γ′ = ΓDλ Γ′ implies that Dλ2 = Dλ , or


that d2i = di , or that di = 0 or 1.
We will show the converse now. Suppose X ′ AX ∼ χ2r and A is symmetric
p.s.d. Then the mgf of X ′ AX is:

exp(−u/2)ur/2−1
Z ∞
MX ′ AX (t) = exp(tu) du
0 2r/2 Γ(r/2)
exp(− u2 (1 − 2t))ur/2−1
Z ∞
= du
0 2r/2 Γ(r/2)
= (1 − 2t)−r/2 , for 1 − 2t > 0.

But in distribution, X ′ AX = pi=1 di Yi2 , Yi i.i.d. N (0, 1), so


P

p
" # " p #
X Y
MX ′ AX (t) = E exp(t di Yi2 ) = E exp(tdi Yi2 )
i=1 i=1
p p
Y Y
E exp(tdi Yi2 ) =
 
= (1 − 2tdi )−1/2 , for 1 − 2tdi > 0.
i=1 i=1

Now note that X ′ AX ∼ χ2r implies X ′ AX > 0 wp 1. i.e., pi=1 di Yi2 > 0
P
wp 1, which in turn imples that di ≥ 0 for all i. (This isPbecause, if dl < 0,
since Yl2 ∼ χ21 independently of Yi , i 6= l, we would have pi=1 di Yi2 < 0 with
positive probability.) Therefore, for t < mini 2d1 i , equating the two mgf’s, we
Qp r/2
Qp 1/2
have (1 − 2t)−r/2 = i=1 (1 − 2td i ) −1/2
, or (1 − 2t) = i=1 (1 − 2tdi ) ,
r
Q p
or (1 − 2t) = i=1 (1 − 2tdi ). Equality of two polynomials mean that their
roots must be the same. Check that r of the di ’s must be 1 and rest 0. Thus
the following result follows.
Result. X ′ AX ∼ χ2r iff A is a symmetric idempotent matrix or an orthogonal
projection matrix of rank r.

2
Result. Suppose Y ∼ Np (0, Ip ) and let Y ′ Y = Y ′ AY +Y ′ BY . If Y ′ AY ∼ χ2r ,
then Y ′ BY ∼ χ2p−r independent of Y ′ AY .
Proof. Note that Y ′ Y ∼ χ2p . Since Y ′ AY ∼ χ2r , A is symmetric idempotent
of rank r. Therefore, B = I − A is symmetric and B 2 = (I − A)2 =
I − 2A + A2 = I − A = B, so that B is idempotent also. Further, Rank(B)
= trace(B) = trace(I − A) = p − r. Therefore, Y ′ BY ∼ χ2p−r . Independence
is shown later.
Result. Let Y ∼ Np (0, Ip ) and let Q1 = Y ′ P1 Y , Q2 = Y ′ P2 Y , Q1 ∼ χ2r , and
Q2 ∼ χ2s . Then Q1 and Q2 are independent iff P1 P2 = 0.
Corollary. In the result before the above one, A(I − A) = 0, so Y ′ AY and
Y ′ (I − A)Y are independent.
Proof. P1 and P2 are symmetric idempotent. If P1 P2 = 0 then
Cov(P1 Y, P2 Y ) = 0 so that Q1 = (P1 Y )′ (P1 Y ) = Y ′ P12 Y = Y ′ P1 Y is in-
dependent of Q2 = (P2 Y )′ (P2 Y ) = Y ′ P2 Y . Conversely, if Q1 and Q2 are
independent χ2r and χ2s , then Q1 + Q2 ∼ χ2r+s . Since Q1 + Q2 = Y ′ (P1 + P2 )Y ,
P1 + P2 is symmetric idempotent. Hence, P1 + P2 = (P1 + P2 )2 = P12 + P22 +
P1 P2 + P2 P1 , implying P1 P2 + P2 P1 = 0. Multiplying by P1 on the left, we
get, P12 P2 +P1 P2 P1 = P1 P2 +P1 P2 P1 = 0 (∗). Similarly, multiplying by P1 on
the right yields, P1 P2 P1 + P2 P1 = 0. Subtracting, we get, P1 P2 − P2 P1 = 0.
Combining this with (∗) above, we get P1 P2 = 0.
Result. Let Q1 = Y ′ P1 Y , Q2 = Y ′ P2 Y , Y ∼ Np (0, Ip ). If Q1 ∼ χ2r ,
Q2 ∼ χ2s and Q1 − Q2 ≥ 0, then Q1 − Q2 and Q2 are independent, r ≥ s and
Q1 − Q2 ∼ χ2r−s .
Proof. P12 = P1 and P22 = P2 are symmetric idempotent. Q1 − Q2 ≥ 0
means that Y ′ (P1 −P2 )Y ≥ 0, hence P1 −P2 is p.s.d. Therefore, from Lemma
shown below, P1 − P2 is a projection matrix and also P1 P2 = P2 P1 = P2 .
Thus (P1 − P2 )P2 = 0. Also, Rank(P1 − P2 ) = tr(P1 − P2 ) = tr(P1 ) - tr(P2 )
= Rank(P1 ) - Rank(P2 ) = r − s. Hence, Q1 − Q2 = Y ′ (P1 − P2 )Y ∼ χ2r−s ,
and is independent of Q2 = Y ′ P2 Y ∼ χ2s .
Lemma. If P1 and P2 are projection matrices such that P1 − P2 is p.s.d.,
then (a) P1 P2 = P2 P1 = P2 and (b) P1 − P2 is also a projection matrix.
Proof. (a) If P1 x = 0, then 0 ≤ x′ (P1 − P2 )x = −x′ P2 x ≤ 0, implying
0 = x′ P2 x = x′ P22 x = (P2 x)′ P2 x, so P2 x = 0. Therefore, for any y,
P2 (I − P1 )y = 0 since P1 (I − P1 )y = 0 (Take x = (I − P1 )y.) Thus, for any
y, P2 P1 y = P2 y or P2 P1 = P2 , and so P2 = P2′ = (P2 P1 )′ = P1 P2 .
(b) (P1 − P2 )2 = P12 + P22 − P1 P2 − P2 P1 = P1 + P2 − P2 − P2 = P1 − P2 .

1
Result. Any orthogonal projection matrix (i.e., symmetric idempotent) is
p.s.d.
Proof. If P is a projection matrix, x′ P x = x′ P 2 x = (P x)′ P x ≥ 0.
Result. Let C be a symmetric p.s.d. matrix. If X ∼ Np (0, Ip ), then AX
and X ′ CX are independent iff AC = 0.
Proof. (i) If part: Since C is symmetric p.s.d., C = T T ′ . If AC = 0, then
AT T ′ = 0, so AT T ′ A′ = (AT )(AT )′ = 0 and hence AT = 0. Thus AX and
T ′ X are independent, so AX and (T ′ X)(T ′ X)′ = X ′ CX are independent.
(ii) Only if: If AX and X ′ CX are independent, then X ′ A′ AX and X ′ CX
are independent. But the mgf of X ′ BX for any B is E(exp(tX ′ BX)) =
|I − 2tB|−1/2 for an interval of values of t. Therefore, the joint mgf of X ′ CX
and X ′ A′ AX is |I − 2(t1 C + t2 A′ A)|−1/2 , but because of independence this
is given to be equal to

|I − 2t1 C|−1/2 |I − 2t2 A′ A|−1/2 = |I − 2t1 C − 2t2 A′ A + 4t1 t2 CA′ A|−1/2 .

Show that, for this to hold on an open set, we must have CA′ A = 0, implying
CA′ AC ′ = 0, and thus AC ′ = 0. But C ′ = C.
Lemma. If X ∼ Np (µ, Σ), then Cov(AX, X ′ CX) = 2AΣCµ.
Proof. Note that (X − µ)′ C(X − µ) = X ′ CX + µ′ Cµ − 2X ′ Cµ = X ′ CX −
2((X − µ)′ Cµ − µ′ Cµ and E(X ′ CX) = tr(CΣ) + µ′ Cµ. Therefore X ′ CX −
E(X ′ CX) = X ′ CX − µ′ Cµ − tr(CΣ) = (X − µ)′ C(X − µ) + 2(X − µ)′ Cµ −
tr(CΣ). Hence,

Cov(AX, X ′ CX)
= E [(AX − Aµ)(X ′ CX − E(X ′ CX))]
= AE {(X − µ) [(X − µ)′ C(X − µ) + 2(X − µ)′ Cµ − tr(CΣ)]}
= 2AE {(X − µ)(X − µ)′ Cµ} − tr(CΣ)AE(X − µ)
+AE {(X − µ)(X − µ)′ C(X − µ)}
= 2AΣCµ,

since
n E(X −hµ) = 0 and E {(X − µ)(X − µ) ′
io C(X − µ)} =
P P
E (X − µ) i j Cij (Xi − µi )(Xj − µj ) = 0. To prove this last equal-
ity, it is enough to show that E {(Xl − µl )(Xi − µi )(Xj − µj )} = 0 for all
i, j, l. For this note:
(i) if i = j = l, E(Xi − µi )3 = 0.
(ii) if i = j 6= l, E {(Xi − µi )2 (Xl − µl )} = 0 since Xl − µl = σσilll (Xi − µi ) + ǫ,

2
where ǫ ∼ N (0, .) is independent of Xi , so this case reduces to (i).
(iii) if i, j and l are all different, the case reduces to (i) and (ii). Alterna-
tively, consider Y = (Y1 , Y2 , Y3 )′ ∼ N3 (0, Σ). Then Y = Σ1/2 (Z1 , Z2 , Z3 ),
where Zi are i.i.d. N (0, 1). Then to show that E(Y1 Y2 Y3 ) = 0, simply note
that Y1 Y2 Y3 is a linear combination of Zi3 , Zi2 Zj and Z1 Z2 Z3 , all of which
have expectation 0.
Loynes’ Lemma. If B is symmetric idempotent, Q is symmetric p.s.d. and
I − B − Q is p.s.d., then BQ = QB = 0.
Proof. Let x be any vector and y = Bx. Then y ′ By = y ′ B 2 x = y ′ Bx = y ′ y,
so y ′ (I −B −Q)y = −y ′ Qy ≤ 0. But I −B −Q is p.s.d., so y ′ (I −B −Q)y ≥ 0,
implying −y ′ Qy ≥ 0. Since Q is also p.s.d., we must have y ′ Qy = 0. (Note,
y is not arbitrary, but Bx for some x.) In addition, since Q is symmetric
p.s.d., Q = L′ L for some L, and hence y ′ Qy = y ′ L′ Ly = 0, implying Ly = 0.
Thus L′ Ly = Qy = QBx = 0 for all x. Therefore, QB = 0 and hence
(QB)′ = B ′ Q′ = BQ = 0.

3
Theorem. Suppose P Xi are n × n symmetric matrices with rank ki , i =
1, 2, . . . , p. Let X = pi=1 Xi have rank k. (It is symmetric.) Then, of the
conditions
(a) Xi idempotent for all i
(b) Xi Xj = 0, i 6= j
(c) XP idempotent
(d) pi=1 ki = k,
it is true that
I. any two of (a), (b), and (c) imply all of (a), (b), (c) and (d)
II. (c) and (d) imply (a) and (b)
III. (c) and {X1 , . . . , Xp−1 idempotent, Xp p.s.d.} imply that Xp idempotent
and hence (a), and therefore (b) and (d).
Proof. I (i): Show (a) and (c) imply (b) and (d). For this, note, P given (c),
I −X is idempotent and hence p.s.d. Now, given (a), X −Xi −Xj = r6=i,j Xr
is p.s.d, being the sum of p.s.d matrices. Therefore, (I −X)+(X −Xi −Xj ) =
I − Xi − Xj is p.s.d., hence Xi Xj = P0 from Loynes’
P Lemma.
P i.e., (b). Also,
given (c), Rank(X) = tr(X) = tr( Xi ) = tr(Xi ) = ki , if (a) is also
given. i.e., (d).
(ii): Show (b) and (c) imply (a) and (d). Let λ be an eigen value of X1 and
u be the corresponding eigen vector. Then X1 u = λu. Either λ = 0, or,
if λ 6= 0, u = X1 λ1 u. Therefore, for i 6= 1, Xi u = Xi X1 λ1 u = 0 given (b).
Therefore, given (b), Xu = X1 u = λu, and so λ is an eigen value of X. But
given (c), X is idempotent, and hence λ = 0 or 1. Therefore eigen values of
X1 are 0 or 1, or X1 is idempotent. Similarly for the other Xi ’s. i.e., (a).
(iii): (a) and (b) together imply (c). (Note that then they
P imply (d) also,
2 2
Xi2 =
P
since
P (a) and (c) give (d).) Given (b) and (a), X = ( Xi ) =
Xi = X, which is (c).
II. Show (c) and (d) imply (a) and (b). Given (c), I − X is idempotent and
hence has rank n − k. Therefore rank of X − I is also n − k. i.e., X − I has
n − k linearly independent rows. i.e.,
(X − I)x = 0 has n − k linearly independent equations. Further,
X2 x = 0 has k2 linearly independent equations,
..
.
Xp x = 0 has kp linearly independent equations.

1
Therefore the maximum number of linearly independent equations in
 
X −I
 X2 
.. x = 0 is n − k + k2 + . . . + kp = n − k1 .
 

 . 
Xp

i.e., the dimension of the solution space is at least n−(n−k1 ) = k1 . However,


this space is exactly X1 x = x because the above equations reduce to that.
Thus X1 x = 1x has at least k1 linearly independent solutions, or 1 is an
eigen value of X1 with multiplicity at least k1 . But since the rank of X1 is
k1 , multiplicity must be exactly k1 . Also, the other eigen values must be 0.
Therefore X1 is idempotent. Similar argument for the other Xi ’s. So, (a).
Now combine it with (c) to get (b).
III. Given (c), X is idempotent, so p.s.d. Therefore, I − X is idempotent
and hence p.s.d.
P If X1 , . . . , Xp−1 are idempotent, hence p.s.d., and Xp is also
p.s.d., then r6=i,j Xr = X − Xi − Xj is p.s.d., so (I − X) + (X − Xi − Xj ) =
I − Xi − Xj is p.s.d. Then Xi Xj = 0 from Loynes’, giving (b). Now (b) and
(c) give (a) and (d).
The above theorem in linear algebra translates into a powerful result called
Fisher-Cochran theorem on the question of: when are quadratic forms inde-
pendent χ2 ?
Theorem. Suppose Y ∼ NnP (0, In ), Ai , i = 1, . . . , p are symmetric n × n
matrices of rank ki , and A = pi=1 Ai is symmteric with rank k. Then
(i) Y ′ Ai Y ∼ χ2ki , (ii) Y ′ Ai Y are pairwise independent, and (iii) Y ′ AY ∼ χ2k
iff
I. any two of (a) Ai are idempotent for all i, (b) Ai Aj = 0, i 6= j, (c) A is
idempotent, are true, or P
II. (c) is true and (d) k = i ki , or
III. (c) is true and
(e) A1 , . . . , Ap−1 are idempotent and Ap is p.s.d. is true.
Proof. Follows from the previous theorem.

2
Linear Models – Estimation

Consider yi uncorrelated, E(yi ) = µ, V ar(yi ) = σ 2 , i = 1, 2, . . . , n. Estimate


µ. In the absence of distributional assumptions, an appealing approach is
least squares. What is the estimate and what are its properties? Write the
model as:
yi = µ + ǫi , E(ǫi ) = 0, V ar(ǫi ) = σ 2 , Cov(ǫi , ǫj ) = 0, i 6= j. Find
n
X
min (yi − µ)2 .
µ
i=1

Note that,
n
X n
X n
X
2 2 2
(yi − µ) = (yi − ȳ) + n(ȳ − µ) ≥ (yi − ȳ)2
i=1 i=1 i=1

with equality iff µ̂ = ȳ. Therefore, LSE of µ is µ̂LS = ȳ. In vector-matrix


formulation,
     
y1 µ ǫ1
Y =  ...  =  ...  +  ...  = µ1 + ǫ.
     
yn µ ǫn
n
X
||Y − µ1||2 = (Y − µ1) (Y − µ1)′ = (yi − µ)2 = ||ǫ||2 .
i=1

Therefore, least squares is equivalent to finding the multiple of 1 which mini-


mizes ||ǫ||. This is achieved when we take the perpendicular or the orthogonal
projection of Y onto the space spanned by 1. i.e.,
Y′ 1 Y′ 1
1 + (Y − 1) = Y
1′ 1 1′ 1
i.e.,
1′ Y
= ȳ.
µ̂LS =
1′ 1
Since Cov(Y) = σ 2 In and E(Y) = µ1,
1 ′ 1′ µ1
E(µ̂LS ) = 1 E(Y) = = µ.
1′ 1 1′ 1

1′ Y 1 ′ 1 ′
2 1 In 1 σ2
V ar(µ̂LS ) = Cov( ) = 1 Cov(Y) 1 = σ = .
1′ 1 1′ 1 1′ 1 (1′ 1)2 n

1
Note, that µ̂LS is a linear unbiased estimate of µ. Suppose a′ Y is any linear
unbiased estimate of µ. Then E(a′ Y ) = µa′ 1 = µ for all µ implies that
a′ 1 = 1. What is the best linear unbiased estimator of µ (i.e., least MSE)?
Note,
V ar(a′ Y ) = Cov(a′ Y ) = a′ Cov(Y )a = σ 2 a′ a.
need to find a such that a′ 1 = 1 and a′ a is minimum.
To minimize this we justP
Simply note that a′ a = ni=1 a2i and
n  Pn 2 n
1X 2 ai X
a − i=1
≥ 0, for all a since (ai − ā)2 ≥ 0.
n i=1 i n i=1

i.e.,
n  2 n
1X 2 1 X 1
a − ≥ 0, or a2i ≥
n i=1 i n i=1
n

with equality iff ai = n1 for all i. Therefore, µ̂LS is BLUE (Best Linear
Unbiased Estimate) irrespective of the distribution of ǫ.
Linear models: Estimation
Data: (xi , yi ), i = 1, 2, . . . , n with multiple predictors or covariates of y.

yi = β0 + β1 xi1 + β2 xi2 + . . . + βp−1 xi(p−1) + ǫi , i = 1, . . . , n


= x′i β + ǫ, i = 1, . . . , n

is a model for y|x. Let Yn×1 = (y1 , . . . , yn )′ , βp×1 = (β0 , β1 , . . . , βp−1 )′ ,


x10 x11 . . . x1(p−1)
 .. .. ..
Xn×p =  . , xi0 ≡ 1 here but can be general also.

. ... .
xn0 xn1 . . . xn(p−1)
β is called the vector of regression coefficients and X is called the regression
matrix or the design matrix (especially if xij = 0 or 1). Quite often y is
called the dependent variable and x the set of independent variables. It is
more standard to call y the response and x, the regressor or predictor. Recall
from previous discussion that
yi = β0 + β1 xi + β2 x2i + ǫi is a linear model, but
yi = β0 + β1 xi + xβi 2 + ǫi is nonlinear. i.e., linear model means linear in βj ’s.
A general Xn×p is fine, X0 = 1 is not essential. Thus we have the linear
model:
Yn×1 = Xn×p βp×1 + ǫ.
Since we have only n observations, it does not make sense to consider p ≥ n,

2
so we take p < n. Skip bold face for vectors and matrices unless there is
ambiguity.
First task is to estimate β. Most common approach is to use least squares
(again, in the absence of distributional assumptions on ǫ). We want

n
X
minp (yi − x′i β)2 = min ||ǫ||2 = minp ||Y − Xβ||2
β∈R β∈R
i=1
= min ||Y − θ||2 ,
θ∈MC (X)

where MC (X) = {a : a = Xb for some b ∈ Rp }. Note that Xb = b1 X1 +


b2 X2 + . . . + bp Xp where Xi are the column vectors of X. Now, to minimize
||Y −θ||2 when θ ∈ MC (X), we should take θ̂ to be the orthogonal projection
of Y onto MC (X). i.e., Y − θ̂ should be orthogonal to MC (X). i.e.,

X ′ (Y − θ̂) = 0, or X ′ θ̂ = X ′ Y.

θ̂ is uniquely determined, being the unique orthogonal projection of Y onto


MC (X). We consider the two cases, Rank(X) = p and Rank(X) < p,
separately.
y2

θ = Xb

y
ŷ = θ̂

y1

3
Full rank case. Rank(X) = p. Since the columns of X are linearly inde-
pendent, there exists a unique vector β̂ such that θ̂ = X β̂. (If the columns
of X are not linearly independent β̂ is not unique.) Therefore,
X ′ X β̂ = X ′ Y.
Since X has full column rank, X ′ X is nonsingular. Therefore,
β̂LS = (X ′ X)−1 X ′ Y
is unique. One could also use calculus for this derivation:
||Y − Xβ||2 = (Y − Xβ)′ (Y − Xβ) = Y ′ Y − 2β ′ X ′ Y + β ′ X ′ Xβ,
so differentiating it w.r.t. β:
−2X ′ Y + 2X ′ Xβ = 0, or X ′ X β̂ = X ′ Y.
Note that
θ̂ = X β̂ = X(X ′ X)−1 X ′ Y = P Y = Ŷ ,
where P is the projection matrix onto MC (X).
ǫ̂ = Y − Ŷ = Y − X β̂ = (I − P )Y = residuals.
ǫ̂′ ǫ̂ = (Y − X β̂)′ (Y − X β̂) = Y ′ Y − β̂ ′ X ′ Y + β̂ ′ (X ′ X β̂ − X ′ Y )
= Y ′ Y − β̂ ′ X ′ Y = Y ′ Y − β̂ ′ (X ′ X β̂ = Y ′ (I − P )Y
Xn
= sum of squares of residuals (RSS) = (yi − x′i β̂)2
i=1

Example. Find least squares estimate of θ1 and θ2 in the following:


y 1 = θ1 + θ2 + ǫ1
y 2 = θ1 − θ2 + ǫ2
y3 = θ1 + 2θ2 + ǫ3
Obtain X and β by writing it in the vector-matrix formulation:
     
y1 1 1   ǫ1
 y2  =  1 −1  θ1 +  ǫ2  , i.e.,
θ2
y3 1 2 ǫ3
Y = Xβ + ǫ.
Then, noting that
 
  1 1  
1 1 1  1 −1  = 3 2 ,
X ′X =
1 −1 2 2 6
1 2
 
1 6 −2
(X ′ X)−1 =
14 −2 3

1
we obtain
 
θ̂1
β̂ = = (X ′ X)−1 X ′ Y
θ̂2
  
1 6 −2 y1 + y2 + y3
=
14 −2 3 y1 − y2 + 2y3
 
1 6(y1 + y2 + y3 ) − 2(y1 − y2 + 2y3 )
=
14 −2(y1 + y2 + y3 ) + 3(y1 − y2 + 2y3 )
   2 4 1

1 4y1 + 8y2 + 2y3 y 1 + y 2 − y 3
= = 7
1
7
5
7 ,
14 y1 − 5y2 + 4y3 y − 14
14 1
y2 + 72 y3
1
ǫ̂′ ǫ̂ = Y ′ Y − β̂ ′ X ′ Y = (y12 + y22 + y32 ) − (4y1 + 8y2 + 2y3 )(y1 + y2 + y3 )
14
1
− (y1 − 5y2 + 4y3 )(y1 − y2 + 2y3 ).
14

Theorem. P = X(X ′ X)−1 X ′ is symmetric idempotent, being the projection


matrix onto MC (X). Rank(P ) = Rank(X) = p. I − P is the orthogonal
projection matrix. Rank(I − P ) = n − p and (I − P )X = 0.
The case of Rank(X) = r < p will be discussed later.
An alternative derivation of β̂:

(Y − Xβ)′ (Y − Xβ) = (Y − X β̂ + X β̂ − Xβ)′ (Y − X β̂ + X β̂ − Xβ)


= (Y − X β̂)′ (Y − X β̂) + (X β̂ − Xβ)′ (X β̂ − Xβ)
+2(X β̂ − Xβ)′ (Y − X β̂)
= (Y − X β̂)′ (Y − X β̂) + (β̂ − β)′ X ′ X(β̂ − β),

since
(X β̂ − Xβ)′ (Y − X β̂) = (β̂ − β)′ (X ′ Y − X ′ X β̂) = 0.
Therefore,
(Y − Xβ)′ (Y − Xβ) ≥ (Y − X β̂)′ (Y − X β̂)
with equality iff β̂ − β = 0 since X ′ X is p.d.

2
Properties of least squares estimates
If Y = Xβ + ǫ, with E(ǫ) = 0 and Cov(ǫ) = σ 2 In . then E(β̂) = β since

E(β̂) = E((X ′ X)−1 X ′ Y ) = (X ′ X)−1 X ′ E(Y )


= (X ′ X)−1 X ′ Xβ = β, and
Cov(β̂) = Cov((X ′ X)−1 X ′ Y ) = (X ′ X)−1 X ′ Cov(Y )X(X ′ X)−1
= σ 2 (X ′ X)−1 X ′ X(X ′ X)−1 = σ 2 (X ′ X)−1 .

Theorem (Gauss-Markov). Consider the Gauss-Markov model, Y =


Xβ + ǫ, with E(ǫ) = 0 and Cov(ǫ) = σ 2 In . Let θ̂ be the least squares
estimate of θ = Xβ. Fix c ∈ Rp and consider estimating c′ θ. Then, in the
class of all linear unbiased estimates of c′ θ, c′ θ̂ is the unique estimate with
minimum variance. (Thus c′ θ̂ is BLUE of c′ θ.)
Proof. θ̂ = X β̂ = P Y , where P is the projection matrix onto MC (X). In
particular, P X = X. Therefore,

E(c′ θ̂) = c′ E(P Y ) = c′ P E(Y ) = c′ P Xβ = c′ Xβ = c′ θ,

so that c′ θ̂ = P Y is a linear unbiased estimate of c′ θ. Let d′ Y be any other


linear unbiased estimate of c′ θ. Then c′ θ = E(d′ Y ) = d′ θ, or (c − d)′ θ = 0 for
all θ ∈ MC (X). i.e., (c−d) is orthogonal to MC (X). Therefore P (c−d) = 0,
and so P c = P d. Now,

V ar(d′ Y ) − V ar(c′ θ̂) = V ar(d′ Y ) − V ar(c′ P Y )


= V ar(d′ Y ) − V ar(d′ P Y )
= σ 2 (d′ d − d′ P 2 d) = σ 2 (d′ d − d′ P d)
= σ 2 d′ (I − P )d = σ 2 d′ (I − P )(I − P )d
≥ 0

with equality iff (I − P )d = 0 or d = P d = P c. i.e., d′ Y = c′ P Y = c′ θ̂.


Remark. Since we have assumed that X has full column rank,
P = X(X ′ X)−1 X ′ and so, if θ = Xβ, then X ′ θ = X ′ Xβ or β = (X ′ X)−1 X ′ θ.
Therefore, for every a ∈ Rp , a′ β = a′ (X ′ X)−1 X ′ θ = c′ θ, where
c = X(X ′ X)−1 a. i.e., every linear function of β is a linear function of θ.
Therefore, for every a ∈ Rp , we have that a′ β̂ = a′ (X ′ X)−1 X ′ θ̂ = c′ θ̂ is
BLUE of a′ β. Thus, when X has full column rank, all linear functions of β
have BLUE, all components of β are estimable (BLUE exists). This will not
be the case when X has less than full column rank.

1
Result. In the model, Y = Xβ + ǫ, E(ǫ) = 0, Cov(ǫ) = σ 2 In and X has full
column rank (p), we have that

E(RSS) = E((Y − X β̂)′ (Y − X β̂)) = (n − p)σ 2 ,

so that RSS/(n − p) is an unbiased estimate of σ 2 .


Proof. Note that Y − X β̂ = Y − P Y = (I − P )Y . Therefore,

RSS = (Y − X β̂)′ (Y − X β̂) = Y ′ (I − P )2 Y = Y ′ (I − P )Y,

where I − P is symmetric idempotent with rank n − p.

E(RSS) = E(Y ′ (I − P )Y ) = tr(σ 2 (I − P )) + (Xβ)′ (I − P )(Xβ)


= σ 2 (n − p) + β ′ X ′ (I − P )Xβ
= (n − p)σ 2 .

For confidence statements and testing we need distribution theory.


Distribution Theory
Suppose ǫi are i.i.d. N (0, σ 2 ). Then ǫn×1 ∼ Nn (0, σ 2 In ) and so,
Y ∼ Nn (Xβ, σ 2 In ).
Theorem. If Y ∼ Nn (Xβ, σ 2 In ) and X has rank p, then
(i) β̂ ∼ Np (β, σ 2 (X ′ X)−1 ),
(ii) (β̂ − β)′ X ′ X(β̂ − β)/σ 2 ∼ χ2p ,
(iii) β̂ is independent of RSS = (Y − X β̂)′ (Y − X β̂),
(iv) RSS/σ 2 ∼ χ2n−p .
Proof. Y ∼ Nn (Xβ, σ 2 In ), so (i)

β̂ = (X ′ X)−1 X ′ Y ∼ Np ((X ′ X)−1 X ′ Xβ, σ 2 (X ′ X)−1 X ′ X(X ′ X)−1 )


= N (β, σ 2 (X ′ X)−1 ).

(ii) Since β̂ ∼ Np (β, σ 2 (X ′ X)−1 ), note (X ′ X)1/2 (β̂ − β) ∼ Np (0, σ 2 Ip ), and


hence
(β̂ − β)′ X ′ X(β̂ − β)/σ 2 ∼ χ2p .
(iii) β̂ = (X ′ X)−1 X ′ Y = AY and RSS = Y ′ (I−P )Y . Since Y ∼ Nn (Xβ, σ 2 In ),
independence of β̂ and Y ′ (I − P )Y holds iff A(I − P ) = 0. But (I − P )A′ =
(I−P )X(X ′ X)−1 = 0. Alternatively, β̂ = (X ′ X)−1 X ′ Y = (X ′ X)−1 X ′ P ′ Y =
(X ′ X)−1 X ′ (P Y ), so that it is independent of (I − P )Y .
(iv) (a) RSS = Y ′ (I − P )Y = (Y − Xβ)′ (I − P )(Y − Xβ) since (I − P )X = 0.
Note that since Y −Xβ ∼ Nn (0, σ 2 In ), and I −P is idempotent of rank n−p,

2
(Y − Xβ)′ (I − P )(Y − Xβ) ∼ χ2n−p .
(b) Alternatively, note that Q = (Y − Xβ)′ (Y − Xβ) ∼ σ 2 χ2n . Now

Q = (Y − Xβ)′ (Y − Xβ)
= (Y − X β̂ + X β̂ − Xβ)′ (Y − X β̂ + X β̂ − Xβ)
= (Y − X β̂)′ (Y − X β̂) + (β̂ − β)′ X ′ X(β̂ − β)
= Q1 + Q2 ,

where Q2 ∼ σ 2 χ2p and Q1 ≥ 0. Therefore, from a previous result, Q1 ∼


σ 2 χ2n−p independent of Q2 .

3
Design matrix X with less than full column rank
Consider the model,

yij = µ + αi + τj + ǫij , i = 1, 2, . . . , I; j = 1, 2, . . . , J,

for the response from the ith treatment in the jth block, say. This can be
put in the usual linear model form: Y = Xβ + ǫ as follows:
     
y11 1 1 0 ... 0 1 0 0 ... 0 ǫ11
 y12   1 1 0 . . . 0 0 1 0 . . . 0   ǫ12 
 ..   .. .. .. . . . . .    .. 
     
 .   . . . . . . .. .. .. .. . . . ..  µ  . 
 y1J   1 1 0 . . . 0 0 0 0 . . . 1   α1   ǫ1J 
     
      
 y21   1 0 1 . . . 0 1 0 0 . . . 0   α2   ǫ21 
    .   
 y22   1 0 1 . . . 0 0 1 0 . . . 0   .   ǫ22 
 .   . . . .. .. .. .. ..   .   . 
 .   . . .
. = . . . . . . . . . . . . . .   αI  +  ..  .
   
  
 y2J   1 0 1 . . . 0 0 0 0 . . . 1   τ1   ǫ2J 
      
 .   . . .
 ..   .. .. .. . . . ... ... ... ... . . . ...   τ2   ... 
   
 y   1 0 0 . . . 1 1 0 0 . . . 0   ...   ǫ 
      
 I1     I1 
 y   1 0 0 ... 1 0 1 0 ... 0  τJ  ǫ 
 I2     I2 
 ..   .. .. ..
 .   . . . . . . ... ... ... ... . . . ... 
  .. 
 . 
yIJ 1 0 0 ... 1 0 0 0 ... 1 ǫIJ

Here, X does not have full column rank. For instance, the first column is
proportional to the sum of the rest. Thus X ′ X is singular, so the previous
discussion does not apply. β itself is not estimable, but what parametric
functions of β are estimable?
Result. For any matrix A, the row space of A satisfies MC (A′ ) = MC (A′ A).
Proof. Ax = 0 implies A′ Ax = 0. Also, A′ Ax = 0 implies x′ A′ Ax = 0,
so Ax = 0. Therefore the null space of A and A′ A are the same. Consider
the orthogonal space and note Rank(A′ A) = Rank(A) = Rank(A′ ). Further,
since A′ Aa = A′ b where b = Aa, MC (A′ A) ⊂ MC (A′ ). Since the ranks (or
dimensions) are the same, the spaces must be the same.
Theorem. Let Y = θ + ǫ where θ = Xβ and Xn×p has rank r < p. Then
(i) minθ∈MC (X) ||Y − θ||2 is achieved (i.e., least squares is attained) when
θ̂ = X β̂ where β̂ is any solution of X ′ Xβ = X ′ Y ;
(ii) Y ′ Y − β̂ ′ X ′ Y is unique for all nonzero Y .
Proof. (i) X ′ Xβ = X ′ Y always has some solution (for β) since MC (X ′ X) =
MC (X ′ ). However, the solution is not unique since Rank(X ′ X) = r < p.

1
Let β̂ be any solution, and let θ̂ = X β̂. Then X ′ (Y − θ̂) = 0. However, given
Y ∈ Rn , the decomposition, Y = θ̂ ⊕ (Y − θ̂) where Y − θ̂ is orthogonal
to MC (X) is unique, and for such a θ̂, ||Y − θ||2 is minimized. We know
from previous discussion that minθ∈MC (X) ||Y − θ||2 is achieved with θ̂ = P Y
which is unique.
(ii) Note that

Y ′ Y − β̂ ′ X ′ Y = Y ′ Y − θ̂′ Y = (Y − θ̂)′ (Y − θ̂),

since θ̂′ (Y − θ̂) = 0. Also, (Y − θ̂)′ (Y − θ̂) = ||Y − θ̂||2 is the unique minimum.
Question. Earlier we could find β̂ directly. How do we find θ̂ now?
Projection matrices
From the theory of orthogonal projections, given Xn×p (i.e., p many n-
vectors), there exists Pn×n satisfying
(i) P x = x for all x ∈ MC (X), and
(ii) if ξ ∈ M⊥C (X), then P ξ = 0.
What are the properties of such a P ?
1. P is unique: Suppose P1 and P2 satisfy (i) and (ii). Let w ∈ Rn . Then
w = Xa + b, b ∈ M⊥ C (X). Then,

(P1 − P2 )w = (P1 − P2 )Xa + (P1 − P2 )b = (Xa − Xa) + (P1 b − P2 b) = 0.

Since this is true for all w ∈ Rn , we must have P1 − P2 = 0.


2. P is idempotent and symmetric:

P 2 x = P (P x) = P x = x for all x ∈ MC (X);

P 2 ξ = P (P ξ) = P 0 = 0 for all ξ ∈ M⊥
C (X).

Therefore P 2 satisfies (i) and (ii), and since P is unique, P 2 = P . Further,


P y⊥(I − P )x for all x, y, so that y ′ P ′ (I − P )x = 0. i.e., P ′ = P ′ P , so
P = (P ′ )′ = (P ′ P )′ = P ′ P = P ′ .
Result. Let Ω be a subspace of the vector space Rn , and let PΩ be its
projection matrix. Then MC (PΩ ) = Ω.
Proof. Note that MC (PΩ ) ⊂ Ω. For this, take y ∈ MC (PΩ ). Then y is a
linear combination of columns of PΩ , or y = PΩ u for some u. Since u = w ⊕v,
w ∈ Ω, v ∈ Ω⊥ , we have, y = PΩ u = PΩ (w ⊕ v) = PΩ w = w ∈ Ω. Conversely,
if x ∈ Ω, then x = PΩ x ∈ MC (PΩ ).

2
In − PΩ represents the orthogonal projection. i.e., Rn = Ω ⊕ Ω⊥ . Thus for
any y ∈ Rn , we have y = PΩ y ⊕ (I − PΩ )y.
If Pn×n is any symmetric idempotent matrix, it represents a projection onto
MC (P ): if y ∈ Rn , then y = P y + (I − P )y = u + v. Note

u′ v = (P y)′ (I − P )y = y ′ P (I − P )y = y ′ (P − P 2 )y = 0,

so that we get y = u ⊕ v, u ∈ MC (P ), v ∈ M⊥
C (P ).

3
Question. Given X, how to find P such that MC (X) = MC (P )?
Result. If Ω = MC (X), then PΩ = X(X ′ X)− X ′ , where X ′ X)− is any
generalized inverse of X ′ X.
Definition. If Bm×n is any matrix, a generalized inverse of B is any n × m
matrix B − satisfying BB − B = B.
Existence. From singular value decomposition of B, there exist orthogonal
matrices Pm×m and Qn×n such that
 
Dr×r 0r×(n−r)
Pm×m Bm×n Qn×n = ∆m×n = ,
0(m−r)×r 0(m−r)×(n−r)
 −1 
− Dr 0
where r = Rank(B). Define ∆n×m = and let B − = Q∆− P .
0 0
First,
   −1    
− Dr 0 Dr 0 Dr 0 Dr 0
∆∆ ∆ = = = ∆.
0 0 0 0 0 0 0 0

Further, B = P ′ ∆Q′ , so that

BB − B = P ′ ∆Q′ Q∆− P P ′ ∆Q′ = P ′ ∆∆− ∆Q′ = P ′ ∆Q′ = B.

Proof of Result. Let B = X ′ X. Find B − such that BB − B = B. For any


Y ∈ Rn , let c = X ′ Y , and let β̃ be any solution of X ′ Xβ = X ′ Y , or that of
Bβ = c. Then
B(B − c) = BB − B β̃ = B β̃ = c,
so that β̂ = B − c is a particular solution of Bβ = c. Let θ̂ = X β̂ = XB − c.
Then, Y = θ̂ + (Y − θ̂), where

θ̂′ (Y − θ̂) = β̂ ′ X ′ (Y − X β̂) = β̂ ′ (X ′ Y − X ′ X β̂) = 0.

Therefore we have an orthogonal decomposition of Y such that θ̂ ∈ MC (X)


and (Y − θ̂)⊥MC (X). Now note that θ̂ = X β̂ = X(X ′ X)− X ′ Y . i.e.,
for Y , its projection onto MC (X) is given by X(X ′ X)− X ′ Y . Therefore,
PΩ = X(X ′ X)− X ′ since PΩ is unique.
Techniques for finding B − are needed: if B = X ′ X, then PΩ = X(X ′ X)− X ′ ;
if we want to solve X ′ Xβ = X ′ Y , or Bβ = c, then β̂ = B − c.

1
 
B11 B12
For Bp×m with rank r < p and B = , where B11 (which is r × r
B21 B22 
−1
B11 0
of rank r) is nonsingular, if we take B =

, then note that
0 0
   −1  
− B11 B12 B11 0 B11 B12
BB B =
B21 B22 0 0 B21 B22
    
Ir 0 B11 B12 B11 B12
= −1 = −1 .
B21 B11 0 B21 B22 B21 B21 B11 B12

Now note that (B21 |B22 ) is a linear function of (B11 |B12 ), or (B21 |B22 ) =
K(B11 |B12 ) = (KB11 |KB12 ) for some matrix K. Therefore, KB11 = B21 , or
−1 −1
K = B21 B11 , so B22 = KB12 = B21 B11 B12 .
 
1 2 5 2
Example. Let B =  3 7 12 4 . Then rank of B is 2 since 2nd
0 1 −3 −2
 
1 2 5 2
row - 3× 1st row = 3rd row. Partition B as: B =  3 7 12 4  =
  0 1 −3 −2
B11 B12
. Take
B21 B22
 
  7 −2 0
−1
B11 0  −3 1 0 
B− = = .
0 0  0 0 0 
0 0 0

Example. Consider the model:


y1 = β1 + β2 + ǫ1
y2 = β1 + β2 + ǫ2
y3 = β1 + β2 + ǫ3
This is equivalent to
     
y1 1 1   ǫ1
 y2  =  1 1  β1 +  ǫ2  .
β2
y3 1 1 ǫ3
     
3 3 3 3 1/3 0
X has rank 1; X X =

= , so choose (X X) =
′ −
.
3 3 3 3 0 0

2
Then check that (X ′ X)(X ′ X)− (X ′ X) = X ′ X. We have then

X β̂ = θ̂ = X(X ′ X)− X ′ Y
   
1 1    y1
1/3 0 1 1 1 
=  1 1  y2 
0 0 1 1 1
1 1 y3
   
1/3 0   (y1 + y2 + y3 )/3
y1 + y2 + y3
=  1/3 0  = ȳ 
y1 + y2 + y3
1/3 0 ȳ
 
β\
1 + β2
 \ 
=  β1 + β2  .
β\
1 + β2

 
β1
Only β1 + β2 can be estimated? Note β1 + β2 = (1 1) and
  β2
1
MC = MC (X ′ ). More on this later.
1

3
Theorem. If Y ∼ Nn (Xβ, σ 2 In ), where Xn×p has rank r and β̂ = (X ′ X)− X ′ Y
is a least squares solution of β,
(i) X β̂ ∼ Nn (Xβ, σ 2 P ),
(ii) (β̂ − β)′ X ′ X(β̂ − β) ∼ σ 2 χ2r
(iii) X β̂ is independent of RSS = (Y − X β̂)′ (Y − X β̂). and
(iv) RSS/σ 2 ∼ χ2n−r (independent of X β̂)

Proof. (i) Since X β̂ = X(X ′ X)− X ′ Y = P Y , we have

X β̂ ∼ Nn (P Xβ, σ 2 P 2 ) = Nn (Xβ, σ 2 P ).

(ii) Since X β̂ = P Y and Xβ = P Xβ,

(β̂ − β)′ X ′ X(β̂ − β) = (X β̂ − Xβ)′ (X β̂ − Xβ)


= (Y − Xβ)′ P (Y − Xβ) ∼ σ 2 χ2r ,

P being symmetric idempotent of rank r.


(iii) We have X β̂ = P Y , RSS = (Y − X β̂)′ (Y − X β̂) = Y ′ (I − P )Y and
P (I − P ) = 0. Therefore independence of X β̂ and RSS follows.
(iv) Note again that

RSS = Y ′ (I − P )Y = (Y − Xβ)′ (I − P )(Y − Xβ) ∼ σ 2 χ2n−r ,

I − P being a projection matrix of rank n − r.


Estimability
Consider the Gauss-Markov model again: Y = Xβ + ǫ, with E(ǫ) = 0 and
Cov(ǫ) = σ 2 In . Now suppose rank of X is r < p.
Definition. A linear parametric function a′ β is said to be estimable if it has
a linear unbiased estimate b′ Y .
Theorem. a′ β is estimable iff a ∈ MC (X ′ ) = MC (X ′ X).
Proof. a′ β is estimable iff there exists b such that E(b′ Y ) = a′ β for all
β ∈ Rp . i.e., b′ Xβ = a′ β for all β ∈ Rp . i.e., b′ X = a′ or a = X ′ b for some
b ∈ Rn .
Theorem (Gauss-Markov). If a′ β is estimable, and β̂ is any least squares
solution (i.e., solution of X ′ Xβ = X ′ Y ),
(i) a′ β̂ is unique,
(ii) a′ β̂ is the BLUE of a′ β.
Proof. (i) If a′ β is estimable, a′ β = b′ Xβ = b′ θ for some b ∈ Rn . Since
θ̂ is the unique projection of Y onto MC (X), we note b′ θ̂ = b′ X β̂ = a′ β̂ is

1
unique. i.e., if β̃ is any other LS solution, then also b′ X β̃ = b′ X β̂ = a′ β̂.
(ii) If d′ Y is any other linear unbiased estimate of a′ β, then
E(d′ Y ) = d′ Xβ = d′ θ = a′ β = b′ Xβ = b′ θ for all β ∈ Rp .
i.e., d′ θ = b′ θ for all θ ∈ MC (X).
i.e., (d − b)′ θ = 0 for all θ ∈ MC (X), or (d − b)⊥MC (X). Consider P =
PMC (X) = X(X ′ X)− X ′ . Then P (d − b) = 0 or P d = P b. Therefore,

V ar(d′ Y ) − V ar(a′ β̂) = V ar(d′ Y ) − V ar(b′ θ̂)


= V ar(d′ Y ) − V ar(b′ P Y ) = V ar(d′ Y ) − V ar(d′ P Y )
= σ 2 (d′ d − d′ P d) = σ 2 d′ (I − P )d ≥ 0,
with equality iff (I − P )d = 0 or d = P d = P b. i.e., d′ Y = b′ P Y = b′ θ̂ = a′ β̂.
Remark. Parametric functions a′ β are estimable when a ∈ MC (X ′ ) = Row
space of X.
Example. Consider again the model:
yij = µ + αi + τj + ǫij , i = 1, 2, 3, 4; j = 1, 2.
Suppose comparing τ1 and τ2 is of interest. Since
   
y11 1 1 0 0 0 1 0

µ
 y12   1 1 0 0 0 0 1
α1
 
   
 y21   1 0 1 0 0 1 0 
α2
 
   
 y22   1 0 1 0 0 0 1 
Y = = α3  + ǫ,

 ..   .. .. .. .. .. .. ..
 
α4

 .   . . . . . . .  

τ1
   
 y41   1 0 0 0 1 1 0  
y42 1 0 0 0 1 0 1 τ2

µ+αi +τj is estimable for all i and j. Therefore, (µ+αi +τ1 )−(µ+αi +τ2 ) =
τ1 − τ2 is estimable.
(µ + αi + τ1 ) − (µ + αj + τ1 ) = αi − αj is estimable.
What else is estimable, apart from linear combinations of these?
Result. If a′ β is estimable, and Y ∼ Nn (Xβ, σ 2 In ), a 100(1−α)% confidence
interval for a′ β is given by
p p
a′ β̂ ± tn−r (1 − α/2) a′ (X ′ X)− a RSS/(n − r).

Proof. Note that a′ β = c′ Xβ = c′ θ for some c. Therefore, a′ β̂ = c′ θ̂ =


c′ P Y ∼ N (a′ β, σ 2 c′ P c). Now c′ P c = c′ X(X ′ X)− X ′ c = a′ (X ′ X)− a. There-
fore,
a′ β̂ − a′ β
p ∼ N (0, 1).
σ 2 a′ (X ′ X)− a

2
Further, since RSS/σ 2 ∼ χ2n−r independent of X β̂, and hence of c′ X β̂ =
c′ θ̂ = a′ β̂,
a′ β̂ − a′ β
p p ∼ tn−r .
σ 2 a′ (X ′ X)− a RSS/(σ 2 (n − r))
Hence,
r !
p RSS
P |a′ β̂ − a′ β| ≤ tn−r (1 − α/2) a′ (X ′ X)− a = 1 − α.
n−r

3
Maximum likelihood estimation
Does LS estimate have other optimality properties?
Since we have assumed that Y ∼ Nn (Xβ, σ 2 In ) to derive distributional prop-
erties of β̂, let us derive the maximum likelihood estimates of β and σ 2 under
this assumption. β̂mle and σ̂ 2 are values of β and σ 2 which maximize the
likelihood,
 
−n/2 2 −n/2 1 ′
(2π) (σ ) exp − 2 (Y − Xβ) (Y − Xβ) .

Equivalently, we may maximize the loglikelihood,
n 1
− log(σ 2 ) − 2 (Y − Xβ)′ (Y − Xβ).
2 2σ
Fix σ 2 and maximize over β, then maximize over σ 2 . Now note that maximiz-
ing over β (for any fixed σ 2 ) is equivalent to minimizing (Y −Xβ)′ (Y −Xβ) =
||Y − Xβ||2 , which yields the same estimate as the least squares. i.e.,
β̂mle = β̂ls . However, σˆ2 = RSS/n, which is not unbiased.
Estimation under linear restrictions or constraints
Consider the following examples.
(i) yij = µ + αi + τj + ǫij . Test H0 : τ1 = τ2 . i.e., test whether there is any
difference between treatments 1 and 2. Under H0 , τ1 − τ2 = 0, or Aβ = c
where A = a′ = (0, 0, . . . , 0, 1, −1, 0, . . . , 0), β = (µ, α1 , . . . , αI , τ1 , τ2 , . . .)′ .
(ii) yi = β0 + β1 xi1 + . . . + βp−1 xi(p−1) + ǫi . Test H0 : X1 , . . . , Xp−1 are not
useful.
Recall that, to derive the GLRT, we need to estimate the parameters of the
model, both with and without restrictions. While testing linear hypotheses
in a linear model, we need to estimate β under the linear constraint Aβ = c.
Consider Y = Xβ + ǫ, Xn×p of rank p, first. We will consider the deficient
rank case later. Let us see how we can find the least squares estimate of β
subject to H : Aβ = c, where Aq×p of rank q and c is given. We can use the
Lagrange multiplier method of calculus for this as follows.
min ||Y − Xβ||2 + λ′ (Aβ − c)
β

= min {Y ′ Y − 2β ′ X ′ Y + β ′ X ′ Xβ + λ′ Aβ − λ′ c} , (1)
β

differentiating which (w.r.t. β) and setting equal to 0, we get,


1
−2X ′ Y + 2X ′ Xβ + A′ λ = 0 or X ′ Xβ = X ′ Y − A′ λH .
2

1
Therefore,
 
′ −1 1 ′ 1
β̂H = (X X) ′
X Y − A λH = β̂ − (X ′ X)−1 A′ λH (∗).
2 2
Differentiating (1) w.r.t. λ, we get Aβ − c = 0. Since
1
c = Aβ̂H = Aβ̂ − A(X ′ X)−1 A′ λH ,
2
1
c − Aβ̂ = − A(X ′ X)−1 A′ λH , and hence
2
1  −1
− λH = A(X ′ X)−1 A′ (c − Aβ̂), and therefore
2
 −1
β̂H = β̂ + (X ′ X)−1 A′ A(X ′ X)−1 A′ (c − Aβ̂).
To establish minimization subject to Aβ = c, note that
||X(β̂ − β)||2 = (β̂ − β)′ X ′ X(β̂ − β)
= (β̂ − β̂H + β̂H − β)′ X ′ X(β̂ − β̂H + β̂H − β)
= (β̂ − β̂H )′ X ′ X(β̂ − β̂H ) + (β̂H − β)′ X ′ X(β̂H − β)
+2(β̂ − β̂H )′ X ′ X(β̂H − β)
= ||X(β̂ − β̂H )||2 + ||X(β̂H − β)||2 ,
since, from (*) above, and subject to Aβ = c,
1 ′
(β̂ − β̂H )′ X ′ X(β̂H − β) = λ A(X ′ X)−1 X ′ X(β̂H − β)
2 H
1 ′ 1
= λH A(β̂H − β) = λ′H (Aβ̂H − Aβ) = 0.
2 2
Therefore,
||Y − Xβ||2 = ||Y − X β̂||2 + ||X(β̂ − β)||2
= ||Y − X β̂||2 + ||X(β̂ − β̂H )||2 + ||X(β̂H − β)||2
≥ ||Y − X β̂||2 + ||X(β̂ − β̂H )||2 ,
and is a minimum when β = β̂H . (Note, X(β̂H − β) = 0 implies X ′ X(β̂H −
β) = 0, so β̂H − β = 0 since columns of X are linearly independent.) Also,
from above, we get,
||Y − X β̂H ||2 = ||Y − X β̂||2 + ||X(β̂ − β̂H )||2 .
If we let Ŷ = X β̂ and ŶH = X β̂H , then
||Y − ŶH ||2 = ||Y − Ŷ ||2 + ||Ŷ − ŶH ||2 .
Note that this can also be established using projection matrices, and not
just for the full column rank case. Let us first establish it for the case
Rank(Xn×p ) = p again, and next extend it.

2
Let β0 be a solution of Aβ = c. Then Y −Xβ0 = X(β −β0 )+ǫ or Ỹ = Xγ +ǫ
with Aγ = A(β − β0 ) = 0. i.e.,
Ỹ = θ + ǫ, θ ∈ MC (X) = Ω, and
A(X ′ X)−1 X ′ θ = A(X ′ X)−1 X ′ X(β − β0 ) = A(β − β0 ) = Aγ = 0.
Set A1 = A(X ′ X)−1 X ′ and ω = N (A1 ) ∩ Ω. Then A1 θ = Aγ = 0 and we
want the projection of Ỹ onto ω since we want:
min ||Ỹ − θ||2 subject to A1 θ = 0.
θ∈MC (X)

We need the following series of results to solve this.


Result A. If N (C) is the null space of C, then N (C) = M⊥ (C ′ ).
Proof. If x ∈ N (C), then Cx = 0 so that x is orthogonal to each row of C.
i.e., x⊥M(C ′ ). Conversely, if x⊥M(C ′ ), then x′ C ′ = (Cx)′ = 0, or Cx = 0,
hence x ∈ N (C).
Result B. (Ω1 ∩ Ω2 )⊥ = Ω⊥ ⊥
1 + Ω2 .

Proof. Let Ωi = N (Ci ), i = 1, 2. Then,


  ⊥
⊥ C1
(Ω1 ∩ Ω2 ) = N = M (C1′ |C2′ ) = M(C1′ ) + M(C2′ ) = Ω⊥ ⊥
1 + Ω2 .
C2

Result C. If ω ⊂ Ω, then PΩ Pω = Pω PΩ = Pω .
Proof. Show that PΩ Pω and Pω PΩ both satisfy the defining properties of
Pω : If x ∈ ω ⊂ Ω, then PΩ Pω x = PΩ x = x; if ξ ∈ ω ⊥ , PΩ Pω ξ = PΩ 0 = 0.
Similar is the other case.
Result D. If ω ⊂ Ω, then PΩ − Pω = Pω⊥ ∩Ω .
Proof. Ω = MC (PΩ ), so each x ∈ Ω can be written x = PΩ y. Consider
the decomposition, PΩ y = Pω y + (PΩ − PΩ )y. Now Pω y ∈ ω ⊂ Ω, and
already PΩ y ∈ Ω, so (PΩ − PΩ )y = PΩ y − Pω y ∈ Ω. Further, Pω (PΩ − Pω ) =
Pω PΩ − Pω = Pω − Pω = 0, so that (Pω y)′ (PΩ − Pω )y = y ′ Pω (PΩ − Pω )y = 0.
Therefore, PΩ y = Pω y ⊕ (PΩ − Pω )y is the orthogonal decomposition of Ω
into ω ⊕ (ω ⊥ ∩ Ω).
Result E. If A1 is any matrix such that ω = N (A1 ) ∩ Ω, then ω ⊥ ∩ Ω =
MC (PΩ A′1 ).
Proof. Note that
ω ⊥ ∩ Ω = (Ω ∩ N (A1 ))⊥ ∩ Ω = Ω⊥ ⊕ N ⊥ (A1 ) ∩ Ω = Ω⊥ ⊕ MC (A′1 ) ∩ Ω.
 

1
 
Now, let x ∈ ω ⊥ ∩ Ω = Ω⊥ ⊕ MC (A′1 ) ∩ Ω. . Then x ∈ Ω, so x = PΩ x.
Also, x ∈ Ω⊥ ⊕ MC (A′1 ), so x = (I − PΩ )α + A′1 β. Therefore,
x = PΩ x = PΩ {(I − PΩ )α + A′1 β} = PΩ A′1 β ∈ MC (A′1 ).
Conversely, if x ∈ MC (PΩ A′1 ), then x = PΩ A′1 β = PΩ (A′1 β) ∈ MC (PΩ ) = Ω.
For any ξ ∈ ω(⊂ Ω), we have x′ ξ = β ′ A1 PΩ ξ = β ′ A1 ξ = 0 since ω =
N (A1 ) ∩ Ω. Therefore, x ∈ ω ⊥ .
Result F. If A1 is a q × n matrix of rank q, then Rank(PΩ A′1 ) = q iff
MC (A′1 ) ∩ Ω⊥ = {0}.
Proof. Rank(PΩ A′1 ) ≤ Rank(A′1 ) = Rank(A1 ) = q. Suppose Rank(PΩ A′1 ) <
q. Let the rows of A1 (i.e.,Pcolumns of A′1 ) bePa′1 , . . . , a′q . Columns of PΩ A′1
are linearly dependent, so Pqi=1 ci PΩ ai = PΩ ( qi=1 ci ai ) = 0 for some c 6= 0.
Then therePexists a vector qi=1 ci ai ∈ MC (A′1 ) ( 6= 0 since rank of A1 is q)
such that qi=1 ci ai ⊥Ω. i.e., MC (A′1 ) ∩ Ω⊥ 6= {0}. If Rank(PΩ A′1 ) = q =
Rank(A′1 ) then MC (A′1 ) = MC (PΩ A′1 ) = ω ⊥ ∩ Ω ⊂ Ω.
Now let us return to the problem of finding the projection of Ỹ onto ω =
N (A1 ) ∩ Ω which achieves:
min ||Ỹ − θ||2 subject to A1 θ = 0.
θ∈MC (X)

From Results A and B, ω ⊥ ∩ Ω = (N (A1 ) ∩ Ω)⊥ ∩ Ω = (MC (A′1 ) + Ω⊥ ) ∩ Ω


and from Result E, ω ⊥ ∩ Ω = MC (PΩ A′1 ). Now note that
PΩ A′1 = (X(X ′ X)−1 X ′ )X(X ′ X)−1 A′ = X(X ′ X)−1 A′ = A′1 .
Therefore, Rank(PΩ A′1 ) = Rank(A′1 ) ≤ q. However, since Rank(PΩ A′1 ) =
Rank(X(X ′ X)−1 A′ ) ≥ Rank(X ′ X(X ′ X)−1 A′ ) = Rank(A′ ) = q, we must
have Rank(PΩ A′1 ) = q. Therefore, from Result D,
PΩ − Pω = Pω⊥ ∩Ω = PMC (PΩ A′1 )
= PΩ A′1 (A1 PΩ2 A′1 )−1 (PΩ A′1 )′
−1
= X(X ′ X)−1 A′ A(X ′ X)−1 X ′ X(X ′ X)−1 A′ A(X ′ X)−1 X ′

−1
= X(X ′ X)−1 A′ A(X ′ X)−1 A′ A(X ′ X)−1 X ′ .
Therefore,
X β̂H − Xβ0 = X γ̂H = Pω Ỹ = PΩ Ỹ − Pω⊥ ∩Ω Ỹ
−1
= PΩ Y − Xβ0 − X(X ′ X)−1 A′ A(X ′ X)−1 A′ A(X ′ X)−1 X ′ (Y − Xβ0 )
−1
= PΩ Y − Xβ0 − X(X ′ X)−1 A′ A(X ′ X)−1 A′ A (X ′ X)−1 X ′ Y − β0


′ −1 ′ ′
 
−1 ′ −1

= PΩ Y − Xβ0 − X(X X) A A(X X) A Aβ̂ − c .

2
Therefore,
−1  
X β̂H = X β̂ − X(X ′ X)−1 A′ A(X ′ X)−1 A′ Aβ̂ − c .

Multiplying by (X ′ X)−1 X ′ on the left, we get,


−1  
β̂H = β̂ − (X ′ X)−1 A′ A(X ′ X)−1 A′ Aβ̂ − c .

This yields the minimum since ||Y − X β̂H ||2 = ||Ỹ − X γ̂H ||2 .

3
Case of X having less than full column rank
Rank(Xn×p ) = r < p. Since only estimable linear functions a′ βcan be

a′1
estimated, assume a′i β, i = 1, 2, . . . , q are estimable and Aq×p =  ... .
 
a′q
However, since = a′i m′i X
for some m′i ,
we have A = Mq×n Xn×p . Since A has
rank q, M also has rank q (≤ r). Proceeding as before, let β0 be any solution
of Aβ = c. Then consider: Ỹ = Y − Xβ0 = X(β − β0 ) + ǫ or Ỹ = Xγ + ǫ or

Ỹ = θ + ǫ, θ ∈ MC (X) = Ω, and

M θ = M Xγ = Aγ = 0. We want to find β̂H , the least squares solution


subject to H : Aβ = c. If ω = Ω ∩ N (M ), then ω ⊥ ∩ Ω = MC (PΩ M ′ ), and
PΩ M ′ = X(X ′ X)− X ′ M ′ = X(X ′ X)− A′ . Further,
M PΩ M ′ = M X(X ′ X)− X ′ M ′ = A(X ′ X)− A′ is nonsingular. This is because,
(since X ′ PΩ = X ′ )

q = Rank(M ′ ) ≥ Rank(PΩ M ′ ) ≥ Rank(X ′ PΩ M ′ )


= Rank(X ′ M ′ ) = Rank(A′ ) = q.

Therefore

PΩ − Pω = Pω⊥ ∩Ω = PMC (PΩ M ′ )


= PΩ M ′ (M PΩ M ′ )−1 M PΩ
−1
= X(X ′ X)− A′ A(X ′ X)− A′ A(X ′ X)− X ′ .

Hence,

X β̂H − Xβ0 = X γ̂H = Pω Ỹ = PΩ Ỹ − Pω⊥ ∩Ω Ỹ


= PΩ Y − Xβ0 − PΩ M ′ (M PΩ M ′ )−1 M PΩ (Y − Xβ0 ), so that

X ′ X β̂H − X ′ Xβ0 = X ′ PΩ Y − X ′ Xβ0 − X ′ PΩ M ′ (M PΩ M ′ )−1 M PΩ (Y − Xβ0 ).

Thus,

X ′ X β̂H = X ′ Y − X ′ M ′ (M PΩ M ′ )−1 {M PΩ Y − M PΩ Xβ0 }


= X ′ Y − X ′ M ′ (M PΩ M ′ )−1 M X(X ′ X)− X ′ Y − M Xβ0


= X ′ Y − X ′ M ′ (M PΩ M ′ )−1 A(X ′ X)− X ′ Y − Aβ0



−1 n o
= X ′ Y − A′ A(X ′ X)− A′ Aβ̂ − c
′ ′ ′
 n
− ′ −1
o
= X X β̂ − A A(X X) A Aβ̂ − c .

1
Now recall, a solution of Bu = d is û = B − d. Therefore, from above, since
−1 n o
X ′ X(β̂H − β̂) = −A′ A(X ′ X)− A′ Aβ̂ − c ,

we have that
−1 n o
β̂H = β̂ − (X ′ X)− A′ A(X ′ X)− A′ Aβ̂ − c .

Also, these two together yield,

(β̂H − β̂)′ X ′ X(β̂H − β̂)


−1 −1
= (Aβ̂ − c)′ A(X ′ X)− A′ A(X ′ X)− A′ A(X ′ X)− A′ (Aβ̂ − c)
−1
= (Aβ̂ − c)′ A(X ′ X)− A′ (Aβ̂ − c).

2
Linear Regression

Consider the model:


Y = Xβ + ǫ, with E(ǫ) = 0 and Cov(ǫ) = σ 2 In . Then β̂ = (X ′ X)− X ′ Y is a
least squares solution. If Xn×p has rank p, it is the least squares estimate of
β. It is an optimal estimate in the sense that for all a ∈ Rp , a′ β̂ is the BLUE
of a′ β. Also, E(β̂) = β and Cov(β̂) = σ 2 (X ′ X)−1 . If X has rank r < p, β̂
is still optimal in the sense that for all estimable a′ β (i.e., a = X ′ b), we still
have that a′ β̂ is the BLUE of a′ β.
If Y ∼ Nn (Xβ, σ 2 Inq
), then a′ β̂ ∼ N (a′ β, σ 2 a′ (X ′ X)− a) and hence
a′ β̂ ± t (1 − α/2) RSS a′ (X ′ X)− a
n−r n−r
is a 100(1 − α)% confidence interval for a′ β for any estimable a′ β.
Now we want to explore the question: how good is the model Y = Xβ + ǫ
for the given data?
Analysis of Variance (ANOVA) for Regression
Pn 2
Given Yn×1 , we look at Y ′ Y = i=1 yi as its variation around 0, in the
absence of any other assumptions. It has n degrees of freedom. If a centre
(or intercept) is considered
P useful, (i.e., yi = β0 + ǫi ) then we can decompose
it as ni=1 yi2 = nȳ 2 + ni=1 (yi − ȳ)2 and check how much is the reduction in
P
variation. If we think that the predictor set X is relevant (i.e., Y = Xβ + ǫ),
the sum of squares SST = Y ′ Y can be decomposed as follows:

SST = Y ′ Y = (Y − Ŷ )′ (Y − Ŷ ) + Ŷ ′ Ŷ
= Y ′ (I − P )Y + Y ′ P Y
= Y ′ (I − P )Y + β̂ ′ X ′ X β̂
= Y ′ Y − β̂ ′ X ′ Y + β̂ ′ X ′ Y
= RSS + SSR,

where RSS is the residual sum of squares and SSR is the sum of squares due to
regression. If Xn×p has rank r ≤ p, then n = (n − r) + r is the corresponding
decomposition of the degrees of freedom. Thus, analysis of variance is simply
the decomposition of total sum of squares into components which can be
attributed to different factors. Then this simple minded ANOVA for Y =
Xβ + ǫ will look as follows.

1
source of sum of d.f. mean F -ratio
variation squares squares
model: SSR = r= MSR = F =
Y = Xβ + ǫ β̂ ′ X ′ Y Rank(X) SSR/r MSR/MSE
residual SSE = n−r MSE =
error Y Y − β̂ ′ X ′ Y

SSE/n − r
Total SST = Y ′ Y n
If Y ∼ Nn (Xβ, σ 2 In ),
(i) X β̂ is independent of SSE = RSS = (Y − X β̂)′ (Y − X β̂) = Y ′ Y − β̂ ′ X ′ Y ,
and
(ii) SSE = RSS ∼ σ 2 χ2n−r ;
(iii) if indeed the linear model is not useful, then β = 0 so that β̂ ′ X ′ X β̂ =
(β̂ − β)′ X ′ X(β̂ − β) ∼ σ 2 χ2r .
Therefore, to check usefulness of the linear model, use
F = MSR/MSE ∼ Fr,n−r (if β = 0).
If β 6= 0, then β̂ ′ X ′ X β̂ ∼ non-central χ2 and E(β̂ ′ X ′ X β̂) = rσ 2 +β ′ X ′ Xβ >
rσ 2 , so large values of F-ratio indicate evidence for β 6= 0.
However, this ANOVA is not particularly useful since (usually) the first
column of X is 1 indicating that the model includes an intercept or cen-
tre. This constant term is generally useful, and we only want to check
H0 : β1 = β2 = · · · = βp−1 = 0 to check the usefulness the actual regressors,
X1 , . . . , Xp−1 (not X0 = 1). Before discussing this, let us recall a result in
probability on decomposing the variance:
If X and Y are jointly distributed (with finite second moments), then

V ar(Y ) = E [V ar(Y |X)] + V ar [E(Y |X)] .

The first term on RHS is the ‘within variation’: if Y is partitioned according


to values of X, how much is left to be explained in Y for given X. The second
term is the variation between Ŷ (X) values, and is the ‘between variation’. In
a study, V ar(Y ) may be large, but if V ar(Y |X) is small, it makes sense to
use X to predict Y using X. This result is known as the Analysis of Variance
formula, and the ANOVA for regression is based on it. Some more results
are needed to derive it.

2
The F-test (to check the goodness of linear models)
We have the model, Y = Xβ+ǫ, Xn×p of rank r ≤ p and with ǫ ∼ Nn (0, σ 2 In ).
Suppose we want to test H0 : Aβ = c, Aq×p of rank q ≤ r, and c is given.
Then
RSS = SSE = (Y − X β̂)′ (Y − X β̂) = Y ′ (I − P )Y
RSSH0 = (Y − X β̂H0 )′ (Y − X β̂H0 ), where
−1 n o
β̂H0 = β̂ + (X ′ X)− A′ A(X ′ X)− A′ c − Aβ̂ .

Theorem. Under the above mentioned assumptions, we have,


(i) RSS ∼ σ 2 χ2n−r ;
−1
(ii) RSSH0 − RSS = (Aβ̂ − c)′ (A(X ′ X)− A′ ) (Aβ̂ − c);
−1
(iii) E (RSSH0 − RSS) = qσ 2 + (Aβ − c)′ (A(X ′ X)− A′ ) (Aβ − c);
(iv) under H0 : Aβ = c,
( RSSH0 - RSS )/q
F = ∼ Fq,n−r ;
RSS /(n − r)
(v) when c = 0,  
n − r Y ′ (P − PH0 )Y
F = ,
q Y ′ (In − P )Y
where PH0 is symmetric idempotent and PH0 P = P PH0 = PH0 .
Proof. (i) Already known.
(ii) Note that
RSSH0 = (Y − X β̂H0 )′ (Y − X β̂H0 )
= (Y − X β̂ + X β̂ − X β̂H0 )′ (Y − X β̂ + X β̂ − X β̂H0 )
= (Y − X β̂)′ (Y − X β̂) + (X β̂ − X β̂H0 )′ (X β̂ − X β̂H0 )
+2(X β̂ − X β̂H0 )′ (Y − X β̂)
= RSS + (β̂ − β̂H0 )′ X ′ X(β̂ − β̂H0 ),
since (X β̂ − X β̂H0 )′ (Y − X β̂) = (β̂ − β̂H0 )′ (X ′ Y − X ′ X β̂) = 0. Now from an
−1
earlier result, (β̂ − β̂H0 )′ X ′ X(β̂ − β̂H0 ) = (Aβ̂ − c)′ (A(X ′ X)− A′ ) (Aβ̂ − c).
(iii) Aβ̂ = M X β̂ = M P Y ∼ Nq (Aβ, σ 2 (A(X ′ X)− A′ )), so that E(Aβ̂ − c) =
Aβ − c and Cov(Aβ̂) = σ 2 A(X ′ X)− A′ . Therefore,
E(RSSH0 − RSS)
n −1 o
= E (Aβ̂ − c)′ A(X ′ X)− A′ (Aβ̂ − c)
−1 n −1 o
= (Aβ − c)′ A(X ′ X)− A′ (Aβ − c) + tr σ 2 A(X ′ X)− A′ A(X ′ X)− A′
−1
= qσ 2 + (Aβ − c)′ A(X ′ X)− A′ (Aβ − c),

1
which is large if Aβ is far from c.
(iv) Note that
−1
RSSH0 − RSS = (Aβ̂ − c)′ A(X ′ X)− A′ (Aβ̂ − c) ∼ σ 2 χ2q ,

under H0 since Aβ̂−c ∼ Nq (Aβ−c, σ 2 (A(X ′ X)− A′ )) = Nq (0, σ 2 A(X ′ X)− A′ ).


Also, RSS ∼ σ 2 χ2n−r from (i). Further, RSS is independent of X β̂ = P Y .
Since Aβ is estimable, A = M X, so that Aβ̂ = M X β̂ = M P Y , which is
independent of RSS.
(v) If c = 0, we have,
n −1 o
X β̂H0 = X β̂ − (X ′ X)− A′ A(X ′ X)− A′ Aβ̂
n o
− ′ −1
′ − ′ ′ − ′ ′
 ′ − ′
= X (X X) X Y − (X X) A A(X X) A A(X X) X Y
n −1 o
= X(X ′ X)− X ′ − X(X ′ X)− A′ A(X ′ X)− A′ A(X ′ X)− X ′ Y
= (P − P1 )Y = PH0 Y.

Clearly, PH0 is symmetric. Further, P1 is symmetric, P12 =


−1 −1
X(X ′ X)− A′ (A(X ′ X)− A′ ) A(X ′ X)− X ′ X(X ′ X)− A′ (A(X ′ X)− A′ ) A(X ′ X)− X ′ =
−1 −1
X(X ′ X)− A′ (A(X ′ X)− A′ ) {A(X ′ X)− X ′ X(X ′ X)− A′ } (A(X ′ X)− A′ ) A(X ′ X)− X ′
−1 −1
= X(X ′ X)− A′ (A(X ′ X)− A′ ) {A(X ′ X)− A′ } (A(X ′ X)− A′ ) A(X ′ X)− X ′
−1
= X(X ′ X)− A′ (A(X ′ X)− A′ ) A(X ′ X)− X ′ = P1 , since the term in the mid-
dle of the expression,
A(X ′ X)− X ′ X(X ′ X)− A′ = M X(X ′ X)− X ′ X(X ′ X)− X ′ M ′ = M P 2 M ′
= M P M ′ = A(X ′ X)− A′ . Also,
−1
P1 P = X(X ′ X)− A′ (A(X ′ X)− A′ ) A(X ′ X)− X ′ X(X ′ X)− X ′
−1
= X(X ′ X)− A′ (A(X ′ X)− A′ ) A(X ′ X)− X ′ = P1 ,
since X ′ X(X ′ X)− X ′ = X ′ P = X ′ P ′ = (P X)′ = X ′ . Note, P1 = (P1 )′ =
(P1 P )′ = P P1 . Therefore,
PH2 0 = (P − P1 )2 = P 2 − P P1 − P1 P + P12 = P − 2P1 + P1 = P − P1 = PH0
and PH0 P = (P − P1 )P = P − P1 = PH0 = P PH0 . Therefore,

RSSH0 = ||Y − X β̂H0 ||2 = (Y − X β̂H0 )′ (Y − X β̂H0 )


= (Y − PH0 Y )′ (Y − PH0 Y ) = Y ′ (I − PH0 )Y

and

RSSH0 − RSS = Y ′ (I − PH0 )Y − Y ′ (I − P )Y = Y ′ (P − PH0 )Y.

2
Now we use the above result for checking the goodness of the linear fit.
ANOVA for checking the goodness of Y = Xβ + ǫ, or yi = β0 + β1 xi1 +
· · · + βi(p−1) + ǫi , or equivalently for testing H0 : β1 = · · · = βp−1 = 0 is
what is needed. Intuitively, if X1 , . . . , Xp−1 provide no useful information,
then the appropriate model is yi = β0 +Pǫi , so ȳ is the only quantity that
can help in predicting y. Then RSSH0 = ni=1 (yi − ȳ)2 is the sum of squares
unexplained, and it has n − 1 d.f. If X1 , . . . , Xp−1 are also used in the model,
then (Y − X β̂)′ (Y − X β̂) = RSS is the unexplained part with n − r d.f.
How much better is RSS compared to RSSH0 ? Let SSreg denote the sum of
squares due to X1 , . . . , Xp−1 and without an intercept. Then,
RSSH0 = RSS + SSreg
n
X n
X
(yi − ȳ)2 = (yi − ŷi )2 + SSreg
i=1 i=1

In other words,
1 ′ ′
Y ′Y − Y 1 1Y = Y ′ (I − P )Y + SSreg , or
n  
′ ′ 1 ′ ′
Y Y = Y (I − P )Y + SSreg + Y 1 1Y , or
n
 
′ ′ ′ ′ 1 ′ ′
SSR = β̂ X X β̂ = β̂ X Y = SSreg + Y 1 1Y ,
n

since Y ′ Y = Y ′ (I −P )Y +Y ′ P Y = Y ′ (I −P )Y + β̂ ′ X ′ X β̂. Now, 1(1′ 1)−1 1′ =


1
n
11′ = PM(1) = PM(X0 ) , so that SSR = nȳ 2 + SSreg is the orthogonal decom-
position of SSR into components attributed to M(1) and M(X1 , . . . , Xp−1 ).
Therefore SSreg with r − 1 d.f. is the quantity to measure the merit of the
regressors, X1 , . . . , Xp−1 .

ANOVA with mean

source of d.f. sum of mean F -ratio


variation squares squares
mean 1 SSM = MSM = Fmean =
nȳ 2 SSM/1 MSM/MSE
regression r−1 SSreg = MSreg = Freg =
on X1 , . . . , Xp−1 β̂ X Y − nȳ 2
′ ′
SSreg /(r − 1) MSreg /MSE
residual n−r SSE = RSS = MSE =
error Y ′ Y − β̂ ′ X ′ Y SSE/(n − r)
Total n SST = Y ′ Y

1
ANOVA for regression (corrected for mean)

source of d.f. sum of mean F -ratio


variation squares squares
regression r−1 SSreg = MSreg = Freg =
(corrected) β̂ X Y − nȳ 2
′ ′
SSreg /(r − 1) MSreg /MSE
residual n−r SSE = RSS = MSE =
error Y ′ Y − β̂ ′ X ′ Y SSE/(n − r)
Total n−1 SST(Corrected) =
2
P
(corrected) (yi − ȳ)
How good is the linear fit? There are two things to consider here.
(i) The ANOVA F-test: Under H0 : β1 = · · · = βp−1 = 0, the F-ratio,
Freg ∼ Fr−1,n−r and large values of the statistic provide evidence against H0 ,
or equivalently indicate that the regressors are useful.
(ii) The proportion of variability in y not explained by the actual regressors
is: RSS/SST (corrected), so the proportion of variability in y around its
mean, explained by the actual regressors is
RSS
1− ≡ R2 = Coefficient of determination.
SST (corrected)

In other words,

RSS Y ′ (I − P )Y
R2 = 1 − =1− ′
SST (corrected) Y (I − n1 11′ )Y
Pn Pn 2
i=1 (yiP− ȳ)2 − Y ′ (I − P )Y − nȳ 2 − Y ′ (I − P )Y
i=1 yi P
= n 2
= n 2
i=1 (yi − ȳ) i=1 (yi − ȳ)
Y ′ Y − nȳ 2 − Y ′ Y + Y ′ P Y Y ′ P Y − nȳ 2
= Pn 2
= P n 2
i=1 (yi − ȳ) i=1 (yi − ȳ)
SSR − nȳ 2 SSreg
= Pn 2
=
i=1 (yi − ȳ) SST (corrected)
= proportion of variability explained by regressors.

Also,
SSreg SSreg
R2 = =
SST (corrected) RSS + SSreg
r−1
SSreg /RSS ( n−r )Freg
= = r−1
1 + SSreg /RSS 1 + ( n−r )Freg

2
is an increasing function of the F-ratio.
Note that to interpret the F-ratio, normality of ǫi is needed. R2 , however, is
a percentage with a straightforward interpretation.

3
Example 1 (socio-economic study). The demand for a consumer prod-
uct is affected by many factors. In one study, measurements on the relative
urbanization (X1 ), educational level (X2 ), and relative income (X3 ) of 9 ran-
domly chosen geographic regions were obtained in an attempt to determine
their effect on the product usage (Y ). The data were:
X1 X2 X3 Y
42.2 11.2 31.9 167.1
48.6 10.6 13.2 174.4
42.6 10.6 28.7 160.8
39.0 10.4 26.1 162.0
34.7 9.3 30.1 140.8
44.5 10.8 8.5 174.6
39.1 10.7 24.3 163.7
40.1 10.0 18.6 174.5
45.9 12.0 20.4 185.7
We fit the model: Y = Xβ+ǫ, with E(ǫ) = 0 and Cov(ǫ) = σ 2 In . In this case,
n = 9, p = 4. ȳ = 167.07 andthe model
is yi = β0 + β1 xi1 + β2 xi2 + β3 xi3 + ǫi .
60.0
 0.24 
 10.72 . The detailed ANOVA (with mean)
We get β̂ = (X ′ X)−1 X ′ Y =  

−0.75
is
source d.f. SS MS F -ratio
mean 1 SSM = nȳ 2 = MSM =
251201.44 251201.44
regression 3 SSreg = MSreg = Freg =
360.45
(X1 , X2 , X3 ) 1081.35 360.45 39.57
= 9.11
residual 5 SSE = RSS = MSE =
error 197.85 39.57
Total (corrected) 8 1279.20
Total 9 252480.64
From this note that s2 = RSS/(n − r) = MSE = 39.57, so s = 6.29 = σ̂, and
R2 = 1081.35/1279.20 = 84.5%. Abridged ANOVA is

1
source d.f. SS MS F -ratio
regression 3 SSreg = MSreg = Freg =
360.45
(X1 , X2 , X3 ) 1081.35 360.45 39.57
= 9.11
residual 5 SSE = RSS = MSE =
error 197.85 39.57
Total (corrected) 8 1279.20
R2 = 84.5% is substantial. What about F = 9.11? F3,5 (.95) = 5.41 and
F3,5 (.99) = 12.06, so there is some evidence against the null and justifying
the linear fit.
Example 2. X = height (cm) and Y = weight (kg) for a sample of n = 10
eighteen-year-old American girls:
X Y
169.6 71.2
166.8 58.2
157.1 56.0
181.1 64.5
158.4 53.0
165.6 52.4
166.7 56.8
156.5 49.2
168.1 55.6
165.3 77.8
Upon  fitting
 the  simple 
linear regression model, yi = β0 + β1 xi + ǫi , we
β̂0 −36.9
get = , s2 = M SE = 71.50, s = 8.456, R2 = 21.9%,
β̂1 0.582
ȳ = 59.47. ANOVA is
source d.f. SS MS F R2
X 1 159.95 159.95 2.24 21.9%
error 8 512.01 71.50
Total (C) 9 731.96
Note the following. (i) X is expected to be a useful predictor of Y , but the
relationship may not be simple. (ii) F1,8 (.90) = 3.46 = (1.86)2 = t28 (.95), so
is there a connection between the ANOVA F-test and a t-test?
Consider simple linear regression again: yi = β0 + β1 xi + ǫi , i = 1, . . . , n, ǫi
i.i.d. N (0, σ 2 ). Then the F-ratio is the F statistic for testing the goodness of
fit of the linear model, or for testing H0 : β1 = 0. Writing the linear model

2
in the standard form, we have
 
1 x1
 1 x2   Pn 
′ n i=1 x i
Xn×2 =  ..  , X X = Pn Pn 2 , and
 
 .  i=1 xi i=1 xi
1 xn
 Pn 
′ −1 1 x2i −nx̄
i=1
(X X) = Pn .
n i=1 (xi − x̄)2 −nx̄ n
Therefore
   Pn 2   Pn 
β̂0 ′ −1 ′ 1 i=1 xi −nx̄ i=1 y i
= (X X) X Y = Pn Pn .
β̂1 n i=1 (xi − x̄)2 −nx̄ n i=1 xi yi

Letting SXX = ni=1 (xi − x̄)2 , SXY = ni=1 (xi − x̄)(yi − ȳ), and extracting
P P
the least squares equations, we get,
( n
)
1 X SXY
β̂1 = −nx̄ȳ + xi yi = ,
SXX i=1
SXX
( n n
) ( n
)
1 X X 1 X
β̂0 = ȳ x2i − x̄ xi yi = ȳSXX + nȳx̄2 − x̄ xi yi
SXX i=1 i=1
SXX i=1
( n
!)
1 X
= ȳSXX − x̄ xi yi − nx̄ȳ = ȳ − x̄β̂1 .
SXX i=1

Now, β̂1 ∼ N (β1 , σ 2 /SXX ), so that, to test H0 : β1 = 0, use the test statistic,

SXX β̂1 β̂12 SXX
p ∼ tn−2 , or ∼ F1,n−2 ,
RSS/(n − 2) MSE

if H0 is true. The ANOVA table shows that


n
X n
X
yi2 = nȳ 2 + (yi − ȳ)2 = nȳ 2 + RSS + SSreg , so
i=1 i=1
n
X
SSreg = (yi − ȳ)2 − RSS.
i=1

3
However,
n
X n 
X 2
RSS = (yi − β̂0 − β̂1 xi )2 = yi − ȳ − β̂1 (xi − x̄)
i=1 i=1
n
X n
X n
X
2 2 2
= (yi − ȳ) + β̂1 (xi − x̄) − 2β̂1 (xi − x̄)(yi − ȳ)
i=1 i=1 i=1
n
X n
X n
X
2 2 2
= (yi − ȳ) + β̂1 (xi − x̄) − 2β̂1 β̂1 (xi − x̄)2
i=1 i=1 i=1
Xn Xn
= (yi − ȳ)2 − β̂12 (xi − x̄)2 .
i=1 i=1

Pn
Therefore, SSreg = β̂12 i=1 (xi − x̄)2 , so that

β̂12 SXX
t2 = = F-ratio of ANOVA.
RSS/(n − 2)

In Example 1, F-ratio tests H0 : β1 = β2 = β3 = 0. Whatif we want totest


0 1 0 0
only β1 = β3 = 0? Then we have H0 : Aβ = 0, where A =
0 0 1 0 2×4
if of rank 2. Then apply the theorem: RSSH0 = (Y − X β̂H0 )′ (Y − X β̂H0 )
where β̂H0 = β̂ + (X ′ X)− A′ (A(X ′ X)− A′ )−1 (c − Aβ̂) and the test statistic is

(RSSH0 − RSS) /q
F = ∼ Fq,n−r under H0 .
RSS/(n − r)

4
Multiple Correlation

As seen earlier, the proportion of variation explained by the linear regression


of Y on the regressors X1 , . . . , Xp−1 is given by

2 SSreg RSS Y ′ (I − P )Y
R = =1− =1− ′ .
SST (corrected) SST (corrected) Y (I − n1 11′ )Y

Consider simple linear regression: Then p = 2 and yi = β0 + β1 xi + ǫi .


Pn
SXY (x − x̄)(yi − ȳ)
β̂1 = = i=1Pn i ,
SXX i=1 (xi − x̄)
2

n
X n
X
2 2
RSS = (yi − ȳ) − β̂1 (xi − x̄)2 ,
i=1 i=1

so that n
{ ni=1 (xi − x̄)(yi − ȳ)}
X P 2
SSreg = β̂12 2
(xi − x̄) = Pn .
i=1 (xi − x̄)
2
i=1

Therefore,

{ ni=1 (xi − x̄)(yi − ȳ)}


P 2
2 SSreg
R = Pn =
{ ni=1 (xi − x̄)2 } { ni=1 (yi − ȳ)2 }
P P
i=1 (yi − ȳ)
2
( Pn )2
(x i − x̄)(y i − ȳ)
= p Pn i=1 2
= rXY ,
{ i=1 (xi − x̄)2 } { ni=1 (yi − ȳ)2 }
P

where
Pn
i=1 (xi
− x̄)(yi − ȳ)
rXY = pPn pPn
i=1 (xi − x̄) i=1 (yi − ȳ)
2 2

= sample correlation coefficient between X and Y .

This connection between R2 and r2 is intuitively meaningful since a good


linear fit is related to a good linear association between X and Y . What
happens when there are multiple regressors, X1 , X2 , . . . , Xp−1 ?
We define the multiple correlation coefficient between Y and X1 , . . . , Xp−1
as the maximum correlation coefficient between Y and any linear function of
X1 , . . . , Xp−1 = maxa Corr(Y, a0 + a1 X1 + · · · + ap−1 Xp−1 ) = R∗ (say).

1
   
Y σY Y σXY

If Cov = , then
X σXY ΣX
Cov 2 (Y, a′ X) {a′ Cov(Y, X)}2 {a′ σXY }2
Corr2 (Y, a′ X) = = = .
V ar(Y )V ar(a′ X) V ar(Y )V ar(a′ X) σY Y a′ ΣX a
1/2 −1/2
Further, taking u′ = a′ ΣX and v = ΣX σXY ,
1/2 −1/2
a′ σXY a′ ΣX ΣX σXY u′ v
= =
(σY Y a′ ΣX a)1/2 (σY Y a′ ΣX a)1/2 (σY Y a′ ΣX a)1/2
1/2
(u′ u)1/2 (v ′ v)1/2 (a′ ΣX a)1/2 σXY

Σ−1
X σXY
≤ =
(σY Y a′ ΣX a)1/2 (σY Y a′ ΣX a)1/2
1/2
σXY Σ−1
 ′
X σXY
= ,
σY Y
q
with equality if we take u ∝ v or a = Σ−1
Since R = σXY
X σXY .

Σ−1

X σXY /σY Y ,
0 ≤ R ≤ 1 unlike the ordinary correlation coefficient. Now let us see why

(R∗ )2 (square of multiple correlation coefficient) is the same as the coefficient


of determination, R2 (proportion of variability explained by the regressors).
Suppose      
Y µY σY Y σXY ′
∼N , .
X µX σXY ΣX
Then,

Y |X ∼ N µY + σXY ′
Σ−1
X (X − µ X ), σ Y Y − σ ′
Σ
XY X
−1
σ XY .
Thus, E(Y |X) = µY − σXY′
Σ−1 −1
X µX + σXY ΣX X and

−1
V ar(Y |X) = σY Y − σXY ΣX σXY ). Therefore,

Cov(Y, σXY

Σ−1
X X)
Corr(Y, E(Y |X)) = q
σY Y σXY

Σ−1 −1
X ΣX ΣX σXY

σXY

Σ−1 σ
= q X XY = R∗ .
√ −1
σY Y σXY

ΣX σXY
i.e., R∗ = correlation coefficient between Y and the conditional expectation
of Y |X (or the regression of Y on X, when the conditional expectation is
linear). Further, V ar(Y ) − E (V ar(Y |X)) = σY Y − σY Y − σXY

Σ−1
X σXY =
−1
σXY ΣX σXY , so that the proportion of variation in Y explained by the re-

gression on X is equal to
V ar(Y ) − E (V ar(Y |X)) σ ′ Σ−1 σXY
R2 = = XY X = (R∗ )2 .
V ar(Y ) σY Y

2
Partial Correlation Coefficients

Example. In a study, X1 = weekly amount of coffee/tea sold by a refresh-


ment stand at a summer resort, and X2 = weekly number of visitors to the
resort. If X2 is large, so should X1 be, right? Actually no! With a certain
resort, r12 = −0.3. Why? Consider X3 = average weekly temperature at the
resort. Both X1 and X2 are related to X3 . If temperature is high, there will
be more visitors, but they will prefer cold drinks to coffee/tea. If tempera-
ture is low, there will be fewer visitors, but they will prefer coffee/tea. Say,
r13 = −0.7, r23 = .8. It is then more meaningful to investigate the relation-
ship between X1 and X2 conditional on X3 (i.e., when X3 is kept fixed) to
eliminate the effect of X3 .
Partial correlation coefficient between X1 and X2 when X3 is fixed is
r12 − r13 r23
r12.3 = Corr(X1 |X3 , X2 |X3 ) = p 2 2
.
(1 − r13 )(1 − r23 )

Suppose X ∼ Nm (µ, Σ) and partition X, µ and Σ as:


     
X1 µ1 Σ11 Σ12
X= ,µ = ,Σ = ,
X2 µ2 Σ′12 Σ22

where X1 is k-dimensional. Then X1 |X2 ∼ Nk (µ1 + Σ12 Σ−1 22 (X2 − µ2 ), Σ11.2 ),


−1 ′
where Σ11.2 = Σ11 − Σ12 Σ22 Σ12 = ((σij.k+1,...,m )). Note that σij.k+1,...,m =
partial covariance between Xi and Xj conditional on X2 = (Xk+1 , . . . , Xm )′ .
Therefore the partial correlation coefficient between Xi and Xj given X2 is
σij.k+1,...,m
ρij.k+1,...,m = √ √ .
σii.k+1,...,m σjj.k+1,...,m

Recall the notation, ρ for the population and r for a sample. From the
expression for Σ11.2 note that σij.l = σij − σil σjl /σll . Thus,
σ σ
σij.l σij − ilσlljl
ρij.l = √ √ = r
σii.l σjj.l 2
σil
 2
σjl

σii − σll σjj − σll
σij σ σ
√ − σll √ilσiijlσjj
σii σjj ρij − ρil ρjl
= r =q .
2
  2
σjl
 2 2
σil
1 − σii σll 1 − σjj σll (1 − ρ il ) 1 − ρ jl

1
Simultaneous confidence sets

When we have a scalar parameter, such as the mean µ of X, we can construct


a confidence interval for it using a sample of observations:
X̄ ± √sn tn−1 (1−α/2). What about the vector β of regression coefficients? We
know that if Y = Xβ + ǫ, where ǫ ∼ Nn (0, σ 2 In ), then (β̂ − β)′ X ′ X(β̂ − β) ∼
σ 2 χ2r independent of RSS = Y ′ (I − P )Y ∼ σ 2 χ2n−r , so that

(β̂ − β)′ X ′ X(β̂ − β)/r


∼ Fr,n−r ,
Y ′ (I − P )Y /(n − r)

and hence
 
r
P (β̂ − β)′ X ′ X(β̂ − β) ≤ Y ′ (I − P )Y Fr,n−r (1 − α) = 1 − α.
n−r

Therefore,
 
r
C = β : (β − β̂)′ X ′ X(β − β̂) ≤ RSS Fr,n−r (1 − α)
n−r

is a 100(1 − α)% confidence set for β. This is an ellipsoid, and if p is not


small (1 or 2), a set which is difficult to appreciate.
Suppose we are only p interested in ap

β for some fixed a. Then
a β̂ ± tn−r (1 − α/2) RSS /(n − r) a′ (X ′ X)− a is a 100(1 − α)% confidence

interval for a′ β. Let us see if we can extend this when we are interested in
deriving a simutaneous confidence set of coefficient 1−α for a′1 β, a′2 β, . . . , a′k β.

2
Scheffe’s method.
Let A′p×d = (a1 , a2 , . . . , ad ) where a1 , a2 , . . . , ad are linearly independent and
ad+1 , . . . , ak are linearly dependent on them. Then d ≤ min{k, r}. Let
φ = Aβ and φ̂ = Aβ̂. Then
−1
(φ̂ − φ)′ (A(X ′ X)− A′ ) (φ̂ − φ)/d
F (β) = ∼ Fd,n−r .
RSS/(n − r)
Therefore,
1 − α = P [F (β) ≤ Fd,n−r (1 − α)]
 
′ ′ − ′ −1
 RSS
= P (φ̂ − φ) A(X X) A (φ̂ − φ) ≤ d Fd,n−r (1 − α) .
n−r
This gives an ellipsoid as before, but consider the following result.
Result. If L is positive definite,
(h′ b)2
b′ L−1 b = sup .
h6=0 h′ Lh

Proof. Note that


(h′ b)2 (h′ L1/2 L−1/2 b)2 h′ Lhb′ L−1 b

= ′
≤ ′
= b′ L−1 b.
h Lh h Lh h Lh

Therefore,
 n o2 


 h (φ − φ̂) d


1−α=P sup ≤ RSS Fd,n−r (1 − α)
 h6=0
 h′ (A(X ′ X)− A′ ) h n−r 

 n o2 


 h (φ − φ̂) d


= P ≤ RSS Fd,n−r (1 − α) for all h 6= 0.
 h′
 (A(X ′ X)− A′ ) h n−r 

 

 |h (φ − φ̂)| 
= P q p ≤ {dFd,n−r (1 − α)}1/2 for all h 6= 0.
 RSS h′ (A(X ′ X)− A′ ) h 
n−r
n o
= P |h′ (φ − φ̂)| ≤ {dFd,n−r (1 − α)}1/2 s.e.(h′ φ̂) for all h 6= 0. ,
q p
RSS
where s.e.(h′ φ̂) = n−r h′ (A(X ′ X)− A′ ) h. Therefore,
r
′ 1/2 RSS p ′ ′ −
ai β̂ ± {dFd,n−r (1 − α)} ai (X X) ai , i = 1, 2, . . . , k
n−r

1
is a simultaneous 100(1 − α)% confidence set for a′1 β, a′2 β, . . . , a′k β, by noting
that
 
P a′i β ∈ a′i β̂ ± {dFd,n−r (1 − α)}1/2 s.e.(a′i β̂, i = 1, 2, . . . , k ≥
n o
′ 1/2 ′
P |h (φ − φ̂)| ≤ {dFd,n−r (1 − α)} s.e.(h φ̂) for all h 6= 0. = 1 − α.
Many other methods are also available.

Regression diagnostics

Lack of fit. Suppose the true model is Y = f (X) + ǫ, ǫ ∼ Nn (0, σ 2 In ),


whereas we fit Y = Xβ + ǫ. We do get β̂ = (X ′ X)−1 X ′ Y and σ̂ 2 = RSS/(n −
r). σ 2 is supposed to account for only the statistical errors (ǫi ), and not model
misspecification. Therefore, if f (X) 6= Xβ, we have statistical errors, ǫi , as
well as the bias, f (X)−Xβ. Then, σ̂ 2 = RSS/(n−r) will estimate a quantity
which includes σ 2 as well as (bias)2 . If σ 2 is known, then comparing σ̂ 2 with
σ 2 can act as a check for lack of fit. In other words,
RSS/σ 2 ∼ χ2n−r if the model, Y = Xβ + ǫ, ǫ ∼ Nn (0, σ 2 In ) is true. Therefore
to test
H0 : Y = Xβ + ǫ, ǫ ∼ Nn (0, σ 2 In ) versus H1 : Y has some other model, use
RSS/σ 2 as the test statistic. If the observed value is too large compared to
χ2n−r , there is evidence against H0 .
Consider a simulation study where data are generated from yi = β0 + β1 xi +
β2 x2i + ǫi , ǫi ∼ N (0, σ 2 ), with β0 = 5, β1 = β2 = 2 and σ 2 = 22 :
x .5 1 1.5 2 2.5 3 3.5 4 4.5 5
y 8.68 12.85 10.71 18.54 21.67 27.3 37.56 44.64 54.09 63.83

Regress Y on X. i.e., fit yi = β0 + β1 xi + ǫi . Then we get β̂0 = −3.925,


β̂1 = 12.33 and the ANOVA table:
source d.f SS MS F R2
Regression 1 3134.2 3134.2 130.76 94.2%
Error 8 191.7 24.0
Total 9 3325.9
These are very good results, but RSS/σ 2 = 191.7/4 = 47.925 >> χ28 (.99) =
20.08. R2 = 94.2% is high, and F-ratio of 130.76 at (1, 8) d.f. is very high,
indicating that X is a very useful predictor of Y . However this does not
mean that the fitted model is the correct one. Check the residual plot:

2
y − ŷ

Now regress Y on X and X 2 .


source d.f SS MS F R2
Regression 2 3305.7 1652.8 572.28 99.4%
Error 7 20.2 2.9
Total 9 3325.9
RSS/σ 2 = 20.2/4 = 5.5 << χ27 (.90) = 12.02.
σ 2 is usually unknown, so this test is difficult, but what this indicates is that
residual plots are useful for checking lack of fit (see plot above). Another
possibility is to check for any pattern between fitted values and residuals.
Yet another reason to explore this is the following.
ǫ̂ = Y − Ŷ = Y − X β̂ = (I − P )Y and Ŷ = X β̂ = P Y are uncorrelated
(since (I − P )P = 0) if Cov(Y ) = σ 2 In . If one sees significant correlation
and some trend, then the model is suspect. What if V ar(yi ) = σi2 , not a
constant? This is called heteroscedasticity (as against homoscedastcity), a
problem discussed in Sanford Weisberg: Applied Linear Regression in the
context of regression diagnostics.

3
With the model: Y = Xβ + ǫ, with E(ǫ) = 0 and Cov(ǫ) = σ 2 In , normality
of ǫ is essential for hypothesis testing and confidence statements. How does
one check this?
Normal probability plot or Q-Q plot.
This is a graphical technique to check for normality. Suppose we have a
random sample T1 , T2 , . . . , Tn from some population, and we want to check
whether the population has the normal distribution with some mean µ and
some variance σ 2 . The method described here depends on examining the
order statistics, T(1) , . . . , T(n) . Let us recall a few facts about order statistics
from a continuous distribution. Since
n
Y
fT1 ,...,Tn (t1 , . . . , tn ) = f (ti ), (t1 , . . . , tn ) ∈ Rn ,
i=1
n
Y
fT(1) ,...,T(n) (t(1) , . . . , t(n) ) = n! f (t(i) ), t(1) < t(2) < · · · t(n) ,
i=1
n! n−i i−1
fT(i) = 1 − F (t(i) ) F (t(i) )f (t(i) ).
(i − 1)!(n − i)!

If U(1) < U(2) < · · · < U(n) are o.s. from U (0, 1), then
1 Z 1
n!
Z
E(U(k) ) = ufU(k) (u) du = uk+1−1 (1 − u)n−k+1−1 du
0 (k − 1)!(n − k)! 0
n! Γ(k + 1)Γ(n − k + 1) k
= = .
(k − 1)!(n − k)! Γ(n + 2) n+1

An additional result needed is the following. If X is a random variable which


is continuous on an interval I with c.d.f. F strictly increasing on I, then
V = F (X) ∼ U (0, 1). For this, note that 0 ≤ V ≤ 1 and for 0 ≤ v ≤ 1,
P (V ≤ v) = P (F (X) ≤ v) = P (X ≤ F −1 (v)) = F (F −1 (v)) = v.
Now argue as follows. If T1 , T2 , . . . , Tn are i.i.d. from N (µ, σ 2 ), then

T(i) − µ
  
i − 0.5
E Φ ≈ , i = 1, 2, . . . , n.
σ n
 
T −µ
Therefore, plot of Φ (i)σ versus i−0.5 n
is on the line y = x. Equivalently,
T −µ
the plot of (i)σ versus Φ−1 ( i−0.5 n
) is on the line y = x. In other words, the
plot of T(i) versus Φ−1 ( i−0.5
n
) is linear. To check this, µ and σ 2 are not needed.
Since T(i) is the quantile of order i/n and Φ−1 ( i−0.5 n
) is the standard normal

1
quantile of order i−0.5
n
, this plot is called the Quantile - Quantile plot. One
looks for nonlinearity in the plot to check for non-normality.
How is this plot to be used in regression? We want to check the normality of
ǫi , but they are not observable. Instead yi are observable, but they have differ-
ent means. We consider the residuals. ǫ̂ = Y −Ŷ = (I−P ) ∼ Nn (0, σ 2 (I−P ))
if normality holds. i.e., ǫ̂i ∼ N (0, σ 2 (1 − Pii )) if Y ∼ N (Xβ, σ 2 In ). For a
fixed number of regressors (p − 1), as n increases, Pii → 0 (Weisberg), so the
residuals can be used in the Q-Q plot.

Stepwise regression (forward selection)

Consider a situation where there are a large number of predictors. A model


including all of them is not desirable since it will be unweildy and there
may be difficulties involving multicollinearity and computational complex-
ities. There are many such situations in weather forecasting, economics,
finance, agriculture and medicine.
Consider the approach where one variable is added at a time until a good
model is available, or equivalently, a stopping rule is met. Possible rules are
(i) r many predictors are chosen (r is pre-dertmined)
(ii) R2 is large enough.
Procedure. (i) Calculate the correlation coefficient between Y and Xi for
all i, say riy . Select as the first variable to enter the regression model the one
most highly correlated with Y .
(ii) Regress Y on the chosen predictor, say Xl , and compute R2 = rly 2
. This
2
is the maximum possible R with one predictor.
(iii) Calculate the partial correlation coefficients given Xl of all the predictors
not yet in the regression model, with the response Y . Choose as the next
predictor to enter the model, the one with the highest (in magnitude) partial
correlation coefficient riy.l : the idea is to add a factor which is most useful
given that Xl is already in.
(iv) Regress Y on Xl as well as the one chosen next, say Xm , and find if Xm
should be added or not. Compute R2 .
(v) Calculate riy.lm and proceed similarly.
Example. Data on breeding success of the common Puffin in different habi-
tats at Great Island, Newfoundland:
y = nesting frequency (burrows/9m2 )
x1 = grass cover (%), x2 = mean soil depth (cm)
x3 = angle of slope (degrees), x4 = distance from cliff edge (m)

2
X1 X2 X3 X4 Y
45 39.2 38 3 16
65 47.0 36 12 15
40 24.3 14 18 10
.. .. .. .. ..
. . . . .
Correlation matrix:
Y X1 X2 X3
X1 0.158
X2 0.022 0.069
X3 0.836 -0.017 0.066
X4 -0.908∗ -0.205 0.212 -0.815
Choose X4 first, since r4y = -0.908 is the highest in magnitude. Then R2 =
(−0.908)2 = 82.4%. F = 168.79 >> F1,36 (.99). Now compute

 −0.07 i = 1;
riy.4 = 0.518 i = 2;
0.398 i = 3.

Choose X2 next and note R2 = 87.2%. Also, X2 is a useful predictor. Com-


pute 
−0.152 i = 1;
riy.42 =
0.233 i = 3.
The formula for this is
riy.4 − ri2.4 ry2.4
riy.42 = q .
2 2
(1 − ri2.4 )(1 − ry2.4 )

If we pick X3 now, R2 = 87.9%, not very different from the previous regres-
sion. Also, X3 is not particularly useful in regression.

3
Basics of Design of Experiments and ANOVA

So far we concentrated on analysis of a given experiment or data. Structure


of the experiment is now explored. Design of experiments is a study of
construction and analysis of experiments where purposeful changes are made
to input variables of a process or system so as to observe and identify the
reasons for changes in the output response. A cause-effect mechanism is of
interest here.
Example. Different or different amounts of fertilizers versus yield of a crop
Experimental designs are used mostly for comparative experiments:
Comparing treatments in a clinical trial
Comparing factors (fertilizers, crop patterns etc.) in agricultural experiments
Randomization, replication, blocking and confounding of effects are some im-
portant concepts in this context. Randomization means that each subject
has the same chance of being placed in any given experimental group. Then
factors which cannot be controlled need not be considered since their effects
are averaged out.
Replication means having multiple subjects in all experimental groups, en-
suring that ‘within group’ variation can be estimated.
Blocking and confounding of effects will be considered later.
Consider the following example of a completely randomized design.
Example. Monosodium glutamate (MSG), a common ingredient of pre-
served food is known to cause brain damage in various mammals. In a study
of the other effects, weight of ovaries (mg), both for a sample of rats treated
with MSG and for an independent control sample of similar but untreated
rats were obtained:
sample size (ni ) sample mean sample s.d.
MSG 10 29.35 4.55
Control 12 21.86 10.09
Consider the linear model,

µ m + ǫi i = 1, 2, . . . , n1 ;
yi =
µc + ǫi i = n1 + 1, n1 + 2, . . . , n1 + n2 ,

1
ǫi i.i.d N (0, σ 2 ). Write it in the vector/matrix form, Y = Xβ + ǫ:

y1 1 0
   
..   .. .. 
.   . . 

 
 yn 1   1 0  µ m
   
= + ǫ,
 yn1 +1   0 1  µc
 
 ..   . . 
  .. .. 
 .
yn1 +n2 0 1

ǫ ∼ Nn1 +n2 (0, σ 2 I). Then


   1 
′ n1 0 ′ −1 n1
0
XX= , so (X X) = 1 .
0 n2 0 n2

Therefore,
       1 
µ̂m ȳ1 µm 2 n1
0
= ∼ N2 ,σ 1 ,
µ̂c ȳ2 µc 0 n2

independent of
n1
X 1 +n2
nX
RSS = (yi − ȳ1 )2 + (yi − ȳ2 )2 ∼ σ 2 χ2n1 +n2 −2 .
i=1 i=n1 +1

We want to compare µm with µc . H0 : µm =µc = 0is meaningless;


µm
H0 : µm = µc is of interest. i.e., H0 : (1 − 1) = 0. Note,
µc
q
(ȳ1 − ȳ2 − (µm − µc )) / n11 + n12
p ∼ tn1 +n2 −2 .
RSS/(n1 + n2 − 2)
If H0 is true, then
q
(ȳ1 − ȳ2 ) / n11 + n12
p ∼ tn1 +n2 −2 ,
RSS/(n1 + n2 − 2)
or equivalently,  
(ȳ1 − ȳ2 )2 / 1
n1
+ 1
n2
∼ F1,n1 +n2 −2 .
RSS/(n1 + n2 − 2)
The design in this experiment has complete randomization. The observations
inside the groups are independent, and also the two samples are independent.

2
For this reason, the design is called a completely randomized design. We can
generalize this procedure if we want to compare k means, as will be done
later.
Paired differences - example of a block design
Sometimes independent samples, such as the ones in a completely random-
ized design, from two (or k > 2) populations is not an efficient way for
comparisons. Consider the following example.
Example. It is of interest to compare an enriched formula with a stan-
dard formula for baby food. Weights of infants vary significantly and this
influences weight gain more than the difference in food quality. Therefore,
independent samples (with infants having very different weights) for the two
formulas will not be very efficient in detecting the difference. Instead, pair
babies of similar weight and feed one of them the standard formula, and the
other the enriched formula. Then observe the gain in weight:
pair 1 2 3 ... n
enriched e1 e2 e3 ... en
standard s1 s2 s3 ... sn
However, the samples may not be treated as independent but correlated. The
n pairs of observations, (e1 , s1 ), . . . , (en , sn ) may still be treated to be uncor-
related (or even independent). These n pairs are like n independent blocks,
inside each of which we can compare enriched with standard. This is the
idea of blocking and block designs. Blocks are supposed to be homogeneous
inside, so comparison of treatments within blocks becomes efficient.

3
Paired differences - example of a block design
Sometimes independent samples, such as the ones in a completely random-
ized design, from two (or k > 2) populations is not an efficient way for
comparisons. Consider the following example.
Example. It is of interest to compare an enriched formula with a stan-
dard formula for baby food. Weights of infants vary significantly and this
influences weight gain more than the difference in food quality. Therefore,
independent samples (with infants having very different weights) for the two
formulas will not be very efficient in detecting the difference. Instead, pair
babies of similar weight and feed one of them the standard formula, and the
other the enriched formula. Then observe the gain in weight:
pair 1 2 3 ... n
enriched e1 e2 e3 ... en
standard s1 s2 s3 ... sn
However, the samples may not be treated as independent but correlated.
The n pairs of observations, (e1 , s1 ), . . . , (en , sn ) may still be treated to be
uncorrelated (or even independent). These n pairs are like n independent
blocks, inside each of which we can compare enriched with standard. This
is the idea of blocking and block designs. Blocks are supposed to be homo-
geneous inside, so comparison of treatments within blocks becomes efficient.
We assume that
σ12
     
ei µ1 ρσ1 σ2
∼ N2 , . In the above example, we
si µ2 ρσ1 σ2 σ22
want to test H0 : µD ≡ µ1 − µ2 = 0, so consider yi = ei − si . Then,
2
yi = µD + ǫi , E(ǫi ) = 0, V ar(ǫi ) = σD = σ12 + σ22 − 2ρσ1 σ2 = V ar(yi ). If
2
normality is assumed, then we have, y1 , . . . , yn i.i.d. N (µD , σD ) and we want
to test H0 : µD = 0. Consider the test statistic,

nȳ
pPn ∼ tn−1 ,
2
i=1 (yi − ȳ) /(n − 1)

if H0 is true, or equivalently,

nȳ 2
Pn 2
∼ F1,n−1 .
i=1 (yi − ȳ) /(n − 1)

1
Note that,
nȳ 2
Pn 2
i=1 (yi − ȳ) /(n − 1)
n(ē − s̄)2
= 1
P n 2
n−1 i=1 [(ei − ē) − (si − s̄)]
n(ē − s̄)2
= 1
Pn Pn Pn
n−1
[ i=1 (ei − ē)2 + 2
i=1 (si − s̄) − 2 i=1 (ei − ē)(si − s̄)]
(ē − s̄)2 /( n1 + n1 )
= .
1
[ ni=1 (ei − ē)2 + ni=1 (si − s̄)2 − 2 ni=1 (ei − ē)(si − s̄)]
P P P
2(n−1)

Compare this test statistic with the one used for independent samples.
Cov(e, s) is expected to be positive (due to blocking), so the variance in the
1
[ ni=1 (ei − ē)2 + ni=1 (si − s̄)2 ],
P P
denominator above is typically less than 2(n−1)
which appears there. This is the positive effect due to blocking.
Confounding of effects.
Example. Consider two groups of similar students and two teachers. It is
of interest to compare two different training methods. Consider the design
where teacher A teaches one group using method I, whereas teacher B teaches
the other group using method II. Later the results are analyzed. The problem
with this design is that, if one group performs better it may be due to teacher
effect or due to method effect, but it is not possible to separate the effects.
We say then that the two effects are confounded. Sometimes we may not be
interested in certain effects, in which case we may actually look for designs
that will confound their effects. This will reduce the number of parameters
to be estimated.

Experiments with a single factor – One-way ANOVA

We want to compare k > 2 treatments. Treatment i produces a population of


y values with mean µi , i = 1, 2, . . . , k. Or, if treatment i is applied, then the
response Y ∼ N (µi , σ 2 ), i = 1, 2, . . . , k. Are these k populations different?
Design. ni observations are made independently from population i, so the k
samples are independent. Equivalently, we may look at this experiment as a
design where N subjects are available to study the k treatments. n1 of these
are randomly selected and assigned to a group which will get treatment 1,
n2 of the remaining for treatment 2, and so on. Such a design is called a
completely randomized design (as mentioned previously). Model for such a

2
design is as follows.
Let yij = response of the jth individual in the ith group (ith treatment),
j = 1, 2, . . . , ni ; i = 1, 2, . . . , k. Then,
yij = µi + ǫij , j = 1, 2, . . . , ni ; i = 1, 2, . . . , k. E(ǫij ) = 0, V ar(ǫij ) = σ 2 , un-
correlated errors; ǫij ∼ N (0, σ 2 ) i.i.d. for testing and confidence statements.
In the usual linear model formulation:
   
y11 1 0 ... 0
 ..   .. .. . 
 .   . . . . . .. 
   
 y1n1   1 0 . . . 0 
   
 y21   0 1 . . . 0  µ1

 .   . . . 
 .   . .
 .   . . . . . ..   µ2 
 
=   ..  + ǫ.
 y2n2   0 1 . . . 0   . 

 .   . .
 ..   .. .. . . . ...  µk

   
 y   0 0 ... 1 
 k1   
 ..   .. .. .
. 
 .   . . ... . 
yknk 0 0 ... 1
 1 
n1
0 ··· 0  Pn1 
 0 1 ··· 0  j=1 y1j
Since (X ′ X)−1 =  ..
n2 ..
 and X ′ Y =  , we get
   
.. .
 . . ··· 0  Pn1
j=1 ykj
0 0 · · · n1k
   
µ̂1 ȳ1
 ..   .. 
 .  =  .  and
µ̂k ȳk
Pk Pni
RSS = i=1 j=1 (yij − ȳi )2 =
PP 2
(yij − µ̂i )2 .
PP
ǫ̂ij =
Questions.
(i) Are the group means µi equal? i.e., test H0 : µ1 = µ2 = · · · = µk .
(ii) If not, how are they different?

3
yij = µi + ǫij , j = 1, 2, . . . , ni ; i = 1, 2, . . . , k E(ǫij ) = 0, V ar(ǫij ) = σ 2 .
   
y11 1 0 ... 0
 ..   .. .. . 
 .   . . . . . .. 
   
 y1n1   1 0 . . . 0 
   
 y21   0 1 . . . 0  µ1

 .   . . .. 
 .   . .
. . . . . . .   µ2 
 
=   ..  + ǫ.
  
 y2n2   0 1 . . . 0   . 

 .   . .
 ..   .. .. . . . ...  µk

   
 y   0 0 ... 1 
 k1   
 ..   .. .. .
. 
 .   . . ... . 
yknk 0 0 ... 1
   
µ̂1 ȳ1
 ..   .. 
 . = . 
µ̂k ȳk

RSS = ki=1 nj=1 (yij − ȳi )2 =


PP 2
(yij − µ̂i )2 .
P P i PP
ǫ̂ij =
Questions.
(i) Are the group means µi equal? i.e., test H0 : µ1 = µ2 = · · · = µk .
(ii) If not, how are they different?
Example. It is believed that the tensile (breaking) strength of synthetic
fibre is affected by the %age of cotton in fibre:
% cotton tensile strength (lb/inch2 ) sample mean
15 7, 7, 15, 11, 9 ȳ1 = 9.8
20 12, 17, 12, 18, 18 ȳ2 = 15.4
25 14, 18, 18, 19, 19 ȳ3 = 17.6
30 19, 25, 22, 19, 23 ȳ4 = 21.6
35 7, 10, 11, 15, 11 ȳ5 = 10.8
Are there substantial differences in the mean breaking strength?
(i) Plot the sample means:

1
20
15
10

15 20 25 30 35

But sample means do not tell the whole story, especially for small samples.
One must look at variation within samples and between samples. In the
plot above, the conclusions would be different according to whether the error
bands are green or red.
25
20
tensile strength

15
10

s1 s2 s3 s4 s5

% cotton

It is easier to do this investigation of variations using box-plots, as shown


above. Variation within samples is not too large or different, but between

2
sample variation is large. Note that, if within sample variation is large com-
pared to between sample variation (like the red error bands in the plot),
then the different samples can be considered to be from a single population.
However, if within sample variation is small compared to between sample
variation (like the green error bands in the plot, i.e., |ȳi − ȳj | are large com-
pared to the error) then there is reason to believe that the groups differ.
To formalize this, we return to linear models:
yij = µi + ǫij , j = 1, 2, . . . , ni ; i = 1, 2, . . . , k, ǫij ∼ N (0, σ 2 ) i.i.d. Are the
group means different?
   
µ̂1 ȳ1
 ..   ..  Pk Pni
 .  =  .  so that RSS = i=1 j=1 (yij − ȳi )2 .
µ̂k ȳk
To test H0 :
µ1 = µ2 = · · · = µk , consider

1 0 0 · · · 0 −1
 0 1 0 · · · 0 −1 
A(k−1)×k =  .. .. .. .. .. . Then we test H0 : Aµ = 0 where A
 
 . . . ··· . . 
0 0 0 · · · 1 −1
has rank k − 1. To test H0 , we obtain µ̂H0 , RSSH0 and consider

(RSSH0 − RSS)/(k − 1)
F = , which ∼ Fk−1,Pk ni −k under H0 .
RSS/( ki=1 ni − k)
P i=1

To find µ̂H0 , RSSH0 , note that, under H0 : µ1 = µ2 = · · · = µk , these means


are equal, and so it is enough to find

X ni
k X ni
k X
X
min (yij − µi )2 = min (yij − µ)2 .
µ1 =µ2 =···=µk µ
i=1 j=1 i=1 j=1

Therefore,
ni
k X ni
k X
1 X X
µ̂H0 = Pk yij ≡ ȳ.. , and hence RSSH0 = (yij − ȳ.. )2 .
i=1 ni i=1 j=1 i=1 j=1

3
yij = µi + ǫij , j = 1, 2, . . . , ni ; i = 1, 2, . . . , k, ǫij ∼ N (0, σ 2 ) i.i.d. Are the
group means different?
   
µ̂1 ȳ1
 ..   ..  Pk Pni
 .  =  .  so that RSS = i=1 j=1 (yij − ȳi )2 .
µ̂k ȳk
To test H0 :
µ1 = µ2 = · · · = µk , consider

1 0 0 · · · 0 −1
 0 1 0 · · · 0 −1 
A(k−1)×k =  .. .. .. .. .. . Then we test H0 : Aµ = 0 where A
 
 . . . ··· . . 
0 0 0 · · · 0 −1
has rank k − 1. To test H0 , we obtain µ̂H0 , RSSH0 and consider

(RSSH0 − RSS)/(k − 1)
F = , which ∼ Fk−1,Pk ni −k under H0 .
RSS/( ki=1 ni − k)
P i=1

To find µ̂H0 , RSSH0 , note that, under H0 : µ1 = µ2 = · · · = µk , these means


are equal, and so it is enough to find

X ni
k X ni
k X
X
2
min (yij − µi ) = min (yij − µ)2 .
µ1 =µ2 =···=µk µ
i=1 j=1 i=1 j=1

Therefore,
ni
k X ni
k X
1 X X
µ̂H0 = Pk yij ≡ ȳ.. , and hence RSSH0 = (yij − ȳ.. )2 .
i=1 ni i=1 j=1 i=1 j=1

1
Pni
Introduce further notation: ȳi. = ȳi = ni j=1 yij , i = 1, 2, . . . , k. Note,
further, that

RSSH0
X ni
k X ni
k X
X
2
= (yij − ȳ.. ) = (yij − ȳi. + ȳi. − ȳ.. )2
i=1 j=1 i=1 j=1
ni
k X k k
( ni
)
X X X X
= (yij − ȳi. )2 + ni (ȳi. − ȳ.. )2 + 2 (ȳi. − ȳ.. ) (yij − ȳi. )
i=1 j=1 i=1 i=1 j=1
k
X
= RSS + ni (ȳi. − ȳ.. )2 ,
i=1

1
Pni
since j=1 (yij − ȳi. ) = 0 for all i. Therefore,

k
X
RSSH0 − RSS = ni (ȳi. − ȳ.. )2
i=1

and therefore,
Pk 2
i=1 ni (ȳi. − ȳ.. ) /(k − 1)
F = Pk Pni Pk ∼ Fk−1,Pk ni −k under H0 .
2
j=1 (yij − ȳi. ) /( i=1 ni − k) i=1
i=1

It is instructive
Pk Pntoi consider these sum of squares.
2
RSS = i=1 j=1 (yij − ȳi. )
= the sum total of all the sum of squares of deviations from the sample means
= within groups or within treatments sum of squares, SSW .
RSSH0 = ki=1 nj=1 (yij − ȳ.. )2
P P i
= total sum of squares of deviations assuming no treatment effect
= total variability (corrected) in the k samples, SST .
Pk 2
Therefore, i=1 ni (ȳi. − ȳ.. ) = SST - SSW = between groups or between
treatments sum of squares = SSB . Thus,
SS
PkT = SSW + SS PBk is the decomposition of sum of squares along with
i=1 ni − 1 = ( i=1 ni − k) + (k − 1), decomposition of d.f.

ANOVA for One-way classification


source d.f. SS MS F
MSB
Treatments k−1 SSB = MSB = MSE
∼ (under H0 )
Pk 2 SSB
(groups) i=1 ni (ȳi. − ȳ.. ) k−1
Fk−1,Pk ni −k
P i=1
Error ni − k P PSSW = MSE =
SS
(yij − ȳi. )2 Pk W
P i=1 ni −k
Total ni − 1 P PSST =
(corrected) (y − ȳ.. )2
Pk ij
Mean 1 ( n )ȳ 2
P Pk i=1 Pnii .. 2
Total ni i=1 j=1 yij

2
Example. Tensile strength data. k = 5, ni = 5. ANOVA is as follows.
source d.f. SS MS F
Factor levels (% cotton) 4 475.76 118.94 14.76 >> 4.43 = F4,20 (.99)
Error 20 161.20 8.06
Total(corrected) 24 636.96
475.76
R2 = 636.96
≈ 75%
Now that the ANOVA H0 has been rejected, we should look at the group
means (estimates) closely. Suppose we want to compare µr and µs either
with H0 : µr = µs or using a confidence interval for µr − µs .
  
2 1 1
µ̂r − µ̂s = ȳr. − ȳs. ∼ N µr − µs , σ +
nr ns

independently of
ni
k X
X
(yij − ȳi. )2 ∼ σ 2 χ2Pk ni −k
.
i=1
i=1 j=1

Therefore,
q
{(ȳr. − ȳs. ) − (µr − µs )} / n1r + n1s
P
r
Pk P ni P  ∼ t ki=1 ni −k .
2 k
i=1 j=1 (yij − ȳi. ) / i=1 ni − k

100(1 − α)% confidence interval for µr − µs is


v !r
u k ni k
uX X X 1 1
ȳr. − ȳs. ± tPk ni −k (1 − α/2)t (yij − ȳi. )2 / ni − k + .
i=1
i=1 j=1 i=1
nr ns

Further, test statistic for testing H0 : µr = µs is


q
(ȳr. − ȳs. ) / n1r + n1s
T =r P
Pk Pni P  ∼ t ki=1 ni −k ,
2 k
i=1 j=1 (yij − ȳi. ) / i=1 ni − k

if H0 is true.

3
Multiple comparison of group means
yij = µi + ǫij , j = 1, 2, . . . , ni ; i = 1, 2, . . . , k, ǫij ∼ N (0, σ 2 ) i.i.d.
The classic ANOVA test is the test of H0 : µ1 = µ2 = · · · = µk , which is
uninteresting and the hypothesis is usually not true. What an experimentor
usually wants to find out is which treatments are better, so rejection of H0
is usually not the end of the analysis. Once it is rejected, further work is
needed to find out why it was rejected.
Definition. A linear parametric function ki=1 ai µi = a′ µ with known con-
P
Pk ′
stants a1 , . . . , ak satisfying i=1 ai = a 1 = 0 is called a contrast (linear
contrast).
Example. If a = (1, −1, 0, . . . , 0)′ , then a′ µ = µ1 − µ2 .
µ1 = µ2 = · · · = µk if and only o
Result. n if a′ µ = 0 for all
a ∈ A = a = (a1 , . . . , ak )′ : ki=1 ai = 0 .
P

Remark. H0 : µ1 = µ2 = · · · = µk is true iff Ha : a′ µ = 0 for all a ∈ A, or


all linear contrasts are zero.
Proof. µ1 = µ2 = · · · = µk iff µ = α1 for some α, or µ ∈ MC (1). Note,
A = M⊥C (1).

Thus, if H0 fails, atleast one of the Ha must fail for a ∈ A. i.e., a′ µ 6= 0. The
experimenter may be interested in this contrast, andPits inference. Consider
inference of any linear parametric function, a′ µ = ki=1 ai µi . We have the
model,
yij ∼ N (µi , σ 2 ), j = 1, 2, . . . , ni ; i = 1, 2, . . . , k independent. Then, ȳi. ∼
N (µi , σ 2 /ni ), i = 1, 2, . . . , k independent, and
k k k k
X X X X a2
E( ai ȳi. ) = ai µi = a′ µ, V ar( ai ȳi. ) = σ 2 i
,
i=1 i=1 i=1 i=1
ni

so that Pk
ai ȳi. − ki=1 ai µi
P
i=1
q P ∼ N (0, 1).
2 k a2i
σ i=1 ni

Let Si2 = nj=1 (yij − ȳi. )2 , i = 1, 2, . . . , k. Then Si2 ∼ σ 2 χ2ni −1 independent


P i

of ȳi. , i = 1, 2, . . . , k. Also, (S12 , . . . , Sk2 ) is independent of ȳ = (ȳ1. , . . . , ȳk. ).


Let Sp2 = ki=1 Si2 . Then Sp2 ∼ σ 2 χ2Pk n −k independent of ȳ. Note that this
P
i=1 i

is just a repeat of our old result that RSS = ki=1 nj=1 (yij − ȳi. )2 = Sp2 is
P P i

1
independent of β̂ = µ̂. Thus, as discussed previously,
a′ ȳ − a′ µ
r  ∼ tPk ni −k ,
2
 P i=1
P k a k
Sp2 i=1 ni
i
/( i=1 n i − k)

so that
v !
u k k

u X a2 i
X
a ȳ ± t Pk ni −k (1 − α/2)tS 2 p /( ni − k)
i=1
i=1
ni i=1

is a 100(1 − α)% confidence interval for a′ µ. Also, reject Ha,0 : a′ µ = 0 in


favour of Ha,1 : a′ µ 6= 0 if

a′ ȳ
r  > tPk ni −k (1 − α/2).
 P i=1
Pk a2i k
Sp2 i=1 ni
/( i=1 n i − k)

What if we want investigate a set of contrasts simultaneously? From Boole’s


Inequality, P
∞ c
P∞ c
P (∪∞
i=1 Ai ) ≤

i=1 P (Ai ), so P (∪i=1 Ai ) ≤ i=1 P (Ai ).
c c
Since ∪∞ ∞
i=1 Ai = (∩i=1 Ai ) ,

n
X n
X
1 − P (∩ni=1 Ai ) ≤ (1 − P (Ai )) = n − P (Ai ), or
i=1 i=1
n
X
P (∩ni=1 Ai ) ≥ P (Ai ) − (n − 1).
i=1

This is known as the Bonferroni Inequality. Apply this to the above problem.
′ ′
If we want a simultaneous confidence set for a(1) µ, . . . , a(d) µ, consider
v !
u k (j) 2 k
n ′
(j) P
α u
2
X (a i ) X o
C = a ȳ±t k ni −k (1− ) Sp t /( ni − k), j = 1, 2, . . . , d .
i=1 2d i=1
ni i=1

Then
d d
X X α
P (C) = P (∩dl=1 Al ) ≥ P (Al )−(d−1) = (1− )−(d−1) = d−α−d+1 = 1−α.
l=1 l=1
d

This procedure is useful when d is not too large.

2
Reparametrization of the one-way model.
Suppose ni are
Pkall equal, and equal toP J. Also, let the number of groups be
k = I. Then i=1 ni = IJ, and ȳi. = Jj=1 yij /J, for i = 1, . . . , I.
P P
¯ = Ii=1 Jj=1 yij /(IJ). Further,
y..
P P
SSW = Ii=1 Jj=1 (yij − ȳi. )2 has d.f. (IJ − I);
P P
SSB = Ii=1 ni (yi. − ȳ.. )2 = J Ii=1 (yi. − ȳ.. )2 has d.f. I − 1.
We can rewrite the model, yij = µi + ǫij , ǫij ∼ N (0, σ 2 ) i.i.d. as follows.
P
µi = µ̄. + (µi − µ̄. ) = µ + αi , where µ̄. = Ii=1 µi /I and αi = µi − µ̄. . Then,
PI PI
i=1 αi = α. = i=1 (µi − µ̄. ) = 0. Further, H0 : µ1 = µ2 = · · · = µI
is the same PI−1as H0 : α1 = α2 = · · · = αI−1 = 0 (α. = 0 implies that
αI = − i=1 αi = 0 also.)
Similarly write
ǭi. = ǭ.. + ǭi. − ǭ.. , so that
ǫij = ǭ.. + (ǭi. − ǭ.. ) + (ǫij − ǭi. ). Therefore

X
I X
J X
I X
J X
I X
J X
I X
J
ǫ2ij = ǫ2.. + 2
(ǭi. − ǭ.. ) + (ǫij − ǭi. )2 ,
i=1 j=1 i=1 j=1 i=1 j=1 i=1 j=1

P P P
since ǭ.. Ii=1 (ǭi. − ǭ.. ) = 0, ǭ.. Ii=1 Jj=1 (ǫij − ǭi. ) = 0 and
PI PJ PI PJ
i=1 j=1 (ǭi. − ǭ.. )(ǫij − ǭi. ) = i=1 (ǭi. − ǭ.. ) j=1 (ǫij − ǭi. ) = 0.

Now, since ǫij = yij − µ − αi , we get ǭi. = ȳi. − µ − αi , ǭ.. = ȳ.. − µ, and
further, from above,

X
I X
J
(yij − µ − αi )2
i=1 j=1

X
I X
J X
I X
J X
I X
J
= (ȳ.. − µ)2 + (ȳi. − ȳ.. − αi )2 + (yij − ȳi. )2 .
i=1 j=1 i=1 j=1 i=1 j=1

P
Least squares estimates subject to Ii=1 αi = 0 may be obtained simply by
examination of the above, and they are:

µ̂ = ȳ.. , α̂i = ȳi. − ȳ.. ,


PI PJ 2
and hence RSS = i=1 j=1 (yij − ȳi. ) .

1
Under H0 : α1 = α2 = · · · = αI−1 = 0, we have

X
I X
J X
I X
J
2
(yij − µ − αi ) ≡ (yij − µ)2
i=1 j=1 i=1 j=1

X
I X
J X
I X
J X
I X
J
2 2
= (ȳ.. − µ) + (ȳi. − ȳ.. ) + (yij − ȳi. )2 ,
i=1 j=1 i=1 j=1 i=1 j=1

PI PJ
so that, then, i=1 j=1 (yij − µ)2 is minimized when µ̂ = ȳ.. (with αi = 0).
We then get

X
I X
J X
I X
J
RSSH0 = (ȳi. − ȳ.. )2 + (yij − ȳi. )2
i=1 j=1 i=1 j=1

X
I X
I X
J
2
= J (ȳi. − ȳ.. ) + (yij − ȳi. )2 .
i=1 i=1 j=1

Therefore,
X
I
RSSH0 − RSS = J (ȳi. − ȳ.. )2 .
i=1

Note that all these can be done by just inspection, even though we have de-
rived these previously using other methods. The simplicity of this approach,
however, is very useful for higher-way classification models.
One-way ANOVA with equal number of observations per group.
source d.f SS MS F
P P 2
Treatments I −1 J (ȳ − ȳ.. )2 SSB /(I − 1) PJ P (ȳi. −ȳ.. )2 /(I−1)
P P i. 2
(yij −ȳi. ) /(IJ−I)
Error IJ − I P P(yij − ȳi. )2 SSW /(IJ − I)
Total (C) IJ − 1 (yij − ȳ.. )
This approach of reparametrization and decomposition generalizes to higher-
way classification where there are substantial simplifications.

2-factor Analysis or 2-way ANOVA

Example. An engineer is designing a battery for use in a device that will be


subjected to some extreme temperature variations. The only design param-
eter that he can select at this time is the plate material for the battery, and
he has three possible choices. When the device is manufactured and shipped

2
to the field, the engineer has no control over the temperature extremes that
the device will encounter, and he knows from past experience that temper-
ature may impact the effective battery life. However, temperature can be
controlled in the product development laboratory for the purposes of testing.
The engineer decides to test all three plate materials at three different tem-
perature levels, 15◦ F, 70◦ F and 125◦ F (-10, 21 and 51 degree C), as these tem-
perature levels are consistent with the product end-use environment. Four
batteries are tested at each combination of plate material and temperature,
and the 36 tests are run in random order.
Question 1. What effects do material type and temperature have on the life
of the battery?
Question 2. Is there a choice of material that would give uniformly long life
regardless of temperature? (Robust product design?)

3
Question 1. What effects do material type and temperature have on the life
of the battery?
Question 2. Is there a choice of material that would give uniformly long life
regardless of temperature? (Robust product design?)
Life (in hrs) data for the battery design experiment:
material temperature (◦ F)
type 15 70 125
1 130 155 34 40 20 70
74 180 80 75 82 58
2 150 188 126 122 25 70
159 126 106 115 58 45
3 138 110 174 120 96 104
168 160 150 139 82 60
Both factors, material type and temperature are important and there may
be interaction between the two also. Let us denote the row factor as factor
A and column factor as factor B (in general). Then the model for the data
may be developed as follows.
Let yijk be the observed response when factor A is at the ith level (i =
1, 2, . . . , I) and factor B is at the jth level (j = 1, 2, . . . , J) for the kth
replicate (k = 1, 2, . . . , K). In the example, I = 3, J = 3, K = 4. This design
is like having IJ different cells each of which has K observations, and one
wants to see if the IJ cell means are different or not (in various ways).

yijk = µij + ǫijk , i = 1, 2, . . . , I; j = 1, 2, . . . , J; k = 1, 2, . . . , K.

Therefore it is also called a completely randomized 2-factor design. We as-


sume, ǫijk are i.i.d. N (0, σ 2 ). As before, this is a linear model, and hence
various linear
PK hypotheses can be tested. Let
1
ȳij. = K k=1 yijk , i = 1, 2, . . . , I; j = 1, 2, . . . , J
1
PJ PK 1
PJ
ȳi.. = JK k=1 yijk = J ȳij. , i = 1, 2, . . . , I
1
PI PK
j=1
1
PIj=1
ȳ.j. = IK i=1 k=1 yijk = I i=1 ȳij. , j = 1, 2, . . . , J
1
PI PJ PK 1
PI PJ 1
PI 1
PJ
ȳ... = IJK i=1 j=1 k=1 yijk = IJ i=1 j=1 ȳij. = I i=1 ȳi.. = J j=1 ȳ.j.

Now, µ̂ij = ȳij. under no constraints, and hence


P P P
RSS = Ii=1 Jj=1 K 2
k=1 (yijk − ȳij. ) has IJ(K − 1) d.f. To consider inter-
esting questions, it is best to adopt the reparametrization,
µij = µ + αi + βj + (αβ)ij , where
1
PI PJ
µ = µ̄.. = IJ i=1 j=1 µij , αi = µ̄i. − µ̄.. , βj = µ̄.j − µ̄.. and
(αβ)ij = µij − µ̄i. − µ̄.j + µ̄.. .

1
P PJ PI
Then note that Ii=1 αi = 0, β j = 0, i=1 (αβ)ij = 0 for all j and
PJ j=1

j=1 (αβ)ij = 0 for all i.


P P P P
(Note, Ii=1 (αβ)ij = Ii=1 µij − Ii=1 µ̄i. − I µ̄.j + I µ̄.. = Ii=1 (µij − µ̄.j ) = 0.)
These are the conditions required for identifiability of the parameters under
reparametrization.
Now consider the interpretation of these parameters. µ = µ̄.. is the overall
effect. αi = µ̄i. − µ̄.. = main effect of factor A at level i since eliminating
the effect of level j by averaging over it leaves the departure of effect i (of
factor A) from overall, and similarly, βj = µ̄.j − µ̄.. = main effect of factor B
at level j. What does (αβ)ij = µij − µ̄i. − µ̄.j + µ̄.. measure?
Suppose we want to see if the effect of factor A at level i depends on the level
of factor B. If there were no such interaction, we would expect the difference
in means µi1 j − µi2 j depend on i1 and i2 and not on j. i.e.,

1X
J
µ i1 j − µ i2 j = φ(i1 , i2 ) = φ(i1 , i2 )
J j=1

1X
J
= (µi j ′ − µi2 j ′ ) = µ̄i1 . − µ̄i2 . ,
J j ′ =1 1

for all i1 , i2 . Or, equivalently, µi1 j − µ̄i1 . = µi2 j − µ̄i2 . for all i1 , i2 . i.e.,

µij − µ̄i. = Φ(j) (independent of i)


1X 1X
I I
= Φ(j) = (µi′ j − µ̄i′ . )
I i′ =1 I i′ =1
= µ̄.j − µ̄.. , for all i, j.

i.e., µij − µ̄i. − µ̄.j + µ̄.. = 0 for all i, j. Because of symmetry, we could have
begun with µij1 − µij2 depending on j1 , j2 , but not on i. Thus, we see that
(αβ)ij = µij − µ̄i. − µ̄.j + µ̄.. measures the interaction of i and j. Therefore,
to investigate the existence of interaction, we should test,
HAB : (αβ)ij = 0 (i = 1, 2, . . . , I; j = 1, 2, . . . , J) as the restricted model
without interaction. Estimation of (αβ)ij can also be considered. Now,
consider the main effects of factors A and B.
To test for lack of difference in levels of factor A, use, HA : αi = 0 for all i.
To test for lack of difference in levels of factor B, use, HB : βj = 0 for all
j. If HAB : (αβ)ij = 0 has been rejected, there is evidence for significant
interaction, so main effects cannot be non-existent.

2
Life (in hrs) data for the battery design experiment:
material temperature (◦ F)
type 15 70 125
1 130 155 34 40 20 70
74 180 80 75 82 58
2 150 188 126 122 25 70
159 126 106 115 58 45
3 138 110 174 120 96 104
168 160 150 139 82 60
Let yijk be the observed response when factor A is at the ith level (i =
1, 2, . . . , I) and factor B is at the jth level (j = 1, 2, . . . , J) for the kth
replicate (k = 1, 2, . . . , K).

yijk = µij + ǫijk , i = 1, 2, . . . , I; j = 1, 2, . . . , J; k = 1, 2, . . . , K.

Now, µ̂ij = ȳij. under no constraints, and hence


P P P
RSS = Ii=1 Jj=1 K 2
k=1 (yijk − ȳij. ) has IJ(K − 1) d.f.
Reparametrization: µij = µ + αi + βj + (αβ)ij , where
PI PJ PI PJ
i=1 αi = 0, j=1 βj = 0, i=1 (αβ)ij = 0 for all j and j=1 (αβ)ij = 0 for
all i are the identifiability conditions.
To investigate the existence of interaction, we should test,
HAB : (αβ)ij = 0(i = 1, 2, . . . , I; j = 1, 2, . . . , J) as the restricted model
without interaction. Estimation of (αβ)ij can also be considered. Now,
consider the main effects of factors A and B.
To test for lack of difference in levels of factor A, use, HA : αi = 0 for all i.
To test for lack of difference in levels of factor B, use, HB : βj = 0 for all
j. If HAB : (αβ)ij = 0 has been rejected, there is evidence for significant
interaction, so main effects cannot be non-existent.
To find estimates, confidence intervals and to conduct tests, we proceed as
follows. Since
µij = µ̄.. + (µ̄i. − µ̄.. ) + (µ̄.j − µ̄.. ) + (µij − µ̄i. − µ̄.j + µ̄.. ) = µ + αi + βj + (αβ)ij ,
we use a similar representation for ǫijk :

ǫijk = ǭ... + (ǭi.. − ǭ... ) + (ǭ.j. − ǭ... ) + (ǭij. − ǭi.. − ǭ.j. + ǭ... ) + (ǫijk − ǭij. ).

1
Therefore, as in one-way classification,

X
I X
J X
K X
I X
J
ǫ2ijk = IJKǭ2... + JK 2
(ǭi. − ǭ.. ) + IK (ǭ.j − ǭ.. )2
i=1 j=1 k=1 i=1 j=1

X
I X
J X
I X
J X
K
+K (ǭij. − ǭi.. − ǭ.j. + ǭ... )2 + (ǫijk − ǭij. )2 ,
i=1 j=1 i=1 j=1 k=1

since cross products vanish. Noting that ǫijk = yijk −µ−αi −βj −(αβ)ij , with
PI PJ PI PJ
i=1 αi = 0, j=1 βj = 0, i=1 (αβ)ij = 0 for all j and j=1 (αβ)ij = 0
for all i, we get ǭ... = ȳ... − µ, ǭi.. = ȳi.. − µ − αi , ǭ.j. = ȳ.j. − µ − βj ,
ǭij. = ȳij. − µ − αi − βj − (αβ)ij . Hence,

X
I X
J X
K
(yijk − µ − αi − βj − (αβ)ij )2
i=1 j=1 k=1

X
I X
J
2 2
= IJK(ȳ... − µ) + JK (ȳi.. − ȳ... − αi ) + IK (ȳ.j. − ȳ... − βj )2
i=1 j=1

X
I X
J X
I X
J X
K
2
+K (ȳij. − ȳi.. − ȳ.j. + ȳ... − (αβ)ij ) + (yijk − ȳij. )2 .
i=1 j=1 i=1 j=1 k=1

Subject to the identifiability conditions, we obtain the least squares esti-


mates:
ˆ
µ̂ = ȳ... , α̂i = ȳi.. − ȳ... , β̂j = ȳ.j. − ȳ... and (αβ) ij = ȳij. − ȳi.. − ȳ.j. + ȳ... .
PI PJ PK
Therefore, RSS = i=1 j=1 k=1 (yijk − ȳij. )2 , as seen earlier.
Consider HAB : (αβ)ij = 0 for all i, j. Due to the identiability constraints on
P P P P
these parameters, namely, 0 = Ii=1 (αβ)ij = Jj=1 (αβ)ij = Ii=1 Jj=1 (αβ)ij ,
there are IJ − I − J + 1 = (I − 1)(J − 1) linearly independent equa-
tions, so the A matrix used to express this as a linear hypothesis has rank
IJ − I − J + 1 = (I − 1)(J − 1). Further, by inspection,

X
I X
J X
K X
I X
J
RSSHAB = (yijk − ȳij. )2 + K (ȳij. − ȳi.. − ȳ.j. + ȳ... )2 ,
i=1 j=1 k=1 i=1 j=1

since µ̂, α̂i and β̂j remain as before. Hence

X
I X
J X
I X
J
RSSHAB − RSS = K 2
(ȳij. − ȳi.. − ȳ.j. + ȳ... ) = K ˆ 2,
(αβ) ij
i=1 j=1 i=1 j=1

2
which has d.f. (I − 1)(J − 1). To test HAB , use
(RSSHAB − RSS) /{(I − 1)(J − 1)}
FAB = ∼ F(I−1)(J−1),IJ(K−1)
RSS/{IJ(K − 1)}
under HAB . Now consider HA : αi = 0 for all i. There are I − 1 linearly
independent equations here, so the rank of A matrix is I − 1. Again, by
inspection, note that estimates of the remaining parameters, µ, βj and (αβ)ij
remain unchanged, so
X
I X
J X
K X
I
2
RSSHA = (yijk − ȳij. ) + JK (ȳi.. − ȳ... )2 , so
i=1 j=1 k=1 i=1

X
I X
I
2
RSSHA − RSS = JK (ȳi.. − ȳ... ) = JK α̂i2
i=1 i=1

with d.f. I − 1. Similarly,


X
I X
J X
K X
J
2
RSSHB = (yijk − ȳij. ) + IK (ȳ.j. − ȳ... )2 , so
i=1 j=1 k=1 j=1

X
J X
J
2
RSSHB − RSS = IK (ȳ.j. − ȳ... ) = IK β̂j2
j=1 j=1

with d.f. J − 1. Therefore, for the respective tests use,


(RSSHA − RSS) /(I − 1)
FA = ∼ FI−1,IJ(K−1)
RSS/{IJ(K − 1)}
under HA and
(RSSHB − RSS) /(J − 1)
FB = ∼ FJ−1,IJ(K−1)
RSS/{IJ(K − 1)}
under HB . The decomposition of the total sum of squares along with its d.f.
is as follows.
X
I X
J X
K X
I X
J
2 2
yijk = IJK ȳ... + JK (ȳi.. − ȳ... )2 + IK (ȳ.j. − ȳ... )2
i=1 j=1 k=1 i=1 j=1

X
I X
J X
I X
J X
K
+K (ȳij. − ȳi.. − ȳ.j. + ȳ... )2 + (yijk − ȳij. )2 .
i=1 j=1 i=1 j=1 k=1
IJK = 1 + (I − 1) + (J − 1) + (IJ − I − J + 1) + (IJK − IJ).

3
ANOVA table for 2-factor analysis:
source d.f SS MS F
A main I −1 SSA = MSA = FA =
P
effects JK Ii=1 α̂i2 MSA /MSE
B main J −1 SSB = MSB = FB =
P
effects IK Jj=1 β̂j2 MSB /MSE
AB (I − 1)(J − 1) SSAB = MSAB = FAB =
PP ˆ 2
interactions K (αβ) ij MSAB /MSE
Error IJ(K − 1) PPP RSS = MSE =
RSS
(y − ȳij. )2
P P P ijk IJ(K−1)
Total (c) IJK − 1 (yijk − ȳ... )2
2
Mean 1 P IJK
PP ȳ...
2
Total IJK yijk

ANOVA for the battery example:


source d.f SS MS F
plate 2 10684 5342 7.91 (2, 27)
temperature 2 39119 19559 28.97 (2, 27)
interactions 4 9614 2413 3.56 (4, 27)
error 27 18231 675
total (c) 35 77647

You might also like