0% found this document useful (0 votes)
43 views

Linear Model

The document discusses linear regression models in matrix form. Key points include: - Linear regression models can be expressed using matrices including the design matrix X, parameter vector β, error vector ε, and response vector Y. - The least squares method is used to estimate parameters by minimizing the sum of squared residuals (Y - Xβ). - This results in the "normal equations" X'Y = (X'X)β which can be solved to obtain the least squares estimates of the regression parameters β. - Expressing the model in matrix form allows generalizing linear regression to multiple predictors and observations in a concise way.

Uploaded by

Ihsana
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
43 views

Linear Model

The document discusses linear regression models in matrix form. Key points include: - Linear regression models can be expressed using matrices including the design matrix X, parameter vector β, error vector ε, and response vector Y. - The least squares method is used to estimate parameters by minimizing the sum of squared residuals (Y - Xβ). - This results in the "normal equations" X'Y = (X'X)β which can be solved to obtain the least squares estimates of the regression parameters β. - Expressing the model in matrix form allows generalizing linear regression to multiple predictors and observations in a concise way.

Uploaded by

Ihsana
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Statistics 512: Applied Linear Models

Topic 3

Topic Overview
This topic will cover
• thinking in terms of matrices

• regression on multiple predictor variables

• case study: CS majors

• Text Example (KNNL 236)

Chapter 5: Linear Regression in Matrix Form


The SLR Model in Scalar Form

Yi = β0 + β1 Xi + i where i ∼iid N (0, σ 2 )

Consider now writing an equation for each observation:

Y1 = β0 + β1 X1 + 1
Y2 = β0 + β1 X2 + 2
.. .. ..
. . .
Yn = β0 + β1 Xn + n

The SLR Model in Matrix Form


     
Y1 β0 + β1 X1 1
 Y2   β0 + β1 X2   2 
     
 ..  =  .. + .. 
 .   .   . 
Yn β0 + β1 Xn n
     
Y1 1 X1 
      1
 Y2   1 X2   β0 
 2


 ..  =  .. ..  β +  .. 
 .   . .  1  . 
Yn 1 Xn n

(I will try to use bold symbols for matrices. At first, I will also indicate the dimensions as
a subscript to the symbol.)

1
• X is called the design matrix.

• β is the vector of parameters

•  is the error vector

• Y is the response vector

The Design Matrix


 
1 X1
 1 X2 
 
Xn×2 =  .. .. 
 . . 
1 Xn

Vector of Parameters
 
β0
β2×1 =
β1

Vector of Error Terms


 
1
 2 
 
n×1 = .. 
 . 
n

Vector of Responses
 
Y1
 Y2 
 
Yn×1 =  .. 
 . 
Yn

Thus,

Y = Xβ + 
Yn×1 = Xn×2 β2×1 + n×1

2
Variance-Covariance Matrix
In general, for any set of variables U1 , U2 , . . . , Un , their variance-covariance matrix is defined
to be
 
σ 2 {U1 } σ{U1 , U2 } ··· σ{U1 , Un }
 ... .. 
2  σ{U2 , U1 } σ 2 {U2 } . 
σ {U} =  . . . 
 .. .. .. σ{Un−1 , Un } 
σ{Un , U1 } ··· σ{Un , Un−1 } σ 2 {Un }

where σ 2 {Ui } is the variance of Ui , and σ{Ui , Uj } is the covariance of Ui and Uj .


When variables are uncorrelated, that means their covariance is 0. The variance-covariance
matrix of uncorrelated variables will be a diagonal matrix, since all the covariances are 0.

Note: Variables that are independent will also be uncorrelated. So when variables are
correlated, they are automatically dependent. However, it is possible to have variables that
are dependent but uncorrelated, since correlation only measures linear dependence. A nice
thing about normally distributed RV’s is that they are a convenient special case: if they are
uncorrelated, they are also independent.

Covariance Matrix of 
   
1 σ2 0 · · · 0
 2   0 σ2 · · · 0 
   
σ 2 {}n×n = Cov  ..  = σ 2 In×n =  .. .. . . . 
 .   . . . .. 
n 0 0 · · · σ2

Covariance Matrix of Y
 
Y1
 Y2 
 
σ 2 {Y}n×n = Cov  ..  = σ 2 In×n
 . 
Yn

Distributional Assumptions in Matrix Form


 ∼ N (0, σ 2 I)
I is an n × n identity matrix.
• Ones in the diagonal elements specify that the variance of each i is 1 times σ 2 .

• Zeros in the off-diagonal elements specify that the covariance between different i is
zero.

• This implies that the correlations are zero.

3
Parameter Estimation
Least Squares
Residuals are  = Y − Xβ. Want to minimize sum of squared residuals.
 
1
X  2 
 
2i = [1 2 · · · n ]  ..  = 0 
 . 
n

We want to minimize 0  = (Y − Xβ)0 (Y − Xβ), where the “prime” ()0 denotes the transpose
of the matrix (exchange the rows and columns).
We take the derivative with respect to the vector β. This is like a quadratic function: think
“(Y − Xβ)2 ”.
The derivative works out to 2 times the derivative of (Y − Xβ)0 with respect to β.
d
That is, dβ ((Y − Xβ)0 (Y − Xβ)) = −2X0 (Y − Xβ). We set this equal to 0 (a vector of
zeros), and solve for β.
So, −2X0 (Y − Xβ) = 0. Or, X0 Y = X0 Xβ (the “normal” equations).

Normal Equations

X0 Y = (X0 X)β
 
b0
Solving this equation for β gives the least squares solution for b = .
b1
Multiply on the left by the inverse of the matrix X0 X. (Notice that the matrix X0 X is a
2 × 2 square matrix for SLR.)

b = (X0X)−1X0Y
REMEMBER THIS.

Reality Break:
This is just to convince you that we have done nothing new nor magic – all we are doing is
writing the same old formulas for b0 and b1 in matrix format. Do NOT worry if you cannot
reproduce the following algebra, but you SHOULD try to follow it so that you believe me
that this is really not a new formula.
Recall in Topic 1, we had
P
(Xi − X̄)(Yi − Ȳ ) SSXY
b1 = P ≡
(Xi − X̄)2 SSX
b0 = Ȳ − b1 X̄

4
Now let’s look at the pieces of the new formula:
 
1 X1
    P 
1 1 ··· 1  1 X2  n X
0
= P P 2 i
XX = 
X1 X2 · · · Xn  ... ...  Xi Xi
1 Xn
 P 2 P   P 2 P 
0 −1 1
PX − X 1
PX − X
(X X) = P P i i
= i i
n Xi2 − ( Xi )2 − Xi n nSSX − Xi n
 
Y
  1   P 
1 1 ··· 1  Y2  Y
0
= P
i
XY = 
X1 X2 · · · Xn  ...  Xi Yi
Yn
Plug these into the equation for b:
 P 2 P  P 
0 −1 0 1
P Xi − Xi P Yi
b = (X X) X Y =
nSSX − Xi n Xi Yi
 P 2 P P P 
1 ( XiP )( YiP ) − ( XiP )( Xi Yi )
=
nSSX −( Xi )( Yi ) + n Xi Yi
 P 2 P 
1 Ȳ ( PXi ) − X̄ Xi Yi
=
SSX Xi Yi − nX̄ Ȳ
 P 2 P 
1 Ȳ ( Xi ) − Ȳ (nX̄ 2 ) + X̄(nX̄ Ȳ ) − X̄ Xi Yi
=
SSX SPXY
  " #  
1 Ȳ SSX − SPXY X̄ Ȳ − SP
SSX
XY
X̄ b 0
= = SPXY = ,
SSX SPXY SSX
b1
where
X X
SSX = Xi2 − nX̄ 2 =
(Xi − X̄)2
X X
SPXY = Xi Yi − nX̄ Ȳ = (Xi − X̄)(Yi − Ȳ )
All we have done is to write the same old formulas for b0 and b1 in a fancy new format.
See NKNW page 199 for details. Why have we bothered to do this? The cool part is that
the same approach works for multiple regression. All we do is make X and b into bigger
matrices, and use exactly the same formula.

Other Quantities in Matrix Form


Fitted Values
     
Ŷ1 b0 + b1 X1 1 X1
      
 Ŷ2   b0 + b1 X2   1 X2  b0
Ŷ =  .. = .. = .. ..  b = Xb
 .   .   . .  1

Ŷn b0 + b1 Xn 1 Xn

5
Hat Matrix

Ŷ = Xb
Ŷ = X(X0 X)−1 X0 Y
Ŷ = HY

where H = X(X0 X)−1 X0 . We call this the “hat matrix” because is turns Y ’s into Ŷ ’s.

Estimated Covariance Matrix of b


This matrix b is a linear combination of the elements of Y.
These estimates are normal if Y is normal.
These estimates will be approximately normal in general.

A Useful Multivariate Theorem


Suppose U ∼ N (µ, Σ), a multivariate normal vector, and V = c + DU, a linear
transformation of U where c is a vector and D is a matrix. Then V ∼ N (c +
Dµ, DΣD0 ).
Recall: b = (X0 X)−1 X0 Y = [(X0 X)−1 X0 ] Y and Y ∼ N (Xβ, σ 2 I).
Now apply theorem to b using

U = Y, µ = Xβ, Σ = σ 2 I
V = b, c = 0, and D = (X0 X)−1 X0

The theorem tells us the vector b is normally distributed with mean

(X0 X)−1 (X0 X)β = β

and covariance matrix


 0 0
σ 2 (X0 X)−1 X0 I (X0 X)−1 X0 = σ 2 (X0 X)−1 (X0 X) (X0 X)−1
= σ 2 (X0 X)−1
0
using the fact that both X0 X and its inverse are symmetric, so ((X0 X)−1 ) = (X0 X)−1 .

Next we will use this framework to do multiple regression where we have more than one
explanatory variable (i.e., add another column to the design matrix and additional beta
parameters).

Multiple Regression
Data for Multiple Regression
• Yi is the response variable (as usual)

6
• Xi,1 , Xi,2 , . . . , Xi,p−1 are the p − 1 explanatory variables for cases i = 1 to n.

• Example – In Homework #1 you considered modeling GPA as a function of entrance


exam score. But we could also consider intelligence test scores and high school GPA
as potential predictors. This would be 3 variables, so p = 4.

• Potential problem to remember!!! These predictor variables are likely to be themselves


correlated. We always want to be careful of using variables that are themselves strongly
correlated as predictors together in the same model.

The Multiple Regression Model

Yi = β0 + β1 Xi,1 + β2 Xi,2 + . . . + βp−1 Xi,p−1 + i for i = 1, 2, . . . , n

where

• Yi is the value of the response variable for the ith case.

• i ∼iid N (0, σ 2 ) (exactly as before!)

• β0 is the intercept (think multidimensionally).

• β1 , β2 , . . . , βp−1 are the regression coefficients for the explanatory variables.

• Xi,k is the value of the kth explanatory variable for the ith case.

• Parameters as usual include all of the β’s as well as σ 2 . These need to be estimated
from the data.

Interesting Special Cases


• Polynomial model:

Yi = β0 + β1 Xi + β2 Xi2 + . . . + βp−1 Xip−1 + i

• X’s can be indicator or dummy variables with X = 0 or 1 (or any other two distinct
numbers) as possible values (e.g. ANOVA model). Interactions between explanatory
variables are then expressed as a product of the X’s:

Yi = β0 + β1 Xi,1 + β2 Xi,2 + β3 Xi,1 Xi,2 + i

7
Model in Matrix Form

Yn×1 = Xn×p βp×1 + n×1


 ∼ N (0, σ 2 In×n )
Y ∼ N (Xβ, σ 2 I)

Design Matrix X:
 
1 X1,1 X1,2 · · · X1,p−1
 1 X2,1 X2,2 · · · X2,p−1 
 
X= .. .. .. ... .. 
 . . . . 
1 Xn,1 Xn,2 · · · Xn,p−1

Coefficient matrix β:
 
β0
 β1 
 
β= .. 
 . 
βp−1

Parameter Estimation
Least Squares
Find b to minimize SSE = (Y − Xb)0 (Y − Xb)
Obtain normal equations as before: X0 Xb = X0 Y

Least Squares Solution

b = (X0 X)−1 X0 Y

Fitted (predicted) values for the mean of Y are

Ŷ = Xb = X(X0 X)−1 X0 Y = HY,

where H = X(X0 X)−1 X0 .

Residuals

e = Y − Ŷ = Y − HY = (I − H)Y

Notice that the matrices H and (I − H) have two special properties. They are
• Symmetric: H = H0 and (I − H)0 = (I − H).
• Idempotent: H2 = H and (I − H)(I − H) = (I − H)

8
Covariance Matrix of Residuals

Cov(e) = σ 2 (I − H)(I − H)0 = σ 2 (I − H)


V ar(ei ) = σ 2 (1 − hi,i ),
where hi,i is the ith diagonal element of H.
Note: hi,i = X0 i (X0 X)−1 Xi where X0i = [1 Xi,1 · · · Xi,p−1 ].
Residuals ei are usually somewhat correlated: cov(ei , ej ) = −σ 2 hi,j ; this is not unexpected,
since they sum to 0.

Estimation of σ
Since we have estimated p parameters, SSE = e0 e has dfE = n − p. The estimate for σ 2 is
the usual estimate:
e0 e (Y − Xb)0 (Y − Xb) SSE
s2 = = = = M SE
n−p n−p dfE

s = s2 = Root MSE

Distribution of b
We know that b = (X0 X)−1 X0 Y. The only RV involved is Y , so the distribution of b is
based on the distribution of Y.

Since Y ∼ N (Xβ, σ 2 I), and using the multivariate theorem from earlier (if you like, go
through the details on your own), we have

E(b) = (X0 X)−1 X0 Xβ = β
σ 2 {b} = Cov(b) = σ 2 (X0 X)−1
Since σ 2 is estimated by the MSE s2 , σ 2 {b} is estimated by s2 (X0 X)−1 .

ANOVA Table
Sources of variation are
• Model (SAS) or Regression (KNNL)
• Error (Residual)
• Total
SS and df add as before
SSM + SSE = SST
dfM + dfE = dfT otal
but their values are different from SLR.

9
Sum of Squares

X
SSM = (Ŷi − Ȳ )2
X
SSE = (Yi − Ŷi )2
X
SST O = (Yi − Ȳ )2

Degrees of Freedom

dfM = p − 1
dfE = n − p
dfT otal = n − 1

The total degrees have not changed from SLR, but the model df has increased from 1 to
p − 1, i.e., the number of X variables. Correspondingly, the error df has decreased from n − 2
to n − p.

Mean Squares

P
SSM (Ŷi − Ȳ )2
M SM = =
dfM p−1
P
SSE (Yi − Ŷi )2
M SE = =
dfE n−p
P
SST O (Yi − Ȳ )2
M ST = =
dfT otal n−1

ANOVA Table

Source df SS MSE F
Model dfM = p − 1 SSM M SM M SM
M SE
Error dfE = n − p SSE M SE
Total dfT = n − 1 SST

F -test
H0 : β1 = β2 = . . . = βp−1 = 0 (all regression coefficients are zero)
HA : βk 6= 0, for at least one k = 1, . . . , p − 1; at least of the β’s is non-zero (or, not all the
β’s are zero).
F = M SM/M SE
Under H0 , F ∼ Fp−1,n−p
Reject H0 if F is larger than critical value; if using SAS, reject H0 if p-value < α = 0.05 .

10
What do we conclude?
If H0 is rejected, we conclude that at least one of the regression coefficients is non-zero;
hence at least one of the X variables is useful in predicting Y . (Doesn’t say which one(s)
though). If H0 is not rejected, then we cannot conclude that any of the X variables is useful
in predicting Y .

p-value of F -test
The p-value for the F significance test tell us one of the following:
• there is no evidence to conclude that any of our explanatory variables can help us to
model the response variable using this kind of model (p ≥ 0.05).
• one or more of the explanatory variables in our model is potentially useful for predict-
ing the response in a linear model (p ≤ 0.05).

R2
The squared multiple regression correlation (R2 ) gives the proportion of variation in the
response variable explained by the explanatory variables.
It is sometimes called the coefficient of multiple determination (KNNL, page 236).
R2 = SSM/SST (the proportion of variation explained by the model)
R2 = 1 − (SSE/SST ) (1 − the proportion not explained by the model)
F and R2 are related:
R2 /(p − 1)
F =
(1 − R2 )/(n − p)

Inference for Individual Regression Coefficients


Confidence Interval for βk
We know that b ∼ N (β, σ 2 (X0 X)−1 )
Define
s2 {b}p×p = M SE × (X0 X)−1
 
s2 {bk } = s2 {b} k,k , the kth diagonal element

CI for βk : bk ± tc s{bk }, where tc = tn−p (0.975).

Significance Test for βk


H0 : βk = 0
Same test statistic t∗ = bk /s{bk }
Still use dfE which now is equal to n − p
p-value computed from tn−p distribution.
This tests the significance of a variable given that the other variables are already in the model
(i.e., fitted last). Unlike in SLR, the t-tests for β are different from the F -test.

11

You might also like