0% found this document useful (0 votes)
31 views

Lecture 14: Multiple Linear Regression 1 Review of Simple Linear Regression in Matrix Form

This document provides an overview of multiple linear regression. It begins by reviewing simple linear regression in matrix form. It then introduces multiple linear regression, where there are p predictor variables instead of just one. It derives the least squares estimator for multiple linear regression, showing that it is the same formula as in simple linear regression but with an (n x (p+1)) design matrix X instead of an (n x 2) design matrix. It discusses why multiple regression slopes are not the same as those from separate simple regressions on each predictor. It also covers properties of the estimates such as being unbiased but having variance that depends on the inverse of X'X. Finally, it discusses issues of collinearity when X'

Uploaded by

S
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views

Lecture 14: Multiple Linear Regression 1 Review of Simple Linear Regression in Matrix Form

This document provides an overview of multiple linear regression. It begins by reviewing simple linear regression in matrix form. It then introduces multiple linear regression, where there are p predictor variables instead of just one. It derives the least squares estimator for multiple linear regression, showing that it is the same formula as in simple linear regression but with an (n x (p+1)) design matrix X instead of an (n x 2) design matrix. It discusses why multiple regression slopes are not the same as those from separate simple regressions on each predictor. It also covers properties of the estimates such as being unbiased but having variance that depends on the inverse of X'X. Finally, it discusses issues of collinearity when X'

Uploaded by

S
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Lecture 14: Multiple Linear Regression

1 Review of Simple Linear Regression in Matrix Form


We have Y = (Y1 , . . . , Yn )T and an n × 2 matrix X whose first column is all 1’s. The model
is Y = Xβ + . The error (mse) is n−1 (Y − Xβ)T (Y − Xβ). The derivative of the MSE
with respect to β is
2
(−XT Y + XT Xβ) (1)
n
Setting this to zero at the optimum coefficient vector βb gives the (matrix) estimating equation
− XT Y + XT Xβb = 0 (2)
whose solution is
βb = (XT X)−1 XT Y. (3)
The fitted values are
b ≡m
Y b = Xβb = HY
where H is the hat matrix. Geometrically, this means that we find the fitted values by taking
the vector of observed responses Y and projecting it onto the column space of X.

2 Multiple Linear Regression


We are now ready to go from the simple linear regression model, with one predictor variable,
to multiple linear regression models, with more than one predictor variable.
In the basic form of the multiple linear regression model,
1. There are p quantitative predictor variables, X1 , X2 , . . . Xp . We make no assumptions
about their distribution; in particular, they may or may not be dependent. X without a
subscript will refer to the vector of all of these taken together. Thus, X = (X1 , . . . , Xp ).
2. There is a single response variable Y .
3. Y = β0 + pi=1 βi Xi + , for some constants (coefficients) β0 , β1 , . . . βp .
P

4. The noise variable  has E [|X = x] = 0 (mean zero), Var [|X = x] = σ 2 (constant
variance), and is uncorrelated across observations.
In matrix form, when we have n observations,
Y = Xβ +  (4)
where X is a n × (p + 1) matrix of random variables whose first column is all 1’s. We assume
that E [|X] = 0 and Var [|X] = σ 2 I.
Sometimes we further assume that  ∼ M V N (0, σ 2 I), independent of X. From these
assumptions, it follows that, conditional on X, Y has a multivariate Gaussian distribution,
Y|X ∼ M V N (Xβ, σ 2 I). (5)

1
3 Derivation of the Least Squares Estimator
We now wish to estimate the model by least squares. Fortunately, we did essentially all of
the necessary work last time.
The MSE is
1
(Y − Xβ)T (Y − Xβ) (6)
n
with gradient
2
−XT Y + XT Xβ .

∇β M SE(β) = (7)
n
The estimating equation is
− XT Y + XT Xβb = 0 (8)
and the solution, the ordinary least squares (OLS) estimator, is

βb = (XT X)−1 XT Y. (9)

3.1 Why Multiple Regression Isn’t Just a Bunch of Simple Re-


gressions
When we do multiple regression, the slopes we get for each variable aren’t the same as the
ones we’d get if we just did p separate simple regressions. Why not?
Suppose the real model is Y = β0 + β1 X1 + β2 X2 + . (Nothing turns on p = 2, it just
keeps things short.) What would happen if we did a simple regression of Y on just X1 ? We
know that the optimal (population) slope on X1 is

Cov [X1 , Y ]
(10)
Var [X1 ]

Let’s substitute in the model equation for Y :

Cov [X1 , Y ] Cov [X1 , β0 + β1 X1 + β2 X2 + ]


= (11)
Var [X1 ] Var [X1 ]
β1 Var [X1 ] + β2 Cov [X1 , X2 ] + Cov [X1 , ]
= (12)
Var [X1 ]
β2 Cov [X1 , X2 ] + 0
= β1 + (13)
Var [X1 ]
Cov [X1 , X2 ]
= β1 + β2 (14)
Var [X1 ]

The total covariance between X1 and Y includes X1 ’s direct contribution to Y , plus the
indirect contribution through correlation with X2 , and X2 ’s contribution to Y .

2
3.2 Point Predictions and Fitted Values
Just as with simple regression, the vector of fitted values Y
b is linear in Y, and given by the
hat matrix:
b = Xβb = X(XT X)−1 XT Y = HY.
Y (15)

All of the interpretations given of the hat matrix in the previous lecture still apply. In
particular, H projects Y onto the column space of X.

4 Properties of the Estimates


As usual, we will treat X as fixed. Now

βb = (XT X)−1 XT Y (16)

and
Y = Xβ +  (17)
and so
βb = (XT X)−1 XT Xβ + (XT X)−1 XT  = β + (XT X)−1 XT . (18)

4.1 Bias
This is straight-forward:
h i
E βb = E β + (XT X)−1 XT 
 
(19)
= β + (XT X)−1 XT E [] (20)
= β (21)

so the least squares estimate is unbiased.

4.2 Variance and Standard Errors


This needs a little more work. We have
h i
Var βb = Var β + (XT X)−1 XT 
 
(22)
= Var (XT X)−1 XT 
 
(23)
T −1 T T −1
= (X X) X Var [] X(X X) (24)
= (XT X)−1 XT σ 2 IX(XT X)−1 (25)
= σ 2 (XT X)−1 XT X(XT X)−1 (26)
= σ 2 (XT X)−1 (27)

3
To understand this a little better, let’s re-write it slightly:
h i σ2  1 −1
T
Var β =
b X X . (28)
n n

The first term, σ 2 /n, is what we’re familiar with from the simple linear model. As n grows,
we expect the entries in XT X to be increasing in magnitude, since they’re sums over all
n data points; dividing all entries in the matrix by n compensates for this. If the sample
covariances between all the predictor variables were 0, when we took the inverse we’d get
1/s2Xi down the diagonal (except for the top of the diagonal), just as we got 1/s2X in the
simple linear model.

5 Collinearity
We have been silently assuming that (XT X)−1 exists, in other words, that XT X is “invert-
ible” or “non-singular”. There are a number of equivalent conditions for a matrix to be
invertible:
1. Its determinant is non-zero.

2. It is of “full column rank”, meaning all of its columns are linearly independent1 .

3. It is of “full row rank”, meaning all of its rows are linearly independent.
The equivalence of these conditions are mathematical facts, proved in linear algebra.
What does this amount to in terms of our data? It means that the variables must
be linearly independent in our sample. That is, there must not be any set of constants
a0 , a1 , . . . ap where, for all rows i,
p
X
a0 + aj xij = 0 (29)
j=1

This, in other words, means that X must be of full column rank.


To understand why linearly dependence among variables is a problem, take an easy case,
where two predictors, say X1 and X2 , are exactly equal to each other. It’s then not surprising
that we don’t have any way of estimating their coefficients. If we get one set of predictions
with coefficients β1 , β2 , we’d get exactly the same predictions from β1 + γ, β2 − γ, no matter
what γ might be. If there are other exact linear relations among two variables, we can
similarly trade off their coefficients against each other, without any change in anything we
can observe. If there are exact linear relationships among more than two variables, all of
their coefficients become ill-defined.
We will come back in a few lectures to what to do when faced with collinearity. For now,
we’ll just mention a few clear situations:
1
Recall that a set of vectors is linearly independent if no linear combination of them is exactly zero.

4
• If n < p + 1, the data are collinear.

• If one of the predictor variables is constant, the data are collinear.

• If two of the predictor variables are proportional to each other, the data are collinear.

• If two of the predictor variables are otherwise linearly related, the data are collinear.

6 R
>
> pdf("plots.pdf")
>
> n = 100
> x1 = runif(n)
> x2 = runif(n)
> x3 = runif(n)
> y = 5 + 2*x1 + 3*x2 + 7*x3 + rnorm(n)
>
> Z = cbind(x1,x2,x3,y)
>
> pairs(Z,pch=20)
>
>
>
> out = lm(y ~ x1 + x2 + x3)
>
> print(out)

Call:
lm(formula = y ~ x1 + x2 + x3)

Coefficients:
(Intercept) x1 x2 x3
4.619 2.840 2.607 7.286

>
> coefficients(out)
(Intercept) x1 x2 x3
4.618816 2.840239 2.607443 7.285716
>
> confint(out)
2.5 % 97.5 %

5
(Intercept) 4.017390 5.220243
x1 2.257652 3.422826
x2 2.004926 3.209960
x3 6.659326 7.912105
>
> head(fitted(out))
1 2 3 4 5 6
11.984243 11.300009 12.006982 11.556004 9.205792 11.400073
>
> head(residuals(out))
1 2 3 4 5 6
-0.3574136 0.0609240 -1.4612416 -0.3427516 -0.1730116 0.3899873
>
> summary(out)

Call:
lm(formula = y ~ x1 + x2 + x3)

Residuals:
Min 1Q Median 3Q Max
-1.91902 -0.59934 0.00622 0.65931 1.81582

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.6188 0.3030 15.244 < 2e-16 ***
x1 2.8402 0.2935 9.677 7.34e-16 ***
x2 2.6074 0.3035 8.590 1.58e-13 ***
x3 7.2857 0.3156 23.088 < 2e-16 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

Residual standard error: 0.8557 on 96 degrees of freedom


Multiple R-squared: 0.8656,Adjusted R-squared: 0.8614
F-statistic: 206.1 on 3 and 96 DF, p-value: < 2.2e-16

>
> newx = data.frame(x1 = .2, x2 = .3, x3 = .7)
>
> predict(out,newdata = newx)
1
11.0691
>

6
> dev.off()

0.0 0.2 0.4 0.6 0.8 1.0 6 8 10 12 14

0.0 0.2 0.4 0.6 0.8 1.0


● ●● ● ● ●● ● ● ●
● ● ● ● ● ●● ●
● ● ● ● ● ●
●● ● ● ● ● ●● ●
● ● ● ●

● ● ● ● ●

● ● ● ●
● ● ● ● ● ● ●
● ●●● ●
● ● ●

● ● ● ● ●
● ● ● ● ● ●●
● ● ● ● ●
●● ● ● ● ●● ●● ●
● ●● ● ●● ● ● ● ● ●
● ● ●
● ●● ● ● ● ● ●●
● ● ●● ●

● ● ●
●● ● ●
● ● ● ● ●

x1
●● ● ● ●●
● ● ● ● ● ●
● ● ● ●● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ●

●● ● ● ● ●
● ● ● ● ● ● ● ● ●●
● ●
● ● ● ●● ● ● ● ●
● ● ● ● ●● ● ●

●● ● ● ● ●● ● ● ● ● ● ●

● ● ● ● ● ●●
● ● ● ●● ●
● ● ● ●
● ●● ● ●●

● ● ● ● ● ●
● ● ●
●● ● ● ● ● ● ● ● ●
● ●
●● ● ● ● ● ●● ● ● ●● ● ● ● ● ●
● ● ● ●
●● ● ●● ● ● ● ●● ●● ● ●●
● ● ●
0.0 0.2 0.4 0.6 0.8 1.0

● ● ● ● ● ●
● ● ● ● ● ● ● ● ●●
● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ●
● ● ●
●●
●● ● ● ● ● ● ●
● ● ● ● ● ●
●● ● ● ●● ● ● ●
● ●
● ● ● ●● ● ● ● ●
● ● ● ● ● ●
● ● ● ●● ● ●
● ● ● ●●
● ● ● ● ● ●
●● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ●
● ● ●

x2
● ● ● ● ● ●
● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ●● ● ● ●● ● ●
● ● ● ● ● ●
● ● ●
● ● ● ● ● ● ● ● ●
● ● ●
● ●
● ●● ●● ● ● ● ●● ● ● ● ● ● ● ●●● ● ● ●
● ● ●
● ●● ●● ● ● ●● ● ● ● ● ● ●
●● ● ● ● ● ●●
● ● ● ●
● ●● ● ● ● ●
● ● ●
●● ●● ● ● ● ● ● ●● ●
● ●
● ● ● ● ●

● ● ● ● ● ● ●● ●
● ●● ● ●● ● ● ●● ● ● ● ● ● ●
● ● ● ●●
● ● ● ● ●●
● ● ●

0.0 0.2 0.4 0.6 0.8 1.0


● ● ● ●● ●
● ● ● ● ● ●
●● ● ● ● ● ● ● ● ●
● ● ● ● ●● ● ●
●● ● ● ● ● ● ●
● ● ● ● ● ●
● ● ●
● ● ●
● ●
●●

● ●
● ●

● ● ● ● ● ● ● ●
● ● ● ●● ● ● ● ● ● ● ● ●●
● ● ●●
● ● ● ● ●
●●● ●● ●● ● ● ●
●● ● ● ● ● ● ● ●● ●● ●
● ● ●● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ●
●● ● ● ● ● ● ●
● ●● ● ● ● ●
● ●● ● ● ●

x3
●● ● ● ● ●●
● ●● ● ● ● ● ● ●● ●
● ●● ● ● ● ● ● ●
● ● ● ● ● ● ●●
● ● ● ● ● ●
● ● ● ● ● ●
●● ● ● ● ● ● ●●
● ● ● ● ● ● ● ● ●
● ● ● ● ● ●
● ● ●
● ● ●● ● ● ● ● ● ● ● ●
● ● ●
● ● ● ● ● ● ● ● ●
● ● ●
● ●
● ● ● ● ● ●

● ● ● ● ● ●
● ● ● ● ● ●
● ●● ● ● ● ●● ● ● ● ●●
● ●


● ● ●

● ● ● ● ●● ●● ●
● ● ● ● ● ● ● ● ●
●● ● ● ●
14

● ● ● ● ● ● ●
● ● ●
● ● ● ● ●● ● ● ●
● ● ●
● ● ● ● ● ● ● ● ● ● ●●
● ● ● ●● ● ● ● ●● ●
● ● ● ● ● ● ● ● ●
● ●● ●
● ● ● ● ● ●
● ● ● ● ●● ● ● ● ● ● ●
12

● ● ● ● ● ● ● ● ● ● ● ●● ● ●
● ●
●● ● ●● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ●● ● ●
●●●● ● ● ●●
● ●

● ●●●
● ● ● ● ● ●●●

● ● ●●
● ●
● ●

● ● ●






● ●

● ●


● ●
● ●

y
10

● ● ●
●● ●● ● ● ● ● ● ● ● ● ● ● ● ●●

● ●● ● ● ● ●● ● ● ● ● ●
● ●
● ● ● ●
● ● ● ● ● ● ●
● ●
● ●
● ● ● ● ● ●
● ● ● ● ● ●
8

● ● ● ● ●● ● ● ●●
●●
● ●● ● ● ● ● ● ●

●● ● ●● ● ● ● ●
● ● ●
● ● ●
6

● ● ● ● ●

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

You might also like