Lecture 14: Multiple Linear Regression 1 Review of Simple Linear Regression in Matrix Form
Lecture 14: Multiple Linear Regression 1 Review of Simple Linear Regression in Matrix Form
4. The noise variable has E [|X = x] = 0 (mean zero), Var [|X = x] = σ 2 (constant
variance), and is uncorrelated across observations.
In matrix form, when we have n observations,
Y = Xβ + (4)
where X is a n × (p + 1) matrix of random variables whose first column is all 1’s. We assume
that E [|X] = 0 and Var [|X] = σ 2 I.
Sometimes we further assume that ∼ M V N (0, σ 2 I), independent of X. From these
assumptions, it follows that, conditional on X, Y has a multivariate Gaussian distribution,
Y|X ∼ M V N (Xβ, σ 2 I). (5)
1
3 Derivation of the Least Squares Estimator
We now wish to estimate the model by least squares. Fortunately, we did essentially all of
the necessary work last time.
The MSE is
1
(Y − Xβ)T (Y − Xβ) (6)
n
with gradient
2
−XT Y + XT Xβ .
∇β M SE(β) = (7)
n
The estimating equation is
− XT Y + XT Xβb = 0 (8)
and the solution, the ordinary least squares (OLS) estimator, is
Cov [X1 , Y ]
(10)
Var [X1 ]
The total covariance between X1 and Y includes X1 ’s direct contribution to Y , plus the
indirect contribution through correlation with X2 , and X2 ’s contribution to Y .
2
3.2 Point Predictions and Fitted Values
Just as with simple regression, the vector of fitted values Y
b is linear in Y, and given by the
hat matrix:
b = Xβb = X(XT X)−1 XT Y = HY.
Y (15)
All of the interpretations given of the hat matrix in the previous lecture still apply. In
particular, H projects Y onto the column space of X.
and
Y = Xβ + (17)
and so
βb = (XT X)−1 XT Xβ + (XT X)−1 XT = β + (XT X)−1 XT . (18)
4.1 Bias
This is straight-forward:
h i
E βb = E β + (XT X)−1 XT
(19)
= β + (XT X)−1 XT E [] (20)
= β (21)
3
To understand this a little better, let’s re-write it slightly:
h i σ2 1 −1
T
Var β =
b X X . (28)
n n
The first term, σ 2 /n, is what we’re familiar with from the simple linear model. As n grows,
we expect the entries in XT X to be increasing in magnitude, since they’re sums over all
n data points; dividing all entries in the matrix by n compensates for this. If the sample
covariances between all the predictor variables were 0, when we took the inverse we’d get
1/s2Xi down the diagonal (except for the top of the diagonal), just as we got 1/s2X in the
simple linear model.
5 Collinearity
We have been silently assuming that (XT X)−1 exists, in other words, that XT X is “invert-
ible” or “non-singular”. There are a number of equivalent conditions for a matrix to be
invertible:
1. Its determinant is non-zero.
2. It is of “full column rank”, meaning all of its columns are linearly independent1 .
3. It is of “full row rank”, meaning all of its rows are linearly independent.
The equivalence of these conditions are mathematical facts, proved in linear algebra.
What does this amount to in terms of our data? It means that the variables must
be linearly independent in our sample. That is, there must not be any set of constants
a0 , a1 , . . . ap where, for all rows i,
p
X
a0 + aj xij = 0 (29)
j=1
4
• If n < p + 1, the data are collinear.
• If two of the predictor variables are proportional to each other, the data are collinear.
• If two of the predictor variables are otherwise linearly related, the data are collinear.
6 R
>
> pdf("plots.pdf")
>
> n = 100
> x1 = runif(n)
> x2 = runif(n)
> x3 = runif(n)
> y = 5 + 2*x1 + 3*x2 + 7*x3 + rnorm(n)
>
> Z = cbind(x1,x2,x3,y)
>
> pairs(Z,pch=20)
>
>
>
> out = lm(y ~ x1 + x2 + x3)
>
> print(out)
Call:
lm(formula = y ~ x1 + x2 + x3)
Coefficients:
(Intercept) x1 x2 x3
4.619 2.840 2.607 7.286
>
> coefficients(out)
(Intercept) x1 x2 x3
4.618816 2.840239 2.607443 7.285716
>
> confint(out)
2.5 % 97.5 %
5
(Intercept) 4.017390 5.220243
x1 2.257652 3.422826
x2 2.004926 3.209960
x3 6.659326 7.912105
>
> head(fitted(out))
1 2 3 4 5 6
11.984243 11.300009 12.006982 11.556004 9.205792 11.400073
>
> head(residuals(out))
1 2 3 4 5 6
-0.3574136 0.0609240 -1.4612416 -0.3427516 -0.1730116 0.3899873
>
> summary(out)
Call:
lm(formula = y ~ x1 + x2 + x3)
Residuals:
Min 1Q Median 3Q Max
-1.91902 -0.59934 0.00622 0.65931 1.81582
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.6188 0.3030 15.244 < 2e-16 ***
x1 2.8402 0.2935 9.677 7.34e-16 ***
x2 2.6074 0.3035 8.590 1.58e-13 ***
x3 7.2857 0.3156 23.088 < 2e-16 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
>
> newx = data.frame(x1 = .2, x2 = .3, x3 = .7)
>
> predict(out,newdata = newx)
1
11.0691
>
6
> dev.off()
x1
●● ● ● ●●
● ● ● ● ● ●
● ● ● ●● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ●
●
●● ● ● ● ●
● ● ● ● ● ● ● ● ●●
● ●
● ● ● ●● ● ● ● ●
● ● ● ● ●● ● ●
●
●● ● ● ● ●● ● ● ● ● ● ●
●
● ● ● ● ● ●●
● ● ● ●● ●
● ● ● ●
● ●● ● ●●
●
● ● ● ● ● ●
● ● ●
●● ● ● ● ● ● ● ● ●
● ●
●● ● ● ● ● ●● ● ● ●● ● ● ● ● ●
● ● ● ●
●● ● ●● ● ● ● ●● ●● ● ●●
● ● ●
0.0 0.2 0.4 0.6 0.8 1.0
● ● ● ● ● ●
● ● ● ● ● ● ● ● ●●
● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ●
● ● ●
●●
●● ● ● ● ● ● ●
● ● ● ● ● ●
●● ● ● ●● ● ● ●
● ●
● ● ● ●● ● ● ● ●
● ● ● ● ● ●
● ● ● ●● ● ●
● ● ● ●●
● ● ● ● ● ●
●● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ●
● ● ●
x2
● ● ● ● ● ●
● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ●● ● ● ●● ● ●
● ● ● ● ● ●
● ● ●
● ● ● ● ● ● ● ● ●
● ● ●
● ●
● ●● ●● ● ● ● ●● ● ● ● ● ● ● ●●● ● ● ●
● ● ●
● ●● ●● ● ● ●● ● ● ● ● ● ●
●● ● ● ● ● ●●
● ● ● ●
● ●● ● ● ● ●
● ● ●
●● ●● ● ● ● ● ● ●● ●
● ●
● ● ● ● ●
●
● ● ● ● ● ● ●● ●
● ●● ● ●● ● ● ●● ● ● ● ● ● ●
● ● ● ●●
● ● ● ● ●●
● ● ●
● ● ● ● ● ● ● ●
● ● ● ●● ● ● ● ● ● ● ● ●●
● ● ●●
● ● ● ● ●
●●● ●● ●● ● ● ●
●● ● ● ● ● ● ● ●● ●● ●
● ● ●● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ●
●● ● ● ● ● ● ●
● ●● ● ● ● ●
● ●● ● ● ●
x3
●● ● ● ● ●●
● ●● ● ● ● ● ● ●● ●
● ●● ● ● ● ● ● ●
● ● ● ● ● ● ●●
● ● ● ● ● ●
● ● ● ● ● ●
●● ● ● ● ● ● ●●
● ● ● ● ● ● ● ● ●
● ● ● ● ● ●
● ● ●
● ● ●● ● ● ● ● ● ● ● ●
● ● ●
● ● ● ● ● ● ● ● ●
● ● ●
● ●
● ● ● ● ● ●
●
● ● ● ● ● ●
● ● ● ● ● ●
● ●● ● ● ● ●● ● ● ● ●●
● ●
●
●
● ● ●
● ● ● ● ●● ●● ●
● ● ● ● ● ● ● ● ●
●● ● ● ●
14
● ● ● ● ● ● ●
● ● ●
● ● ● ● ●● ● ● ●
● ● ●
● ● ● ● ● ● ● ● ● ● ●●
● ● ● ●● ● ● ● ●● ●
● ● ● ● ● ● ● ● ●
● ●● ●
● ● ● ● ● ●
● ● ● ● ●● ● ● ● ● ● ●
12
● ● ● ● ● ● ● ● ● ● ● ●● ● ●
● ●
●● ● ●● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ●● ● ●
●●●● ● ● ●●
● ●
●
● ●●●
● ● ● ● ● ●●●
●
● ● ●●
● ●
● ●
●
● ● ●
●
●
●
●
●
●
● ●
●
● ●
●
●
● ●
● ●
●
y
10
● ● ●
●● ●● ● ● ● ● ● ● ● ● ● ● ● ●●
●
● ●● ● ● ● ●● ● ● ● ● ●
● ●
● ● ● ●
● ● ● ● ● ● ●
● ●
● ●
● ● ● ● ● ●
● ● ● ● ● ●
8
● ● ● ● ●● ● ● ●●
●●
● ●● ● ● ● ● ● ●
●● ● ●● ● ● ● ●
● ● ●
● ● ●
6
● ● ● ● ●
●
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0