intro to regression
intro to regression
1 Course logistics
• Instructor: Ryan Tibshirani, TAs: Robert Lunde, Sonia Tardova
• Class outline: 55 minutes lecture, 5 minutes break, 20 minutes working through R examples
1
• Again sticking with mean squared error,
2 h 2 i
MSE(r) = E Y − r(X) = E E Y − r(X) |X ,
we can see that the optimal function is r(X) = E(Y |X), or r(x) = E(Y |X = x). We call this
the regression function, it’s what we use to predict Y from X
• We’re going to assume that we can write
Y = r(X) + ε,
where ε is a random variable, called the error term, that has mean zero and is independent of
X. Check: what happens when we take expectation conditional on X = x on both sides?
• Disclaimer: this model assumes nothing about causality! Only a predictive relationship. And
even once we are clear about this, assuming that the error ε is independent of X is not always
realistic ... why?
where nx = |{i : xi = x}|. But this really only works when X takes discrete values; otherwise,
at any x, we’d typically have only one or zero xi values. Back to cholesterol example: what
happens if we didn’t observe any person in our study who took exactly x = 31.5 mg weekly of
the drug?
• One way you already know to estimate the regression function, if you’re willing to use a linear
relationship: linear regression. We find the best fitting linear function, i.e., we find α, β to
minimize
MSE(α, β) = E[(Y − α − βX)2 ].
Taking derivatives, it is not hard to see that
Cov(X, Y )
β= , α = E(Y ) − βE(X)
Var(X)
• In sample, we can do one of two things: first, just plug in sample estimates to get
Pn
(xi − x̄)yi
β̂ = Pi=1
n 2
, α̂ = ȳ − β̂ x̄,
i=1 (xi − x̄)
1
Pn
where x̄ = n i=1 xi and similarly for ȳ; second, find α̂, β̂ to minimize
n
X
(yi − α − βxi )2 ,
i=1
2
• Let’s look at this slightly differently: to make a prediction at an arbitrary point x, we use
r̂(x) = α̂ + β̂x
Pn
(xi − x̄)yi
= α̂ + Pi=1n 2 i
x
i=1 (xi − x̄)
n
X (xi − x̄)x
= α̂ + · yi (2)
i=1
ns2x
Pn
where we abbreviated s2x = i=1 (xi − x̄)2
with w(x, xi ) = (xi − x̄)x/(ns2x ). So is (1), with w(x, xi ) = 1/nx if xi = x and 0 otherwise,
where recall nx = |{i : xi = x}|. In general, we call a regression function of the form (3) a
linear smoother (note this is so-named because it is linear in y, but need not behave linearly
as a function of x!)
• Note that the weight w(x, xi ) = (xi − x̄)x/(ns2x ) does not actually depend on how far xi is
from x; this is not a problem if the true regression function is a straight line, but otherwise it
is
• Another common, more flexible linear smoother is k-nearest-neighbors regression; for this we
take (
1/k if xi is one of the k nearest points to x
w(x, xi ) =
0 otherwise,
like an extension of (1). In other words
1 X
r̂(x) = yi , (4)
k
i∈Nk (x)
This is way too optimistic, and doesn’t have the right shape as a function of k!
3
• Suppose that we an independent test sample (x01 , y10 ), (x02 , y20 ), . . . (x0m , ym
0
) (following the same
distribution as our training sample). We could then look at the expected test error,
h1 Xm
2 i
E[TestErr(r̂)] = E yi0 − r̂(x0i ) .
m i=1
Note that the expectation here is taken over all that is random (both training and test samples).
This really does capture what we want, and has the right behavior with k
• In short, underfitting leads to bad training error, but overfitting leads to good training error;
both underfitting and overfitting lead to bad test errors (which is really we care about). Our
goal is to find the right balance. We will see soon how to do this without access to test data.
We will also see that underfitting and overfitting are related to two quantities called estimation
bias and estimation variance (or just bias and variance)
This is an example of kernel regression, using what’s called a Gaussian kernel. The parameter
h here is called the bandwidth, like k in k-nearest-neighbors regression, it controls the level of
adaptivity/flexibility of the fit. One reason to prefer kernel regression is that it can produce
smoother (less jagged) looks fits than k-nearest-neighbors