0% found this document useful (0 votes)
2 views

intro to regression

The document outlines the course logistics and content for Advanced Methods for Data Analysis, focusing on regression and prediction techniques. It discusses point prediction, regression functions, estimation methods, and the balance between underfitting and overfitting in model training. Additionally, it introduces linear smoothers and kernel regression as methods to improve prediction accuracy and smoothness of fits.

Uploaded by

rz
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

intro to regression

The document outlines the course logistics and content for Advanced Methods for Data Analysis, focusing on regression and prediction techniques. It discusses point prediction, regression functions, estimation methods, and the balance between underfitting and overfitting in model training. Additionally, it introduces linear smoothers and kernel regression as methods to improve prediction accuracy and smoothness of fits.

Uploaded by

rz
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Introduction and Regression

Advanced Methods for Data Analysis (36-402/36-608)


Spring 2014

1 Course logistics
• Instructor: Ryan Tibshirani, TAs: Robert Lunde, Sonia Tardova

• See course website: http://www.stat.cmu.edu/~ryantibs/advmethods/ for syllabus, office


hours, course calendar, etc.
• 12 homeworks, 1 take-home exam, 1 in-class exam, 1 take-home final

• Class outline: 55 minutes lecture, 5 minutes break, 20 minutes working through R examples

2 Regression and prediction


2.1 Point prediction
• In the most simple regression setup we begin with Y ∈ R, a real-valued random variable
• What is the optimal point prediction for Y ? Depends on how we measure error. If we consider
mean squared error,
MSE(r) = E[(Y − r)2 ],
then the optimal prediction is r = E(Y )
• In practice how would we make this point prediction? We estimate the expected value: given
sample values y1 , y2 , . . . yn , we take
n
1X
r̂ = yi .
n i=1

• Example: suppose I’m taking a particular cholesterol medication, predict my improvement in


blood cholesterol level

2.2 Regression function


• Point prediction has limited applicability; more often we want to predict Y based on another
variable X, called the predictor, covariate, or input. In this context Y is now often called the
response, outcome, or output
• We make our prediction for Y a function of X, denoted r(X); e.g., what will be the amount
of improvement in blood cholesterol level as a function of the dose of a particular medication
X?

1
• Again sticking with mean squared error,
 2  h  2 i
MSE(r) = E Y − r(X) = E E Y − r(X) |X ,

we can see that the optimal function is r(X) = E(Y |X), or r(x) = E(Y |X = x). We call this
the regression function, it’s what we use to predict Y from X
• We’re going to assume that we can write

Y = r(X) + ε,

where ε is a random variable, called the error term, that has mean zero and is independent of
X. Check: what happens when we take expectation conditional on X = x on both sides?
• Disclaimer: this model assumes nothing about causality! Only a predictive relationship. And
even once we are clear about this, assuming that the error ε is independent of X is not always
realistic ... why?

2.3 How to estimate the regression function?


• From samples (x1 , y1 ), (x2 , y2 , ), . . . (xn , yn ), we could estimate the regression function via
1 X
r̂(x) = yi , (1)
nx i:x =x
i

where nx = |{i : xi = x}|. But this really only works when X takes discrete values; otherwise,
at any x, we’d typically have only one or zero xi values. Back to cholesterol example: what
happens if we didn’t observe any person in our study who took exactly x = 31.5 mg weekly of
the drug?

• One way you already know to estimate the regression function, if you’re willing to use a linear
relationship: linear regression. We find the best fitting linear function, i.e., we find α, β to
minimize
MSE(α, β) = E[(Y − α − βX)2 ].
Taking derivatives, it is not hard to see that

Cov(X, Y )
β= , α = E(Y ) − βE(X)
Var(X)

• In sample, we can do one of two things: first, just plug in sample estimates to get
Pn
(xi − x̄)yi
β̂ = Pi=1
n 2
, α̂ = ȳ − β̂ x̄,
i=1 (xi − x̄)

1
Pn
where x̄ = n i=1 xi and similarly for ȳ; second, find α̂, β̂ to minimize
n
X
(yi − α − βxi )2 ,
i=1

which ends up yielding the same estimates as above

2
• Let’s look at this slightly differently: to make a prediction at an arbitrary point x, we use

r̂(x) = α̂ + β̂x
Pn
(xi − x̄)yi
= α̂ + Pi=1n 2 i
x
i=1 (xi − x̄)
n
X (xi − x̄)x
= α̂ + · yi (2)
i=1
ns2x
Pn
where we abbreviated s2x = i=1 (xi − x̄)2

2.4 Linear smoothers


• From the above, we can see that the linear regression prediction (2) is of the form
n
X
r̂(x) = w(x, xi ) · yi , (3)
i=1

with w(x, xi ) = (xi − x̄)x/(ns2x ). So is (1), with w(x, xi ) = 1/nx if xi = x and 0 otherwise,
where recall nx = |{i : xi = x}|. In general, we call a regression function of the form (3) a
linear smoother (note this is so-named because it is linear in y, but need not behave linearly
as a function of x!)
• Note that the weight w(x, xi ) = (xi − x̄)x/(ns2x ) does not actually depend on how far xi is
from x; this is not a problem if the true regression function is a straight line, but otherwise it
is
• Another common, more flexible linear smoother is k-nearest-neighbors regression; for this we
take (
1/k if xi is one of the k nearest points to x
w(x, xi ) =
0 otherwise,
like an extension of (1). In other words
1 X
r̂(x) = yi , (4)
k
i∈Nk (x)

where Nk (x) gives the k nearest neighbors of x


• To use this in practice, we’re going to need to choose k. What are the tradeoffs at either end?
For small k, we risk picking up noisy features of the sample that don’t really have to do with
the true regression function r(x), called overfitting; for large k, we may miss important details,
due to averaging over too many yi values, called underfitting

2.5 Training and test errors


• How are we going to quantify this (overfitting vs underfitting)? Let’s call (x1 , y1 ), . . . (xn , yn ),
the sample of data that we used to fit r̂, our training sample. What’s wrong with looking at
how well we do in fitting the training points themselves, i.e., the expected training error,
h1 Xn
2 i
E[TrainErr(r̂)] = E yi − r̂(xi ) ?
n i=1

This is way too optimistic, and doesn’t have the right shape as a function of k!

3
• Suppose that we an independent test sample (x01 , y10 ), (x02 , y20 ), . . . (x0m , ym
0
) (following the same
distribution as our training sample). We could then look at the expected test error,
h1 Xm
2 i
E[TestErr(r̂)] = E yi0 − r̂(x0i ) .
m i=1

Note that the expectation here is taken over all that is random (both training and test samples).
This really does capture what we want, and has the right behavior with k
• In short, underfitting leads to bad training error, but overfitting leads to good training error;
both underfitting and overfitting lead to bad test errors (which is really we care about). Our
goal is to find the right balance. We will see soon how to do this without access to test data.
We will also see that underfitting and overfitting are related to two quantities called estimation
bias and estimation variance (or just bias and variance)

2.6 Smoother linear smoothers


• Let’s look back at the (3); recall that in k-nearest-neighbors regression, the weights w(x, xi )
are either 1/k or 0, depending on xi . This will often produce jagged looking fits. Why? Are
these weights smooth over x?
• How about choosing
exp(−(xi − x)2 /(2h))
w(x, xi ) = Pn 2
?
j=1 exp(−(xj − x) /(2h))

This is an example of kernel regression, using what’s called a Gaussian kernel. The parameter
h here is called the bandwidth, like k in k-nearest-neighbors regression, it controls the level of
adaptivity/flexibility of the fit. One reason to prefer kernel regression is that it can produce
smoother (less jagged) looks fits than k-nearest-neighbors

You might also like