MIT15 097S12 Lec04
MIT15 097S12 Lec04
Of course, we’d like an estimator with low bias and low variance.
Then if we could choose any f we want, what would we choose? Maybe we’d
choose f to minimize the least squares error:
It turns out that the f ∗ that minimizes the above error is the conditional expec
tation!
Draw a picture
Proposition.
f ∗ (x) = Ey [y|x].
Proof. Consider each x separately. For each x there’s a marginal distribution on
y. In other words, look at Ey [(y − f (x))2 |x] for each x. So, pick an x. For this
x, define ȳ to be Ey [y|x]. Now,
So how do we pick f (x)? Well, we can’t do anything about the first term, it
doesn’t depend on f (x). The best choice of f (x) minimizes the second term,
which happens at f (x) = ȳ, where remember ȳ = Ey [y|x].
So we know for each x what to choose in order to minimize Ey [(y − f (x))2 |x].
To complete the argument, note that:
and we have found the minima of the inside term for each x. •
Note that if we’re interested instead in the absolute loss Ex,y [|y − f (x)|], it
is possible to show that the best predictor is the conditional median, that is,
f (x) = median[y|x].
We have training set S = (x1 , y1 ), . . . , (xm , ym ), where each example is drawn iid
from D. We want to use S to learn a function fS : X → Y.
First, let’s consider some learning algorithm (which produced fS ) and its ex
pected prediction error:
Ex,y,S [(y − fS (x))2 ].
Remember that the estimator fS is random, since it depends on the randomly
drawn training data. Here, the expectation is taken with respect to a new ran
domly drawn point x, y ∼ D and training data S ∼ Dm .
In other words, to get this value, we’d get infinitely many training sets, run the
learning algorithm on all of them to get infinite predictions fS (x) for each x.
Then for each x we’d average the predictions to get f¯(x).
The third term here is zero, since Ey,S [(y−ȳ)(ȳ−fS (x))] = Ey (y−ȳ)ES (ȳ−fS (x)),
and the first part of that is Ey (y − ȳ) = 0.
The first term is the variance of y around its mean. We don’t have control over
that when we choose fS . This term is zero if y is deterministically related to x.
ES (ȳ − fS (x))2
= ES (ȳ − f¯(x) + f¯
(x) − fS (x))2
The last term is zero, since (ȳ − f¯(x)) is a constant, and f¯(x) is the mean of
fS (x) with respect to S. Also the first term isn’t random. It’s (ȳ − f¯(x))2 .
In this expression, the second term is the variance of our estimator around its
mean. It controls how our predictions vary around its average prediction. The
third term is the bias squared, where the bias is the difference between the av
erage prediction and the true conditional mean.
Theorem.
For each fixed x, Ey,S [(y − fS (x))2 ] = vary|x (y) + varS (fS (x)) + bias(fS (x))2 .
So
Ex,y,S [(y − fS (x))2 ] = Ex [vary|x (y) + varS (fS (x)) + bias(fS (x))2 ].
4
Question: Intuitively, what happens to the second term if fS fits the data per
fectly every time (overfitting)?
Question: Intuitively, what happens to the last two terms if fS is a flat line
every time?
The bottom line: In order to predict well, you need to strike a balance between
bias and variance.
• The variance term controls wiggliness, so you’ll want to choose simple func
tions that can’t yield predictions that are too varied.
• The bias term controls how close the average model prediction is close to
the truth, ȳ. You’ll need to pay attention to the data in order to reduce the
bias term.
• Since you can’t calculate either the bias or the variance term, what we
usually do is just impose some “structure” into the functions we’re fitting
with, so the class of functions we are working with is small (e.g., low degree
polynomials). We then try to fit the data well using those functions. Hope
fully this strikes the right balance of wiggliness (variance) and capturing the
mean of the data (bias).
• One thing we like to do is make assumptions on the distribution D, or
at least on the class of functions that might be able to fit well. Those
assumptions each lead to a different algorithm (i.e. model). How well the
algorithm works or not depends on how true the assumption is.
• Even when we’re not working with least squares error, we hope a similar
idea holds (and will work on proving that later in the course). We’ll use
the same type of idea, where we impose some structure, and hope it reduces
wiggliness and will still give accurate predictions.
MIT OpenCourseWare
http://ocw.mit.edu
For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.