0% found this document useful (0 votes)
32 views

MIT15 097S12 Lec04

This document discusses the bias-variance tradeoff in machine learning models. It defines bias as the difference between a model's expected prediction and the true underlying value being predicted, and variance as how much a model's predictions change when trained on different training data. The bias-variance decomposition shows that a model's expected error is composed of an irreducible error term due to noise in the data, plus terms for the model's bias and variance. To minimize total error, models must balance high bias (underfitting) with high variance (overfitting). The bias-variance tradeoff explains why simpler models often generalize better to new data.

Uploaded by

alok
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views

MIT15 097S12 Lec04

This document discusses the bias-variance tradeoff in machine learning models. It defines bias as the difference between a model's expected prediction and the true underlying value being predicted, and variance as how much a model's predictions change when trained on different training data. The bias-variance decomposition shows that a model's expected error is composed of an irreducible error term due to noise in the data, plus terms for the model's bias and variance. To minimize total error, models must balance high bias (underfitting) with high variance (overfitting). The bias-variance tradeoff explains why simpler models often generalize better to new data.

Uploaded by

alok
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Bias/Variance Tradeoff

A parameter is some quantity about a distribution that we would like to know.


We’ll estimate the parameter θ using an estimator θ̂. The bias of estimator θ̂ for
parameter θ is defined as:

• Bias(θ̂,θ) := E(θ̂) − θ, where the expectation is with respect to the distribution


that θ̂ is constructed from.

An estimator whose bias is 0 is called unbiased. Contrast bias with:

• Var(θ̂) = E(θ̂ − E(θ̂))2 .

Low bias Low bias High bias High bias


Low variance High variance Low variance High variance

Image by MIT OpenCourseWare.

Of course, we’d like an estimator with low bias and low variance.

A little bit of decision theory


(The following is based on notes of David McAllester.)

Let’s say our data come from some distribution D on X × Y, where Y ⊂ R.


Usually we don’t know D (we instead only have data) but for the moment, let’s
say we know it. We want to learn a function f : X → Y.

Then if we could choose any f we want, what would we choose? Maybe we’d
choose f to minimize the least squares error:

Ex,y∼D [(y − f (x))2 ].

It turns out that the f ∗ that minimizes the above error is the conditional expec­
tation!

Draw a picture

Proposition.
f ∗ (x) = Ey [y|x].
Proof. Consider each x separately. For each x there’s a marginal distribution on
y. In other words, look at Ey [(y − f (x))2 |x] for each x. So, pick an x. For this
x, define ȳ to be Ey [y|x]. Now,

Ey [(y − f (x))2 |x]


= Ey [(y − ȳ + ȳ − f (x))2 |x]
= Ey [(y − ȳ)2 |x] + Ey [(ȳ − f (x))2 |x] + 2Ey [(y − ȳ)(ȳ − f (x))|x]
= Ey [(y − ȳ)2 |x] + (ȳ − f (x))2 + 2(ȳ − f (x))Ey [(y − ȳ)|x]
= Ey [(y − ȳ)2 |x] + (ȳ − f (x))2

where the last step follows from the definition of ȳ.

So how do we pick f (x)? Well, we can’t do anything about the first term, it
doesn’t depend on f (x). The best choice of f (x) minimizes the second term,
which happens at f (x) = ȳ, where remember ȳ = Ey [y|x].

So we know for each x what to choose in order to minimize Ey [(y − f (x))2 |x].
To complete the argument, note that:

Ex,y [(y − f (x))2 ] = Ex [Ey [(y − f (x))2 |x]]

and we have found the minima of the inside term for each x. •

Note that if we’re interested instead in the absolute loss Ex,y [|y − f (x)|], it
is possible to show that the best predictor is the conditional median, that is,
f (x) = median[y|x].

Back to Bias/Variance Decomposition


Let’s think about a situation where we created our function f using data S. Why
would we do that of course?

We have training set S = (x1 , y1 ), . . . , (xm , ym ), where each example is drawn iid
from D. We want to use S to learn a function fS : X → Y.

We want to know what the error of fS is on average. In other words, we want to


know what Ex,y,S [(y −fS (x))2 ] is. That will help us figure out how to minimize it.
This is going to be a neat result - it’s going to decompose into bias and variance
terms!

First, let’s consider some learning algorithm (which produced fS ) and its ex­
pected prediction error:
Ex,y,S [(y − fS (x))2 ].
Remember that the estimator fS is random, since it depends on the randomly
drawn training data. Here, the expectation is taken with respect to a new ran­
domly drawn point x, y ∼ D and training data S ∼ Dm .

Let us define the mean prediction of the algorithm at point x to be:

f¯(x) = ES [fS (x)].

In other words, to get this value, we’d get infinitely many training sets, run the
learning algorithm on all of them to get infinite predictions fS (x) for each x.
Then for each x we’d average the predictions to get f¯(x).

We can now decompose the error, at a fixed x, as follows:

Ey,S [(y − fS (x))2 ]


= Ey,S [(y − ȳ + ȳ − fS (x))2 ]
= Ey (y − ȳ)2 + ES (ȳ − fS (x))2 + 2Ey,S [(y − ȳ)(ȳ − fS (x))].

The third term here is zero, since Ey,S [(y−ȳ)(ȳ−fS (x))] = Ey (y−ȳ)ES (ȳ−fS (x)),
and the first part of that is Ey (y − ȳ) = 0.

The first term is the variance of y around its mean. We don’t have control over
that when we choose fS . This term is zero if y is deterministically related to x.

Let’s look at the second term:

ES (ȳ − fS (x))2

= ES (ȳ − f¯(x) + f¯
(x) − fS (x))2

= ES (ȳ − f¯(x))2 + ES (f¯(x) − fS (x))2 + 2ES [(ȳ − f¯


(x))(f¯(x) − fS (x))]

The last term is zero, since (ȳ − f¯(x)) is a constant, and f¯(x) is the mean of
fS (x) with respect to S. Also the first term isn’t random. It’s (ȳ − f¯(x))2 .

Putting things together, what we have is this (reversing some terms):

Ey,S [(y − fS (x))2 ] = Ey (y − ȳ)2 + ES (f¯(x) − fS (x))2 + (ȳ − f¯(x))2 .

In this expression, the second term is the variance of our estimator around its
mean. It controls how our predictions vary around its average prediction. The
third term is the bias squared, where the bias is the difference between the av­
erage prediction and the true conditional mean.

We’ve just proved the following:

Theorem.

For each fixed x, Ey,S [(y − fS (x))2 ] = vary|x (y) + varS (fS (x)) + bias(fS (x))2 .

So
Ex,y,S [(y − fS (x))2 ] = Ex [vary|x (y) + varS (fS (x)) + bias(fS (x))2 ].

That is the bias-variance decomposition.

The “Bias-Variance” tradeoff: want to choose fS to balance between reducing the


second and third terms in order to make the lowest MSE. We can’t just minimize
one or the other, it needs to be a balance. Sometimes, if you are willing to inject
some bias, this can allow you to substantially reduce the variance. E.g., modeling
with lower degree polynomials, rather than higher degree polynomials.

4
Question: Intuitively, what happens to the second term if fS fits the data per­
fectly every time (overfitting)?

Question: Intuitively, what happens to the last two terms if fS is a flat line
every time?

The bottom line: In order to predict well, you need to strike a balance between
bias and variance.

• The variance term controls wiggliness, so you’ll want to choose simple func­
tions that can’t yield predictions that are too varied.
• The bias term controls how close the average model prediction is close to
the truth, ȳ. You’ll need to pay attention to the data in order to reduce the
bias term.
• Since you can’t calculate either the bias or the variance term, what we
usually do is just impose some “structure” into the functions we’re fitting
with, so the class of functions we are working with is small (e.g., low degree
polynomials). We then try to fit the data well using those functions. Hope­
fully this strikes the right balance of wiggliness (variance) and capturing the
mean of the data (bias).
• One thing we like to do is make assumptions on the distribution D, or
at least on the class of functions that might be able to fit well. Those
assumptions each lead to a different algorithm (i.e. model). How well the
algorithm works or not depends on how true the assumption is.
• Even when we’re not working with least squares error, we hope a similar
idea holds (and will work on proving that later in the course). We’ll use
the same type of idea, where we impose some structure, and hope it reduces
wiggliness and will still give accurate predictions.

Go back to the other notes!

MIT OpenCourseWare
http://ocw.mit.edu

15.097 Prediction: Machine Learning and Statistics


Spring 2012

For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.

You might also like