Bias Variance Tradeoff
Bias Variance Tradeoff
f ∶ X → Y,
2 / 71
A Theoretical Framework
Player Height Weight Yrs Expr 2 Points 3 Points Salary
1 ... ... ... ... ... ...
2 ... ... ... ... ... ...
3 ... ... ... ... ... ...
... ... ... ... ... ... ...
● We have a dataset D = {(x1 , y1 ), (x2 , y2 ), . . . , (xn , yn )} where xi
is the feature vectors for the i-th player, and yi is his/her salary.
● From this data, we want to obtain a fitted model, known as a
hypothesis model fˆ():
fˆ ∶ X → Y
that approximates the unknown target function f ().
● To find fˆ(), we consider a set of candidate models, known as a
hypothesis set H = {h1 , h2 , . . . , hm }.
● The selected hypothesized model h∗m will be the one used as the
final model fˆ.
3 / 71
A Theoretical Framework
6 / 71
A Theoretical Framework
● The final model fˆ is selected by the learning algorithm from the set
of hypothesis models.
● Ideally, fˆ should be a good approximation of the target function f .
7 / 71
Types of Predictions
8 / 71
Two Types of Predictions
● With two types of data points, we have two corresponding types of
predictions:
1 predictions ŷi of observed/seen values xi
2 predictions ŷ0 of unobserved/unseen values x0
● The predictions of observed data ŷi involve the memorizing aspect.
● The predictions of unobserved data ŷ0 involve the generalization
aspect.
● We are interested in the second type of predictions: we want to
find models that are able to give predictions ŷ0 as accurate as
possible for the real value y0 .
● Having good predictions ŷi of observed data is often a necessary
condition for a good model, but not a sufficient condition.
● Sometimes, we can perfectly fit the observed data, but have a
terrible performance for unobserved data x0 .
9 / 71
Error Measure
● We need a way to measure the accuracy of the predictions.
● We need some mechanism to quantify how different the fitted
model fˆ() is from the target function f (): the total amount of
error - Overall Measure of Error: E(fˆ, f ).
● The overall measure of error is defined in terms of individual errors
erri (ŷi , yi ) that quantify the difference between an observed value
yi and its predicted value ŷi .
10 / 71
Individual Errors
11 / 71
Two Types of Errors
● In machine learning, the overall measures of error are known as the
cost functions or risks.
● There are two types of overall error measures, based on the type of
data that is used to assess the individual errors.
1 In-sample Error, denoted Ein , is the average of individual errors
from data points of the in-sample data Din :
1
Ein (fˆ, f ) = ∑ erri
n i
2 Out-of-sample Error, denoted Eout , is the theoretical mean, or
expected value, of the individual errors over the entire input space:
Eout (fˆ, f ) = EX [err(fˆ(x), f (x)]
14 / 71
Probability Perspective
⎧
⎪
⎪Ein (fˆ) ≈ 0 practical result
Eout (fˆ) ≈ 0 ⇒ ⎨
⎪
⎩Eout (f ) ≈ Ein (f ) theoretical result
⎪ ˆ ˆ
15 / 71
Noisy Targets
16 / 71
Noisy Targets
18 / 71
Estimation
● Assume that we can draw multiple random samples, all of the same
size m, from the population.
● For each sample, we compute a statistic θ̂.
21 / 71
Sampling Estimators
24 / 71
Distribution of Estimators
● The estimator has expected value E(θ̂) with finite variance var(θ̂).
29 / 71
MSE of an Estimator
30 / 71
Cases of Biases and Variance
Depending on the type of estimator θ̂ and the sample size m, we can
get statistics having different behaviors.
31 / 71
Theoretical Framework of Supervised Learning
● The main goal is to find a model fˆ that approximates well the
target function f , i.e., fˆ ≈ f .
● We want to find a model fˆ that gives good predictions on both
types of data points:
1 in-sample data ŷi = fˆ(xi ), where (xi , yi ) ∈ Din
2 out-of-sample data ŷ0 = fˆ(x0 ), where (x0 , y0 ) ∈ Dout
● Two types of predictions involve two types of errors:
1 in-sample error: Ein (fˆ)
2 out-of-sample error: Eout (fˆ)
● To have fˆ ≈ f , we need to achieve two goals:
1 small in-sample error: Ein (fˆ) ≈ 0
2 out-of-sample error similar to in-sample error: Eout (fˆ) ≈ Ein (fˆ)
● We will study the theoretical behavior of Eout (fˆ) from a regression
perspective with Mean Squared Error (MSE).
32 / 71
Motivation Example
● Consider a noiseless target function:
f (x) = sin (πx)
with the input variable x ∈ [−1, 1].
33 / 71
Motivation Example - Two Hypotheses
● Give a dataset of n data points, we fit the data using one of two
hypothesis spaces H0 and H1 :
● H0 : the set of all lines of the form h(x) = b
● H1 : the set of all lines of the form h(x) = b0 + b1 x
34 / 71
Learning from two points
35 / 71
Learning from two points
● For 500 times, we randomly sample two points in the interval
[−1, 1], and fit both models h0 and h1 .
● The average hypotheses h̄0 and h̄1 displayed in orange.
36 / 71
Learning from two points
● For H0 models: if we average all 500 fitted models, we get h̄0
which corresponds to the horizontal line y = 0.
● All the individual fitted lines have the same slope of the average
hypothesis, but different intercept values.
● The class of H0 models have low variance and high bias.
37 / 71
Learning from two points
● For H1 models: if we average all 500 fitted models, we get h̄1
which corresponds to the orange line with positive slope.
● The individual fitted lines have all sorts of slopes: negative, close
to zero, zero, positive, etc. There is a substantial amount of
variability between the average hypothesis h̄1 and the form of any
single fit h1 .
38 / 71
Learning from two points
● For H1 models: the fact that the average hypothesis h̄1 has
positive slope means that the majority of fitted lines also have
positive slopes.
● The average hypothesis somewhat matches the overall trend of
target function f () around x ∈ [−0.5, 0.5].
● The class of H1 models have high variance and low bias.
39 / 71
Bias-Variance Derivation
40 / 71
Out-of-Sample Predictions
● To consider the MSE as a theoretical expected value (i.e., not an
empirical average), we suppose the existence of an out-of-sample
data point x0 .
● Given a learning data set D of n points, a hypothesis space H of
hypotheses h(x)’s, the expectation of the Squared Error for a
given out-of-sample point x0 over all possible learning sets is:
2
E [(h(D) (x0 ) − f (x0 )) ]
41 / 71
Out-of-Sample Predictions
● We consider the average hypothesis h̄(x0 ) that plays the role of
µθ̂ = E(θ̂):
h̄(x0 ) = ED [h(D) (x0 )]
● We have the error for a given out-of-sample point x0 :
42 / 71
Out-of-Sample Predictions
● ED [a2 ] = ED [(h(D) − h̄)2 ] = Variance(h)
● ED [b2 ] = ED [(h̄ − f )2 ] = Bias2 (h)
● We have
43 / 71
Noisy Targets
y = f (x) + ϵ
44 / 71
Types of Theoretical MSEs
1 MSE involves a single out-of-sample point x0 , measuring the
performance of a class of hypotheses h ∈ H over multiple learning
data sets D - expected test MSE.
2
ED [(h(D) (x0 ) − f (x0 )) ]
45 / 71
Types of Theoretical MSEs
46 / 71
The Bias-Variance Tradeoff Picture
47 / 71
The Bias-Variance Tradeoff Picture
49 / 71
The Bias-Variance Tradeoff Picture
50 / 71
Bias
● The bias term involves h̄(x) − f (x). The average h̄ comes from a
hypothesis class H (e.g., constant models, linear models, etc.).
● h̄ is a prototypical example of a certain class of hypotheses.
● The bias term measures how well a hypothesis class H
approximates the target function f .
51 / 71
Variance
52 / 71
Tradeoff
● A model should have both small variance and small bias. But,
there is a tradeoff between these two (Bias-Variance tradeoff).
● To actually perform bias-variance decomposition, we need access to
h̄. But, computing h̄ requires computing every model of a
hypothesis class H (e.g., linear, quadratic, etc.)
● More complex/flexible models tend to have less bias, and thus have
a better chance to approximate f (x). Complex models tend to
have small in-sample error: Ein ≈ 0.
● More complex models tend to have higher variance. They have a
higher risk of large out-of-sample error: Eout ≫ 0. We need more
resources (training data, computing powers).
● Less complex models tend to have less variance, but more bias, i.e.,
smaller chance to estimate f (), but higher chance to approximate
out-of-sample error: Ein ≈ Eout , but Ein ≈ Eout ≫ 0.
53 / 71
Tradeoff
54 / 71
Overfitting
55 / 71
Bias-Variance
● We assume a response variable y = f (x) + ϵ.
● We seek a model h(x) that approximates well the target f ().
● Given a learning dataset D of n points, and a hypothesis h(x), the
expectation of the Squared Error for a given out-of-sample point
x0 , over all possible learning sets, is:
2 2
ED [(h(D) (x0 ) − f (x0 )) ] = ED [(h(D) (x0 ) − h̄(x0 )) ]
´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¸¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¶
variance
+ (h̄(x0 ) − f (x0 ))2
´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¸ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹¶
bias2
+ σ2
¯
noise
= variance + bias2 + σ 2
57 / 71
Example
● We consider a target function with some noise:
f (x) = sin (1 + x2 ) + ϵ
with input variable x ∈ [0, 1] and the noise term
ϵ ∼ N (µ = 0, σ = 0.03).
● The function sin (1 + x2 ) is:
58 / 71
Example
59 / 71
Example
64 / 71
Example - Which Model?
65 / 71
Example - Which Model?
66 / 71
Example - Which Model?
● Three new
learning sets of
size n = 10.
● For each
learning set, the
9-degree
polynomials can
fit perfectly. But
they are very
volatile.
● The 1,2,3-degree
polynomials are
more stable.
68 / 71
Overfitting and Underfitting
69 / 71
Overfitting and Underfitting
● The amount of bias does not depend on the size of the in-sample
set.
● This means that increasing the number of learning points won’t
give us a better chance to approximate f ().
● The amount of variance of the model does depend on the number
of learning points.
● As n increases, large-capacity models will experience a reduction in
variability, and their higher flexibility tends to become an
advantage.
70 / 71
Overfitting and Underfitting
71 / 71