0% found this document useful (0 votes)
12 views71 pages

Bias Variance Tradeoff

The document discusses the bias-variance tradeoff in machine learning models. It introduces concepts like the target function, hypothesis set, learning algorithm, in-sample error and out-of-sample error. It explains how models are selected to minimize out-of-sample error and make accurate predictions on new data, while avoiding overfitting to training data.

Uploaded by

lmtoan311
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views71 pages

Bias Variance Tradeoff

The document discusses the bias-variance tradeoff in machine learning models. It introduces concepts like the target function, hypothesis set, learning algorithm, in-sample error and out-of-sample error. It explains how models are selected to minimize out-of-sample error and make accurate predictions on new data, while avoiding overfitting to training data.

Uploaded by

lmtoan311
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 71

Bias-Variance Tradeoff

Ngoc Hoang Luong

University of Information Technology (UIT), VNU-HCM

November 20, 2023


A Theoretical Framework
Player Height Weight Yrs Expr 2 Points 3 Points Salary
1 ... ... ... ... ... ...
2 ... ... ... ... ... ...
3 ... ... ... ... ... ...
... ... ... ... ... ... ...

● Predict the salary of player - the output variable, denoted as Y .


● The rest of the variables are the inputs, denoted as X1 , X2 , . . . , Xp .
● We assume the existence of a target function f ()

f ∶ X → Y,

which is a function mapping from the input space X to the output


space Y. This function is unknown.
● We want to “find” this target function.

2 / 71
A Theoretical Framework
Player Height Weight Yrs Expr 2 Points 3 Points Salary
1 ... ... ... ... ... ...
2 ... ... ... ... ... ...
3 ... ... ... ... ... ...
... ... ... ... ... ... ...
● We have a dataset D = {(x1 , y1 ), (x2 , y2 ), . . . , (xn , yn )} where xi
is the feature vectors for the i-th player, and yi is his/her salary.
● From this data, we want to obtain a fitted model, known as a
hypothesis model fˆ():
fˆ ∶ X → Y
that approximates the unknown target function f ().
● To find fˆ(), we consider a set of candidate models, known as a
hypothesis set H = {h1 , h2 , . . . , hm }.
● The selected hypothesized model h∗m will be the one used as the
final model fˆ.
3 / 71
A Theoretical Framework

● The target function f is unknown, and we can never really


“discover” f . We can only find a good enough approximation to f
by estimating fˆ.
● The idea of a good approximation fˆ ≈ f is also theoretical because
we don’t know f .
4 / 71
A Theoretical Framework

● The dataset D is influenced by the unknown target function.


● The hypothesis set H is the set of ML model types that want to
try out (e.g., linear model, polynomial models, non-parametric
models, etc.)
5 / 71
A Theoretical Framework

● The learning algorithm A is the set of instructions to be carried


out when learning from data.

6 / 71
A Theoretical Framework

● The final model fˆ is selected by the learning algorithm from the set
of hypothesis models.
● Ideally, fˆ should be a good approximation of the target function f .

7 / 71
Types of Predictions

● What does a “good model” mean?


● We want to estimate an unknown function f with some model fˆ
that gives “good” predictions.
● For a simple linear regression model, a fitted model fˆ(x) can be
used to make two types of predictions:
● For an observed point xi , we can compute yi = fˆ(xi ). Note that xi
was part of the learning data used to find fˆ.
● For an unseen point x0 , we can compute ŷ0 = fˆ(x0 ). Note that x0
was not part of the learning data used to find fˆ.
● We have two kinds of dataset:
● In-sample data, denoted by Din , is used to fit a model.
● Out-of-sample data, denoted by Dout , is used to measure the
predictive quality of a model.

8 / 71
Two Types of Predictions
● With two types of data points, we have two corresponding types of
predictions:
1 predictions ŷi of observed/seen values xi
2 predictions ŷ0 of unobserved/unseen values x0
● The predictions of observed data ŷi involve the memorizing aspect.
● The predictions of unobserved data ŷ0 involve the generalization
aspect.
● We are interested in the second type of predictions: we want to
find models that are able to give predictions ŷ0 as accurate as
possible for the real value y0 .
● Having good predictions ŷi of observed data is often a necessary
condition for a good model, but not a sufficient condition.
● Sometimes, we can perfectly fit the observed data, but have a
terrible performance for unobserved data x0 .

9 / 71
Error Measure
● We need a way to measure the accuracy of the predictions.
● We need some mechanism to quantify how different the fitted
model fˆ() is from the target function f (): the total amount of
error - Overall Measure of Error: E(fˆ, f ).
● The overall measure of error is defined in terms of individual errors
erri (ŷi , yi ) that quantify the difference between an observed value
yi and its predicted value ŷi .

E(fˆ, f ) = measure (∑ erri (ŷi , yi ))

● We typically use the mean sum of errors as the overall error


measure:
1
E(fˆ, f ) = (∑ erri (ŷi , yi ))
n i

10 / 71
Individual Errors

1 Squared error: err(fˆ, f ) = (ŷi − yi )2


2 Absolute error: err(fˆ, f ) = ∣ŷi − yi ∣
3 Misclassification error: err(fˆ, f ) = 1[ŷi ≠ yi ]
4 ...
● In machine learning, these individual errors are known as loss
functions.
● We can design different individual error functions, but the above
are the most common.

11 / 71
Two Types of Errors
● In machine learning, the overall measures of error are known as the
cost functions or risks.
● There are two types of overall error measures, based on the type of
data that is used to assess the individual errors.
1 In-sample Error, denoted Ein , is the average of individual errors
from data points of the in-sample data Din :
1
Ein (fˆ, f ) = ∑ erri
n i
2 Out-of-sample Error, denoted Eout , is the theoretical mean, or
expected value, of the individual errors over the entire input space:
Eout (fˆ, f ) = EX [err(fˆ(x), f (x)]

● The point x denotes a general data point in the input space X .


● The expectation is taken over the input space X . Thus, the nature
of Eout is highly theoretical, we will never able to compute this
quantity.
12 / 71
Supervised Learning Diagram

● Learning algorithms A use individual error function err().


● The overall measure of error E() is used to determine with model
h() is the best approximation to the target model f ().
13 / 71
Probability Perspective

● Our ultimate goal is to get a good function fˆ ≈ f . Technically, we


want Eout (fˆ) ≈ 0.
● However, out-of-sample data is theoretical; we can never obtain it
entirely. We don’t have access to Eout .
● The best we can do is to obtain a subset of the out-of-sample data
(called the test data).
● We need to assume some probability distribution P over the input
space X . Our data points x1 , x2 , . . . , xn are independent
identically distributed (iid) samples from this distribution P .
● This links the in-sample error to the out-of-sample data.

14 / 71
Probability Perspective



⎪Ein (fˆ) ≈ 0 practical result
Eout (fˆ) ≈ 0 ⇒ ⎨

⎩Eout (f ) ≈ Ein (f ) theoretical result
⎪ ˆ ˆ

15 / 71
Noisy Targets

● In practice, there will be some noise. Instead of y = f (x) where


f ∶ X → Y, it will be:
y = f (x) + ϵ
● We could have multiple inputs mapping to the same output..
● We would have two individuals with the exact inputs xA = xB , but
with different responses yA ≠ yB .
● We need to consider some target conditional distribution P (y∣x).
● Our data can be described as a joint probability distribution
P (x, y):
P (x, y) = P (x)P (y∣x)

16 / 71
Noisy Targets

● In supervised learning, we want to learn the conditional distribution


P (y∣x), where y = f (x) + ϵ.
● The Hypothesis Set H and the Learning Algorithm A are together
called the Learning Model.
17 / 71
Estimation
● Estimation consists of providing an approximate value to the
parameter of a population, using a (random) sample of
observations drawn from such population.
● We have a population of n objects, and we want to describe them
with some numeric characteristic θ.
● For example, we have a population of all students in a college, and
we want to know their average height. We call this (theoretical)
average the parameter.

18 / 71
Estimation

● To estimate the value of the parameter, we draw a sample of


m < n students from the population and compute a statistic θ̂.
● Ideally, we would like some statistic θ̂ that approximates well the
parameter θ.
19 / 71
Estimation

1 Get a random sample from the population.


2 Use the limited amount of data in the sample to estimate θ using
some formula to compute θ̂.
3 Make a statement about how reliable an estimator θ̂ is.
20 / 71
Sampling Estimators

● Assume that we can draw multiple random samples, all of the same
size m, from the population.
● For each sample, we compute a statistic θ̂.

21 / 71
Sampling Estimators

An estimator is a random variable:


● The first sample of size m will result in θ̂1 .
● The second sample of size m will result in θ̂2 .
● The third sample of size m will result in θ̂3 .
● and so on. . .
22 / 71
Sampling Estimators

An estimator is a random variable:


● Some samples yield a θ̂k that overestimates θ.
● Some samples yield a θ̂k that underestimates θ.
● Some samples yield a θ̂k that matches θ.
23 / 71
Distribution of Estimators
● In theory, we could get a very large number of samples and
visualize the distribution of θ̂ as:

● Some estimators will be close to the parameter θ.


● Some estimator will be far away from the parameter θ.

24 / 71
Distribution of Estimators
● The estimator has expected value E(θ̂) with finite variance var(θ̂).

● How different (or similar) is θ̂ from θ. On average, how close we


expect the estimator to be from the parameter?
● We need a measure to assess the typical distance of estimators
from the parameter.
25 / 71
Distribution of Estimators

The difference θ̂ − θ is the estimation error. The estimation error is also


a random variable:
● The first sample yields an error θ̂1 − θ.
● The second sample yields an error θ̂2 − θ.
● and so on . . . .
26 / 71
Distribution of Estimators

● To measure the size of the estimation errors, we use Mean Squared


Error (MSE) of θ̂.
MSE(θ̂) = E[(θ̂ − θ)2 ]
● MSE is the squared distance from our estimator θ̂ to the true value
θ, averaged over all possible samples.
27 / 71
Distribution of Estimators

(θ̂ − θ)2 = (θ̂ − E(θ̂) + E(θ̂) − θ)2


= (θ̂ − µθ̂ + µθ̂ − θ )2
² ²
a b
= a2 + b2 + 2ab
Ô⇒ MSE(θ̂) = E[(θ̂ − θ)2 ] = E[a2 + b2 + 2ab]
28 / 71
MSE of an Estimator
● The MSE(θ̂) = E[(θ̂ − θ)2 ] can be decomposed as:
MSE(θ̂) = E[(θ̂ − θ)2 ] = E[a2 + b2 + 2ab]
= E(a2 ) + E(b2 ) + 2E(ab)
= E[(θ̂ − µθ̂ )2 ] + E[(µθ̂ − θ)2 ] + 2E(ab)
● We have:
E(ab) = E[(θ̂ − µθ̂ )(µθ̂ − θ)]
= (µθ̂ − θ)E[(θ̂ − µθ̂ )] //µθ̂ = E(θ̂) and θ are constants
= (µθ̂ − θ)[E(θ̂) − E(µθ̂ )] = 0
● Consequently
MSE(θ̂) = E[(θ̂ − µθ̂ )2 ] + E[(µθ̂ − θ)2 ]
= E[(θ̂ − µθ̂ )2 ] + E[(µθ̂ − θ)]2 = E[(θ̂ − µθ̂ )2 ] + (µθ̂ − θ)2
= Var(θ̂) + Bias2 (θ̂)

29 / 71
MSE of an Estimator

MSE(θ̂) = E[(θ̂ − µθ̂ )2 ] +(µθ̂ − θ )2


´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¸¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹¶ ²
Variance Bias
= Var(θ̂) + Bias (θ̂) 2

● The MSE of an estimator can be decomposed in terms of Bias and


Variance.
● Bias, µθ̂ − θ, is the tendency of θ̂ to overestimate or underestimate
θ over all possible samples.
● Variance, Var(θ̂), measures the average variability of the
estimators around their mean E(θ̂).

30 / 71
Cases of Biases and Variance
Depending on the type of estimator θ̂ and the sample size m, we can
get statistics having different behaviors.

31 / 71
Theoretical Framework of Supervised Learning
● The main goal is to find a model fˆ that approximates well the
target function f , i.e., fˆ ≈ f .
● We want to find a model fˆ that gives good predictions on both
types of data points:
1 in-sample data ŷi = fˆ(xi ), where (xi , yi ) ∈ Din
2 out-of-sample data ŷ0 = fˆ(x0 ), where (x0 , y0 ) ∈ Dout
● Two types of predictions involve two types of errors:
1 in-sample error: Ein (fˆ)
2 out-of-sample error: Eout (fˆ)
● To have fˆ ≈ f , we need to achieve two goals:
1 small in-sample error: Ein (fˆ) ≈ 0
2 out-of-sample error similar to in-sample error: Eout (fˆ) ≈ Ein (fˆ)
● We will study the theoretical behavior of Eout (fˆ) from a regression
perspective with Mean Squared Error (MSE).

32 / 71
Motivation Example
● Consider a noiseless target function:
f (x) = sin (πx)
with the input variable x ∈ [−1, 1].

33 / 71
Motivation Example - Two Hypotheses
● Give a dataset of n data points, we fit the data using one of two
hypothesis spaces H0 and H1 :
● H0 : the set of all lines of the form h(x) = b
● H1 : the set of all lines of the form h(x) = b0 + b1 x

34 / 71
Learning from two points

● We assume a dataset of size n = 2. D = {(x1 , y1 ), (x2 , y2 )}, where


x1 , x2 ∈ [−1, 1].
● For H0 , we choose the constant hypothesis that best fits the data:
the horizontal line at the midpoint:
y1 + y2
b=
2
● For H1 , we choose the line that passes through the two data points
(x1 , y1 ) and (x2 , y2 ).

35 / 71
Learning from two points
● For 500 times, we randomly sample two points in the interval
[−1, 1], and fit both models h0 and h1 .
● The average hypotheses h̄0 and h̄1 displayed in orange.

36 / 71
Learning from two points
● For H0 models: if we average all 500 fitted models, we get h̄0
which corresponds to the horizontal line y = 0.
● All the individual fitted lines have the same slope of the average
hypothesis, but different intercept values.
● The class of H0 models have low variance and high bias.

37 / 71
Learning from two points
● For H1 models: if we average all 500 fitted models, we get h̄1
which corresponds to the orange line with positive slope.
● The individual fitted lines have all sorts of slopes: negative, close
to zero, zero, positive, etc. There is a substantial amount of
variability between the average hypothesis h̄1 and the form of any
single fit h1 .

38 / 71
Learning from two points
● For H1 models: the fact that the average hypothesis h̄1 has
positive slope means that the majority of fitted lines also have
positive slopes.
● The average hypothesis somewhat matches the overall trend of
target function f () around x ∈ [−0.5, 0.5].
● The class of H1 models have high variance and low bias.

39 / 71
Bias-Variance Derivation

● We know that the Mean Squared Error (MSE) of an estimator θ̂


can be decomposed in terms of bias and variance as:

MSE(θ̂) = E[(θ̂ − µθ̂ )2 ] + (µθ̂ − θ)2

with µθ̂ = E(θ̂)


● Bias, µθ̂ − θ , is the tendency of θ̂ to overestimate or underestimate
θ over all possible samples.
● Variance, Var(θ̂), measures the average variability of the
estimators around their mean E(θ̂).
● We have θ̂ above is a general estimator.
● Next, we focus on fˆ(), our approximation of a target function f ().

40 / 71
Out-of-Sample Predictions
● To consider the MSE as a theoretical expected value (i.e., not an
empirical average), we suppose the existence of an out-of-sample
data point x0 .
● Given a learning data set D of n points, a hypothesis space H of
hypotheses h(x)’s, the expectation of the Squared Error for a
given out-of-sample point x0 over all possible learning sets is:
2
E [(h(D) (x0 ) − f (x0 )) ]

● We assume the target function f () is noiseless.


● h(D) is obtained by fitting on a specific learning data set D.
● h(D) (x0 ) denotes the predicted value of an out-of-sample point x0 .
● h(D) (x0 ) plays the role of θ̂ and f (x0 ) plays the role of θ.

41 / 71
Out-of-Sample Predictions
● We consider the average hypothesis h̄(x0 ) that plays the role of
µθ̂ = E(θ̂):
h̄(x0 ) = ED [h(D) (x0 )]
● We have the error for a given out-of-sample point x0 :

h(D) (x0 ) − f (x0 ) = h(D) (x0 ) − h̄(x0 ) + h̄(x0 ) − f (x0 )


h(D) − f = h(D) − h̄ + h̄ − f
● Same as above, we derive the expectation:
2
ED [(h(D) − f ) ] = ED [(h(D) − h̄ + h̄ − f )2 ]
´¹¹ ¹ ¹ ¹ ¹ ¹ ¸¹ ¹ ¹ ¹ ¹ ¹ ¹¶ ²
a b
= ED [(a + b) ] = ED [a2 + 2ab + b2 ]
2

= ED [a2 ] + ED [b2 ] + ED [2ab]

42 / 71
Out-of-Sample Predictions
● ED [a2 ] = ED [(h(D) − h̄)2 ] = Variance(h)
● ED [b2 ] = ED [(h̄ − f )2 ] = Bias2 (h)
● We have

ED [2ab] = ED [2(h(D) − h̄)(h̄ − f )]


= 2ED [h(D) h̄ − h(D) f − h̄2 + h̄f ]
∝ h̄ED [h(D) ] − f ED [h(D) ] − ED [h̄2 ] + f ED [h̄]
= h̄2 − f h̄ − h̄2 + f h̄ = 0
● Thus, assume that the target function f () is noiseless, we have
the expectation of the Squared Error for a given out-of-sample
point x0 , over all possible learning sets, is:

ED [(h(D) (x0 ) − f (x0 )) ] = ED [(h(D) (x0 ) − h̄(x0 )) ] + (h̄(x0 ) − f (x0 ))


2 2 2

´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¸¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¶ ´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¸¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¶


bias2
variance

43 / 71
Noisy Targets

● When there is noise in the data, we have:

y = f (x) + ϵ

● If ϵ is zero-mean noise random variable with variance σ 2 , the


bias-variance decomposition becomes:
2
ED [(h(D) (x0 ) − y0 ) ] = var + bias2 + σ 2

● Notice that the above equation involves the squared error


corresponds to just one out-of-sample point (x0 , y0 ) (i.e., test
point).

44 / 71
Types of Theoretical MSEs
1 MSE involves a single out-of-sample point x0 , measuring the
performance of a class of hypotheses h ∈ H over multiple learning
data sets D - expected test MSE.
2
ED [(h(D) (x0 ) − f (x0 )) ]

2 MSE involves a single hypothesis h(), measuring its performance


over all out-of-sample points x0 . Notice that h() has been fitted
on just one learning set D
EX [(h(x0 ) − f (x0 ))2 ]
3 MSE measures the performance of a class of hypotheses h ∈ H,
over multiple learning data sets D, over all out-of-sample points x0
- overall expected test MSE.
2
EX [ED [(h(D) (x0 ) − f (x0 )) ]]

45 / 71
Types of Theoretical MSEs

● These types of MSEs are highly theoretical.


● First, we don’t know the target function f .
● Second, we don’t have access to all out-of-sample points.
● Third, we cannot have infinite learning sets to compute the average
hypothesis h̄.
● We try to compute an approximation (i.e., an estimate) of an MSE
using a test dataset Dtest .
● Dtest is assumed to be a representative subset (i.e., an unbiased
sample) of the out-of-sample data Dout .

46 / 71
The Bias-Variance Tradeoff Picture

For example, we consider several classes of hypothesis:


● H1 of cubic polynomials.
● H2 of quadratic models.
● H3 of linear models.

47 / 71
The Bias-Variance Tradeoff Picture

● Each non-filled point is a fitted model h(D) based on some


particular dataset D
● Each filled point is the average model of a particular class of
hypotheses.
● For example, if H3 represents the fits based on linear models, each
triangle is a linear polynomial ax + b with coefficients a, b that
change depending on the sample.
48 / 71
The Bias-Variance Tradeoff Picture

● We measure the variability in each class of models.


● The dashed lines represent the deviations of each fitted model
against their average hypothesis.
● The set of all dashed lines shows the variance in each class of
models, i.e., how spread out the models within each class are.

49 / 71
The Bias-Variance Tradeoff Picture

● Suppose we can locate the true model f () in this space. Assume


f () to be of class H1 .
● The solid lines between each average hypothesis and the target
function represents the bias of each class of model.
● Notice that in practice we don’t have access to either the average
models h̄() or the true model f ().

50 / 71
Bias

● The bias term involves h̄(x) − f (x). The average h̄ comes from a
hypothesis class H (e.g., constant models, linear models, etc.).
● h̄ is a prototypical example of a certain class of hypotheses.
● The bias term measures how well a hypothesis class H
approximates the target function f .

MSE = Variance + Bias2 + Noise


² ´¹¹ ¹ ¸¹ ¹ ¹¶
deterministic noise random noise

51 / 71
Variance

● The variance term ED [(h(D) (x) − h̄)2 ] measures how close a


particular hypothesis h(D) (x) can get to the average hypothesis h̄.
● The variance term measures how precise is our particular function
h(D) (x) compared to the average function h̄(x)

52 / 71
Tradeoff
● A model should have both small variance and small bias. But,
there is a tradeoff between these two (Bias-Variance tradeoff).
● To actually perform bias-variance decomposition, we need access to
h̄. But, computing h̄ requires computing every model of a
hypothesis class H (e.g., linear, quadratic, etc.)
● More complex/flexible models tend to have less bias, and thus have
a better chance to approximate f (x). Complex models tend to
have small in-sample error: Ein ≈ 0.
● More complex models tend to have higher variance. They have a
higher risk of large out-of-sample error: Eout ≫ 0. We need more
resources (training data, computing powers).
● Less complex models tend to have less variance, but more bias, i.e.,
smaller chance to estimate f (), but higher chance to approximate
out-of-sample error: Ein ≈ Eout , but Ein ≈ Eout ≫ 0.

53 / 71
Tradeoff

● To decrease bias, we might need “insider” information. That is, to


truly decrease bias, we need some information on the form of the
unknown target function f ().
● It is thus nearly impossible to have zero bias.
● Therefore, to decrease bias, we tend to use more complex/flexible
models. We need to put our efforts toward decreasing variance.
● Adding more training data.
● Reduce the dimensions of data (e.g. lower-rank data matrices
through PCA).
● Apply regularizations, make the size of model parameters smaller.

54 / 71
Overfitting

● In supervised learning, one of the major risks when fitting a model


is to overestimate how well it will do when we use it in the real
world. This risk is commonly called overfitting.
● Analogy to students’ studying before taking a test.
● Limited capacity: We only grasp the general idea for some topics,
and do not understand the details. For example, we know the
concept of simple linear regression, but don’t know how to derive
the formula.
● Too much focus: We focus too much on certain topics, memorize
most of their details, but ignore other topics. For example, we
memorize the Normal Equations in linear regression model, but
ignore the projection or the probabilistic perspectives.
● Distraction by “noise”: phone notifications, noise from the
neighbors, etc.

55 / 71
Bias-Variance
● We assume a response variable y = f (x) + ϵ.
● We seek a model h(x) that approximates well the target f ().
● Given a learning dataset D of n points, and a hypothesis h(x), the
expectation of the Squared Error for a given out-of-sample point
x0 , over all possible learning sets, is:
2 2
ED [(h(D) (x0 ) − f (x0 )) ] = ED [(h(D) (x0 ) − h̄(x0 )) ]
´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¸¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¶
variance
+ (h̄(x0 ) − f (x0 ))2
´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¸ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹¶
bias2
+ σ2
¯
noise
= variance + bias2 + σ 2

where h̄(x0 ) = ED [h(D) (x0 )] is the average hypothesis.


56 / 71
Bias-Variance

● Large Bias: “limited learning capacity” - the class of models H has


too little capacity to get close enough to the true model.
● Large Variance: focusing too much on certain details at the
expense of other equally or more important details.
● Large Noise: distracted by the noise in the data - The data for
training to obtain h(D) is bad: messy, missing values, poor quality.

57 / 71
Example
● We consider a target function with some noise:
f (x) = sin (1 + x2 ) + ϵ
with input variable x ∈ [0, 1] and the noise term
ϵ ∼ N (µ = 0, σ = 0.03).
● The function sin (1 + x2 ) is:

58 / 71
Example

● We sample one in-sample set of 10 points (xi , yi ) and one


out-of-sample set of 10 points (x0 , y0 ).
● Notice that a true out-of-sample set should include all x ∈ [0, 1].

59 / 71
Example

1 First, we fit a linear model (i.e., degree 1 polynomial)


2 Second, we fit a second degree polynomial
3 Third, we fit a third degree polynomial
4 ...
5 With 10 in-sample points, we can fit a 9 degree polynomial.
60 / 71
Example - Linear Model

1 The first fitted model is a linear model of the form:


h1 (x) = b0 + b1 x
2 The fitted regression model is the blue line.
3 Ein = 0.00147 and Eout = 0.00215
61 / 71
Example - Quadratic Model

1 The second fitted model is a quadratic model of the form:


h2 (x) = b0 + b1 x + b2 x2
2 The fitted regression model is the blue curve.
3 Ein = 0.00093 and Eout = 0.00137
62 / 71
Example - Quadratic Model

1 The 9-th fitted model is a nonic model of the form:


h9 (x) = b0 + b1 x + b2 x2 + . . . + b8 x8 + b9 x9
2 The fitted regression model is the blue curve.
3 Ein = 0.00000 and Eout = 0.00231
63 / 71
Example - Which Model?

● The 9-degree polynomial achieves Ein = 0.0


● Should we choose h9 (x) as the final model? Because this is the
model with the perfect fit to the learning data.
● No! We should consider their out-of-sample error Eout .

64 / 71
Example - Which Model?

65 / 71
Example - Which Model?

● The linear model underfits the data.


Degree Ein Eout Among all models, its in-sample error
1 0.00147 0.00215 is the largest. A more complex model
2 0.00093 0.00137 will reduce Ein and Eout .
3 0.00029 0.00081 ● The quadratic model is better than
4 0.00012 0.00116 the linear model, but still misses the
5 0.00011 0.00110 shape of the signals. It still underfits.
6 0.00007 0.00108 ● Polynomials of degrees 3, 4, 5 are
7 0.00004 0.00118
“okayfit”.
8 0.00003 0.00151
● The 9-th degree polynomial is too
9 0.00000 0.00231
flexible. Its Ein = 0.0, but it produces
very large Eout . This model overfits.

66 / 71
Example - Which Model?

● Overfitting means that an attractively small Ein value is no longer


a good indicator of a model’s out-of-sample performance.
67 / 71
Example - More Learning Sets

● Three new
learning sets of
size n = 10.
● For each
learning set, the
9-degree
polynomials can
fit perfectly. But
they are very
volatile.
● The 1,2,3-degree
polynomials are
more stable.

68 / 71
Overfitting and Underfitting

● Overfitting happens when we fit the data more than is necessary.


Overfitting is when we choose a model with smaller Ein but it
turns out in bigger Eout . The in-sample error is no longer a good
indicator for the chosen model’s generalization.
● Underfitting occurs with models that perform poorly on unseen
data because of their lack of capacity/flexibility. Underfit models
are not of the right class, they suffer from large bias.
● Low complexity models tend to be biased. with more complex
models the amount of bias decreases. But it is generally impossible
to know the true class of model for the target function.
● NOTICE: If none of the proposed hypothesis classes contain the
truth, they will all be biased.

69 / 71
Overfitting and Underfitting

● The amount of bias does not depend on the size of the in-sample
set.
● This means that increasing the number of learning points won’t
give us a better chance to approximate f ().
● The amount of variance of the model does depend on the number
of learning points.
● As n increases, large-capacity models will experience a reduction in
variability, and their higher flexibility tends to become an
advantage.

70 / 71
Overfitting and Underfitting

71 / 71

You might also like