0% found this document useful (0 votes)
13 views

Lecture 2.1

Uploaded by

dale carmack
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

Lecture 2.1

Uploaded by

dale carmack
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 49

CS 57800 Spring 2022

Statistical Machine Learning


Lecture 2.1
Classi cation: Foundations

Note: These slides are for your personal educational use only. Please do not distribute them

Slides Acknowledgment:
Tommi Jaakkola, Devavrat Shah, David Sontag, Suvrit Sra (MIT EECS Course 6.867)
Anuran Makur CS 57800 (Spring 2022)
fi

Administrivia

★ Midterm Results
★ Median: 10/30

★ Midterm Questions
★ 1. f) Change of basis does not change subspace of functions

★ 2. 2) TV = 1 i likelihoods are mutually singular

★ 2. 5) Constant function achieves maximum detection and false

alarm probabilities

★ 3. Pairs of likelihoods can be partitioned into equivalence classes

that have the same NP function. The canonical pair in each class
is (NP as CDF, uniform distribution).

★ 3. f) A function with properties of NP function is an NP function!

Anuran Makur CS 57800 (Spring 2022) 2


ff
CS 57800

I II

Statistical Inference Supervised Learning

Anuran Makur CS 57800 (Spring 2022) 3


Inference vs Learning

Statistical inference Supervised learning

‣ Hidden parameter Y ‣ Label Y


‣ Observation X ‣ Feature X
‣ Goal: Infer Y from X ‣ Goal: Predict Y from X
‣ Likelihoods are known ‣ Likelihoods are unknown, but
given training data

‣ Optimal estimators or decision ‣ Optimal prediction rules depend


rules depend on likelihoods on training data

Food for thought: Think of any other important di erences. Also, can we mix both?

Anuran Makur CS 57800 (Spring 2022) 4


ff
Inference vs Learning

Statistical inference Supervised learning

‣ Hypothesis testing ‣ Classi cation


‣ Estimation ‣ Regression

Paradigms of inference

‣ Bayesian
‣ Non-Bayesian
Paradigms of learning

‣ Generative
‣ Discriminative
Anuran Makur CS 57800 (Spring 2022) 5
fi
CS 57800

II

Supervised Learning

Classi cation: Foundations


Classi cation: Logistic regression, SVM, Kernels
Classi cation: Naive Bayes
Classi cation: Learnability and VC dimension
Regression (2 lectures)
Neural Networks (2 lectures)

Anuran Makur CS 57800 (Spring 2022) 6


fi
fi
fi
fi
CS 57800

II

Supervised Learning

Classi cation: Foundations


Classi cation: Logistic regression, SVM, Kernels
Classi cation: Naive Bayes
Classi cation: Learnability and VC dimension
Regression (2 lectures)
Neural Networks (2 lectures)

Anuran Makur CS 57800 (Spring 2022) 7


fi
fi
fi
fi
Outline

★ De nitions and formal setup



– Loss function, Risk

– Bayes Classi er

★ Nearest Neighbors

★ Empirical Risk Minimization



– Over tting

– Regularization

– Perspectives

Anuran Makur CS 57800 (Spring 2022) 8


fi
fi
fi
Classification
Learn from training data to predict accurately on unseen data
Basic terminology
I Data domain: An arbitrary set X . Often just X = Rd
(assuming that the members of X are represented via
feature vectors; some authors write (x) to emphasize this)
I Label domain: A discrete set Y; e.g., {0, 1} or { 1, 1}.

Anuran
SuvritMakur
Sra ([email protected]) CS 57800 (Spring
6.867 (Fall 2019)2022) 9
Classification
Learn from training data to predict accurately on unseen data
Basic terminology
I Data domain: An arbitrary set X . Often just X = Rd
(assuming that the members of X are represented via
feature vectors; some authors write (x) to emphasize this)
I Label domain: A discrete set Y; e.g., {0, 1} or { 1, 1}.
I Training data: A finite collection S = {(x1 , y1 ), . . . , (xN , yN )}
of pairs drawn from X ⇥ Y

Anuran
SuvritMakur
Sra ([email protected]) CS 57800 (Spring
6.867 (Fall 2019)2022) 10
Classification
Learn from training data to predict accurately on unseen data
Basic terminology
I Data domain: An arbitrary set X . Often just X = Rd
(assuming that the members of X are represented via
feature vectors; some authors write (x) to emphasize this)
I Label domain: A discrete set Y; e.g., {0, 1} or { 1, 1}.
I Training data: A finite collection S = {(x1 , y1 ), . . . , (xN , yN )}
of pairs drawn from X ⇥ Y
I Classifier: A prediction rule h : X ! Y (we’ll write hS to
emphasize dependence of h on the training data). We call h
a hypothesis, prediction rule, or classifier.

Anuran Makur
Suvrit Sra ([email protected]) CS 57800 (Spring
6.867 (Fall 2022)
2019) 11
Probability model, setup
I Data distribution: Joint distribution P on X ⇥ Y.
Important assumption: P is fixed but unknown.
We will write (X, Y) to denote a random variable with X taking
values in X and Y taking values in Y.

Anuran
SuvritMakur
Sra ([email protected]) CS 57800 (Spring
6.867 (Fall 2019)2022) 12
Probability model, setup
I Data distribution: Joint distribution P on X ⇥ Y.
Important assumption: P is fixed but unknown.
We will write (X, Y) to denote a random variable with X taking
values in X and Y taking values in Y.
I Class conditional distribution: Let Y = {0, 1}. We define

⌘(x) := P(Y = 1 | X = x) = E[Y | X = x].

Anuran
SuvritMakur
Sra ([email protected]) CS 57800 (Spring
6.867 (Fall 2019)2022) 13
Probability model, setup
I Data distribution: Joint distribution P on X ⇥ Y.
Important assumption: P is fixed but unknown.
We will write (X, Y) to denote a random variable with X taking
values in X and Y taking values in Y.
I Class conditional distribution: Let Y = {0, 1}. We define

⌘(x) := P(Y = 1 | X = x) = E[Y | X = x].

I Measuring success: Error of classifier aka risk aka


generalization error:

L(h) ⌘ LP(h) := P(h(X) 6= Y)

i.e., the error of classifier h is the probability of randomly


choosing a pair (x, y) ⇠ P for which h(x) 6= y

Anuran
SuvritMakur
Sra ([email protected]) CS 57800 (Spring
6.867 (Fall 2019)2022) 14
Bayes Classifier
I Goal: Minimize the risk / misclassification error
Intuitively, picking the most likely class given the data makes
sense (notice, not limited to binary classification)

Anuran
SuvritMakur
Sra ([email protected]) CS 57800 (Spring
6.867 (Fall 2019)2022) 15
Bayes Classifier
I Goal: Minimize the risk / misclassification error
Intuitively, picking the most likely class given the data makes
sense (notice, not limited to binary classification)
Bayes classifier
(
1
⇤ 1, if ⌘(x) = P(Y = 1|X = x) > 2 ,
h (x) :=
0, otherwise.

Anuran
SuvritMakur
Sra ([email protected]) CS 57800 (Spring
6.867 (Fall 2019)2022) 16
Bayes Classifier
I Goal: Minimize the risk / misclassification error
Intuitively, picking the most likely class given the data makes
sense (notice, not limited to binary classification)
Bayes classifier
(
1
⇤ 1, if ⌘(x) = P(Y = 1|X = x) > 2 ,
h (x) :=
0, otherwise.

Theorem (BC optimality). For any classifier h : Rd ! {0, 1},


P(h⇤ (X) 6= Y)  P(h(X) 6= Y), i.e., h⇤ is an optimal classifier.

[Proof
[Proof: Recall MAP on hypothesis
rule from black-board]
testing lecture]

Anuran
SuvritMakur
Sra ([email protected]) CS 57800 (Spring
6.867 (Fall 2019)2022) 17
Bayes Classifier
Exercise: Verify the following useful formulae

L⇤ = inf P(h(X) 6= Y)
h:Rd !{0,1}

= E[min {⌘(X), 1 ⌘(X)}]


1 1
= 2 2 E[|2⌘(X) 1|].

(Hint: Use notation from above proof)


We call L⇤ the Bayes Error (the minimum error possible any
classifier; this is an idealized quantity)
Question: What makes the Bayes Classifier idealized?

Anuran
SuvritMakur
Sra ([email protected]) CS 57800 (Spring
6.867 (Fall 2019)2022) 18
Bayes Classifier
Exercise: Verify the following useful formulae

L⇤ = inf P(h(X) 6= Y)
h:Rd !{0,1}

= E[min {⌘(X), 1 ⌘(X)}]


1 1
= 2 2 E[|2⌘(X) 1|].

(Hint: Use notation from above proof)


We call L⇤ the Bayes Error (the minimum error possible any
classifier; this is an idealized quantity)
Question: What makes the Bayes Classifier idealized?
If we had access to P(x, y), no need to really do much more!

Anuran
SuvritMakur
Sra ([email protected]) CS 57800 (Spring
6.867 (Fall 2019)2022) 19
Nearest-Neighbors
We mention now an approach that (seems) distribution free.
I Based on the belief: Features that are used to describe the
data are relevant to the labelings in a way that makes “close
by” points likely to have the same label

Anuran
SuvritMakur
Sra ([email protected]) CS 57800 (Spring
6.867 (Fall 2019)2022) 20
Nearest Neighbor Classi cation
One of the simplest possible classi ers!

Training: None required (or rather, memorize the data)

Testing:

for each test data point ‘x’ do:

nd the ‘k’ points in training data nearest to ‘x’

predict label ‘y’ for ‘x’ by taking (weighted) majority label

Anuran Makur CS 57800 (Spring 2022) 21


fi
fi
fi
Nearest Neighbor Classi cation

k-NN can learn complex nonlinear classi ers


Image: Elements of Statistical Learning Theory

Anuran Makur CS 57800 (Spring 2022) 22


fi
fi
Nearest-Neighbors
We mention now an approach that (seems) distribution free.
I Based on the belief: Features that are used to describe the
data are relevant to the labelings in a way that makes “close
by” points likely to have the same label
k-NN: Let S = {(xi , yi ) | 1  i  N} be the training data. Let

NNk (x) := j | 1  j  k, xj is within k closest to x

(so that NN1 (x) = argmin1iN dist(x, xi ))

Anuran
SuvritMakur
Sra ([email protected]) CS 57800 (Spring
6.867 (Fall 2019)2022) 23
Nearest-Neighbors
We mention now an approach that (seems) distribution free.
I Based on the belief: Features that are used to describe the
data are relevant to the labelings in a way that makes “close
by” points likely to have the same label
k-NN: Let S = {(xi , yi ) | 1  i  N} be the training data. Let

NNk (x) := j | 1  j  k, xj is within k closest to x

(so that NN1 (x) = argmin1iN dist(x, xi ))


Classifier: h(x) = weighted majority label among NNk (x).

Anuran
SuvritMakur
Sra ([email protected]) CS 57800 (Spring
6.867 (Fall 2019)2022) 24
NN and Bayes
Asymptotically, it can be shown that the error of the NN
classifier is
LNN = E[2⌘(X)(1 ⌘(X))].
Theorem. LNN  2L⇤ (where L⇤ is Bayes error)
Proof Sketch. Let A(X) = min(⌘(X), 1 ⌘(X)). Notice that
2⌘(1 ⌘) A; thus we have

L⇤  LNN = 2E[A(X)(1 A(X))]


 2E[A(X)]E[1 A(X)] (Justify!)
= 2L⇤ (1 L⇤ )
 2L⇤ .

In other words, asymptotically (as N ! 1), a NN classifier


comes within a factor 2 of the best possible (Bayes) classifier.

Anuran
SuvritMakur
Sra ([email protected]) CS 57800 (Spring
6.867 (Fall 2019)2022) 25
NN and Bayes: Remarks
I Read Ch. 19 of [SSS] for additional analysis of NN methods
I Check out Explaining the Success of Nearest Neighbor Methods
in Prediction by George Chen and Devavrat Shah (2018).
Ideally, we want non-asymptotic results, to better
understand how many examples (i.e., how large N) do we
need to attain a certain error rate
We may also have some prior knowledge about (X, Y) that
we may wish to incorporate
Noise, robustness, adversarial learning, and other concerns
All of these can be accommodated; let us look at another
more explicit paradigm

Anuran
SuvritMakur
Sra ([email protected]) CS 57800 (Spring
6.867 (Fall 2019)2022) 26
Empirical Risk Minimization

Learner does not know P(X, Y), so true error (Bayes error) is
not known to the learner. However,
I Training Error: The error that the classifier incurs on the
training data

1
LS (h) := # {i 2 [N] | h(xi ) 6= yi } ,
N
aka empirical risk

Anuran
SuvritMakur
Sra ([email protected]) CS 57800 (Spring
6.867 (Fall 2019)2022) 27
Empirical Risk Minimization

Learner does not know P(X, Y), so true error (Bayes error) is
not known to the learner. However,
I Training Error: The error that the classifier incurs on the
training data

1
LS (h) := # {i 2 [N] | h(xi ) 6= yi } ,
N
aka empirical risk
I ERM principle: Seek predictor that minimizes LS (h)
I Pitfall: Overfitting!

Anuran
SuvritMakur
Sra ([email protected]) CS 57800 (Spring
6.867 (Fall 2019)2022) 28
Empirical Risk Minimization

Learner does not know P(X, Y), so true error (Bayes error) is
not known to the learner. However,
I Training Error: The error that the classifier incurs on the
training data

1
LS (h) := # {i 2 [N] | h(xi ) 6= yi } ,
N
aka empirical risk
I ERM principle: Seek predictor that minimizes LS (h)
I Pitfall: Overfitting!

Food for thought: What is over tting? Is it necessarily bad?

Anuran
SuvritMakur
Sra ([email protected]) CS 57800 (Spring
6.867 (Fall 2019)2022) 29
fi
Over tting

S = {(xi , yi ) | 1  i  N }
<latexit sha1_base64="jSFEno0XjtXvh0feQa5M41SCu8c=">AAACCXicbVDLSsNAFJ3UV62vqEs3g0WoICURQTdC0Y0rqWgf0IQwmUzaoTNJmJmIJXTrxl9x40IRt/6BO//GSZuFth64l8M59zJzj58wKpVlfRulhcWl5ZXyamVtfWNzy9zeacs4FZi0cMxi0fWRJIxGpKWoYqSbCIK4z0jHH17mfueeCEnj6E6NEuJy1I9oSDFSWvJMeHvuZLUHjx6NPHrocBpA22EEUpj3a2fsmVWrbk0A54ldkCoo0PTMLyeIccpJpDBDUvZsK1FuhoSimJFxxUklSRAeoj7paRohTqSbTS4ZwwOtBDCMha5IwYn6eyNDXMoR9/UkR2ogZ71c/M/rpSo8czMaJakiEZ4+FKYMqhjmscCACoIVG2mCsKD6rxAPkEBY6fAqOgR79uR50j6u21bdvjmpNi6KOMpgD+yDGrDBKWiAK9AELYDBI3gGr+DNeDJejHfjYzpaMoqdXfAHxucP16iYgQ==</latexit>

( y=0
yi , if x = xi
h(x) =
0, otherwise
y=1
<latexit sha1_base64="Avvb5Z8aSiD8xI57fKSkL0HXoUg=">AAACN3icbVBNS8NAEN34bf2qevSyWBQFKYkIeimIXjyJglWhKWGznbSLm03YnWhL6L/y4t/wphcPinj1H7htc/BrYOHx3ryZnRemUhh03SdnbHxicmp6ZrY0N7+wuFReXrk0SaY51HkiE30dMgNSKKijQAnXqQYWhxKuwpvjgX51C9qIRF1gL4VmzNpKRIIztFRQPu1sdbdpjfohtIXKuR1l+rQXiB26SX2ELuYi6vu0W+sGwvdL7s7miE2wA/pOGOiXfFCtwhmUK27VHRb9C7wCVEhRZ0H50W8lPItBIZfMmIbnptjMmUbB5WB2ZiBl/Ia1oWGhYjGYZj68u083LNOiUaLtU0iH7HdHzmJjenFoO2OGHfNbG5D/aY0Mo4NmLlSaISg+WhRlkmJCByHSltDAUfYsYFwL+1fKO0wzjjbqkg3B+33yX3C5W/Xcqne+Vzk8KuKYIWtknWwRj+yTQ3JCzkidcHJPnskreXMenBfn3fkYtY45hWeV/Cjn8wu806v0</latexit>

Memorize
x distributed uniformly in
the unit square
✴ This classi er has 0 empirical risk!

✴ As bad as a random guess (error prob on unseen data =1/2)

Food for thought: The method Memorize is not fully “bad”, why?

Anuran Makur CS 57800 (Spring 2022) 30


fi
fi
ERM with inductive bias
I Rather than give up on ERM, we search for settings where it
may actually work.
I Inductive bias: Apply ERM over a restricted search space.

Anuran
SuvritMakur
Sra ([email protected]) CS 57800 (Spring
6.867 (Fall 2019)2022) 31
ERM with inductive bias
I Rather than give up on ERM, we search for settings where it
may actually work.
I Inductive bias: Apply ERM over a restricted search space.
1 Learner chooses a hypothesis class H (i.e., set of predictors it
is going to optimize over) in advance before having seen
any training data (eg, Linear model, Neural Networks, Random Forests, etc.)
2 ERMH uses ERM to learn h : X ! Y by using S (training data)

ERMH (S) 2 argmin LS (h)


h2H

Anuran
SuvritMakur
Sra ([email protected]) CS 57800 (Spring
6.867 (Fall 2019)2022) 32
ERM with inductive bias
I Rather than give up on ERM, we search for settings where it
may actually work.
I Inductive bias: Apply ERM over a restricted search space.
1 Learner chooses a hypothesis class H (i.e., set of predictors it
is going to optimize over) in advance before having seen
any training data (eg, Linear model, Neural Networks, Random Forests, etc.)
2 ERMH uses ERM to learn h : X ! Y by using S (training data)

ERMH (S) 2 argmin LS (h)


h2H

I Note: Ideally H should be governed by knowledge of data.


But even “simple” choices of H can overfit if we are not careful.
Of course, overly strong inductive bias can lead to underfitting.

Anuran
SuvritMakur
Sra ([email protected]) CS 57800 (Spring
6.867 (Fall 2019)2022) 33
Loss functions, ERM setup
Recall, risk was defined as

L(h) = P(h(X) 6= Y)

We can consider more general ways to measure risk.

Anuran
SuvritMakur
Sra ([email protected]) CS 57800 (Spring
6.867 (Fall 2019)2022) 34
Loss functions, ERM setup
Recall, risk was defined as

L(h) = P(h(X) 6= Y)

We can consider more general ways to measure risk.


I Loss function: A nonnegative function quantifying the
“loss/risk” incurred by a hypothesis.

` : H ⇥ X ⇥ Y ! R+

hinge loss `h (z) := max(0, 1 z)


4
logistic loss `log (z) := log(1 + e z
)
`(z)

3 0/1 loss
`0/1 (z) := [[z  0]]

-3 -2 -1 1 2 3
z

Anuran
SuvritMakur
Sra ([email protected]) CS 57800 (Spring
6.867 (Fall 2019)2022) 35
Loss functions, ERM setup
I Risk function: Expected loss of h 2 H wrt the data
distribution P over X ⇥ Y

L(h) := E[`(h, X, Y)]

Anuran
SuvritMakur
Sra ([email protected]) CS 57800 (Spring
6.867 (Fall 2019)2022) 36
Loss functions, ERM setup
I Risk function: Expected loss of h 2 H wrt the data
distribution P over X ⇥ Y

L(h) := E[`(h, X, Y)]

I Empirical Risk:
N
1 X
LS (h) := `(h, xi , yi )
N
i=1

Anuran
SuvritMakur
Sra ([email protected]) CS 57800 (Spring
6.867 (Fall 2019)2022) 37
Loss functions, ERM setup
I Risk function: Expected loss of h 2 H wrt the data
distribution P over X ⇥ Y

L(h) := E[`(h, X, Y)]

I Empirical Risk:
N
1 X
LS (h) := `(h, xi , yi )
N
i=1

I 0/1-loss: (
1, h(x) 6= y
`0/1 (h, (x, y)) :=
0, h(x) = y.
Exercise: Verify that `0/1 reduces to L(h) = P(h(X) 6= Y)

Anuran
SuvritMakur
Sra ([email protected]) CS 57800 (Spring
6.867 (Fall 2019)2022) 38
ERM: Computation

I Computational question:
N N
1 X 1 X
min `0/1 (h, xi , yi ) = Jh(xi ) 6= yi K
h2H N N
i=1 i=1

This empirical risk is typically NP-Hard to optimize


Exercise: Provide settings / scenarios where this loss is not
hard to optimize.

Anuran
SuvritMakur
Sra ([email protected]) CS 57800 (Spring
6.867 (Fall 2019)2022) 39
ERM: Theory
Question: When does ERM work? In other words, if we
minimize LS (h), what bearing does that have on L(h)?
Goal of learning theory is to study this (and such) question(s).
Not the focus of this course

Anuran
SuvritMakur
Sra ([email protected]) CS 57800 (Spring
6.867 (Fall 2019)2022) 40
ERM: Theory
Question: When does ERM work? In other words, if we
minimize LS (h), what bearing does that have on L(h)?
Goal of learning theory is to study this (and such) question(s).
Informally, if for all h 2 H, LS (h) is a good approximation to
L(h), then ERM will also return a good hypothesis

LP(hS )  min LP(h) + ✏


h2H

hS is learned using ERM; both risks over data distrib

Anuran
SuvritMakur
Sra ([email protected]) CS 57800 (Spring
6.867 (Fall 2019)2022) 41
ERM: Theory
Question: When does ERM work? In other words, if we
minimize LS (h), what bearing does that have on L(h)?
Goal of learning theory is to study this (and such) question(s).
Informally, if for all h 2 H, LS (h) is a good approximation to
L(h), then ERM will also return a good hypothesis

LP(hS )  min LP(h) + ✏


h2H

hS is learned using ERM; both risks over data distrib


I Why may a certain hypothesis class H be “better”? How to
ensure learnability, control overfitting, etc? Let us look at a
fundamental tradeoff that guides us

Anuran
SuvritMakur
Sra ([email protected]) CS 57800 (Spring
6.867 (Fall 2019)2022) 42
ERM: Bias-complexity tradeoff

I Error-decomposition: To control overfitting, we introduced


inductive bias. Let us look at a fundamental error
decomposition in ML

LP(hS ) = ✏apx + ✏est

Thus, prob of error on random (unseen) data, decomposes into

✏apx := min L(h) (A PPROX E RROR)


h2H
✏est := LP(hS ) ✏apx (E STIMATION E RROR)

Anuran
SuvritMakur
Sra ([email protected]) CS 57800 (Spring
6.867 (Fall 2019)2022) 43
ERM: Bias-complexity tradeoff
I Approx error: Min risk achievable by a predictor in H.
Measures how much risk due to inductive bias (observe,
does not depend on N or S)
I Estimation error: Difference between approx error and
error achieved by the ERM predictor (on test data). This
error arises because training error (empirical risk) is just a
proxy for the true risk.

Anuran
SuvritMakur
Sra ([email protected]) CS 57800 (Spring
6.867 (Fall 2019)2022) 44
ERM: Bias-complexity tradeoff
I Approx error: Min risk achievable by a predictor in H.
Measures how much risk due to inductive bias (observe,
does not depend on N or S)
I Estimation error: Difference between approx error and
error achieved by the ERM predictor (on test data). This
error arises because training error (empirical risk) is just a
proxy for the true risk.
I Quality of estimation depends on training set size N and on
the “richness / complexity” of the hypothesis class (e.g., for
a finite hypothesis class, estimation error increases as log |H|
and decreases with N).

Anuran
SuvritMakur
Sra ([email protected]) CS 57800 (Spring
6.867 (Fall 2019)2022) 45
ERM: Bias-complexity tradeoff
I Approx error: Min risk achievable by a predictor in H.
A
Measures how much risk due to inductive bias (observe,
does not depend on N or S)
B
I Estimation error: Difference between approx error and
error achieved by the ERM predictor (on test data). This
error arises because training error (empirical risk) is just a
proxy for the true risk.
I Quality of estimation depends on training set size N and on
the “richness / complexity” of the hypothesis class (e.g., for
a finite hypothesis class, estimation error increases as log |H|
and decreases with N).

Fig. 1. Curves for training


Classical trainingrisk (dashed
vs test curve line) and test risk (solid l
double-descent risk curve, which incorporates the U-shaped risk
Anuran capacity function classes
SuvritMakur
Sra ([email protected])
(i.e.,
CS 57800
6.867 (Fall
the “modern” interpolating regime
(Spring
2019)2022) 46
ERM: Bias-complexity tradeoff
I Approx error: Min risk achievable by a predictor in H.
Measures how much risk due to inductive bias (observe,
does not depend on N or S)
I Estimation error: Difference between approx error and
error achieved by the ERM predictor (on test data). This
error arises because training error (empirical risk) is just a
proxy for the true risk.
I Quality of estimation depends on training set size N and on
the “richness / complexity” of the hypothesis class (e.g., for
a finite hypothesis class, estimation error increases as log |H|
and decreases with N).
I Regularization: One way to quantify the complexity of an
individual hypothesis

Anuran
SuvritMakur
Sra ([email protected]) CS 57800 (Spring
6.867 (Fall 2019)2022) 47
Regularized ERM

Here we seek to minimize the regularized empirical risk

min LS (h) + R(h),


h2H

where 0 is a hyper-paramter that regulates the


bias-complexity tradeoff.
N EXT LECTURE: Concrete examples of hypotheses classes,
regularizers, and is
Regularization corresponding optimization
a very important idea inmethods.
machine learning

Above method not the only way to regularize!


Models, Algorithms, Computation,
and several other ways to implicitly or explicitly regularize

Anuran
SuvritMakur
Sra ([email protected]) CS 57800 (Spring
6.867 (Fall 2019)2022) 48

ERM: Bias-complexity tradeoff


I Approx error: Min risk achievable by a predictor in H.
B Measures how much risk due to inductive bias (observe,
does not depend on N or S)
I Estimation error: Difference between approx error and
error achieved by the ERM predictor (on test data). This
error arises because training error (empirical risk) is just a
proxy for the true risk.
I Quality of estimation depends on training set size N and on
the “richness / complexity” of the hypothesis class (e.g., for
a finite hypothesis class, estimation error increases as log |H|
and
risk (solid line). decreases
(A) The with N).
classical U-shaped risk curve arising from the bias–variance trade-off. (B) The
haped risk curve (i.e., the “classical” regime) together with the observed behavior from using high-
ng regime), separated by theviewpoint
Modern interpolation threshold. The predictors
on generalization: to the right of
the double-descent the interpolation
curve
Reconciling modern machine-learning practice and
the classical bias–variance trade-off
polate the trainingMikhail Belkin
apply, Daniel
a,b,1to regression
Hsu , Siyuan Ma , andand
c a
Soumikalso
a
Mandal does not predict the second descent

n the training data beyond the interpolation threshold. Recently, there has been an
emerging recognition that certain interpolating predictors (not
n inAnuran
how perfor-
SuvritMakur based on ERM)CScan
Sra ([email protected]) 57800
6.867indeed
(Spring
(Fall be provably statistically optimal or
2019)2022) 49

You might also like