0% found this document useful (0 votes)

13 views

Lecture 2.1

Uploaded by

dale carmack

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views

Lecture 2.1

Uploaded by

dale carmack

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 49

CS 57800 Spring 2022

Statistical Machine Learning

Lecture 2.1
Classi cation: Foundations

Note: These slides are for your personal educational use only. Please do not distribute them

Slides Acknowledgment:
Tommi Jaakkola, Devavrat Shah, David Sontag, Suvrit Sra (MIT EECS Course 6.867)
Anuran Makur CS 57800 (Spring 2022)
fi

Administrivia

★ Midterm Results
★ Median: 10/30

★ Midterm Questions
★ 1. f) Change of basis does not change subspace of functions

★ 2. 2) TV = 1 i likelihoods are mutually singular

★ 2. 5) Constant function achieves maximum detection and false

alarm probabilities

★ 3. Pairs of likelihoods can be partitioned into equivalence classes

that have the same NP function. The canonical pair in each class
is (NP as CDF, uniform distribution).

★ 3. f) A function with properties of NP function is an NP function!

Anuran Makur CS 57800 (Spring 2022) 2

ff
CS 57800

I II

Statistical Inference Supervised Learning

Anuran Makur CS 57800 (Spring 2022) 3

Inference vs Learning

Statistical inference Supervised learning

‣ Hidden parameter Y ‣ Label Y

‣ Observation X ‣ Feature X
‣ Goal: Infer Y from X ‣ Goal: Predict Y from X
‣ Likelihoods are known ‣ Likelihoods are unknown, but
given training data

‣ Optimal estimators or decision ‣ Optimal prediction rules depend

rules depend on likelihoods on training data

Food for thought: Think of any other important di erences. Also, can we mix both?

Anuran Makur CS 57800 (Spring 2022) 4

ff
Inference vs Learning

Statistical inference Supervised learning

‣ Hypothesis testing ‣ Classi cation

‣ Estimation ‣ Regression

Paradigms of inference

‣ Bayesian
‣ Non-Bayesian
Paradigms of learning

‣ Generative
‣ Discriminative
Anuran Makur CS 57800 (Spring 2022) 5
fi
CS 57800

Supervised Learning

Classi cation: Foundations

Classi cation: Logistic regression, SVM, Kernels
Classi cation: Naive Bayes
Classi cation: Learnability and VC dimension
Regression (2 lectures)
Neural Networks (2 lectures)

Anuran Makur CS 57800 (Spring 2022) 6

fi
fi
fi
fi
CS 57800

Supervised Learning

Classi cation: Foundations

Classi cation: Logistic regression, SVM, Kernels
Classi cation: Naive Bayes
Classi cation: Learnability and VC dimension
Regression (2 lectures)
Neural Networks (2 lectures)

Anuran Makur CS 57800 (Spring 2022) 7

fi
fi
fi
fi
Outline

★ De nitions and formal setup 

– Loss function, Risk 
– Bayes Classi er

★ Nearest Neighbors

★ Empirical Risk Minimization 

– Over tting 
– Regularization 
– Perspectives

Anuran Makur CS 57800 (Spring 2022) 8

fi
fi
fi
Classification
Learn from training data to predict accurately on unseen data
Basic terminology
I Data domain: An arbitrary set X . Often just X = Rd
(assuming that the members of X are represented via
feature vectors; some authors write (x) to emphasize this)
I Label domain: A discrete set Y; e.g., {0, 1} or { 1, 1}.

Anuran
SuvritMakur
Sra ([email protected]) CS 57800 (Spring
6.867 (Fall 2019)2022) 9
Classification
Learn from training data to predict accurately on unseen data
Basic terminology
I Data domain: An arbitrary set X . Often just X = Rd
(assuming that the members of X are represented via
feature vectors; some authors write (x) to emphasize this)
I Label domain: A discrete set Y; e.g., {0, 1} or { 1, 1}.
I Training data: A finite collection S = {(x1 , y1 ), . . . , (xN , yN )}
of pairs drawn from X ⇥ Y

Anuran
SuvritMakur
Sra ([email protected]) CS 57800 (Spring
6.867 (Fall 2019)2022) 10
Classification
Learn from training data to predict accurately on unseen data
Basic terminology
I Data domain: An arbitrary set X . Often just X = Rd
(assuming that the members of X are represented via
feature vectors; some authors write (x) to emphasize this)
I Label domain: A discrete set Y; e.g., {0, 1} or { 1, 1}.
I Training data: A finite collection S = {(x1 , y1 ), . . . , (xN , yN )}
of pairs drawn from X ⇥ Y
I Classifier: A prediction rule h : X ! Y (we’ll write hS to
emphasize dependence of h on the training data). We call h
a hypothesis, prediction rule, or classifier.

Anuran Makur
Suvrit Sra ([email protected]) CS 57800 (Spring
6.867 (Fall 2022)
2019) 11
Probability model, setup
I Data distribution: Joint distribution P on X ⇥ Y.
Important assumption: P is fixed but unknown.
We will write (X, Y) to denote a random variable with X taking
values in X and Y taking values in Y.

Anuran
SuvritMakur
Sra ([email protected]) CS 57800 (Spring
6.867 (Fall 2019)2022) 12
Probability model, setup
I Data distribution: Joint distribution P on X ⇥ Y.
Important assumption: P is fixed but unknown.
We will write (X, Y) to denote a random variable with X taking
values in X and Y taking values in Y.
I Class conditional distribution: Let Y = {0, 1}. We define

⌘(x) := P(Y = 1 | X = x) = E[Y | X = x].

Anuran
SuvritMakur
Sra ([email protected]) CS 57800 (Spring
6.867 (Fall 2019)2022) 13
Probability model, setup
I Data distribution: Joint distribution P on X ⇥ Y.
Important assumption: P is fixed but unknown.
We will write (X, Y) to denote a random variable with X taking
values in X and Y taking values in Y.
I Class conditional distribution: Let Y = {0, 1}. We define

⌘(x) := P(Y = 1 | X = x) = E[Y | X = x].

I Measuring success: Error of classifier aka risk aka

generalization error:

L(h) ⌘ LP(h) := P(h(X) 6= Y)

i.e., the error of classifier h is the probability of randomly

choosing a pair (x, y) ⇠ P for which h(x) 6= y

Anuran
SuvritMakur
Sra ([email protected]) CS 57800 (Spring
6.867 (Fall 2019)2022) 14
Bayes Classifier
I Goal: Minimize the risk / misclassification error
Intuitively, picking the most likely class given the data makes
sense (notice, not limited to binary classification)

Anuran
SuvritMakur
Sra ([email protected]) CS 57800 (Spring
6.867 (Fall 2019)2022) 15
Bayes Classifier
I Goal: Minimize the risk / misclassification error
Intuitively, picking the most likely class given the data makes
sense (notice, not limited to binary classification)
Bayes classifier
(
1
⇤ 1, if ⌘(x) = P(Y = 1|X = x) > 2 ,
h (x) :=
0, otherwise.

Anuran
SuvritMakur
Sra ([email protected]) CS 57800 (Spring
6.867 (Fall 2019)2022) 16
Bayes Classifier
I Goal: Minimize the risk / misclassification error
Intuitively, picking the most likely class given the data makes
sense (notice, not limited to binary classification)
Bayes classifier
(
1
⇤ 1, if ⌘(x) = P(Y = 1|X = x) > 2 ,
h (x) :=
0, otherwise.

Theorem (BC optimality). For any classifier h : Rd ! {0, 1},

P(h⇤ (X) 6= Y)  P(h(X) 6= Y), i.e., h⇤ is an optimal classifier.

[Proof
[Proof: Recall MAP on hypothesis
rule from black-board]
testing lecture]

Anuran
SuvritMakur
Sra ([email protected]) CS 57800 (Spring
6.867 (Fall 2019)2022) 17
Bayes Classifier
Exercise: Verify the following useful formulae

L⇤ = inf P(h(X) 6= Y)
h:Rd !{0,1}

= E[min {⌘(X), 1 ⌘(X)}]

1 1
= 2 2 E[|2⌘(X) 1|].

(Hint: Use notation from above proof)

We call L⇤ the Bayes Error (the minimum error possible any
classifier; this is an idealized quantity)
Question: What makes the Bayes Classifier idealized?

Anuran
SuvritMakur
Sra ([email protected]) CS 57800 (Spring
6.867 (Fall 2019)2022) 18
Bayes Classifier
Exercise: Verify the following useful formulae

L⇤ = inf P(h(X) 6= Y)
h:Rd !{0,1}

= E[min {⌘(X), 1 ⌘(X)}]

1 1
= 2 2 E[|2⌘(X) 1|].

(Hint: Use notation from above proof)

We call L⇤ the Bayes Error (the minimum error possible any
classifier; this is an idealized quantity)
Question: What makes the Bayes Classifier idealized?
If we had access to P(x, y), no need to really do much more!

Anuran
SuvritMakur
Sra ([email protected]) CS 57800 (Spring
6.867 (Fall 2019)2022) 19
Nearest-Neighbors
We mention now an approach that (seems) distribution free.
I Based on the belief: Features that are used to describe the
data are relevant to the labelings in a way that makes “close
by” points likely to have the same label

Anuran
SuvritMakur
Sra ([email protected]) CS 57800 (Spring
6.867 (Fall 2019)2022) 20
Nearest Neighbor Classi cation
One of the simplest possible classi ers!

Training: None required (or rather, memorize the data)

Testing:

for each test data point ‘x’ do:

nd the ‘k’ points in training data nearest to ‘x’

predict label ‘y’ for ‘x’ by taking (weighted) majority label

Anuran Makur CS 57800 (Spring 2022) 21

fi
fi
fi
Nearest Neighbor Classi cation

k-NN can learn complex nonlinear classi ers

Image: Elements of Statistical Learning Theory

Anuran Makur CS 57800 (Spring 2022) 22

fi
fi
Nearest-Neighbors
We mention now an approach that (seems) distribution free.
I Based on the belief: Features that are used to describe the
data are relevant to the labelings in a way that makes “close
by” points likely to have the same label
k-NN: Let S = {(xi , yi ) | 1  i  N} be the training data. Let

NNk (x) := j | 1  j  k, xj is within k closest to x

(so that NN1 (x) = argmin1iN dist(x, xi ))

Anuran
SuvritMakur
Sra ([email protected]) CS 57800 (Spring
6.867 (Fall 2019)2022) 23
Nearest-Neighbors
We mention now an approach that (seems) distribution free.
I Based on the belief: Features that are used to describe the
data are relevant to the labelings in a way that makes “close
by” points likely to have the same label
k-NN: Let S = {(xi , yi ) | 1  i  N} be the training data. Let

NNk (x) := j | 1  j  k, xj is within k closest to x

(so that NN1 (x) = argmin1iN dist(x, xi ))

Classifier: h(x) = weighted majority label among NNk (x).

Anuran
SuvritMakur
Sra ([email protected]) CS 57800 (Spring
6.867 (Fall 2019)2022) 24
NN and Bayes
Asymptotically, it can be shown that the error of the NN
classifier is
LNN = E[2⌘(X)(1 ⌘(X))].
Theorem. LNN  2L⇤ (where L⇤ is Bayes error)
Proof Sketch. Let A(X) = min(⌘(X), 1 ⌘(X)). Notice that
2⌘(1 ⌘) A; thus we have

L⇤  LNN = 2E[A(X)(1 A(X))]

 2E[A(X)]E[1 A(X)] (Justify!)
= 2L⇤ (1 L⇤ )
 2L⇤ .

In other words, asymptotically (as N ! 1), a NN classifier

comes within a factor 2 of the best possible (Bayes) classifier.

Anuran
SuvritMakur
Sra ([email protected]) CS 57800 (Spring
6.867 (Fall 2019)2022) 25
NN and Bayes: Remarks
I Read Ch. 19 of [SSS] for additional analysis of NN methods
I Check out Explaining the Success of Nearest Neighbor Methods
in Prediction by George Chen and Devavrat Shah (2018).
Ideally, we want non-asymptotic results, to better
understand how many examples (i.e., how large N) do we
need to attain a certain error rate
We may also have some prior knowledge about (X, Y) that
we may wish to incorporate
Noise, robustness, adversarial learning, and other concerns
All of these can be accommodated; let us look at another
more explicit paradigm

Anuran
SuvritMakur
Sra ([email protected]) CS 57800 (Spring
6.867 (Fall 2019)2022) 26
Empirical Risk Minimization

Learner does not know P(X, Y), so true error (Bayes error) is
not known to the learner. However,
I Training Error: The error that the classifier incurs on the
training data

1
LS (h) := # {i 2 [N] | h(xi ) 6= yi } ,
N
aka empirical risk

Anuran
SuvritMakur
Sra ([email protected]) CS 57800 (Spring
6.867 (Fall 2019)2022) 27
Empirical Risk Minimization

Learner does not know P(X, Y), so true error (Bayes error) is
not known to the learner. However,
I Training Error: The error that the classifier incurs on the
training data

1
LS (h) := # {i 2 [N] | h(xi ) 6= yi } ,
N
aka empirical risk
I ERM principle: Seek predictor that minimizes LS (h)
I Pitfall: Overfitting!

Anuran
SuvritMakur
Sra ([email protected]) CS 57800 (Spring
6.867 (Fall 2019)2022) 28
Empirical Risk Minimization

Learner does not know P(X, Y), so true error (Bayes error) is
not known to the learner. However,
I Training Error: The error that the classifier incurs on the
training data

1
LS (h) := # {i 2 [N] | h(xi ) 6= yi } ,
N
aka empirical risk
I ERM principle: Seek predictor that minimizes LS (h)
I Pitfall: Overfitting!

Food for thought: What is over tting? Is it necessarily bad?

Anuran
SuvritMakur
Sra ([email protected]) CS 57800 (Spring
6.867 (Fall 2019)2022) 29
fi
Over tting

S = {(xi , yi ) | 1  i  N }
<latexit sha1_base64="jSFEno0XjtXvh0feQa5M41SCu8c=">AAACCXicbVDLSsNAFJ3UV62vqEs3g0WoICURQTdC0Y0rqWgf0IQwmUzaoTNJmJmIJXTrxl9x40IRt/6BO//GSZuFth64l8M59zJzj58wKpVlfRulhcWl5ZXyamVtfWNzy9zeacs4FZi0cMxi0fWRJIxGpKWoYqSbCIK4z0jHH17mfueeCEnj6E6NEuJy1I9oSDFSWvJMeHvuZLUHjx6NPHrocBpA22EEUpj3a2fsmVWrbk0A54ldkCoo0PTMLyeIccpJpDBDUvZsK1FuhoSimJFxxUklSRAeoj7paRohTqSbTS4ZwwOtBDCMha5IwYn6eyNDXMoR9/UkR2ogZ71c/M/rpSo8czMaJakiEZ4+FKYMqhjmscCACoIVG2mCsKD6rxAPkEBY6fAqOgR79uR50j6u21bdvjmpNi6KOMpgD+yDGrDBKWiAK9AELYDBI3gGr+DNeDJejHfjYzpaMoqdXfAHxucP16iYgQ==</latexit>

( y=0
yi , if x = xi
h(x) =
0, otherwise
y=1
<latexit sha1_base64="Avvb5Z8aSiD8xI57fKSkL0HXoUg=">AAACN3icbVBNS8NAEN34bf2qevSyWBQFKYkIeimIXjyJglWhKWGznbSLm03YnWhL6L/y4t/wphcPinj1H7htc/BrYOHx3ryZnRemUhh03SdnbHxicmp6ZrY0N7+wuFReXrk0SaY51HkiE30dMgNSKKijQAnXqQYWhxKuwpvjgX51C9qIRF1gL4VmzNpKRIIztFRQPu1sdbdpjfohtIXKuR1l+rQXiB26SX2ELuYi6vu0W+sGwvdL7s7miE2wA/pOGOiXfFCtwhmUK27VHRb9C7wCVEhRZ0H50W8lPItBIZfMmIbnptjMmUbB5WB2ZiBl/Ia1oWGhYjGYZj68u083LNOiUaLtU0iH7HdHzmJjenFoO2OGHfNbG5D/aY0Mo4NmLlSaISg+WhRlkmJCByHSltDAUfYsYFwL+1fKO0wzjjbqkg3B+33yX3C5W/Xcqne+Vzk8KuKYIWtknWwRj+yTQ3JCzkidcHJPnskreXMenBfn3fkYtY45hWeV/Cjn8wu806v0</latexit>

Memorize
x distributed uniformly in
the unit square
✴ This classi er has 0 empirical risk!

✴ As bad as a random guess (error prob on unseen data =1/2)

Food for thought: The method Memorize is not fully “bad”, why?

Anuran Makur CS 57800 (Spring 2022) 30

fi
fi
ERM with inductive bias
I Rather than give up on ERM, we search for settings where it
may actually work.
I Inductive bias: Apply ERM over a restricted search space.

Anuran
SuvritMakur
Sra ([email protected]) CS 57800 (Spring
6.867 (Fall 2019)2022) 31
ERM with inductive bias
I Rather than give up on ERM, we search for settings where it
may actually work.
I Inductive bias: Apply ERM over a restricted search space.
1 Learner chooses a hypothesis class H (i.e., set of predictors it
is going to optimize over) in advance before having seen
any training data (eg, Linear model, Neural Networks, Random Forests, etc.)
2 ERMH uses ERM to learn h : X ! Y by using S (training data)

ERMH (S) 2 argmin LS (h)

h2H

Anuran
SuvritMakur
Sra ([email protected]) CS 57800 (Spring
6.867 (Fall 2019)2022) 32
ERM with inductive bias
I Rather than give up on ERM, we search for settings where it
may actually work.
I Inductive bias: Apply ERM over a restricted search space.
1 Learner chooses a hypothesis class H (i.e., set of predictors it
is going to optimize over) in advance before having seen
any training data (eg, Linear model, Neural Networks, Random Forests, etc.)
2 ERMH uses ERM to learn h : X ! Y by using S (training data)

ERMH (S) 2 argmin LS (h)

h2H

I Note: Ideally H should be governed by knowledge of data.

But even “simple” choices of H can overfit if we are not careful.
Of course, overly strong inductive bias can lead to underfitting.

Anuran
SuvritMakur
Sra ([email protected]) CS 57800 (Spring
6.867 (Fall 2019)2022) 33
Loss functions, ERM setup
Recall, risk was defined as

L(h) = P(h(X) 6= Y)

We can consider more general ways to measure risk.

Anuran
SuvritMakur
Sra ([email protected]) CS 57800 (Spring
6.867 (Fall 2019)2022) 34
Loss functions, ERM setup
Recall, risk was defined as

L(h) = P(h(X) 6= Y)

We can consider more general ways to measure risk.

I Loss function: A nonnegative function quantifying the
“loss/risk” incurred by a hypothesis.

` : H ⇥ X ⇥ Y ! R+

hinge loss `h (z) := max(0, 1 z)

4
logistic loss `log (z) := log(1 + e z
)
`(z)

3 0/1 loss
`0/1 (z) := [[z  0]]

-3 -2 -1 1 2 3
z

Anuran
SuvritMakur
Sra ([email protected]) CS 57800 (Spring
6.867 (Fall 2019)2022) 35
Loss functions, ERM setup
I Risk function: Expected loss of h 2 H wrt the data
distribution P over X ⇥ Y

L(h) := E[`(h, X, Y)]

Anuran
SuvritMakur
Sra ([email protected]) CS 57800 (Spring
6.867 (Fall 2019)2022) 36
Loss functions, ERM setup
I Risk function: Expected loss of h 2 H wrt the data
distribution P over X ⇥ Y

L(h) := E[`(h, X, Y)]

I Empirical Risk:
N
1 X
LS (h) := `(h, xi , yi )
N
i=1

Anuran
SuvritMakur
Sra ([email protected]) CS 57800 (Spring
6.867 (Fall 2019)2022) 37
Loss functions, ERM setup
I Risk function: Expected loss of h 2 H wrt the data
distribution P over X ⇥ Y

L(h) := E[`(h, X, Y)]

I Empirical Risk:
N
1 X
LS (h) := `(h, xi , yi )
N
i=1

I 0/1-loss: (
1, h(x) 6= y
`0/1 (h, (x, y)) :=
0, h(x) = y.
Exercise: Verify that `0/1 reduces to L(h) = P(h(X) 6= Y)

Anuran
SuvritMakur
Sra ([email protected]) CS 57800 (Spring
6.867 (Fall 2019)2022) 38
ERM: Computation

I Computational question:
N N
1 X 1 X
min `0/1 (h, xi , yi ) = Jh(xi ) 6= yi K
h2H N N
i=1 i=1

This empirical risk is typically NP-Hard to optimize

Exercise: Provide settings / scenarios where this loss is not
hard to optimize.

Anuran
SuvritMakur
Sra ([email protected]) CS 57800 (Spring
6.867 (Fall 2019)2022) 39
ERM: Theory
Question: When does ERM work? In other words, if we
minimize LS (h), what bearing does that have on L(h)?
Goal of learning theory is to study this (and such) question(s).
Not the focus of this course

Anuran
SuvritMakur
Sra ([email protected]) CS 57800 (Spring
6.867 (Fall 2019)2022) 40
ERM: Theory
Question: When does ERM work? In other words, if we
minimize LS (h), what bearing does that have on L(h)?
Goal of learning theory is to study this (and such) question(s).
Informally, if for all h 2 H, LS (h) is a good approximation to
L(h), then ERM will also return a good hypothesis

LP(hS )  min LP(h) + ✏

h2H

hS is learned using ERM; both risks over data distrib

Anuran
SuvritMakur
Sra ([email protected]) CS 57800 (Spring
6.867 (Fall 2019)2022) 41
ERM: Theory
Question: When does ERM work? In other words, if we
minimize LS (h), what bearing does that have on L(h)?
Goal of learning theory is to study this (and such) question(s).
Informally, if for all h 2 H, LS (h) is a good approximation to
L(h), then ERM will also return a good hypothesis

LP(hS )  min LP(h) + ✏

h2H

hS is learned using ERM; both risks over data distrib

I Why may a certain hypothesis class H be “better”? How to
ensure learnability, control overfitting, etc? Let us look at a
fundamental tradeoff that guides us

Anuran
SuvritMakur
Sra ([email protected]) CS 57800 (Spring
6.867 (Fall 2019)2022) 42
ERM: Bias-complexity tradeoff

I Error-decomposition: To control overfitting, we introduced

inductive bias. Let us look at a fundamental error
decomposition in ML

LP(hS ) = ✏apx + ✏est

Thus, prob of error on random (unseen) data, decomposes into

✏apx := min L(h) (A PPROX E RROR)

h2H
✏est := LP(hS ) ✏apx (E STIMATION E RROR)

Anuran
SuvritMakur
Sra ([email protected]) CS 57800 (Spring
6.867 (Fall 2019)2022) 43
ERM: Bias-complexity tradeoff
I Approx error: Min risk achievable by a predictor in H.
Measures how much risk due to inductive bias (observe,
does not depend on N or S)
I Estimation error: Difference between approx error and
error achieved by the ERM predictor (on test data). This
error arises because training error (empirical risk) is just a
proxy for the true risk.

Anuran
SuvritMakur
Sra ([email protected]) CS 57800 (Spring
6.867 (Fall 2019)2022) 44
ERM: Bias-complexity tradeoff
I Approx error: Min risk achievable by a predictor in H.
Measures how much risk due to inductive bias (observe,
does not depend on N or S)
I Estimation error: Difference between approx error and
error achieved by the ERM predictor (on test data). This
error arises because training error (empirical risk) is just a
proxy for the true risk.
I Quality of estimation depends on training set size N and on
the “richness / complexity” of the hypothesis class (e.g., for
a finite hypothesis class, estimation error increases as log |H|
and decreases with N).

Anuran
SuvritMakur
Sra ([email protected]) CS 57800 (Spring
6.867 (Fall 2019)2022) 45
ERM: Bias-complexity tradeoff
I Approx error: Min risk achievable by a predictor in H.
A
Measures how much risk due to inductive bias (observe,
does not depend on N or S)
B
I Estimation error: Difference between approx error and
error achieved by the ERM predictor (on test data). This
error arises because training error (empirical risk) is just a
proxy for the true risk.
I Quality of estimation depends on training set size N and on
the “richness / complexity” of the hypothesis class (e.g., for
a finite hypothesis class, estimation error increases as log |H|
and decreases with N).

Fig. 1. Curves for training

Classical trainingrisk (dashed
vs test curve line) and test risk (solid l
double-descent risk curve, which incorporates the U-shaped risk
Anuran capacity function classes
SuvritMakur
Sra ([email protected])
(i.e.,
CS 57800
6.867 (Fall
the “modern” interpolating regime
(Spring
2019)2022) 46
ERM: Bias-complexity tradeoff
I Approx error: Min risk achievable by a predictor in H.
Measures how much risk due to inductive bias (observe,
does not depend on N or S)
I Estimation error: Difference between approx error and
error achieved by the ERM predictor (on test data). This
error arises because training error (empirical risk) is just a
proxy for the true risk.
I Quality of estimation depends on training set size N and on
the “richness / complexity” of the hypothesis class (e.g., for
a finite hypothesis class, estimation error increases as log |H|
and decreases with N).
I Regularization: One way to quantify the complexity of an
individual hypothesis

Anuran
SuvritMakur
Sra ([email protected]) CS 57800 (Spring
6.867 (Fall 2019)2022) 47
Regularized ERM

Here we seek to minimize the regularized empirical risk

min LS (h) + R(h),

h2H

where 0 is a hyper-paramter that regulates the

bias-complexity tradeoff.
N EXT LECTURE: Concrete examples of hypotheses classes,
regularizers, and is
Regularization corresponding optimization
a very important idea inmethods.
machine learning

Above method not the only way to regularize!

Models, Algorithms, Computation,
and several other ways to implicitly or explicitly regularize

Anuran
SuvritMakur
Sra ([email protected]) CS 57800 (Spring
6.867 (Fall 2019)2022) 48
 

ERM: Bias-complexity tradeoff

I Approx error: Min risk achievable by a predictor in H.
B Measures how much risk due to inductive bias (observe,
does not depend on N or S)
I Estimation error: Difference between approx error and
error achieved by the ERM predictor (on test data). This
error arises because training error (empirical risk) is just a
proxy for the true risk.
I Quality of estimation depends on training set size N and on
the “richness / complexity” of the hypothesis class (e.g., for
a finite hypothesis class, estimation error increases as log |H|
and
risk (solid line). decreases
(A) The with N).
classical U-shaped risk curve arising from the bias–variance trade-off. (B) The
haped risk curve (i.e., the “classical” regime) together with the observed behavior from using high-
ng regime), separated by theviewpoint
Modern interpolation threshold. The predictors
on generalization: to the right of
the double-descent the interpolation
curve
Reconciling modern machine-learning practice and
the classical bias–variance trade-off
polate the trainingMikhail Belkin
apply, Daniel
a,b,1to regression
Hsu , Siyuan Ma , andand
c a
Soumikalso
a
Mandal does not predict the second descent

n the training data beyond the interpolation threshold. Recently, there has been an
emerging recognition that certain interpolating predictors (not
n inAnuran
how perfor-
SuvritMakur based on ERM)CScan
Sra ([email protected]) 57800
6.867indeed
(Spring
(Fall be provably statistically optimal or
2019)2022) 49

Cs229 Midterm Aut2015
No ratings yet
Cs229 Midterm Aut2015
21 pages
Stanford University CS 229, Autumn 2014 Midterm Examination
No ratings yet
Stanford University CS 229, Autumn 2014 Midterm Examination
23 pages
Mathematics of Machine Learning MIT
No ratings yet
Mathematics of Machine Learning MIT
411 pages
MIT18 657F15 LecNote PDF
No ratings yet
MIT18 657F15 LecNote PDF
194 pages
1 Intro
No ratings yet
1 Intro
5 pages
Quantitative Methods Module 1
No ratings yet
Quantitative Methods Module 1
24 pages
6.036: Intro To Machine Learning: Lecturer: Professor Leslie Kaelbling Notes By: Andrew Lin Fall 2019
No ratings yet
6.036: Intro To Machine Learning: Lecturer: Professor Leslie Kaelbling Notes By: Andrew Lin Fall 2019
50 pages
Machine Learning and Pattern Recognition Week 3 Intro - Classification
No ratings yet
Machine Learning and Pattern Recognition Week 3 Intro - Classification
5 pages
ml-20230316-1
No ratings yet
ml-20230316-1
9 pages
UNIT-3
No ratings yet
UNIT-3
99 pages
datamining-lect12
No ratings yet
datamining-lect12
75 pages
EDA Lecture Module 2
100% (1)
EDA Lecture Module 2
42 pages
3 2KNN
No ratings yet
3 2KNN
27 pages
Unit Ii
No ratings yet
Unit Ii
48 pages
Lec 2
No ratings yet
Lec 2
37 pages
Assignment 02: Submitted To
No ratings yet
Assignment 02: Submitted To
4 pages
2 Naive Bayes
No ratings yet
2 Naive Bayes
49 pages
Lecture 2.4
No ratings yet
Lecture 2.4
28 pages
Cheatsheet Supervised Learning
No ratings yet
Cheatsheet Supervised Learning
4 pages
Lista Fabio Cozman
No ratings yet
Lista Fabio Cozman
6 pages
Data Mining Lecture 10B: Classification
No ratings yet
Data Mining Lecture 10B: Classification
62 pages
6.867 Lecture Notes: Section 1: Introduction: 1 Intro 2 2 Problem Class 3
No ratings yet
6.867 Lecture Notes: Section 1: Introduction: 1 Intro 2 2 Problem Class 3
10 pages
Ds 2
No ratings yet
Ds 2
27 pages
Brief Intro To ML PDF
No ratings yet
Brief Intro To ML PDF
236 pages
Notes6_Classification
No ratings yet
Notes6_Classification
10 pages
Notes Chapter Linear Classifiers
No ratings yet
Notes Chapter Linear Classifiers
4 pages
Lec5 Class
No ratings yet
Lec5 Class
14 pages
Dl Highlights
No ratings yet
Dl Highlights
6 pages
Lecture 4 Classification P1
No ratings yet
Lecture 4 Classification P1
51 pages
2.SupervisedLearning Error
No ratings yet
2.SupervisedLearning Error
32 pages
IT 802 ML Unit-2 Notes
No ratings yet
IT 802 ML Unit-2 Notes
19 pages
b9c50b5b58d240169f8bec65f9d6589b
No ratings yet
b9c50b5b58d240169f8bec65f9d6589b
61 pages
ML Merge
No ratings yet
ML Merge
145 pages
lecture-11
No ratings yet
lecture-11
32 pages
Lecture 2 - Principle of Machine Learning
No ratings yet
Lecture 2 - Principle of Machine Learning
39 pages
02 Hypothesis Spaces
No ratings yet
02 Hypothesis Spaces
34 pages
Fairness Lectures-21
No ratings yet
Fairness Lectures-21
63 pages
05-1 Supervised Learning
No ratings yet
05-1 Supervised Learning
65 pages
Unit Ii
No ratings yet
Unit Ii
118 pages
Stat Risk
No ratings yet
Stat Risk
6 pages
Machine Learning
No ratings yet
Machine Learning
5 pages
Lecture 1.3
No ratings yet
Lecture 1.3
7 pages
KNN and Baysian Method
No ratings yet
KNN and Baysian Method
43 pages
W8-Supervised Learning Methods
No ratings yet
W8-Supervised Learning Methods
30 pages
UNIT-3
No ratings yet
UNIT-3
100 pages
ML-2-PPT-UNIT-2
No ratings yet
ML-2-PPT-UNIT-2
214 pages
Datamining Lect7knearst
No ratings yet
Datamining Lect7knearst
62 pages
Datamining-lect4 - Other Classification Techniques. Nearest Neighbor Classifiers, Support Vector Machines, Logistic Regression, Naive Bayes Classification. Supervised Learning
No ratings yet
Datamining-lect4 - Other Classification Techniques. Nearest Neighbor Classifiers, Support Vector Machines, Logistic Regression, Naive Bayes Classification. Supervised Learning
79 pages
Lecture 1, Part 2: Linear Classification: Roger Grosse
No ratings yet
Lecture 1, Part 2: Linear Classification: Roger Grosse
10 pages
Machine Learning UNIT-2: Logistic Regression
No ratings yet
Machine Learning UNIT-2: Logistic Regression
12 pages
Unit 2 Notes
No ratings yet
Unit 2 Notes
7 pages
Lecture 7
No ratings yet
Lecture 7
83 pages
Statistical Machine Learning-The Basic Approach and Current Research Challenges
No ratings yet
Statistical Machine Learning-The Basic Approach and Current Research Challenges
35 pages
ML Midterm Question Pool
No ratings yet
ML Midterm Question Pool
7 pages
EE378A - Combined Notes
No ratings yet
EE378A - Combined Notes
76 pages
Is The Performance of My Deep Network Too Good To Be True? A Direct Approach To Estimating The Bayes Error in Binary Classification
No ratings yet
Is The Performance of My Deep Network Too Good To Be True? A Direct Approach To Estimating The Bayes Error in Binary Classification
22 pages
P03 BayesianLearning
No ratings yet
P03 BayesianLearning
2 pages

Lecture 2.1

Uploaded by

Lecture 2.1

Uploaded by

CS 57800 Spring 2022

Statistical Machine Learning

★ 2. 2) TV = 1 i likelihoods are mutually singular

★ 2. 5) Constant function achieves maximum detection and false

★ 3. Pairs of likelihoods can be partitioned into equivalence classes

★ 3. f) A function with properties of NP function is an NP function!

Anuran Makur CS 57800 (Spring 2022) 2

Statistical Inference Supervised Learning

Anuran Makur CS 57800 (Spring 2022) 3

Statistical inference Supervised learning

‣ Hidden parameter Y ‣ Label Y

‣ Optimal estimators or decision ‣ Optimal prediction rules depend

Anuran Makur CS 57800 (Spring 2022) 4

Statistical inference Supervised learning

‣ Hypothesis testing ‣ Classi cation

Classi cation: Foundations

Anuran Makur CS 57800 (Spring 2022) 6

Classi cation: Foundations

Anuran Makur CS 57800 (Spring 2022) 7

★ De nitions and formal setup

★ Empirical Risk Minimization

Anuran Makur CS 57800 (Spring 2022) 8

⌘(x) := P(Y = 1 | X = x) = E[Y | X = x].

⌘(x) := P(Y = 1 | X = x) = E[Y | X = x].

I Measuring success: Error of classifier aka risk aka

L(h) ⌘ LP(h) := P(h(X) 6= Y)

i.e., the error of classifier h is the probability of randomly

Theorem (BC optimality). For any classifier h : Rd ! {0, 1},

= E[min {⌘(X), 1 ⌘(X)}]

(Hint: Use notation from above proof)

= E[min {⌘(X), 1 ⌘(X)}]

(Hint: Use notation from above proof)

Training: None required (or rather, memorize the data)

for each test data point ‘x’ do:

nd the ‘k’ points in training data nearest to ‘x’

predict label ‘y’ for ‘x’ by taking (weighted) majority label

Anuran Makur CS 57800 (Spring 2022) 21

k-NN can learn complex nonlinear classi ers

Anuran Makur CS 57800 (Spring 2022) 22

NNk (x) := j | 1  j  k, xj is within k closest to x

(so that NN1 (x) = argmin1iN dist(x, xi ))

NNk (x) := j | 1  j  k, xj is within k closest to x

(so that NN1 (x) = argmin1iN dist(x, xi ))

L⇤  LNN = 2E[A(X)(1 A(X))]

In other words, asymptotically (as N ! 1), a NN classifier

Food for thought: What is over tting? Is it necessarily bad?

✴ As bad as a random guess (error prob on unseen data =1/2)

Anuran Makur CS 57800 (Spring 2022) 30

ERMH (S) 2 argmin LS (h)

ERMH (S) 2 argmin LS (h)

I Note: Ideally H should be governed by knowledge of data.

We can consider more general ways to measure risk.

We can consider more general ways to measure risk.

hinge loss `h (z) := max(0, 1 z)

L(h) := E[`(h, X, Y)]

L(h) := E[`(h, X, Y)]

L(h) := E[`(h, X, Y)]

This empirical risk is typically NP-Hard to optimize

LP(hS )  min LP(h) + ✏

hS is learned using ERM; both risks over data distrib

LP(hS )  min LP(h) + ✏

hS is learned using ERM; both risks over data distrib

I Error-decomposition: To control overfitting, we introduced

LP(hS ) = ✏apx + ✏est

Thus, prob of error on random (unseen) data, decomposes into

✏apx := min L(h) (A PPROX E RROR)

Fig. 1. Curves for training

Here we seek to minimize the regularized empirical risk

min LS (h) + R(h),

where 0 is a hyper-paramter that regulates the

Above method not the only way to regularize!

ERM: Bias-complexity tradeoff

You might also like

★ De nitions and formal setup 

★ Empirical Risk Minimization