Lecture 2.1
Lecture 2.1
Note: These slides are for your personal educational use only. Please do not distribute them
Slides Acknowledgment:
Tommi Jaakkola, Devavrat Shah, David Sontag, Suvrit Sra (MIT EECS Course 6.867)
Anuran Makur CS 57800 (Spring 2022)
fi
Administrivia
★ Midterm Results
★ Median: 10/30
★ Midterm Questions
★ 1. f) Change of basis does not change subspace of functions
alarm probabilities
that have the same NP function. The canonical pair in each class
is (NP as CDF, uniform distribution).
I II
Food for thought: Think of any other important di erences. Also, can we mix both?
Paradigms of inference
‣ Bayesian
‣ Non-Bayesian
Paradigms of learning
‣ Generative
‣ Discriminative
Anuran Makur CS 57800 (Spring 2022) 5
fi
CS 57800
II
Supervised Learning
II
Supervised Learning
★ Nearest Neighbors
Anuran
SuvritMakur
Sra ([email protected]) CS 57800 (Spring
6.867 (Fall 2019)2022) 9
Classification
Learn from training data to predict accurately on unseen data
Basic terminology
I Data domain: An arbitrary set X . Often just X = Rd
(assuming that the members of X are represented via
feature vectors; some authors write (x) to emphasize this)
I Label domain: A discrete set Y; e.g., {0, 1} or { 1, 1}.
I Training data: A finite collection S = {(x1 , y1 ), . . . , (xN , yN )}
of pairs drawn from X ⇥ Y
Anuran
SuvritMakur
Sra ([email protected]) CS 57800 (Spring
6.867 (Fall 2019)2022) 10
Classification
Learn from training data to predict accurately on unseen data
Basic terminology
I Data domain: An arbitrary set X . Often just X = Rd
(assuming that the members of X are represented via
feature vectors; some authors write (x) to emphasize this)
I Label domain: A discrete set Y; e.g., {0, 1} or { 1, 1}.
I Training data: A finite collection S = {(x1 , y1 ), . . . , (xN , yN )}
of pairs drawn from X ⇥ Y
I Classifier: A prediction rule h : X ! Y (we’ll write hS to
emphasize dependence of h on the training data). We call h
a hypothesis, prediction rule, or classifier.
Anuran Makur
Suvrit Sra ([email protected]) CS 57800 (Spring
6.867 (Fall 2022)
2019) 11
Probability model, setup
I Data distribution: Joint distribution P on X ⇥ Y.
Important assumption: P is fixed but unknown.
We will write (X, Y) to denote a random variable with X taking
values in X and Y taking values in Y.
Anuran
SuvritMakur
Sra ([email protected]) CS 57800 (Spring
6.867 (Fall 2019)2022) 12
Probability model, setup
I Data distribution: Joint distribution P on X ⇥ Y.
Important assumption: P is fixed but unknown.
We will write (X, Y) to denote a random variable with X taking
values in X and Y taking values in Y.
I Class conditional distribution: Let Y = {0, 1}. We define
Anuran
SuvritMakur
Sra ([email protected]) CS 57800 (Spring
6.867 (Fall 2019)2022) 13
Probability model, setup
I Data distribution: Joint distribution P on X ⇥ Y.
Important assumption: P is fixed but unknown.
We will write (X, Y) to denote a random variable with X taking
values in X and Y taking values in Y.
I Class conditional distribution: Let Y = {0, 1}. We define
Anuran
SuvritMakur
Sra ([email protected]) CS 57800 (Spring
6.867 (Fall 2019)2022) 14
Bayes Classifier
I Goal: Minimize the risk / misclassification error
Intuitively, picking the most likely class given the data makes
sense (notice, not limited to binary classification)
Anuran
SuvritMakur
Sra ([email protected]) CS 57800 (Spring
6.867 (Fall 2019)2022) 15
Bayes Classifier
I Goal: Minimize the risk / misclassification error
Intuitively, picking the most likely class given the data makes
sense (notice, not limited to binary classification)
Bayes classifier
(
1
⇤ 1, if ⌘(x) = P(Y = 1|X = x) > 2 ,
h (x) :=
0, otherwise.
Anuran
SuvritMakur
Sra ([email protected]) CS 57800 (Spring
6.867 (Fall 2019)2022) 16
Bayes Classifier
I Goal: Minimize the risk / misclassification error
Intuitively, picking the most likely class given the data makes
sense (notice, not limited to binary classification)
Bayes classifier
(
1
⇤ 1, if ⌘(x) = P(Y = 1|X = x) > 2 ,
h (x) :=
0, otherwise.
[Proof
[Proof: Recall MAP on hypothesis
rule from black-board]
testing lecture]
Anuran
SuvritMakur
Sra ([email protected]) CS 57800 (Spring
6.867 (Fall 2019)2022) 17
Bayes Classifier
Exercise: Verify the following useful formulae
L⇤ = inf P(h(X) 6= Y)
h:Rd !{0,1}
Anuran
SuvritMakur
Sra ([email protected]) CS 57800 (Spring
6.867 (Fall 2019)2022) 18
Bayes Classifier
Exercise: Verify the following useful formulae
L⇤ = inf P(h(X) 6= Y)
h:Rd !{0,1}
Anuran
SuvritMakur
Sra ([email protected]) CS 57800 (Spring
6.867 (Fall 2019)2022) 19
Nearest-Neighbors
We mention now an approach that (seems) distribution free.
I Based on the belief: Features that are used to describe the
data are relevant to the labelings in a way that makes “close
by” points likely to have the same label
Anuran
SuvritMakur
Sra ([email protected]) CS 57800 (Spring
6.867 (Fall 2019)2022) 20
Nearest Neighbor Classi cation
One of the simplest possible classi ers!
Testing:
Anuran
SuvritMakur
Sra ([email protected]) CS 57800 (Spring
6.867 (Fall 2019)2022) 23
Nearest-Neighbors
We mention now an approach that (seems) distribution free.
I Based on the belief: Features that are used to describe the
data are relevant to the labelings in a way that makes “close
by” points likely to have the same label
k-NN: Let S = {(xi , yi ) | 1 i N} be the training data. Let
Anuran
SuvritMakur
Sra ([email protected]) CS 57800 (Spring
6.867 (Fall 2019)2022) 24
NN and Bayes
Asymptotically, it can be shown that the error of the NN
classifier is
LNN = E[2⌘(X)(1 ⌘(X))].
Theorem. LNN 2L⇤ (where L⇤ is Bayes error)
Proof Sketch. Let A(X) = min(⌘(X), 1 ⌘(X)). Notice that
2⌘(1 ⌘) A; thus we have
Anuran
SuvritMakur
Sra ([email protected]) CS 57800 (Spring
6.867 (Fall 2019)2022) 25
NN and Bayes: Remarks
I Read Ch. 19 of [SSS] for additional analysis of NN methods
I Check out Explaining the Success of Nearest Neighbor Methods
in Prediction by George Chen and Devavrat Shah (2018).
Ideally, we want non-asymptotic results, to better
understand how many examples (i.e., how large N) do we
need to attain a certain error rate
We may also have some prior knowledge about (X, Y) that
we may wish to incorporate
Noise, robustness, adversarial learning, and other concerns
All of these can be accommodated; let us look at another
more explicit paradigm
Anuran
SuvritMakur
Sra ([email protected]) CS 57800 (Spring
6.867 (Fall 2019)2022) 26
Empirical Risk Minimization
Learner does not know P(X, Y), so true error (Bayes error) is
not known to the learner. However,
I Training Error: The error that the classifier incurs on the
training data
1
LS (h) := # {i 2 [N] | h(xi ) 6= yi } ,
N
aka empirical risk
Anuran
SuvritMakur
Sra ([email protected]) CS 57800 (Spring
6.867 (Fall 2019)2022) 27
Empirical Risk Minimization
Learner does not know P(X, Y), so true error (Bayes error) is
not known to the learner. However,
I Training Error: The error that the classifier incurs on the
training data
1
LS (h) := # {i 2 [N] | h(xi ) 6= yi } ,
N
aka empirical risk
I ERM principle: Seek predictor that minimizes LS (h)
I Pitfall: Overfitting!
Anuran
SuvritMakur
Sra ([email protected]) CS 57800 (Spring
6.867 (Fall 2019)2022) 28
Empirical Risk Minimization
Learner does not know P(X, Y), so true error (Bayes error) is
not known to the learner. However,
I Training Error: The error that the classifier incurs on the
training data
1
LS (h) := # {i 2 [N] | h(xi ) 6= yi } ,
N
aka empirical risk
I ERM principle: Seek predictor that minimizes LS (h)
I Pitfall: Overfitting!
Anuran
SuvritMakur
Sra ([email protected]) CS 57800 (Spring
6.867 (Fall 2019)2022) 29
fi
Over tting
S = {(xi , yi ) | 1 i N }
<latexit sha1_base64="jSFEno0XjtXvh0feQa5M41SCu8c=">AAACCXicbVDLSsNAFJ3UV62vqEs3g0WoICURQTdC0Y0rqWgf0IQwmUzaoTNJmJmIJXTrxl9x40IRt/6BO//GSZuFth64l8M59zJzj58wKpVlfRulhcWl5ZXyamVtfWNzy9zeacs4FZi0cMxi0fWRJIxGpKWoYqSbCIK4z0jHH17mfueeCEnj6E6NEuJy1I9oSDFSWvJMeHvuZLUHjx6NPHrocBpA22EEUpj3a2fsmVWrbk0A54ldkCoo0PTMLyeIccpJpDBDUvZsK1FuhoSimJFxxUklSRAeoj7paRohTqSbTS4ZwwOtBDCMha5IwYn6eyNDXMoR9/UkR2ogZ71c/M/rpSo8czMaJakiEZ4+FKYMqhjmscCACoIVG2mCsKD6rxAPkEBY6fAqOgR79uR50j6u21bdvjmpNi6KOMpgD+yDGrDBKWiAK9AELYDBI3gGr+DNeDJejHfjYzpaMoqdXfAHxucP16iYgQ==</latexit>
( y=0
yi , if x = xi
h(x) =
0, otherwise
y=1
<latexit sha1_base64="Avvb5Z8aSiD8xI57fKSkL0HXoUg=">AAACN3icbVBNS8NAEN34bf2qevSyWBQFKYkIeimIXjyJglWhKWGznbSLm03YnWhL6L/y4t/wphcPinj1H7htc/BrYOHx3ryZnRemUhh03SdnbHxicmp6ZrY0N7+wuFReXrk0SaY51HkiE30dMgNSKKijQAnXqQYWhxKuwpvjgX51C9qIRF1gL4VmzNpKRIIztFRQPu1sdbdpjfohtIXKuR1l+rQXiB26SX2ELuYi6vu0W+sGwvdL7s7miE2wA/pOGOiXfFCtwhmUK27VHRb9C7wCVEhRZ0H50W8lPItBIZfMmIbnptjMmUbB5WB2ZiBl/Ia1oWGhYjGYZj68u083LNOiUaLtU0iH7HdHzmJjenFoO2OGHfNbG5D/aY0Mo4NmLlSaISg+WhRlkmJCByHSltDAUfYsYFwL+1fKO0wzjjbqkg3B+33yX3C5W/Xcqne+Vzk8KuKYIWtknWwRj+yTQ3JCzkidcHJPnskreXMenBfn3fkYtY45hWeV/Cjn8wu806v0</latexit>
Memorize
x distributed uniformly in
the unit square
✴ This classi er has 0 empirical risk!
Food for thought: The method Memorize is not fully “bad”, why?
Anuran
SuvritMakur
Sra ([email protected]) CS 57800 (Spring
6.867 (Fall 2019)2022) 31
ERM with inductive bias
I Rather than give up on ERM, we search for settings where it
may actually work.
I Inductive bias: Apply ERM over a restricted search space.
1 Learner chooses a hypothesis class H (i.e., set of predictors it
is going to optimize over) in advance before having seen
any training data (eg, Linear model, Neural Networks, Random Forests, etc.)
2 ERMH uses ERM to learn h : X ! Y by using S (training data)
Anuran
SuvritMakur
Sra ([email protected]) CS 57800 (Spring
6.867 (Fall 2019)2022) 32
ERM with inductive bias
I Rather than give up on ERM, we search for settings where it
may actually work.
I Inductive bias: Apply ERM over a restricted search space.
1 Learner chooses a hypothesis class H (i.e., set of predictors it
is going to optimize over) in advance before having seen
any training data (eg, Linear model, Neural Networks, Random Forests, etc.)
2 ERMH uses ERM to learn h : X ! Y by using S (training data)
Anuran
SuvritMakur
Sra ([email protected]) CS 57800 (Spring
6.867 (Fall 2019)2022) 33
Loss functions, ERM setup
Recall, risk was defined as
L(h) = P(h(X) 6= Y)
Anuran
SuvritMakur
Sra ([email protected]) CS 57800 (Spring
6.867 (Fall 2019)2022) 34
Loss functions, ERM setup
Recall, risk was defined as
L(h) = P(h(X) 6= Y)
` : H ⇥ X ⇥ Y ! R+
3 0/1 loss
`0/1 (z) := [[z 0]]
-3 -2 -1 1 2 3
z
Anuran
SuvritMakur
Sra ([email protected]) CS 57800 (Spring
6.867 (Fall 2019)2022) 35
Loss functions, ERM setup
I Risk function: Expected loss of h 2 H wrt the data
distribution P over X ⇥ Y
Anuran
SuvritMakur
Sra ([email protected]) CS 57800 (Spring
6.867 (Fall 2019)2022) 36
Loss functions, ERM setup
I Risk function: Expected loss of h 2 H wrt the data
distribution P over X ⇥ Y
I Empirical Risk:
N
1 X
LS (h) := `(h, xi , yi )
N
i=1
Anuran
SuvritMakur
Sra ([email protected]) CS 57800 (Spring
6.867 (Fall 2019)2022) 37
Loss functions, ERM setup
I Risk function: Expected loss of h 2 H wrt the data
distribution P over X ⇥ Y
I Empirical Risk:
N
1 X
LS (h) := `(h, xi , yi )
N
i=1
I 0/1-loss: (
1, h(x) 6= y
`0/1 (h, (x, y)) :=
0, h(x) = y.
Exercise: Verify that `0/1 reduces to L(h) = P(h(X) 6= Y)
Anuran
SuvritMakur
Sra ([email protected]) CS 57800 (Spring
6.867 (Fall 2019)2022) 38
ERM: Computation
I Computational question:
N N
1 X 1 X
min `0/1 (h, xi , yi ) = Jh(xi ) 6= yi K
h2H N N
i=1 i=1
Anuran
SuvritMakur
Sra ([email protected]) CS 57800 (Spring
6.867 (Fall 2019)2022) 39
ERM: Theory
Question: When does ERM work? In other words, if we
minimize LS (h), what bearing does that have on L(h)?
Goal of learning theory is to study this (and such) question(s).
Not the focus of this course
Anuran
SuvritMakur
Sra ([email protected]) CS 57800 (Spring
6.867 (Fall 2019)2022) 40
ERM: Theory
Question: When does ERM work? In other words, if we
minimize LS (h), what bearing does that have on L(h)?
Goal of learning theory is to study this (and such) question(s).
Informally, if for all h 2 H, LS (h) is a good approximation to
L(h), then ERM will also return a good hypothesis
Anuran
SuvritMakur
Sra ([email protected]) CS 57800 (Spring
6.867 (Fall 2019)2022) 41
ERM: Theory
Question: When does ERM work? In other words, if we
minimize LS (h), what bearing does that have on L(h)?
Goal of learning theory is to study this (and such) question(s).
Informally, if for all h 2 H, LS (h) is a good approximation to
L(h), then ERM will also return a good hypothesis
Anuran
SuvritMakur
Sra ([email protected]) CS 57800 (Spring
6.867 (Fall 2019)2022) 42
ERM: Bias-complexity tradeoff
Anuran
SuvritMakur
Sra ([email protected]) CS 57800 (Spring
6.867 (Fall 2019)2022) 43
ERM: Bias-complexity tradeoff
I Approx error: Min risk achievable by a predictor in H.
Measures how much risk due to inductive bias (observe,
does not depend on N or S)
I Estimation error: Difference between approx error and
error achieved by the ERM predictor (on test data). This
error arises because training error (empirical risk) is just a
proxy for the true risk.
Anuran
SuvritMakur
Sra ([email protected]) CS 57800 (Spring
6.867 (Fall 2019)2022) 44
ERM: Bias-complexity tradeoff
I Approx error: Min risk achievable by a predictor in H.
Measures how much risk due to inductive bias (observe,
does not depend on N or S)
I Estimation error: Difference between approx error and
error achieved by the ERM predictor (on test data). This
error arises because training error (empirical risk) is just a
proxy for the true risk.
I Quality of estimation depends on training set size N and on
the “richness / complexity” of the hypothesis class (e.g., for
a finite hypothesis class, estimation error increases as log |H|
and decreases with N).
Anuran
SuvritMakur
Sra ([email protected]) CS 57800 (Spring
6.867 (Fall 2019)2022) 45
ERM: Bias-complexity tradeoff
I Approx error: Min risk achievable by a predictor in H.
A
Measures how much risk due to inductive bias (observe,
does not depend on N or S)
B
I Estimation error: Difference between approx error and
error achieved by the ERM predictor (on test data). This
error arises because training error (empirical risk) is just a
proxy for the true risk.
I Quality of estimation depends on training set size N and on
the “richness / complexity” of the hypothesis class (e.g., for
a finite hypothesis class, estimation error increases as log |H|
and decreases with N).
Anuran
SuvritMakur
Sra ([email protected]) CS 57800 (Spring
6.867 (Fall 2019)2022) 47
Regularized ERM
Anuran
SuvritMakur
Sra ([email protected]) CS 57800 (Spring
6.867 (Fall 2019)2022) 48
n the training data beyond the interpolation threshold. Recently, there has been an
emerging recognition that certain interpolating predictors (not
n inAnuran
how perfor-
SuvritMakur based on ERM)CScan
Sra ([email protected]) 57800
6.867indeed
(Spring
(Fall be provably statistically optimal or
2019)2022) 49