0% found this document useful (0 votes)

50 views

Lecture 2.4

This document discusses learnability and the VC dimension in statistical machine learning. It introduces the concept of PAC learnability, which is defined as the existence of a sample complexity function and learning algorithm that can return a hypothesis with error less than ε with probability at least 1-δ, given a sample size greater than the complexity function. The key ideas for proving learnability are interpreting errors as failures and successes, considering misleading samples, and applying a union bound over the finite hypothesis class.

Uploaded by

Jon Smithson

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

50 views

Lecture 2.4

Uploaded by

Jon Smithson

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 28

CS 57800 Spring 2022

Statistical Machine Learning

Lecture 2.4
Classi cation: Learnability and VC dimension

Note: These slides are for your personal educational use only. Please do not distribute them.

Slides Acknowledgment:
Tommi Jaakkola, Devavrat Shah, David Sontag, Suvrit Sra (MIT EECS Course 6.867)
Anuran Makur CS 57800 (Spring 2022)
fi

CS 57800

I II

Statistical Inference Supervised Learning

Anuran Makur CS 57800 (Spring 2022) 2

CS 57800

Supervised Learning

Classi cation: Foundations

Classi cation: Logistic regression, SVM, Kernels
Classi cation: Naive Bayes
Classi cation: Learnability and VC dimension
Regression (2 lectures)
Neural Networks (2 lectures)

Anuran Makur CS 57800 (Spring 2022) 3

fi
fi
fi
fi
CS 57800

Supervised Learning

Classi cation: Foundations

Classi cation: Logistic regression, SVM, Kernels
Classi cation: Naive Bayes
Classi cation: Learnability and VC dimension
Regression (2 lectures)
Neural Networks (2 lectures)

Anuran Makur CS 57800 (Spring 2022) 4

fi
fi
fi
fi
Outline

★ Learnability

★ VC dimension

★ Discussion

Anuran Makur CS 57800 (Spring 2022) 5

Motivation

We saw that ERM without restrictions can quickly over t

(the Memorize method)

We introduced idea of inductive bias to restrict the hypothesis class

? Fundamental question:

We discussed regularization as a way to regulate bias-complexity trade-o

Which hypothesis classes do not lead to over tting, e.g., with ERM?

(In practice, carefully balance between over- and under tting. Also, recall we had talked
brie y about regularization as an approach to regulate the “bias-complexity” tradeo )

Anuran Makur CS 57800 (Spring 2022) 6

fl
fi
fi
fi
Motivation
Why learning theory?
– What does it mean to learn?

– Understand limitations of learning and the necessity for prior knowledge

– What makes learning succeed?

– What learning rules to use?

– Understand the tradeo s that in uence and govern learning

These insights have been deeply in uential over the years

Learn ML

Anuran Makur CS 57800 (Spring 2022) 7

ff
fl
fl
Learnability
Aim: What does it mean to be able to learn?

Assumption 1: Suppose our hypothesis class is rich enough so that realizability

holds, that is, there is a hypothesis h* ∈ ℋ such that Lℙ(h*) = 0

How to bound the error made by using ERM?

Understanding realizability Bounding the risk

➡ Over random sample S~ℙ, LS(h*)=0

Exercise: Argue that realizability implies that
Lℙ(hS)?
empirical risk LS(hS)=0 for ERM hypothesis hS (recall this is the risk on unseen
data of the ERM hypothesis hS)
Question: Since risk of 0 is possible using our
hypothesis class and ERM leads to 0 training hS ≠ h* is possible!
error, are we done?

Anuran Makur CS 57800 (Spring 2022) 8

Learnability: Simple setup
Assumption 2: Suppose the hypothesis class ℋ is nite.

Aim: Since Lℙ(h*) = 0 is possible, we seek to bound Lℙ(hS) ≤ ε

log( | ℋ | /δ)
Theorem Let ℋ be nite. Let δ ∈ (0,1), ε > 0, and let N ≥ .
ε
Then, for any distribution ℙ for which realizability holds, with probability at least
1-δ over the choice of dataset S of size N, every ERM hypothesis hS satis es
Lℙ(hS) ≤ ε

Note: Without realizability, a weaker

Proof sketch: §2.3.1 of [SSS]
result can be proved.
Let’s look at the key ideas…

Anuran Makur CS 57800 (Spring 2022) 9

fi
fi
Assumption 2: Suppose the hypothesis class ℋ is finite.

Aim: Since Lℙ(h*)

Learnability: Simple setup
= 0 is possible, we seek to bound Lℙ(hS) ≤ ε

log( | ℋ | /δ)
Theorem Let ℋ be finite. Let δ ∈ (0,1), ε > 0, and let N ≥
ε
Then, for any distribution ℙ for which realizability holds, with probability at least Observe: Con dence parameter
1-δ over the choice of dataset S of size N, every ERM hypothesis hS satisfies δ and accuracy parameter ε
Lℙ(hS) ≤ ε
Proof: §2.3.1 of [SSS]

L (h )
Note: Without realizability, a weaker
Exercises 2 on S;
S depends there is a chance that S is not representative of the whole data.
1
result canℙbe proved.
covers To model this
the details––proof chance,
analogous
to how we bounded parameter error
we introduce a con dence parameter δ in the bound above.
Let’s look at the key ideas…
for NB (Hoeﬀding+Union bound).

2 Cannot guarantee perfect label prediction, so introduce an accuracy parameter ε

Suvrit Sra ([email protected]) 6.867 (Fall 2019) 8

Key ideas on how to prove above theorem (see §2.3.1 of [SSS])

➡ Interpret Lℙ(hS) > ε as a failure of the learner and Lℙ(hS) ≤ ε as success

➡ Consider “misleading” samples S (S is a random dataset) for which there is a
hypothesis h that is “bad” (Lℙ(h) > ε) but looks “good” empirically (LS(h) = 0)

➡ Realizability says LS(hS) = 0, so Lℙ(hS) > ε implies that S is misleading

➡ For large enough N, the probability of this event is bounded

➡ Finiteness of sample class allows extending above bound via the union bound

Anuran Makur CS 57800 (Spring 2022) 10

fi
fi
PAC-learnability
(Probably approximately correct learnability)

(Formal de nition of the (ε,δ)-style result we just saw)

Anuran Makur CS 57800 (Spring 2022) 11

fi
PAC-learnability
Defn. A hypothesis class ℋ is PAC-learnable if there exists a sample complexity
function Nℋ(ε, δ) and a learning algorithm with the following property: For
every ε, δ∈ (0,1) and every distribution ℙ such that the realizability
assumption holds, after training using N ≥ Nℋ(ε, δ) i.i.d. examples generated
from ℙ, the learning algorithm returns a hypothesis h such that Lℙ(h) ≤ ε with
con dence 1-δ (over choice of samples).
Informally
Say hypothesis class ℋ “rich” so that we have realizability. PAC-learnability of class ℋ
means that “enough” number of random examples drawn from the data distribution will allow
approx. risk minimization, i.e., ensure L(h) ≤ ε, with probability ≥ 1-δ, where “enough”
depends on the desired tolerances (ε,δ). The δ arises due to randomness of S drawn from ℙ,
while ε arises due to the actual hypothesis picked by our learner.
The ε, δ are inevitable:
– Since training data S ~ ℙ(X, Y ) is randomly drawn, there is a chance that it may be non-informative;
hence a δ con dence parameter that accounts for this chance.

– Even when S faithfully represents ℙ(X, Y ), it may miss some details (it is just a nite sample) and
cause our classi er to make minor errors; thus the accuracy ε.

Anuran Makur CS 57800 (Spring 2022) 12

fi
fi
fi

fi
PAC-learnability: Questions
Question: True or False? Any nite hypothesis class (assuming
realizability) is PAC-learnable with sample complexity
✓ ◆
log |H/
NH (✏, )  O
<latexit sha1_base64="Q3Z8P20CsB5ojS28IRiqcE96HvI=">AAADRnicfVLLbhMxFHWmPEp5pbBkYxEhpVIVZkKjdFkBi2ygRSJpqjiKPB7PxKrtGdkeIDLzZ3wAnwBbtiB2iC03L6mJgCuNfXUesudcx4UU1oXh11qwc+36jZu7t/Zu37l77359/8HA5qVhvM9ymZthTC2XQvO+E07yYWE4VbHk5/Hlizl//o4bK3L91s0KPlY00yIVjDqAJvXB64knjErcq5qEF1bIXB+ShEtHDzCRHJ/CkromSQ1lnsg8+0gUdVOw+F71dKms/NpaESOyqTuY1BthK4o63U4bQ9M9jp7Nm6Nu2Ol0cdQKF9VAqzqb7Ne+kCRnpeLaMUmtHUVh4caeGieY5NUeKS0vKLukGR9Bq6niduwXAVT4CSAJTnMDn3Z4gV51eKqsnakYlPPL221uDv6NG5UuPR57oYvScc2WB6WlxC7H8zRxIgxnTs6gocwIuCtmUwpROch845RYbfyDt05RMzMJoJq/Z7lSVCeeiKIatcd+ETqRVGcwgkZ0iBvtZbLELLAt15DBCNZjGVZb7MVV9mKb1dKWCgSwEp1LoYSzIHnJYQyGvwLbacENdbnxhJpM0Q+VX+3/UQm9VMEOb2E9cPzvZtBuRfBS3hw1Tp6vXsUueoQeoyaKUBedoB46Q33E0Cf0DX1HP4LPwc/gV/B7KQ1qK89DtFE76A/C6Rh+</latexit>
✏

Question: What if realizability does not hold?

Question: What about in nite hypothesis classes?

Anuran Makur CS 57800 (Spring 2022) 13

fi
fi
(Agnostic) PAC-learnability
Question: What if realizability does not hold?

By No-Free-Lunch (NFL) theorem, we know that no learner is guaranteed

Then, no hope
to match of satisfying
the Bayes Lℙin(h)
classi er ≤ ε. Let
general us weaken
(always our aim,distribution
an adversarial and see if
we
cancan
be at least comeonε-close
constructed which to
ourthe best possible
learner classi
fails while er within
another our
may succeed)
hypothesis class with high probability:

LP (h)  inf
0
LP (h) + ✏
<latexit sha1_base64="siLZEjixqnqtlykVR6f/NrmuI1Q=">AAADO3icfVLLjtMwFHXDaxgeMwNLNhYVYhCoSspUneUIWIwEiCLRmY7qqLpxnNYa24lsB6isfBQfwHcAOxA7xJY9TlrQtDyulPjonHP9uPcmheDGhuGnVnDu/IWLlzYub165eu361vbOjSOTl5qyIc1FrkcJGCa4YkPLrWCjQjOQiWDHyenjWj9+zbThuXpl5wWLJUwVzzgF66nJ9tNnE0ck2FmSuEFV7c7uYSIYJlxlEze7WwPc6BSEO6wqgv9IuE9YYbiod2uHnSjq9Xtd7EF/P3pYg71+2Ov1cdQJm2ijZQwmO62PJM1pKZmyVIAx4ygsbOxAW04FqzZJaVgB9BSmbOyhAslM7JpXV/iOZ1Kc5dp/yuKGPZvhQBozl4l31vc161pN/k0blzbbjx1XRWmZoouDslJgm+O6hDjlmlEr5h4A1dzfFdMZaKDWF3rllESuvMEZK0HPdepZxd7QXEpQqSO8qMbd2PnCZ5YIUFPfgXb0ALe7RPPpzBLdcGtZI1q5360ZVWvqyVn1ZF1VwpTSG/yfqFxwya3xlifMt0Gz5z7tRcE02Fw7Anoq4W3llut/XFwtXH71s/Cr4fjf4KjbifykvNxrHzxaTsUGuoVuo10UoT46QIdogIaIonfoA/qMvgTvg6/Bt+D7whq0ljk30UoEP34C3QIT+g==</latexit>
h 2H

(Note: The hypothesis class ℋ may be “bad,” but we are just trying to be
approximately as good as the best hypothesis within this class w.h.p.)

More generally: L(h) = EP[loss(h, X, Y)] with general loss functions,

not necessarily the 0/1-loss.

Anuran Makur CS 57800 (Spring 2022) 14

fi
fi
I Error-decomposition: To control overfitting, we introduced
(Agnostic) PAC-learnability
inductive bias. Let us look at a fundamental error
decomposition in ML

Recall from earlier: LP(hS ) = ✏apx + ✏est

Thus, prob of error on random (unseen) data, decomposes into

✏apx := min L(h) (A PPROX E RROR)

h2H
✏est := LP(hS ) ✏apx (E STIMATION E RROR)

Approx. error depends on the t of our prior knowledge (via the

inductive bias) to the unknown underlying distribution.

PAC-learnability requires: Estimation error should be bounded

uniformly over all distributions for a given hypothesis class.
Suvrit Sra ([email protected]) 6.867 (Fall 2018)

Finite hypothesis classes are PAC-learnable, but so are in nite ones

provided a more re ned notion of “size” is considered.

Anuran Makur CS 57800 (Spring 2022) 15

fi
fi
fi
Uniform convergence PAC-learnability
Idea: If LS(h) is close to the true risk Lℙ(h) for all h ∈ ℋ, then the ERM
solution hS will also have small true risk Lℙ(hS)

This leads us to introduce the notion of an ε-representative data sample

De nition. A dataset S is called ε-representative if

∀h ∈ ℋ, | LS(h) − Lℙ(h) | ≤ ε

Lemma. Assume S is ε/2-representative. Then, any ERM solution

hS ∈ argminh∈ℋ LS(h) satis es

Lℙ(hS) ≤ min Lℙ(h) + ε

h∈ℋ

LP (hS )  LS (hS ) + "/2  LS (h) + "/2  LP (h) + "/2 + "/2 = LP (h) + "
<latexit sha1_base64="9OznFJLjFyhHoQCwHd7CohXjwI0=">AAADy3icfZJdb9MwFIadBtgoX9245MaiQtr46NIy1t0gTcAFF9soGt061VHkuG5rzXYi2ykrJpf8EH4W/wanDaKNYEdKfPI+5zgneR2nnGkTBL+8mn/r9p2Nzbv1e/cfPHzU2No+10mmCO2ThCdqEGNNOZO0b5jhdJAqikXM6UV89b7gFzOqNEvkFzNPaSjwRLIxI9g4KWr8PI4sEthM49j28nxnGp3tQsQpPI7Olg8v0AwrmmrGE7nX+cscgeuoZOv77Vb6q01vb+6o16NGM2gFi4BBqxN094NDl7wp4gC2S9QEZfSirdo2GiUkE1QawrHWw3aQmtBiZRjhNK+jTNMUkys8oUOXSiyoDu3iX+bwmVNGcJwod0kDF+pqh8VC67mIXWUxtK6yQvwni8Xam602Aqu5GlXmMePD0DKZZoZKshxnnHFoEljYB0dMUWL43CWYKOa+CJIpVpgYZ3IdSfqVJEJgObKIpfmwE1pnytggjuXEudNsv4RNZ6Jik6lBaiHm622z69yiWNjrvAIk15lwzN2RTDgTzOhKySnJl14SzO1pdYPBKh1U6eUqvSzoB+rsU/TEaZ9SqrBJ1HOLsJoI7GYs15vKmFyWudWdoj9HBf4/Oe+02q9bnc/7zaN35XnaBE/AU7AD2qALjsBH0AN9QLwN75V34HX9E1/73/zvy9KaV/Y8Bmvh//gNJY1BCQ==</latexit>

Anuran Makur CS 57800 (Spring 2022) 16

fi
fi
In nite hypothesis classes
Finite hypothesis classes are PAC-learnable, but so are in nite ones
provided a more re ned notion of “size” is considered.

VC Dimension
(We are in the PAC-learnability setup, so we assume realizability)

Anuran Makur CS 57800 (Spring 2022) 17

fi
fi
fi
The VC dimension: Motivation

All possible labelings explainable by linear classi er

+ - - + 2 points in 1D

+ + -
3 points in 1D
+ - -
+ - +

Cannot be classi ed exactly with just a linear boundary

Anuran Makur CS 57800 (Spring 2022) 18

fi
fi
The VC dimension: Motivation
For learnability, what matters is not the literal size of the hypothesis class but the
maximum number of data points that can be classi ed exactly.

Example: The class of thresholds { {x < a} : a ∈ ℝ} on a line is PAC-learnable.

Thus, if there is a threshold h* s.t. Lℙ(h*) = 0, then we can come within ε of h* in

O(log(1/δ)/ϵ) samples
Observations:
Suppose hyp. class ℋ captures all 2|C| possible labelings of points in a nite
set C with {0,1}

PAC-learning de nition considers every distribution where realizability holds

Consider any (positive) dist ℙ with support C s.t. ∃h ∈ ℋ, Lℙ(h) = 0

If random sample S ∼ ℙ contains half the points in C, then labels of S provide

no information about labels ℙ assigns to other points in C

Hence, although realizability holds, we cannot learn a good hypothesis h ∈ ℋ

based on |C|/2 samples S

Intuition: Harder (more samples S required) to learn when there is larger C

where ℋ captures all possible labelings

Anuran Makur CS 57800 (Spring 2022) 19

𝕀

fi
fi
VC dimension: Shattering
Defn. A nite set of points C is shattered by a hypothesis space ℋ if for all
possible 2|C| labelings of the points in C into {0,1}, there is a consistent
hypothesis in ℋ (i.e., with 0 error).

Defn. The Vapnik-Chervonenkis (VC) dimension of a hypothesis class ℋ

(over the data domain) is the size of the largest subset of the data domain
that is shattered by ℋ.
We will just write VC(ℋ) (the data space is implicit)

Note: It is important to note the quanti ers on the de nition. Shattering

stipulates that there exists a set of size VC(ℋ) that can be shattered.

Theorem: If ℋ has in nite VC dimension, then it is not PAC-learnable.

Anuran Makur CS 57800 (Spring 2022) 20

fi
fi
fi
fi
Shattering: Examples 5

3 points, in 2D.  
linear classi er

shatters these

Figure 1. Three points in R2 , shattered by oriented lines. [Fig: C. Burges]

2.3. The VC Dimension and the Number of Parameters

The VC dimension thus gives concreteness to the notion of the capacity of a given set
of functions. Intuitively, one might be led to expect that learning machines with many
4 points in 2D
parameters would have high VC dimension, while learning machines with few parameters
would have low VC dimension. There is a striking counterexample to this, due to E. Levin
linear classi er
and J.S. Denker (Vapnik, 1995): A learning machine with just one parameter, but with
cannot shatter! infinite VC dimension (a family of classifiers is said to have infinite VC dimension if it can
shatter l points, no matter how large l). Define the step function θ(x), x ∈ R : {θ(x) =
1 ∀x > 0; θ(x) = −1 ∀x ≤ 0}. Consider the one-parameter family of functions, defined by
f (x, α) ≡ θ(sin(αx)), x, α ∈ R. (4)
You choose some number l, and present me with the task of finding l points that can be
shattered. I choose them to be:
Anuran Makur CS 57800 (Spring 2022) 21
xi = 10−i , i = 1, · · · , l. (5)
fi
fi
VC dimension
To show that VC(ℋ) = d, we need to prove two things:

1 There exists a set C of size d that is shattered by ℋ

2 No set of size d+1 is shattered by ℋ

Problem: Interpret the bounds d1 ≤ d ≤ d2 on the VC dimension in terms

of the above two points.

Anuran Makur CS 57800 (Spring 2022) 22

We have already shown that a class of infinite VC-dimension is not learnable. The
Fundamental theorem
converse statement is also true, leading to the fundamental theorem of statistical
learning theory:
theorem 6.7 (The Fundamental Theorem of Statistical Learning) Let H be a
hypothesis class of functions from a domain X to {0, 1} and let the loss function
be the 0 1 loss. Then, the following are equivalent:
1. H has the uniform convergence property.
2. Any ERM rule is a successful agnostic PAC learner for H.
3. H is agnostic PAC learnable.
4. H is PAC learnable.
5. Any ERM rule is a successful PAC learner for H.
6. H has a finite VC-dimension.
The proof of the theorem is given in the next section.
Weonly
Not already
does saw 1 → 2. 2 →3, characterize
the VC-dimension 3 → 4, and 2PAC → 5learnability;
are easy.
it even deter-
minesContrapositive of 4 → 6 also mentioned earlier.

the sample complexity.

For 4 → 6, 5 → 6, see [Shalev-Shwartz and Ben-David, 2014] if interested.

theorem 6.8 (The

High-level Fundamental
comments on 6 →Theorem of Statistical Learning – Quantita-
1 to follow.
tive Version) Let H be a hypothesis class of functions from a domain X to {0, 1}
Note:
and letWe
theremarked that be
loss function having
the 0in nite VC Assume
1 loss. dimension implies
that a hypothesis
VCdim(H) class
= d < 1.
is notthere
Then, PAC-learnable.
are absoluteThe above C
constants theorem alsothat:
1 , C2 such establishes the converse.

Anuran Makur CS 57800 (Spring 2022) 23

fi
6. H has a finite VC-dimension.

Sample complexity
The proof of the theorem is given in the next section.
Not only does the VC-dimension characterize PAC learnability; it even deter-
1. H has the uniform convergence property with sample complexity
mines the sample complexity.
d + log(1/ ) UC
d + log(1/ )
C
theorem 6.8 (The Fundamental
1  m
Theorem (✏, )  C
H of Statistical
2 Learning – Quantita-
✏2 ✏2
tive Version) Let H be a hypothesis class of functions from a domain X to {0, 1}
2. H is agnostic PAC learnable with sample complexity
and let the loss function be the 0 1 loss. Assume that VCdim(H) = d < 1.
Then, there are absolute d+ log(1/ )C1 , C2 such that: d + log(1/ )
constants
C1 2
 mH (✏, )  C2
✏ ✏2
3. H is PAC learnable with sample complexity
d + log(1/ ) d log(1/✏) + log(1/ )
C1 m NHℋ(ϵ,
(✏, δ))  C2
✏ ✏
The proof of this theorem is given in Chapter 28.
RemarkSee 6.3[Shalev-Shwartz
We stated the fundamental
and Ben-David,theorem
2014]for
if binary classification tasks.
interested
A similar result holds for some other learning problems such as regression with
the absolute loss or the squared loss. However, the theorem does not hold for
Lower bound: Let ℋ be any hypothesis class with VC(ℋ) = d.

all learning tasks. In particular, learnability is sometimes possible even though

the uniform convergence property does not hold (we will see an example in
Then any PAC-learning algorithm must use at least Ω(d/ε) samples.
Chapter 13, Exercise 2). Furthermore, in some situations, the ERM rule fails
but learnability is possible with other learning rules.

Anuran Makur CS 57800 (Spring 2022) 24

Proof of the fundamental theorem
Two main parts for 6 → 1:

Suppose VC(ℋ) = d. Although ℋ may be in nite, the number of

possible labelings it induces on a nite subset C of the data domain,
i.e., its “e ective size” is only O(|C|d) (instead of exponential in |C|,
as |C| grows large)

Hyp. class ℋ enjoys uniform convergence property when its “e ective

size” is small

How to de ne “effective size”?

Anuran Makur CS 57800 (Spring 2022) 25

fi
ff
fi
fi
ff
The VC-Dimension

Effective size: Growth function/Shatter coef cient

definition 6.9 (Growth Function) Let H be a hypothesis class. Then the
growth function of H, denoted ⌧H : N ! N, is defined as

⌧H (m) = max HC .
C⇢X :|C|=m

In words, ⌧H (m) is the number of di↵erent functions from a set C of size m to

{0, 1} that can be obtained by restricting H to C.

Obviously, if VCdim(H) = d then for any m  d we have ⌧H (m) = 2m . In

(note:
such cases, m aboveallispossible
H induces the same as our N)
functions from C to {0, 1}. The following beau-
tiful lemma,
(note: proposed
number ofindependently by Sauer,
di erent functions Shelah,
= number of and Perles,
di erent shows that
labelings)
when m becomes larger than the VC-dimension, the growth function increases
ℋC =than
(note: rather
polynomially {(h(c) : c ∈ C) | hwith
exponentially ∈ ℋ} m. is the restriction of ℋ to C)

lemma 6.10 (Sauer-Shelah-Perles) Let H be a hypothesis class with VCdim(H) 

Pd
Verbosely: The growth function
d < 1. Then, for all m, ⌧H (m)  i=0is the maximum
m number of di erent
i . In particular, if m > d + 1 then
⌧Hlabelings of a set
(m)  (em/d) d
. C of size ‘m’ (our N) that can be obtained by using
the classi ers in our hypothesis class on the members of C
Proof of Sauer’s Lemma *
To prove the lemma it suffices to prove the following stronger claim: For any
C = {c1 , . . . , cm } we have
Anuran Makur CS 57800 (Spring 2022) 26
fi
ff
ff
ff
such cases, H induces all possible functions from C to {0, 1}. The following beau-
Sauer’s lemma: Bounding growth functions
tiful lemma, proposed independently by Sauer, Shelah, and Perles, shows that
when m becomes larger than the VC-dimension, the growth function increases
polynomially rather than exponentially with m.

lemma 6.10 (Sauer-Shelah-Perles) Let H be a hypothesis class with VCdim(H) 

Pd
d < 1. Then, for all m, ⌧H (m)  i=0 mi . In particular, if m > d + 1 then
⌧H (m)  (em/d)d .

Proof of Sauer’s Lemma *

ToThus,
prove the
nitelemma
VC dimit implies
suffices polynomial
to prove thegrowth
following stronger
in m, while claim:
in niteFor
VCany
C=dim
{c1means
, . . . , cmexponential
} we have growth. Polynomial growth crucial to deducing
the uniform convergence property
8 H, |HC |  |{B ✓ C : H shatters B}|. (6.3)

The reason why Equation (6.3) is sufficient to prove the lemma is that if VCdim(H) 
Proof no
d then idea:
setFor any size
whose C, let
is B be athan
larger subset
d isofshattered
C such that
by Hℋand
shatters B. Then,
therefore
| ℋC | ≤ | {B ⊆ C : ℋ shatters B} | Xd ✓ ◆
m
|{B ✓ C : H shatters B}|  .
i=0
i

When m > d + 1 the right-hand side of the preceding is at most (em/d)d (see
Lemma A.5 in Appendix A).
We are left with proving Equation (6.3) and we do it using an inductive argu-
Anuran
ment. For m = 1, no matter whatCSH57800
Makur (Spring 2022)
is, either both sides of Equation (6.3) equal 27
fi
fi
Uniform convergence with “small effective size”
Theorem. Let H be a class and ⌧H its growth function. Then for every
distribution P(X, Y ) and every 2 (0, 1), with probability at least 1 over
the choice of S ⇠ P, we have p
<latexit sha1_base64="NtgY0HW7Mqho8FBWzg9D8y3bfgs=">AAAD7HicfVJLb9NAEHYaHqW8WjhyGZEgtVKI4kCUHivg0AOIIvpIFUfVejNOVt2HtbtuG1m+cwVxQ1z5TfBfODB2gmgjYA/26Pu+mZ2db+JUCuc7nR+1lfq16zdurt5au33n7r376xsPDp3JLMcDbqSxg5g5lELjgRde4iC1yFQs8Sg+fVnyR2donTB6389SHCk20SIRnHmCTtZ/RtoIPUbtI48XPk7y/Skai6pdwGv00Iw4k7DbhBiBAZfMOWB6TLhn2Uk+Z4smCO9gYs25n0KSaV4WbwOV0pAYC0gtzGBMz7EizkqSCijmp3Gc7xWbg9bxVrMqOxc2ozFKzyASGjY7rXCr2YJzQaVTa2IWCyn8DJgHicxRi+HTub4JhtLBTxH41AiOYBJovofICQV/riuLIUzZGZ6sNzrtMOz1e12goL8dPiuD5/1Or9eHsN2pTiNYnL2Tjdr3aGx4pmhe1SyGYSf1o5xZL7jEYi3KHKaMn7IJDinUTKEb5ZVNBTwhZFzNIzHaQ4VezsiZcm6mYlKWzbplrgT/xg0zn2yPcqHTzKPm84uSTII3UHpOg7fIvSwdYNwK6pUGxCzjnjbjyi2xuvKG3HnF7MyOCdV4zo1S5FIeibQYdkd5JDHxkWR6IhEaYQsa3ciKydRHtsKWsga8yCsbaGnyQbHEHl9mj5dZLV2mSEBf2lgpFC0cSV4h2WDxDaW9TdEyb2weMTtR7KLIF///qISeq+hPu/DbcPh3cNhth7Qp77qNnReLrVgNHgWPg80gDPrBTrAb7AUHAa/FtQ+1j7VPdV3/XP9S/zqXrtQWOQ+DK6f+7RfaXU0p</latexit>

4+ log ⌧H (2N )
|LP (h) LS (h)|  p
<latexit sha1_base64="+zMFzNLuW9+c+5JIJXb6xNBlEH4=">AAADRnicfVJLbxMxEHa2PEp5pXDkYhEhpQKi3SVReqyAQw+lBEHSVHEUeb3ejVXbu9heIHL9z/gB/AS4cgVxQ1xxHkhNBIy0O6PvIdszk5ScaROGX2rB1qXLV65uX9u5fuPmrdv13TsDXVSK0D4peKGGCdaUM0n7hhlOh6WiWCScniRnz+b8yTuqNCvkGzMr6VjgXLKMEWw8NKkPzo8mFglspklie841p3uPjyavfTqHiFOIMoWJbT9E+q0yFvEih8jgynsI5vDQNePjPecsSik3eCmKj52b1BthK4o63U4MfdHdj57Mi3Y37HS6MGqFi2iAVfQmu7XPKC1IJag0hGOtR1FYmrHFyjDCqdtBlaYlJmc4pyNfSiyoHttFAxx84JEUZoXynzRwgV50WCy0nonEK+dP1ZvcHPwbN6pMtj+2TJaVoZIsD8oqDk0B592EKVOUGD7zBSaK+btCMsW+Zcb3fO2URKy9wWojsJqp1KOSvieFEFimFrHSjeKxbzTNDOJY5n4GjegRbMRIsXxqkFpgG64hccsh+qHYodtgTy+yp5us5LoSXuD/SBacCWa0lzynfgyKvvC2lyVV2BTKIqxygT84u8r/UTG5VPnsd+HPwOG/i0HcivymvGo3Dp6utmIb3AP3QRNEoAsOwCHogT4g4CP4Cr6B78Gn4EfwM/i1lAa1lecuWIst8BvB9Rdb</latexit>
2N

Corollary: If VC(ℋ) is nite, then the uniform convergence property holds.

✓ ◆
UC d
NH (✏, )  Õ
<latexit sha1_base64="Ag44mScjWtwknZQ3MhlNUgDOGOU=">AAADR3icfVLLbhMxFHWmPEp5pbBkYxEhJVIVzUwbpcuKsugGWiTSpIrTyOO5k1i1PSPbA0TWfBofwB/Ami2IHWKJ86jURMCVxr4+51zZc+5NCsGNDcOvtWDr1u07d7fv7dx/8PDR4/ruk3OTl5pBj+Ui14OEGhBcQc9yK2BQaKAyEdBPro7nfP89aMNz9c7OChhJOlE844xaD43r/TdjRxgV+KS6dL3jqkmgMFzkao+kICxtYSIAE8tFCu608ofMNkmmKXNp5ZpL0XVN6zKuiOaTqW2N642wHUWdbifGPukeRvvz5KAbdjpdHLXDRTTQKs7Gu7UvJM1ZKUFZJqgxwygs7MhRbTkTUO2Q0kBB2RWdwNCnikowI7dwoMIvPJLiLNf+UxYv0JsVjkpjZjLxSknt1Gxyc/Bv3LC02eHIcVWUFhRbXpSVAtscz+3EKdfArJj5hDLN/Vsxm1Jvj/Wmr92SyLV/cMZKqmc69aiCDyyXkqrUEV5Uw3jkFkYTQdXE29+I9nAjXjpL9ALbqBqwypH5630r3aDaYC9ushebrBKmlF7gV6JywSW3xktegW+Dhte+7LQATW2uHaF6IunHyq32/6i4Wqr87mfhuuH438l53I78pLw9aBy9XE3FNnqGnqMmilAXHaETdIZ6iKFP6Bv6jn4En4Ofwa/g91Ia1FY1T9FabNX+AK1BGGI=</latexit>
( ✏)2
su ces for the UC property to hold (the O-tilde is hiding logarithmic factors)

Proof: For N > d, using Sauer’s lemma we have τH(2N) ≤ (2eN/d)d. Now
invoke the above theorem with N ≥ NHUC(ε,δ)

Anuran Makur CS 57800 (Spring 2022) 28

ffi
fi

Ab Inition Scenario
No ratings yet
Ab Inition Scenario
28 pages
RESUME Immersion
94% (18)
RESUME Immersion
5 pages
Understanding Machine Learning Solution Manual: 2 Gentle Start
No ratings yet
Understanding Machine Learning Solution Manual: 2 Gentle Start
67 pages
ML Lecture 1 Iitg
No ratings yet
ML Lecture 1 Iitg
32 pages
Foundations of Machine Learning: Module 7: Computational Learning Theory
No ratings yet
Foundations of Machine Learning: Module 7: Computational Learning Theory
64 pages
ML Unit-2 Material Add-On
No ratings yet
ML Unit-2 Material Add-On
82 pages
Module 2 - Syllabus: CS 476 Introduction To Machine Learning, Module 2
No ratings yet
Module 2 - Syllabus: CS 476 Introduction To Machine Learning, Module 2
20 pages
SML_Lecture2
No ratings yet
SML_Lecture2
35 pages
Week 3
No ratings yet
Week 3
56 pages
ML Lecture23
No ratings yet
ML Lecture23
57 pages
ML 3
No ratings yet
ML 3
36 pages
Tutorial
No ratings yet
Tutorial
81 pages
10-601 Machine Learning
No ratings yet
10-601 Machine Learning
7 pages
Machine Leaning 3
No ratings yet
Machine Leaning 3
44 pages
Lect 26 PDF
No ratings yet
Lect 26 PDF
14 pages
Unit 1-1
No ratings yet
Unit 1-1
75 pages
Chapter-3-Solutions-Understanding-Machine-Learning
No ratings yet
Chapter-3-Solutions-Understanding-Machine-Learning
6 pages
ML Chapter 7 (CLT) Notes
No ratings yet
ML Chapter 7 (CLT) Notes
59 pages
1 The Probably Approximately Correct (PAC) Model: COS 511: Theoretical Machine Learning
No ratings yet
1 The Probably Approximately Correct (PAC) Model: COS 511: Theoretical Machine Learning
6 pages
Unit Iii
No ratings yet
Unit Iii
6 pages
Lecture 2.1
No ratings yet
Lecture 2.1
49 pages
MLSM Lecture2 120923
No ratings yet
MLSM Lecture2 120923
35 pages
Machine Learning - Computational Learning Theory PDF
No ratings yet
Machine Learning - Computational Learning Theory PDF
7 pages
Lecture5 Learning Theory v1.1
No ratings yet
Lecture5 Learning Theory v1.1
59 pages
ML 13 Strong and Weak Learners
No ratings yet
ML 13 Strong and Weak Learners
13 pages
LearningTheory
No ratings yet
LearningTheory
19 pages
PAC Bayesian Learning Introduction
No ratings yet
PAC Bayesian Learning Introduction
124 pages
CH 2
No ratings yet
CH 2
29 pages
Notes
No ratings yet
Notes
125 pages
Lec 6
No ratings yet
Lec 6
29 pages
computational learning theorem
No ratings yet
computational learning theorem
91 pages
ML_UNIT 4
No ratings yet
ML_UNIT 4
15 pages
SML_Lecture4
No ratings yet
SML_Lecture4
38 pages
2111.04746v4 (1)
No ratings yet
2111.04746v4 (1)
62 pages
Statistical Machine Learning-The Basic Approach and Current Research Challenges
No ratings yet
Statistical Machine Learning-The Basic Approach and Current Research Challenges
35 pages
Ex 1
No ratings yet
Ex 1
2 pages
Lecture 1
No ratings yet
Lecture 1
5 pages
Sec 1630
No ratings yet
Sec 1630
145 pages
6.036: Intro To Machine Learning: Lecturer: Professor Leslie Kaelbling Notes By: Andrew Lin Fall 2019
No ratings yet
6.036: Intro To Machine Learning: Lecturer: Professor Leslie Kaelbling Notes By: Andrew Lin Fall 2019
50 pages
Selected theoretical aspects of ML and deep learning
No ratings yet
Selected theoretical aspects of ML and deep learning
46 pages
Cs 171 18 IntroLearning Old
No ratings yet
Cs 171 18 IntroLearning Old
47 pages
Statistical Learning Theory
No ratings yet
Statistical Learning Theory
4 pages
AL3451 13 M
No ratings yet
AL3451 13 M
22 pages
MIT18 657F15 LecNote PDF
No ratings yet
MIT18 657F15 LecNote PDF
194 pages
Machine Learning - The Science of Selection under Uncertainty
No ratings yet
Machine Learning - The Science of Selection under Uncertainty
85 pages
Machine Learning: PAC-Learning and VC-Dimension
No ratings yet
Machine Learning: PAC-Learning and VC-Dimension
31 pages
Lecturenotes Cse176
No ratings yet
Lecturenotes Cse176
80 pages
Mathematics of Machine Learning MIT
No ratings yet
Mathematics of Machine Learning MIT
411 pages
Introduction To Machine Learning (67577) : Shai Shalev-Shwartz
No ratings yet
Introduction To Machine Learning (67577) : Shai Shalev-Shwartz
124 pages
Lecturenotes PDF
No ratings yet
Lecturenotes PDF
80 pages
WINSEM2021-22 CSE4020 ETH VL2021220501968 Reference Material I 22-01-2022 PAC Learning
No ratings yet
WINSEM2021-22 CSE4020 ETH VL2021220501968 Reference Material I 22-01-2022 PAC Learning
34 pages
Fairness Lectures-21
No ratings yet
Fairness Lectures-21
63 pages
Chapter 8
No ratings yet
Chapter 8
106 pages
ML 01
No ratings yet
ML 01
24 pages
poly_aml
No ratings yet
poly_aml
76 pages
CH 3
No ratings yet
CH 3
24 pages
MachineLearning_UNIT III
No ratings yet
MachineLearning_UNIT III
30 pages
Statistical Machine Learning-The Basic Approach and Current Research Challenges
No ratings yet
Statistical Machine Learning-The Basic Approach and Current Research Challenges
35 pages
UNIT-3
No ratings yet
UNIT-3
99 pages
GMAT 2024
From Everand
GMAT 2024
Azhar ul Haque Sario
No ratings yet
Foundations of Probability Theory
From Everand
Foundations of Probability Theory
Himadri Deshpande
No ratings yet
Key Concepts in Discrete Mathematics
From Everand
Key Concepts in Discrete Mathematics
Udayan Bhattacharya
No ratings yet
LAB-03 EE-311 Signal and Systems PDF
No ratings yet
LAB-03 EE-311 Signal and Systems PDF
12 pages
Powerpoint Presentation On Journals
No ratings yet
Powerpoint Presentation On Journals
11 pages
Wework
No ratings yet
Wework
14 pages
21-22 F6 M2 Mock Exam
No ratings yet
21-22 F6 M2 Mock Exam
28 pages
STA 371G (Damien)
No ratings yet
STA 371G (Damien)
8 pages
Ebookmetafile 8961
No ratings yet
Ebookmetafile 8961
64 pages
Activity Chapter 5: Effect On December 31, 20X1: Using Straight Line Method
No ratings yet
Activity Chapter 5: Effect On December 31, 20X1: Using Straight Line Method
2 pages
Download ebooks file Human Biology Zambak 1st Edition Abdurrahman Elma all chapters
100% (13)
Download ebooks file Human Biology Zambak 1st Edition Abdurrahman Elma all chapters
67 pages
Tet Syllabus LP 2021
No ratings yet
Tet Syllabus LP 2021
13 pages
On Engg-Service-Marketing-Mix-explained-with-LIC
No ratings yet
On Engg-Service-Marketing-Mix-explained-with-LIC
30 pages
Web Development: An Internship Training Report On
No ratings yet
Web Development: An Internship Training Report On
6 pages
StarTec External HD
No ratings yet
StarTec External HD
4 pages
Authentic Syllabus For Unilag, Uniilorin & Delsu Post Utme
No ratings yet
Authentic Syllabus For Unilag, Uniilorin & Delsu Post Utme
4 pages
An Essay Towards A Real Character and A Philosophical Language
No ratings yet
An Essay Towards A Real Character and A Philosophical Language
13 pages
Rinhs Science 9 1st Activity 2 Non Mendelian Genetics
No ratings yet
Rinhs Science 9 1st Activity 2 Non Mendelian Genetics
2 pages
Power BI BS Analysis
No ratings yet
Power BI BS Analysis
13 pages
Field Sales Associate (FSA) Training Report
No ratings yet
Field Sales Associate (FSA) Training Report
7 pages
Gmail - Booking Confirmation On IRCTC, Train - 16229, 23-Dec-2021, 2S, MYS - SBC
No ratings yet
Gmail - Booking Confirmation On IRCTC, Train - 16229, 23-Dec-2021, 2S, MYS - SBC
1 page
SVL Testing Guide
No ratings yet
SVL Testing Guide
7 pages
African Musical Instruments From The Environment: Example of Improvised Musical Instrument
No ratings yet
African Musical Instruments From The Environment: Example of Improvised Musical Instrument
2 pages
Urban Health Problems & Nuhm
No ratings yet
Urban Health Problems & Nuhm
48 pages
Download ebooks file Entertainment industry economics a guide for financial analysis Tenth Edition Harold L. Vogel all chapters
100% (3)
Download ebooks file Entertainment industry economics a guide for financial analysis Tenth Edition Harold L. Vogel all chapters
55 pages
Bombay Thread Works Ppt
No ratings yet
Bombay Thread Works Ppt
9 pages
IdolCon 2020 Schedule's PDF
No ratings yet
IdolCon 2020 Schedule's PDF
6 pages
ASK - Amplitude Shift Keying
No ratings yet
ASK - Amplitude Shift Keying
15 pages
History of Thums Up
No ratings yet
History of Thums Up
8 pages
232 ICS202 Syllabus
No ratings yet
232 ICS202 Syllabus
3 pages
TLV
No ratings yet
TLV
1 page