0% found this document useful (0 votes)
50 views

Lecture 2.4

This document discusses learnability and the VC dimension in statistical machine learning. It introduces the concept of PAC learnability, which is defined as the existence of a sample complexity function and learning algorithm that can return a hypothesis with error less than ε with probability at least 1-δ, given a sample size greater than the complexity function. The key ideas for proving learnability are interpreting errors as failures and successes, considering misleading samples, and applying a union bound over the finite hypothesis class.

Uploaded by

Jon Smithson
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
50 views

Lecture 2.4

This document discusses learnability and the VC dimension in statistical machine learning. It introduces the concept of PAC learnability, which is defined as the existence of a sample complexity function and learning algorithm that can return a hypothesis with error less than ε with probability at least 1-δ, given a sample size greater than the complexity function. The key ideas for proving learnability are interpreting errors as failures and successes, considering misleading samples, and applying a union bound over the finite hypothesis class.

Uploaded by

Jon Smithson
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

CS 57800 Spring 2022

Statistical Machine Learning


Lecture 2.4
Classi cation: Learnability and VC dimension

Note: These slides are for your personal educational use only. Please do not distribute them.

Slides Acknowledgment:
Tommi Jaakkola, Devavrat Shah, David Sontag, Suvrit Sra (MIT EECS Course 6.867)
Anuran Makur CS 57800 (Spring 2022)
fi

CS 57800

I II

Statistical Inference Supervised Learning

Anuran Makur CS 57800 (Spring 2022) 2


CS 57800

II

Supervised Learning

Classi cation: Foundations


Classi cation: Logistic regression, SVM, Kernels
Classi cation: Naive Bayes
Classi cation: Learnability and VC dimension
Regression (2 lectures)
Neural Networks (2 lectures)

Anuran Makur CS 57800 (Spring 2022) 3


fi
fi
fi
fi
CS 57800

II

Supervised Learning

Classi cation: Foundations


Classi cation: Logistic regression, SVM, Kernels
Classi cation: Naive Bayes
Classi cation: Learnability and VC dimension
Regression (2 lectures)
Neural Networks (2 lectures)

Anuran Makur CS 57800 (Spring 2022) 4


fi
fi
fi
fi
Outline

★ Learnability

★ VC dimension

★ Discussion

Anuran Makur CS 57800 (Spring 2022) 5


Motivation

We saw that ERM without restrictions can quickly over t


(the Memorize method)

We introduced idea of inductive bias to restrict the hypothesis class

? Fundamental question:

We discussed regularization as a way to regulate bias-complexity trade-o


Which hypothesis classes do not lead to over tting, e.g., with ERM?

(In practice, carefully balance between over- and under tting. Also, recall we had talked
brie y about regularization as an approach to regulate the “bias-complexity” tradeo )

Anuran Makur CS 57800 (Spring 2022) 6


fl
fi
fi
fi
Motivation
Why learning theory?
– What does it mean to learn?

– Understand limitations of learning and the necessity for prior knowledge

– What makes learning succeed?

– What learning rules to use?

– Understand the tradeo s that in uence and govern learning

These insights have been deeply in uential over the years

Learn ML

Anuran Makur CS 57800 (Spring 2022) 7


ff
fl
fl
Learnability
Aim: What does it mean to be able to learn?

Assumption 1: Suppose our hypothesis class is rich enough so that realizability


holds, that is, there is a hypothesis h* ∈ ℋ such that Lℙ(h*) = 0

How to bound the error made by using ERM?

Understanding realizability Bounding the risk

➡ Over random sample S~ℙ, LS(h*)=0


Exercise: Argue that realizability implies that
Lℙ(hS)?
empirical risk LS(hS)=0 for ERM hypothesis hS (recall this is the risk on unseen
data of the ERM hypothesis hS)
Question: Since risk of 0 is possible using our
hypothesis class and ERM leads to 0 training hS ≠ h* is possible!
error, are we done?

Anuran Makur CS 57800 (Spring 2022) 8


Learnability: Simple setup
Assumption 2: Suppose the hypothesis class ℋ is nite.

Aim: Since Lℙ(h*) = 0 is possible, we seek to bound Lℙ(hS) ≤ ε

log( | ℋ | /δ)
Theorem Let ℋ be nite. Let δ ∈ (0,1), ε > 0, and let N ≥ .
ε
Then, for any distribution ℙ for which realizability holds, with probability at least
1-δ over the choice of dataset S of size N, every ERM hypothesis hS satis es
Lℙ(hS) ≤ ε

Note: Without realizability, a weaker


Proof sketch: §2.3.1 of [SSS]
result can be proved.
Let’s look at the key ideas…

Anuran Makur CS 57800 (Spring 2022) 9


fi
fi
Assumption 2: Suppose the hypothesis class ℋ is finite.

Aim: Since Lℙ(h*)


Learnability: Simple setup
= 0 is possible, we seek to bound Lℙ(hS) ≤ ε

log( | ℋ | /δ)
Theorem Let ℋ be finite. Let δ ∈ (0,1), ε > 0, and let N ≥
ε
Then, for any distribution ℙ for which realizability holds, with probability at least Observe: Con dence parameter
1-δ over the choice of dataset S of size N, every ERM hypothesis hS satisfies δ and accuracy parameter ε
Lℙ(hS) ≤ ε
Proof: §2.3.1 of [SSS]

L (h )
Note: Without realizability, a weaker
Exercises 2 on S;
S depends there is a chance that S is not representative of the whole data.
1
result canℙbe proved.
covers To model this
the details––proof chance,
analogous
to how we bounded parameter error
we introduce a con dence parameter δ in the bound above.
Let’s look at the key ideas…
for NB (Hoeffding+Union bound).

2 Cannot guarantee perfect label prediction, so introduce an accuracy parameter ε


Suvrit Sra ([email protected]) 6.867 (Fall 2019) 8

Key ideas on how to prove above theorem (see §2.3.1 of [SSS])

➡ Interpret Lℙ(hS) > ε as a failure of the learner and Lℙ(hS) ≤ ε as success


➡ Consider “misleading” samples S (S is a random dataset) for which there is a
hypothesis h that is “bad” (Lℙ(h) > ε) but looks “good” empirically (LS(h) = 0)

➡ Realizability says LS(hS) = 0, so Lℙ(hS) > ε implies that S is misleading

➡ For large enough N, the probability of this event is bounded

➡ Finiteness of sample class allows extending above bound via the union bound

Anuran Makur CS 57800 (Spring 2022) 10


fi
fi
PAC-learnability
(Probably approximately correct learnability)

(Formal de nition of the (ε,δ)-style result we just saw)

Anuran Makur CS 57800 (Spring 2022) 11


fi
PAC-learnability
Defn. A hypothesis class ℋ is PAC-learnable if there exists a sample complexity
function Nℋ(ε, δ) and a learning algorithm with the following property: For
every ε, δ∈ (0,1) and every distribution ℙ such that the realizability
assumption holds, after training using N ≥ Nℋ(ε, δ) i.i.d. examples generated
from ℙ, the learning algorithm returns a hypothesis h such that Lℙ(h) ≤ ε with
con dence 1-δ (over choice of samples).
Informally
Say hypothesis class ℋ “rich” so that we have realizability. PAC-learnability of class ℋ
means that “enough” number of random examples drawn from the data distribution will allow
approx. risk minimization, i.e., ensure L(h) ≤ ε, with probability ≥ 1-δ, where “enough”
depends on the desired tolerances (ε,δ). The δ arises due to randomness of S drawn from ℙ,
while ε arises due to the actual hypothesis picked by our learner.
The ε, δ are inevitable:
– Since training data S ~ ℙ(X, Y ) is randomly drawn, there is a chance that it may be non-informative;
hence a δ con dence parameter that accounts for this chance.

– Even when S faithfully represents ℙ(X, Y ), it may miss some details (it is just a nite sample) and
cause our classi er to make minor errors; thus the accuracy ε.

Anuran Makur CS 57800 (Spring 2022) 12


fi
fi
fi

fi
PAC-learnability: Questions
Question: True or False? Any nite hypothesis class (assuming
realizability) is PAC-learnable with sample complexity
✓ ◆
log |H/
NH (✏, )  O
<latexit sha1_base64="Q3Z8P20CsB5ojS28IRiqcE96HvI=">AAADRnicfVLLbhMxFHWmPEp5pbBkYxEhpVIVZkKjdFkBi2ygRSJpqjiKPB7PxKrtGdkeIDLzZ3wAnwBbtiB2iC03L6mJgCuNfXUesudcx4UU1oXh11qwc+36jZu7t/Zu37l77359/8HA5qVhvM9ymZthTC2XQvO+E07yYWE4VbHk5/Hlizl//o4bK3L91s0KPlY00yIVjDqAJvXB64knjErcq5qEF1bIXB+ShEtHDzCRHJ/CkromSQ1lnsg8+0gUdVOw+F71dKms/NpaESOyqTuY1BthK4o63U4bQ9M9jp7Nm6Nu2Ol0cdQKF9VAqzqb7Ne+kCRnpeLaMUmtHUVh4caeGieY5NUeKS0vKLukGR9Bq6niduwXAVT4CSAJTnMDn3Z4gV51eKqsnakYlPPL221uDv6NG5UuPR57oYvScc2WB6WlxC7H8zRxIgxnTs6gocwIuCtmUwpROch845RYbfyDt05RMzMJoJq/Z7lSVCeeiKIatcd+ETqRVGcwgkZ0iBvtZbLELLAt15DBCNZjGVZb7MVV9mKb1dKWCgSwEp1LoYSzIHnJYQyGvwLbacENdbnxhJpM0Q+VX+3/UQm9VMEOb2E9cPzvZtBuRfBS3hw1Tp6vXsUueoQeoyaKUBedoB46Q33E0Cf0DX1HP4LPwc/gV/B7KQ1qK89DtFE76A/C6Rh+</latexit>

Question: What if realizability does not hold?

Question: What about in nite hypothesis classes?

Anuran Makur CS 57800 (Spring 2022) 13


fi
fi
(Agnostic) PAC-learnability
Question: What if realizability does not hold?

By No-Free-Lunch (NFL) theorem, we know that no learner is guaranteed


Then, no hope
to match of satisfying
the Bayes Lℙin(h)
classi er ≤ ε. Let
general us weaken
(always our aim,distribution
an adversarial and see if
we
cancan
be at least comeonε-close
constructed which to
ourthe best possible
learner classi
fails while er within
another our
may succeed)
hypothesis class with high probability:

LP (h)  inf
0
LP (h) + ✏
<latexit sha1_base64="siLZEjixqnqtlykVR6f/NrmuI1Q=">AAADO3icfVLLjtMwFHXDaxgeMwNLNhYVYhCoSspUneUIWIwEiCLRmY7qqLpxnNYa24lsB6isfBQfwHcAOxA7xJY9TlrQtDyulPjonHP9uPcmheDGhuGnVnDu/IWLlzYub165eu361vbOjSOTl5qyIc1FrkcJGCa4YkPLrWCjQjOQiWDHyenjWj9+zbThuXpl5wWLJUwVzzgF66nJ9tNnE0ck2FmSuEFV7c7uYSIYJlxlEze7WwPc6BSEO6wqgv9IuE9YYbiod2uHnSjq9Xtd7EF/P3pYg71+2Ov1cdQJm2ijZQwmO62PJM1pKZmyVIAx4ygsbOxAW04FqzZJaVgB9BSmbOyhAslM7JpXV/iOZ1Kc5dp/yuKGPZvhQBozl4l31vc161pN/k0blzbbjx1XRWmZoouDslJgm+O6hDjlmlEr5h4A1dzfFdMZaKDWF3rllESuvMEZK0HPdepZxd7QXEpQqSO8qMbd2PnCZ5YIUFPfgXb0ALe7RPPpzBLdcGtZI1q5360ZVWvqyVn1ZF1VwpTSG/yfqFxwya3xlifMt0Gz5z7tRcE02Fw7Anoq4W3llut/XFwtXH71s/Cr4fjf4KjbifykvNxrHzxaTsUGuoVuo10UoT46QIdogIaIonfoA/qMvgTvg6/Bt+D7whq0ljk30UoEP34C3QIT+g==</latexit>
h 2H

(Note: The hypothesis class ℋ may be “bad,” but we are just trying to be
approximately as good as the best hypothesis within this class w.h.p.)

More generally: L(h) = EP[loss(h, X, Y)] with general loss functions,

not necessarily the 0/1-loss.

Anuran Makur CS 57800 (Spring 2022) 14


fi
fi
I Error-decomposition: To control overfitting, we introduced
(Agnostic) PAC-learnability
inductive bias. Let us look at a fundamental error
decomposition in ML

Recall from earlier: LP(hS ) = ✏apx + ✏est

Thus, prob of error on random (unseen) data, decomposes into

✏apx := min L(h) (A PPROX E RROR)


h2H
✏est := LP(hS ) ✏apx (E STIMATION E RROR)

Approx. error depends on the t of our prior knowledge (via the


inductive bias) to the unknown underlying distribution.

PAC-learnability requires: Estimation error should be bounded


uniformly over all distributions for a given hypothesis class.
Suvrit Sra ([email protected]) 6.867 (Fall 2018)

Finite hypothesis classes are PAC-learnable, but so are in nite ones


provided a more re ned notion of “size” is considered.

Anuran Makur CS 57800 (Spring 2022) 15


fi
fi
fi
Uniform convergence PAC-learnability
Idea: If LS(h) is close to the true risk Lℙ(h) for all h ∈ ℋ, then the ERM
solution hS will also have small true risk Lℙ(hS)

This leads us to introduce the notion of an ε-representative data sample

De nition. A dataset S is called ε-representative if

∀h ∈ ℋ, | LS(h) − Lℙ(h) | ≤ ε

Lemma. Assume S is ε/2-representative. Then, any ERM solution


hS ∈ argminh∈ℋ LS(h) satis es

Lℙ(hS) ≤ min Lℙ(h) + ε


h∈ℋ

LP (hS )  LS (hS ) + "/2  LS (h) + "/2  LP (h) + "/2 + "/2 = LP (h) + "
<latexit sha1_base64="9OznFJLjFyhHoQCwHd7CohXjwI0=">AAADy3icfZJdb9MwFIadBtgoX9245MaiQtr46NIy1t0gTcAFF9soGt061VHkuG5rzXYi2ykrJpf8EH4W/wanDaKNYEdKfPI+5zgneR2nnGkTBL+8mn/r9p2Nzbv1e/cfPHzU2No+10mmCO2ThCdqEGNNOZO0b5jhdJAqikXM6UV89b7gFzOqNEvkFzNPaSjwRLIxI9g4KWr8PI4sEthM49j28nxnGp3tQsQpPI7Olg8v0AwrmmrGE7nX+cscgeuoZOv77Vb6q01vb+6o16NGM2gFi4BBqxN094NDl7wp4gC2S9QEZfSirdo2GiUkE1QawrHWw3aQmtBiZRjhNK+jTNMUkys8oUOXSiyoDu3iX+bwmVNGcJwod0kDF+pqh8VC67mIXWUxtK6yQvwni8Xam602Aqu5GlXmMePD0DKZZoZKshxnnHFoEljYB0dMUWL43CWYKOa+CJIpVpgYZ3IdSfqVJEJgObKIpfmwE1pnytggjuXEudNsv4RNZ6Jik6lBaiHm622z69yiWNjrvAIk15lwzN2RTDgTzOhKySnJl14SzO1pdYPBKh1U6eUqvSzoB+rsU/TEaZ9SqrBJ1HOLsJoI7GYs15vKmFyWudWdoj9HBf4/Oe+02q9bnc/7zaN35XnaBE/AU7AD2qALjsBH0AN9QLwN75V34HX9E1/73/zvy9KaV/Y8Bmvh//gNJY1BCQ==</latexit>

Anuran Makur CS 57800 (Spring 2022) 16


fi
fi
In nite hypothesis classes
Finite hypothesis classes are PAC-learnable, but so are in nite ones
provided a more re ned notion of “size” is considered.

VC Dimension
(We are in the PAC-learnability setup, so we assume realizability)

Anuran Makur CS 57800 (Spring 2022) 17


fi
fi
fi
The VC dimension: Motivation

All possible labelings explainable by linear classi er

+ - - + 2 points in 1D

+ + -
3 points in 1D
+ - -
+ - +

Cannot be classi ed exactly with just a linear boundary

Anuran Makur CS 57800 (Spring 2022) 18


fi
fi
The VC dimension: Motivation
For learnability, what matters is not the literal size of the hypothesis class but the
maximum number of data points that can be classi ed exactly.

Example: The class of thresholds { {x < a} : a ∈ ℝ} on a line is PAC-learnable.


Thus, if there is a threshold h* s.t. Lℙ(h*) = 0, then we can come within ε of h* in

O(log(1/δ)/ϵ) samples
Observations:
Suppose hyp. class ℋ captures all 2|C| possible labelings of points in a nite
set C with {0,1}

PAC-learning de nition considers every distribution where realizability holds

Consider any (positive) dist ℙ with support C s.t. ∃h ∈ ℋ, Lℙ(h) = 0

If random sample S ∼ ℙ contains half the points in C, then labels of S provide


no information about labels ℙ assigns to other points in C

Hence, although realizability holds, we cannot learn a good hypothesis h ∈ ℋ


based on |C|/2 samples S

Intuition: Harder (more samples S required) to learn when there is larger C


where ℋ captures all possible labelings

Anuran Makur CS 57800 (Spring 2022) 19


𝕀

fi
fi
VC dimension: Shattering
Defn. A nite set of points C is shattered by a hypothesis space ℋ if for all
possible 2|C| labelings of the points in C into {0,1}, there is a consistent
hypothesis in ℋ (i.e., with 0 error).

Defn. The Vapnik-Chervonenkis (VC) dimension of a hypothesis class ℋ


(over the data domain) is the size of the largest subset of the data domain
that is shattered by ℋ.
We will just write VC(ℋ) (the data space is implicit)

Note: It is important to note the quanti ers on the de nition. Shattering


stipulates that there exists a set of size VC(ℋ) that can be shattered.

Theorem: If ℋ has in nite VC dimension, then it is not PAC-learnable.

Anuran Makur CS 57800 (Spring 2022) 20


fi
fi
fi
fi
Shattering: Examples 5

3 points, in 2D. 

linear classi er

shatters these

Figure 1. Three points in R2 , shattered by oriented lines. [Fig: C. Burges]

2.3. The VC Dimension and the Number of Parameters

The VC dimension thus gives concreteness to the notion of the capacity of a given set
of functions. Intuitively, one might be led to expect that learning machines with many
4 points in 2D
parameters would have high VC dimension, while learning machines with few parameters
would have low VC dimension. There is a striking counterexample to this, due to E. Levin
linear classi er
and J.S. Denker (Vapnik, 1995): A learning machine with just one parameter, but with
cannot shatter! infinite VC dimension (a family of classifiers is said to have infinite VC dimension if it can
shatter l points, no matter how large l). Define the step function θ(x), x ∈ R : {θ(x) =
1 ∀x > 0; θ(x) = −1 ∀x ≤ 0}. Consider the one-parameter family of functions, defined by
f (x, α) ≡ θ(sin(αx)), x, α ∈ R. (4)
You choose some number l, and present me with the task of finding l points that can be
shattered. I choose them to be:
Anuran Makur CS 57800 (Spring 2022) 21
xi = 10−i , i = 1, · · · , l. (5)
fi
fi
VC dimension
To show that VC(ℋ) = d, we need to prove two things:

1 There exists a set C of size d that is shattered by ℋ

2 No set of size d+1 is shattered by ℋ

Problem: Interpret the bounds d1 ≤ d ≤ d2 on the VC dimension in terms


of the above two points.

Anuran Makur CS 57800 (Spring 2022) 22


We have already shown that a class of infinite VC-dimension is not learnable. The
Fundamental theorem
converse statement is also true, leading to the fundamental theorem of statistical
learning theory:
theorem 6.7 (The Fundamental Theorem of Statistical Learning) Let H be a
hypothesis class of functions from a domain X to {0, 1} and let the loss function
be the 0 1 loss. Then, the following are equivalent:
1. H has the uniform convergence property.
2. Any ERM rule is a successful agnostic PAC learner for H.
3. H is agnostic PAC learnable.
4. H is PAC learnable.
5. Any ERM rule is a successful PAC learner for H.
6. H has a finite VC-dimension.
The proof of the theorem is given in the next section.
Weonly
Not already
does saw 1 → 2. 2 →3, characterize
the VC-dimension 3 → 4, and 2PAC → 5learnability;
are easy.
it even deter-
minesContrapositive of 4 → 6 also mentioned earlier.

the sample complexity.


For 4 → 6, 5 → 6, see [Shalev-Shwartz and Ben-David, 2014] if interested.

theorem 6.8 (The


High-level Fundamental
comments on 6 →Theorem of Statistical Learning – Quantita-
1 to follow.
tive Version) Let H be a hypothesis class of functions from a domain X to {0, 1}
Note:
and letWe
theremarked that be
loss function having
the 0in nite VC Assume
1 loss. dimension implies
that a hypothesis
VCdim(H) class
= d < 1.
is notthere
Then, PAC-learnable.
are absoluteThe above C
constants theorem alsothat:
1 , C2 such establishes the converse.

Anuran Makur CS 57800 (Spring 2022) 23


fi
6. H has a finite VC-dimension.

Sample complexity
The proof of the theorem is given in the next section.
Not only does the VC-dimension characterize PAC learnability; it even deter-
1. H has the uniform convergence property with sample complexity
mines the sample complexity.
d + log(1/ ) UC
d + log(1/ )
C
theorem 6.8 (The Fundamental
1  m
Theorem (✏, )  C
H of Statistical
2 Learning – Quantita-
✏2 ✏2
tive Version) Let H be a hypothesis class of functions from a domain X to {0, 1}
2. H is agnostic PAC learnable with sample complexity
and let the loss function be the 0 1 loss. Assume that VCdim(H) = d < 1.
Then, there are absolute d+ log(1/ )C1 , C2 such that: d + log(1/ )
constants
C1 2
 mH (✏, )  C2
✏ ✏2
3. H is PAC learnable with sample complexity
d + log(1/ ) d log(1/✏) + log(1/ )
C1 m NHℋ(ϵ,
(✏, δ))  C2
✏ ✏
The proof of this theorem is given in Chapter 28.
RemarkSee 6.3[Shalev-Shwartz
We stated the fundamental
and Ben-David,theorem
2014]for
if binary classification tasks.
interested
A similar result holds for some other learning problems such as regression with
the absolute loss or the squared loss. However, the theorem does not hold for
Lower bound: Let ℋ be any hypothesis class with VC(ℋ) = d.

all learning tasks. In particular, learnability is sometimes possible even though


the uniform convergence property does not hold (we will see an example in
Then any PAC-learning algorithm must use at least Ω(d/ε) samples.
Chapter 13, Exercise 2). Furthermore, in some situations, the ERM rule fails
but learnability is possible with other learning rules.

Anuran Makur CS 57800 (Spring 2022) 24


Proof of the fundamental theorem
Two main parts for 6 → 1:

Suppose VC(ℋ) = d. Although ℋ may be in nite, the number of


possible labelings it induces on a nite subset C of the data domain,
i.e., its “e ective size” is only O(|C|d) (instead of exponential in |C|,
as |C| grows large)

Hyp. class ℋ enjoys uniform convergence property when its “e ective


size” is small

How to de ne “effective size”?

Anuran Makur CS 57800 (Spring 2022) 25


fi
ff
fi
fi
ff
The VC-Dimension

Effective size: Growth function/Shatter coef cient


definition 6.9 (Growth Function) Let H be a hypothesis class. Then the
growth function of H, denoted ⌧H : N ! N, is defined as

⌧H (m) = max HC .
C⇢X :|C|=m

In words, ⌧H (m) is the number of di↵erent functions from a set C of size m to


{0, 1} that can be obtained by restricting H to C.

Obviously, if VCdim(H) = d then for any m  d we have ⌧H (m) = 2m . In


(note:
such cases, m aboveallispossible
H induces the same as our N)
functions from C to {0, 1}. The following beau-
tiful lemma,
(note: proposed
number ofindependently by Sauer,
di erent functions Shelah,
= number of and Perles,
di erent shows that
labelings)
when m becomes larger than the VC-dimension, the growth function increases
ℋC =than
(note: rather
polynomially {(h(c) : c ∈ C) | hwith
exponentially ∈ ℋ} m. is the restriction of ℋ to C)

lemma 6.10 (Sauer-Shelah-Perles) Let H be a hypothesis class with VCdim(H) 


Pd
Verbosely: The growth function
d < 1. Then, for all m, ⌧H (m)  i=0is the maximum
m number of di erent
i . In particular, if m > d + 1 then
⌧Hlabelings of a set
(m)  (em/d) d
. C of size ‘m’ (our N) that can be obtained by using
the classi ers in our hypothesis class on the members of C
Proof of Sauer’s Lemma *
To prove the lemma it suffices to prove the following stronger claim: For any
C = {c1 , . . . , cm } we have
Anuran Makur CS 57800 (Spring 2022) 26
fi
ff
ff
ff
such cases, H induces all possible functions from C to {0, 1}. The following beau-
Sauer’s lemma: Bounding growth functions
tiful lemma, proposed independently by Sauer, Shelah, and Perles, shows that
when m becomes larger than the VC-dimension, the growth function increases
polynomially rather than exponentially with m.

lemma 6.10 (Sauer-Shelah-Perles) Let H be a hypothesis class with VCdim(H) 


Pd
d < 1. Then, for all m, ⌧H (m)  i=0 mi . In particular, if m > d + 1 then
⌧H (m)  (em/d)d .

Proof of Sauer’s Lemma *


ToThus,
prove the
nitelemma
VC dimit implies
suffices polynomial
to prove thegrowth
following stronger
in m, while claim:
in niteFor
VCany
C=dim
{c1means
, . . . , cmexponential
} we have growth. Polynomial growth crucial to deducing
the uniform convergence property
8 H, |HC |  |{B ✓ C : H shatters B}|. (6.3)

The reason why Equation (6.3) is sufficient to prove the lemma is that if VCdim(H) 
Proof no
d then idea:
setFor any size
whose C, let
is B be athan
larger subset
d isofshattered
C such that
by Hℋand
shatters B. Then,
therefore
| ℋC | ≤ | {B ⊆ C : ℋ shatters B} | Xd ✓ ◆
m
|{B ✓ C : H shatters B}|  .
i=0
i

When m > d + 1 the right-hand side of the preceding is at most (em/d)d (see
Lemma A.5 in Appendix A).
We are left with proving Equation (6.3) and we do it using an inductive argu-
Anuran
ment. For m = 1, no matter whatCSH57800
Makur (Spring 2022)
is, either both sides of Equation (6.3) equal 27
fi
fi
Uniform convergence with “small effective size”
Theorem. Let H be a class and ⌧H its growth function. Then for every
distribution P(X, Y ) and every 2 (0, 1), with probability at least 1 over
the choice of S ⇠ P, we have p
<latexit sha1_base64="NtgY0HW7Mqho8FBWzg9D8y3bfgs=">AAAD7HicfVJLb9NAEHYaHqW8WjhyGZEgtVKI4kCUHivg0AOIIvpIFUfVejNOVt2HtbtuG1m+cwVxQ1z5TfBfODB2gmgjYA/26Pu+mZ2db+JUCuc7nR+1lfq16zdurt5au33n7r376xsPDp3JLMcDbqSxg5g5lELjgRde4iC1yFQs8Sg+fVnyR2donTB6389SHCk20SIRnHmCTtZ/RtoIPUbtI48XPk7y/Skai6pdwGv00Iw4k7DbhBiBAZfMOWB6TLhn2Uk+Z4smCO9gYs25n0KSaV4WbwOV0pAYC0gtzGBMz7EizkqSCijmp3Gc7xWbg9bxVrMqOxc2ozFKzyASGjY7rXCr2YJzQaVTa2IWCyn8DJgHicxRi+HTub4JhtLBTxH41AiOYBJovofICQV/riuLIUzZGZ6sNzrtMOz1e12goL8dPiuD5/1Or9eHsN2pTiNYnL2Tjdr3aGx4pmhe1SyGYSf1o5xZL7jEYi3KHKaMn7IJDinUTKEb5ZVNBTwhZFzNIzHaQ4VezsiZcm6mYlKWzbplrgT/xg0zn2yPcqHTzKPm84uSTII3UHpOg7fIvSwdYNwK6pUGxCzjnjbjyi2xuvKG3HnF7MyOCdV4zo1S5FIeibQYdkd5JDHxkWR6IhEaYQsa3ciKydRHtsKWsga8yCsbaGnyQbHEHl9mj5dZLV2mSEBf2lgpFC0cSV4h2WDxDaW9TdEyb2weMTtR7KLIF///qISeq+hPu/DbcPh3cNhth7Qp77qNnReLrVgNHgWPg80gDPrBTrAb7AUHAa/FtQ+1j7VPdV3/XP9S/zqXrtQWOQ+DK6f+7RfaXU0p</latexit>

4+ log ⌧H (2N )
|LP (h) LS (h)|  p
<latexit sha1_base64="+zMFzNLuW9+c+5JIJXb6xNBlEH4=">AAADRnicfVJLbxMxEHa2PEp5pXDkYhEhpQKi3SVReqyAQw+lBEHSVHEUeb3ejVXbu9heIHL9z/gB/AS4cgVxQ1xxHkhNBIy0O6PvIdszk5ScaROGX2rB1qXLV65uX9u5fuPmrdv13TsDXVSK0D4peKGGCdaUM0n7hhlOh6WiWCScniRnz+b8yTuqNCvkGzMr6VjgXLKMEWw8NKkPzo8mFglspklie841p3uPjyavfTqHiFOIMoWJbT9E+q0yFvEih8jgynsI5vDQNePjPecsSik3eCmKj52b1BthK4o63U4MfdHdj57Mi3Y37HS6MGqFi2iAVfQmu7XPKC1IJag0hGOtR1FYmrHFyjDCqdtBlaYlJmc4pyNfSiyoHttFAxx84JEUZoXynzRwgV50WCy0nonEK+dP1ZvcHPwbN6pMtj+2TJaVoZIsD8oqDk0B592EKVOUGD7zBSaK+btCMsW+Zcb3fO2URKy9wWojsJqp1KOSvieFEFimFrHSjeKxbzTNDOJY5n4GjegRbMRIsXxqkFpgG64hccsh+qHYodtgTy+yp5us5LoSXuD/SBacCWa0lzynfgyKvvC2lyVV2BTKIqxygT84u8r/UTG5VPnsd+HPwOG/i0HcivymvGo3Dp6utmIb3AP3QRNEoAsOwCHogT4g4CP4Cr6B78Gn4EfwM/i1lAa1lecuWIst8BvB9Rdb</latexit>
2N

Corollary: If VC(ℋ) is nite, then the uniform convergence property holds.

✓ ◆
UC d
NH (✏, )  Õ
<latexit sha1_base64="Ag44mScjWtwknZQ3MhlNUgDOGOU=">AAADR3icfVLLbhMxFHWmPEp5pbBkYxEhJVIVzUwbpcuKsugGWiTSpIrTyOO5k1i1PSPbA0TWfBofwB/Ami2IHWKJ86jURMCVxr4+51zZc+5NCsGNDcOvtWDr1u07d7fv7dx/8PDR4/ruk3OTl5pBj+Ui14OEGhBcQc9yK2BQaKAyEdBPro7nfP89aMNz9c7OChhJOlE844xaD43r/TdjRxgV+KS6dL3jqkmgMFzkao+kICxtYSIAE8tFCu608ofMNkmmKXNp5ZpL0XVN6zKuiOaTqW2N642wHUWdbifGPukeRvvz5KAbdjpdHLXDRTTQKs7Gu7UvJM1ZKUFZJqgxwygs7MhRbTkTUO2Q0kBB2RWdwNCnikowI7dwoMIvPJLiLNf+UxYv0JsVjkpjZjLxSknt1Gxyc/Bv3LC02eHIcVWUFhRbXpSVAtscz+3EKdfArJj5hDLN/Vsxm1Jvj/Wmr92SyLV/cMZKqmc69aiCDyyXkqrUEV5Uw3jkFkYTQdXE29+I9nAjXjpL9ALbqBqwypH5630r3aDaYC9ushebrBKmlF7gV6JywSW3xktegW+Dhte+7LQATW2uHaF6IunHyq32/6i4Wqr87mfhuuH438l53I78pLw9aBy9XE3FNnqGnqMmilAXHaETdIZ6iKFP6Bv6jn4En4Ofwa/g91Ia1FY1T9FabNX+AK1BGGI=</latexit>
( ✏)2
su ces for the UC property to hold (the O-tilde is hiding logarithmic factors)

Proof: For N > d, using Sauer’s lemma we have τH(2N) ≤ (2eN/d)d. Now
invoke the above theorem with N ≥ NHUC(ε,δ)

Anuran Makur CS 57800 (Spring 2022) 28


ffi
fi

You might also like