Lecture 2.4
Lecture 2.4
Note: These slides are for your personal educational use only. Please do not distribute them.
Slides Acknowledgment:
Tommi Jaakkola, Devavrat Shah, David Sontag, Suvrit Sra (MIT EECS Course 6.867)
Anuran Makur CS 57800 (Spring 2022)
fi
CS 57800
I II
II
Supervised Learning
II
Supervised Learning
★ Learnability
★ VC dimension
★ Discussion
? Fundamental question:
(In practice, carefully balance between over- and under tting. Also, recall we had talked
brie y about regularization as an approach to regulate the “bias-complexity” tradeo )
Learn ML
log( | ℋ | /δ)
Theorem Let ℋ be nite. Let δ ∈ (0,1), ε > 0, and let N ≥ .
ε
Then, for any distribution ℙ for which realizability holds, with probability at least
1-δ over the choice of dataset S of size N, every ERM hypothesis hS satis es
Lℙ(hS) ≤ ε
log( | ℋ | /δ)
Theorem Let ℋ be finite. Let δ ∈ (0,1), ε > 0, and let N ≥
ε
Then, for any distribution ℙ for which realizability holds, with probability at least Observe: Con dence parameter
1-δ over the choice of dataset S of size N, every ERM hypothesis hS satisfies δ and accuracy parameter ε
Lℙ(hS) ≤ ε
Proof: §2.3.1 of [SSS]
L (h )
Note: Without realizability, a weaker
Exercises 2 on S;
S depends there is a chance that S is not representative of the whole data.
1
result canℙbe proved.
covers To model this
the details––proof chance,
analogous
to how we bounded parameter error
we introduce a con dence parameter δ in the bound above.
Let’s look at the key ideas…
for NB (Hoeffding+Union bound).
➡ Finiteness of sample class allows extending above bound via the union bound
– Even when S faithfully represents ℙ(X, Y ), it may miss some details (it is just a nite sample) and
cause our classi er to make minor errors; thus the accuracy ε.
fi
PAC-learnability: Questions
Question: True or False? Any nite hypothesis class (assuming
realizability) is PAC-learnable with sample complexity
✓ ◆
log |H/
NH (✏, ) O
<latexit sha1_base64="Q3Z8P20CsB5ojS28IRiqcE96HvI=">AAADRnicfVLLbhMxFHWmPEp5pbBkYxEhpVIVZkKjdFkBi2ygRSJpqjiKPB7PxKrtGdkeIDLzZ3wAnwBbtiB2iC03L6mJgCuNfXUesudcx4UU1oXh11qwc+36jZu7t/Zu37l77359/8HA5qVhvM9ymZthTC2XQvO+E07yYWE4VbHk5/Hlizl//o4bK3L91s0KPlY00yIVjDqAJvXB64knjErcq5qEF1bIXB+ShEtHDzCRHJ/CkromSQ1lnsg8+0gUdVOw+F71dKms/NpaESOyqTuY1BthK4o63U4bQ9M9jp7Nm6Nu2Ol0cdQKF9VAqzqb7Ne+kCRnpeLaMUmtHUVh4caeGieY5NUeKS0vKLukGR9Bq6niduwXAVT4CSAJTnMDn3Z4gV51eKqsnakYlPPL221uDv6NG5UuPR57oYvScc2WB6WlxC7H8zRxIgxnTs6gocwIuCtmUwpROch845RYbfyDt05RMzMJoJq/Z7lSVCeeiKIatcd+ETqRVGcwgkZ0iBvtZbLELLAt15DBCNZjGVZb7MVV9mKb1dKWCgSwEp1LoYSzIHnJYQyGvwLbacENdbnxhJpM0Q+VX+3/UQm9VMEOb2E9cPzvZtBuRfBS3hw1Tp6vXsUueoQeoyaKUBedoB46Q33E0Cf0DX1HP4LPwc/gV/B7KQ1qK89DtFE76A/C6Rh+</latexit>
✏
LP (h) inf
0
LP (h) + ✏
<latexit sha1_base64="siLZEjixqnqtlykVR6f/NrmuI1Q=">AAADO3icfVLLjtMwFHXDaxgeMwNLNhYVYhCoSspUneUIWIwEiCLRmY7qqLpxnNYa24lsB6isfBQfwHcAOxA7xJY9TlrQtDyulPjonHP9uPcmheDGhuGnVnDu/IWLlzYub165eu361vbOjSOTl5qyIc1FrkcJGCa4YkPLrWCjQjOQiWDHyenjWj9+zbThuXpl5wWLJUwVzzgF66nJ9tNnE0ck2FmSuEFV7c7uYSIYJlxlEze7WwPc6BSEO6wqgv9IuE9YYbiod2uHnSjq9Xtd7EF/P3pYg71+2Ov1cdQJm2ijZQwmO62PJM1pKZmyVIAx4ygsbOxAW04FqzZJaVgB9BSmbOyhAslM7JpXV/iOZ1Kc5dp/yuKGPZvhQBozl4l31vc161pN/k0blzbbjx1XRWmZoouDslJgm+O6hDjlmlEr5h4A1dzfFdMZaKDWF3rllESuvMEZK0HPdepZxd7QXEpQqSO8qMbd2PnCZ5YIUFPfgXb0ALe7RPPpzBLdcGtZI1q5360ZVWvqyVn1ZF1VwpTSG/yfqFxwya3xlifMt0Gz5z7tRcE02Fw7Anoq4W3llut/XFwtXH71s/Cr4fjf4KjbifykvNxrHzxaTsUGuoVuo10UoT46QIdogIaIonfoA/qMvgTvg6/Bt+D7whq0ljk30UoEP34C3QIT+g==</latexit>
h 2H
(Note: The hypothesis class ℋ may be “bad,” but we are just trying to be
approximately as good as the best hypothesis within this class w.h.p.)
∀h ∈ ℋ, | LS(h) − Lℙ(h) | ≤ ε
LP (hS ) LS (hS ) + "/2 LS (h) + "/2 LP (h) + "/2 + "/2 = LP (h) + "
<latexit sha1_base64="9OznFJLjFyhHoQCwHd7CohXjwI0=">AAADy3icfZJdb9MwFIadBtgoX9245MaiQtr46NIy1t0gTcAFF9soGt061VHkuG5rzXYi2ykrJpf8EH4W/wanDaKNYEdKfPI+5zgneR2nnGkTBL+8mn/r9p2Nzbv1e/cfPHzU2No+10mmCO2ThCdqEGNNOZO0b5jhdJAqikXM6UV89b7gFzOqNEvkFzNPaSjwRLIxI9g4KWr8PI4sEthM49j28nxnGp3tQsQpPI7Olg8v0AwrmmrGE7nX+cscgeuoZOv77Vb6q01vb+6o16NGM2gFi4BBqxN094NDl7wp4gC2S9QEZfSirdo2GiUkE1QawrHWw3aQmtBiZRjhNK+jTNMUkys8oUOXSiyoDu3iX+bwmVNGcJwod0kDF+pqh8VC67mIXWUxtK6yQvwni8Xam602Aqu5GlXmMePD0DKZZoZKshxnnHFoEljYB0dMUWL43CWYKOa+CJIpVpgYZ3IdSfqVJEJgObKIpfmwE1pnytggjuXEudNsv4RNZ6Jik6lBaiHm622z69yiWNjrvAIk15lwzN2RTDgTzOhKySnJl14SzO1pdYPBKh1U6eUqvSzoB+rsU/TEaZ9SqrBJ1HOLsJoI7GYs15vKmFyWudWdoj9HBf4/Oe+02q9bnc/7zaN35XnaBE/AU7AD2qALjsBH0AN9QLwN75V34HX9E1/73/zvy9KaV/Y8Bmvh//gNJY1BCQ==</latexit>
VC Dimension
(We are in the PAC-learnability setup, so we assume realizability)
+ - - + 2 points in 1D
+ + -
3 points in 1D
+ - -
+ - +
O(log(1/δ)/ϵ) samples
Observations:
Suppose hyp. class ℋ captures all 2|C| possible labelings of points in a nite
set C with {0,1}
fi
fi
VC dimension: Shattering
Defn. A nite set of points C is shattered by a hypothesis space ℋ if for all
possible 2|C| labelings of the points in C into {0,1}, there is a consistent
hypothesis in ℋ (i.e., with 0 error).
3 points, in 2D.
linear classi er
shatters these
The VC dimension thus gives concreteness to the notion of the capacity of a given set
of functions. Intuitively, one might be led to expect that learning machines with many
4 points in 2D
parameters would have high VC dimension, while learning machines with few parameters
would have low VC dimension. There is a striking counterexample to this, due to E. Levin
linear classi er
and J.S. Denker (Vapnik, 1995): A learning machine with just one parameter, but with
cannot shatter! infinite VC dimension (a family of classifiers is said to have infinite VC dimension if it can
shatter l points, no matter how large l). Define the step function θ(x), x ∈ R : {θ(x) =
1 ∀x > 0; θ(x) = −1 ∀x ≤ 0}. Consider the one-parameter family of functions, defined by
f (x, α) ≡ θ(sin(αx)), x, α ∈ R. (4)
You choose some number l, and present me with the task of finding l points that can be
shattered. I choose them to be:
Anuran Makur CS 57800 (Spring 2022) 21
xi = 10−i , i = 1, · · · , l. (5)
fi
fi
VC dimension
To show that VC(ℋ) = d, we need to prove two things:
Sample complexity
The proof of the theorem is given in the next section.
Not only does the VC-dimension characterize PAC learnability; it even deter-
1. H has the uniform convergence property with sample complexity
mines the sample complexity.
d + log(1/ ) UC
d + log(1/ )
C
theorem 6.8 (The Fundamental
1 m
Theorem (✏, ) C
H of Statistical
2 Learning – Quantita-
✏2 ✏2
tive Version) Let H be a hypothesis class of functions from a domain X to {0, 1}
2. H is agnostic PAC learnable with sample complexity
and let the loss function be the 0 1 loss. Assume that VCdim(H) = d < 1.
Then, there are absolute d+ log(1/ )C1 , C2 such that: d + log(1/ )
constants
C1 2
mH (✏, ) C2
✏ ✏2
3. H is PAC learnable with sample complexity
d + log(1/ ) d log(1/✏) + log(1/ )
C1 m NHℋ(ϵ,
(✏, δ)) C2
✏ ✏
The proof of this theorem is given in Chapter 28.
RemarkSee 6.3[Shalev-Shwartz
We stated the fundamental
and Ben-David,theorem
2014]for
if binary classification tasks.
interested
A similar result holds for some other learning problems such as regression with
the absolute loss or the squared loss. However, the theorem does not hold for
Lower bound: Let ℋ be any hypothesis class with VC(ℋ) = d.
⌧H (m) = max HC .
C⇢X :|C|=m
The reason why Equation (6.3) is sufficient to prove the lemma is that if VCdim(H)
Proof no
d then idea:
setFor any size
whose C, let
is B be athan
larger subset
d isofshattered
C such that
by Hℋand
shatters B. Then,
therefore
| ℋC | ≤ | {B ⊆ C : ℋ shatters B} | Xd ✓ ◆
m
|{B ✓ C : H shatters B}| .
i=0
i
When m > d + 1 the right-hand side of the preceding is at most (em/d)d (see
Lemma A.5 in Appendix A).
We are left with proving Equation (6.3) and we do it using an inductive argu-
Anuran
ment. For m = 1, no matter whatCSH57800
Makur (Spring 2022)
is, either both sides of Equation (6.3) equal 27
fi
fi
Uniform convergence with “small effective size”
Theorem. Let H be a class and ⌧H its growth function. Then for every
distribution P(X, Y ) and every 2 (0, 1), with probability at least 1 over
the choice of S ⇠ P, we have p
<latexit sha1_base64="NtgY0HW7Mqho8FBWzg9D8y3bfgs=">AAAD7HicfVJLb9NAEHYaHqW8WjhyGZEgtVKI4kCUHivg0AOIIvpIFUfVejNOVt2HtbtuG1m+cwVxQ1z5TfBfODB2gmgjYA/26Pu+mZ2db+JUCuc7nR+1lfq16zdurt5au33n7r376xsPDp3JLMcDbqSxg5g5lELjgRde4iC1yFQs8Sg+fVnyR2donTB6389SHCk20SIRnHmCTtZ/RtoIPUbtI48XPk7y/Skai6pdwGv00Iw4k7DbhBiBAZfMOWB6TLhn2Uk+Z4smCO9gYs25n0KSaV4WbwOV0pAYC0gtzGBMz7EizkqSCijmp3Gc7xWbg9bxVrMqOxc2ozFKzyASGjY7rXCr2YJzQaVTa2IWCyn8DJgHicxRi+HTub4JhtLBTxH41AiOYBJovofICQV/riuLIUzZGZ6sNzrtMOz1e12goL8dPiuD5/1Or9eHsN2pTiNYnL2Tjdr3aGx4pmhe1SyGYSf1o5xZL7jEYi3KHKaMn7IJDinUTKEb5ZVNBTwhZFzNIzHaQ4VezsiZcm6mYlKWzbplrgT/xg0zn2yPcqHTzKPm84uSTII3UHpOg7fIvSwdYNwK6pUGxCzjnjbjyi2xuvKG3HnF7MyOCdV4zo1S5FIeibQYdkd5JDHxkWR6IhEaYQsa3ciKydRHtsKWsga8yCsbaGnyQbHEHl9mj5dZLV2mSEBf2lgpFC0cSV4h2WDxDaW9TdEyb2weMTtR7KLIF///qISeq+hPu/DbcPh3cNhth7Qp77qNnReLrVgNHgWPg80gDPrBTrAb7AUHAa/FtQ+1j7VPdV3/XP9S/zqXrtQWOQ+DK6f+7RfaXU0p</latexit>
4+ log ⌧H (2N )
|LP (h) LS (h)| p
<latexit sha1_base64="+zMFzNLuW9+c+5JIJXb6xNBlEH4=">AAADRnicfVJLbxMxEHa2PEp5pXDkYhEhpQKi3SVReqyAQw+lBEHSVHEUeb3ejVXbu9heIHL9z/gB/AS4cgVxQ1xxHkhNBIy0O6PvIdszk5ScaROGX2rB1qXLV65uX9u5fuPmrdv13TsDXVSK0D4peKGGCdaUM0n7hhlOh6WiWCScniRnz+b8yTuqNCvkGzMr6VjgXLKMEWw8NKkPzo8mFglspklie841p3uPjyavfTqHiFOIMoWJbT9E+q0yFvEih8jgynsI5vDQNePjPecsSik3eCmKj52b1BthK4o63U4MfdHdj57Mi3Y37HS6MGqFi2iAVfQmu7XPKC1IJag0hGOtR1FYmrHFyjDCqdtBlaYlJmc4pyNfSiyoHttFAxx84JEUZoXynzRwgV50WCy0nonEK+dP1ZvcHPwbN6pMtj+2TJaVoZIsD8oqDk0B592EKVOUGD7zBSaK+btCMsW+Zcb3fO2URKy9wWojsJqp1KOSvieFEFimFrHSjeKxbzTNDOJY5n4GjegRbMRIsXxqkFpgG64hccsh+qHYodtgTy+yp5us5LoSXuD/SBacCWa0lzynfgyKvvC2lyVV2BTKIqxygT84u8r/UTG5VPnsd+HPwOG/i0HcivymvGo3Dp6utmIb3AP3QRNEoAsOwCHogT4g4CP4Cr6B78Gn4EfwM/i1lAa1lecuWIst8BvB9Rdb</latexit>
2N
✓ ◆
UC d
NH (✏, ) Õ
<latexit sha1_base64="Ag44mScjWtwknZQ3MhlNUgDOGOU=">AAADR3icfVLLbhMxFHWmPEp5pbBkYxEhJVIVzUwbpcuKsugGWiTSpIrTyOO5k1i1PSPbA0TWfBofwB/Ami2IHWKJ86jURMCVxr4+51zZc+5NCsGNDcOvtWDr1u07d7fv7dx/8PDR4/ruk3OTl5pBj+Ui14OEGhBcQc9yK2BQaKAyEdBPro7nfP89aMNz9c7OChhJOlE844xaD43r/TdjRxgV+KS6dL3jqkmgMFzkao+kICxtYSIAE8tFCu608ofMNkmmKXNp5ZpL0XVN6zKuiOaTqW2N642wHUWdbifGPukeRvvz5KAbdjpdHLXDRTTQKs7Gu7UvJM1ZKUFZJqgxwygs7MhRbTkTUO2Q0kBB2RWdwNCnikowI7dwoMIvPJLiLNf+UxYv0JsVjkpjZjLxSknt1Gxyc/Bv3LC02eHIcVWUFhRbXpSVAtscz+3EKdfArJj5hDLN/Vsxm1Jvj/Wmr92SyLV/cMZKqmc69aiCDyyXkqrUEV5Uw3jkFkYTQdXE29+I9nAjXjpL9ALbqBqwypH5630r3aDaYC9ushebrBKmlF7gV6JywSW3xktegW+Dhte+7LQATW2uHaF6IunHyq32/6i4Wqr87mfhuuH438l53I78pLw9aBy9XE3FNnqGnqMmilAXHaETdIZ6iKFP6Bv6jn4En4Ofwa/g91Ia1FY1T9FabNX+AK1BGGI=</latexit>
( ✏)2
su ces for the UC property to hold (the O-tilde is hiding logarithmic factors)
Proof: For N > d, using Sauer’s lemma we have τH(2N) ≤ (2eN/d)d. Now
invoke the above theorem with N ≥ NHUC(ε,δ)