Introduction To Machine Learning (67577) : Shai Shalev-Shwartz
Introduction To Machine Learning (67577) : Shai Shalev-Shwartz
Lecture 2
Shai Shalev-Shwartz
School of CS and Engineering,
The Hebrew University of Jerusalem
PAC learning
IML Lecture 2
PAC learning
1 / 45
Outline
1
IML Lecture 2
PAC learning
2 / 45
Example
IML Lecture 2
X = R2 representing color
and shape of papayas.
Y = {1} representing
tasty or non-tasty.
h(x) = 1 if x is within the
inner rectangle
PAC learning
3 / 45
Batch Learning
The learners input:
Training data, S = ((x1 , y1 ) . . . (xm , ym )) (X Y)m
IML Lecture 2
PAC learning
4 / 45
Batch Learning
The learners input:
Training data, S = ((x1 , y1 ) . . . (xm , ym )) (X Y)m
IML Lecture 2
PAC learning
4 / 45
IML Lecture 2
PAC learning
5 / 45
IML Lecture 2
PAC learning
5 / 45
LD,f (h) =
def
xD
IML Lecture 2
PAC learning
5 / 45
LD,f (h) =
def
xD
IML Lecture 2
PAC learning
5 / 45
Data-generation Model
We must assume some relation between the training data and D, f
Simple data generation model:
Independently Identically Distributed (i.i.d.): Each xi is sampled
independently according to D
Realizability: For every i [m], yi = f (xi )
IML Lecture 2
PAC learning
6 / 45
IML Lecture 2
PAC learning
7 / 45
IML Lecture 2
PAC learning
7 / 45
IML Lecture 2
PAC learning
7 / 45
IML Lecture 2
PAC learning
7 / 45
IML Lecture 2
PAC learning
7 / 45
IML Lecture 2
PAC learning
8 / 45
IML Lecture 2
PAC learning
8 / 45
IML Lecture 2
PAC learning
8 / 45
IML Lecture 2
PAC learning
8 / 45
IML Lecture 2
PAC learning
9 / 45
IML Lecture 2
PAC learning
9 / 45
IML Lecture 2
PAC learning
9 / 45
IML Lecture 2
PAC learning
9 / 45
IML Lecture 2
PAC learning
9 / 45
No Free Lunch
Suppose that |X | =
For any finite C X take D to be uniform distribution over C
If number of training examples is m |C|/2 the learner has no
knowledge on at least half the elements in C
Formalizing the above, it can be shown that:
IML Lecture 2
PAC learning
10 / 45
No Free Lunch
Suppose that |X | =
For any finite C X take D to be uniform distribution over C
If number of training examples is m |C|/2 the learner has no
knowledge on at least half the elements in C
Formalizing the above, it can be shown that:
IML Lecture 2
PAC learning
10 / 45
No Free Lunch
Suppose that |X | =
For any finite C X take D to be uniform distribution over C
If number of training examples is m |C|/2 the learner has no
knowledge on at least half the elements in C
Formalizing the above, it can be shown that:
IML Lecture 2
PAC learning
10 / 45
Prior Knowledge
Give more knowledge to the learner: the target f comes from some
hypothesis class, H Y X
The learner knows H
Is it possible to PAC learn now ?
Of course, the answer depends on H since the No Free Lunch
theorem tells us that for H = Y X the problem is not solvable ...
IML Lecture 2
PAC learning
11 / 45
Outline
1
IML Lecture 2
PAC learning
12 / 45
IML Lecture 2
PAC learning
13 / 45
IML Lecture 2
PAC learning
13 / 45
IML Lecture 2
PAC learning
13 / 45
IML Lecture 2
PAC learning
13 / 45
IML Lecture 2
PAC learning
13 / 45
IML Lecture 2
PAC learning
13 / 45
IML Lecture 2
PAC learning
13 / 45
ERMH (S)
Input: training set S = (x1 , y1 ), . . . , (xm , ym )
Define the empirical risk: LS (h) =
1
m |{i
: h(xi ) 6= yi }|
IML Lecture 2
PAC learning
13 / 45
IML Lecture 2
PAC learning
14 / 45
Proof
Let S|x = (x1 , . . . , xm ) be the instances of the training set
IML Lecture 2
PAC learning
15 / 45
Proof
Let S|x = (x1 , . . . , xm ) be the instances of the training set
We would like to prove:
Dm ({S|x : L(D,f ) (ERMH (S)) > })
IML Lecture 2
PAC learning
15 / 45
Proof
Let S|x = (x1 , . . . , xm ) be the instances of the training set
We would like to prove:
Dm ({S|x : L(D,f ) (ERMH (S)) > })
Let HB be the set of bad hypotheses,
HB = {h H : L(D,f ) (h) > }
IML Lecture 2
PAC learning
15 / 45
Proof
Let S|x = (x1 , . . . , xm ) be the instances of the training set
We would like to prove:
Dm ({S|x : L(D,f ) (ERMH (S)) > })
Let HB be the set of bad hypotheses,
HB = {h H : L(D,f ) (h) > }
Let M be the set of misleading samples,
M = {S|x : h HB , LS (h) = 0}
IML Lecture 2
PAC learning
15 / 45
Proof
Let S|x = (x1 , . . . , xm ) be the instances of the training set
We would like to prove:
Dm ({S|x : L(D,f ) (ERMH (S)) > })
Let HB be the set of bad hypotheses,
HB = {h H : L(D,f ) (h) > }
Let M be the set of misleading samples,
M = {S|x : h HB , LS (h) = 0}
Observe:
{S|x : L(D,f ) (ERMH (S)) > } M =
{S|x : LS (h) = 0}
hHB
IML Lecture 2
PAC learning
15 / 45
Proof (Cont.)
Lemma (Union bound)
For any two sets A, B and a distribution D we have
D(A B) D(A) + D(B) .
IML Lecture 2
PAC learning
16 / 45
Proof (Cont.)
Lemma (Union bound)
For any two sets A, B and a distribution D we have
D(A B) D(A) + D(B) .
We have shown:
S
{S|x : L(D,f ) (ERMH (S)) > } hHB {S|x : LS (h) = 0}
IML Lecture 2
PAC learning
16 / 45
Proof (Cont.)
Lemma (Union bound)
For any two sets A, B and a distribution D we have
D(A B) D(A) + D(B) .
We have shown:
S
{S|x : L(D,f ) (ERMH (S)) > } hHB {S|x : LS (h) = 0}
Therefore, using the union bound
Dm ({S|x :L(D,f ) (ERMH (S)) > })
X
IML Lecture 2
PAC learning
16 / 45
Proof (Cont.)
Observe:
Dm ({S|x : LS (h) = 0}) = (1 LD,f (h))m
IML Lecture 2
PAC learning
17 / 45
Proof (Cont.)
Observe:
Dm ({S|x : LS (h) = 0}) = (1 LD,f (h))m
If h HB then LD,f (h) > and therefore
Dm ({S|x : LS (h) = 0}) < (1 )m
IML Lecture 2
PAC learning
17 / 45
Proof (Cont.)
Observe:
Dm ({S|x : LS (h) = 0}) = (1 LD,f (h))m
If h HB then LD,f (h) > and therefore
Dm ({S|x : LS (h) = 0}) < (1 )m
We have shown:
Dm ({S|x : L(D,f ) (ERMH (S)) > }) < |HB | (1 )m
IML Lecture 2
PAC learning
17 / 45
Proof (Cont.)
Observe:
Dm ({S|x : LS (h) = 0}) = (1 LD,f (h))m
If h HB then LD,f (h) > and therefore
Dm ({S|x : LS (h) = 0}) < (1 )m
We have shown:
Dm ({S|x : L(D,f ) (ERMH (S)) > }) < |HB | (1 )m
Finally, using 1 e and |HB | |H| we conclude:
Dm ({S|x : L(D,f ) (ERMH (S)) > }) < |H| e m
IML Lecture 2
PAC learning
17 / 45
Proof (Cont.)
Observe:
Dm ({S|x : LS (h) = 0}) = (1 LD,f (h))m
If h HB then LD,f (h) > and therefore
Dm ({S|x : LS (h) = 0}) < (1 )m
We have shown:
Dm ({S|x : L(D,f ) (ERMH (S)) > }) < |HB | (1 )m
Finally, using 1 e and |HB | |H| we conclude:
Dm ({S|x : L(D,f ) (ERMH (S)) > }) < |H| e m
The right-hand side would be if m
Shai Shalev-Shwartz (Hebrew U)
IML Lecture 2
log(|H|/)
.
PAC learning
17 / 45
IML Lecture 2
PAC learning
18 / 45
Outline
1
IML Lecture 2
PAC learning
19 / 45
PAC learning
Definition (PAC learnability)
A hypothesis class H is PAC learnable if there exists a function
mH : (0, 1)2 N and a learning algorithm with the following property:
for every , (0, 1)
for every distribution D over X , and for every labeling function
f : X {0, 1}
when running the learning algorithm on m mH (, ) i.i.d. examples
generated by D and labeled by f , the algorithm returns a hypothesis h
such that, with probability of at least 1 (over the choice of the
examples), L(D,f ) (h) .
IML Lecture 2
PAC learning
20 / 45
PAC learning
Definition (PAC learnability)
A hypothesis class H is PAC learnable if there exists a function
mH : (0, 1)2 N and a learning algorithm with the following property:
for every , (0, 1)
for every distribution D over X , and for every labeling function
f : X {0, 1}
when running the learning algorithm on m mH (, ) i.i.d. examples
generated by D and labeled by f , the algorithm returns a hypothesis h
such that, with probability of at least 1 (over the choice of the
examples), L(D,f ) (h) .
IML Lecture 2
PAC learning
20 / 45
PAC learning
Leslie Valiant, Turing award 2010
For transformative contributions to the theory of computation,
including the theory of probably approximately correct (PAC)
learning, the complexity of enumeration and of algebraic
computation, and the theory of parallel and distributed
computing.
IML Lecture 2
PAC learning
21 / 45
Corollary
Let H be a finite hypothesis class.
H is PAC learnable with sample complexity mH (, )
log(|H|/)
IML Lecture 2
PAC learning
22 / 45
Corollary
Let H be a finite hypothesis class.
H is PAC learnable with sample complexity mH (, )
log(|H|/)
IML Lecture 2
PAC learning
22 / 45
Chervonenkis
Vapnik
IML Lecture 2
PAC learning
23 / 45
Outline
1
IML Lecture 2
PAC learning
24 / 45
Example: http://www.youtube.com/watch?v=p_MzP2MZaOo
Pay attention to the retrospect explanations at 5:00
IML Lecture 2
PAC learning
25 / 45
IML Lecture 2
PAC learning
26 / 45
IML Lecture 2
PAC learning
26 / 45
IML Lecture 2
PAC learning
26 / 45
IML Lecture 2
PAC learning
26 / 45
IML Lecture 2
PAC learning
26 / 45
IML Lecture 2
PAC learning
26 / 45
IML Lecture 2
PAC learning
27 / 45
IML Lecture 2
PAC learning
27 / 45
IML Lecture 2
PAC learning
27 / 45
IML Lecture 2
PAC learning
27 / 45
IML Lecture 2
PAC learning
27 / 45
IML Lecture 2
PAC learning
27 / 45
IML Lecture 2
PAC learning
27 / 45
VC dimension Examples
To show that VCdim(H) = d we need to show that:
1
IML Lecture 2
PAC learning
28 / 45
VC dimension Examples
To show that VCdim(H) = d we need to show that:
1
IML Lecture 2
PAC learning
28 / 45
VC dimension Examples
Threshold functions: X = R, H = {x 7 sign(x ) : R}
Show that {0} is shattered
IML Lecture 2
PAC learning
29 / 45
VC dimension Examples
Threshold functions: X = R, H = {x 7 sign(x ) : R}
Show that {0} is shattered
Show that any two points cannot be shattered
IML Lecture 2
PAC learning
29 / 45
VC dimension Examples
Intervals: X = R, H = {ha,b : a < b R}, where ha,b (x) = 1 iff x [a, b]
Show that {0, 1} is shattered
IML Lecture 2
PAC learning
30 / 45
VC dimension Examples
Intervals: X = R, H = {ha,b : a < b R}, where ha,b (x) = 1 iff x [a, b]
Show that {0, 1} is shattered
Show that any three points cannot be shattered
IML Lecture 2
PAC learning
30 / 45
VC dimension Examples
Axis aligned rectangles: X = R2 ,
H = {h(a1 ,a2 ,b1 ,b2 ) : a1 < a2 and b1 < b2 }, where h(a1 ,a2 ,b1 ,b2 ) (x1 , x2 ) = 1
iff x1 [a1 , a2 ] and x2 [b1 , b2 ]
Show:
Shattered
Not Shattered
c1
c4
c5
c2
c3
IML Lecture 2
PAC learning
31 / 45
VC dimension Examples
Finite classes:
Show that the VC dimension of a finite H is at most log2 (|H|).
IML Lecture 2
PAC learning
32 / 45
VC dimension Examples
Finite classes:
Show that the VC dimension of a finite H is at most log2 (|H|).
Show that there can be arbitrary gap between VCdim(H) and
log2 (|H|)
IML Lecture 2
PAC learning
32 / 45
VC dimension Examples
Halfspaces: X = Rd , H = {x 7 sign(hw, xi) : w Rd }
Show that {e1 , . . . , ed } is shattered
IML Lecture 2
PAC learning
33 / 45
VC dimension Examples
Halfspaces: X = Rd , H = {x 7 sign(hw, xi) : w Rd }
Show that {e1 , . . . , ed } is shattered
Show that any d + 1 points cannot be shattered
IML Lecture 2
PAC learning
33 / 45
d log(1/) + log(1/)
d + log(1/)
mH (, ) C2
IML Lecture 2
PAC learning
34 / 45
IML Lecture 2
PAC learning
35 / 45
IML Lecture 2
PAC learning
35 / 45
IML Lecture 2
PAC learning
35 / 45
IML Lecture 2
PAC learning
35 / 45
IML Lecture 2
1
2
2 =
PAC learning
35 / 45
IML Lecture 2
PAC learning
36 / 45
IML Lecture 2
PAC learning
36 / 45
IML Lecture 2
PAC learning
36 / 45
IML Lecture 2
PAC learning
36 / 45
SDm
S,T Dm
IML Lecture 2
PAC learning
37 / 45
SDm
S,T Dm
IML Lecture 2
PAC learning
37 / 45
SDm
S,T Dm
IML Lecture 2
PAC learning
37 / 45
SDm
S,T Dm
IML Lecture 2
PAC learning
37 / 45
em d
d
IML Lecture 2
PAC learning
38 / 45
Outline
1
IML Lecture 2
PAC learning
39 / 45
IML Lecture 2
PAC learning
40 / 45
IML Lecture 2
PAC learning
40 / 45
yi hw, xi i > 0 .
IML Lecture 2
PAC learning
40 / 45
yi hw, xi i > 0 .
IML Lecture 2
PAC learning
40 / 45
yi hw, xi i > 0 .
IML Lecture 2
PAC learning
40 / 45
IML Lecture 2
PAC learning
41 / 45
Analysis
Theorem (Agmon54, Novikoff62)
Let (x1 , y1 ), . . . , (xm , ym ) be a sequence of examples such that there
exists w Rd such that for every i, yi hw , xi i 1. Then, the
Perceptron will make at most
kw k2 max kxi k2
i
IML Lecture 2
PAC learning
42 / 45
Analysis
Theorem (Agmon54, Novikoff62)
Let (x1 , y1 ), . . . , (xm , ym ) be a sequence of examples such that there
exists w Rd such that for every i, yi hw , xi i 1. Then, the
Perceptron will make at most
kw k2 max kxi k2
i
IML Lecture 2
PAC learning
42 / 45
Proof
Let w(t) be the value of w at iteration t
IML Lecture 2
PAC learning
43 / 45
Proof
Let w(t) be the value of w at iteration t
Let (xt , yt ) be the example used to update w at iteration t
IML Lecture 2
PAC learning
43 / 45
Proof
Let w(t) be the value of w at iteration t
Let (xt , yt ) be the example used to update w at iteration t
Denote R = maxi kxi k
IML Lecture 2
PAC learning
43 / 45
Proof
Let w(t) be the value of w at iteration t
Let (xt , yt ) be the example used to update w at iteration t
Denote R = maxi kxi k
The cosine of the angle between w and w(t) is
IML Lecture 2
hw(t) ,w i
kw(t) k kw k
PAC learning
43 / 45
Proof
Let w(t) be the value of w at iteration t
Let (xt , yt ) be the example used to update w at iteration t
Denote R = maxi kxi k
The cosine of the angle between w and w(t) is
hw(t) ,w i
kw(t) k kw k
IML Lecture 2
PAC learning
43 / 45
Proof
Let w(t) be the value of w at iteration t
Let (xt , yt ) be the example used to update w at iteration t
Denote R = maxi kxi k
The cosine of the angle between w and w(t) is
hw(t) ,w i
kw(t) k kw k
IML Lecture 2
PAC learning
43 / 45
Proof
Let w(t) be the value of w at iteration t
Let (xt , yt ) be the example used to update w at iteration t
Denote R = maxi kxi k
The cosine of the angle between w and w(t) is
hw(t) ,w i
kw(t) k kw k
hw(t+1) , w i t
IML Lecture 2
PAC learning
43 / 45
Proof
Let w(t) be the value of w at iteration t
Let (xt , yt ) be the example used to update w at iteration t
Denote R = maxi kxi k
The cosine of the angle between w and w(t) is
hw(t) ,w i
kw(t) k kw k
hw(t+1) , w i t
kw(t+1) k R t
IML Lecture 2
PAC learning
43 / 45
Proof
Let w(t) be the value of w at iteration t
Let (xt , yt ) be the example used to update w at iteration t
Denote R = maxi kxi k
The cosine of the angle between w and w(t) is
hw(t) ,w i
kw(t) k kw k
hw(t+1) , w i t
kw(t+1) k R t
t
hw(t) , w i
1
kw(t) k kw k
t kw k
IML Lecture 2
PAC learning
43 / 45
Proof
Let w(t) be the value of w at iteration t
Let (xt , yt ) be the example used to update w at iteration t
Denote R = maxi kxi k
The cosine of the angle between w and w(t) is
hw(t) ,w i
kw(t) k kw k
hw(t+1) , w i t
kw(t+1) k R t
t
hw(t) , w i
1
kw(t) k kw k
t kw k
IML Lecture 2
PAC learning
43 / 45
Proof (Cont.)
Showing hw(t+1) , w i t
Initially, hw(1) , w i = 0
Showing kw(t+1) k2 R2 t
IML Lecture 2
PAC learning
44 / 45
Proof (Cont.)
Showing hw(t+1) , w i t
Initially, hw(1) , w i = 0
Whenever we update, hw(t) , w i increases by at least 1:
hw(t+1) , w i = hw(t) + yt xt , w i = hw(t) , w i + yt hxt , w i
| {z }
1
Showing kw(t+1) k2 R2 t
IML Lecture 2
PAC learning
44 / 45
Proof (Cont.)
Showing hw(t+1) , w i t
Initially, hw(1) , w i = 0
Whenever we update, hw(t) , w i increases by at least 1:
hw(t+1) , w i = hw(t) + yt xt , w i = hw(t) , w i + yt hxt , w i
| {z }
1
Showing kw(t+1) k2 R2 t
Initially, kw(1) k2 = 0
IML Lecture 2
PAC learning
44 / 45
Proof (Cont.)
Showing hw(t+1) , w i t
Initially, hw(1) , w i = 0
Whenever we update, hw(t) , w i increases by at least 1:
hw(t+1) , w i = hw(t) + yt xt , w i = hw(t) , w i + yt hxt , w i
| {z }
1
Showing kw(t+1) k2 R2 t
Initially, kw(1) k2 = 0
Whenever we update, kw(t) k2 increases by at most 1:
kw(t+1) k2 = kw(t) + yt xt k2 = kw(t) k2 + 2yt hw(t) , xt i +yt2 kxt k2
{z
}
|
0
(t) 2
kw k + R .
IML Lecture 2
PAC learning
44 / 45
Summary
The PAC Learning model
What is PAC learnable?
PAC learning of finite classes using ERM
The VC dimension and the fundmental theorem of learning
Classes of finite VC dimension
IML Lecture 2
PAC learning
45 / 45