0% found this document useful (0 votes)
8 views

Chapter-3-Solutions-Understanding-Machine-Learning

This document provides solutions to exercises from Chapter 3 of 'Understanding Machine Learning' by Shai Shalev-Shwartz and Shai Ben-David. It covers topics such as the realizability assumption, PAC learnability, and the Bayes Optimal Predictor, detailing algorithms and their properties. The solutions demonstrate the mathematical foundations of machine learning concepts and their implications for learning theory.

Uploaded by

y100000agrawal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

Chapter-3-Solutions-Understanding-Machine-Learning

This document provides solutions to exercises from Chapter 3 of 'Understanding Machine Learning' by Shai Shalev-Shwartz and Shai Ben-David. It covers topics such as the realizability assumption, PAC learnability, and the Bayes Optimal Predictor, detailing algorithms and their properties. The solutions demonstrate the mathematical foundations of machine learning concepts and their implications for learning theory.

Uploaded by

y100000agrawal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Solutions to

Understanding Machine Learning


by Shai Shalev-Shwartz and Shai Ben-David

Chapter 3 (A Formal Learning Model)

Zahra Taheri1
Mar 2020

Exercise 3.2
1. As it is mentioned in the exercise, the realizability assumption here implies that the true hypothesis
f labels negatively all examples in the domain X , perhaps except one. Let A be the algorithm that
returns an hypothesis hS with the following property:

 hx if ∃x ∈ S s. t. f (x) = 1
hS =
 h− otherwise

It is easy to see that LS (hS ) = 0, and so A is an ERM .


2. Let D be a probability distribution over X and ε ∈ (0, 1). If f = h− , then A returns the true
hypothesis. Suppose that there exists an x in X such that f (x) = 1. Such an element is unique, by the
realizability assumption. Let S|X = (x1 , . . . , xm ) be the instances of the training set. We would like
to upper bound Dm ({S|X : L(D,f ) (hS ) > ε}). If x ∈ S|X , then A returns the true hypothesis and so,
L(D,f ) (hS ) = 0. Therefore we are interested in cases that x 6∈ S|X . Also, if D(x) ≤ ε, then L(D,f ) (h) ≤ ε,
for all h in the hypothesis class. So, we should suppose that D(x) > ε. Then, D(x0 ) ≤ 1 − ε, for all
x0 ∈ X \ x. Hence we have

{S|X : L(D,f ) (hS ) > ε} = {S|X : x 6∈ S|X and D(x) > ε} = {S|X : ∀x0 ∈ S|X D(x0 ) ≤ 1 − ε}.

Therefore we have:

Dm ({S|X : L(D,f ) (hS ) > ε}) = Dm ({S|X : ∀x0 ∈ S|X D(x0 ) ≤ 1 − ε}) ≤ (1 − ε)m ≤ e−εm .

1
https://github.com/zahta/Exercises-Understanding-Machine-Learning
1
2

log(1/δ)
Let δ ∈ (0, 1) such that e−εm ≤ δ. So, m ≥ ε . Therefore, Hsingleton is PAC learnable with
mHsingleton ≤ d log(1/δ)
ε e.

Exercise 3.3
Similar to the Exercise 3 of Chapter 2, let A be the algorithm that returns the smallest circle enclosing
all positive examples in the training set S. Let C(S) be the circle returned by A with the radius r(S)
and A(S) : X → Y be the corresponding hypothesis. Similar to the Exercise 2.3, it is easy to see that
LS (A(S)) = 0 and so A is an ERM.
Let D be a probability distribution over X , ε ∈ (0, 1), and f be the target hypothesis in H. By the
realizability assumption, there exists a circle C ∗ with the radius r∗ and the corresponding hypothesis
h∗ related to the zero generalization error. By the definitions of C(S) and C ∗ we have C(S) ⊆ C ∗ . Also
we have:

/ S|X and f (x) = 1}) = D(C ∗ \ C(S)).


L(D,f ) (A(S)) = D({x ∈ X : A(S)(x) 6= f (x)}) = D({x ∈ X : x ∈

Let r1 ≤ r∗ be a number such that the probability mass (with respect to D) of the strip C1 = {x ∈
R2 : r1 ≤ kxk ≤ r∗ } is ε. If S contains (positive) examples in C1 , with the discussion above we
have L(D,f ) (A(S)) ≤ ε, since L(D,f ) (A(S)) = D(C ∗ \ C(S)), Now, we would like to upper bound
Dm ({S|X : L(D,f ) (hS ) > ε}). With the discussion above,

{S|X : L(D,f ) (hS ) > ε} = {S|X : S|X ∩ C1 = ∅}.

Therefore we have :
Dm ({S|X : L(D,f ) (hS ) > ε}) ≤ (1 − ε)m ≤ e−εm
log(1/δ)
Let δ ∈ (0, 1) such that e−εm ≤ δ. So, m ≥ ε . Therefore, H is PAC learnable with mH ≤ d log(1/δ)
ε e.

Exercise 3.4
1. Let H be the hypothesis class of all conjunctions over d variables. If we show that H is finite,
then by corollary 3.2, H is PAC learnable. Let h ∈ H and h is not the all-negative hypothesis. Let
x = (x1 , . . . , xd ) ∈ X . Then h(x) = di=1 ai , where ai ∈ {xi , x̄i , none}, for all i ∈ d, in which by none
V

we means that the literals xi and x̄i are not appear in h(x). Therefore, |H| = 3d + 1 and so by corollary
3.2, H is PAC learnable with sample complexity

log(|H|/δ)
mH (ε, δ) ≤ d e.
ε
Solutions to Understanding Machine Learning-Chapter 3 3

2. Suppose that S is a training set of size m such that x01 , . . . , x0l are all positively labeled instances
in S. By induction on i ≤ l, we define conjunctions hi . Let h0 be the all-negative hypothesis with
definition h0 (x) := dj=1 xj xj . Let i + 1 ≤ l and x0i+1 = (xi+1 i+1
V
1 , . . . , xd ). We obtain hi+1 from hi as

follows:

(1) For all j ∈ [d], if xi+1


j = 1 and xi+1
j is a literal of hi then delete xi+1
j .

(2) For all j ∈ [d], if xi+1


j = 0 and xi+1
j is a literal of hi then delete xi+1
j .

The algorithm returns hl . It is easy to see that hl labels x01 , . . . , x0l as positive. Since hl is the
most restrictive conjunction that labels positively all the positively labeled members of S and by the
realizability assumption, LS (hl ) = 0 and so the algorithm implements the ERM rule.
(Note: The solution of this exercise is explained in Section 8.2.3 of the book)

Exercise 3.5
Let HB = {h ∈ H : L(Dm ,f ) (h) > ε} and M = {S|X : ∃h ∈ HB s.t. L(S,f ) (h) = 0}. Then we have
h i hS i
P ∃h ∈ H s.t. L(Dm ,f ) (h) > ε and L(S,f ) (h) = 0 = P[M ] = P h∈HB {S|X : L(S,f ) (h) = 0} . So by the
union bound we have:
h i  
P ∃h ∈ H s.t. L(Dm ,f ) (h) > ε and L(S,f ) (h) = 0 ≤ |H| × P {S|X : L(S,f ) (h) = 0} (1)

On the other hand, if there exists h ∈ H such that L(Dm ,f ) (h) > ε then by definition of the general-
ization error we have:

Px∼D1 [h(x) 6= f (x)] + · · · + Px∼Dm [h(x) 6= f (x)]



m

Therefore
Px∼D1 [h(x) = f (x)] + · · · + Px∼Dm [h(x) = f (x)]
≤1−ε (2)
m
Also we have:

m

m
!1/m m
  Y Y
P {S|X : L(S,f ) (h) = 0} = Px∼Di [h(x) = f (x)] =  Px∼Di [h(x) = f (x)] 
i=1 i=1

1
a1 +a2 +···+am
So by geometric-arithmetic mean inequality, i. e., (a1 a2 . . . am ) m ≤ m , and (2) we have:
 m
Px∼D1 [h(x) = f (x)] + · · · + Px∼Dm [h(x) = f (x)]
≤ (1 − ε)m ≤ e−εm
 
P {S|X : L(S,f ) (h) = 0} ≤
m
4

Therefore by (1) we have:


h i
P ∃h ∈ H s.t. L(Dm ,f ) (h) > ε and L(S,f ) (h) = 0 ≤ |H|e−εm .

Exercise 3.6
Suppose that H is agnostic PAC learnable. So there exist an algorithm A and a function mH : (0, 1)2 → N
such that for every ε, δ ∈ (0, 1) and for every distribution D over X ×Y, when running A on m ≥ mH (ε, δ)
i.i.d. examples generated by D, A returns a hypothesis h such that, with probability of at least 1 − δ
(over the choice of the m training examples), LD (h) ≤ minh0 ∈H LD (h0 ) + ε.
Now we want to show that if the realizability assumption holds, H is PAC learnable using A. Let D be
a probability distribution over X and f be the target hypothesis in H. Consider the distribution D0 over
X × {0, 1} obtained by drawing x ∈ X according to D and taking the pair (x, f (x)). By the realizability
assumption, minh0 ∈H LD (h0 ) = 0. Let ε, δ ∈ (0, 1). Therefore when running A on m ≥ mH (ε, δ) i.i.d.
examples which are labeled by f , A returns a hypothesis h such that, with probability of at least 1 − δ
(over the choice of the m training examples) we have:

LD (h) ≤ minh0 ∈H LD (h0 ) + ε = ε.

Exercise 3.7
The Bayes Optimal Predictor: Given any probability distribution D over X × {0, 1}, the Bayes Optimal
Predictor is the following label predicting function from X to {0, 1}:

 1 if P [y = 1|x] ≥ 1/2
fD (x) =
 0 otherwise

We want to verify that for every probability distribution D, the Bayes optimal predictor fD is optimal.
It means that for every classifier g : X → {0, 1}, LD (fD ) ≤ LD (g).
Note that for every classifier g, since EX,Y [f (X, Y )] = EX EY |X=x [f (X, Y )|X = x], we have:

  h  i
LD (g) = E(x,y)∼D 1g(x)6=y = Ex∼DX Ey∼DY |x 1g(x)6=y |X = x = Ex∼DX [P [{g(X) 6= Y |X = x}]]

We want to prove that, for all x ∈ X we have

P [{g(X) 6= Y |X = x}] ≥ P [{fD (X) 6= Y |X = x}]


Solutions to Understanding Machine Learning-Chapter 3 5

It is easy to see that for every classifier g : X → {0, 1} we have:

P [{g(X) 6= Y |X = x}] = 1 − P [{g(X) = Y |X = x}] =

1 − P [{g(X) = 1, Y = 1|X = x}] − P [{g(X) = 0, Y = 0|X = x}]

Also we have

(1) P [{g(X) = 1, Y = 1|X = x}] = P [{g(X) = 1|X = x}] P [{Y = 1|X = x}]
(2) P [{g(X) = 0, Y = 0|X = x}] = P [{g(X) = 0|X = x}] P [{Y = 0|X = x}]

Note that if g(x) = 1 then P [{g(X) = 1|X = x}] = 1 and if g(x) = 0 then P [{g(X) = 1|X = x}] = 0.
Also if g(x) = 0 then P [{g(X) = 0|X = x}] = 1 and if g(x) = 1 then P [{g(X) = 0|X = x}] = 0.
Therefore

1 − P [{g(X) = Y |X = x}] = 1 − (1g(x)=1 P [{Y = 1|X = x}] + 1g(x)=0 P [{Y = 0|X = x}])

So we have:

P [{fD (X) = Y |X = x}] − P [{g(X) = Y |X = x}] =

P [{Y = 1|X = x}] (1fD (x)=1 − 1g(x)=1 ) + P [{Y = 0|X = x}] (1fD (x)=0 − 1g(x)=0 )

Since P [{Y = 0|X = x}] = 1 − P [{Y = 1|X = x}] and for every classifier g, 1g(x)=0 = 1 − 1g(x)=1 , we
have:

P [{fD (X) = Y |X = x}] − P [{g(X) = Y |X = x}] =

P [{Y = 1|X = x}] (1fD (x)=1 − 1g(x)=1 ) + (1 − P [{Y = 1|X = x}])(1fD (x)=0 − 1g(x)=0 ) =

P [{Y = 1|X = x}] (1fD (x)=1 − 1g(x)=1 ) + (1 − P [{Y = 1|X = x}])(1 − 1fD (x)=1 − 1 + 1g(x)=1 ) =

(2P [{Y = 1|X = x}] − 1) 1fD (x)=1 − 1g(x)=1

Therefore


P [{fD (X) = Y |X = x}] − P [{g(X) = Y |X = x}] = (2P [{Y = 1|X = x}] − 1) 1fD (x)=1 − 1g(x)=1

So by the definition of the Bayes predictor, for all x ∈ X we have

P [{fD (X) = Y |X = x}] − P [{g(X) = Y |X = x}] ≥ 0


6

Hence, for all x ∈ X we have

P [{g(X) 6= Y |X = x}] ≥ P [{fD (X) 6= Y |X = x}]

Since EX,Y [f (X, Y )] = EX EY |X=x [f (X, Y )|X = x], by the latter inequality we have:

  h  i
LD (g) = E(x,y)∼D 1g(x)6=y = Ex∼DX Ey∼DY |x 1g(x)6=y |X = x =

Ex∼DX [P [{g(X) 6= Y |X = x}]] ≥ Ex∼DX [P [{fD (X) 6= Y |X = x}]] = LD (fD )

You might also like