Chapter-3-Solutions-Understanding-Machine-Learning
Chapter-3-Solutions-Understanding-Machine-Learning
Zahra Taheri1
Mar 2020
Exercise 3.2
1. As it is mentioned in the exercise, the realizability assumption here implies that the true hypothesis
f labels negatively all examples in the domain X , perhaps except one. Let A be the algorithm that
returns an hypothesis hS with the following property:
hx if ∃x ∈ S s. t. f (x) = 1
hS =
h− otherwise
{S|X : L(D,f ) (hS ) > ε} = {S|X : x 6∈ S|X and D(x) > ε} = {S|X : ∀x0 ∈ S|X D(x0 ) ≤ 1 − ε}.
Therefore we have:
Dm ({S|X : L(D,f ) (hS ) > ε}) = Dm ({S|X : ∀x0 ∈ S|X D(x0 ) ≤ 1 − ε}) ≤ (1 − ε)m ≤ e−εm .
1
https://github.com/zahta/Exercises-Understanding-Machine-Learning
1
2
log(1/δ)
Let δ ∈ (0, 1) such that e−εm ≤ δ. So, m ≥ ε . Therefore, Hsingleton is PAC learnable with
mHsingleton ≤ d log(1/δ)
ε e.
Exercise 3.3
Similar to the Exercise 3 of Chapter 2, let A be the algorithm that returns the smallest circle enclosing
all positive examples in the training set S. Let C(S) be the circle returned by A with the radius r(S)
and A(S) : X → Y be the corresponding hypothesis. Similar to the Exercise 2.3, it is easy to see that
LS (A(S)) = 0 and so A is an ERM.
Let D be a probability distribution over X , ε ∈ (0, 1), and f be the target hypothesis in H. By the
realizability assumption, there exists a circle C ∗ with the radius r∗ and the corresponding hypothesis
h∗ related to the zero generalization error. By the definitions of C(S) and C ∗ we have C(S) ⊆ C ∗ . Also
we have:
Let r1 ≤ r∗ be a number such that the probability mass (with respect to D) of the strip C1 = {x ∈
R2 : r1 ≤ kxk ≤ r∗ } is ε. If S contains (positive) examples in C1 , with the discussion above we
have L(D,f ) (A(S)) ≤ ε, since L(D,f ) (A(S)) = D(C ∗ \ C(S)), Now, we would like to upper bound
Dm ({S|X : L(D,f ) (hS ) > ε}). With the discussion above,
Therefore we have :
Dm ({S|X : L(D,f ) (hS ) > ε}) ≤ (1 − ε)m ≤ e−εm
log(1/δ)
Let δ ∈ (0, 1) such that e−εm ≤ δ. So, m ≥ ε . Therefore, H is PAC learnable with mH ≤ d log(1/δ)
ε e.
Exercise 3.4
1. Let H be the hypothesis class of all conjunctions over d variables. If we show that H is finite,
then by corollary 3.2, H is PAC learnable. Let h ∈ H and h is not the all-negative hypothesis. Let
x = (x1 , . . . , xd ) ∈ X . Then h(x) = di=1 ai , where ai ∈ {xi , x̄i , none}, for all i ∈ d, in which by none
V
we means that the literals xi and x̄i are not appear in h(x). Therefore, |H| = 3d + 1 and so by corollary
3.2, H is PAC learnable with sample complexity
log(|H|/δ)
mH (ε, δ) ≤ d e.
ε
Solutions to Understanding Machine Learning-Chapter 3 3
2. Suppose that S is a training set of size m such that x01 , . . . , x0l are all positively labeled instances
in S. By induction on i ≤ l, we define conjunctions hi . Let h0 be the all-negative hypothesis with
definition h0 (x) := dj=1 xj xj . Let i + 1 ≤ l and x0i+1 = (xi+1 i+1
V
1 , . . . , xd ). We obtain hi+1 from hi as
follows:
The algorithm returns hl . It is easy to see that hl labels x01 , . . . , x0l as positive. Since hl is the
most restrictive conjunction that labels positively all the positively labeled members of S and by the
realizability assumption, LS (hl ) = 0 and so the algorithm implements the ERM rule.
(Note: The solution of this exercise is explained in Section 8.2.3 of the book)
Exercise 3.5
Let HB = {h ∈ H : L(Dm ,f ) (h) > ε} and M = {S|X : ∃h ∈ HB s.t. L(S,f ) (h) = 0}. Then we have
h i hS i
P ∃h ∈ H s.t. L(Dm ,f ) (h) > ε and L(S,f ) (h) = 0 = P[M ] = P h∈HB {S|X : L(S,f ) (h) = 0} . So by the
union bound we have:
h i
P ∃h ∈ H s.t. L(Dm ,f ) (h) > ε and L(S,f ) (h) = 0 ≤ |H| × P {S|X : L(S,f ) (h) = 0} (1)
On the other hand, if there exists h ∈ H such that L(Dm ,f ) (h) > ε then by definition of the general-
ization error we have:
Therefore
Px∼D1 [h(x) = f (x)] + · · · + Px∼Dm [h(x) = f (x)]
≤1−ε (2)
m
Also we have:
m
m
!1/m m
Y Y
P {S|X : L(S,f ) (h) = 0} = Px∼Di [h(x) = f (x)] = Px∼Di [h(x) = f (x)]
i=1 i=1
1
a1 +a2 +···+am
So by geometric-arithmetic mean inequality, i. e., (a1 a2 . . . am ) m ≤ m , and (2) we have:
m
Px∼D1 [h(x) = f (x)] + · · · + Px∼Dm [h(x) = f (x)]
≤ (1 − ε)m ≤ e−εm
P {S|X : L(S,f ) (h) = 0} ≤
m
4
Exercise 3.6
Suppose that H is agnostic PAC learnable. So there exist an algorithm A and a function mH : (0, 1)2 → N
such that for every ε, δ ∈ (0, 1) and for every distribution D over X ×Y, when running A on m ≥ mH (ε, δ)
i.i.d. examples generated by D, A returns a hypothesis h such that, with probability of at least 1 − δ
(over the choice of the m training examples), LD (h) ≤ minh0 ∈H LD (h0 ) + ε.
Now we want to show that if the realizability assumption holds, H is PAC learnable using A. Let D be
a probability distribution over X and f be the target hypothesis in H. Consider the distribution D0 over
X × {0, 1} obtained by drawing x ∈ X according to D and taking the pair (x, f (x)). By the realizability
assumption, minh0 ∈H LD (h0 ) = 0. Let ε, δ ∈ (0, 1). Therefore when running A on m ≥ mH (ε, δ) i.i.d.
examples which are labeled by f , A returns a hypothesis h such that, with probability of at least 1 − δ
(over the choice of the m training examples) we have:
Exercise 3.7
The Bayes Optimal Predictor: Given any probability distribution D over X × {0, 1}, the Bayes Optimal
Predictor is the following label predicting function from X to {0, 1}:
1 if P [y = 1|x] ≥ 1/2
fD (x) =
0 otherwise
We want to verify that for every probability distribution D, the Bayes optimal predictor fD is optimal.
It means that for every classifier g : X → {0, 1}, LD (fD ) ≤ LD (g).
Note that for every classifier g, since EX,Y [f (X, Y )] = EX EY |X=x [f (X, Y )|X = x], we have:
h i
LD (g) = E(x,y)∼D 1g(x)6=y = Ex∼DX Ey∼DY |x 1g(x)6=y |X = x = Ex∼DX [P [{g(X) 6= Y |X = x}]]
Also we have
(1) P [{g(X) = 1, Y = 1|X = x}] = P [{g(X) = 1|X = x}] P [{Y = 1|X = x}]
(2) P [{g(X) = 0, Y = 0|X = x}] = P [{g(X) = 0|X = x}] P [{Y = 0|X = x}]
Note that if g(x) = 1 then P [{g(X) = 1|X = x}] = 1 and if g(x) = 0 then P [{g(X) = 1|X = x}] = 0.
Also if g(x) = 0 then P [{g(X) = 0|X = x}] = 1 and if g(x) = 1 then P [{g(X) = 0|X = x}] = 0.
Therefore
1 − P [{g(X) = Y |X = x}] = 1 − (1g(x)=1 P [{Y = 1|X = x}] + 1g(x)=0 P [{Y = 0|X = x}])
So we have:
P [{Y = 1|X = x}] (1fD (x)=1 − 1g(x)=1 ) + P [{Y = 0|X = x}] (1fD (x)=0 − 1g(x)=0 )
Since P [{Y = 0|X = x}] = 1 − P [{Y = 1|X = x}] and for every classifier g, 1g(x)=0 = 1 − 1g(x)=1 , we
have:
P [{Y = 1|X = x}] (1fD (x)=1 − 1g(x)=1 ) + (1 − P [{Y = 1|X = x}])(1fD (x)=0 − 1g(x)=0 ) =
P [{Y = 1|X = x}] (1fD (x)=1 − 1g(x)=1 ) + (1 − P [{Y = 1|X = x}])(1 − 1fD (x)=1 − 1 + 1g(x)=1 ) =
(2P [{Y = 1|X = x}] − 1) 1fD (x)=1 − 1g(x)=1
Therefore
P [{fD (X) = Y |X = x}] − P [{g(X) = Y |X = x}] = (2P [{Y = 1|X = x}] − 1) 1fD (x)=1 − 1g(x)=1
Since EX,Y [f (X, Y )] = EX EY |X=x [f (X, Y )|X = x], by the latter inequality we have:
h i
LD (g) = E(x,y)∼D 1g(x)6=y = Ex∼DX Ey∼DY |x 1g(x)6=y |X = x =