IE643 Lecture4 2020aug25
IE643 Lecture4 2020aug25
IE 643
Lecture 4
1 Recap
Perceptron Training and Convergence
Perceptron
Prediction Rule
hw , xi ≥ θ =⇒ predict 1.
hw , xi < θ =⇒ predict −1.
Geometric Idea: To find a separating hyperplane (w , θ) such that samples
with class labels 1 and −1 lie on alternate sides of the hyperplane.
Prediction Rule
hw , xi ≥ θ =⇒ predict 1.
hw , xi < θ =⇒ predict −1.
Geometric Idea: To find a separating hyperplane (w , θ) such that samples
with class labels 1 and −1 lie on alternate sides of the hyperplane.
Perceptron
Perceptron - Training
First assumption: At least the data should be such that the samples
with label 1 can to be separated by a hyperplane from samples with
label −1.
Is this assumption sufficient?
hw ∗ , x t i > γ where y t = 1,
hw ∗ , x t i < −γ where y t = −1.
y t hw ∗ , x t i > γ.
hw ∗ , w t+1 i − hw ∗ , w t i = y t hw ∗ , x t i
hw ∗ , w t+1 i − hw ∗ , w t i = y t hw ∗ , x t i > γ
Hence hw ∗ , w t+1 i − hw ∗ , w t i = 0.
T
X X
hw ∗ , w t+1 i − hw ∗ , w t i = hw ∗ , w t+1 i − hw ∗ , w t i+
t=1 t∈{1,...,T },
t:mistake is made
at round t
X
hw ∗ , w t+1 i − hw ∗ , w t i
t∈{1,...,T },
t:no mistake is made
at round t
X
= hw ∗ , w t+1 i − hw ∗ , w t i
t∈{1,...,T },
t:mistake is made
at round t
> Mγ
Also note:
T
X
hw ∗ , w t+1 i − hw ∗ , w t i = hw ∗ , w T +1 i
t=1
Hence we have:
T
X
hw ∗ , w t+1 i − hw ∗ , w t i > Mγ
t=1
=⇒ hw ∗ , w T +1 i > Mγ
hw ∗ , w T +1 i > Mγ
hw ∗ , w T +1 i > Mγ
hw ∗ , w T +1 i > Mγ
Assumption on boundedness of kx t k2
We shall assume further that ∀t = 1, 2, . . . , the `2 norm (or length) of x t
is bounded, which is denoted as:
kx t k2 ≤ R ∀t = 1, 2, . . .
Assumption on boundedness of kx t k2
We shall assume further that ∀t = 1, 2, . . . , the `2 norm (or length) of x t
is bounded, which is denoted as:
kx t k2 ≤ R ∀t = 1, 2, . . .
Assumption on boundedness of kx t k2
We shall assume further that ∀t = 1, 2, . . . , the `2 norm (or length) of x t
is bounded, which is denoted as:
kx t k2 ≤ R ∀t = 1, 2, . . .
kw T +1 k22 ≤ MR 2 .
kw T +1 k22 ≤ MR 2 .
hw ∗ , w T +1 i > Mγ
hw ∗ , w T +1 i > Mγ
Mγ < hw ∗ , w T +1 i ≤ kw ∗ k2 kw T +1 k2
hw ∗ , w T +1 i > Mγ
Mγ < hw ∗ , w T +1 i ≤ kw ∗ k2 kw T +1 k2
=⇒ M 2 γ 2 < kw ∗ k22 kw T +1 k22
hw ∗ , w T +1 i > Mγ
Mγ < hw ∗ , w T +1 i ≤ kw ∗ k2 kw T +1 k2
=⇒ M 2 γ 2 < kw ∗ k22 kw T +1 k22
hw ∗ , w T +1 i > Mγ
Mγ < hw ∗ , w T +1 i ≤ kw ∗ k2 kw T +1 k2
=⇒ M 2 γ 2 < kw ∗ k22 kw T +1 k22
M 2 γ 2 < kw ∗ k22 MR 2
hw ∗ , w T +1 i > Mγ
Mγ < hw ∗ , w T +1 i ≤ kw ∗ k2 kw T +1 k2
=⇒ M 2 γ 2 < kw ∗ k22 kw T +1 k22
M 2 γ 2 < kw ∗ k22 MR 2
kw ∗ k22 R 2
=⇒ M <
γ2
hw ∗ , w T +1 i > Mγ
Mγ < hw ∗ , w T +1 i ≤ kw ∗ k2 kw T +1 k2
=⇒ M 2 γ 2 < kw ∗ k22 kw T +1 k22
M 2 γ 2 < kw ∗ k22 MR 2
kw ∗ k22 R 2
=⇒ M <
γ2
Thus, assuming that kw ∗ k2 and R can be controlled, the number of
mistakes M is inversely proportional to γ, which determines the closeness
of the data points to the separating hyperplane.
P. Balamurugan Deep Learning - Theory and Practice August 25, 2020. 60 / 71
Perceptron Mistake Bound (Continued...)
References:
H.D. Block: The perceptron: A model for brain functioning.
Reviews of Modern Physics 34, 123-135 (1962).
A.B.J. Novikoff: On convergence proofs on perceptrons. In:
Proceedings of the Symposium on the Mathematical Theory of
Automata, vol. XII, pp. 615-622 (1962).
w ∗ , γ = argmin 0
u,µ
w ∗ , γ = argmin 0
u,µ
w ∗ , γ = argmin 0
u,µ
Second question:
What is the intuition behind the Perceptron update rule?