212 Handout
212 Handout
(機器學習技法)
Roadmap
1 Embedding Numerous Features: Kernel Models
2 Combining Predictive Features: Aggregation Models
x0 = 1
w1 g1
α1
w2 g2
x1 α2
w3 g3 α3 T
G(x) = sign αt sign wtT x
P
α4 G
x = x2 .. t=1
| {z }
w4 g4 . gt (x)
αT • two layers of weights:
.. ..
.. wt and α
. .
.
• two layers of sign functions:
wT gT
in gt and in G
xd
−1 −1
+1
g1 g2 AND(g1 , g2 )
xd
+1
−1 −1
+1
g1 g2 XOR(g1 , g2 )
• limitation: XOR not ‘linear separable’ under φ(x) = (g1 (x), g2 (x))
x0 = 1 +1 +1
x1 -1 G≡
w1 g1 AND OR
XOR(g1 , g2 )
x2
.. w2 g2 AND
. -1
xd
perceptron (simple)
=⇒ aggregation of perceptrons (powerful)
=⇒ multi-layer perceptrons (more powerful)
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 5/23
Neural Network Motivation
x2
..
.
xd
by UC Regents Davis campus-brainmaps.org.
Fun Time
Let g0 (x) = +1. Which of the following (α0 , α1 , α2 ) allows
2
P
G(x) = sign αt gt (x) to implement OR(g1 , g2 )?
t=0
1 (−3, +1, +1)
2 (−1, +1, +1)
3 (+1, +1, +1)
4 (+3, +1, +1)
Neural Network Motivation
Fun Time
Let g0 (x) = +1. Which of the following (α0 , α1 , α2 ) allows
2
P
G(x) = sign αt gt (x) to implement OR(g1 , g2 )?
t=0
1 (−3, +1, +1)
2 (−1, +1, +1)
3 (+1, +1, +1)
4 (+3, +1, +1)
Reference Answer: 3
You can easily verify with all four possibilities of
(g1 (x), g2 (x)).
xd xd xd
err = 0/1 err = squared err = cross-entropy
Physical Interpretation
x0 = 1 +1 +1
x1
tanh
(1) tanh (2) (3)
x2 wij wjk wkq
.. tanh
. tanh
xd (2) (2)
s3 tanh x3
Fun Time
(`)
How many weights {wij } are there in a 3-5-1 NNet?
1 9
2 15
3 20
4 26
Neural Network Neural Network Hypothesis
Fun Time
(`)
How many weights {wij } are there in a 3-5-1 NNet?
1 9
2 15
3 20
4 26
Reference Answer: 4
(1)
There are (3 + 1) × 5 weights in wij , and
(2)
(5 + 1) × 1 weights in wjk .
∂en
next: efficient computation of (`)
∂wij
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 13/23
Neural Network Neural Network Learning
∂en
Computing (L) (Output Layer)
∂wi1
2
(L−1)
dX
(L) (L) (L−1)
en = (yn − NNet(xn ))2 = yn − s1 2
= yn − wi1 xi
i=0
∂en ∂en
(`)
(L)
∂wi1 ∂wij
(`)
∂en ∂s1
(L)
∂en ∂sj
= · = (`)
· (`)
(L) (L)
∂s1 ∂wi1 ∂sj ∂wij
(L) (L−1) (`) (`−1)
= −2 yn − s1 · xi = δj · xi
(L) (L)
δ1 = −2 yn − s1 , how about others?
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 14/23
Neural Network Neural Network Learning
(`) ∂en
Computing δj = (`)
∂sj
(`+1)
s1
(`+1) ..
(`) tanh (`) wjk .
=⇒ · · · =⇒ en
sj =⇒ xj =⇒
s(`+1)
k
..
.
(`+1)
dX (`+1) (`)
(`) ∂en ∂en ∂sk ∂xj
δj = (`)
= (`+1) (`) (`)
∂sj k =1 ∂sk ∂xj ∂sj
X (`+1)
(`+1)
(`)
= δk wjk tanh0 sj
k
(`) (`+1)
δj can be computed backwards from δk
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 15/23
Neural Network Neural Network Learning
Fun Time
∂en (L) (L−1) ∂en
According to (L) = −2 yn − s1 · xi when would (L) = 0?
∂wi1 ∂wi1
(L)
1 yn = s1
(L−1)
2 xi =0
(L−1)
3 si =0
4 all of the above
Neural Network Neural Network Learning
Fun Time
∂en (L) (L−1) ∂en
According to (L) = −2 yn − s1 · xi when would (L) = 0?
∂wi1 ∂wi1
(L)
1 yn = s1
(L−1)
2 xi =0
(L−1)
3 si =0
4 all of the above
Reference Answer: 4
(L−1) (L−1)
Note that xi = tanh(si ) = 0 if and only
(L−1)
if si = 0.
x0 = 1 +1 +1
x1
tanh
(1) tanh (2) (3)
x2 wij wjk wkq
.. tanh
. tanh
xd (2) (2)
s3 tanh x3
• ‘shrink’ weights:
large weight → large shrink; small weight → small shrink
(`)
• want wij = 0 (sparse) to effectively decrease dVC
P (`)
• L1 regularizer: wij , but not differentiable
• weight-elimination (‘scaled’ L2) regularizer:
large weight → median shrink; small weight → median shrink
(`) 2
P wij
weight-elimination regularizer:
(`) 2
1+ wij
Error
as t increases
in-sample error
H3
w1
d∗vc VC dimension, dvc
w2
∗ in middle, remember? :-))
(dVC
w0
w3
-0.2
Etest
• smaller t effectively -0.6
log10 (error)
decrease dVC -1
Ein
early stopping -1.8
t∗
3
10 2
10 104
iteration, t
Fun Time
(`) 2
P wij ∂regularizer
For the weight elimination regularizer ,
(`) 2
what is (`) ?
1+ wij ∂wij
1
(`) 2
(`)
1 2wij / 1 + wij
2
(`) 2
(`)
2 2wij / 1 + wij
3
(`) 2
(`)
3 2wij / 1 + wij
4
(`) 2
(`)
4 2wij / 1 + wij
Neural Network Optimization and Regularization
Fun Time
(`) 2
P wij ∂regularizer
For the weight elimination regularizer ,
(`) 2
what is (`) ?
1+ wij ∂wij
1
(`) 2
(`)
1 2wij / 1 + wij
2
(`) 2
(`)
2 2wij / 1 + wij
3
(`) 2
(`)
3 2wij / 1 + wij
4
(`) 2
(`)
4 2wij / 1 + wij
Reference Answer: 2
Too much calculus in this class, huh? :-)
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 22/23
Neural Network Optimization and Regularization
Summary
1 Embedding Numerous Features: Kernel Models
2 Combining Predictive Features: Aggregation Models
3 Distilling Implicit Features: Extraction Models