0% found this document useful (0 votes)
10 views28 pages

212 Handout

Uploaded by

zizhu.diary
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views28 pages

212 Handout

Uploaded by

zizhu.diary
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

Machine Learning Techniques

(機器學習技法)

Lecture 12: Neural Network


Hsuan-Tien Lin (林軒田)
[email protected]

Department of Computer Science


& Information Engineering
National Taiwan University
(國立台灣大學資訊工程系)

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 0/23


Neural Network

Roadmap
1 Embedding Numerous Features: Kernel Models
2 Combining Predictive Features: Aggregation Models

Lecture 11: Gradient Boosted Decision Tree


aggregating trees from functional gradient and
steepest descent subject to any error measure
3 Distilling Implicit Features: Extraction Models

Lecture 12: Neural Network


Motivation
Neural Network Hypothesis
Neural Network Learning
Optimization and Regularization

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 1/23


Neural Network Motivation

Linear Aggregation of Perceptrons: Pictorial View

x0 = 1
 
w1 g1
 
 
  α1
  w2 g2
 
  x1 α2  
 
  w3 g3 α3 T 
G(x) = sign  αt sign wtT x 
  P
  α4 G
 x  = x2 .. t=1
  | {z }
  w4 g4 . gt (x)
 
  αT • two layers of weights:
.. ..
 
.. wt and α
 
  . .
  .
• two layers of sign functions:
 
 
  wT gT
in gt and in G
xd

what boundary can G implement?

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 2/23


Neural Network Motivation

Logic Operations with Aggregation


+1

−1 −1

+1

g1 g2 AND(g1 , g2 )

G(x) = sign (−1+g1 (x)+g2 (x))


x0 = 1
+1 α0 = −1 • g1 (x) = g2 (x) = +1 (TRUE):
x1 G(x) = +1 (TRUE)
w1 g1 α1 = +1 G(x)
• otherwise:
x2
w2 g2 α2 = +1 G(x) = −1 (FALSE)
.. • G ≡ AND(g1 , g2 )
.

xd

OR, NOT can be similarly implemented


Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 3/23
Neural Network Motivation

Powerfulness and Limitation


− − − − − −
+ + + + + +
+ + + + + +
− − − − − −

8 perceptrons 16 perceptrons target boundary


• ‘convex set’ hypotheses implemented: dVC → ∞, remember? :-)
• powerfulness: enough perceptrons ≈ smooth boundary

+1

−1 −1

+1

g1 g2 XOR(g1 , g2 )
• limitation: XOR not ‘linear separable’ under φ(x) = (g1 (x), g2 (x))

how to implement XOR(g1 , g2 )?


Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 4/23
Neural Network Motivation

Multi-Layer Perceptrons: Basic Neural Network


• non-separable data: can use more transform
• how about one more layer of AND transform?

XOR(g1 , g2 ) = OR(AND(−g1 , g2 ), AND(g1 , −g2 ))

x0 = 1 +1 +1
x1 -1 G≡
w1 g1 AND OR
XOR(g1 , g2 )
x2
.. w2 g2 AND
. -1
xd

perceptron (simple)
=⇒ aggregation of perceptrons (powerful)
=⇒ multi-layer perceptrons (more powerful)
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 5/23
Neural Network Motivation

Connection to Biological Neurons


x0 = 1
+1 +1
x1

x2

..
.

xd
by UC Regents Davis campus-brainmaps.org.

Licensed under CC BY 3.0 via Wikimedia Commons

by Lauris Rubenis. by Pedro Ribeiro


Licensed under CC BY Simōes. Licensed
2.0 via under CC BY 2.0 via
https://flic.kr/ https://flic.kr/
p/fkVuZX p/adiV7b

neural network: bio-inspired model


Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 6/23
Neural Network Motivation

Fun Time
Let g0 (x) = +1. Which of the following (α0 , α1 , α2 ) allows
 2 
P
G(x) = sign αt gt (x) to implement OR(g1 , g2 )?
t=0
1 (−3, +1, +1)
2 (−1, +1, +1)
3 (+1, +1, +1)
4 (+3, +1, +1)
Neural Network Motivation

Fun Time
Let g0 (x) = +1. Which of the following (α0 , α1 , α2 ) allows
 2 
P
G(x) = sign αt gt (x) to implement OR(g1 , g2 )?
t=0
1 (−3, +1, +1)
2 (−1, +1, +1)
3 (+1, +1, +1)
4 (+3, +1, +1)

Reference Answer: 3
You can easily verify with all four possibilities of
(g1 (x), g2 (x)).

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 7/23


Neural Network Neural Network Hypothesis

Neural Network Hypothesis: Output


x0 = 1 +1 +1 • OUTPUT: simply a
x1 linear model with
w OUTPUT s = wT φ(2) (φ(1) (x))
x2
• any linear model can be
..
. used—remember? :-)
xd

linear classification linear regression logistic regression


h(x) = sign(s) h(x) = s h(x) = θ(s)
x0 x0 x0
x1 x1 x1
s s s
x2 h(x) x2 h(x) x2 h(x)

xd xd xd
err = 0/1 err = squared err = cross-entropy

will discuss ‘regression’ with squared error


Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 8/23
Neural Network Neural Network Hypothesis

Neural Network Hypothesis: Transformation


• : transformation function
x0 = 1 +1 +1
of score (signal) s
• any transformation? x1
x2
• : whole network linear &
thus less useful ..
.
• : discrete & thus hard to
xd
optimize for w
• popular choice of linear

transformation: = tanh(s) tanh

• ‘analog’ approximation of PSfrag sign


: easier to optimize
• somewhat closer to
exp(s) − exp(−s)
biological neuron tanh(s) =
exp(s) + exp(−s)
• not that new! :-)
= 2θ(2s) − 1
will discuss with tanh as transformation
function
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 9/23
Neural Network Neural Network Hypothesis

Neural Network Hypothesis


x0 = 1 +1 +1
x1
tanh
(1) tanh (2) (3)
x2 wij wjk wkq
.. tanh
. tanh
xd (2) (2)
s3 tanh x3

d (0) -d (1) -d (2) -· · · -d (L) Neural Network (NNet)



 1≤`≤L layers (`−1)
dX
(`) (`−1) (`) (`) (`−1)
wij : 0≤i ≤d inputs , score sj = wij xi ,
1 ≤ j ≤ d (`) outputs

i=0
  
(`)
(`) tanh sj if ` < L
transformed xj = (`)
sj if ` = L

apply x as input layer x(0) , go through hidden


(L)
layers to get x(`) , predict at output layer x1
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 10/23
Neural Network Neural Network Hypothesis

Physical Interpretation
x0 = 1 +1 +1
x1
tanh
(1) tanh (2) (3)
x2 wij wjk wkq
.. tanh
. tanh
xd (2) (2)
s3 tanh x3

• each layer: transformation to be learned from data


 
d (`−1)
P (`) (`−1)
wi1 xi
• φ(`) (x) = tanh  i=0 
..
.
—whether x ‘matches’ weight vectors in pattern

NNet: pattern extraction with


layers of connection weights
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 11/23
Neural Network Neural Network Hypothesis

Fun Time
(`)
How many weights {wij } are there in a 3-5-1 NNet?
1 9
2 15
3 20
4 26
Neural Network Neural Network Hypothesis

Fun Time
(`)
How many weights {wij } are there in a 3-5-1 NNet?
1 9
2 15
3 20
4 26

Reference Answer: 4
(1)
There are (3 + 1) × 5 weights in wij , and
(2)
(5 + 1) × 1 weights in wjk .

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 12/23


Neural Network Neural Network Learning

How to Learn the Weights?


x0 = 1 +1 +1
x1
tanh
(1) tanh (2) (3)
x2 wij wjk wkq
.. tanh
. tanh
xd (2) (2)
s3 tanh x3
 
(`) (`)
• goal: learning all {wij } to minimize Ein {wij }
• one hidden layer: simply aggregation of perceptrons
—gradient boosting to determine hidden neuron one by one
• multiple hidden layers? not easy
• let en = (yn − NNet(xn ))2 :
∂en
can apply (stochastic) GD after computing (`) !
∂wij

∂en
next: efficient computation of (`)
∂wij
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 13/23
Neural Network Neural Network Learning
∂en
Computing (L) (Output Layer)
∂wi1
 2
(L−1)
dX
 
(L) (L) (L−1) 
en = (yn − NNet(xn ))2 = yn − s1 2
= yn − wi1 xi
i=0

specially (output layer) generally (1 ≤ ` < L)


(0 ≤ i ≤ d (L−1) ) (0 ≤ i ≤ d (`−1) ; 1 ≤ j ≤ d (`) )

∂en ∂en
(`)
(L)
∂wi1 ∂wij
(`)
∂en ∂s1
(L)
∂en ∂sj
= · = (`)
· (`)
(L) (L)
∂s1 ∂wi1 ∂sj ∂wij
     
(L) (L−1) (`) (`−1)
= −2 yn − s1 · xi = δj · xi
 
(L) (L)
δ1 = −2 yn − s1 , how about others?
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 14/23
Neural Network Neural Network Learning
(`) ∂en
Computing δj = (`)
∂sj
 (`+1) 
s1
(`+1)  .. 
(`) tanh (`) wjk  . 
 =⇒ · · · =⇒ en
sj =⇒ xj =⇒ 
 s(`+1) 
 k 
..
.

(`+1)
dX (`+1) (`)
(`) ∂en ∂en ∂sk ∂xj
δj = (`)
= (`+1) (`) (`)
∂sj k =1 ∂sk ∂xj ∂sj
X (`+1)

(`+1)
  
(`)
= δk wjk tanh0 sj
k

(`) (`+1)
δj can be computed backwards from δk
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 15/23
Neural Network Neural Network Learning

Backpropagation (Backprop) Algorithm


Backprop on NNet
(`)
initialize all weights wij
for t = 0, 1, . . . , T
1 stochastic: randomly pick n ∈ {1, 2, · · · , N}
(`)
2 forward: compute all xi with x(0) = xn
(`)
3 backward: compute all δj subject to x(0) = xn
(`) (`) (`−1) (`)
4 gradient descent: wij ← wij − ηxi δj
 P P 
(2) (1)
return gNNET (x) = · · · tanh w
j jk · tanh w
i ij xi

sometimes 1 to 3 is (parallelly) done many times and


(`−1) (`)
average(xi δj ) taken for update in 4 , called mini-batch

basic NNet algorithm: backprop to compute


the gradient efficiently
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 16/23
Neural Network Neural Network Learning

Fun Time
   
∂en (L) (L−1) ∂en
According to (L) = −2 yn − s1 · xi when would (L) = 0?
∂wi1 ∂wi1
(L)
1 yn = s1
(L−1)
2 xi =0
(L−1)
3 si =0
4 all of the above
Neural Network Neural Network Learning

Fun Time
   
∂en (L) (L−1) ∂en
According to (L) = −2 yn − s1 · xi when would (L) = 0?
∂wi1 ∂wi1
(L)
1 yn = s1
(L−1)
2 xi =0
(L−1)
3 si =0
4 all of the above

Reference Answer: 4
(L−1) (L−1)
Note that xi = tanh(si ) = 0 if and only
(L−1)
if si = 0.

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 17/23


Neural Network Optimization and Regularization

Neural Network Optimization


  ! 
N
1 X  X (2) X (1)
Ein (w) = err · · · tanh  wjk · tanh wij xn,i  , yn 
N
n=1 j i

• generally non-convex when multiple hidden layers


• not easy to reach global minimum
• GD/SGD with backprop only gives local minimum
(`)
• different initial wij =⇒ different local minimum
• somewhat ‘sensitive’ to initial weights
• large weights =⇒ saturate (small gradient)
• advice: try some random & small ones

NNet: difficult to optimize,


but practically works

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 18/23


Neural Network Optimization and Regularization

VC Dimension of Neural Network Model


roughly, with tanh-like transfer functions:
dVC = O(V D) where V = # of neurons, D = # of weights

x0 = 1 +1 +1
x1
tanh
(1) tanh (2) (3)
x2 wij wjk wkq
.. tanh
. tanh
xd (2) (2)
s3 tanh x3

• pros: can approximate ‘anything’ if enough neurons (V large)


• cons: can overfit if too many neurons

NNet: watch out for overfitting!

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 19/23


Neural Network Optimization and Regularization

Regularization for Neural Network


basic choice:
P (`) 2

old friend weight-decay (L2) regularizer Ω(w) = wij

• ‘shrink’ weights:
large weight → large shrink; small weight → small shrink
(`)
• want wij = 0 (sparse) to effectively decrease dVC
P (`)
• L1 regularizer: wij , but not differentiable
• weight-elimination (‘scaled’ L2) regularizer:
large weight → median shrink; small weight → median shrink

(`) 2
 
P wij
weight-elimination regularizer: 
(`) 2

1+ wij

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 20/23


Neural Network Optimization and Regularization

Yet Another Regularization: Early Stopping


out-of-sample error

• GD/SGD (backprop) visits


model complexity
more weight combinations

Error
as t increases
in-sample error
H3
w1
d∗vc VC dimension, dvc
w2
∗ in middle, remember? :-))
(dVC
w0
w3
-0.2

Etest
• smaller t effectively -0.6

log10 (error)
decrease dVC -1

• better ‘stop in middle’: -1.4

Ein
early stopping -1.8
t∗
3
10 2
10 104
iteration, t

when to stop? validation!


Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 21/23
Neural Network Optimization and Regularization

Fun Time
(`) 2
 
P wij ∂regularizer
For the weight elimination regularizer   ,
(`) 2
what is (`) ?
1+ wij ∂wij
  1
(`) 2

(`)
1 2wij / 1 + wij
  2
(`) 2

(`)
2 2wij / 1 + wij
  3
(`) 2

(`)
3 2wij / 1 + wij
  4
(`) 2

(`)
4 2wij / 1 + wij
Neural Network Optimization and Regularization

Fun Time
(`) 2
 
P wij ∂regularizer
For the weight elimination regularizer   ,
(`) 2
what is (`) ?
1+ wij ∂wij
  1
(`) 2

(`)
1 2wij / 1 + wij
  2
(`) 2

(`)
2 2wij / 1 + wij
  3
(`) 2

(`)
3 2wij / 1 + wij
  4
(`) 2

(`)
4 2wij / 1 + wij

Reference Answer: 2
Too much calculus in this class, huh? :-)
Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 22/23
Neural Network Optimization and Regularization

Summary
1 Embedding Numerous Features: Kernel Models
2 Combining Predictive Features: Aggregation Models
3 Distilling Implicit Features: Extraction Models

Lecture 12: Neural Network


Motivation
multi-layer for power with biological inspirations
Neural Network Hypothesis
layered pattern extraction until linear hypothesis
Neural Network Learning
backprop to compute gradient efficiently
Optimization and Regularization
tricks on initialization, regularizer, early stopping

• next: making neural network ‘deeper’

Hsuan-Tien Lin (NTU CSIE) Machine Learning Techniques 23/23

You might also like