0% found this document useful (0 votes)
28 views

Lec3 MLP Optimization

Uploaded by

kaydee140492
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views

Lec3 MLP Optimization

Uploaded by

kaydee140492
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 86

FALL 2023

INTRODUCTION
TO ARTIFICIAL
INTELLIGENCE

Tianyi Zhou

8/28/23
University of Maryland
Some slides are adapted from Song & Russell
@Berkeley and Shao @William and Mary 1
TA Office hours

• Wednesday (11:00AM-1:00PM)
Chenxi Liu
Ziyue Li

• Friday (2:00PM-4:00PM)
Hongyu Zhao
Kwesi Cobbina

• Starting this Friday.


• Location will be announced before the date.
Seven Components of this course
Action,
Acting Prediction
Probabilistic
Reasoning
Human
users Agent
Language
Models Embodied &
Multi-modal AI World
Neural
Networks

Search & Perception


Planning
Reward, Data,
Observation
Neural networks
• Perceptron and MLP
• Optimization and Backpropagation
• Convolutional Neural Networks
• Recurrent Neural Networks and LSTM
• Attention and Transformer
• Graph Neural Networks

4
Plan
• Today
– Neural networks optimization and training
• http://neuralnetworksanddeeplearning.com/
– Ch 1, 2, and 6
• Russell and Norvig
– Ch 21
• Goodfellow, Bengio, & Courville
– Ch 6, 8
– NN Playground: https://piazza.com/class/llup7xc8lm44w4/post/9
– Gradient descent: https://piazza.com/class/llup7xc8lm44w4/post/7
– Backprop:
• https://www.youtube.com/watch?v=Ilg3gGewQ5U
• https://jeremykun.com/2012/12/09/neural-networks-and-
backpropagation/

• Next lecture
– Neural networks architectures: CNN, LSTM, Transformer
Perceptron in one page
• Binary input x, binary output 𝑓 𝑥 = 𝑠𝑖𝑔𝑛 ∑! 𝑤! 𝑥!
• Learning goal: find w such that 𝑓 𝑥 = 𝑦 for most training data 𝑥, 𝑦
• Learning algorithm: for every misclassified 𝑥, 𝑦 , update 𝑤 = 𝑤 + 𝑦𝑥
• Simple and elegant proof of convergence
• Applications: classification, approximation to logic gates/circuits
𝑥"
𝑧 𝒙 = 𝒙𝒘
Neuron
𝑤" 𝑓 𝒙 = 𝜎 𝑧 𝒙 = 𝜎 𝒙𝒘
𝑥# ∑ 𝜎 𝑦
𝑤# 𝒙 = [1, 𝑥", 𝑥#, 𝑥$]

Activation fun.
𝒘 = [𝑏, 𝑤", 𝑤#, 𝑤$]%
𝑤$
𝑥$
Sigmoid Neuron

𝑥"
f(x) output
Neuron
𝑤"
𝑥# ∑ 𝜎 𝑦
𝑤#

𝑤$ 1. Find weighted sum of inputs


X
𝑥$
z(x) = w i xi
i
f (x) = (z(x)) 2. apply sigmoid function
1
• Real-valued Input and Output (z) = z
• Output is a composition of two functions 1+e
• Output is lower and upper bounded
RELU Neuron

𝑥"
Neuron
𝑤"
𝑥# ∑ 𝜎 𝑦
𝑤#
1. Find weighted sum of inputs
𝑤$ X
𝑥$
z(x) = w i xi
f (x) = (z(x)) i
2. apply ReLU (rectified linear unit) function
• Real-valued Input and Output


Output is a composition of two functions
Output is only lower bounded but upper unbounded.
𝜎 𝑧 = max(0, 𝑧)
Why Non-Linear Activation Function?
Non-linear can transform the original training data into a higher dimension so that
they are separable.

𝑥" 𝑦 = 𝑥!
Neuron
𝑤"
𝑥# ∑ 𝜎 𝑦
𝑤#

Activation fun.
𝑤$
𝑥$

9
Activation
functions
• Thresholding and information filter
• Nonlinearity and expressiveness
• Partition of the input space
• Many choices available and critical to performance
Practical TIPs of activation

• Try ReLU at first (most widely used activation function).


• Be careful about the learning rate.
• For better performance, try out LeakyReLU, ELU, SeLU, GeLU,
Maxout max(𝑥𝑤 ! , 𝑥𝑤 " )
• Avoid to use Sigmoid or Tanh for hidden layer neurons (why?). But they
can be used for output if a bounded output is required.
Neural connections
in animal brains
• Connectome (a map of neural connections) in brains of
302 neurons in <1mm worms Caenorhabditis elegans.
• Much more complicated in human brains.
• Architecture of neural networks is important (multi-
layer, convolution, attention, recurrent, graph).
MULTI-LAYER PERCEPTRON (MLP)

In feature × out feature


𝑤"", 𝑤"#, 𝑤"$
𝑿 = 𝑥", 𝑥# 𝑿∈ ℝ𝟏×𝟐 𝑾(") = 𝑤 ,𝑤 ,𝑤 𝑾(%) ∈ ℝ𝟐×𝟑
#" ## #$

𝒁(") = 𝑿𝑾 "
𝒁(%) ∈ ℝ𝟏×𝟑 𝑯(") = 𝝈(𝒁(") ) 𝑯(%) ∈ ℝ𝟏×𝟑
𝒀 = 𝑯(%) 𝑾 (
+ 𝒃(() 𝑾(() ∈ ℝ𝟑×𝟐 𝒀 ∈ ℝ𝟏×𝟐
13
Linear Regions of MLP
• A neural networks recursively applying a nonlinear activation 𝜎 on a pathway (a sequence of neurons),
i.e., 𝑓 𝑥 = 𝑊 ($) 𝜎(𝑊 ($&!) 𝜎(𝑊 ($&") ⋯ 𝑊 (") 𝜎(𝑊 (!) 𝑥) ⋯ ))
• If 𝜎 is piece-wise linear (e.g., ReLU), the neural network partitions the input space into exponential
number of polytopes (𝑥𝑤 > 0 or 𝑥𝑤 ≤ 0) and applies a linear function to data in each polytope.
0
≤0 𝑤>
0 𝑥𝑤 𝑥
𝒘(",/012) = , 𝑓 𝑥 = 𝑥𝒘(",/012) for x in left polytope
0
𝑤""
𝒘(",34562) = 𝑤 , 𝑓 𝑥 = 𝑥𝒘(",34562) , for x in right polytope
#"

Every polytope is encoded by (left, right, left, left, …) and


data point in the polytope has an output of
𝑓 𝑥 = 𝑥𝒘(",/012) 𝒘(#,34562) 𝒘(",/012) ⋯, a linear function
Neural Networks are Piecewise-Linear
• If 𝜎 piece-wise linear, 𝑓 𝑥 = 𝜎( ⋯ 𝜎(𝜎(𝑥𝑊 (!) )𝑊 (") ) ⋯ 𝑊 ($&!) )𝑊 ($) is piece-wise linear, e.g., each
piece is a polytope and 𝑓 𝑥 = 𝑥𝒘(!,()*+) 𝒘(",,-./+) 𝒘(!,()*+) ⋯ 𝒘($,,-./+) for x in the polytope
• In practice: exponential number of polytopes but every polytope only have at most one data point.
• How does generalization come from? Shared facets/edges between polytopes.
Neural Networks are Piecewise-Linear
Training Piecewise-Linear Neural Network
Neural networks are
function approximators
• Use (x, y) data to learn a neural network f(x) towards y=f(x)

Prediction/Regression Classification
output is a continuous variable output is a discrete label
02 31 02 31
is weekend? is weekend?
B6 is holiday? 7C B6 is holiday? 7C
B6 7C B6 7C
B6 weather 7C B6 weather 7C
f B6 7C ! sales forecast f B6 7C ! is cat?
B6area income7C B6area income7C
@4 5A @4 5A
.. ..
. .

In RL, we will use NN for regression f(x) ß Q( [s,a] )


DNNs: Feedforward (inference)

Layer 1:

𝑯(%) = 𝜎 𝑯()) 𝑾(%) + 𝒃(%) = 𝜎 𝑿𝑾(%) + 𝒃(%)

Layer 2:

𝑯(() = 𝜎 𝑯(%) 𝑾(() + 𝒃(()

Layer l:
𝑯(*) = 𝜎 𝑯(*+%) 𝑾(*) + 𝒃(*)

19
task: regression

Output prediction:

A = 𝑔 𝑯(7) 𝑾(7;") + 𝒃(7;")


𝒚

Logits

q Loss Function: difference between


prediction and label

ℒ = 𝐿(𝒚789:7 , A𝒚)

20
Loss FunctioN: Regression
Ø Regression Problem
Loss fun: Mean Square Error (MSE)

1
𝒚) 𝟐
𝑀𝑆𝐸 = G (𝒚789:7 −A
𝑁

Example:

Y_label 10 15 20

Y_predict 9 14.5 21

1
𝑀𝑆𝐸 = [ 10 − 9 " + 15 − 14.5 " + 20 − 21 " ]
3
21
Task: classification
Output: E.g., softmax for classification

A = 𝑔 𝑯(7) 𝑾(7;") + 𝒃(7;")


𝒚

Logits
q Softmax: get probabilities (positive, sum up to one)

𝑒 =!
softmax 𝑧! = ? =
∑!>" 𝑒 !

Example:

𝒁 = [−1, 0, 2]

22
Loss Function: classification
Ø Multiple Classification Problem Do you remember entropy in Decision Tree?
Loss fun: Cross Entropy

$ predicted probability observation of class c


𝑙𝑜𝑠𝑠 = − & 𝑦! log 𝑝!
!"#

𝑦0 class label 𝑐
Cat

Class cat dog bird


Dog
Label Y1 1 0 0

Label Y2 0 1 0 Bird
23
Loss Function: binary classification
Ø Binary Classification {0,1}
Loss fun: Binary Cross Entropy

𝑙𝑜𝑠𝑠 = − 𝑦log𝑝 + 1 − 𝑦 log(1 − 𝑝)

predicted probability of a class

Yes/No

24
Supervised Learning
w1 w13
w2 w14
w3 output layer (can
input layer (activation w4
have multiple
function = identity) outputs)

one or more
hidden layers
Training
Training inputs output
x1 x2 x3 y
0.6 0 -1 10 Given training inputs and outputs,
0.4 1.2 3.1 21 find weights so that network output
1.7 0.6 -0.3 2.1 matches training outputs. How?
1 1 0 0.1
What should w1 ,w2 , b be?
• Want f(x) = y, how to measure their difference?
• Minimize a loss function x0 = 1

• You likely did the following: x1 w1 b

– started with some values of w1, w2, b say 1, 1, 0


– computed f(x) using w1 = 1, w2 = 1, b = 0 f(x)
– checked if f(x) matched y, i.e., if the loss E(w) is small
x2 w2
– if not, you adjusted w1, w2, b slightly
– rinse and repeat until f(x) matched y

x1 x2 y
(
0 0 1 0 if w1 x1 + w2 x2 + b  0
0 1 1 f (x) =
1 if w1 x1 + w2 x2 + b > 0
1 0 1
1 1 0
Optimization of loss function

Loss E(w)

w2
w1
Loss
landscape of
DNN is
complicated

We are interested in finding the


lowest point on the loss landscape
but it is easy to be trapped in

• Bad local minima


• Saddle point
• Plateau

https://tinyurl.com/2fe26wsn
Local minima
Different Optimization Algorithms
(saddle point)
General Recipe for optimization
Training inputs Training output
• Start with some initial weights w
x1 x2 x3 y
• Compute f(x) based on these
0.6 0 -1 10
weights
0.4 1.2 3.1 21 • Check if f(x) matches y – how?
1.7 0.6 -0.3 2.1 Compute loss function E(w)
1 1 0 0.1 • Adjust the weights – how?
• Rinse and repeat
w1 w13
w2 w14
w w3
4
How to adjust weights?
• Let wold = [w1, w2, w3, … ]T be the current set of weights
✓ ◆
• Find @E @E
rE(wold ) = , ,...
@w0 @w1
• ∇𝐸 is a vector pointing in the direction of the local
minima of E from wold
• Update the weights so that you move in the direction
of this vector (i.e., towards the minima)
w2
wnew = wold ↵rE(wold ) w1
General Recipe of optimization
Training inputs Training output • Start with some initial weights
x1 x2 x3 y • Compute f(x) based on these
0.6 0 -1 10 weights
0.4 1.2 3.1 21 • Check if f(x) matches y – how?
Compute loss function E(w)
1.7 0.6 -0.3 2.1
• Adjust the weights – how?
1 1 0 0.1
Gradient descent
w1 w13
• Rinse and repeat
w2 w14
w w3
4
Gradient Descent
on loss landscape
• We still need to compute w2
w1
✓ ◆
@E @E
rE(wold ) = , ,...
@w0 @w1

• I’ll show how to do this for a single sigmoid neuron


• Then, we will see how to do this for a neural network – backpropagation
• Most libraries implement automatic differentiation
Sigmoid Neuron
1 w0
fw(x) output
x1 w1
X

w2
x2

fw (x) = (z(x)) 1X j
E(w) = ky fw (xj )k2
2 j
1
(z) = need to compute
1+e z
X ✓ ◆
@E @E
z(x) = w i xi rE(wold ) = , ,...
@w0 @w1
i
Finding the weights
fw (x) = (z(x))
1
1X j (z) =
E(w) = ky fw (xj )k2 1+e z
2 j X
z(x) = w i xi
i

@E
=?
@wi
Finding the weights fw (x) = (z(x))
1
1X j (z) =
E(w) = ky fw (xj )k2 1+e z
2 j X
z(x) = w i xi
i

@E @E @ @z
= · ·
@wi @ @z @wi

Use chain rule


Back Propagation with Chain Rule
,- ,- ,/
=
,. ,/ ,.

Example:

𝑦 = 3ℎ4 + 1
ℎ = log 𝑤

𝑑𝑦 𝑑𝑦 𝑑ℎ 1 6 log 𝑤
= = 6ℎ ∗ =
𝑑𝑤 𝑑ℎ 𝑑𝑤 𝑤 𝑤

38
Back Propagation with Chain Rule
,- ,- ,/
=
,. ,/ ,.

Practice:

𝑦 = ℎ4 + 2 , ℎ = 𝑒5
𝑑𝑦
=?
𝑑𝑥
𝑑𝑦 𝑑𝑦 𝑑ℎ
= = 2ℎ ∗ 𝑒 ! = 2𝑒 ! 𝑒 ! = 2 𝑒 ! "
𝑑𝑥 𝑑ℎ 𝑑𝑥
39
Finding the weights fw (x) = (z(x))
Loss of training sample-j (xj,yj) 1
(z) =
1+e z
1X j
E(w) = ky fw (xj )k2 X
2 j z(x) = w i xi
i

@E X @ @z
j j
= y (z(x )) · ·
@wi j
@z @wi

d @z j
Use chain rule = (z)(1 (z)) = xi
dz @wi
xji is the i’th term of sample-j, xj
Finding the weights fw (x) = (z(x))
Loss of training sample-j (xj,yj)
1
(z) =
1+e z
1X j X
E(w) = ky fw (xj )k2
2 j z(x) = w i xi
i

@E X @ @z
j j
= y (z(x )) · ·
@wi j
@z @wi
all three terms are the same – output of the neuron fw(xj)

d @z
= (z)(1 (z)) = xji
dz @wi
xji is the i’th feature of sample-j, xj
Rewriting the equations
@E X d @z 𝜕𝑧
j j
= y (z(x )) = 𝑥6
@wi j
dz @wi 𝜕𝑤6
✓ ◆
d @E @E
= (z)(1 (z)) rE(w) = , ,...
dz @w0 @w1

Let oj , (z(xj )) output of the neuron for input xj

@E X
= (y j oj )oj (1 oj )xji
@wi j

Update rule: @E
wi0 = wi ⌘
@wi
Caveat of using sigmoid neuron
• Saturated neurons “kills” the gradients.
• Gradient of w are either all-positive or all- Output saturates,
negative (assuming x is all-positive). gradient vanishes

• exp() is more expensive to compute (vs. ReLU).

d Output saturates,
Always positive: = (z)(1 (z)) gradient vanishes
dz
@E X
= (y j oj )oj (1 oj )xji
@wi j
Gradient Descent
change in wi is proportional to gradient of loss wrt wi
✓ ◆
@E @E
rE(w) = , ,...
@w0 @w1

Let oj , (z(xj )) output of the neuron for input xj

@E X
= (y j oj )oj (1 oj )xji
@wi j

Update rule: @E
wi0 = wi ⌘
@wi
Training one neuron
• Start with some initial weights w = [w0, w1, …]
• Repeat until convergence
– For each input xj, use the current weights w to compute the output of the neuron fw(xj) (call it oj)
– Compute the loss E(w) between fw(xj) and yj
– Compute the gradient of the loss wrt each wi

@E X
= (y j oj )oj (1 oj )xji Let oj , (z(xj ))
@wi j

– Update each weight wi based on the gradient

@E
wi = wi ⌘
@wi
Training a DNN:
Same idea as single neuron
• Start with some initial weights w = [w0, w1, …]
• Repeat until convergence
– For each input xj, use the current weights w to compute the output of the network fw(xj)
– Compute the loss E(w) between fw(xj) and yj
– Compute the gradient of the loss wrt each wi

• slightly more complicated equation

@E
– Update each weight wi based on the gradient wi = wi ⌘
@wi
Before training a dnn:
Designing a DNN
• Remember, the only unknowns are the weights
• Everything else should be decided by you

• How many layers?


• How many neurons in each layer?
• How to connect the neurons?
• What activation function to use?
• What loss function to use?
• What gradient descent method to use?
Training DNNs: Initialize
Ø Step 1: Randomly initialize weight W and bias b for predict y
Layer l:

𝑯(*) = 𝜎 𝑯(*+%) 𝑾(*) + 𝒃(*)

Initialize weight W and bias b


0.1, −0.2, 0.2
𝑾(*) = 𝒃(*) = 0.1, 0.2, 0
1.1, 0.01. 0.3

48
Training DNNs: Feedforward
Ø Step 2: Calculate feedforward to estimate output y using W and b
Layer 1:

𝑯(%) = 𝜎 𝑯()) 𝑾(%) + 𝒃(%) = 𝜎 𝑿𝑾(%) + 𝒃(%)

Layer 2:

𝑯(() = 𝜎 𝑯(%) 𝑾(() + 𝒃(()

Layer l:
𝑯(*) = 𝜎 𝑯(*+%) 𝑾(*) + 𝒃(*)

49
Training DNNs: Back Propagation
Ø Step 3: back propagation to compute gradient

𝒁(%) = 𝑿𝑾(%) + 𝒃(%)


𝑯(%) = σ(𝒁(%) )

𝒁(*) = 𝑯(*+%) 𝑾(*) + 𝒃(*)
𝑯(*) = σ(𝒁(*) )
9 = 𝑯(*) 𝑾(*,%) + 𝒃(*,%)
𝒚

Loss function:
:−𝑦 4
𝐿= 𝒚
label 50
Step by Step Back Propagation
Ø Step 3: back propagation to compute gradient

Examples: label
Loss function: 𝐿= A−𝒚 #
𝒚

9 = 𝑯(*) 𝑾(*,%) + 𝒃(*,%)


𝒚
Gradient decent to find optimal value
𝜕𝐿
𝑑9
𝒚= =
𝜕9
𝒚
𝑾(")
𝜕9
𝒚
(𝒍,%)
=
𝜕𝑾

𝜕𝐿 𝜕𝐿 𝜕9𝒚
(*,%)
=
𝜕𝑾 𝒚 𝜕𝑾(*,%)
𝜕9
51
Step by Step Back Propagation
Ø Step 3: back propagation to compute gradient

Examples: Toy example:


label
# 𝑤%
Loss 𝐿= 𝒚 A−𝒚
𝑿 = 𝑥% , 𝑥( , 𝑥. 𝑾 = 𝑤(
function: 𝑤.
9 = 𝑯(*) 𝑾(*,%) + 𝒃(*,%)
𝒚 𝒚 = 𝑿𝑾
𝜕𝐿
𝑑9
𝒚= 9−𝒚
=2 𝒚 𝜕𝒚 𝜕𝒚 𝜕𝒚 𝜕𝒚
𝜕9
𝒚 = , ,
𝜕𝑿 𝜕𝑥% 𝜕𝑥( 𝜕𝑥.
𝜕9
𝒚 * $
=𝑯 = 𝑤% , 𝑤( , 𝑤.
𝜕𝑾(𝒍,%)
= 𝑾𝑻
𝜕𝐿 𝜕𝐿 𝜕9 𝒚 * $
(*,%)
= (*,%)
= 2𝑯 9−𝒚
𝒚
𝜕𝑾 𝜕9
𝒚 𝜕𝑾
52
Step by Step Back Propagation
Ø Step 3: back propagation to compute gradient

Examples: label
Loss 𝐿= 𝒚 A−𝒚 #
function:
9 = 𝑯(*) 𝑾(*,%) + 𝒃(*,%)
𝒚 𝑯(*) = σ(𝒁(*) )

𝜕9
𝒚
(𝒍)
=
𝜕𝑯
𝜕𝐿 𝜕𝐿 𝜕9𝒚
(*)
= (*)
=
𝜕𝑯 𝜕9
𝒚 𝜕𝑯

𝜕𝑯(𝒍)
(𝒍)
=
𝜕𝒁
𝜕𝐿
(𝒍)
=
𝜕𝒁 53
Step by Step Back Propagation
Ø Step 3: back propagation to compute gradient

Examples: label
Loss function: 𝐿= A−𝒚 #
𝒚

9 = 𝑯(*) 𝑾(*,%) + 𝒃(*,%)


𝒚 𝑯(*) = σ(𝒁(*) )

𝜕9
𝒚 *,% $
(𝒍)
=𝑾
𝜕𝑯
𝜕𝐿 𝜕𝐿 𝜕9𝒚 *,% $
(*)
= (*)
9−𝒚 𝑾
=2 𝒚
𝜕𝑯 𝜕9
𝒚 𝜕𝑯

𝜕𝑯(𝒍) / (𝒁(*) )
= σ
𝜕𝒁(𝒍)
𝜕𝐿 𝜕𝐿 𝜕𝑯(𝒍) *,% $ /
(𝒍)
= (*) (𝒍)
9−𝒚 𝑾
=2 𝒚 σ (𝒁(*) )
𝜕𝒁 𝜕𝑯 𝜕𝒁 54
Training DNNs: Back Propagation
Ø Step 3: back propagation to compute gradient
𝜕𝐿
𝑑9
𝒚= 9−𝒚
=2 𝒚
𝜕9
𝒚

(*,%)
𝜕𝐿 𝜕𝐿 𝜕9 𝒚 (*) * $
𝑑𝑾 = (*,%)
= (*,%)
= 𝑑9
𝒚 𝑯 = 2𝑯 9−𝒚
𝒚
𝜕𝑾 𝜕9
𝒚 𝜕𝑾
𝜕𝐿 𝜕𝐿 𝜕9𝒚 *,% $
(*)
= (*)
9−𝒚 𝑾
=2 𝒚
𝜕𝑯 𝜕9
𝒚 𝜕𝑯
𝜕𝐿 𝜕9 𝒚
𝑑𝒃(*,%) = = 𝑑9 9−𝒚
𝒚=2 𝒚
𝒚 𝜕𝒃(*,%)
𝜕9

55
Training DNNs: Back Propagation
Ø Step 3: back propagation to compute gradient
𝜕𝐿
𝑑9
𝒚= 9−𝒚
=2 𝒚
𝜕9
𝒚
𝜕𝐿 𝜕𝐿 𝜕9𝒚
𝑑𝑯 (*)
= =
𝜕𝑯(*) 𝜕9𝒚 𝜕𝑯(*)
9−𝒚 𝑾
=2 𝒚 *,% $
≜𝛿
𝜕𝑯 (*) 𝜕𝒁(*)
(*+%) 0
𝑑𝑾(*) = 𝑑𝑯(*) (*) (*)
= 𝑯 𝛿⨀𝜎 /
𝒁 *
𝜕𝒁 𝜕𝑾
(*)
𝜕𝑯
𝑑𝒃(*) = 𝑑𝑯(*) (*)
= 𝛿⨀𝜎 /
𝒁 *
𝜕𝒁
Hadamard product

56
Training DNNs: Back Propagation

Ø Step 4: Update weights with gradient decent

𝑾(*) ⟵ 𝑾 * − 𝜆𝑑𝑾(*)

𝒃(*) ⟵ 𝒃 * − 𝜆𝑑𝒃(*)

Learning rate

Ø Step 5: Repeat Step 2, 3, and 4

57
Example: DNNs training
Training Example:

Imgs color (x1) face (x2) y


Img 1 1 2 1 (cat)
Img 2 0 1 0

1, 2 1
𝑿= 𝒚= ReLU= max{0, 𝑥}
0, 1 0

Ø Step 1: randomly initialize weights 𝑾(%) , 𝑾(() , 𝒃(%) and 𝒃(()

0.1, 0.2, 0.0 1.0


𝑾(%) = 𝒃(") = 0, 0, 0 𝑾(") = 0.5 𝒃(#) = 0
0.0, 0.1, 0.1
1.0 58
Example: DNN Training
Example:
1, 2 1 ReLU= max{0, 𝑥}
𝑿= 𝒚=
0, 1 0
0.1, 0.2, 0.0 1.0
𝑾(!) = 𝑾(") = 0.5 𝒃(!) = 0, 0, 0 𝒃(𝟐) = 0
0.0, 0.1, 0.1
1.0

Ø Step 2: calculate feedforward to estimate output y

𝒁(!) = 𝑿𝑾(!) + 𝒃(!) =

𝑯(!) = ReLU 𝒁 ! =

B = 𝑯(!) 𝑾(") + 𝒃(") =


𝒚
59
Example: DNN Training
Example:
1, 2 1
𝑿= 𝒚= ReLU= max{0, 𝑥}
0, 1 0
0.1, 0.2, 0.0 1.0
𝑾(!) = 𝑾(") = 0.5 𝒃(!) = 0, 0, 0 𝒃(𝟐) = 0
0.0, 0.1, 0.1
1.0

Ø Step 2: calculate feedforward to estimate output y


0.1, 0.4, 0.2 0.1, 0.4, 0.2
𝒁(!) = 𝑿𝑾(!) + 𝒃(!) = + 0,0,0 =
0.0, 0.1, 0.1 0.0, 0.1, 0.1

0.1, 0.4, 0.2


𝑯(!) = ReLU 𝒁 ! =
0.0, 0.1, 0.1
Estimate output:
0.50 1
B = 𝑯(!) 𝑾(") + 𝒃(") =
𝒚 𝒚= Ground truth
0.15 0 60
Example: DNN Training
Example:
Ø Step 3: back propagation to compute gradient
1
Assume loss function is mean square error (MSE) 𝒚 −𝒚)𝟐
𝐿 = G(B
𝑁
𝜕𝐿 1 0.5 − 1 −0.5
𝑑B
𝒚= B−𝒚 =G 𝒚
= ∗ 2G 𝒚 B−𝒚 =G =G = −0.35
𝜕B
𝒚 2 0.15 − 0 0.15
𝜕𝐿 𝜕𝐿 𝜕B𝒚 (%) = −0.35 ∗ 0.1, 0.4, 0.2
𝑑𝑾(") = = = 𝑑 9
𝒚 𝑯
𝜕𝑾(") 𝜕B𝒚 𝜕𝑾(") 0.0, 0.1, 0.1

𝜕𝐿 𝜕B𝒚
𝑑𝒃(") = (")
= 𝑑B
𝒚 = −0.35
𝜕B
𝒚 𝜕𝒃

61
Example: DNN Training
Example:
Ø Step 3: back propagation to compute gradient

𝜕𝐿 𝜕B𝒚 1.0
𝑑𝑯(!) = (!)
9𝑾(() = −0.35 ∗ 0.5
= 𝑑𝒚 Gradient decent to find optimal value
𝜕B
𝒚 𝜕𝑯
1.0

𝜕𝑯 (%) 𝜕𝒁(%)
𝑑𝑾(!) = 𝑑𝑯(!) =⋯ 𝑾(")
𝜕𝒁(%) 𝜕𝑾(%)
𝜕𝑯 (%) 𝜕𝒁(%)
𝑑𝒃(!) = 𝑑𝑯(!) =⋯
𝜕𝒁(%) 𝜕𝒃(%)

62
Example: DNN Training
Example:

Ø Step 4: update weights W and b using gradient decent from Step 3

𝑾(() ⟵ 𝑾 (
− 𝜆𝑑𝑾(()
Gradient decent to find optimal value
loss
𝒃(() ⟵ 𝒃 (
− 𝜆𝑑𝒃(()
Starting point

𝑾(%) ⟵ 𝑾 %
− 𝜆𝑑𝑾(%) 𝑾(")

𝒃(%) ⟵ 𝒃 %
− 𝜆𝑑𝒃(%)

parameters
Ø Step 5: repeat Steps 2, 3, and 4 until convergence
63
Summary of DNNs
DNNs aim to learn the parameters, such as weights W and bias b, by
optimizing the loss function

Ø Step 1: randomly initialize weights 𝑾 and bias 𝒃

Ø Step 2: calculate feedforward to estimate output y

Ø Step 3: back propagation to compute gradient: d𝑾 and d𝒃

Ø Step 4: update weights 𝑾 and bias 𝒃 using gradient decent

Ø Step 5: repeat Steps 2, 3, and 4 until convergence

64
Any Questions?

65
Learning Rate
If learning rate (step size) is too large, it may skip the optimal result

𝑾(*) ⟵ 𝑾 * − 𝜆𝑑𝑾(*)

𝒃(*) ⟵ 𝒃 * − 𝜆𝑑𝒃(*)
loss
Starting point

parameters

66
Learning Rate schedule

𝑾(*) ⟵ 𝑾 * − 𝜆𝑑𝑾(*)
loss
Starting point 𝒃(*) ⟵ 𝒃 * − 𝜆𝑑𝒃(*)

0.001, epoch < 5


𝑙𝑟 = V0.0001, 5 ≤ epoch ≤ 10
0.00001, epoch > 11
parameters

67
Optimization: Batch Gradient Descent
Given a training dataset of N examples
parameters
4 𝜽 = {𝑾, 𝒃}
1
𝑓(𝜽) = G 𝑓2 (𝜽)
𝑁
23!
N samples

The gradient of the objective function at W and b is


4
1
∇𝑓(𝜽) = G ∇𝑓2 (𝜽)
𝑁
23!

𝜽 ← 𝜽 − 𝜆∇𝑓(𝜽)

Computational cost for each independent variable iteration is 𝒪(𝑛)


68
Stochastic Gradient Descent (SGD)

Stochastic gradient descent: SGD randomly picks one data point from the
whole data set at each iteration to calculate the derivatives

6
1
∇𝑓(𝜽) = G ∇𝑓5 (𝜽)
𝐾
53!

𝜽 ← 𝜽 − 𝜆∇𝑓(𝜽)
N samples

Limitation: cause the objective function to fluctuate heavily


69
Batch Size: Minibatch

• What if N = 100,000 images?


Large memory and expensive computation complexity

• How to deal with the issue?


Divide data into M different partitions, each batch with N/M data samples

Example: N=100,000, M=1000,


minibatch = N/M=100 (num of input images each time) 70
Minibatch SGD

Batch = N =10

𝑥%% , 𝑥%( In feature × out feature


𝑥(% , 𝑥(( (%)
𝑤%% , 𝑤%( , 𝑤%.
𝑿= ⋮ 𝑿∈ ℝ𝑵×𝟐 𝑾 = 𝑤 ,𝑤 ,𝑤 𝑾(%) ∈ ℝ𝟐×𝟑
(% (( (.
𝑥1% , 𝑥1(

𝒁(%) = 𝑿𝑾 % + 𝒃(%) 𝑯(%) = 𝝈(𝒁(%) ) 𝑯(%) ∈ ℝ𝑵×𝟑

𝒁(%) 𝑵×𝟑
∈ℝ 𝒃 (%) 𝟏×𝟑
∈ℝ 𝒀 = 𝑯(%) 𝑾 (
+ 𝒃(() 𝑾(() ∈ ℝ𝟑×𝟐 𝒀 ∈ ℝ𝑵×𝟐
71
Minibatch SGD

Uniformly sample a subset of data from Minibatch

Example: minibatch = N/M=100

6
1
∇𝑓(𝜽) = G ∇𝑓5 (𝜽)
𝑁
53!

𝜽 ← 𝜽 − 𝜆∇𝑓(𝜽)

N samples

72
Weights Initialization
Ø Xavier Initialization
Assume fully connected layer without activation function. Given layer feature 𝑥! with the associated weights
𝑤7,2 , then the output 𝑜7
1&'

𝑜4 = R 𝑤4,3 𝑥3 𝑤4,3 ~𝑁(0, 𝜎 ( )


35%

Assume layer feature 𝑥! have zero mean and variance 𝛾 " , 𝑥3 ~𝑁(0, 𝛾 " )
6

𝐸 𝑜4 = 𝐸 R 𝑤4,3 𝑥3 = 0
35%

73
Weights Initialization
Ø Xavier Initialization
Variance of output features:
" "
𝑉𝑎𝑟 𝑜7 = 𝐸 𝑜7" − 𝐸 𝑜7 𝑉𝑎𝑟 𝑜79! = 𝐸 "
𝑜79! − 𝐸 𝑜79!

4 " " 4 "


= ∑23!
"#
𝐸 𝑤7,2 𝑥2 = ∑23!
$%&
𝐸 𝑤79!,2 𝑥2"

= 𝑁28 𝜎 " 𝛾 " = 𝑁:;< 𝜎 " 𝛾 "

Since layer 𝑜2 has the same variance 𝛾 " as layer 𝑥7 , so we have

𝑁28 𝜎 " 𝛾 " = 𝛾 " 2


] 𝜎=
𝑁:;< 𝜎 " 𝛾 " = 𝛾 " 𝑁28 + 𝑁:;<

Where 𝑁() , 𝑁*+, is number of input/output features


74
Underfitting and Overfitting
§ Overfitting occurs when a statistical model fits exactly against a few training data.
§ Underfitting: when model is too simple but too many data, both training and test errors are large

• Increase the • Reduce the model


size (depth, width,
model size
sharing weights)
(depth, width, • Regularization
sharing weights) (weight decay)
• Ensemble • Dropout
• Train longer • Perturb data (data
• Lift dimension augmentation)
• Data selection
• Early stopping

75
Dropout: Prevent Overfitting
Dropout refers to dropping out units (neurons) in a neural network. It is used to
prevent Overfitting problem.

0, 𝑝
dropout 𝑝 = c
1, 1 − 𝑝

(a) Standard Neural Nets (b) After Dropout 76


Dropout vs. Ensemble
Ø Create Decision Tree based on Bootstrapped dataset
Subset dataset
age income student buy_computer
<=30 high no no
<=30 high no no
31…40 high no yes
>40 medium no yes
>40 low yes yes
>40 low yes no
31…40 low yes yes
<=30 medium no no
<=30 low yes yes
>40 medium yes yes
<=30 medium yes yes
31…40 medium no yes
31…40 high yes yes
>40 medium no no

77
Dropout vs. Ensemble
Ø Ensemble that combines several base models in order to produce one optimal predictive
model

Covid-19 Covid-19 No Covid-19

Majority Voting
78
Batch Normalization (BN)
Batch normalization is a method that normalizes activations in a network across the
mini-batch. For each feature, batch normalization computes the mean and variance of
that feature in the mini-batch

𝑋3 − 𝑚𝑒𝑎𝑛 𝑋3 − 𝜇 𝑋3 − 𝜇
𝑋8 = = ≅
𝑠𝑡𝑑 𝜎 𝜎( + 𝜖 Avoid 0

Re-scale the values


Feature 1 Feature 2 Feature 3
𝐵𝑁 𝑋 = 𝛾𝑋8 + 𝛽 Data
(X1) (X2) (X3)

Image 1 1 2 5

Image 2 4 1 3
79
Batch Normalization (BN)
q Batch normalization is a method that normalizes activations in a network across
the mini-batch.
q For each feature, batch normalization computes the mean and variance of that
feature in the mini-batch

BN

80
Advantages of Batch
Normalization (BN)

q Speed Up the Training


B
q Handles internal covariate shift N
q Make the training stable

81
Layer Normalization
Layer normalization normalizes the activations along the feature direction instead of mini-
batch direction. Namely, normalize input across the features for each input data sample.

𝑋3 − 𝑚𝑒𝑎𝑛 𝑋3 − 𝜇 𝑋3 − 𝜇
𝑋8 = = ≅
𝑠𝑡𝑑 𝜎 𝜎( + 𝜖
Re-scale the values

𝐵𝑁 𝑋 = 𝛾𝑋8 + 𝛽 Feature 1 Feature 2 Feature 3


Data
(X1) (X2) (X3)

Image 1 1 2 5

Image 2 4 1 3
82
Layer Normalization

Advantages:
q Remove the dependency on batches of data samples
q Make it easier to apply for Neural Networks

83
Other normalization layers
Early Stopping
Early stopping is an optimization technique used to reduce overfitting without
compromising on model accuracy

loss

0 10 20 30 epochs
85
Any Questions?

86

You might also like