Lec3 MLP Optimization
Lec3 MLP Optimization
INTRODUCTION
TO ARTIFICIAL
INTELLIGENCE
Tianyi Zhou
8/28/23
University of Maryland
Some slides are adapted from Song & Russell
@Berkeley and Shao @William and Mary 1
TA Office hours
• Wednesday (11:00AM-1:00PM)
Chenxi Liu
Ziyue Li
• Friday (2:00PM-4:00PM)
Hongyu Zhao
Kwesi Cobbina
4
Plan
• Today
– Neural networks optimization and training
• http://neuralnetworksanddeeplearning.com/
– Ch 1, 2, and 6
• Russell and Norvig
– Ch 21
• Goodfellow, Bengio, & Courville
– Ch 6, 8
– NN Playground: https://piazza.com/class/llup7xc8lm44w4/post/9
– Gradient descent: https://piazza.com/class/llup7xc8lm44w4/post/7
– Backprop:
• https://www.youtube.com/watch?v=Ilg3gGewQ5U
• https://jeremykun.com/2012/12/09/neural-networks-and-
backpropagation/
• Next lecture
– Neural networks architectures: CNN, LSTM, Transformer
Perceptron in one page
• Binary input x, binary output 𝑓 𝑥 = 𝑠𝑖𝑔𝑛 ∑! 𝑤! 𝑥!
• Learning goal: find w such that 𝑓 𝑥 = 𝑦 for most training data 𝑥, 𝑦
• Learning algorithm: for every misclassified 𝑥, 𝑦 , update 𝑤 = 𝑤 + 𝑦𝑥
• Simple and elegant proof of convergence
• Applications: classification, approximation to logic gates/circuits
𝑥"
𝑧 𝒙 = 𝒙𝒘
Neuron
𝑤" 𝑓 𝒙 = 𝜎 𝑧 𝒙 = 𝜎 𝒙𝒘
𝑥# ∑ 𝜎 𝑦
𝑤# 𝒙 = [1, 𝑥", 𝑥#, 𝑥$]
Activation fun.
𝒘 = [𝑏, 𝑤", 𝑤#, 𝑤$]%
𝑤$
𝑥$
Sigmoid Neuron
𝑥"
f(x) output
Neuron
𝑤"
𝑥# ∑ 𝜎 𝑦
𝑤#
𝑥"
Neuron
𝑤"
𝑥# ∑ 𝜎 𝑦
𝑤#
1. Find weighted sum of inputs
𝑤$ X
𝑥$
z(x) = w i xi
f (x) = (z(x)) i
2. apply ReLU (rectified linear unit) function
• Real-valued Input and Output
•
•
Output is a composition of two functions
Output is only lower bounded but upper unbounded.
𝜎 𝑧 = max(0, 𝑧)
Why Non-Linear Activation Function?
Non-linear can transform the original training data into a higher dimension so that
they are separable.
𝑥" 𝑦 = 𝑥!
Neuron
𝑤"
𝑥# ∑ 𝜎 𝑦
𝑤#
Activation fun.
𝑤$
𝑥$
9
Activation
functions
• Thresholding and information filter
• Nonlinearity and expressiveness
• Partition of the input space
• Many choices available and critical to performance
Practical TIPs of activation
𝒁(") = 𝑿𝑾 "
𝒁(%) ∈ ℝ𝟏×𝟑 𝑯(") = 𝝈(𝒁(") ) 𝑯(%) ∈ ℝ𝟏×𝟑
𝒀 = 𝑯(%) 𝑾 (
+ 𝒃(() 𝑾(() ∈ ℝ𝟑×𝟐 𝒀 ∈ ℝ𝟏×𝟐
13
Linear Regions of MLP
• A neural networks recursively applying a nonlinear activation 𝜎 on a pathway (a sequence of neurons),
i.e., 𝑓 𝑥 = 𝑊 ($) 𝜎(𝑊 ($&!) 𝜎(𝑊 ($&") ⋯ 𝑊 (") 𝜎(𝑊 (!) 𝑥) ⋯ ))
• If 𝜎 is piece-wise linear (e.g., ReLU), the neural network partitions the input space into exponential
number of polytopes (𝑥𝑤 > 0 or 𝑥𝑤 ≤ 0) and applies a linear function to data in each polytope.
0
≤0 𝑤>
0 𝑥𝑤 𝑥
𝒘(",/012) = , 𝑓 𝑥 = 𝑥𝒘(",/012) for x in left polytope
0
𝑤""
𝒘(",34562) = 𝑤 , 𝑓 𝑥 = 𝑥𝒘(",34562) , for x in right polytope
#"
Prediction/Regression Classification
output is a continuous variable output is a discrete label
02 31 02 31
is weekend? is weekend?
B6 is holiday? 7C B6 is holiday? 7C
B6 7C B6 7C
B6 weather 7C B6 weather 7C
f B6 7C ! sales forecast f B6 7C ! is cat?
B6area income7C B6area income7C
@4 5A @4 5A
.. ..
. .
Layer 1:
Layer 2:
Layer l:
𝑯(*) = 𝜎 𝑯(*+%) 𝑾(*) + 𝒃(*)
19
task: regression
Output prediction:
Logits
ℒ = 𝐿(𝒚789:7 , A𝒚)
20
Loss FunctioN: Regression
Ø Regression Problem
Loss fun: Mean Square Error (MSE)
1
𝒚) 𝟐
𝑀𝑆𝐸 = G (𝒚789:7 −A
𝑁
Example:
Y_label 10 15 20
Y_predict 9 14.5 21
1
𝑀𝑆𝐸 = [ 10 − 9 " + 15 − 14.5 " + 20 − 21 " ]
3
21
Task: classification
Output: E.g., softmax for classification
Logits
q Softmax: get probabilities (positive, sum up to one)
𝑒 =!
softmax 𝑧! = ? =
∑!>" 𝑒 !
Example:
𝒁 = [−1, 0, 2]
22
Loss Function: classification
Ø Multiple Classification Problem Do you remember entropy in Decision Tree?
Loss fun: Cross Entropy
𝑦0 class label 𝑐
Cat
Label Y2 0 1 0 Bird
23
Loss Function: binary classification
Ø Binary Classification {0,1}
Loss fun: Binary Cross Entropy
Yes/No
24
Supervised Learning
w1 w13
w2 w14
w3 output layer (can
input layer (activation w4
have multiple
function = identity) outputs)
one or more
hidden layers
Training
Training inputs output
x1 x2 x3 y
0.6 0 -1 10 Given training inputs and outputs,
0.4 1.2 3.1 21 find weights so that network output
1.7 0.6 -0.3 2.1 matches training outputs. How?
1 1 0 0.1
What should w1 ,w2 , b be?
• Want f(x) = y, how to measure their difference?
• Minimize a loss function x0 = 1
x1 x2 y
(
0 0 1 0 if w1 x1 + w2 x2 + b 0
0 1 1 f (x) =
1 if w1 x1 + w2 x2 + b > 0
1 0 1
1 1 0
Optimization of loss function
Loss E(w)
w2
w1
Loss
landscape of
DNN is
complicated
https://tinyurl.com/2fe26wsn
Local minima
Different Optimization Algorithms
(saddle point)
General Recipe for optimization
Training inputs Training output
• Start with some initial weights w
x1 x2 x3 y
• Compute f(x) based on these
0.6 0 -1 10
weights
0.4 1.2 3.1 21 • Check if f(x) matches y – how?
1.7 0.6 -0.3 2.1 Compute loss function E(w)
1 1 0 0.1 • Adjust the weights – how?
• Rinse and repeat
w1 w13
w2 w14
w w3
4
How to adjust weights?
• Let wold = [w1, w2, w3, … ]T be the current set of weights
✓ ◆
• Find @E @E
rE(wold ) = , ,...
@w0 @w1
• ∇𝐸 is a vector pointing in the direction of the local
minima of E from wold
• Update the weights so that you move in the direction
of this vector (i.e., towards the minima)
w2
wnew = wold ↵rE(wold ) w1
General Recipe of optimization
Training inputs Training output • Start with some initial weights
x1 x2 x3 y • Compute f(x) based on these
0.6 0 -1 10 weights
0.4 1.2 3.1 21 • Check if f(x) matches y – how?
Compute loss function E(w)
1.7 0.6 -0.3 2.1
• Adjust the weights – how?
1 1 0 0.1
Gradient descent
w1 w13
• Rinse and repeat
w2 w14
w w3
4
Gradient Descent
on loss landscape
• We still need to compute w2
w1
✓ ◆
@E @E
rE(wold ) = , ,...
@w0 @w1
w2
x2
fw (x) = (z(x)) 1X j
E(w) = ky fw (xj )k2
2 j
1
(z) = need to compute
1+e z
X ✓ ◆
@E @E
z(x) = w i xi rE(wold ) = , ,...
@w0 @w1
i
Finding the weights
fw (x) = (z(x))
1
1X j (z) =
E(w) = ky fw (xj )k2 1+e z
2 j X
z(x) = w i xi
i
@E
=?
@wi
Finding the weights fw (x) = (z(x))
1
1X j (z) =
E(w) = ky fw (xj )k2 1+e z
2 j X
z(x) = w i xi
i
@E @E @ @z
= · ·
@wi @ @z @wi
Example:
𝑦 = 3ℎ4 + 1
ℎ = log 𝑤
𝑑𝑦 𝑑𝑦 𝑑ℎ 1 6 log 𝑤
= = 6ℎ ∗ =
𝑑𝑤 𝑑ℎ 𝑑𝑤 𝑤 𝑤
38
Back Propagation with Chain Rule
,- ,- ,/
=
,. ,/ ,.
Practice:
𝑦 = ℎ4 + 2 , ℎ = 𝑒5
𝑑𝑦
=?
𝑑𝑥
𝑑𝑦 𝑑𝑦 𝑑ℎ
= = 2ℎ ∗ 𝑒 ! = 2𝑒 ! 𝑒 ! = 2 𝑒 ! "
𝑑𝑥 𝑑ℎ 𝑑𝑥
39
Finding the weights fw (x) = (z(x))
Loss of training sample-j (xj,yj) 1
(z) =
1+e z
1X j
E(w) = ky fw (xj )k2 X
2 j z(x) = w i xi
i
@E X @ @z
j j
= y (z(x )) · ·
@wi j
@z @wi
d @z j
Use chain rule = (z)(1 (z)) = xi
dz @wi
xji is the i’th term of sample-j, xj
Finding the weights fw (x) = (z(x))
Loss of training sample-j (xj,yj)
1
(z) =
1+e z
1X j X
E(w) = ky fw (xj )k2
2 j z(x) = w i xi
i
@E X @ @z
j j
= y (z(x )) · ·
@wi j
@z @wi
all three terms are the same – output of the neuron fw(xj)
d @z
= (z)(1 (z)) = xji
dz @wi
xji is the i’th feature of sample-j, xj
Rewriting the equations
@E X d @z 𝜕𝑧
j j
= y (z(x )) = 𝑥6
@wi j
dz @wi 𝜕𝑤6
✓ ◆
d @E @E
= (z)(1 (z)) rE(w) = , ,...
dz @w0 @w1
@E X
= (y j oj )oj (1 oj )xji
@wi j
Update rule: @E
wi0 = wi ⌘
@wi
Caveat of using sigmoid neuron
• Saturated neurons “kills” the gradients.
• Gradient of w are either all-positive or all- Output saturates,
negative (assuming x is all-positive). gradient vanishes
d Output saturates,
Always positive: = (z)(1 (z)) gradient vanishes
dz
@E X
= (y j oj )oj (1 oj )xji
@wi j
Gradient Descent
change in wi is proportional to gradient of loss wrt wi
✓ ◆
@E @E
rE(w) = , ,...
@w0 @w1
@E X
= (y j oj )oj (1 oj )xji
@wi j
Update rule: @E
wi0 = wi ⌘
@wi
Training one neuron
• Start with some initial weights w = [w0, w1, …]
• Repeat until convergence
– For each input xj, use the current weights w to compute the output of the neuron fw(xj) (call it oj)
– Compute the loss E(w) between fw(xj) and yj
– Compute the gradient of the loss wrt each wi
@E X
= (y j oj )oj (1 oj )xji Let oj , (z(xj ))
@wi j
@E
wi = wi ⌘
@wi
Training a DNN:
Same idea as single neuron
• Start with some initial weights w = [w0, w1, …]
• Repeat until convergence
– For each input xj, use the current weights w to compute the output of the network fw(xj)
– Compute the loss E(w) between fw(xj) and yj
– Compute the gradient of the loss wrt each wi
@E
– Update each weight wi based on the gradient wi = wi ⌘
@wi
Before training a dnn:
Designing a DNN
• Remember, the only unknowns are the weights
• Everything else should be decided by you
48
Training DNNs: Feedforward
Ø Step 2: Calculate feedforward to estimate output y using W and b
Layer 1:
Layer 2:
Layer l:
𝑯(*) = 𝜎 𝑯(*+%) 𝑾(*) + 𝒃(*)
49
Training DNNs: Back Propagation
Ø Step 3: back propagation to compute gradient
Loss function:
:−𝑦 4
𝐿= 𝒚
label 50
Step by Step Back Propagation
Ø Step 3: back propagation to compute gradient
Examples: label
Loss function: 𝐿= A−𝒚 #
𝒚
𝜕𝐿 𝜕𝐿 𝜕9𝒚
(*,%)
=
𝜕𝑾 𝒚 𝜕𝑾(*,%)
𝜕9
51
Step by Step Back Propagation
Ø Step 3: back propagation to compute gradient
Examples: label
Loss 𝐿= 𝒚 A−𝒚 #
function:
9 = 𝑯(*) 𝑾(*,%) + 𝒃(*,%)
𝒚 𝑯(*) = σ(𝒁(*) )
𝜕9
𝒚
(𝒍)
=
𝜕𝑯
𝜕𝐿 𝜕𝐿 𝜕9𝒚
(*)
= (*)
=
𝜕𝑯 𝜕9
𝒚 𝜕𝑯
𝜕𝑯(𝒍)
(𝒍)
=
𝜕𝒁
𝜕𝐿
(𝒍)
=
𝜕𝒁 53
Step by Step Back Propagation
Ø Step 3: back propagation to compute gradient
Examples: label
Loss function: 𝐿= A−𝒚 #
𝒚
𝜕9
𝒚 *,% $
(𝒍)
=𝑾
𝜕𝑯
𝜕𝐿 𝜕𝐿 𝜕9𝒚 *,% $
(*)
= (*)
9−𝒚 𝑾
=2 𝒚
𝜕𝑯 𝜕9
𝒚 𝜕𝑯
𝜕𝑯(𝒍) / (𝒁(*) )
= σ
𝜕𝒁(𝒍)
𝜕𝐿 𝜕𝐿 𝜕𝑯(𝒍) *,% $ /
(𝒍)
= (*) (𝒍)
9−𝒚 𝑾
=2 𝒚 σ (𝒁(*) )
𝜕𝒁 𝜕𝑯 𝜕𝒁 54
Training DNNs: Back Propagation
Ø Step 3: back propagation to compute gradient
𝜕𝐿
𝑑9
𝒚= 9−𝒚
=2 𝒚
𝜕9
𝒚
(*,%)
𝜕𝐿 𝜕𝐿 𝜕9 𝒚 (*) * $
𝑑𝑾 = (*,%)
= (*,%)
= 𝑑9
𝒚 𝑯 = 2𝑯 9−𝒚
𝒚
𝜕𝑾 𝜕9
𝒚 𝜕𝑾
𝜕𝐿 𝜕𝐿 𝜕9𝒚 *,% $
(*)
= (*)
9−𝒚 𝑾
=2 𝒚
𝜕𝑯 𝜕9
𝒚 𝜕𝑯
𝜕𝐿 𝜕9 𝒚
𝑑𝒃(*,%) = = 𝑑9 9−𝒚
𝒚=2 𝒚
𝒚 𝜕𝒃(*,%)
𝜕9
55
Training DNNs: Back Propagation
Ø Step 3: back propagation to compute gradient
𝜕𝐿
𝑑9
𝒚= 9−𝒚
=2 𝒚
𝜕9
𝒚
𝜕𝐿 𝜕𝐿 𝜕9𝒚
𝑑𝑯 (*)
= =
𝜕𝑯(*) 𝜕9𝒚 𝜕𝑯(*)
9−𝒚 𝑾
=2 𝒚 *,% $
≜𝛿
𝜕𝑯 (*) 𝜕𝒁(*)
(*+%) 0
𝑑𝑾(*) = 𝑑𝑯(*) (*) (*)
= 𝑯 𝛿⨀𝜎 /
𝒁 *
𝜕𝒁 𝜕𝑾
(*)
𝜕𝑯
𝑑𝒃(*) = 𝑑𝑯(*) (*)
= 𝛿⨀𝜎 /
𝒁 *
𝜕𝒁
Hadamard product
56
Training DNNs: Back Propagation
𝑾(*) ⟵ 𝑾 * − 𝜆𝑑𝑾(*)
𝒃(*) ⟵ 𝒃 * − 𝜆𝑑𝒃(*)
Learning rate
57
Example: DNNs training
Training Example:
1, 2 1
𝑿= 𝒚= ReLU= max{0, 𝑥}
0, 1 0
𝑯(!) = ReLU 𝒁 ! =
𝜕𝐿 𝜕B𝒚
𝑑𝒃(") = (")
= 𝑑B
𝒚 = −0.35
𝜕B
𝒚 𝜕𝒃
61
Example: DNN Training
Example:
Ø Step 3: back propagation to compute gradient
𝜕𝐿 𝜕B𝒚 1.0
𝑑𝑯(!) = (!)
9𝑾(() = −0.35 ∗ 0.5
= 𝑑𝒚 Gradient decent to find optimal value
𝜕B
𝒚 𝜕𝑯
1.0
𝜕𝑯 (%) 𝜕𝒁(%)
𝑑𝑾(!) = 𝑑𝑯(!) =⋯ 𝑾(")
𝜕𝒁(%) 𝜕𝑾(%)
𝜕𝑯 (%) 𝜕𝒁(%)
𝑑𝒃(!) = 𝑑𝑯(!) =⋯
𝜕𝒁(%) 𝜕𝒃(%)
62
Example: DNN Training
Example:
𝑾(() ⟵ 𝑾 (
− 𝜆𝑑𝑾(()
Gradient decent to find optimal value
loss
𝒃(() ⟵ 𝒃 (
− 𝜆𝑑𝒃(()
Starting point
𝑾(%) ⟵ 𝑾 %
− 𝜆𝑑𝑾(%) 𝑾(")
𝒃(%) ⟵ 𝒃 %
− 𝜆𝑑𝒃(%)
parameters
Ø Step 5: repeat Steps 2, 3, and 4 until convergence
63
Summary of DNNs
DNNs aim to learn the parameters, such as weights W and bias b, by
optimizing the loss function
64
Any Questions?
65
Learning Rate
If learning rate (step size) is too large, it may skip the optimal result
𝑾(*) ⟵ 𝑾 * − 𝜆𝑑𝑾(*)
𝒃(*) ⟵ 𝒃 * − 𝜆𝑑𝒃(*)
loss
Starting point
parameters
66
Learning Rate schedule
𝑾(*) ⟵ 𝑾 * − 𝜆𝑑𝑾(*)
loss
Starting point 𝒃(*) ⟵ 𝒃 * − 𝜆𝑑𝒃(*)
67
Optimization: Batch Gradient Descent
Given a training dataset of N examples
parameters
4 𝜽 = {𝑾, 𝒃}
1
𝑓(𝜽) = G 𝑓2 (𝜽)
𝑁
23!
N samples
𝜽 ← 𝜽 − 𝜆∇𝑓(𝜽)
Stochastic gradient descent: SGD randomly picks one data point from the
whole data set at each iteration to calculate the derivatives
6
1
∇𝑓(𝜽) = G ∇𝑓5 (𝜽)
𝐾
53!
𝜽 ← 𝜽 − 𝜆∇𝑓(𝜽)
N samples
Batch = N =10
𝒁(%) 𝑵×𝟑
∈ℝ 𝒃 (%) 𝟏×𝟑
∈ℝ 𝒀 = 𝑯(%) 𝑾 (
+ 𝒃(() 𝑾(() ∈ ℝ𝟑×𝟐 𝒀 ∈ ℝ𝑵×𝟐
71
Minibatch SGD
6
1
∇𝑓(𝜽) = G ∇𝑓5 (𝜽)
𝑁
53!
𝜽 ← 𝜽 − 𝜆∇𝑓(𝜽)
N samples
72
Weights Initialization
Ø Xavier Initialization
Assume fully connected layer without activation function. Given layer feature 𝑥! with the associated weights
𝑤7,2 , then the output 𝑜7
1&'
Assume layer feature 𝑥! have zero mean and variance 𝛾 " , 𝑥3 ~𝑁(0, 𝛾 " )
6
𝐸 𝑜4 = 𝐸 R 𝑤4,3 𝑥3 = 0
35%
73
Weights Initialization
Ø Xavier Initialization
Variance of output features:
" "
𝑉𝑎𝑟 𝑜7 = 𝐸 𝑜7" − 𝐸 𝑜7 𝑉𝑎𝑟 𝑜79! = 𝐸 "
𝑜79! − 𝐸 𝑜79!
75
Dropout: Prevent Overfitting
Dropout refers to dropping out units (neurons) in a neural network. It is used to
prevent Overfitting problem.
0, 𝑝
dropout 𝑝 = c
1, 1 − 𝑝
77
Dropout vs. Ensemble
Ø Ensemble that combines several base models in order to produce one optimal predictive
model
Majority Voting
78
Batch Normalization (BN)
Batch normalization is a method that normalizes activations in a network across the
mini-batch. For each feature, batch normalization computes the mean and variance of
that feature in the mini-batch
𝑋3 − 𝑚𝑒𝑎𝑛 𝑋3 − 𝜇 𝑋3 − 𝜇
𝑋8 = = ≅
𝑠𝑡𝑑 𝜎 𝜎( + 𝜖 Avoid 0
Image 1 1 2 5
Image 2 4 1 3
79
Batch Normalization (BN)
q Batch normalization is a method that normalizes activations in a network across
the mini-batch.
q For each feature, batch normalization computes the mean and variance of that
feature in the mini-batch
BN
80
Advantages of Batch
Normalization (BN)
81
Layer Normalization
Layer normalization normalizes the activations along the feature direction instead of mini-
batch direction. Namely, normalize input across the features for each input data sample.
𝑋3 − 𝑚𝑒𝑎𝑛 𝑋3 − 𝜇 𝑋3 − 𝜇
𝑋8 = = ≅
𝑠𝑡𝑑 𝜎 𝜎( + 𝜖
Re-scale the values
Image 1 1 2 5
Image 2 4 1 3
82
Layer Normalization
Advantages:
q Remove the dependency on batches of data samples
q Make it easier to apply for Neural Networks
83
Other normalization layers
Early Stopping
Early stopping is an optimization technique used to reduce overfitting without
compromising on model accuracy
loss
0 10 20 30 epochs
85
Any Questions?
86