0% found this document useful (0 votes)

28 views

Lec3 MLP Optimization

Uploaded by

kaydee140492

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

28 views

Lec3 MLP Optimization

Uploaded by

kaydee140492

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 86

FALL 2023

INTRODUCTION
TO ARTIFICIAL
INTELLIGENCE

Tianyi Zhou

8/28/23
University of Maryland
Some slides are adapted from Song & Russell
@Berkeley and Shao @William and Mary 1
TA Office hours

• Wednesday (11:00AM-1:00PM)
Chenxi Liu
Ziyue Li

• Friday (2:00PM-4:00PM)
Hongyu Zhao
Kwesi Cobbina

• Starting this Friday.

• Location will be announced before the date.
Seven Components of this course
Action,
Acting Prediction
Probabilistic
Reasoning
Human
users Agent
Language
Models Embodied &
Multi-modal AI World
Neural
Networks

Search & Perception

Planning
Reward, Data,
Observation
Neural networks
• Perceptron and MLP
• Optimization and Backpropagation
• Convolutional Neural Networks
• Recurrent Neural Networks and LSTM
• Attention and Transformer
• Graph Neural Networks

4
Plan
• Today
– Neural networks optimization and training
• http://neuralnetworksanddeeplearning.com/
– Ch 1, 2, and 6
• Russell and Norvig
– Ch 21
• Goodfellow, Bengio, & Courville
– Ch 6, 8
– NN Playground: https://piazza.com/class/llup7xc8lm44w4/post/9
– Gradient descent: https://piazza.com/class/llup7xc8lm44w4/post/7
– Backprop:
• https://www.youtube.com/watch?v=Ilg3gGewQ5U
• https://jeremykun.com/2012/12/09/neural-networks-and-
backpropagation/

• Next lecture
– Neural networks architectures: CNN, LSTM, Transformer
Perceptron in one page
• Binary input x, binary output 𝑓 𝑥 = 𝑠𝑖𝑔𝑛 ∑! 𝑤! 𝑥!
• Learning goal: find w such that 𝑓 𝑥 = 𝑦 for most training data 𝑥, 𝑦
• Learning algorithm: for every misclassified 𝑥, 𝑦 , update 𝑤 = 𝑤 + 𝑦𝑥
• Simple and elegant proof of convergence
• Applications: classification, approximation to logic gates/circuits
𝑥"
𝑧 𝒙 = 𝒙𝒘
Neuron
𝑤" 𝑓 𝒙 = 𝜎 𝑧 𝒙 = 𝜎 𝒙𝒘
𝑥# ∑ 𝜎 𝑦
𝑤# 𝒙 = [1, 𝑥", 𝑥#, 𝑥$]

Activation fun.
𝒘 = [𝑏, 𝑤", 𝑤#, 𝑤$]%
𝑤$
𝑥$
Sigmoid Neuron

𝑥"
f(x) output
Neuron
𝑤"
𝑥# ∑ 𝜎 𝑦
𝑤#

𝑤$ 1. Find weighted sum of inputs

X
𝑥$
z(x) = w i xi
i
f (x) = (z(x)) 2. apply sigmoid function
1
• Real-valued Input and Output (z) = z
• Output is a composition of two functions 1+e
• Output is lower and upper bounded
RELU Neuron

𝑥"
Neuron
𝑤"
𝑥# ∑ 𝜎 𝑦
𝑤#
1. Find weighted sum of inputs
𝑤$ X
𝑥$
z(x) = w i xi
f (x) = (z(x)) i
2. apply ReLU (rectified linear unit) function
• Real-valued Input and Output
•
•
Output is a composition of two functions
Output is only lower bounded but upper unbounded.
𝜎 𝑧 = max(0, 𝑧)
Why Non-Linear Activation Function?
Non-linear can transform the original training data into a higher dimension so that
they are separable.

𝑥" 𝑦 = 𝑥!
Neuron
𝑤"
𝑥# ∑ 𝜎 𝑦
𝑤#

Activation fun.
𝑤$
𝑥$

9
Activation
functions
• Thresholding and information filter
• Nonlinearity and expressiveness
• Partition of the input space
• Many choices available and critical to performance
Practical TIPs of activation

• Try ReLU at first (most widely used activation function).

• Be careful about the learning rate.
• For better performance, try out LeakyReLU, ELU, SeLU, GeLU,
Maxout max(𝑥𝑤 ! , 𝑥𝑤 " )
• Avoid to use Sigmoid or Tanh for hidden layer neurons (why?). But they
can be used for output if a bounded output is required.
Neural connections
in animal brains
• Connectome (a map of neural connections) in brains of
302 neurons in <1mm worms Caenorhabditis elegans.
• Much more complicated in human brains.
• Architecture of neural networks is important (multi-
layer, convolution, attention, recurrent, graph).
MULTI-LAYER PERCEPTRON (MLP)

In feature × out feature

𝑤"", 𝑤"#, 𝑤"$
𝑿 = 𝑥", 𝑥# 𝑿∈ ℝ𝟏×𝟐 𝑾(") = 𝑤 ,𝑤 ,𝑤 𝑾(%) ∈ ℝ𝟐×𝟑
#" ## #$

𝒁(") = 𝑿𝑾 "
𝒁(%) ∈ ℝ𝟏×𝟑 𝑯(") = 𝝈(𝒁(") ) 𝑯(%) ∈ ℝ𝟏×𝟑
𝒀 = 𝑯(%) 𝑾 (
+ 𝒃(() 𝑾(() ∈ ℝ𝟑×𝟐 𝒀 ∈ ℝ𝟏×𝟐
13
Linear Regions of MLP
• A neural networks recursively applying a nonlinear activation 𝜎 on a pathway (a sequence of neurons),
i.e., 𝑓 𝑥 = 𝑊 ($) 𝜎(𝑊 ($&!) 𝜎(𝑊 ($&") ⋯ 𝑊 (") 𝜎(𝑊 (!) 𝑥) ⋯ ))
• If 𝜎 is piece-wise linear (e.g., ReLU), the neural network partitions the input space into exponential
number of polytopes (𝑥𝑤 > 0 or 𝑥𝑤 ≤ 0) and applies a linear function to data in each polytope.
0
≤0 𝑤>
0 𝑥𝑤 𝑥
𝒘(",/012) = , 𝑓 𝑥 = 𝑥𝒘(",/012) for x in left polytope
0
𝑤""
𝒘(",34562) = 𝑤 , 𝑓 𝑥 = 𝑥𝒘(",34562) , for x in right polytope
#"

Every polytope is encoded by (left, right, left, left, …) and

data point in the polytope has an output of
𝑓 𝑥 = 𝑥𝒘(",/012) 𝒘(#,34562) 𝒘(",/012) ⋯, a linear function
Neural Networks are Piecewise-Linear
• If 𝜎 piece-wise linear, 𝑓 𝑥 = 𝜎( ⋯ 𝜎(𝜎(𝑥𝑊 (!) )𝑊 (") ) ⋯ 𝑊 ($&!) )𝑊 ($) is piece-wise linear, e.g., each
piece is a polytope and 𝑓 𝑥 = 𝑥𝒘(!,()*+) 𝒘(",,-./+) 𝒘(!,()*+) ⋯ 𝒘($,,-./+) for x in the polytope
• In practice: exponential number of polytopes but every polytope only have at most one data point.
• How does generalization come from? Shared facets/edges between polytopes.
Neural Networks are Piecewise-Linear
Training Piecewise-Linear Neural Network
Neural networks are
function approximators
• Use (x, y) data to learn a neural network f(x) towards y=f(x)

Prediction/Regression Classification
output is a continuous variable output is a discrete label
02 31 02 31
is weekend? is weekend?
B6 is holiday? 7C B6 is holiday? 7C
B6 7C B6 7C
B6 weather 7C B6 weather 7C
f B6 7C ! sales forecast f B6 7C ! is cat?
B6area income7C B6area income7C
@4 5A @4 5A
.. ..
. .

In RL, we will use NN for regression f(x) ß Q( [s,a] )

DNNs: Feedforward (inference)

Layer 1:

𝑯(%) = 𝜎 𝑯()) 𝑾(%) + 𝒃(%) = 𝜎 𝑿𝑾(%) + 𝒃(%)

Layer 2:

𝑯(() = 𝜎 𝑯(%) 𝑾(() + 𝒃(()

Layer l:
𝑯(*) = 𝜎 𝑯(*+%) 𝑾(*) + 𝒃(*)

19
task: regression

Output prediction:

A = 𝑔 𝑯(7) 𝑾(7;") + 𝒃(7;")

𝒚

Logits

q Loss Function: difference between

prediction and label

ℒ = 𝐿(𝒚789:7 , A𝒚)

20
Loss FunctioN: Regression
Ø Regression Problem
Loss fun: Mean Square Error (MSE)

1
𝒚) 𝟐
𝑀𝑆𝐸 = G (𝒚789:7 −A
𝑁

Example:

Y_label 10 15 20

Y_predict 9 14.5 21

1
𝑀𝑆𝐸 = [ 10 − 9 " + 15 − 14.5 " + 20 − 21 " ]
3
21
Task: classification
Output: E.g., softmax for classification

A = 𝑔 𝑯(7) 𝑾(7;") + 𝒃(7;")

𝒚

Logits
q Softmax: get probabilities (positive, sum up to one)

𝑒 =!
softmax 𝑧! = ? =
∑!>" 𝑒 !

Example:

𝒁 = [−1, 0, 2]

22
Loss Function: classification
Ø Multiple Classification Problem Do you remember entropy in Decision Tree?
Loss fun: Cross Entropy

$ predicted probability observation of class c

𝑙𝑜𝑠𝑠 = − & 𝑦! log 𝑝!
!"#

𝑦0 class label 𝑐
Cat

Class cat dog bird

Dog
Label Y1 1 0 0

Label Y2 0 1 0 Bird
23
Loss Function: binary classification
Ø Binary Classification {0,1}
Loss fun: Binary Cross Entropy

𝑙𝑜𝑠𝑠 = − 𝑦log𝑝 + 1 − 𝑦 log(1 − 𝑝)

predicted probability of a class

Yes/No

24
Supervised Learning
w1 w13
w2 w14
w3 output layer (can
input layer (activation w4
have multiple
function = identity) outputs)

one or more
hidden layers
Training
Training inputs output
x1 x2 x3 y
0.6 0 -1 10 Given training inputs and outputs,
0.4 1.2 3.1 21 find weights so that network output
1.7 0.6 -0.3 2.1 matches training outputs. How?
1 1 0 0.1
What should w1 ,w2 , b be?
• Want f(x) = y, how to measure their difference?
• Minimize a loss function x0 = 1

• You likely did the following: x1 w1 b

– started with some values of w1, w2, b say 1, 1, 0

– computed f(x) using w1 = 1, w2 = 1, b = 0 f(x)
– checked if f(x) matched y, i.e., if the loss E(w) is small
x2 w2
– if not, you adjusted w1, w2, b slightly
– rinse and repeat until f(x) matched y

x1 x2 y
(
0 0 1 0 if w1 x1 + w2 x2 + b  0
0 1 1 f (x) =
1 if w1 x1 + w2 x2 + b > 0
1 0 1
1 1 0
Optimization of loss function

Loss E(w)

w2
w1
Loss
landscape of
DNN is
complicated

We are interested in finding the

lowest point on the loss landscape
but it is easy to be trapped in

• Bad local minima

• Saddle point
• Plateau

https://tinyurl.com/2fe26wsn
Local minima
Different Optimization Algorithms
(saddle point)
General Recipe for optimization
Training inputs Training output
• Start with some initial weights w
x1 x2 x3 y
• Compute f(x) based on these
0.6 0 -1 10
weights
0.4 1.2 3.1 21 • Check if f(x) matches y – how?
1.7 0.6 -0.3 2.1 Compute loss function E(w)
1 1 0 0.1 • Adjust the weights – how?
• Rinse and repeat
w1 w13
w2 w14
w w3
4
How to adjust weights?
• Let wold = [w1, w2, w3, … ]T be the current set of weights
✓ ◆
• Find @E @E
rE(wold ) = , ,...
@w0 @w1
• ∇𝐸 is a vector pointing in the direction of the local
minima of E from wold
• Update the weights so that you move in the direction
of this vector (i.e., towards the minima)
w2
wnew = wold ↵rE(wold ) w1
General Recipe of optimization
Training inputs Training output • Start with some initial weights
x1 x2 x3 y • Compute f(x) based on these
0.6 0 -1 10 weights
0.4 1.2 3.1 21 • Check if f(x) matches y – how?
Compute loss function E(w)
1.7 0.6 -0.3 2.1
• Adjust the weights – how?
1 1 0 0.1
Gradient descent
w1 w13
• Rinse and repeat
w2 w14
w w3
4
Gradient Descent
on loss landscape
• We still need to compute w2
w1
✓ ◆
@E @E
rE(wold ) = , ,...
@w0 @w1

• I’ll show how to do this for a single sigmoid neuron

• Then, we will see how to do this for a neural network – backpropagation
• Most libraries implement automatic differentiation
Sigmoid Neuron
1 w0
fw(x) output
x1 w1
X

w2
x2

fw (x) = (z(x)) 1X j
E(w) = ky fw (xj )k2
2 j
1
(z) = need to compute
1+e z
X ✓ ◆
@E @E
z(x) = w i xi rE(wold ) = , ,...
@w0 @w1
i
Finding the weights
fw (x) = (z(x))
1
1X j (z) =
E(w) = ky fw (xj )k2 1+e z
2 j X
z(x) = w i xi
i

@E
=?
@wi
Finding the weights fw (x) = (z(x))
1
1X j (z) =
E(w) = ky fw (xj )k2 1+e z
2 j X
z(x) = w i xi
i

@E @E @ @z
= · ·
@wi @ @z @wi

Use chain rule

Back Propagation with Chain Rule
,- ,- ,/
=
,. ,/ ,.

Example:

𝑦 = 3ℎ4 + 1
ℎ = log 𝑤

𝑑𝑦 𝑑𝑦 𝑑ℎ 1 6 log 𝑤
= = 6ℎ ∗ =
𝑑𝑤 𝑑ℎ 𝑑𝑤 𝑤 𝑤

38
Back Propagation with Chain Rule
,- ,- ,/
=
,. ,/ ,.

Practice:

𝑦 = ℎ4 + 2 , ℎ = 𝑒5
𝑑𝑦
=?
𝑑𝑥
𝑑𝑦 𝑑𝑦 𝑑ℎ
= = 2ℎ ∗ 𝑒 ! = 2𝑒 ! 𝑒 ! = 2 𝑒 ! "
𝑑𝑥 𝑑ℎ 𝑑𝑥
39
Finding the weights fw (x) = (z(x))
Loss of training sample-j (xj,yj) 1
(z) =
1+e z
1X j
E(w) = ky fw (xj )k2 X
2 j z(x) = w i xi
i

@E X @ @z
j j
= y (z(x )) · ·
@wi j
@z @wi

d @z j
Use chain rule = (z)(1 (z)) = xi
dz @wi
xji is the i’th term of sample-j, xj
Finding the weights fw (x) = (z(x))
Loss of training sample-j (xj,yj)
1
(z) =
1+e z
1X j X
E(w) = ky fw (xj )k2
2 j z(x) = w i xi
i

@E X @ @z
j j
= y (z(x )) · ·
@wi j
@z @wi
all three terms are the same – output of the neuron fw(xj)

d @z
= (z)(1 (z)) = xji
dz @wi
xji is the i’th feature of sample-j, xj
Rewriting the equations
@E X d @z 𝜕𝑧
j j
= y (z(x )) = 𝑥6
@wi j
dz @wi 𝜕𝑤6
✓ ◆
d @E @E
= (z)(1 (z)) rE(w) = , ,...
dz @w0 @w1

Let oj , (z(xj )) output of the neuron for input xj

@E X
= (y j oj )oj (1 oj )xji
@wi j

Update rule: @E
wi0 = wi ⌘
@wi
Caveat of using sigmoid neuron
• Saturated neurons “kills” the gradients.
• Gradient of w are either all-positive or all- Output saturates,
negative (assuming x is all-positive). gradient vanishes

• exp() is more expensive to compute (vs. ReLU).

d Output saturates,
Always positive: = (z)(1 (z)) gradient vanishes
dz
@E X
= (y j oj )oj (1 oj )xji
@wi j
Gradient Descent
change in wi is proportional to gradient of loss wrt wi
✓ ◆
@E @E
rE(w) = , ,...
@w0 @w1

Let oj , (z(xj )) output of the neuron for input xj

@E X
= (y j oj )oj (1 oj )xji
@wi j

Update rule: @E
wi0 = wi ⌘
@wi
Training one neuron
• Start with some initial weights w = [w0, w1, …]
• Repeat until convergence
– For each input xj, use the current weights w to compute the output of the neuron fw(xj) (call it oj)
– Compute the loss E(w) between fw(xj) and yj
– Compute the gradient of the loss wrt each wi

@E X
= (y j oj )oj (1 oj )xji Let oj , (z(xj ))
@wi j

– Update each weight wi based on the gradient

@E
wi = wi ⌘
@wi
Training a DNN:
Same idea as single neuron
• Start with some initial weights w = [w0, w1, …]
• Repeat until convergence
– For each input xj, use the current weights w to compute the output of the network fw(xj)
– Compute the loss E(w) between fw(xj) and yj
– Compute the gradient of the loss wrt each wi

• slightly more complicated equation

@E
– Update each weight wi based on the gradient wi = wi ⌘
@wi
Before training a dnn:
Designing a DNN
• Remember, the only unknowns are the weights
• Everything else should be decided by you

• How many layers?

• How many neurons in each layer?
• How to connect the neurons?
• What activation function to use?
• What loss function to use?
• What gradient descent method to use?
Training DNNs: Initialize
Ø Step 1: Randomly initialize weight W and bias b for predict y
Layer l:

𝑯() = 𝜎 𝑯(+%) 𝑾() + 𝒃()

Initialize weight W and bias b

0.1, −0.2, 0.2
𝑾(*) = 𝒃(*) = 0.1, 0.2, 0
1.1, 0.01. 0.3

48
Training DNNs: Feedforward
Ø Step 2: Calculate feedforward to estimate output y using W and b
Layer 1:

𝑯(%) = 𝜎 𝑯()) 𝑾(%) + 𝒃(%) = 𝜎 𝑿𝑾(%) + 𝒃(%)

Layer 2:

𝑯(() = 𝜎 𝑯(%) 𝑾(() + 𝒃(()

Layer l:
𝑯(*) = 𝜎 𝑯(*+%) 𝑾(*) + 𝒃(*)

49
Training DNNs: Back Propagation
Ø Step 3: back propagation to compute gradient

𝒁(%) = 𝑿𝑾(%) + 𝒃(%)

𝑯(%) = σ(𝒁(%) )
…
𝒁(*) = 𝑯(*+%) 𝑾(*) + 𝒃(*)
𝑯(*) = σ(𝒁(*) )
9 = 𝑯(*) 𝑾(*,%) + 𝒃(*,%)
𝒚

Loss function:
:−𝑦 4
𝐿= 𝒚
label 50
Step by Step Back Propagation
Ø Step 3: back propagation to compute gradient

Examples: label
Loss function: 𝐿= A−𝒚 #
𝒚

9 = 𝑯() 𝑾(,%) + 𝒃(*,%)

𝒚
Gradient decent to find optimal value
𝜕𝐿
𝑑9
𝒚= =
𝜕9
𝒚
𝑾(")
𝜕9
𝒚
(𝒍,%)
=
𝜕𝑾

𝜕𝐿 𝜕𝐿 𝜕9𝒚
(*,%)
=
𝜕𝑾 𝒚 𝜕𝑾(*,%)
𝜕9
51
Step by Step Back Propagation
Ø Step 3: back propagation to compute gradient

Examples: Toy example:

label
# 𝑤%
Loss 𝐿= 𝒚 A−𝒚
𝑿 = 𝑥% , 𝑥( , 𝑥. 𝑾 = 𝑤(
function: 𝑤.
9 = 𝑯(*) 𝑾(*,%) + 𝒃(*,%)
𝒚 𝒚 = 𝑿𝑾
𝜕𝐿
𝑑9
𝒚= 9−𝒚
=2 𝒚 𝜕𝒚 𝜕𝒚 𝜕𝒚 𝜕𝒚
𝜕9
𝒚 = , ,
𝜕𝑿 𝜕𝑥% 𝜕𝑥( 𝜕𝑥.
𝜕9
𝒚 * $
=𝑯 = 𝑤% , 𝑤( , 𝑤.
𝜕𝑾(𝒍,%)
= 𝑾𝑻
𝜕𝐿 𝜕𝐿 𝜕9 𝒚 * $
(*,%)
= (*,%)
= 2𝑯 9−𝒚
𝒚
𝜕𝑾 𝜕9
𝒚 𝜕𝑾
52
Step by Step Back Propagation
Ø Step 3: back propagation to compute gradient

Examples: label
Loss 𝐿= 𝒚 A−𝒚 #
function:
9 = 𝑯(*) 𝑾(*,%) + 𝒃(*,%)
𝒚 𝑯(*) = σ(𝒁(*) )

𝜕9
𝒚
(𝒍)
=
𝜕𝑯
𝜕𝐿 𝜕𝐿 𝜕9𝒚
(*)
= (*)
=
𝜕𝑯 𝜕9
𝒚 𝜕𝑯

𝜕𝑯(𝒍)
(𝒍)
=
𝜕𝒁
𝜕𝐿
(𝒍)
=
𝜕𝒁 53
Step by Step Back Propagation
Ø Step 3: back propagation to compute gradient

Examples: label
Loss function: 𝐿= A−𝒚 #
𝒚

9 = 𝑯() 𝑾(,%) + 𝒃(*,%)

𝒚 𝑯(*) = σ(𝒁(*) )

𝜕9
𝒚 *,% $
(𝒍)
=𝑾
𝜕𝑯
𝜕𝐿 𝜕𝐿 𝜕9𝒚 *,% $
(*)
= (*)
9−𝒚 𝑾
=2 𝒚
𝜕𝑯 𝜕9
𝒚 𝜕𝑯

𝜕𝑯(𝒍) / (𝒁(*) )
= σ
𝜕𝒁(𝒍)
𝜕𝐿 𝜕𝐿 𝜕𝑯(𝒍) *,% $ /
(𝒍)
= (*) (𝒍)
9−𝒚 𝑾
=2 𝒚 σ (𝒁(*) )
𝜕𝒁 𝜕𝑯 𝜕𝒁 54
Training DNNs: Back Propagation
Ø Step 3: back propagation to compute gradient
𝜕𝐿
𝑑9
𝒚= 9−𝒚
=2 𝒚
𝜕9
𝒚

(*,%)
𝜕𝐿 𝜕𝐿 𝜕9 𝒚 (*) * $
𝑑𝑾 = (*,%)
= (*,%)
= 𝑑9
𝒚 𝑯 = 2𝑯 9−𝒚
𝒚
𝜕𝑾 𝜕9
𝒚 𝜕𝑾
𝜕𝐿 𝜕𝐿 𝜕9𝒚 *,% $
(*)
= (*)
9−𝒚 𝑾
=2 𝒚
𝜕𝑯 𝜕9
𝒚 𝜕𝑯
𝜕𝐿 𝜕9 𝒚
𝑑𝒃(*,%) = = 𝑑9 9−𝒚
𝒚=2 𝒚
𝒚 𝜕𝒃(*,%)
𝜕9

55
Training DNNs: Back Propagation
Ø Step 3: back propagation to compute gradient
𝜕𝐿
𝑑9
𝒚= 9−𝒚
=2 𝒚
𝜕9
𝒚
𝜕𝐿 𝜕𝐿 𝜕9𝒚
𝑑𝑯 (*)
= =
𝜕𝑯(*) 𝜕9𝒚 𝜕𝑯(*)
9−𝒚 𝑾
=2 𝒚 *,% $
≜𝛿
𝜕𝑯 (*) 𝜕𝒁(*)
(*+%) 0
𝑑𝑾(*) = 𝑑𝑯(*) (*) (*)
= 𝑯 𝛿⨀𝜎 /
𝒁 *
𝜕𝒁 𝜕𝑾
(*)
𝜕𝑯
𝑑𝒃(*) = 𝑑𝑯(*) (*)
= 𝛿⨀𝜎 /
𝒁 *
𝜕𝒁
Hadamard product

56
Training DNNs: Back Propagation

Ø Step 4: Update weights with gradient decent

𝑾(*) ⟵ 𝑾 * − 𝜆𝑑𝑾(*)

𝒃(*) ⟵ 𝒃 * − 𝜆𝑑𝒃(*)

Learning rate

Ø Step 5: Repeat Step 2, 3, and 4

57
Example: DNNs training
Training Example:

Imgs color (x1) face (x2) y

Img 1 1 2 1 (cat)
Img 2 0 1 0

1, 2 1
𝑿= 𝒚= ReLU= max{0, 𝑥}
0, 1 0

Ø Step 1: randomly initialize weights 𝑾(%) , 𝑾(() , 𝒃(%) and 𝒃(()

0.1, 0.2, 0.0 1.0

𝑾(%) = 𝒃(") = 0, 0, 0 𝑾(") = 0.5 𝒃(#) = 0
0.0, 0.1, 0.1
1.0 58
Example: DNN Training
Example:
1, 2 1 ReLU= max{0, 𝑥}
𝑿= 𝒚=
0, 1 0
0.1, 0.2, 0.0 1.0
𝑾(!) = 𝑾(") = 0.5 𝒃(!) = 0, 0, 0 𝒃(𝟐) = 0
0.0, 0.1, 0.1
1.0

Ø Step 2: calculate feedforward to estimate output y

𝒁(!) = 𝑿𝑾(!) + 𝒃(!) =

𝑯(!) = ReLU 𝒁 ! =

B = 𝑯(!) 𝑾(") + 𝒃(") =

𝒚
59
Example: DNN Training
Example:
1, 2 1
𝑿= 𝒚= ReLU= max{0, 𝑥}
0, 1 0
0.1, 0.2, 0.0 1.0
𝑾(!) = 𝑾(") = 0.5 𝒃(!) = 0, 0, 0 𝒃(𝟐) = 0
0.0, 0.1, 0.1
1.0

Ø Step 2: calculate feedforward to estimate output y

0.1, 0.4, 0.2 0.1, 0.4, 0.2
𝒁(!) = 𝑿𝑾(!) + 𝒃(!) = + 0,0,0 =
0.0, 0.1, 0.1 0.0, 0.1, 0.1

0.1, 0.4, 0.2

𝑯(!) = ReLU 𝒁 ! =
0.0, 0.1, 0.1
Estimate output:
0.50 1
B = 𝑯(!) 𝑾(") + 𝒃(") =
𝒚 𝒚= Ground truth
0.15 0 60
Example: DNN Training
Example:
Ø Step 3: back propagation to compute gradient
1
Assume loss function is mean square error (MSE) 𝒚 −𝒚)𝟐
𝐿 = G(B
𝑁
𝜕𝐿 1 0.5 − 1 −0.5
𝑑B
𝒚= B−𝒚 =G 𝒚
= ∗ 2G 𝒚 B−𝒚 =G =G = −0.35
𝜕B
𝒚 2 0.15 − 0 0.15
𝜕𝐿 𝜕𝐿 𝜕B𝒚 (%) = −0.35 ∗ 0.1, 0.4, 0.2
𝑑𝑾(") = = = 𝑑 9
𝒚 𝑯
𝜕𝑾(") 𝜕B𝒚 𝜕𝑾(") 0.0, 0.1, 0.1

𝜕𝐿 𝜕B𝒚
𝑑𝒃(") = (")
= 𝑑B
𝒚 = −0.35
𝜕B
𝒚 𝜕𝒃

61
Example: DNN Training
Example:
Ø Step 3: back propagation to compute gradient

𝜕𝐿 𝜕B𝒚 1.0
𝑑𝑯(!) = (!)
9𝑾(() = −0.35 ∗ 0.5
= 𝑑𝒚 Gradient decent to find optimal value
𝜕B
𝒚 𝜕𝑯
1.0

𝜕𝑯 (%) 𝜕𝒁(%)
𝑑𝑾(!) = 𝑑𝑯(!) =⋯ 𝑾(")
𝜕𝒁(%) 𝜕𝑾(%)
𝜕𝑯 (%) 𝜕𝒁(%)
𝑑𝒃(!) = 𝑑𝑯(!) =⋯
𝜕𝒁(%) 𝜕𝒃(%)

62
Example: DNN Training
Example:

Ø Step 4: update weights W and b using gradient decent from Step 3

𝑾(() ⟵ 𝑾 (
− 𝜆𝑑𝑾(()
Gradient decent to find optimal value
loss
𝒃(() ⟵ 𝒃 (
− 𝜆𝑑𝒃(()
Starting point

𝑾(%) ⟵ 𝑾 %
− 𝜆𝑑𝑾(%) 𝑾(")

𝒃(%) ⟵ 𝒃 %
− 𝜆𝑑𝒃(%)

parameters
Ø Step 5: repeat Steps 2, 3, and 4 until convergence
63
Summary of DNNs
DNNs aim to learn the parameters, such as weights W and bias b, by
optimizing the loss function

Ø Step 1: randomly initialize weights 𝑾 and bias 𝒃

Ø Step 2: calculate feedforward to estimate output y

Ø Step 3: back propagation to compute gradient: d𝑾 and d𝒃

Ø Step 4: update weights 𝑾 and bias 𝒃 using gradient decent

Ø Step 5: repeat Steps 2, 3, and 4 until convergence

64
Any Questions?

65
Learning Rate
If learning rate (step size) is too large, it may skip the optimal result

𝑾(*) ⟵ 𝑾 * − 𝜆𝑑𝑾(*)

𝒃(*) ⟵ 𝒃 * − 𝜆𝑑𝒃(*)
loss
Starting point

parameters

66
Learning Rate schedule

𝑾(*) ⟵ 𝑾 * − 𝜆𝑑𝑾(*)
loss
Starting point 𝒃(*) ⟵ 𝒃 * − 𝜆𝑑𝒃(*)

0.001, epoch < 5

𝑙𝑟 = V0.0001, 5 ≤ epoch ≤ 10
0.00001, epoch > 11
parameters

67
Optimization: Batch Gradient Descent
Given a training dataset of N examples
parameters
4 𝜽 = {𝑾, 𝒃}
1
𝑓(𝜽) = G 𝑓2 (𝜽)
𝑁
23!
N samples

The gradient of the objective function at W and b is

4
1
∇𝑓(𝜽) = G ∇𝑓2 (𝜽)
𝑁
23!

𝜽 ← 𝜽 − 𝜆∇𝑓(𝜽)

Computational cost for each independent variable iteration is 𝒪(𝑛)

68
Stochastic Gradient Descent (SGD)

Stochastic gradient descent: SGD randomly picks one data point from the
whole data set at each iteration to calculate the derivatives

6
1
∇𝑓(𝜽) = G ∇𝑓5 (𝜽)
𝐾
53!

𝜽 ← 𝜽 − 𝜆∇𝑓(𝜽)
N samples

Limitation: cause the objective function to fluctuate heavily

69
Batch Size: Minibatch

• What if N = 100,000 images?

Large memory and expensive computation complexity

• How to deal with the issue?

Divide data into M different partitions, each batch with N/M data samples

Example: N=100,000, M=1000,

minibatch = N/M=100 (num of input images each time) 70
Minibatch SGD

Batch = N =10

𝑥%% , 𝑥%( In feature × out feature

𝑥(% , 𝑥(( (%)
𝑤%% , 𝑤%( , 𝑤%.
𝑿= ⋮ 𝑿∈ ℝ𝑵×𝟐 𝑾 = 𝑤 ,𝑤 ,𝑤 𝑾(%) ∈ ℝ𝟐×𝟑
(% (( (.
𝑥1% , 𝑥1(

𝒁(%) = 𝑿𝑾 % + 𝒃(%) 𝑯(%) = 𝝈(𝒁(%) ) 𝑯(%) ∈ ℝ𝑵×𝟑

𝒁(%) 𝑵×𝟑
∈ℝ 𝒃 (%) 𝟏×𝟑
∈ℝ 𝒀 = 𝑯(%) 𝑾 (
+ 𝒃(() 𝑾(() ∈ ℝ𝟑×𝟐 𝒀 ∈ ℝ𝑵×𝟐
71
Minibatch SGD

Uniformly sample a subset of data from Minibatch

Example: minibatch = N/M=100

6
1
∇𝑓(𝜽) = G ∇𝑓5 (𝜽)
𝑁
53!

𝜽 ← 𝜽 − 𝜆∇𝑓(𝜽)

N samples

72
Weights Initialization
Ø Xavier Initialization
Assume fully connected layer without activation function. Given layer feature 𝑥! with the associated weights
𝑤7,2 , then the output 𝑜7
1&'

𝑜4 = R 𝑤4,3 𝑥3 𝑤4,3 ~𝑁(0, 𝜎 ( )

35%

Assume layer feature 𝑥! have zero mean and variance 𝛾 " , 𝑥3 ~𝑁(0, 𝛾 " )
6

𝐸 𝑜4 = 𝐸 R 𝑤4,3 𝑥3 = 0
35%

73
Weights Initialization
Ø Xavier Initialization
Variance of output features:
" "
𝑉𝑎𝑟 𝑜7 = 𝐸 𝑜7" − 𝐸 𝑜7 𝑉𝑎𝑟 𝑜79! = 𝐸 "
𝑜79! − 𝐸 𝑜79!

4 " " 4 "

= ∑23!
"#
𝐸 𝑤7,2 𝑥2 = ∑23!
$%&
𝐸 𝑤79!,2 𝑥2"

= 𝑁28 𝜎 " 𝛾 " = 𝑁:;< 𝜎 " 𝛾 "

Since layer 𝑜2 has the same variance 𝛾 " as layer 𝑥7 , so we have

𝑁28 𝜎 " 𝛾 " = 𝛾 " 2

] 𝜎=
𝑁:;< 𝜎 " 𝛾 " = 𝛾 " 𝑁28 + 𝑁:;<

Where 𝑁() , 𝑁*+, is number of input/output features

74
Underfitting and Overfitting
§ Overfitting occurs when a statistical model fits exactly against a few training data.
§ Underfitting: when model is too simple but too many data, both training and test errors are large

• Increase the • Reduce the model

size (depth, width,
model size
sharing weights)
(depth, width, • Regularization
sharing weights) (weight decay)
• Ensemble • Dropout
• Train longer • Perturb data (data
• Lift dimension augmentation)
• Data selection
• Early stopping

75
Dropout: Prevent Overfitting
Dropout refers to dropping out units (neurons) in a neural network. It is used to
prevent Overfitting problem.

0, 𝑝
dropout 𝑝 = c
1, 1 − 𝑝

(a) Standard Neural Nets (b) After Dropout 76

Dropout vs. Ensemble
Ø Create Decision Tree based on Bootstrapped dataset
Subset dataset
age income student buy_computer
<=30 high no no
<=30 high no no
31…40 high no yes
>40 medium no yes
>40 low yes yes
>40 low yes no
31…40 low yes yes
<=30 medium no no
<=30 low yes yes
>40 medium yes yes
<=30 medium yes yes
31…40 medium no yes
31…40 high yes yes
>40 medium no no

77
Dropout vs. Ensemble
Ø Ensemble that combines several base models in order to produce one optimal predictive
model

Covid-19 Covid-19 No Covid-19

Majority Voting
78
Batch Normalization (BN)
Batch normalization is a method that normalizes activations in a network across the
mini-batch. For each feature, batch normalization computes the mean and variance of
that feature in the mini-batch

𝑋3 − 𝑚𝑒𝑎𝑛 𝑋3 − 𝜇 𝑋3 − 𝜇
𝑋8 = = ≅
𝑠𝑡𝑑 𝜎 𝜎( + 𝜖 Avoid 0

Re-scale the values

Feature 1 Feature 2 Feature 3
𝐵𝑁 𝑋 = 𝛾𝑋8 + 𝛽 Data
(X1) (X2) (X3)

Image 1 1 2 5

Image 2 4 1 3
79
Batch Normalization (BN)
q Batch normalization is a method that normalizes activations in a network across
the mini-batch.
q For each feature, batch normalization computes the mean and variance of that
feature in the mini-batch

80
Advantages of Batch
Normalization (BN)

q Speed Up the Training

B
q Handles internal covariate shift N
q Make the training stable

81
Layer Normalization
Layer normalization normalizes the activations along the feature direction instead of mini-
batch direction. Namely, normalize input across the features for each input data sample.

𝑋3 − 𝑚𝑒𝑎𝑛 𝑋3 − 𝜇 𝑋3 − 𝜇
𝑋8 = = ≅
𝑠𝑡𝑑 𝜎 𝜎( + 𝜖
Re-scale the values

𝐵𝑁 𝑋 = 𝛾𝑋8 + 𝛽 Feature 1 Feature 2 Feature 3

Data
(X1) (X2) (X3)

Image 1 1 2 5

Image 2 4 1 3
82
Layer Normalization

Advantages:
q Remove the dependency on batches of data samples
q Make it easier to apply for Neural Networks

83
Other normalization layers
Early Stopping
Early stopping is an optimization technique used to reduce overfitting without
compromising on model accuracy

loss

0 10 20 30 epochs
85
Any Questions?

CHAPTER 2-2 Graphical Method of Solving Linear Programming Problems
No ratings yet
CHAPTER 2-2 Graphical Method of Solving Linear Programming Problems
14 pages
Kagan Lecture2
No ratings yet
Kagan Lecture2
118 pages
MachineLearningSlides PartOne
No ratings yet
MachineLearningSlides PartOne
252 pages
cs188 sp23 Note25
No ratings yet
cs188 sp23 Note25
8 pages
ML_MU_Unit_5NeuralNetworkpdf__2025_04_16_13_47_39
No ratings yet
ML_MU_Unit_5NeuralNetworkpdf__2025_04_16_13_47_39
57 pages
Chapter 2 - 2 Shallow neural network 2_2
No ratings yet
Chapter 2 - 2 Shallow neural network 2_2
34 pages
Week 2 Artificial Neural Networks
No ratings yet
Week 2 Artificial Neural Networks
62 pages
Hacking Neural Networks: A Short Introduction
No ratings yet
Hacking Neural Networks: A Short Introduction
50 pages
2-Mathematical Optimization and Deep Learning
No ratings yet
2-Mathematical Optimization and Deep Learning
53 pages
Advanced Information Retreival: Chapter 02: Modeling - Neural Network Model
No ratings yet
Advanced Information Retreival: Chapter 02: Modeling - Neural Network Model
31 pages
ANN-Unit 6 - Deep Neural Networks
No ratings yet
ANN-Unit 6 - Deep Neural Networks
29 pages
AML 03 Dense Neural Networks
No ratings yet
AML 03 Dense Neural Networks
20 pages
Learning With Linear Neurons: Adapted From Lectures by Geoffrey Hinton and Others Updated by N. Intrator, May 2007
No ratings yet
Learning With Linear Neurons: Adapted From Lectures by Geoffrey Hinton and Others Updated by N. Intrator, May 2007
59 pages
Artificial Neural Networks
No ratings yet
Artificial Neural Networks
100 pages
Lesson 3 Artificial Neural Network
No ratings yet
Lesson 3 Artificial Neural Network
77 pages
Unit-5
No ratings yet
Unit-5
59 pages
UNIT II DNN
No ratings yet
UNIT II DNN
24 pages
Neural-Network(Basics)
No ratings yet
Neural-Network(Basics)
48 pages
Machine Learning and Pattern Recognition Week 8 Neural Net Intro
No ratings yet
Machine Learning and Pattern Recognition Week 8 Neural Net Intro
3 pages
Machine Learning with Artificial Neural Networks
No ratings yet
Machine Learning with Artificial Neural Networks
6 pages
Neural Networks: Artificial Intelligence: Representation and Problem Solving
No ratings yet
Neural Networks: Artificial Intelligence: Representation and Problem Solving
19 pages
CS460 - Deep Learning - W02 & W03
No ratings yet
CS460 - Deep Learning - W02 & W03
44 pages
Week 6 - Lab
No ratings yet
Week 6 - Lab
5 pages
Deep Learning
No ratings yet
Deep Learning
78 pages
ST M Hdstat RNN Deep Learning
No ratings yet
ST M Hdstat RNN Deep Learning
17 pages
3 ArtificialNeuralNetworks PDF
No ratings yet
3 ArtificialNeuralNetworks PDF
77 pages
2. Neural Network Training
No ratings yet
2. Neural Network Training
73 pages
IBest_DeepLearning
No ratings yet
IBest_DeepLearning
123 pages
Artificial Neural Networks
No ratings yet
Artificial Neural Networks
80 pages
UNIT-I.pptx
No ratings yet
UNIT-I.pptx
90 pages
NN unit_1
No ratings yet
NN unit_1
27 pages
26 Neural Nets
No ratings yet
26 Neural Nets
77 pages
Neural Network Presentation
No ratings yet
Neural Network Presentation
33 pages
Building Your Deep Neural Network - Step by Step v8 PDF
No ratings yet
Building Your Deep Neural Network - Step by Step v8 PDF
44 pages
05 ANN Artificial Neural Networks
No ratings yet
05 ANN Artificial Neural Networks
216 pages
Neural Networks in Healthcare Lecture 2_021808
No ratings yet
Neural Networks in Healthcare Lecture 2_021808
73 pages
Slides 11
No ratings yet
Slides 11
48 pages
Reviewer
No ratings yet
Reviewer
7 pages
Chapter 9
No ratings yet
Chapter 9
73 pages
Module1 ECO-598 AI & ML Aug 21
No ratings yet
Module1 ECO-598 AI & ML Aug 21
45 pages
Lesson 7.0 Supervised Learning With Neural Networks (1)
No ratings yet
Lesson 7.0 Supervised Learning With Neural Networks (1)
22 pages
Deep Learning PDF
100% (1)
Deep Learning PDF
87 pages
2501.10465v1
No ratings yet
2501.10465v1
10 pages
Lec10 Handout
No ratings yet
Lec10 Handout
41 pages
Bai 1 Eng
No ratings yet
Bai 1 Eng
10 pages
Data Mining and Machine Learning: Fundamental Concepts and Algorithms
No ratings yet
Data Mining and Machine Learning: Fundamental Concepts and Algorithms
59 pages
AIDL03 EvolutionOfAI
No ratings yet
AIDL03 EvolutionOfAI
22 pages
Competitive Applications
No ratings yet
Competitive Applications
16 pages
Lecture20 Backprop
No ratings yet
Lecture20 Backprop
77 pages
CS 224D: Deep Learning For NLP: Lecture Notes: Part III Spring 2015
No ratings yet
CS 224D: Deep Learning For NLP: Lecture Notes: Part III Spring 2015
14 pages
1.1 Introduction
No ratings yet
1.1 Introduction
73 pages
Deep Learning (1)
No ratings yet
Deep Learning (1)
19 pages
Neural - Networks
No ratings yet
Neural - Networks
47 pages
10 nn1
No ratings yet
10 nn1
162 pages
Lecture NN 2005
No ratings yet
Lecture NN 2005
137 pages
465-Lecture 2-4
No ratings yet
465-Lecture 2-4
43 pages
cst414- Deep learning
No ratings yet
cst414- Deep learning
34 pages
Unit 2.1
No ratings yet
Unit 2.1
37 pages
50 Python Concepts Every Developer Should Know
From Everand
50 Python Concepts Every Developer Should Know
Hernando Abella
No ratings yet
Neural Networks
From Everand
Neural Networks
Sasha Kurzweil
No ratings yet
Bio Inspired Computing: Fundamentals and Applications for Biological Inspiration in the Digital World
From Everand
Bio Inspired Computing: Fundamentals and Applications for Biological Inspiration in the Digital World
Fouad Sabry
No ratings yet
Sri Sharada Vidyalaya, Budigere Summative Assessment - 01 2024-25 Class: 9 Sub: Mathematics Date:09-10-2024 Name: I. State True or False. 5
No ratings yet
Sri Sharada Vidyalaya, Budigere Summative Assessment - 01 2024-25 Class: 9 Sub: Mathematics Date:09-10-2024 Name: I. State True or False. 5
132 pages
2
No ratings yet
2
36 pages
12XM ARMCQ - Linear Programming CBSE
No ratings yet
12XM ARMCQ - Linear Programming CBSE
7 pages
Solved Paper 2022 Sem 1 (1) Maths BQ
No ratings yet
Solved Paper 2022 Sem 1 (1) Maths BQ
6 pages
Facility Planning Solution HK2!22!23.Docx
No ratings yet
Facility Planning Solution HK2!22!23.Docx
7 pages
Numerical Methods For Unconstrained Optimization and Nonlinear Equations
No ratings yet
Numerical Methods For Unconstrained Optimization and Nonlinear Equations
15 pages
ML Lesson Plan
No ratings yet
ML Lesson Plan
4 pages
Quantum Calculus (Victor Kac, Pokman Cheung) (Z Library) 1 122 Pages Pages 97
No ratings yet
Quantum Calculus (Victor Kac, Pokman Cheung) (Z Library) 1 122 Pages Pages 97
1 page
All Pairs of Shortest Paths
No ratings yet
All Pairs of Shortest Paths
18 pages
Math 10 Test Questions
No ratings yet
Math 10 Test Questions
3 pages
Lecture 6 - Improving Local Search
No ratings yet
Lecture 6 - Improving Local Search
28 pages
Module 5 Bisection Method of Solving A Nonlinear Equation
No ratings yet
Module 5 Bisection Method of Solving A Nonlinear Equation
8 pages
8
No ratings yet
8
17 pages
RD Sharma Solutions For Class 12 Maths
No ratings yet
RD Sharma Solutions For Class 12 Maths
10 pages
Model Predictive Control
No ratings yet
Model Predictive Control
12 pages
Data Structure and Algorithm Analysis: Mid Term Examination 2020
No ratings yet
Data Structure and Algorithm Analysis: Mid Term Examination 2020
13 pages
Practice Class 1: Systems of Linear Equations
No ratings yet
Practice Class 1: Systems of Linear Equations
2 pages
AT2 - 2203 - Prac - Solution
No ratings yet
AT2 - 2203 - Prac - Solution
6 pages
Online Instructions For Chapter 3: Decrease-And-Conquer: Algorithms Analysis and Design (CO3031)
No ratings yet
Online Instructions For Chapter 3: Decrease-And-Conquer: Algorithms Analysis and Design (CO3031)
7 pages
Transportation and Assignment Problem
No ratings yet
Transportation and Assignment Problem
84 pages
Factor I Sing Exercises 132
No ratings yet
Factor I Sing Exercises 132
4 pages
1d7656c1-07ac-40b0-820b-f1d2b9cd3065
No ratings yet
1d7656c1-07ac-40b0-820b-f1d2b9cd3065
2 pages
Gauss Alimination Method New
No ratings yet
Gauss Alimination Method New
12 pages
Algebraic Expressions M
No ratings yet
Algebraic Expressions M
52 pages
DS
No ratings yet
DS
4 pages
WATER JUG,DFS,BFS - Jupyter Notebook
No ratings yet
WATER JUG,DFS,BFS - Jupyter Notebook
3 pages
Rohini 50026617495
No ratings yet
Rohini 50026617495
6 pages
Predictive Analytics Notes
No ratings yet
Predictive Analytics Notes
42 pages
TM270-2024-2025-1-TMA
No ratings yet
TM270-2024-2025-1-TMA
6 pages