0% found this document useful (0 votes)
4 views

lecture 4

The document outlines the training process of Feedforward Neural Networks (FFNN), focusing on forward pass, loss calculation, and backpropagation. It discusses various optimization techniques, including gradient descent and its variants (batch, mini-batch, and stochastic). Additionally, it highlights challenges in mini-batch gradient descent and introduces adaptive learning rates and alternative optimizers like ADAM.

Uploaded by

lokr.789
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

lecture 4

The document outlines the training process of Feedforward Neural Networks (FFNN), focusing on forward pass, loss calculation, and backpropagation. It discusses various optimization techniques, including gradient descent and its variants (batch, mini-batch, and stochastic). Additionally, it highlights challenges in mini-batch gradient descent and introduces adaptive learning rates and alternative optimizers like ADAM.

Uploaded by

lokr.789
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 46

FFNN training

Prepared by: Dr / Doaa Gamal


Assistant professor at Faculty of Engineering, Suez Canal University
([email protected])
Lecture outline
2

 Last time we talked about:


 Multi-layer Perceptron (MLP, or FFNN)
 FFNN training (Gradient descent)
 Today we are going to talk about:
 FFNN training
 Backpropagation
 Different Loss optimizers
TRAINING FFNN
Regression Example
4

• The following data is a set of large regression dataset to calculate the savings of
workers given their income and minimum wage in their countries
• It is required to design a FFNN to fit the problem
Regression Example (forward direction)
5

For the first sample:


𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 = 𝑦 (1) =f(𝑥 (1) ; 𝑊)
Actual=𝑦 (𝑖)
Regression Example (forward direction)
6

Is the predicted output for the ith input sample


Loss optimization
7
Loss optimization
8

the gradient of a scalar-function f of several


variables is the vector ∇f whose value at a
point p gives the direction and the rate of
fastest increase.
Loss optimization
9
Gradient Descent Optimizer
10
Gradient Descent Optimizer
11

𝜕𝐽(𝑊)
 However How to calculate the gradients
𝜕𝑊
BACKPROPAGATION
Computing gradients (Backpropagation)
13
Computing gradients (Backpropagation)
14
Computing gradients (Backpropagation)
15
Computing gradients (Backpropagation)
16
Computing gradients (Backpropagation)
17
Training FFNN
18

 Forward Pass: every data point (or a batch of data points) in


the training dataset is passed through the model, resulting in
a prediction.
 Calculation of Loss: The predictions from the forward pass are
compared to the actual targets using a loss function.
 Backward Pass (Backpropagation): The error calculated is then
propagated backward through the model. During this phase,
gradients (derivatives) of the loss with respect to model
parameters are computed.
Training FFNN
19

 Optimization: An optimizer then adjusts the model


parameters in a direction that minimizes the loss which
depends on the calculated gradients.
 Repeat: previous steps are repeated for the entire dataset.
Once the dataset is entirely processed, one epoch is
completed. This process is then repeated for the desired
number of epochs.
TRAINING EXAMPLE
Backpropagation example (XOR)
21

𝒙𝟏 𝒙𝟐 target
0 0 0

0 1 1

1 0 1

1 1 0
Backpropagation example (random initialization)
22
Backpropagation example (Feedforward)
23
Backpropagation example (loss calculation)
24

n: number of samples
In this example n=1
Backpropagation example (backword)
25

Gradient descent:
𝑊𝑛𝑒𝑤 = 𝑊𝑜𝑙𝑑 − 𝜂∆𝑊

𝜕𝐽(𝑊)
∆𝑤 =
𝜕𝑊
Backpropagation example (weight update)
26

Gradient descent:
𝑊𝑛𝑒𝑤 = 𝑊𝑜𝑙𝑑 − 𝜂∆𝑊
𝜕𝐽(𝑊)
∆𝑤 =
𝜕𝑊
Backpropagation example
27

Iteration 2:
• Repeat the above steps using the second sample (0,1), desired output=1
Backpropagation example (final result)
28
OPTIMIZATION: GRADIENT DESCENT ALTERNATIVES
Gradient descent
30

Gradient Descent (GD) is the most classic optimization algorithm. It


aims to minimize the loss function by decreasing each parameter in
the direction of the gradient.
Gradient descent variants
31

 There are three variants of gradient descent, which differ


in how much data we use to compute the gradient of the
objective function.
i. batch gradient descent ii. Minibatch gradient descent
Iii. Stochastic gradient descent
Gradient descent variants
32

 A training dataset can be divided into one or more


batches.
 Stochastic Gradient Descent. Batch Size = 1
 Batch Gradient Descent. Batch Size = Size of Training Set
 Mini-Batch Gradient Descent. 1 < Batch Size < Size of
Training Set
Stochastic gradient descent
33

 Stochastic gradient descent (SGD) in contrast performs a


parameter update for each training example x(i) and
label y(i):
 n=1
 It is therefore usually much faster and can also be used to
learn online.
 SGD performs frequent updates with a high variance that
cause the objective function to fluctuate heavily.
Batch gradient descent
34

 batch gradient descent, computes the gradient of the cost


function w.r.t. to the parameters θ (weight and bias) for
the whole training dataset to perform just one update :
 n=dataset size
 batch gradient descent is very slow and intractable for
datasets that don't fit in memory. Batch gradient descent
also doesn't allow us to update our model online, i.e. with
new examples on-the-fly.
Mini-batch gradient descent
35

 Mini-batch gradient descent finally takes the best of both


worlds and performs an update for every mini-batch
of n training examples:
 n=100
Mini-batch gradient descent
36

a) reduces the variance of the parameter updates, which can lead to more stable
convergence;
b) can make use of highly optimized matrix optimizations
c) Common mini-batch sizes range between 50 and 256, but can vary for different
applications.
Mini-batch gradient descent is typically the algorithm of choice when training a
neural network and the term SGD usually is employed also when mini-batches are
used.
Challenges of mini-batch gradient descent
37
Challenges of mini-batch gradient descent
38
Challenges of mini-batch gradient descent
39
Challenges of mini-batch gradient descent
40
Challenges of mini-batch gradient descent
41
Challenges of mini-batch gradient descent
42

Choosing a proper learning rate can be difficult.


 Learning rate schedules try to adjust the learning rate during training

by reducing the learning rate according to a pre-defined schedule or


when the change in objective between epochs falls below a
threshold.
 Additionally, the same learning rate applies to all parameter updates.
If our data is sparse and our features have very different frequencies,
we might not want to update all of them to the same extent, but
perform a larger update for rarely occurring features.
Adaptive learning rate
43
Gradient descent Alternative optimizers
44
Gradient descent optimization algorithms
45

 ADAM optimizer  SGD


Thank You

You might also like