0% found this document useful (0 votes)

2 views

Gradient Descent Method

The document discusses various optimization algorithms used in deep learning to minimize loss functions, including Gradient Descent, Stochastic Gradient Descent (SGD), Mini Batch SGD, and others. It explains how each optimizer works, their advantages and disadvantages, and the importance of learning rates in the optimization process. The document emphasizes the need for effective weight updates to improve model performance and convergence speed.

Uploaded by

tanishkkushwaha511

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views

Gradient Descent Method

Uploaded by

tanishkkushwaha511

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 12

Introduction

In deep learning, we have the concept of loss, which tells us how poorly the model is per-
forming at that current instant. Now we need to use this loss to train our network such that
it performs better. Essentially what we need to do is to take the loss and try to minimize it,
because a lower loss means our model is going to perform better. The process of minimiz-
ing (or maximizing) any mathematical expression is called optimization.

Optimizers are algorithms or methods used to change the attributes of the neural network
such as weights and learning rate to reduce the losses. Optimizers are used to solve op-
timization problems by minimizing the function.

Different types of optimizers and how they exactly work to minimize the loss function.

1. Gradient Descent
2. Stochastic Gradient Descent (SGD)
3. Mini Batch Stochastic Gradient Descent (MB-SGD)
4. SGD with momentum
5. Nesterov Accelerated Gradient (NAG)
6. Adaptive Gradient (AdaGrad)
7. AdaDelta
8. RMSprop
9. Adam

1. Gradient Descent

Gradient descent is an optimization algorithm that's used when training a machine learning
model. It's based on a convex function and tweaks its parameters iteratively to minimize a
given function to its local minimum.

WHAT IS GRADIENT DESCENT? Gradient Descent is an optimization algorithm for

finding a local minimum of a differentiable function. Gradient descent is simply
used to find the values of a function's parameters (coefficients) that minimize a cost
function as far as possible.

You start by defining the initial parameter's values and from there gradient descent uses
calculus to iteratively adjust the values so they minimize the given cost-function.

The weight is initialized using some initialization strategies and is updated with each epoch
according to the update equation.

W n e w=W o l d−η∗d L /d W o l d

W t=W (t−1)−η∗d L/d W (t−1)

Where Wt is new weight

W(t-1) = Old Weight

η =Learning rate
dL/dW(t-1)= Gradient

L is loss /Cost function

The above equation computes the gradient of the cost function L w.r.t. to the parameters/
weights W for the entire training dataset:

Our aim is to get to the bottom of our graph(Cost vs weights), or to a point where we can
no longer move downhill–a local minimum.
Importance of Learning rate

How big the steps are gradient descent takes into the direction of the local minimum are
determined by the learning rate, which figures out how fast or slows we will move towards
the optimal weights.
For gradient descent to reach the local minimum we must set the learning rate to an ap-
propriate value, which is neither too low nor too high. This is important because if the steps
it takes are too big, it may not reach the local minimum because it bounces back and forth
between the convex function of gradient descent (see left image below). If we set the

learning rate to a very small value, gradient descent will eventually reach the local mini-
mum but that may take a while (see the right image).

So, the learning rate should never be too high or too low for this reason. You can check if
you’re learning rate is doing well by plotting it on a graph.
Advantages:

1. Easy computation.
2. Easy to implement.
3. Easy to understand.

Disadvantages:

1. May trap at local minima.

2. Weights are changed after calculating the gradient on the whole dataset. So, if the
dataset is too large then this may take years to converge to the minima.
3. Requires large memory to calculate the gradient on the whole dataset.

2. Stochastic Gradient Descent (SGD)

SGD algorithm is an extension of the Gradient Descent and it overcomes some of the dis-
advantages of the GD algorithm. Gradient Descent has a disadvantage that it requires a
lot of memory to load the entire dataset of n-points at a time to compute the derivative of
the loss function. In the SGD algorithm derivative is computed taking one point at a
time.

SGD performs a parameter update for each training example x(i) and label y(i):
where {x(i) ,y(i)} are the training examples.

1. On the left, we have Stochastic Gradient Descent (where m=1 per step) we take a
Gradient Descent step for each example and on the right is Gradient Descent (1
step per entire training set).
2. SGD seems to be quite noisy, at the same time it is much faster but may not con-
verge to a minimum.
3. Typically, to get the best out of both worlds we use Mini-batch gradient descent
(MGD) which looks at a smaller number of training set examples at once to help
(usually power of 2 - 2^6 etc.).
4. Mini-batch Gradient Descent is relatively more stable than Stochastic Gradient De-
scent (SGD) but does have oscillations as gradient steps are being taken in the di-
rection of a sample of the training set and not the entire set as in BGD.
It is observed that in SGD the updates take more number iterations compared to gradient
descent to reach minima. On the right, the Gradient Descent takes fewer steps to reach
minima but the SGD algorithm is noisier and takes more iterations.

Advantage:

Memory requirement is less compared to the GD algorithm as the derivative is computed

taking only 1 point at once.
Disadvantages:

1. The time required to complete 1 epoch is large compared to the GD algorithm.

2. Takes a long time to converge.
3. May stuck at local minima.

3 Mini Batch Stochastic Gradient Descent (MB-SGD)

MB-SGD algorithm is an extension of the SGD algorithm and it overcomes the problem of
large time complexity in the case of the SGD algorithm. MB-SGD algorithm takes a batch
of points or subset of points from the dataset to compute derivate.

It is observed that the derivative of the loss function for MB-SGD is almost the same as a
derivate of the loss function for GD after some number of iterations. But the number of iter-
ations to achieve minima is large for MB-SGD compared to GD and the cost of computa-
tion is also large.

Image Source

The update of weight is dependent on the derivate of loss for a batch of points. The up-
dates in the case of MB-SGD are much noisy because the derivative is not always towards
minima.
MB-SGD divides the dataset into various batches and after every batch, the parameters
are updated.

Advantages:

Less time complexity to converge compared to standard SGD algorithm.

Disadvantages:

1. The update of MB-SGD is much noisy compared to the update of the GD algorithm.
2. Take a longer time to converge than the GD algorithm.
3. May get stuck at local minima.
Summary

4 SGD with momentum

A major disadvantage of the MB-SGD algorithm is that updates of weight are very noisy.
SGD with momentum overcomes this disadvantage by denoising the gradients. Updates of
weight are dependent on noisy derivative and if we somehow denoise the derivatives then
converging time will decrease.

The idea is to denoise derivative using exponential weighting average that is to give more
weightage to recent updates compared to the previous update.
It accelerates the convergence towards the relevant direction and reduces the fluctuation
to the irrelevant direction. One more hyperparameter is used in this method known as mo-
mentum symbolized by ‘γ’.

The momentum term γ is usually set to 0.9 or a similar value.

Momentum at time ‘t’ is computed using all previous updates giving more weightage to re-
cent updates compared to the previous update. This leads to speed up the convergence.

Essentially, when using momentum, we push a ball down a hill. The ball accumulates mo-
mentum as it rolls downhill, becoming faster and faster on the way (until it reaches its ter-
minal velocity if there is air resistance, i.e. γ<1). The same thing happens to our parameter
updates: The momentum term increases for dimensions whose gradients point in the
same directions and reduces updates for dimensions whose gradients change directions.
As a result, we gain faster convergence and reduced oscillation.

Image Source

The diagram above concludes SGD with momentum denoises the gradients and con-
verges faster as compared to SGD.

Update rule for momentum-based gradient descent:

In this, momentum is added to the conventional gradient descent
equation. The update equation is

wt+1 = wt − updatet

updatet is calculated by:

updatet = γ · updatet−1 + η∇wt
Update rule for momentum-based gradient descent:
In this, momentum is added to the conventional gradient descent
equation. The update equation is
wt+1 = wt − updatet

updatet is calculated by:

updatet = γ · updatet−1 + η∇wt

Advantages:

1. Has all advantages of the SGD algorithm.

2. Converges faster than the GD algorithm.
Disadvantages:

We need to compute one more variable for each update.

5 Nesterov Accelerated Gradient (NAG)

The idea of the NAG algorithm is very similar to SGD with momentum with a slight variant.
In the case of SGD with a momentum algorithm, the momentum and gradient are com-
puted on the previous updated weight..

As can see, in the momentum-based gradient, the steps become larger and larger due to
the accumulated momentum, and then we overshoot at the 4th step. We then have to take
steps in the opposite direction to reach the minimum point.
However, the update in NAG happens in two steps. First, a partial step to reach the look-
ahead point, and then the final update. We calculate the gradient at the look-ahead point
and then use it to calculate the final update. If the gradient at the look-ahead point is nega-
tive, our final update will be smaller than that of a regular momentum-based gradient. Like
in the above example, the updates of NAG are similar to that of the momentum-based gra-
dient for the first three steps because the gradient at that point and the look-ahead point
are positive. But at step 4, the gradient of the look-ahead point is negative.
In NAG, the first partial update 4a will be used to go to the look-ahead point and then the
gradient will be calculated at that point without updating the parameters. Since the gradient
at step 4b is negative, the overall update will be smaller than the momentum-based gradi-
ent descent.
We can see in the above example that the momentum-based gradient descent takes six
steps to reach the minimum point, while NAG takes only five steps.
This looking ahead helps NAG to converge to the minimum points in fewer steps and re-
duce the chances of overshooting.

How NAG Actually Works

We saw how NAG solves the problem of overshooting by ‘looking ahead’. Let us see how
this is calculated and the actual math behind it.

Update rule for gradient descent:

wt+1 = wt − η∇wt

In this equation, the weight (W) is updated in each iteration. η is the learning rate, and ∇wt
is the gradient.

Update rule for momentum-based gradient descent:

In this, momentum is added to the conventional gradient descent equation. The update
equation is

wt+1 = wt − updatet

updatet is calculated by:

updatet = γ · updatet−1 + η∇wt

This is how the gradient of all the previous updates is added to the current update.

Update rule for NAG:

wt+1 = wt − updatet
While calculating the updatet, We will include the look ahead gradient (∇wlook_ahead).

updatet = γ · updatet−1 + η∇wlook_ahead

∇wlook_ahead is calculated by:

wlook_ahead = wt − γ · updatet−1

This look-ahead gradient will be used in our update and will prevent overshooting.

6. Adaptive Gradient Optimizer.

Adagrad stands for Adap- tive Gradient Optimizer.
There were optimizers like Gradient Descent, Stochastic
Gradient Descent, mini-batch SGD, all were used to reduce the loss function with respect
to the weights. The weight updating formula is as follows:

Based on iterations, this for- mula can be written as:

where
w(t) = value of w at current iteration, w(t-1) = value of w at previous iteration and η =
learning rate.
In SGD and mini-batch SGD, the value of η used to be the same for each weight, or say
for each parameter. Typically, η = 0.01. But in Adagrad Optimizer the core idea is
that each weight has a different learning rate (η). This modification has great impor-
tance, in the real-world dataset, some features are
sparse (for example, in Bag of Words most of the features are
zero so it’s sparse) and some are dense (most of the features
will be noon-zero), so keeping the same value of learning rate for all the weights is not
good for optimization. The weight updating formula for adagrad looks like:

Where alpha(t) denotes different learning rates for each weight at each itera-
tion.
Here, η is a constant number, ep- silon is a small positive value number to
avoid divide by zero error if in case alpha(t) becomes 0 be-
cause if alpha(t) become zero then the learning rate will become
zero which in turn after multiplying by derivative will make w(old) = w(new),
and this will lead to small convergence.

is derivative of loss with respect to weight and will always be positive since its a square
term, which means that alpha(t) will also remain positive, this implies that alpha(t) >= al-
pha(t-1).
It can be seen from the formula that as alpha(t) and is inversely proportional to one an-
other, this implies that as alpha(t) will increase, will decrease. This means that as the
number of iterations will increase, the learning rate will reduce adaptively, so you no need
to manually select the learning rate.
Advantages of Adagrad:
• No manual tuning of the learning rate required.
• Faster convergence
• More reliable
One main disadvantage of Adagrad optimizer is that alpha(t) can become large as the
number of iterations will increase and due to this will decrease at the larger rate. This will
make the old weight almost equal to the new weight which may lead to slow convergence.

Imm Book Chapter 1 - 2
No ratings yet
Imm Book Chapter 1 - 2
75 pages
Bosch Refrig Ser Man B20CS
No ratings yet
Bosch Refrig Ser Man B20CS
77 pages
Gradient-Based Optimizers
No ratings yet
Gradient-Based Optimizers
54 pages
Gradient Descent Algorithms and Variations - PyImageSearch
No ratings yet
Gradient Descent Algorithms and Variations - PyImageSearch
21 pages
An Overview of Gradient Descent Optimization Algorithms PDF
No ratings yet
An Overview of Gradient Descent Optimization Algorithms PDF
12 pages
Gradient Descent_PR
No ratings yet
Gradient Descent_PR
31 pages
Gradient Descent & Stockastic Gradient Descent
No ratings yet
Gradient Descent & Stockastic Gradient Descent
6 pages
Technical_writing (2)
No ratings yet
Technical_writing (2)
9 pages
WINSEM2024-25_CSE4006_ETH_AP2024254000693_2025-01-08_Reference-Material-I
No ratings yet
WINSEM2024-25_CSE4006_ETH_AP2024254000693_2025-01-08_Reference-Material-I
40 pages
GD Types
No ratings yet
GD Types
98 pages
Technical_writing (1)
No ratings yet
Technical_writing (1)
9 pages
Gradient Descent and Its Types
No ratings yet
Gradient Descent and Its Types
5 pages
S09_DNN_Gradients_wip
No ratings yet
S09_DNN_Gradients_wip
28 pages
Gradient Descent
No ratings yet
Gradient Descent
2 pages
Lecture 5
No ratings yet
Lecture 5
34 pages
Mlfa Autumn 23 Optimization
No ratings yet
Mlfa Autumn 23 Optimization
37 pages
PCA and Convex optimization and bias , Variance-2
No ratings yet
PCA and Convex optimization and bias , Variance-2
29 pages
14-RMSProp and Adam Optimization-12!08!2024
No ratings yet
14-RMSProp and Adam Optimization-12!08!2024
2 pages
Comparison of Gradient Descent Algorithms On Training Neural Networks
No ratings yet
Comparison of Gradient Descent Algorithms On Training Neural Networks
20 pages
Gradient Descent Optimization
No ratings yet
Gradient Descent Optimization
27 pages
Gradient_decent
No ratings yet
Gradient_decent
15 pages
optimization techniques (SGD alternatives)
No ratings yet
optimization techniques (SGD alternatives)
34 pages
Technical_writing
No ratings yet
Technical_writing
8 pages
ANN Explanation Request Updated
No ratings yet
ANN Explanation Request Updated
44 pages
Stochastic Gradient Descent - Term Paper
No ratings yet
Stochastic Gradient Descent - Term Paper
8 pages
Deep Learning
No ratings yet
Deep Learning
20 pages
Module 2
No ratings yet
Module 2
67 pages
Gradient Descent and Cost Function
No ratings yet
Gradient Descent and Cost Function
14 pages
Lec6 (1)
No ratings yet
Lec6 (1)
11 pages
4_Gradient Descent and Stochastic GD
No ratings yet
4_Gradient Descent and Stochastic GD
37 pages
Gradient Descent Algorithm is a first
No ratings yet
Gradient Descent Algorithm is a first
5 pages
SGD
No ratings yet
SGD
3 pages
05.Stochastic Gradient Descent (3)
No ratings yet
05.Stochastic Gradient Descent (3)
2 pages
Gradient Descent 5 Part 2
No ratings yet
Gradient Descent 5 Part 2
15 pages
Gradient Descent DS Rohit Sharma Fench Knjs
No ratings yet
Gradient Descent DS Rohit Sharma Fench Knjs
15 pages
QB Unit 3
No ratings yet
QB Unit 3
14 pages
Ch2-Training, Optimization and Regularization of DNN-new (1)
No ratings yet
Ch2-Training, Optimization and Regularization of DNN-new (1)
114 pages
Optimization Techniques in Deep Learning
No ratings yet
Optimization Techniques in Deep Learning
14 pages
cours5
No ratings yet
cours5
23 pages
7 Stochastic Gradient
No ratings yet
7 Stochastic Gradient
4 pages
Document 2
No ratings yet
Document 2
30 pages
Gradient Descent
No ratings yet
Gradient Descent
15 pages
2. Gradient Descent (GD)- GD With Momentum- Nesterov Accelerated GD- Stochastic GD - OrIGINAL
No ratings yet
2. Gradient Descent (GD)- GD With Momentum- Nesterov Accelerated GD- Stochastic GD - OrIGINAL
25 pages
Soft Computing Assignment
No ratings yet
Soft Computing Assignment
9 pages
Opti Incertitude
No ratings yet
Opti Incertitude
231 pages
Unit 4 - GRADIENT LEARNING
No ratings yet
Unit 4 - GRADIENT LEARNING
3 pages
04 Batch SGD Mini Batch Gradient Descent Algorithms
No ratings yet
04 Batch SGD Mini Batch Gradient Descent Algorithms
3 pages
Types of Gradient Descent
No ratings yet
Types of Gradient Descent
9 pages
Visualising SGD With Momentum, Adam and Learning Rate Annealing
No ratings yet
Visualising SGD With Momentum, Adam and Learning Rate Annealing
8 pages
3 Types of Gradient Descent Algorithms For Small & Large Datasets
No ratings yet
3 Types of Gradient Descent Algorithms For Small & Large Datasets
9 pages
Gradient Descent
No ratings yet
Gradient Descent
13 pages
Gradient Descent Optimization
No ratings yet
Gradient Descent Optimization
4 pages
Stochastic Gradient Descent
No ratings yet
Stochastic Gradient Descent
23 pages
Gradient Descent Overview
No ratings yet
Gradient Descent Overview
14 pages
Sheet 3 Sol 3
No ratings yet
Sheet 3 Sol 3
3 pages
Tut04 - One Algorithm To Optimize Them All
No ratings yet
Tut04 - One Algorithm To Optimize Them All
19 pages
Gradient Descent
No ratings yet
Gradient Descent
8 pages
Lesson 4 Gradient Descent
No ratings yet
Lesson 4 Gradient Descent
13 pages
Gradient Descent
No ratings yet
Gradient Descent
4 pages
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
From Everand
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
Fouad Sabry
No ratings yet
Bundle Adjustment: Optimizing Visual Data for Precise Reconstruction
From Everand
Bundle Adjustment: Optimizing Visual Data for Precise Reconstruction
Fouad Sabry
No ratings yet
Hill Climbing: Fundamentals and Applications
From Everand
Hill Climbing: Fundamentals and Applications
Fouad Sabry
No ratings yet
Crawl 11
No ratings yet
Crawl 11
30 pages
Stabat Mater
No ratings yet
Stabat Mater
1 page
22 Oaths Taken by Dr. Ambedkar
No ratings yet
22 Oaths Taken by Dr. Ambedkar
2 pages
EC2302 DIGITAL SIGNAL PROCESSING Two Marks With Answers Vidyarthiplus (V+) Blog - A Blog For Students
100% (2)
EC2302 DIGITAL SIGNAL PROCESSING Two Marks With Answers Vidyarthiplus (V+) Blog - A Blog For Students
41 pages
1 Aerodinamika INTRODUCTION PDF
No ratings yet
1 Aerodinamika INTRODUCTION PDF
16 pages
Bear HMJ A35m1
No ratings yet
Bear HMJ A35m1
6 pages
Annexure 2 Technical Catalogue
No ratings yet
Annexure 2 Technical Catalogue
26 pages
City of San Fernando West Integrated School Learners' Profile Form Personal
No ratings yet
City of San Fernando West Integrated School Learners' Profile Form Personal
1 page
Weber and Environmental Sociology
No ratings yet
Weber and Environmental Sociology
17 pages
Chapter 8, Kingdom Fungi
No ratings yet
Chapter 8, Kingdom Fungi
1 page
Grade 8: Self-Learning Pack
No ratings yet
Grade 8: Self-Learning Pack
15 pages
[DBD261][2024_06_14][1][Online][SickTest][75] - Not Answered
No ratings yet
[DBD261][2024_06_14][1][Online][SickTest][75] - Not Answered
25 pages
Reaffirmed 1996
No ratings yet
Reaffirmed 1996
36 pages
Gerald Gordinier: Human-Computer Interaction
No ratings yet
Gerald Gordinier: Human-Computer Interaction
9 pages
Safety Integrity level Study
No ratings yet
Safety Integrity level Study
9 pages
Kidney Transplant Understanding Tool KTUT
No ratings yet
Kidney Transplant Understanding Tool KTUT
11 pages
Axolotl-Strawberry-Crochet-PDF-Free-Pattern
No ratings yet
Axolotl-Strawberry-Crochet-PDF-Free-Pattern
7 pages
SEPA8 e Slides CH 4 R1
No ratings yet
SEPA8 e Slides CH 4 R1
14 pages
EBA Report On Liquidity Measures Under Article 509 (1) of The CRR
No ratings yet
EBA Report On Liquidity Measures Under Article 509 (1) of The CRR
46 pages
Heisenberg's Letters
No ratings yet
Heisenberg's Letters
4 pages
ELEC9713 Lecture Notes 4IN1
No ratings yet
ELEC9713 Lecture Notes 4IN1
137 pages
Ankit Kumar Varun - WPR 9
No ratings yet
Ankit Kumar Varun - WPR 9
6 pages
Focus4 2E Vocabulary Quiz Unit5 GroupA 1kol
No ratings yet
Focus4 2E Vocabulary Quiz Unit5 GroupA 1kol
2 pages
M1 Lesson 1 - Introduction To Biopharmaceutics and Pharmacokinetics
No ratings yet
M1 Lesson 1 - Introduction To Biopharmaceutics and Pharmacokinetics
115 pages
API 570 Day 2 Book (1 To 51)
100% (3)
API 570 Day 2 Book (1 To 51)
65 pages
Daffodils
No ratings yet
Daffodils
3 pages
BudgetImport_Budget Import Layout (26)
No ratings yet
BudgetImport_Budget Import Layout (26)
11 pages
Immediate download (Ebook) Why Do Banks Fail and What to Do About It by Abidi, Nordine, Buchetti, Bruno, Crosetti, Samuele, Miquel-Flores, Ixart ISBN 9783031523106, 3031523105 ebooks 2024
100% (11)
Immediate download (Ebook) Why Do Banks Fail and What to Do About It by Abidi, Nordine, Buchetti, Bruno, Crosetti, Samuele, Miquel-Flores, Ixart ISBN 9783031523106, 3031523105 ebooks 2024
65 pages