Gradient Descent Overview

Notes

Uploaded by

bhagyavantrajapur1999

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views

Gradient Descent Overview

Notes

Uploaded by

bhagyavantrajapur1999

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 14

Overview of Stochastic

Gradient Descent Algorithms

Srikumar Ramalingam
Reference
Sebastian Ruder, An overview of gradient descent optimization
algorithms, 2017
https://arxiv.org/pdf/1609.04747.pdf
Notations
• Objective function 𝐽 𝜃 , where 𝜃 is the parameter set 𝜃 ∈ ℝ𝑛
• Gradient of the object function is ∇𝜃 𝐽 𝜃 with respect to the
parameters.
• 𝜂 is the learning rate.
Batch Gradient Descent
• We compute the gradient of the cost function with respect to the
parameters for the entire dataset:

𝜃 = 𝜃 − 𝜂. ∇𝜃 𝐽 𝜃

• As we need to calculate the gradients for the whole dataset to perform just
one update, batch gradient descent can be very slow and is intractable for
datasets that do not ﬁt in memory.
• Batch gradient descent is guaranteed to converge to the global minimum
for convex error surfaces and to a local minimum for non-convex surfaces
Stochastic Gradient Descent
• Parameter update is done for each training example 𝑥 𝑖 , 𝑦 𝑖 .

𝜃 = 𝜃 − 𝜂. ∇𝜃 𝐽 𝜃; 𝑥 𝑖 , 𝑦 𝑖

• SGD performs frequent updates with a high variance that cause

the objective function to fluctuate heavily.
• SGD’s fluctuation enables it to jump to new and potentially
better local minima. However, this also complicates convergence
to the exact minimum, as SGD will keep overshooting.
• when we slowly decrease the learning rate, SGD shows the same
convergence behaviour as batch gradient descent, almost
certainly converging to a local or the global minimum for non-
convex and convex optimization respectively
Mini-batch gradient descent
• Mini-batch gradient descent finally takes the best of both worlds and
performs an update for every mini-batch of 𝑛 training examples:

𝑖:𝑖+𝑛 𝑖:𝑖+𝑛
𝜃 = 𝜃 − 𝜂. ∇𝜃 𝐽 𝜃; 𝑥 ,𝑦

• Common mini-batch sizes range between 50 and 256, but can vary for
different applications.
• Mini-batch gradient descent is typically the algorithm of choice when
training a neural network and the term SGD usually is employed also
when mini-batches are used.
Momentum
• Here we add a fraction 𝛾 of the update vector of the past time step to
the current update vector.
ν𝑡 = 𝛾ν𝑡−1 + 𝜂. ∇𝜃 𝐽 𝜃
𝜃 = 𝜃 − ν𝑡
• The idea of momentum is similar to a ball rolling down a hill. The
momentum term increases for dimensions whose gradients point in
the same directions and reduces updates for dimensions whose
gradients change directions.
• The momentum term 𝛾 is
usually set to 0.9 or a similar value.
Nesterov accelerated gradient
• We would like to have a smarter ball, a ball that has a notion of where
it is going so that it knows to slow down before the hill slopes up
again.
• We can now effectively look ahead by calculating the gradient not
w.r.t. to our current parameters θ but w.r.t. the approximate future
position of our parameters:
ν𝑡 = 𝛾ν𝑡−1 + 𝜂. ∇𝜃 𝐽 𝜃 − 𝜂ν𝑡−1
𝜃 = 𝜃 − ν𝑡
Adagrad
• For brevity, we set 𝑔𝑡,𝑖 to be the gradient of the objective function w.r.t. to the parameter
𝜃𝑖 at time step 𝑡.
• The SGD update for every parameter 𝜃𝑖 is given by:
𝜃𝑡+1,𝑖 = 𝜃𝑡,𝑖 − 𝜂. 𝑔𝑡,𝑖
• The Adagrad update for every parameter 𝜃𝑖 is given by:
𝜂
𝜃𝑡+1,𝑖 = 𝜃𝑡,𝑖 − . 𝑔𝑡,𝑖
2
𝐺𝑡,𝑖 +𝜖
• 𝐺𝑡,𝑖 is the sum of squares of the gradients w.r.t. 𝜃𝑖 upto time 𝑡, and 𝜖 is a smoothing
variable that avoid division by zero.
• One of Adagrad’s main benefits is that it eliminates the need to manually tune the
learning rate. Most implementations use a default value of 0.01 and leave it at that. On
the other hand, the learning rate may eventually become infinitesimally small.
Adadelta
• This can be seen as a slight modification of Adagrad. The sum of gradients is
recursively defined as a decaying average of all past squared gradients.
2
𝐸 𝑔𝑡2 = 𝛾𝐸 𝑔𝑡−1 + 1 − 𝛾 𝑔𝑡2

The update while using Adadelata is given below:

𝜂
𝜃𝑡+1,𝑖 = 𝜃𝑡,𝑖 − . 𝑔𝑡,𝑖
2
𝐸[𝑔𝑡,𝑖 ]+𝜖
2
• 𝐸[𝑔𝑡,𝑖 ] is the the decaying average over past squared gradients w.r.t. 𝜃𝑖 upto time
𝑡, and 𝜖 is a smoothing variable that avoid division by zero.
RMSProp
• RMSprop and Adadelta have both been developed independently
around the same time stemming from the need to resolve Adagrad’s
radically diminishing learning rates.

2
𝐸 𝑔𝑡2 = 0.9𝐸 𝑔𝑡−1 + 0.1𝑔𝑡2

𝜂
𝜃𝑡+1,𝑖 = 𝜃𝑡,𝑖 − . 𝑔𝑡,𝑖
2
𝐸[𝑔𝑡,𝑖 ]+𝜖
Adam
• Adaptive Moment Estimation (Adam) is another method that computes
adaptive learning rates for each parameter.
• In addition to storing an exponentially decaying average of past squared
gradients 𝑣𝑡 like Adadelta and RMSprop, Adam also keeps an exponentially
decaying average of past gradients 𝑚𝑡 , similar to momentum:
𝑚𝑡 = 𝛽1 𝑚𝑡−1 + 1 − 𝛽1 𝑔𝑡
𝑣𝑡 = 𝛽2 𝑣𝑡−1 + 1 − 𝛽2 𝑔𝑡

𝜂
𝜃𝑡+1,𝑖 = 𝜃𝑡,𝑖 − . 𝑚𝑡,𝑖
𝑣𝑡,𝑖 + 𝜖
Visualization around a saddle point

Here SGD, Momentum, and NAG ﬁnd it difﬁculty to break symmetry, although the latter two eventually manage to
escape the saddle point, while Adagrad, RMSprop, and Adadelta quickly head down the negative slope, with Adadelta
leading the charge.
Visualization on Beale Function

• Adadelta, Adagrad, and Rmsprop headed off immediately in the right direction and converged fast.
• Momentum and NAG were off track, and NAG corrected it course eventually.

optimization techniques (SGD alternatives)
No ratings yet
optimization techniques (SGD alternatives)
34 pages
Gradient-Based Optimizers
No ratings yet
Gradient-Based Optimizers
54 pages
Curs6site PDF
No ratings yet
Curs6site PDF
40 pages
DL Class2
No ratings yet
DL Class2
30 pages
Rajesh (Dl Unit3) 06dec2024
No ratings yet
Rajesh (Dl Unit3) 06dec2024
67 pages
Optimization Techniques in Deep Learning
No ratings yet
Optimization Techniques in Deep Learning
14 pages
Optimizers
No ratings yet
Optimizers
4 pages
Gradient Descent Optimization
No ratings yet
Gradient Descent Optimization
27 pages
Optimization For Deep Learning: Sebastian Ruder
No ratings yet
Optimization For Deep Learning: Sebastian Ruder
49 pages
An Overview of Gradient Descent Optimization Algorithms PDF
No ratings yet
An Overview of Gradient Descent Optimization Algorithms PDF
12 pages
BME 6407 - Class 10 (April 2023)
No ratings yet
BME 6407 - Class 10 (April 2023)
31 pages
Soft Computing Assignment
No ratings yet
Soft Computing Assignment
9 pages
S09_DNN_Gradients_wip
No ratings yet
S09_DNN_Gradients_wip
28 pages
Gradient Descent_PR
No ratings yet
Gradient Descent_PR
31 pages
Lecture 8 Gradient Descent For Non-Convex Functions
No ratings yet
Lecture 8 Gradient Descent For Non-Convex Functions
21 pages
Gradient Descent
No ratings yet
Gradient Descent
5 pages
Momentum, AdaGrad, RMSProp, Adam
No ratings yet
Momentum, AdaGrad, RMSProp, Adam
27 pages
Mlfa Autumn 23 Optimization
No ratings yet
Mlfa Autumn 23 Optimization
37 pages
cours5
No ratings yet
cours5
23 pages
Visualising SGD With Momentum, Adam and Learning Rate Annealing
No ratings yet
Visualising SGD With Momentum, Adam and Learning Rate Annealing
8 pages
Gradient Descent Method
No ratings yet
Gradient Descent Method
12 pages
03 Optimization
No ratings yet
03 Optimization
20 pages
Optimization Gradient Descent Method
No ratings yet
Optimization Gradient Descent Method
3 pages
4_Gradient Descent and Stochastic GD
No ratings yet
4_Gradient Descent and Stochastic GD
37 pages
Gradient Descent Algorithms and Variations - PyImageSearch
No ratings yet
Gradient Descent Algorithms and Variations - PyImageSearch
21 pages
WINSEM2024-25_CSE4006_ETH_AP2024254000693_2025-01-08_Reference-Material-I
No ratings yet
WINSEM2024-25_CSE4006_ETH_AP2024254000693_2025-01-08_Reference-Material-I
40 pages
Otimization 2024_ver3
No ratings yet
Otimization 2024_ver3
42 pages
Comparison of Gradient Descent Algorithms On Training Neural Networks
No ratings yet
Comparison of Gradient Descent Algorithms On Training Neural Networks
20 pages
CS 437 / CS 5317 Deep Learning: Murtaza Taj
No ratings yet
CS 437 / CS 5317 Deep Learning: Murtaza Taj
11 pages
Module 2
No ratings yet
Module 2
67 pages
L5 Training Neural Networks Part 2 en v2
No ratings yet
L5 Training Neural Networks Part 2 en v2
70 pages
Lecture 7 - Optimization Part I
No ratings yet
Lecture 7 - Optimization Part I
38 pages
Lecture 5
No ratings yet
Lecture 5
34 pages
Opti Incertitude
No ratings yet
Opti Incertitude
231 pages
Op Tim Ization
No ratings yet
Op Tim Ization
22 pages
Deep Learning (MODULE-2) (2)
No ratings yet
Deep Learning (MODULE-2) (2)
86 pages
Unit 2.2
No ratings yet
Unit 2.2
46 pages
Technical_writing (2)
No ratings yet
Technical_writing (2)
9 pages
Optimization Algorithms Deep PDF
No ratings yet
Optimization Algorithms Deep PDF
9 pages
Deep learning chapter 1
No ratings yet
Deep learning chapter 1
46 pages
08 Training
No ratings yet
08 Training
18 pages
Training NNs
No ratings yet
Training NNs
34 pages
Lesson 5 Deep Neural Net Optimization Tuning Interpretability
100% (1)
Lesson 5 Deep Neural Net Optimization Tuning Interpretability
105 pages
Cheat_Sheet_1
No ratings yet
Cheat_Sheet_1
1 page
Part 13 MD
No ratings yet
Part 13 MD
41 pages
Lecture 8.5
No ratings yet
Lecture 8.5
9 pages
04 Batch SGD Mini Batch Gradient Descent Algorithms
No ratings yet
04 Batch SGD Mini Batch Gradient Descent Algorithms
3 pages
SCSA3015 Deep Learning Unit 4 PDF
No ratings yet
SCSA3015 Deep Learning Unit 4 PDF
30 pages
HMD-Deep Learning-Lecture 2-2024
No ratings yet
HMD-Deep Learning-Lecture 2-2024
47 pages
Unit2 Optimizer
No ratings yet
Unit2 Optimizer
18 pages
11 - Optimizers
No ratings yet
11 - Optimizers
16 pages
Technical_writing (1)
No ratings yet
Technical_writing (1)
9 pages
Op Tim Ization
No ratings yet
Op Tim Ization
9 pages
Stochastic Gradient Descent - Term Paper
No ratings yet
Stochastic Gradient Descent - Term Paper
8 pages
Ch2-Training, Optimization and Regularization of DNN-new (1)
No ratings yet
Ch2-Training, Optimization and Regularization of DNN-new (1)
114 pages
Deep Learning
No ratings yet
Deep Learning
18 pages
MLP Encoder Decoder
No ratings yet
MLP Encoder Decoder
14 pages
Bundle Adjustment: Optimizing Visual Data for Precise Reconstruction
From Everand
Bundle Adjustment: Optimizing Visual Data for Precise Reconstruction
Fouad Sabry
No ratings yet
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
From Everand
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
Fouad Sabry
No ratings yet
Hill Climbing: Fundamentals and Applications
From Everand
Hill Climbing: Fundamentals and Applications
Fouad Sabry
No ratings yet
ICT grade 11 1st lesson
No ratings yet
ICT grade 11 1st lesson
4 pages
LM-DiskANN Low Memory Footprint in Disk-Native Dynamic Graph-Based ANN Indexing
No ratings yet
LM-DiskANN Low Memory Footprint in Disk-Native Dynamic Graph-Based ANN Indexing
10 pages
Bhargava2 v-4 ScrambledConteer-11
No ratings yet
Bhargava2 v-4 ScrambledConteer-11
10 pages
Basis Path Testing Technique
No ratings yet
Basis Path Testing Technique
6 pages
Design and Analysis of Algorithms - Model Question Paper
100% (1)
Design and Analysis of Algorithms - Model Question Paper
3 pages
CSCE 120: Learning To Code: Organizing Data I Hacktivity 12.1
No ratings yet
CSCE 120: Learning To Code: Organizing Data I Hacktivity 12.1
3 pages
Cs 16 Lab 2.data Structure and Algorithms Lab Programs
No ratings yet
Cs 16 Lab 2.data Structure and Algorithms Lab Programs
67 pages
Lecture 01.1
No ratings yet
Lecture 01.1
21 pages
Grading Rubric
No ratings yet
Grading Rubric
1 page
Nonce and Previous Hash Construction For Bitcoin Mining
No ratings yet
Nonce and Previous Hash Construction For Bitcoin Mining
3 pages
Discrete Maths and Graph Theory Mindmap
No ratings yet
Discrete Maths and Graph Theory Mindmap
1 page
AI March_2024
No ratings yet
AI March_2024
1 page
Selection Sort Quiz (With ans Key)
No ratings yet
Selection Sort Quiz (With ans Key)
4 pages
MSIT-104 Data Structure and Algorithms
No ratings yet
MSIT-104 Data Structure and Algorithms
237 pages
Jigyasa Sharma (AAMM) Assignment 2
No ratings yet
Jigyasa Sharma (AAMM) Assignment 2
14 pages
Advanced Programming
No ratings yet
Advanced Programming
5 pages
Class Notes For CSCI 104: Data Structures and Object-Oriented Design
No ratings yet
Class Notes For CSCI 104: Data Structures and Object-Oriented Design
206 pages
Introduction To Algorithms: Chapter 3: Growth of Functions
No ratings yet
Introduction To Algorithms: Chapter 3: Growth of Functions
29 pages
What's Next For ML & You: Emily Fox & Carlos Guestrin
No ratings yet
What's Next For ML & You: Emily Fox & Carlos Guestrin
38 pages
Division
No ratings yet
Division
7 pages
BE03000081 (3)
No ratings yet
BE03000081 (3)
4 pages
Lab 9 Graph
No ratings yet
Lab 9 Graph
9 pages
Recurrence Relation and Recursion
No ratings yet
Recurrence Relation and Recursion
39 pages
DAA Question Bank
No ratings yet
DAA Question Bank
9 pages
Ads-Unit I
No ratings yet
Ads-Unit I
16 pages
Busn214 Week04
No ratings yet
Busn214 Week04
150 pages
Keys
No ratings yet
Keys
13 pages
محاضرة 5
No ratings yet
محاضرة 5
26 pages
DSA Tips
No ratings yet
DSA Tips
5 pages
EGM6365: Structural Optimization EGM6365: Structural Optimization
No ratings yet
EGM6365: Structural Optimization EGM6365: Structural Optimization
9 pages

Gradient Descent Overview

Uploaded by

Gradient Descent Overview

Uploaded by

Overview of Stochastic

Gradient Descent Algorithms

• SGD performs frequent updates with a high variance that cause

The update while using Adadelata is given below:

You might also like