0% found this document useful (0 votes)
20 views

Gradient Descent Overview

Notes
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views

Gradient Descent Overview

Notes
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Overview of Stochastic

Gradient Descent Algorithms


Srikumar Ramalingam
Reference
Sebastian Ruder, An overview of gradient descent optimization
algorithms, 2017
https://arxiv.org/pdf/1609.04747.pdf
Notations
• Objective function 𝐽 𝜃 , where 𝜃 is the parameter set 𝜃 ∈ ℝ𝑛
• Gradient of the object function is ∇𝜃 𝐽 𝜃 with respect to the
parameters.
• 𝜂 is the learning rate.
Batch Gradient Descent
• We compute the gradient of the cost function with respect to the
parameters for the entire dataset:

𝜃 = 𝜃 − 𝜂. ∇𝜃 𝐽 𝜃

• As we need to calculate the gradients for the whole dataset to perform just
one update, batch gradient descent can be very slow and is intractable for
datasets that do not fit in memory.
• Batch gradient descent is guaranteed to converge to the global minimum
for convex error surfaces and to a local minimum for non-convex surfaces
Stochastic Gradient Descent
• Parameter update is done for each training example 𝑥 𝑖 , 𝑦 𝑖 .

𝜃 = 𝜃 − 𝜂. ∇𝜃 𝐽 𝜃; 𝑥 𝑖 , 𝑦 𝑖

• SGD performs frequent updates with a high variance that cause


the objective function to fluctuate heavily.
• SGD’s fluctuation enables it to jump to new and potentially
better local minima. However, this also complicates convergence
to the exact minimum, as SGD will keep overshooting.
• when we slowly decrease the learning rate, SGD shows the same
convergence behaviour as batch gradient descent, almost
certainly converging to a local or the global minimum for non-
convex and convex optimization respectively
Mini-batch gradient descent
• Mini-batch gradient descent finally takes the best of both worlds and
performs an update for every mini-batch of 𝑛 training examples:

𝑖:𝑖+𝑛 𝑖:𝑖+𝑛
𝜃 = 𝜃 − 𝜂. ∇𝜃 𝐽 𝜃; 𝑥 ,𝑦

• Common mini-batch sizes range between 50 and 256, but can vary for
different applications.
• Mini-batch gradient descent is typically the algorithm of choice when
training a neural network and the term SGD usually is employed also
when mini-batches are used.
Momentum
• Here we add a fraction 𝛾 of the update vector of the past time step to
the current update vector.
ν𝑡 = 𝛾ν𝑡−1 + 𝜂. ∇𝜃 𝐽 𝜃
𝜃 = 𝜃 − ν𝑡
• The idea of momentum is similar to a ball rolling down a hill. The
momentum term increases for dimensions whose gradients point in
the same directions and reduces updates for dimensions whose
gradients change directions.
• The momentum term 𝛾 is
usually set to 0.9 or a similar value.
Nesterov accelerated gradient
• We would like to have a smarter ball, a ball that has a notion of where
it is going so that it knows to slow down before the hill slopes up
again.
• We can now effectively look ahead by calculating the gradient not
w.r.t. to our current parameters θ but w.r.t. the approximate future
position of our parameters:
ν𝑡 = 𝛾ν𝑡−1 + 𝜂. ∇𝜃 𝐽 𝜃 − 𝜂ν𝑡−1
𝜃 = 𝜃 − ν𝑡
Adagrad
• For brevity, we set 𝑔𝑡,𝑖 to be the gradient of the objective function w.r.t. to the parameter
𝜃𝑖 at time step 𝑡.
• The SGD update for every parameter 𝜃𝑖 is given by:
𝜃𝑡+1,𝑖 = 𝜃𝑡,𝑖 − 𝜂. 𝑔𝑡,𝑖
• The Adagrad update for every parameter 𝜃𝑖 is given by:
𝜂
𝜃𝑡+1,𝑖 = 𝜃𝑡,𝑖 − . 𝑔𝑡,𝑖
2
𝐺𝑡,𝑖 +𝜖
• 𝐺𝑡,𝑖 is the sum of squares of the gradients w.r.t. 𝜃𝑖 upto time 𝑡, and 𝜖 is a smoothing
variable that avoid division by zero.
• One of Adagrad’s main benefits is that it eliminates the need to manually tune the
learning rate. Most implementations use a default value of 0.01 and leave it at that. On
the other hand, the learning rate may eventually become infinitesimally small.
Adadelta
• This can be seen as a slight modification of Adagrad. The sum of gradients is
recursively defined as a decaying average of all past squared gradients.
2
𝐸 𝑔𝑡2 = 𝛾𝐸 𝑔𝑡−1 + 1 − 𝛾 𝑔𝑡2

The update while using Adadelata is given below:


𝜂
𝜃𝑡+1,𝑖 = 𝜃𝑡,𝑖 − . 𝑔𝑡,𝑖
2
𝐸[𝑔𝑡,𝑖 ]+𝜖
2
• 𝐸[𝑔𝑡,𝑖 ] is the the decaying average over past squared gradients w.r.t. 𝜃𝑖 upto time
𝑡, and 𝜖 is a smoothing variable that avoid division by zero.
RMSProp
• RMSprop and Adadelta have both been developed independently
around the same time stemming from the need to resolve Adagrad’s
radically diminishing learning rates.

2
𝐸 𝑔𝑡2 = 0.9𝐸 𝑔𝑡−1 + 0.1𝑔𝑡2

𝜂
𝜃𝑡+1,𝑖 = 𝜃𝑡,𝑖 − . 𝑔𝑡,𝑖
2
𝐸[𝑔𝑡,𝑖 ]+𝜖
Adam
• Adaptive Moment Estimation (Adam) is another method that computes
adaptive learning rates for each parameter.
• In addition to storing an exponentially decaying average of past squared
gradients 𝑣𝑡 like Adadelta and RMSprop, Adam also keeps an exponentially
decaying average of past gradients 𝑚𝑡 , similar to momentum:
𝑚𝑡 = 𝛽1 𝑚𝑡−1 + 1 − 𝛽1 𝑔𝑡
𝑣𝑡 = 𝛽2 𝑣𝑡−1 + 1 − 𝛽2 𝑔𝑡

𝜂
𝜃𝑡+1,𝑖 = 𝜃𝑡,𝑖 − . 𝑚𝑡,𝑖
𝑣𝑡,𝑖 + 𝜖
Visualization around a saddle point

Here SGD, Momentum, and NAG find it difficulty to break symmetry, although the latter two eventually manage to
escape the saddle point, while Adagrad, RMSprop, and Adadelta quickly head down the negative slope, with Adadelta
leading the charge.
Visualization on Beale Function

• Adadelta, Adagrad, and Rmsprop headed off immediately in the right direction and converged fast.
• Momentum and NAG were off track, and NAG corrected it course eventually.

You might also like