Gradient Descent Overview
Gradient Descent Overview
𝜃 = 𝜃 − 𝜂. ∇𝜃 𝐽 𝜃
• As we need to calculate the gradients for the whole dataset to perform just
one update, batch gradient descent can be very slow and is intractable for
datasets that do not fit in memory.
• Batch gradient descent is guaranteed to converge to the global minimum
for convex error surfaces and to a local minimum for non-convex surfaces
Stochastic Gradient Descent
• Parameter update is done for each training example 𝑥 𝑖 , 𝑦 𝑖 .
𝜃 = 𝜃 − 𝜂. ∇𝜃 𝐽 𝜃; 𝑥 𝑖 , 𝑦 𝑖
𝑖:𝑖+𝑛 𝑖:𝑖+𝑛
𝜃 = 𝜃 − 𝜂. ∇𝜃 𝐽 𝜃; 𝑥 ,𝑦
• Common mini-batch sizes range between 50 and 256, but can vary for
different applications.
• Mini-batch gradient descent is typically the algorithm of choice when
training a neural network and the term SGD usually is employed also
when mini-batches are used.
Momentum
• Here we add a fraction 𝛾 of the update vector of the past time step to
the current update vector.
ν𝑡 = 𝛾ν𝑡−1 + 𝜂. ∇𝜃 𝐽 𝜃
𝜃 = 𝜃 − ν𝑡
• The idea of momentum is similar to a ball rolling down a hill. The
momentum term increases for dimensions whose gradients point in
the same directions and reduces updates for dimensions whose
gradients change directions.
• The momentum term 𝛾 is
usually set to 0.9 or a similar value.
Nesterov accelerated gradient
• We would like to have a smarter ball, a ball that has a notion of where
it is going so that it knows to slow down before the hill slopes up
again.
• We can now effectively look ahead by calculating the gradient not
w.r.t. to our current parameters θ but w.r.t. the approximate future
position of our parameters:
ν𝑡 = 𝛾ν𝑡−1 + 𝜂. ∇𝜃 𝐽 𝜃 − 𝜂ν𝑡−1
𝜃 = 𝜃 − ν𝑡
Adagrad
• For brevity, we set 𝑔𝑡,𝑖 to be the gradient of the objective function w.r.t. to the parameter
𝜃𝑖 at time step 𝑡.
• The SGD update for every parameter 𝜃𝑖 is given by:
𝜃𝑡+1,𝑖 = 𝜃𝑡,𝑖 − 𝜂. 𝑔𝑡,𝑖
• The Adagrad update for every parameter 𝜃𝑖 is given by:
𝜂
𝜃𝑡+1,𝑖 = 𝜃𝑡,𝑖 − . 𝑔𝑡,𝑖
2
𝐺𝑡,𝑖 +𝜖
• 𝐺𝑡,𝑖 is the sum of squares of the gradients w.r.t. 𝜃𝑖 upto time 𝑡, and 𝜖 is a smoothing
variable that avoid division by zero.
• One of Adagrad’s main benefits is that it eliminates the need to manually tune the
learning rate. Most implementations use a default value of 0.01 and leave it at that. On
the other hand, the learning rate may eventually become infinitesimally small.
Adadelta
• This can be seen as a slight modification of Adagrad. The sum of gradients is
recursively defined as a decaying average of all past squared gradients.
2
𝐸 𝑔𝑡2 = 𝛾𝐸 𝑔𝑡−1 + 1 − 𝛾 𝑔𝑡2
2
𝐸 𝑔𝑡2 = 0.9𝐸 𝑔𝑡−1 + 0.1𝑔𝑡2
𝜂
𝜃𝑡+1,𝑖 = 𝜃𝑡,𝑖 − . 𝑔𝑡,𝑖
2
𝐸[𝑔𝑡,𝑖 ]+𝜖
Adam
• Adaptive Moment Estimation (Adam) is another method that computes
adaptive learning rates for each parameter.
• In addition to storing an exponentially decaying average of past squared
gradients 𝑣𝑡 like Adadelta and RMSprop, Adam also keeps an exponentially
decaying average of past gradients 𝑚𝑡 , similar to momentum:
𝑚𝑡 = 𝛽1 𝑚𝑡−1 + 1 − 𝛽1 𝑔𝑡
𝑣𝑡 = 𝛽2 𝑣𝑡−1 + 1 − 𝛽2 𝑔𝑡
𝜂
𝜃𝑡+1,𝑖 = 𝜃𝑡,𝑖 − . 𝑚𝑡,𝑖
𝑣𝑡,𝑖 + 𝜖
Visualization around a saddle point
Here SGD, Momentum, and NAG find it difficulty to break symmetry, although the latter two eventually manage to
escape the saddle point, while Adagrad, RMSprop, and Adadelta quickly head down the negative slope, with Adadelta
leading the charge.
Visualization on Beale Function
• Adadelta, Adagrad, and Rmsprop headed off immediately in the right direction and converged fast.
• Momentum and NAG were off track, and NAG corrected it course eventually.