Gradient Descent Method
Gradient Descent Method
In deep learning, we have the concept of loss, which tells us how poorly the model is per-
forming at that current instant. Now we need to use this loss to train our network such that
it performs better. Essentially what we need to do is to take the loss and try to minimize it,
because a lower loss means our model is going to perform better. The process of minimiz-
ing (or maximizing) any mathematical expression is called optimization.
Optimizers are algorithms or methods used to change the attributes of the neural network
such as weights and learning rate to reduce the losses. Optimizers are used to solve op-
timization problems by minimizing the function.
Different types of optimizers and how they exactly work to minimize the loss function.
1. Gradient Descent
2. Stochastic Gradient Descent (SGD)
3. Mini Batch Stochastic Gradient Descent (MB-SGD)
4. SGD with momentum
5. Nesterov Accelerated Gradient (NAG)
6. Adaptive Gradient (AdaGrad)
7. AdaDelta
8. RMSprop
9. Adam
1. Gradient Descent
Gradient descent is an optimization algorithm that's used when training a machine learning
model. It's based on a convex function and tweaks its parameters iteratively to minimize a
given function to its local minimum.
You start by defining the initial parameter's values and from there gradient descent uses
calculus to iteratively adjust the values so they minimize the given cost-function.
The weight is initialized using some initialization strategies and is updated with each epoch
according to the update equation.
W n e w=W o l d−η∗d L /d W o l d
Or
η =Learning rate
dL/dW(t-1)= Gradient
The above equation computes the gradient of the cost function L w.r.t. to the parameters/
weights W for the entire training dataset:
Our aim is to get to the bottom of our graph(Cost vs weights), or to a point where we can
no longer move downhill–a local minimum.
Importance of Learning rate
How big the steps are gradient descent takes into the direction of the local minimum are
determined by the learning rate, which figures out how fast or slows we will move towards
the optimal weights.
For gradient descent to reach the local minimum we must set the learning rate to an ap-
propriate value, which is neither too low nor too high. This is important because if the steps
it takes are too big, it may not reach the local minimum because it bounces back and forth
between the convex function of gradient descent (see left image below). If we set the
learning rate to a very small value, gradient descent will eventually reach the local mini-
mum but that may take a while (see the right image).
So, the learning rate should never be too high or too low for this reason. You can check if
you’re learning rate is doing well by plotting it on a graph.
Advantages:
1. Easy computation.
2. Easy to implement.
3. Easy to understand.
Disadvantages:
SGD algorithm is an extension of the Gradient Descent and it overcomes some of the dis-
advantages of the GD algorithm. Gradient Descent has a disadvantage that it requires a
lot of memory to load the entire dataset of n-points at a time to compute the derivative of
the loss function. In the SGD algorithm derivative is computed taking one point at a
time.
SGD performs a parameter update for each training example x(i) and label y(i):
where {x(i) ,y(i)} are the training examples.
1. On the left, we have Stochastic Gradient Descent (where m=1 per step) we take a
Gradient Descent step for each example and on the right is Gradient Descent (1
step per entire training set).
2. SGD seems to be quite noisy, at the same time it is much faster but may not con-
verge to a minimum.
3. Typically, to get the best out of both worlds we use Mini-batch gradient descent
(MGD) which looks at a smaller number of training set examples at once to help
(usually power of 2 - 2^6 etc.).
4. Mini-batch Gradient Descent is relatively more stable than Stochastic Gradient De-
scent (SGD) but does have oscillations as gradient steps are being taken in the di-
rection of a sample of the training set and not the entire set as in BGD.
It is observed that in SGD the updates take more number iterations compared to gradient
descent to reach minima. On the right, the Gradient Descent takes fewer steps to reach
minima but the SGD algorithm is noisier and takes more iterations.
1. On the left, we have Stochastic Gradient Descent (where m=1 per step) we take a
Gradient Descent step for each example and on the right is Gradient Descent (1
step per entire training set).
2. SGD seems to be quite noisy, at the same time it is much faster but may not con-
verge to a minimum.
3. Typically, to get the best out of both worlds we use Mini-batch gradient descent
(MGD) which looks at a smaller number of training set examples at once to help
(usually power of 2 - 2^6 etc.).
4. Mini-batch Gradient Descent is relatively more stable than Stochastic Gradient De-
scent (SGD) but does have oscillations as gradient steps are being taken in the di-
rection of a sample of the training set and not the entire set as in BGD.
It is observed that in SGD the updates take more number iterations compared to gradient
descent to reach minima. On the right, the Gradient Descent takes fewer steps to reach
minima but the SGD algorithm is noisier and takes more iterations.
Advantage:
MB-SGD algorithm is an extension of the SGD algorithm and it overcomes the problem of
large time complexity in the case of the SGD algorithm. MB-SGD algorithm takes a batch
of points or subset of points from the dataset to compute derivate.
It is observed that the derivative of the loss function for MB-SGD is almost the same as a
derivate of the loss function for GD after some number of iterations. But the number of iter-
ations to achieve minima is large for MB-SGD compared to GD and the cost of computa-
tion is also large.
Image Source
The update of weight is dependent on the derivate of loss for a batch of points. The up-
dates in the case of MB-SGD are much noisy because the derivative is not always towards
minima.
MB-SGD divides the dataset into various batches and after every batch, the parameters
are updated.
Advantages:
Disadvantages:
1. The update of MB-SGD is much noisy compared to the update of the GD algorithm.
2. Take a longer time to converge than the GD algorithm.
3. May get stuck at local minima.
Summary
A major disadvantage of the MB-SGD algorithm is that updates of weight are very noisy.
SGD with momentum overcomes this disadvantage by denoising the gradients. Updates of
weight are dependent on noisy derivative and if we somehow denoise the derivatives then
converging time will decrease.
The idea is to denoise derivative using exponential weighting average that is to give more
weightage to recent updates compared to the previous update.
It accelerates the convergence towards the relevant direction and reduces the fluctuation
to the irrelevant direction. One more hyperparameter is used in this method known as mo-
mentum symbolized by ‘γ’.
Momentum at time ‘t’ is computed using all previous updates giving more weightage to re-
cent updates compared to the previous update. This leads to speed up the convergence.
Essentially, when using momentum, we push a ball down a hill. The ball accumulates mo-
mentum as it rolls downhill, becoming faster and faster on the way (until it reaches its ter-
minal velocity if there is air resistance, i.e. γ<1). The same thing happens to our parameter
updates: The momentum term increases for dimensions whose gradients point in the
same directions and reduces updates for dimensions whose gradients change directions.
As a result, we gain faster convergence and reduced oscillation.
Image Source
The diagram above concludes SGD with momentum denoises the gradients and con-
verges faster as compared to SGD.
wt+1 = wt − updatet
Advantages:
The idea of the NAG algorithm is very similar to SGD with momentum with a slight variant.
In the case of SGD with a momentum algorithm, the momentum and gradient are com-
puted on the previous updated weight..
As can see, in the momentum-based gradient, the steps become larger and larger due to
the accumulated momentum, and then we overshoot at the 4th step. We then have to take
steps in the opposite direction to reach the minimum point.
However, the update in NAG happens in two steps. First, a partial step to reach the look-
ahead point, and then the final update. We calculate the gradient at the look-ahead point
and then use it to calculate the final update. If the gradient at the look-ahead point is nega-
tive, our final update will be smaller than that of a regular momentum-based gradient. Like
in the above example, the updates of NAG are similar to that of the momentum-based gra-
dient for the first three steps because the gradient at that point and the look-ahead point
are positive. But at step 4, the gradient of the look-ahead point is negative.
In NAG, the first partial update 4a will be used to go to the look-ahead point and then the
gradient will be calculated at that point without updating the parameters. Since the gradient
at step 4b is negative, the overall update will be smaller than the momentum-based gradi-
ent descent.
We can see in the above example that the momentum-based gradient descent takes six
steps to reach the minimum point, while NAG takes only five steps.
This looking ahead helps NAG to converge to the minimum points in fewer steps and re-
duce the chances of overshooting.
In this equation, the weight (W) is updated in each iteration. η is the learning rate, and ∇wt
is the gradient.
wt+1 = wt − updatet
This is how the gradient of all the previous updates is added to the current update.
wlook_ahead = wt − γ · updatet−1
This look-ahead gradient will be used in our update and will prevent overshooting.
where
w(t) = value of w at current iteration, w(t-1) = value of w at previous iteration and η =
learning rate.
In SGD and mini-batch SGD, the value of η used to be the same for each weight, or say
for each parameter. Typically, η = 0.01. But in Adagrad Optimizer the core idea is
that each weight has a different learning rate (η). This modification has great impor-
tance, in the real-world dataset, some features are
sparse (for example, in Bag of Words most of the features are
zero so it’s sparse) and some are dense (most of the features
will be noon-zero), so keeping the same value of learning rate for all the weights is not
good for optimization. The weight updating formula for adagrad looks like:
Where alpha(t) denotes differ- ent learning rates for each weight at each itera-
tion.
Here, η is a constant number, ep- silon is a small positive value number to
avoid divide by zero error if in case alpha(t) becomes 0 be-
cause if alpha(t) become zero then the learning rate will become
zero which in turn after multiplying by de- rivative will make w(old) = w(new),
and this will lead to small convergence.
is derivative of loss with respect to weight and will always be positive since its a square
term, which means that alpha(t) will also remain positive, this implies that alpha(t) >= al-
pha(t-1).
It can be seen from the formula that as alpha(t) and is inversely proportional to one an-
other, this implies that as alpha(t) will increase, will decrease. This means that as the
number of iterations will increase, the learning rate will reduce adaptively, so you no need
to manually select the learning rate.
Advantages of Adagrad:
• No manual tuning of the learning rate required.
• Faster convergence
• More reliable
One main disadvantage of Adagrad optimizer is that alpha(t) can become large as the
number of iterations will increase and due to this will decrease at the larger rate. This will
make the old weight almost equal to the new weight which may lead to slow convergence.