0% found this document useful (0 votes)
2 views

Gradient Descent Method

The document discusses various optimization algorithms used in deep learning to minimize loss functions, including Gradient Descent, Stochastic Gradient Descent (SGD), Mini Batch SGD, and others. It explains how each optimizer works, their advantages and disadvantages, and the importance of learning rates in the optimization process. The document emphasizes the need for effective weight updates to improve model performance and convergence speed.
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Gradient Descent Method

The document discusses various optimization algorithms used in deep learning to minimize loss functions, including Gradient Descent, Stochastic Gradient Descent (SGD), Mini Batch SGD, and others. It explains how each optimizer works, their advantages and disadvantages, and the importance of learning rates in the optimization process. The document emphasizes the need for effective weight updates to improve model performance and convergence speed.
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 12

Introduction

In deep learning, we have the concept of loss, which tells us how poorly the model is per-
forming at that current instant. Now we need to use this loss to train our network such that
it performs better. Essentially what we need to do is to take the loss and try to minimize it,
because a lower loss means our model is going to perform better. The process of minimiz-
ing (or maximizing) any mathematical expression is called optimization.

Optimizers are algorithms or methods used to change the attributes of the neural network
such as weights and learning rate to reduce the losses. Optimizers are used to solve op-
timization problems by minimizing the function.

Different types of optimizers and how they exactly work to minimize the loss function.

1. Gradient Descent
2. Stochastic Gradient Descent (SGD)
3. Mini Batch Stochastic Gradient Descent (MB-SGD)
4. SGD with momentum
5. Nesterov Accelerated Gradient (NAG)
6. Adaptive Gradient (AdaGrad)
7. AdaDelta
8. RMSprop
9. Adam

1. Gradient Descent

Gradient descent is an optimization algorithm that's used when training a machine learning
model. It's based on a convex function and tweaks its parameters iteratively to minimize a
given function to its local minimum.

WHAT IS GRADIENT DESCENT? Gradient Descent is an optimization algorithm for


finding a local minimum of a differentiable function. Gradient descent is simply
used to find the values of a function's parameters (coefficients) that minimize a cost
function as far as possible.

You start by defining the initial parameter's values and from there gradient descent uses
calculus to iteratively adjust the values so they minimize the given cost-function.

The weight is initialized using some initialization strategies and is updated with each epoch
according to the update equation.

W n e w=W o l d−η∗d L /d W o l d

Or

W t=W (t−1)−η∗d L/d W (t−1)

Where Wt is new weight

W(t-1) = Old Weight

η =Learning rate
dL/dW(t-1)= Gradient

L is loss /Cost function

The above equation computes the gradient of the cost function L w.r.t. to the parameters/
weights W for the entire training dataset:

Our aim is to get to the bottom of our graph(Cost vs weights), or to a point where we can
no longer move downhill–a local minimum.
Importance of Learning rate

How big the steps are gradient descent takes into the direction of the local minimum are
determined by the learning rate, which figures out how fast or slows we will move towards
the optimal weights.
For gradient descent to reach the local minimum we must set the learning rate to an ap-
propriate value, which is neither too low nor too high. This is important because if the steps
it takes are too big, it may not reach the local minimum because it bounces back and forth
between the convex function of gradient descent (see left image below). If we set the

learning rate to a very small value, gradient descent will eventually reach the local mini-
mum but that may take a while (see the right image).

So, the learning rate should never be too high or too low for this reason. You can check if
you’re learning rate is doing well by plotting it on a graph.
Advantages:

1. Easy computation.
2. Easy to implement.
3. Easy to understand.

Disadvantages:

1. May trap at local minima.


2. Weights are changed after calculating the gradient on the whole dataset. So, if the
dataset is too large then this may take years to converge to the minima.
3. Requires large memory to calculate the gradient on the whole dataset.

2. Stochastic Gradient Descent (SGD)

SGD algorithm is an extension of the Gradient Descent and it overcomes some of the dis-
advantages of the GD algorithm. Gradient Descent has a disadvantage that it requires a
lot of memory to load the entire dataset of n-points at a time to compute the derivative of
the loss function. In the SGD algorithm derivative is computed taking one point at a
time.

SGD performs a parameter update for each training example x(i) and label y(i):
where {x(i) ,y(i)} are the training examples.

1. On the left, we have Stochastic Gradient Descent (where m=1 per step) we take a
Gradient Descent step for each example and on the right is Gradient Descent (1
step per entire training set).
2. SGD seems to be quite noisy, at the same time it is much faster but may not con-
verge to a minimum.
3. Typically, to get the best out of both worlds we use Mini-batch gradient descent
(MGD) which looks at a smaller number of training set examples at once to help
(usually power of 2 - 2^6 etc.).
4. Mini-batch Gradient Descent is relatively more stable than Stochastic Gradient De-
scent (SGD) but does have oscillations as gradient steps are being taken in the di-
rection of a sample of the training set and not the entire set as in BGD.
It is observed that in SGD the updates take more number iterations compared to gradient
descent to reach minima. On the right, the Gradient Descent takes fewer steps to reach
minima but the SGD algorithm is noisier and takes more iterations.

1. On the left, we have Stochastic Gradient Descent (where m=1 per step) we take a
Gradient Descent step for each example and on the right is Gradient Descent (1
step per entire training set).
2. SGD seems to be quite noisy, at the same time it is much faster but may not con-
verge to a minimum.
3. Typically, to get the best out of both worlds we use Mini-batch gradient descent
(MGD) which looks at a smaller number of training set examples at once to help
(usually power of 2 - 2^6 etc.).
4. Mini-batch Gradient Descent is relatively more stable than Stochastic Gradient De-
scent (SGD) but does have oscillations as gradient steps are being taken in the di-
rection of a sample of the training set and not the entire set as in BGD.
It is observed that in SGD the updates take more number iterations compared to gradient
descent to reach minima. On the right, the Gradient Descent takes fewer steps to reach
minima but the SGD algorithm is noisier and takes more iterations.

Advantage:

Memory requirement is less compared to the GD algorithm as the derivative is computed


taking only 1 point at once.
Disadvantages:

1. The time required to complete 1 epoch is large compared to the GD algorithm.


2. Takes a long time to converge.
3. May stuck at local minima.

3 Mini Batch Stochastic Gradient Descent (MB-SGD)

MB-SGD algorithm is an extension of the SGD algorithm and it overcomes the problem of
large time complexity in the case of the SGD algorithm. MB-SGD algorithm takes a batch
of points or subset of points from the dataset to compute derivate.

It is observed that the derivative of the loss function for MB-SGD is almost the same as a
derivate of the loss function for GD after some number of iterations. But the number of iter-
ations to achieve minima is large for MB-SGD compared to GD and the cost of computa-
tion is also large.

Image Source

The update of weight is dependent on the derivate of loss for a batch of points. The up-
dates in the case of MB-SGD are much noisy because the derivative is not always towards
minima.
MB-SGD divides the dataset into various batches and after every batch, the parameters
are updated.

Advantages:

Less time complexity to converge compared to standard SGD algorithm.

Disadvantages:

1. The update of MB-SGD is much noisy compared to the update of the GD algorithm.
2. Take a longer time to converge than the GD algorithm.
3. May get stuck at local minima.
Summary

4 SGD with momentum

A major disadvantage of the MB-SGD algorithm is that updates of weight are very noisy.
SGD with momentum overcomes this disadvantage by denoising the gradients. Updates of
weight are dependent on noisy derivative and if we somehow denoise the derivatives then
converging time will decrease.

The idea is to denoise derivative using exponential weighting average that is to give more
weightage to recent updates compared to the previous update.
It accelerates the convergence towards the relevant direction and reduces the fluctuation
to the irrelevant direction. One more hyperparameter is used in this method known as mo-
mentum symbolized by ‘γ’.

The momentum term γ is usually set to 0.9 or a similar value.

Momentum at time ‘t’ is computed using all previous updates giving more weightage to re-
cent updates compared to the previous update. This leads to speed up the convergence.

Essentially, when using momentum, we push a ball down a hill. The ball accumulates mo-
mentum as it rolls downhill, becoming faster and faster on the way (until it reaches its ter-
minal velocity if there is air resistance, i.e. γ<1). The same thing happens to our parameter
updates: The momentum term increases for dimensions whose gradients point in the
same directions and reduces updates for dimensions whose gradients change directions.
As a result, we gain faster convergence and reduced oscillation.

Image Source

The diagram above concludes SGD with momentum denoises the gradients and con-
verges faster as compared to SGD.

Update rule for momentum-based gradient descent:


In this, momentum is added to the conventional gradient descent
equation. The update equation is

wt+1 = wt − updatet

updatet is calculated by:


updatet = γ · updatet−1 + η∇wt
Update rule for momentum-based gradient descent:
In this, momentum is added to the conventional gradient descent
equation. The update equation is
wt+1 = wt − updatet

updatet is calculated by:


updatet = γ · updatet−1 + η∇wt

Advantages:

1. Has all advantages of the SGD algorithm.


2. Converges faster than the GD algorithm.
Disadvantages:

We need to compute one more variable for each update.


5 Nesterov Accelerated Gradient (NAG)

The idea of the NAG algorithm is very similar to SGD with momentum with a slight variant.
In the case of SGD with a momentum algorithm, the momentum and gradient are com-
puted on the previous updated weight..

As can see, in the momentum-based gradient, the steps become larger and larger due to
the accumulated momentum, and then we overshoot at the 4th step. We then have to take
steps in the opposite direction to reach the minimum point.
However, the update in NAG happens in two steps. First, a partial step to reach the look-
ahead point, and then the final update. We calculate the gradient at the look-ahead point
and then use it to calculate the final update. If the gradient at the look-ahead point is nega-
tive, our final update will be smaller than that of a regular momentum-based gradient. Like
in the above example, the updates of NAG are similar to that of the momentum-based gra-
dient for the first three steps because the gradient at that point and the look-ahead point
are positive. But at step 4, the gradient of the look-ahead point is negative.
In NAG, the first partial update 4a will be used to go to the look-ahead point and then the
gradient will be calculated at that point without updating the parameters. Since the gradient
at step 4b is negative, the overall update will be smaller than the momentum-based gradi-
ent descent.
We can see in the above example that the momentum-based gradient descent takes six
steps to reach the minimum point, while NAG takes only five steps.
This looking ahead helps NAG to converge to the minimum points in fewer steps and re-
duce the chances of overshooting.

How NAG Actually Works


We saw how NAG solves the problem of overshooting by ‘looking ahead’. Let us see how
this is calculated and the actual math behind it.

Update rule for gradient descent:


wt+1 = wt − η∇wt

In this equation, the weight (W) is updated in each iteration. η is the learning rate, and ∇wt
is the gradient.

Update rule for momentum-based gradient descent:


In this, momentum is added to the conventional gradient descent equation. The update
equation is

wt+1 = wt − updatet

updatet is calculated by:

updatet = γ · updatet−1 + η∇wt

This is how the gradient of all the previous updates is added to the current update.

Update rule for NAG:


wt+1 = wt − updatet
While calculating the updatet, We will include the look ahead gradient (∇wlook_ahead).

updatet = γ · updatet−1 + η∇wlook_ahead

∇wlook_ahead is calculated by:

wlook_ahead = wt − γ · updatet−1

This look-ahead gradient will be used in our update and will prevent overshooting.

6. Adaptive Gradient Optimizer.


Adagrad stands for Adap- tive Gradient Optimizer.
There were optimizers like Gradient Descent, Stochastic
Gradient Descent, mini-batch SGD, all were used to reduce the loss function with respect
to the weights. The weight updating formula is as follows:

Based on iterations, this for- mula can be written as:

where
w(t) = value of w at current iteration, w(t-1) = value of w at previous iteration and η =
learning rate.
In SGD and mini-batch SGD, the value of η used to be the same for each weight, or say
for each parameter. Typically, η = 0.01. But in Adagrad Optimizer the core idea is
that each weight has a different learning rate (η). This modification has great impor-
tance, in the real-world dataset, some features are
sparse (for example, in Bag of Words most of the features are
zero so it’s sparse) and some are dense (most of the features
will be noon-zero), so keeping the same value of learning rate for all the weights is not
good for optimization. The weight updating formula for adagrad looks like:

Where alpha(t) denotes differ- ent learning rates for each weight at each itera-
tion.
Here, η is a constant number, ep- silon is a small positive value number to
avoid divide by zero error if in case alpha(t) becomes 0 be-
cause if alpha(t) become zero then the learning rate will become
zero which in turn after multiplying by de- rivative will make w(old) = w(new),
and this will lead to small convergence.

is derivative of loss with respect to weight and will always be positive since its a square
term, which means that alpha(t) will also remain positive, this implies that alpha(t) >= al-
pha(t-1).
It can be seen from the formula that as alpha(t) and is inversely proportional to one an-
other, this implies that as alpha(t) will increase, will decrease. This means that as the
number of iterations will increase, the learning rate will reduce adaptively, so you no need
to manually select the learning rate.
Advantages of Adagrad:
• No manual tuning of the learning rate required.
• Faster convergence
• More reliable
One main disadvantage of Adagrad optimizer is that alpha(t) can become large as the
number of iterations will increase and due to this will decrease at the larger rate. This will
make the old weight almost equal to the new weight which may lead to slow convergence.

You might also like