DL_26-09 (3)
DL_26-09 (3)
Y( l) t =1
s(t) = !/1 i- s(t - 1) + (1 - f])" Y(t) t > 1
e
T e co fflc1cnt p represents the decree of we1ehtinc increase, a constant m
facto• betv.een O and l. A lo\'.er p discounts older observations faster
• To ,1ccount for the momentum, we can use a ll'l( vii\ l O'- r l l ovN llw past r,rJd1ents
In r cgIons where the> gradient Is hi&h like AB, we1eht updates wrll be laree. Thus, In c1 .•:a I w ·
g,llhe1 ing momentum by ta ktnG a mov1nu average over thec,e nradrcnts.
,111.~
• But th e1c Is a problem ,v1th this m eth od, 1t considers all the erc1drents over ItcratIons v:1th
cqu;,I \\'CIGhlagc.
o That is, the gradient at t=O has equal we1ghtage as that of the crad1ent Jt current IterJt1on t
We need to use some sort of weighted average of the past gradients such that the recent
gradients arc given more wcightagc.
This can be done by using an Exponential f\lovmg Average(Ef,iA).
e> t\n e,ponent1al mo\ ing average 1s a mo, ing a, erage that assigns a greater weight on the mo
recent values.
Assume the weight update at the
zeroth ItcratIon t=O is zero
v(O) =0
This addresses the point "when the gradient at the current moment is negligible or zero, th e learning
becomes zero". Using moment um with gradient descent, gradients from t he past will push the cost
further to move around a saddle point.
In the cost surface, let's zoom into point C.
~ > C
, .
How does Momentum fix this?
II
L= l
• C.:isc 1: When all the past gradients have the same sign
D fhc ~ummul1on term will become large and we will take large steps while updating the w eights.
• /\lone the curve BC, even if the learning rate 1s low, all the gradients along the curve will ha•, e the
S.lmc dircct1on(sign) thus increasing the momentum and accelerating the descent.
Case 2: when some of the gradients have +ve sign whereas others have -ve
The summation term will become small and weight updates will be small.
e> If the learning rate is high, the gradient at each iteration around the valley C v,ill alter its sign betv,een
+ve and -ve and after few oscillations, the sum of past gradients will become small. Thus, making small
updates in the weights from there on and damping the oscillations.
• By adding a momentum term in the gradient descent, gradients accumulated from past
iterations will push the cost further to move around a saddle point even when the current
gradient is negligible or zero.
• Even though momentum with gradient descent converges better and faster, it still doesn't
resolve all the problems.
• First, the hyperparameter 11 (learning rate) has to be tuned manually.
• Second, in some cases, where, even if the learning rate is low, the momentum term and the
current gradient can alone drive and cause oscillations.
The Learning rate problem can be further resolved by using other variations of Gradient
Descent like AdaptiveGradient and RMSprop.
A large momentum problem can be further resolved by using a variation of m omentum-
based gradient descent called Nesterov Accelerated Gradient Descent.
"1"'"''11 = ()
II JI"(/ t' I V II' I = ,,"VIL' I
= -, . II p,/" t I II ..... I,
11pd(lft t = / · IIJl(iaf, 1-1 + 11V11•, = -,1- I · 11\111•1 + i I-''- · 11V11·1 + ... + 1/\11'1
l
11
l)
• Let us change our input data so that we end up with a different error surface and then see
what happens ...
o lt1 this cnsc 1 the error is high on cit her
sicll' of I.he minima valley
o Could momcnt11111 he clc•lrimcnt nl 111
s uch cases ... lc l 's sec .... 1.0
05
00 !
- 0 .5
'-~-----:--:;-;-1.0
2 4 6
-6 -4 -2 C\,
d escc11t
G Aftl'r 100 iterat ions 1110111enlum ha::;cd
method has reached nn error of
0.0000 I whereas vanilla gradient des-
cent is ::;till s tuck nt nn error of 0 .3G
I QUC\tlon:
C,m we do somclhlnc to reduce thl'~c osclllatlons?
YrYs, Nt'\l erov accell'rnlC'd crridlcnl
: · ttlB Ii.
... :.
0000 004'.> Oen'.> OlJ) 0 1
1
B
e
• C¼, \1)
. tb r»
4
• We sec many oscillations. Why ? Because we are making greedy
decisions.
• Each point is trying to push the parameters in a direction most
T2
favorable to it (without being aware of how this affects other b o
points)
-2
• I\ parameter update which is locally favorable to one point may
harm other points (its almost as if the data points are
-4
cotnpeting with each other)
-2 0 2
Indeed we see that there is no guarantee that each local greedy w~
move reduces the global error