0% found this document useful (0 votes)
6 views22 pages

DL_26-09 (3)

The document discusses the concept of momentum in gradient descent, explaining how it helps in accelerating convergence and reducing oscillations by weighting past gradients. It introduces the Exponential Moving Average (EMA) to give more importance to recent gradients and describes the impact of different momentum constants on weight updates. Additionally, it presents Nesterov Accelerated Gradient Descent (NAG) as an improvement that anticipates future updates to enhance convergence speed and stability.

Uploaded by

kyasashreya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views22 pages

DL_26-09 (3)

The document discusses the concept of momentum in gradient descent, explaining how it helps in accelerating convergence and reducing oscillations by weighting past gradients. It introduces the Exponential Moving Average (EMA) to give more importance to recent gradients and describes the impact of different momentum constants on weight updates. Additionally, it presents Nesterov Accelerated Gradient Descent (NAG) as an improvement that anticipates future updates to enhance convergence speed and stability.

Uploaded by

kyasashreya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

T Ei•:i\ for a series r may be calculated recurs1\:ely

Y( l) t =1
s(t) = !/1 i- s(t - 1) + (1 - f])" Y(t) t > 1

e
T e co fflc1cnt p represents the decree of we1ehtinc increase, a constant m
facto• betv.een O and l. A lo\'.er p discounts older observations faster

Y(t) s the \alue at a period c.


St) the value of the EMA at any period t.

V(t): 1s the new weight update done at 1terat1on t


j3: Momentum constant
8(t): 1s the gradient at 1terat1on t
How can this be used and applied to Gradient Desceni

• To ,1ccount for the momentum, we can use a ll'l( vii\ l O'- r l l ovN llw past r,rJd1ents
In r cgIons where the> gradient Is hi&h like AB, we1eht updates wrll be laree. Thus, In c1 .•:a I w ·
g,llhe1 ing momentum by ta ktnG a mov1nu average over thec,e nradrcnts.
,111.~

• But th e1c Is a problem ,v1th this m eth od, 1t considers all the erc1drents over ItcratIons v:1th
cqu;,I \\'CIGhlagc.
o That is, the gradient at t=O has equal we1ghtage as that of the crad1ent Jt current IterJt1on t
We need to use some sort of weighted average of the past gradients such that the recent
gradients arc given more wcightagc.
This can be done by using an Exponential f\lovmg Average(Ef,iA).
e> t\n e,ponent1al mo\ ing average 1s a mo, ing a, erage that assigns a greater weight on the mo
recent values.
Assume the weight update at the
zeroth ItcratIon t=O is zero
v(O) =0

v(l) = {1 • v(O) + (1 - /1) 8(1)


v(l) = (1 - /1) · o( l )

v(2) = {1 v(l) + ( 1 - P) · 8(2)


v(2) = {1 {(1 - {1) "' 8(1)) + (1 - /3) 8(2)
v(2) = (1 - {3){/J ,.. 8(1 ) + 8(2)}

v(3) = f3 v(2) + (1 - P) · 8(3)


i1

v(3) = p * {Cl - P){P , 8(1) + 8(2)}} + ( 1 - P) 8(3)


v(3) = (1 - {1) {/1 2 " 8(1) + P ,, 8(2) + 8(3)}
hink about the constant 13 and ignore the
term (1-13) in the above equation.
n
'N~t e: In many texts, you might find (1-'3)
r\{\aced with 11 the learning rate.
v(n) = (1 - P) I
C= 1
{1 11 - L o(t )
~ va lue?
wha t 1f Pis 0.1?
• /\t n=3; the gradient at t =3 will contribute 100% of its value, the gradient at t =2 will
cont ri bute 10% of its value, and gradient at t=l w ill only contribute 1% of its value.
• here contribution from earlier gradients decreases rapidly.
wha t if p is 0.9?
• At n=3; the gradient at t =3 will contribute 100% of its value, t=2 will contribute 90% of its
value, and gradient at t=l will contribute 81% of its value.
From the above, we can deduce that higher p will accommodate more gradients from the
past. Hence, generally, p is kept around 0.9 in most of the cases.
Note: The actual contribution of each gradient in the weight updat e wilt be further subJect ed
to the learning rate.

This addresses the point "when the gradient at the current moment is negligible or zero, th e learning
becomes zero". Using moment um with gradient descent, gradients from t he past will push the cost
further to move around a saddle point.
In the cost surface, let's zoom into point C.
~ > C

With gr.idient descent,


• 1f the learning rate is too small, the weights e> If the learn1ne rate is too hi[!h cost
will be updated ve1y slowly hence oscillates around the minima.
convergence l akes a lot of time even when
the gradient is high.

, .
How does Momentum fix this?
II

Let's look at the summation equation of the momentum again.


v(n) = ( 1 - {J) L, (3 -t8(t)
11

L= l
• C.:isc 1: When all the past gradients have the same sign
D fhc ~ummul1on term will become large and we will take large steps while updating the w eights.

• /\lone the curve BC, even if the learning rate 1s low, all the gradients along the curve will ha•, e the
S.lmc dircct1on(sign) thus increasing the momentum and accelerating the descent.

Case 2: when some of the gradients have +ve sign whereas others have -ve

The summation term will become small and weight updates will be small.
e> If the learning rate is high, the gradient at each iteration around the valley C v,ill alter its sign betv,een
+ve and -ve and after few oscillations, the sum of past gradients will become small. Thus, making small
updates in the weights from there on and damping the oscillations.

Gradient Descent with Momentum takes small steps in directions


where the gradients oscillate and take large steps along the direction
where the past gradients have the same direction(same sign).
_) GD w ith Momentum

• By adding a momentum term in the gradient descent, gradients accumulated from past
iterations will push the cost further to move around a saddle point even when the current
gradient is negligible or zero.
• Even though momentum with gradient descent converges better and faster, it still doesn't
resolve all the problems.
• First, the hyperparameter 11 (learning rate) has to be tuned manually.
• Second, in some cases, where, even if the learning rate is low, the momentum term and the
current gradient can alone drive and cause oscillations.
The Learning rate problem can be further resolved by using other variations of Gradient
Descent like AdaptiveGradient and RMSprop.
A large momentum problem can be further resolved by using a variation of m omentum-
based gradient descent called Nesterov Accelerated Gradient Descent.
"1"'"''11 = ()
II JI"(/ t' I V II' I = ,,"VIL' I
= -, . II p,/" t I II ..... I,

1111tlt1f< ·2 = - · 1111tluf11 +I/VII'':!.=-,· 11'111•1 + 11"vw2


1111tlot1:,= -, · 1111clt1fc2 + 1/"Vll';i = -,(-, •11\111•1 .J..11'111•1)-+ 11"V1t'i
= -1 · 1111tlatc'.! + 17'1111:4 = -,'2 · 11\111•1 +; · 11V11•2; 11'711··1
., .,
1111t!ut,, = -r • 111uiat, ., + ,,vw, = -,· · 11'\11·1 + -,- · ,,v,,,'2 +-1 • 11\711•.i + 1/v11• 1

11pd(lft t = / · IIJl(iaf, 1-1 + 11V11•, = -,1- I · 11\111•1 + i I-''- · 11V11·1 + ... + 1/\11'1
l

11

l)

II f11f I t/\° I I r/\- I I


I I, . 1/\ II 1/\° I
ft1f I
I 1/\ II,:!
1/\ l l t - - •1/\- 111 .L..1/\ I_ ,~
1/V fl I _1·1/\- 111 r/\- I - /V
ft 1/\ ll" t - I ·1/\- 111 r/\ I /\

I /V II"/ _, I · 11\° 11 I - ,,v' ,v


- l
Some observations and quest ions
,_
[ __,)
• Even in the regions having gentle slopes, momentum based gradient descent is able to
take large steps because the momentum carries it along.
• Is moving fast always good? Would there be a situation where momentum would cause us
to run pass our goal?

• Let us change our input data so that we end up with a different error surface and then see
what happens ...
o lt1 this cnsc 1 the error is high on cit her
sicll' of I.he minima valley
o Could momcnt11111 he clc•lrimcnt nl 111
s uch cases ... lc l 's sec .... 1.0
05
00 !
- 0 .5

'-~-----:--:;-;-1.0
2 4 6
-6 -4 -2 C\,

0.04 0.0B 0.12 0.16 0 20 0.24 0.28 0.32 0 36


G ~lomc11lu111 hnsccl /-'., rndil'lll descent
oscillat<.'S i11 a nd 011l of l he minima tI
valley ns the mo111<.'11l u1 11 cnrries it out h o

oft 11(' vnll<.•y -l

G Takl's a lot of 11-l m11s hdorc finally


C:011\'l'rgi ll j..!;

G Despit e these 11-lurns it st ill con-


vergC's faster t hnn vnnilln gradient 0- OllO 0 O't) 0 I l) 0 I ~ 0 111 0 110 0 II I O Y. 0

d escc11t
G Aftl'r 100 iterat ions 1110111enlum ha::;cd
method has reached nn error of
0.0000 I whereas vanilla gradient des-
cent is ::;till s tuck nt nn error of 0 .3G
I QUC\tlon:
C,m we do somclhlnc to reduce thl'~c osclllatlons?
YrYs, Nt'\l erov accell'rnlC'd crridlcnl

Nesterov Accelerated Gradient Descent


Intuition
o Look before you leap
o H<'cnll t.hnl update 1 = 1 · 1qulatc,_1 + 17v'w,
o So we k11uw t hat. we arc going Lo move hy at least hy 1 · 11pdat, 1-1 a11cl then a
hit more hy 11v' 1111
o \\'hy not calculate the gradient (v'w1 001u,l,cad) al this partially 11pdatcd vaiit<'
of w ( W/oo"-o,l,cad = w, - ,. , ·
update,-1) instead of cnlculating it using the c11rrP11t
\'nine Wt

Upclatc rule for NAG

w,,..,1...11l,e,ul = Wt - 'Y . lL])(fofCt-1


updatcl = 'Y. update,_,+ 71v'w,,,,,1:,,1irn,I
w1+1 = Wt - update,

\Ye will have similar update rule for bt


def llo_nc~terov_,1ccc:\ c:rn lc:ll_grnc.ltc:nl _dc~ccr11 CI 1

w, b, eta • 1n1t. w, 1n1t_b , 1, 1



prev v_w, prev_v_b, o••• • • • f , 0 . 9
•,. I ·.» ronoe(11111t_epocll1) 1 T,
dw, db - I , I
11!0 p1rt111l upd,,t,.,
v_w , QDMD • prev_v_w
b o
v_b • gaMo • prev_v_b
f•f x,y ,~ zlp(X, YI : -2
1ca\c11llte aracllenh after part11\ updat,.
dW ... grod_w(w , v_w, b • v_b, x, YI
db .. grad_bCw • v_w, b · v_b, x, YI

'"°" do the! fu\ \ upd1t~ -4 -2 0 ,


. . v_w 111 01-.1 • prcv_v_w • eta • dw W-+
•;/;..:." v_b • go..a • prcv_v_b • eta • db
~ ~ w~ w,~~v_w_.~ :, . . . .
Will :r:/-!'~
vi .]~~-:r.'f:·, .
- - '···-
prcv_v_b ~--
• v_b •,.·. ·'
' 0000 004~ 0090013~ 0 180 0125 0 2 70 0)15 0360
_ )
l,
I, o

: · ttlB Ii.
... :.
0000 004'.> Oen'.> OlJ) 0 1
1

011'.> 0110 Oll~ Ol'

Observations about NAG :


Looking ahead helps NAG 1n correcting 1ls course quicker than momentum based gradient descent
Hence the oscillations are smaller and the chances of escapme the mi nima valley also smallr>r
M
a Not ire thnt t hP algorit l11u goe~ on.' r the' t'llt i
dntn once hcforc updating the paran1ct0r:-;
o \\"hy'! 13ccau~c t hb i:-; the t r11e g,rndil"lll of ti
lo:--:-- as clcri vccl cnrliC'r (stun of t hl' p;radiC'nt~
tht" l,,~~f'~ c-01Tcspoutling to c;1cl1 <lnta point
rJ]
ca :'\ n :, pproxi11rn t iou. l lc.'11<:e. t liC'orc·t icn 1~11an1
I l"C's hold (in ot hL"l' words Pach st Pp ~11arantc
th:, t th<" lo~:-; will clr.crcw,;r)
ca \\' lint ·s t ll<' llipsicle? hungin<:' we· l1a\'e a 11
liull point~ in the trainin~ datt1. 'l'o n1akc
11pd:1I<' tn 11·.h thl' al~oritl1111 111t1k<·s a n1illi
<':t l<-1 ti:, I i(ll 1:--. () h\'iuusly very ~low!!
\Th ~ Cnn Wt' du s<>llll'lhi11g 1H•tter ·: Yr~. Ic,t·s k
~~~~~~m :,t :--tcll'lia~l i<' gn1<lic11t clr:-;C<'lll
f do stoch~stic gradicnl _dcsccnl ():
w, b, cto, max epochs 2, 2, 1.8, 1880 a Notice that the algorithn1 upcla.tcs the para,..
• i ,, rongc(mtlx_cpochs):
dw, db 8, 0
1nctcrs for every single data point
ti,, x, y in zip (X, Y):
dw ~ grad w(w, b, x, y) a Now if "'C have a 1nillion data points we ,vill
, db • grod- b(w, b, x, y)
~: ·/ •. :-,:.;,. · ,_. w • \1 •.• cto • dw ·::: · n1nkc a million updates in each epoch ( 1 cpocl1
"',;;,ia·.,.. • 'h';,c:.b/ff{cto _
..,.., ,. db _,·, .. ,., ·· -~,>
. . . = 1 pass over the data; 1 step = 1 update)
o \:\That is the flipside ? It is an approxi1nat<
o Stochastic because we are
(rather stochastic) gradient
csti1nating the total gradi-
ent based on a single data a No guarantee that each step will decrease thi
point. Ahnost like tossing a loss
coin only once and estimat- e Lefs see this algorithm in action when ,v
ing P (heads). have a fe"' data points

B
e
• C¼, \1)
. tb r»
4
• We sec many oscillations. Why ? Because we are making greedy
decisions.
• Each point is trying to push the parameters in a direction most
T2
favorable to it (without being aware of how this affects other b o
points)
-2
• I\ parameter update which is locally favorable to one point may
harm other points (its almost as if the data points are
-4
cotnpeting with each other)
-2 0 2
Indeed we see that there is no guarantee that each local greedy w~
move reduces the global error

005 020 035 050 065 080 095 } }I


4
• We sec many oscillations. Why 7 Because we are making greedy
decisions.
• Each point is trying to push the parameters in a direction most
12
favorable to it (without being aware or how this affects other b o
points)
-2
• I\ parameter update which is locally favorable to one point may
harm other points (its almost as if the data points are
-4
c mpeting with each other)
-4 -2 0 2
Indeed we see that there is no guarantee that each local greedy w--+
move reduces the global error

00S 020 03S 050 0fiS 080 0?5 110

• Can we reduce the oscillations by improving our stochastic


estimates of the gradient (currently estimated from just 1 data
point at a time)
r;a Yes, Mini-batch gradient

You might also like