0% found this document useful (0 votes)

6 views22 pages

DL_26-09 (3)

The document discusses the concept of momentum in gradient descent, explaining how it helps in accelerating convergence and reducing oscillations by weighting past gradients. It introduces the Exponential Moving Average (EMA) to give more importance to recent gradients and describes the impact of different momentum constants on weight updates. Additionally, it presents Nesterov Accelerated Gradient Descent (NAG) as an improvement that anticipates future updates to enhance convergence speed and stability.

Uploaded by

kyasashreya

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views22 pages

DL_26-09 (3)

Uploaded by

kyasashreya

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 22

T Ei•:i\ for a series r may be calculated recurs1\:ely

Y( l) t =1
s(t) = !/1 i- s(t - 1) + (1 - f])" Y(t) t > 1

e
T e co fflc1cnt p represents the decree of we1ehtinc increase, a constant m
facto• betv.een O and l. A lo\'.er p discounts older observations faster

Y(t) s the \alue at a period c.

St) the value of the EMA at any period t.

V(t): 1s the new weight update done at 1terat1on t

j3: Momentum constant
8(t): 1s the gradient at 1terat1on t
How can this be used and applied to Gradient Desceni

• To ,1ccount for the momentum, we can use a ll'l( vii\ l O'- r l l ovN llw past r,rJd1ents
In r cgIons where the> gradient Is hi&h like AB, we1eht updates wrll be laree. Thus, In c1 .•:a I w ·
g,llhe1 ing momentum by ta ktnG a mov1nu average over thec,e nradrcnts.
,111.~

• But th e1c Is a problem ,v1th this m eth od, 1t considers all the erc1drents over ItcratIons v:1th
cqu;,I \\'CIGhlagc.
o That is, the gradient at t=O has equal we1ghtage as that of the crad1ent Jt current IterJt1on t
We need to use some sort of weighted average of the past gradients such that the recent
gradients arc given more wcightagc.
This can be done by using an Exponential f\lovmg Average(Ef,iA).
e> t\n e,ponent1al mo\ ing average 1s a mo, ing a, erage that assigns a greater weight on the mo
recent values.
Assume the weight update at the
zeroth ItcratIon t=O is zero
v(O) =0

v(l) = {1 • v(O) + (1 - /1) 8(1)

v(l) = (1 - /1) · o( l )

v(2) = {1 v(l) + ( 1 - P) · 8(2)

v(2) = {1 {(1 - {1) "' 8(1)) + (1 - /3) 8(2)
v(2) = (1 - {3){/J ,.. 8(1 ) + 8(2)}

v(3) = f3 v(2) + (1 - P) · 8(3)

v(3) = p * {Cl - P){P , 8(1) + 8(2)}} + ( 1 - P) 8(3)

v(3) = (1 - {1) {/1 2 " 8(1) + P ,, 8(2) + 8(3)}
hink about the constant 13 and ignore the
term (1-13) in the above equation.
n
'N~t e: In many texts, you might find (1-'3)
r\{\aced with 11 the learning rate.
v(n) = (1 - P) I
C= 1
{1 11 - L o(t )
~ va lue?
wha t 1f Pis 0.1?
• /\t n=3; the gradient at t =3 will contribute 100% of its value, the gradient at t =2 will
cont ri bute 10% of its value, and gradient at t=l w ill only contribute 1% of its value.
• here contribution from earlier gradients decreases rapidly.
wha t if p is 0.9?
• At n=3; the gradient at t =3 will contribute 100% of its value, t=2 will contribute 90% of its
value, and gradient at t=l will contribute 81% of its value.
From the above, we can deduce that higher p will accommodate more gradients from the
past. Hence, generally, p is kept around 0.9 in most of the cases.
Note: The actual contribution of each gradient in the weight updat e wilt be further subJect ed
to the learning rate.

This addresses the point "when the gradient at the current moment is negligible or zero, th e learning
becomes zero". Using moment um with gradient descent, gradients from t he past will push the cost
further to move around a saddle point.
In the cost surface, let's zoom into point C.
~ > C

With gr.idient descent,

• 1f the learning rate is too small, the weights e> If the learn1ne rate is too hi[!h cost
will be updated ve1y slowly hence oscillates around the minima.
convergence l akes a lot of time even when
the gradient is high.

, .
How does Momentum fix this?
II

Let's look at the summation equation of the momentum again.

v(n) = ( 1 - {J) L, (3 -t8(t)
11

L= l
• C.:isc 1: When all the past gradients have the same sign
D fhc ~ummul1on term will become large and we will take large steps while updating the w eights.

• /\lone the curve BC, even if the learning rate 1s low, all the gradients along the curve will ha•, e the
S.lmc dircct1on(sign) thus increasing the momentum and accelerating the descent.

Case 2: when some of the gradients have +ve sign whereas others have -ve

The summation term will become small and weight updates will be small.
e> If the learning rate is high, the gradient at each iteration around the valley C v,ill alter its sign betv,een
+ve and -ve and after few oscillations, the sum of past gradients will become small. Thus, making small
updates in the weights from there on and damping the oscillations.

Gradient Descent with Momentum takes small steps in directions

where the gradients oscillate and take large steps along the direction
where the past gradients have the same direction(same sign).
_) GD w ith Momentum

• By adding a momentum term in the gradient descent, gradients accumulated from past
iterations will push the cost further to move around a saddle point even when the current
gradient is negligible or zero.
• Even though momentum with gradient descent converges better and faster, it still doesn't
resolve all the problems.
• First, the hyperparameter 11 (learning rate) has to be tuned manually.
• Second, in some cases, where, even if the learning rate is low, the momentum term and the
current gradient can alone drive and cause oscillations.
The Learning rate problem can be further resolved by using other variations of Gradient
Descent like AdaptiveGradient and RMSprop.
A large momentum problem can be further resolved by using a variation of m omentum-
based gradient descent called Nesterov Accelerated Gradient Descent.
"1"'"''11 = ()
II JI"(/ t' I V II' I = ,,"VIL' I
= -, . II p,/" t I II ..... I,

1111tlt1f< ·2 = - · 1111tluf11 +I/VII'':!.=-,· 11'111•1 + 11"vw2

1111tlot1:,= -, · 1111clt1fc2 + 1/"Vll';i = -,(-, •11\111•1 .J..11'111•1)-+ 11"V1t'i
= -1 · 1111tlatc'.! + 17'1111:4 = -,'2 · 11\111•1 +; · 11V11•2; 11'711··1
., .,
1111t!ut,, = -r • 111uiat, ., + ,,vw, = -,· · 11'\11·1 + -,- · ,,v,,,'2 +-1 • 11\711•.i + 1/v11• 1

11pd(lft t = / · IIJl(iaf, 1-1 + 11V11•, = -,1- I · 11\111•1 + i I-''- · 11V11·1 + ... + 1/\11'1
l

II f11f I t/\° I I r/\- I I

I I, . 1/\ II 1/\° I
ft1f I
I 1/\ II,:!
1/\ l l t - - •1/\- 111 .L..1/\ I_ ,~
1/V fl I _1·1/\- 111 r/\- I - /V
ft 1/\ ll" t - I ·1/\- 111 r/\ I /\

I /V II"/ _, I · 11\° 11 I - ,,v' ,v

- l
Some observations and quest ions
,_
[ __,)
• Even in the regions having gentle slopes, momentum based gradient descent is able to
take large steps because the momentum carries it along.
• Is moving fast always good? Would there be a situation where momentum would cause us
to run pass our goal?

• Let us change our input data so that we end up with a different error surface and then see
what happens ...
o lt1 this cnsc 1 the error is high on cit her
sicll' of I.he minima valley
o Could momcnt11111 he clc•lrimcnt nl 111
s uch cases ... lc l 's sec .... 1.0
05
00 !
- 0 .5

'-~-----:--:;-;-1.0
2 4 6
-6 -4 -2 C\,

0.04 0.0B 0.12 0.16 0 20 0.24 0.28 0.32 0 36

G ~lomc11lu111 hnsccl /-'., rndil'lll descent
oscillat<.'S i11 a nd 011l of l he minima tI
valley ns the mo111<.'11l u1 11 cnrries it out h o

oft 11(' vnll<.•y -l

G Takl's a lot of 11-l m11s hdorc finally

C:011\'l'rgi ll j..!;

G Despit e these 11-lurns it st ill con-

vergC's faster t hnn vnnilln gradient 0- OllO 0 O't) 0 I l) 0 I ~ 0 111 0 110 0 II I O Y. 0

d escc11t
G Aftl'r 100 iterat ions 1110111enlum ha::;cd
method has reached nn error of
0.0000 I whereas vanilla gradient des-
cent is ::;till s tuck nt nn error of 0 .3G
I QUC\tlon:
C,m we do somclhlnc to reduce thl'~c osclllatlons?
YrYs, Nt'\l erov accell'rnlC'd crridlcnl

Nesterov Accelerated Gradient Descent

Intuition
o Look before you leap
o H<'cnll t.hnl update 1 = 1 · 1qulatc,_1 + 17v'w,
o So we k11uw t hat. we arc going Lo move hy at least hy 1 · 11pdat, 1-1 a11cl then a
hit more hy 11v' 1111
o \\'hy not calculate the gradient (v'w1 001u,l,cad) al this partially 11pdatcd vaiit<'
of w ( W/oo"-o,l,cad = w, - ,. , ·
update,-1) instead of cnlculating it using the c11rrP11t
\'nine Wt

Upclatc rule for NAG

w,,..,1...11l,e,ul = Wt - 'Y . lL])(fofCt-1

updatcl = 'Y. update,_,+ 71v'w,,,,,1:,,1irn,I
w1+1 = Wt - update,

\Ye will have similar update rule for bt

def llo_nc~terov_,1ccc:\ c:rn lc:ll_grnc.ltc:nl _dc~ccr11 CI 1

w, b, eta • 1n1t. w, 1n1t_b , 1, 1

•
prev v_w, prev_v_b, o••• • • • f , 0 . 9
•,. I ·.» ronoe(11111t_epocll1) 1 T,
dw, db - I , I
11!0 p1rt111l upd,,t,.,
v_w , QDMD • prev_v_w
b o
v_b • gaMo • prev_v_b
f•f x,y ,~ zlp(X, YI : -2
1ca\c11llte aracllenh after part11\ updat,.
dW ... grod_w(w , v_w, b • v_b, x, YI
db .. grad_bCw • v_w, b · v_b, x, YI

'"°" do the! fu\ \ upd1t~ -4 -2 0 ,

. . v_w 111 01-.1 • prcv_v_w • eta • dw W-+
•;/;..:." v_b • go..a • prcv_v_b • eta • db
~ ~ w~ w,~~v_w_.~ :, . . . .
Will :r:/-!'~
vi .]~~-:r.'f:·, .
- - '···-
prcv_v_b ~--
• v_b •,.·. ·'
' 0000 004~ 0090013~ 0 180 0125 0 2 70 0)15 0360
_ )
l,
I, o

: · ttlB Ii.
... :.
0000 004'.> Oen'.> OlJ) 0 1
1

011'.> 0110 Oll~ Ol'

Observations about NAG :

Looking ahead helps NAG 1n correcting 1ls course quicker than momentum based gradient descent
Hence the oscillations are smaller and the chances of escapme the mi nima valley also smallr>r
M
a Not ire thnt t hP algorit l11u goe~ on.' r the' t'llt i
dntn once hcforc updating the paran1ct0r:-;
o \\"hy'! 13ccau~c t hb i:-; the t r11e g,rndil"lll of ti
lo:--:-- as clcri vccl cnrliC'r (stun of t hl' p;radiC'nt~
tht" l,,~~f'~ c-01Tcspoutling to c;1cl1 <lnta point
rJ]
ca :'\ n :, pproxi11rn t iou. l lc.'11<:e. t liC'orc·t icn 1~11an1
I l"C's hold (in ot hL"l' words Pach st Pp ~11arantc
th:, t th<" lo~:-; will clr.crcw,;r)
ca \\' lint ·s t ll<' llipsicle? hungin<:' we· l1a\'e a 11
liull point~ in the trainin~ datt1. 'l'o n1akc
11pd:1I<' tn 11·.h thl' al~oritl1111 111t1k<·s a n1illi
<':t l<-1 ti:, I i(ll 1:--. () h\'iuusly very ~low!!
\Th ~ Cnn Wt' du s<>llll'lhi11g 1H•tter ·: Yr~. Ic,t·s k
~~~~~~m :,t :--tcll'lia~l i<' gn1<lic11t clr:-;C<'lll
f do stoch~stic gradicnl _dcsccnl ():
w, b, cto, max epochs 2, 2, 1.8, 1880 a Notice that the algorithn1 upcla.tcs the para,..
• i ,, rongc(mtlx_cpochs):
dw, db 8, 0
1nctcrs for every single data point
ti,, x, y in zip (X, Y):
dw ~ grad w(w, b, x, y) a Now if "'C have a 1nillion data points we ,vill
, db • grod- b(w, b, x, y)
~: ·/ •. :-,:.;,. · ,_. w • \1 •.• cto • dw ·::: · n1nkc a million updates in each epoch ( 1 cpocl1
"',;;,ia·.,.. • 'h';,c:.b/ff{cto _
..,.., ,. db _,·, .. ,., ·· -~,>
. . . = 1 pass over the data; 1 step = 1 update)
o \:\That is the flipside ? It is an approxi1nat<
o Stochastic because we are
(rather stochastic) gradient
csti1nating the total gradi-
ent based on a single data a No guarantee that each step will decrease thi
point. Ahnost like tossing a loss
coin only once and estimat- e Lefs see this algorithm in action when ,v
ing P (heads). have a fe"' data points

B
e
• C¼, \1)
. tb r»
4
• We sec many oscillations. Why ? Because we are making greedy
decisions.
• Each point is trying to push the parameters in a direction most
T2
favorable to it (without being aware of how this affects other b o
points)
-2
• I\ parameter update which is locally favorable to one point may
harm other points (its almost as if the data points are
-4
cotnpeting with each other)
-2 0 2
Indeed we see that there is no guarantee that each local greedy w~
move reduces the global error

005 020 035 050 065 080 095 } }I

4
• We sec many oscillations. Why 7 Because we are making greedy
decisions.
• Each point is trying to push the parameters in a direction most
12
favorable to it (without being aware or how this affects other b o
points)
-2
• I\ parameter update which is locally favorable to one point may
harm other points (its almost as if the data points are
-4
c mpeting with each other)
-4 -2 0 2
Indeed we see that there is no guarantee that each local greedy w--+
move reduces the global error

00S 020 03S 050 0fiS 080 0?5 110

• Can we reduce the oscillations by improving our stochastic

estimates of the gradient (currently estimated from just 1 data
point at a time)
r;a Yes, Mini-batch gradient

Deep-Learning-book-part2
No ratings yet
Deep-Learning-book-part2
101 pages
Geoffrey Hinton With Nitish Srivastava Kevin Swersky: Neural Networks For Machine Learning
No ratings yet
Geoffrey Hinton With Nitish Srivastava Kevin Swersky: Neural Networks For Machine Learning
31 pages
DL_PPR_PPT
No ratings yet
DL_PPR_PPT
40 pages
Chapter 4 - Optimization
No ratings yet
Chapter 4 - Optimization
44 pages
Lecture 03
No ratings yet
Lecture 03
32 pages
Lecture 04
No ratings yet
Lecture 04
32 pages
Lecture 7
No ratings yet
Lecture 7
54 pages
Lecture 11
No ratings yet
Lecture 11
35 pages
9 - Gradient Descent Part 3
No ratings yet
9 - Gradient Descent Part 3
31 pages
Lecture 2
No ratings yet
Lecture 2
57 pages
optimization
No ratings yet
optimization
6 pages
2. Gradient Descent (GD)- GD With Momentum- Nesterov Accelerated GD- Stochastic GD - OrIGINAL
No ratings yet
2. Gradient Descent (GD)- GD With Momentum- Nesterov Accelerated GD- Stochastic GD - OrIGINAL
25 pages
3.1 Global-Descent-Based Error Backpropagation: W W Given by
No ratings yet
3.1 Global-Descent-Based Error Backpropagation: W W Given by
28 pages
Optimization
No ratings yet
Optimization
21 pages
04 Numerical
No ratings yet
04 Numerical
39 pages
optimization
No ratings yet
optimization
26 pages
4_Gradient Descent and Stochastic GD
No ratings yet
4_Gradient Descent and Stochastic GD
37 pages
DL CS 6 M2 Live Session Flow
No ratings yet
DL CS 6 M2 Live Session Flow
32 pages
111 W24 Syllabus
No ratings yet
111 W24 Syllabus
7 pages
Part 13 MD
No ratings yet
Part 13 MD
41 pages
Nesterov Momentum
No ratings yet
Nesterov Momentum
3 pages
054 Report
No ratings yet
054 Report
6 pages
C15-Momentum RMSProp Adam
No ratings yet
C15-Momentum RMSProp Adam
23 pages
L5 - UCLxDeepMind DL2020
No ratings yet
L5 - UCLxDeepMind DL2020
52 pages
04 Numerical
No ratings yet
04 Numerical
46 pages
DL Regularization
No ratings yet
DL Regularization
51 pages
Handout Delta Rule
No ratings yet
Handout Delta Rule
10 pages
Nestrov Gradient Descent
No ratings yet
Nestrov Gradient Descent
8 pages
UNIT3
No ratings yet
UNIT3
17 pages
1509.04612v2
No ratings yet
1509.04612v2
4 pages
ass9_soln
No ratings yet
ass9_soln
5 pages
Backpropagation_optimization_tutorial
No ratings yet
Backpropagation_optimization_tutorial
14 pages
Ella Fitzgerald Collection of Sheet Music, 1897-1991
14% (7)
Ella Fitzgerald Collection of Sheet Music, 1897-1991
72 pages
Lecture 8 Gradient Descent For Non-Convex Functions
No ratings yet
Lecture 8 Gradient Descent For Non-Convex Functions
21 pages
Scaled Conjugate Gradient For Supervised Learning
No ratings yet
Scaled Conjugate Gradient For Supervised Learning
23 pages
Gradient Descent_PR
No ratings yet
Gradient Descent_PR
31 pages
Curs6site PDF
No ratings yet
Curs6site PDF
40 pages
Wa0000.
No ratings yet
Wa0000.
4 pages
Unit 2.2
No ratings yet
Unit 2.2
46 pages
On The Momentum Term in Gradient Descent Learning Algorithms
No ratings yet
On The Momentum Term in Gradient Descent Learning Algorithms
7 pages
Berkeley-tutorial Optimization for Machine Learningpart2
No ratings yet
Berkeley-tutorial Optimization for Machine Learningpart2
35 pages
cours5
No ratings yet
cours5
23 pages
ML algorithms
No ratings yet
ML algorithms
10 pages
Haykin Xue Neural Networks and Learning Machines 3ed Soln PDF
50% (2)
Haykin Xue Neural Networks and Learning Machines 3ed Soln PDF
103 pages
optimization techniques (SGD alternatives)
No ratings yet
optimization techniques (SGD alternatives)
34 pages
Adadelta: An Adaptive Learning Rate Method Matthew D. Zeiler Google Inc., USA New York University, USA
No ratings yet
Adadelta: An Adaptive Learning Rate Method Matthew D. Zeiler Google Inc., USA New York University, USA
6 pages
Advanced Gradient Descent
No ratings yet
Advanced Gradient Descent
14 pages
A Proposal On Machine Learning Via Dynamical Systems
No ratings yet
A Proposal On Machine Learning Via Dynamical Systems
11 pages
UNIT V NNHDL
No ratings yet
UNIT V NNHDL
33 pages
Gradient Descent
No ratings yet
Gradient Descent
5 pages
4 08-05-2021!13!53-11 J-422 Elements of Public Administration-II
No ratings yet
4 08-05-2021!13!53-11 J-422 Elements of Public Administration-II
117 pages
Optimization Algorithms Deep PDF
No ratings yet
Optimization Algorithms Deep PDF
9 pages
Medicinal Chemistry with Pharmaceutical Product Development - 1st Edition Complete EPUB Download
100% (9)
Medicinal Chemistry with Pharmaceutical Product Development - 1st Edition Complete EPUB Download
17 pages
Optimizer
No ratings yet
Optimizer
2 pages
Haykin, Xue-Neural Networks and Learning Machines 3ed Soln
53% (19)
Haykin, Xue-Neural Networks and Learning Machines 3ed Soln
103 pages
Gradient Descent
No ratings yet
Gradient Descent
13 pages
Simon Chapter 3
No ratings yet
Simon Chapter 3
12 pages
PDF 4
No ratings yet
PDF 4
11 pages
AGPT-Practice-and-Supervisor-Handbook
No ratings yet
AGPT-Practice-and-Supervisor-Handbook
40 pages
DevOps Presentation YKB 20052024 v7.0 Part2
No ratings yet
DevOps Presentation YKB 20052024 v7.0 Part2
60 pages
Practical Research
No ratings yet
Practical Research
41 pages
029-IET Quantum Communication - 2024 - Salam - Quantum computing challenges and solutions in software industry A multivocal
No ratings yet
029-IET Quantum Communication - 2024 - Salam - Quantum computing challenges and solutions in software industry A multivocal
24 pages
4.deep Learning Assignment4 Solution PDF
100% (1)
4.deep Learning Assignment4 Solution PDF
12 pages
Exponential Convergence Rates For Batch Normalization - 4
No ratings yet
Exponential Convergence Rates For Batch Normalization - 4
1 page
BSC 30112023
No ratings yet
BSC 30112023
147 pages
Gradient Descent Optimization
No ratings yet
Gradient Descent Optimization
27 pages
Handbook of Competence and Motivation, Second Edit... - (1. Competence and Motivation Theory and Application)
No ratings yet
Handbook of Competence and Motivation, Second Edit... - (1. Competence and Motivation Theory and Application)
4 pages
Assignment No1 Software - Project - Management - 2023
No ratings yet
Assignment No1 Software - Project - Management - 2023
6 pages
Nfjpiancr 2223 1st REBM FT SOCA MOM
No ratings yet
Nfjpiancr 2223 1st REBM FT SOCA MOM
21 pages
Ou Brochure
No ratings yet
Ou Brochure
2 pages
Job Interview: Talking About Skills: Speaking Activity
No ratings yet
Job Interview: Talking About Skills: Speaking Activity
7 pages
CSTP 3 Sullivan 7
No ratings yet
CSTP 3 Sullivan 7
7 pages
Journal - Contemporary Gender Roles
No ratings yet
Journal - Contemporary Gender Roles
11 pages
AI-MS-2
No ratings yet
AI-MS-2
8 pages
Laplace Transforms Essentials
From Everand
Laplace Transforms Essentials
Morteza Shafii-Mousavi
3.5/5 (3)
Filipino PT 2ndQ
No ratings yet
Filipino PT 2ndQ
2 pages
GEE-ENTREP-UNIT 1-2-April 13,2021
No ratings yet
GEE-ENTREP-UNIT 1-2-April 13,2021
6 pages
KUBAN STATE MEDICAL UNIVERSITY
No ratings yet
KUBAN STATE MEDICAL UNIVERSITY
1 page
Final Test Key Eng127 Semester 2 Stage 1 19 20 Lan Thi - 1
No ratings yet
Final Test Key Eng127 Semester 2 Stage 1 19 20 Lan Thi - 1
11 pages
Vocabulary: Wolf, Villagers, Hill, Sheep
No ratings yet
Vocabulary: Wolf, Villagers, Hill, Sheep
53 pages
ODPS - BMV Online Services
No ratings yet
ODPS - BMV Online Services
1 page
CVEN4308 Structural Dynamics: School of Civil and Environmental Engineering
No ratings yet
CVEN4308 Structural Dynamics: School of Civil and Environmental Engineering
10 pages
ABAP Hints and Tips - Adding Custom Fields To POs, Outline Agreements, and RFOs
No ratings yet
ABAP Hints and Tips - Adding Custom Fields To POs, Outline Agreements, and RFOs
1 page
A Short Course in Automorphic Functions
From Everand
A Short Course in Automorphic Functions
Joseph Lehner
No ratings yet
FS2-EP-7 - Lance
No ratings yet
FS2-EP-7 - Lance
4 pages
Link VII Unit 4 SEN Test
No ratings yet
Link VII Unit 4 SEN Test
2 pages
Unit 8 - Systems Analysis and Design
0% (1)
Unit 8 - Systems Analysis and Design
17 pages
Module 8 21st Century Literature
No ratings yet
Module 8 21st Century Literature
39 pages
2022 Kinder Writing Assessment Rubric 2
No ratings yet
2022 Kinder Writing Assessment Rubric 2
3 pages
Applications of Derivatives Rate of Change (Calculus) Mathematics Question Bank
From Everand
Applications of Derivatives Rate of Change (Calculus) Mathematics Question Bank
Mohmmad Khaja Shareef
No ratings yet

DL_26-09 (3)

Uploaded by

DL_26-09 (3)

Uploaded by

T Ei•:i\ for a series r may be calculated recurs1\:ely

Y(t) s the \alue at a period c.

V(t): 1s the new weight update done at 1terat1on t

v(l) = {1 • v(O) + (1 - /1) 8(1)

v(2) = {1 v(l) + ( 1 - P) · 8(2)

v(3) = f3 v(2) + (1 - P) · 8(3)

v(3) = p * {Cl - P){P , 8(1) + 8(2)}} + ( 1 - P) 8(3)

With gr.idient descent,

Let's look at the summation equation of the momentum again.

Gradient Descent with Momentum takes small steps in directions

1111tlt1f< ·2 = - · 1111tluf11 +I/VII'':!.=-,· 11'111•1 + 11"vw2

II f11f I t/\° I I r/\- I I

I /V II"/ _, I · 11\° 11 I - ,,v' ,v

0.04 0.0B 0.12 0.16 0 20 0.24 0.28 0.32 0 36

oft 11(' vnll<.•y -l

G Takl's a lot of 11-l m11s hdorc finally

G Despit e these 11-lurns it st ill con-

Nesterov Accelerated Gradient Descent

Upclatc rule for NAG

w,,..,1...11l,e,ul = Wt - 'Y . lL])(fofCt-1

\Ye will have similar update rule for bt

w, b, eta • 1n1t. w, 1n1t_b , 1, 1

'"°" do the! fu\ \ upd1t~ -4 -2 0 ,

011'.> 0110 Oll~ Ol'

Observations about NAG :

005 020 035 050 065 080 095 } }I

00S 020 03S 050 0fiS 080 0?5 110

• Can we reduce the oscillations by improving our stochastic

You might also like