0% found this document useful (0 votes)
34 views23 pages

Ann2018 L5

The document describes the gradient descent algorithm and backpropagation for training neural networks. It provides equations for calculating the error gradient and updating the weights based on this gradient. The gradient is calculated using a chain rule decomposition involving derivatives of the error with respect to outputs, outputs with respect to inputs, and inputs with respect to weights. Examples are given to demonstrate calculating the gradient and updating weights for single and multi-layer neural networks.

Uploaded by

Amartya Keshri
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views23 pages

Ann2018 L5

The document describes the gradient descent algorithm and backpropagation for training neural networks. It provides equations for calculating the error gradient and updating the weights based on this gradient. The gradient is calculated using a chain rule decomposition involving derivatives of the error with respect to outputs, outputs with respect to inputs, and inputs with respect to weights. Examples are given to demonstrate calculating the gradient and updating weights for single and multi-layer neural networks.

Uploaded by

Amartya Keshri
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

1 1

E  (d i  oi )  (d i  f ( wi x)) 2
2 t

2 2

E  (d i  oi ) f ' ( wi x) x
t

The components of the gradient vector are


E
 (d i  oi ) f ' ( wi x) x j j  1,2,...., n
t
for
wij
w i   E
w i   [ d i  oi ] f ' ( net i ) x
w    r  x
r  [d i  f ( wi x)] f ' ( wi x)
t t
w    r  x

w2  w1   [d i  oi ] f ' (neti ) x d=t


Example 1

X1=[1 –2 0 –1]1, X2=[0 1.5 -.5 -1]1, X3=[-1 1 .5 -1]1


d1 = -1 d2 = -1 d3 = 1
w1=[1 -1 0 .5]; 2
Net1=net1=[1 -1 0 .5]*[1 -2 0 -1]'=2.5 f ( net )  (  net )
1
1 e
2e (  net )
f ' ( net ) 
w1 [1  e (  net ) ]2
w2
2e(  net ) 1
o (  net ) 2
 (1  o 2
)
w3 [1  e ] 2
w4 net 1  2.5 O1  0.848
f ' ( net 1 )  0.140
Complete this problem
w 2   [ d i  oi ] f ' ( net 1 ) x1  w1
for one epoch

 [0.974  .948 0 0.526 ]'

net 2  1.948
Example 2

x1 2
0.982
x  d  1   0.1
 0.5  0
4 -3.93
d=t x2
1

The transfer function is unipolar continuous (logsig) 1


o
f ' (net)  (1  o)o 1  e net
w   (d  o)(1  o)o x net=2*0.982+4*0.5-3.93*1=0.034

w    x o=1/(1+exp(-0.04)) = 0.51
Error=1-0.51=0.49
  (1  .51)(1  .51)(.51)  .1225
w   *  * 0.982  0.1* .1225 * .982  0.012 4+0.1*0.1225*.5=4.0061
wnew  wold  .012  2  .012  2.012 -3.93+.1*.1225*1=-3.9178
net = 2.012*0.982+4.0061*0.5-3.9178*1=0.061
Error=1-0.5152=0.4848
o=1/(1+exp(-0.061)=0.5152
By chain rule:
∂E ∂E ∂oi ∂xi
---- = ---- ---- ----
∂wij ∂oi ∂xi ∂wij

∂E
---- = (1/2) 2 (di - oi) (-1) = (oi - ti) E = 1/2 ∑ (di - oi)2
∂oi i

∂oi ∂
---- = ---- [1 / (1 + e-xi)] = - [1 / (1 + e-xi)2] (- e-xi ) = e-xi / (1 + e-xi)2
∂xi ∂xi
(1 + e-xi) - 1 1
= ------------- • ----------- = [1 - 1 / (1 + e-xi)] • [1 / (1 + e-xi)]
(1 + e-xi) (1 + e-xi)

= (1 - oi) oi
∂xi
---- = aj xi = ∑ wijaj
∂wij j
∂E ∂E ∂oi ∂xi
---- = ---- ---- ----
∂wij ∂oi ∂xi ∂wij

= (oi - ti) (1 - oi)oi aj


}
}
}
raw error term due to incoming
(pre-synaptic) activation
due to sigmoid

∂E
Δwij = - η ----- (where η is an arbitrary learning rate)
∂wij

wijt+1 = wijt + η (ti - oi) (1 - oi) oi aj


A two layer network
1
x  d 1   0.1
0 

Transfer function is unipolar continuous


1
o
1  e net
net3=u3= 3*1+4*0+1*1=4 o3=1/(1+exp(-4))=0.982
net4=u4= 6*1+5*0+-6*1=0 o4=1/(1+exp(0))=0.5
net5=u5=2*0.982+4*0.5-3.93*1=0.034 o5=1/(1+exp(-0.04))
=0.51
f ' ( net )  ( 1  o )o
w   ( d  o )( 1  o )o x  w    x  5  (1  .51)(1  .51)(.51)  .1225
δ
w53   *  5 * 0.982  0.1* .1225 * .982  0.012
w53  w53  .012  2  .012  2.012
Performance Optimization
Taylor Series Expansion

d
F ( x ) = F ( x* ) + F (x) ( x – x* )
dx x = x*

2
1 d 2
+ --- F (x) ( x – x* ) + 
2 d x2
x = x*

n
1 d
( x – x* ) + 
n
nnd8ts + ----- F (x)
n! d x n
x = x*
Example
–x
F( x ) = e

Taylor series of F(x) about x* = 0 :

–x –0 –0 1 –0 2 1 –0 3
F (x ) = e = e – e ( x – 0 ) + ---e ( x – 0 ) – -- e ( x – 0 ) + 
2 6

1 2 1 3
F ( x ) = 1 – x + -- x – --- x + 
2 6

Taylor series approximations:

F ( x )  F0 ( x ) = 1

F ( x)  F 1 ( x) = 1 – x

1 2
F ( x )  F 2 ( x ) = 1 – x + --- x
2
Plot of Approximations
6

F2 ( x )
3

2 F1 ( x )

1
F0 ( x )

-2 -1 0 1 2
Vector Case
F ( x) = F ( x1 x 2   x n )

 
F ( x ) = F ( x* ) + F (x ) ( x 1 – x 1* ) + F (x ) ( x 2 – x 2* )
 x1 x = x *  x2 x=x *

2
 1  2
+ + F (x ) ( x – x * ) + --
- F ( x ) ( x – x * )
 xn x* x = x*
n n 2  x2 1 1
x =
1

2
1 
+ --- F (x ) *
( x 1 – x 1* ) ( x 2 – x 2* ) + 
2  x 1 x 2 x = x
Matrix Form
T
F ( x ) = F ( x* ) +  F ( x ) ( x – x* )
x = x*

1 T
+ --- ( x – x * ) 2F ( x ) ( x – x* ) + 
2 x = x*

Gradient Hessian

2 2 2
  
 F (x ) F (x )  F (x )
F (x ) 2
 x1  1 2
x x  1 n
x x
 x1
2 2 2
   
F (x ) F (x ) F (x )  F (x )
F ( x ) =  x2  (x ) =
2F
 2 1
x x 2
 x2  2 n
x x



 2 2 2
F (x )   
 xn F (x ) F (x )  F (x )
 n 1
x x  n 2
x x 2
 xn
Directional Derivatives
First derivative (slope) of F(x) along xi axis:  F ( x )   xi

(ith element of gradient)

2 2
Second derivative (curvature) of F(x) along xi axis:  F (x )  x i

(i,i element of Hessian)

T
p F ( x )
First derivative (slope) of F(x) along vector p: -----------------------
p

T
Second derivative (curvature) of F(x) along vector p: p 2 F ( x ) p
------------------------------
2
p
Example
2 2
F (x ) = x 1 + 2x 1 x2 + 2 x2

x* = 0.5 p = 1
0 –1


F( x )
 x1 2x 1 + 2x 2 1
 F ( x) = = =
x = x*  2x 1 + 4x 2 1
F( x )
 x2 x = x*
x = x*

1
T 1 – 1
p F ( x ) 1 0
----------------------- = ------------------------ = ------- = 0
p 1 2
–1
Plots
Directional
Derivatives
2

20

15
1
1.4
10
1.3
5
x2 0 1.0

0 0.5
2
1 2
-1
0.0
0 1
0
-1
x2 -2 -2
-1
x1
-2
-2 -1 0 1 2

x1

nnd8dd
Minima
Strong Minimum

The point x* is a strong minimum of F(x) if a scalar  > 0 exists, such that F(x*) <
F(x* + x) for all x such that  > ||x|| > 0.

Global Minimum

The point x* is a unique global minimum of F(x) if


F(x*) < F(x* + x) for all x  0.

Weak Minimum

The point x* is a weak minimum of F(x) if it is not a strong minimum, and a scalar  > 0
exists, such that F(x*) F(x* + x) for all x such that  > ||x|| > 0.
Scalar Example
4 2 1
F ( x ) = 3x – 7x – --- x + 6
2
8

Strong Maximum
6

2 Strong Minimum

Global Minimum
0
-2 -1 0 1 2
Quadratic Functions
1 T T
F ( x ) = -- x Ax + d x + c (Symmetric A)
2

Gradient of Quadratic Function:

F( x ) = Ax + d

Hessian of Quadratic Function:

 2F ( x ) = A
• If the eigenvalues of the Hessian matrix are all positive,
the function will have a single strong minimum.
• If the eigenvalues are all negative, the function will
have a single strong maximum.
• If some eigenvalues are positive and other eigenvalues
are negative, the function will have a single saddle
point.
• If the eigenvalues are all nonnegative, but some
eigenvalues are zero, then the function will either have
a weak minimum or will have no stationary point.
• If the eigenvalues are all nonpositive, but some
eigenvalues are zero, then the function will either have
a weak maximum or will have no stationary point.
Stationary point nature summary
xT Ax i Definiteness H Nature x*

0 Positive d. Minimum

0 Positive semi-d. Valley

Indefinite Saddlepoint
0
0 Negative semi-d. Ridge

0 Negative d. Maximum
Steepest Descent
2 2
F ( x ) = x1 + 2 x1 x 2 + 2x 2 + x1

x 0 = 0.5  = 0.1
0.5


F( x )
 x1 2x 1 + 2x2 + 1 g0 =  F (x ) = 3
F ( x ) = = x= x0
 2x 1 + 4x 2 3
F( x )
 x2

x 1 = x 0 – g 0 = 0.5 – 0.1 3 = 0.2


0.5 3 0.2

x2 = x1 – g1 = 0.2 – 0.1 1.8 = 0.02


0.2 1.2 0.08
2

-1

-2
-2 -1 0 1 2

If A is a symmetric matrix with eigenvalues λs, then eigenvalues of I- αA are


1- αλ
Stable Learning Rates (Quadratic)
1 T T
F ( x ) = -- x Ax + d x + c
2

F( x ) = Ax + d

x k + 1 = xk –  gk = x k –  ( Ax k + d ) xk + 1 =  I – A x k – d

Stability is determined
by the eigenvalues of
this matrix.

 I –  A  zi = z i –  Az i = z i –  iz i = ( 1 –  i) z i

(i - eigenvalue of A) Eigenvalues


of [I - A].

Stability Requirement:
2 2
( 1 –  i)  1   ----   ------------
i max
Example
  0.851     0.526  
A= 22 (
 1  = 0.764) z
 1 =  
 2 = 5.24 z
 2 = 
24   – 0.526     0.851  

2 2
  ------------ = ---------- = 0.38
max 5.24

 = 0.37  = 0.39
2 2

1 1

0 0

-1 -1

-2 -2
-2 -1 0 1 2 -2 -1 0 1 2

You might also like