100% found this document useful (1 vote)
92 views

AML 04 Backpropagation

The document provides an overview of backpropagation and gradient descent for training neural networks. It discusses how neural networks can be represented as nested functions and how the chain rule of calculus is used to efficiently compute gradients through backpropagation. It explains that taking steps in the direction of the negative gradient can minimize the loss function during training. The document also introduces concepts like the Jacobian and Hessian matrices which are important for understanding how the gradient and learning rate change across multiple layers of a neural network.

Uploaded by

Vaibhav
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
92 views

AML 04 Backpropagation

The document provides an overview of backpropagation and gradient descent for training neural networks. It discusses how neural networks can be represented as nested functions and how the chain rule of calculus is used to efficiently compute gradients through backpropagation. It explains that taking steps in the direction of the negative gradient can minimize the loss function during training. The document also introduces concepts like the Jacobian and Hessian matrices which are important for understanding how the gradient and learning rate change across multiple layers of a neural network.

Uploaded by

Vaibhav
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

Advanced Machine Learning

Backpropagation
Amit Sethi
Electrical Engineering, IIT Bombay
Learning objectives
• Write derivative of a nested function using
chain rule

• Articulate how storage of partial derivatives


leads to an efficient gradient descent for
neural networks

• Write gradient descent as matrix operations


Overall function of a neural network
• 𝑓 𝒙𝑖 = 𝑔𝑙 (𝑾𝑙 ∗ 𝑔𝑙−1 𝑾𝑙−1 … 𝑔1 𝑾1 ∗ 𝒙𝑖 … )
• Weights form a matrix
• Output of the previous layer form a vector
• The activation (nonlinear) function is applied
point-wise to the weight times input

• Design questions (hyper parameters):


– Number of layers
– Number of neurons in each layer (rows of weight
matrices)
Training the neural network
• Given 𝒙𝑖 and 𝑦𝑖
• Think of what hyper-parameters and neural
network design might work
• Form a neural network:
𝑓 𝒙𝑖 = 𝑔𝑙 (𝑾𝑙 ∗ 𝑔𝑙−1 𝑾𝑙−1 … 𝑔1 𝑾1 ∗ 𝒙𝑖 … )
• Compute 𝑓𝒘 𝒙𝑖 as an estimate of 𝑦𝑖 for all
samples
• Compute loss:
1 𝑁 1 𝑁
𝑖=1 𝐿(𝑓𝒘 𝒙𝑖 , 𝑦𝑖 ) = 𝑖=1 𝑙𝑖 (𝒘)
𝑁 𝑁
• Tweak 𝒘 to reduce loss (optimization algorithm)
• Repeat last three steps
Gradient ascent
• If you didn’t know the shape of a mountain
• But at every step you knew the slope
• Can you reach the top of the mountain?
Gradient descent minimizes the loss
function
• At every point, compute
• Loss (scalar): 𝑙𝑖 (𝒘)
• Gradient of loss with respect to
weights (vector):
𝛻𝒘 𝑙𝑖 (𝒘)
• Take a step towards negative
gradient:
1 𝑁
𝒘 ← 𝒘 − 𝜂 𝛻𝒘 𝑙𝑖 (𝒘)
𝑁 𝑖=1
Derivative of a function of a scalar

E.g. 𝑓 𝑥 = 𝑎𝑥 2 + 𝑏𝑥 + 𝑐, 𝑓 ′ (𝑥) = 2𝑎𝑥 + 𝑏, 𝑓′′ 𝑥 = 2𝑎


𝑑 𝑓(𝑥)
• Derivative 𝑓’ 𝑥 = is the rate of change of 𝑓 𝑥 with 𝑥
𝑑𝑥
• It is zero when then function is flat (horizontal), such as at the
minimum or maximum of 𝑓 𝑥
• It is positive when 𝑓 𝑥 is sloping up, and negative when 𝑓 𝑥 is
sloping down
• To move towards the maxima, taking a small step in a direction of
the derivative
Gradient of a function of a vector
• Derivative with respect to each
dimension, holding other
dimensions constant
f(x1, x2) →

𝜕𝑓
𝜕𝑥1
• 𝛻𝑓 𝒙 = 𝛻𝑓 𝑥1 , 𝑥2 = 𝜕𝑓
𝜕𝑥2
• At a minima or a maxima the
gradient is a zero vector
The function is flat in every
direction
• At a minima or a maxima the
gradient is a zero vector
Original image source unknown
Gradient of a function of a vector
• Gradient gives a direction for
moving towards the minima
• Take a small step towards
f(x1, x2) →

negative of the gradient

Original image source unknown


Example of gradient
• Let 𝑓 𝒙 = 𝑓 𝑥1 , 𝑥2 = 5𝑥1 2 + 3𝑥2 2
𝜕𝑓
𝜕𝑥1 10𝑥1
• Then 𝛻𝑓 𝒙 = 𝛻𝑓 𝑥1 , 𝑥2 = =
𝜕𝑓 6𝑥2
𝜕𝑥2
20 0.958
• At a location 2,1 a step in or
6 0.287
direction will lead to maximal increase in the
function
This story is unfolding in multiple
dimensions

Original image source unknown


Backpropagation
• Backpropagation is an
efficient method to do
y1 y2 … yn gradient descent
• It saves the gradient
w.r.t. the upper layer
output to compute the
… gradient w.r.t. the
weights immediately
… …
below

• It is linked to the chain
rule of derivatives
h11 h12 …
h1n • All intermediary
1 functions must be
differentiable,
including the
x1 x2 … xd activation functions
Chain rule of differentiation
• Very handy for complicated functions
 Especially functions of functions
 E.g. NN outputs are functions of previous layers
 For example: Let 𝑓 𝑥 = 𝑔 𝑕 𝑥
 Let 𝑦 = 𝑕 𝑥 , 𝑧 = 𝑔 𝑦 = 𝑔 𝑕 𝑥
′ 𝑑𝑧 𝑑𝑧𝑑𝑦
 Then 𝑓 𝑥 = = = 𝑔′ (𝑦)𝑕′ (𝑥)
𝑑𝑥 𝑑𝑦𝑑𝑥
𝑑 sin(𝑥 2 )
 For example: = 2𝑥 cos(𝑥 2 )
𝑑𝑥
Backpropagation makes use of
chain rule of derivatives
𝜕𝑓(𝑔 𝑥 ) 𝜕𝑓(𝑔 𝑥 ) 𝜕𝑔 𝑥
• Chain rule: =
𝜕𝑥 𝜕𝑔(𝑥) 𝜕𝑥

x
* ?
ReL
W1 + Z1 A1
U
b1 * ?
SoftM
W2 + Z2 A2
ax
b2 CE Loss
targ
et
Vector valued functions and Jacobians
• We often deal with functions that give multiple
outputs
𝑓1 (𝒙) 𝑓1 (𝑥1 , 𝑥2 , 𝑥3 )
• Let 𝒇 𝒙 = =
𝑓2 (𝒙) 𝑓2 (𝑥1 , 𝑥2 , 𝑥3 )
• Thinking in terms of vector of functions can make the
representation less cumbersome and computations
more efficient
• Then the Jacobian is
𝜕𝑓1 𝜕𝑓1 𝜕𝑓1
𝜕𝒇 𝜕𝒇 𝜕𝒇 𝜕𝑥1 𝜕𝑥2 𝜕𝑥3
• 𝑱(𝒇) = 𝜕𝑥1 𝜕𝑥2 𝜕𝑥3
= 𝜕𝑓2 𝜕𝑓2 𝜕𝑓2
𝜕𝑥1 𝜕𝑥2 𝜕𝑥3
Jacobian of each layer
• Compute the derivatives of a higher layer’s
output with respect to those of the lower
layer

• What if we scale all the weights by a factor R?

• What happens a few layers down?


Role of step size and learning rate
• Tale of two loss functions
– Same value, and
– Same gradient (first derivative), but
– Different Hessian (second derivative)
– Different step sizes needed
• Success not guaranteed
The perfect step size is impossible to
guess
• Goldilocks finds the perfect balance only in a
fairy tale

• The step size is decided by learning rate 𝜂 and


the gradient
Double derivative

E.g. 𝑓 𝑥 = 𝑎𝑥 2 + 𝑏𝑥 + 𝑐, 𝑓 ′ (𝑥) = 2𝑎𝑥 + 𝑏, 𝑓′′ 𝑥 = 2𝑎


𝑑 2 𝑓(𝑥)
• Double derivative 𝑓’’ 𝑥 = is the derivative of
𝑑 𝑥2
derivative of 𝑓 𝑥
• Double derivative is positive for convex functions (have a
single minima), and negative for concave functions (have a
single maxima)
Double derivative

𝑓 𝑥 = 𝑎𝑥 2 + 𝑏𝑥 + 𝑐,
𝑓 ′ 𝑥 = 2𝑎𝑥 + 𝑏,
𝑓′′ 𝑥 = 2𝑎

• Double derivative tells how far the minima might be


from a given point.
• From 𝑥 = 0 the minima is closer for the red dashed
curve than for the blue solid curve, because the former
has a larger second derivative (its slope reverses faster)
Perfect step size for a paraboloid
• Let 𝑓 𝑥 = 𝑎𝑥 2 + 𝑏𝑥 + 𝑐
• Assuming 𝑎 < 0
∗ 𝑏
• Minima is at: 𝑥 = −
2𝑎
• For any 𝑥 the perfect step would be:
𝑏 2𝑎𝑥+𝑏 𝑓′ 𝑥
− −𝑥 = − = − ′′
2𝑎 2𝑎 𝑓 𝑥
∗ 1
• So, the perfect learning rate is: 𝜂 =
𝑓′′ 𝑥
−1
• In multiple dimensions, 𝒙 ← 𝒙 − 𝐻 𝑓 𝒙 𝛻(𝑓 𝒙 )
• Practically, we do not want to compute the inverse of
a Hessian matrix, so we approximate Hessian inverse
Hessian of a function of a vector
• Double derivative with respect
to a pair of dimensions forms
the Hessian matrix:
f(x1, x2) →

• If all eigenvalues of a Hessian


matrix are positive, then the
function is convex
Original image source unknown
Example of Hessian
• Let 𝑓 𝒙 = 𝑓 𝑥1 , 𝑥2 = 5𝑥1 2 + 3𝑥2 2 + 4𝑥1 𝑥2
• Then
𝜕𝑓
𝜕𝑥1 10𝑥1 + 4𝑥2
𝛻𝑓 𝒙 = 𝛻𝑓 𝑥1 , 𝑥2 = =
𝜕𝑓 6𝑥2 + 4𝑥1
𝜕𝑥2

𝜕2 𝑓 𝜕2 𝑓
𝜕𝑥1 2 𝜕𝑥1 𝜕𝑥2 10 4
• And, 𝐻(𝑓 𝒙 ) = =
𝜕2 𝑓 𝜕2 𝑓 4 6
𝜕𝑥2 𝜕𝑥1 𝜕𝑥2 2
Saddle points, Hessian and long local
furrows

• Some variables may


have reached a local
minima while others
have not
• Some weights may have
almost zero gradient
• At least some
eigenvalues may not be
negative
Image source: Wikipedia
Complicated loss functions

Original image source unknown


Global
minima?

Saddle
point
A realistic picture

Local
minima

Local
maxima

Image source: https://www.cs.umd.edu/~tomg/projects/landscapes/

You might also like