100% found this document useful (1 vote)

92 views

AML 04 Backpropagation

The document provides an overview of backpropagation and gradient descent for training neural networks. It discusses how neural networks can be represented as nested functions and how the chain rule of calculus is used to efficiently compute gradients through backpropagation. It explains that taking steps in the direction of the negative gradient can minimize the loss function during training. The document also introduces concepts like the Jacobian and Hessian matrices which are important for understanding how the gradient and learning rate change across multiple layers of a neural network.

Uploaded by

Vaibhav

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

92 views

AML 04 Backpropagation

Uploaded by

Vaibhav

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 26

Advanced Machine Learning

Backpropagation
Amit Sethi
Electrical Engineering, IIT Bombay
Learning objectives
• Write derivative of a nested function using
chain rule

• Articulate how storage of partial derivatives

leads to an efficient gradient descent for
neural networks

• Write gradient descent as matrix operations

Overall function of a neural network
• 𝑓 𝒙𝑖 = 𝑔𝑙 (𝑾𝑙 ∗ 𝑔𝑙−1 𝑾𝑙−1 … 𝑔1 𝑾1 ∗ 𝒙𝑖 … )
• Weights form a matrix
• Output of the previous layer form a vector
• The activation (nonlinear) function is applied
point-wise to the weight times input

• Design questions (hyper parameters):

– Number of layers
– Number of neurons in each layer (rows of weight
matrices)
Training the neural network
• Given 𝒙𝑖 and 𝑦𝑖
• Think of what hyper-parameters and neural
network design might work
• Form a neural network:
𝑓 𝒙𝑖 = 𝑔𝑙 (𝑾𝑙 ∗ 𝑔𝑙−1 𝑾𝑙−1 … 𝑔1 𝑾1 ∗ 𝒙𝑖 … )
• Compute 𝑓𝒘 𝒙𝑖 as an estimate of 𝑦𝑖 for all
samples
• Compute loss:
1 𝑁 1 𝑁
𝑖=1 𝐿(𝑓𝒘 𝒙𝑖 , 𝑦𝑖 ) = 𝑖=1 𝑙𝑖 (𝒘)
𝑁 𝑁
• Tweak 𝒘 to reduce loss (optimization algorithm)
• Repeat last three steps
Gradient ascent
• If you didn’t know the shape of a mountain
• But at every step you knew the slope
• Can you reach the top of the mountain?
Gradient descent minimizes the loss
function
• At every point, compute
• Loss (scalar): 𝑙𝑖 (𝒘)
• Gradient of loss with respect to
weights (vector):
𝛻𝒘 𝑙𝑖 (𝒘)
• Take a step towards negative
gradient:
1 𝑁
𝒘 ← 𝒘 − 𝜂 𝛻𝒘 𝑙𝑖 (𝒘)
𝑁 𝑖=1
Derivative of a function of a scalar

E.g. 𝑓 𝑥 = 𝑎𝑥 2 + 𝑏𝑥 + 𝑐, 𝑓 ′ (𝑥) = 2𝑎𝑥 + 𝑏, 𝑓′′ 𝑥 = 2𝑎

𝑑 𝑓(𝑥)
• Derivative 𝑓’ 𝑥 = is the rate of change of 𝑓 𝑥 with 𝑥
𝑑𝑥
• It is zero when then function is flat (horizontal), such as at the
minimum or maximum of 𝑓 𝑥
• It is positive when 𝑓 𝑥 is sloping up, and negative when 𝑓 𝑥 is
sloping down
• To move towards the maxima, taking a small step in a direction of
the derivative
Gradient of a function of a vector
• Derivative with respect to each
dimension, holding other
dimensions constant
f(x1, x2) →

𝜕𝑓
𝜕𝑥1
• 𝛻𝑓 𝒙 = 𝛻𝑓 𝑥1 , 𝑥2 = 𝜕𝑓
𝜕𝑥2
• At a minima or a maxima the
gradient is a zero vector
The function is flat in every
direction
• At a minima or a maxima the
gradient is a zero vector
Original image source unknown
Gradient of a function of a vector
• Gradient gives a direction for
moving towards the minima
• Take a small step towards
f(x1, x2) →

negative of the gradient

Original image source unknown

Example of gradient
• Let 𝑓 𝒙 = 𝑓 𝑥1 , 𝑥2 = 5𝑥1 2 + 3𝑥2 2
𝜕𝑓
𝜕𝑥1 10𝑥1
• Then 𝛻𝑓 𝒙 = 𝛻𝑓 𝑥1 , 𝑥2 = =
𝜕𝑓 6𝑥2
𝜕𝑥2
20 0.958
• At a location 2,1 a step in or
6 0.287
direction will lead to maximal increase in the
function
This story is unfolding in multiple
dimensions

Original image source unknown

Backpropagation
• Backpropagation is an
efficient method to do
y1 y2 … yn gradient descent
• It saves the gradient
w.r.t. the upper layer
output to compute the
… gradient w.r.t. the
weights immediately
… …
below
…
• It is linked to the chain
rule of derivatives
h11 h12 …
h1n • All intermediary
1 functions must be
differentiable,
including the
x1 x2 … xd activation functions
Chain rule of differentiation
• Very handy for complicated functions
 Especially functions of functions
 E.g. NN outputs are functions of previous layers
 For example: Let 𝑓 𝑥 = 𝑔 𝑕 𝑥
 Let 𝑦 = 𝑕 𝑥 , 𝑧 = 𝑔 𝑦 = 𝑔 𝑕 𝑥
′ 𝑑𝑧 𝑑𝑧𝑑𝑦
 Then 𝑓 𝑥 = = = 𝑔′ (𝑦)𝑕′ (𝑥)
𝑑𝑥 𝑑𝑦𝑑𝑥
𝑑 sin(𝑥 2 )
 For example: = 2𝑥 cos(𝑥 2 )
𝑑𝑥
Backpropagation makes use of
chain rule of derivatives
𝜕𝑓(𝑔 𝑥 ) 𝜕𝑓(𝑔 𝑥 ) 𝜕𝑔 𝑥
• Chain rule: =
𝜕𝑥 𝜕𝑔(𝑥) 𝜕𝑥

x
* ?
ReL
W1 + Z1 A1
U
b1 * ?
SoftM
W2 + Z2 A2
ax
b2 CE Loss
targ
et
Vector valued functions and Jacobians
• We often deal with functions that give multiple
outputs
𝑓1 (𝒙) 𝑓1 (𝑥1 , 𝑥2 , 𝑥3 )
• Let 𝒇 𝒙 = =
𝑓2 (𝒙) 𝑓2 (𝑥1 , 𝑥2 , 𝑥3 )
• Thinking in terms of vector of functions can make the
representation less cumbersome and computations
more efficient
• Then the Jacobian is
𝜕𝑓1 𝜕𝑓1 𝜕𝑓1
𝜕𝒇 𝜕𝒇 𝜕𝒇 𝜕𝑥1 𝜕𝑥2 𝜕𝑥3
• 𝑱(𝒇) = 𝜕𝑥1 𝜕𝑥2 𝜕𝑥3
= 𝜕𝑓2 𝜕𝑓2 𝜕𝑓2
𝜕𝑥1 𝜕𝑥2 𝜕𝑥3
Jacobian of each layer
• Compute the derivatives of a higher layer’s
output with respect to those of the lower
layer

• What if we scale all the weights by a factor R?

• What happens a few layers down?

Role of step size and learning rate
• Tale of two loss functions
– Same value, and
– Same gradient (first derivative), but
– Different Hessian (second derivative)
– Different step sizes needed
• Success not guaranteed
The perfect step size is impossible to
guess
• Goldilocks finds the perfect balance only in a
fairy tale

• The step size is decided by learning rate 𝜂 and

the gradient
Double derivative

E.g. 𝑓 𝑥 = 𝑎𝑥 2 + 𝑏𝑥 + 𝑐, 𝑓 ′ (𝑥) = 2𝑎𝑥 + 𝑏, 𝑓′′ 𝑥 = 2𝑎

𝑑 2 𝑓(𝑥)
• Double derivative 𝑓’’ 𝑥 = is the derivative of
𝑑 𝑥2
derivative of 𝑓 𝑥
• Double derivative is positive for convex functions (have a
single minima), and negative for concave functions (have a
single maxima)
Double derivative

𝑓 𝑥 = 𝑎𝑥 2 + 𝑏𝑥 + 𝑐,
𝑓 ′ 𝑥 = 2𝑎𝑥 + 𝑏,
𝑓′′ 𝑥 = 2𝑎

• Double derivative tells how far the minima might be

from a given point.
• From 𝑥 = 0 the minima is closer for the red dashed
curve than for the blue solid curve, because the former
has a larger second derivative (its slope reverses faster)
Perfect step size for a paraboloid
• Let 𝑓 𝑥 = 𝑎𝑥 2 + 𝑏𝑥 + 𝑐
• Assuming 𝑎 < 0
∗ 𝑏
• Minima is at: 𝑥 = −
2𝑎
• For any 𝑥 the perfect step would be:
𝑏 2𝑎𝑥+𝑏 𝑓′ 𝑥
− −𝑥 = − = − ′′
2𝑎 2𝑎 𝑓 𝑥
∗ 1
• So, the perfect learning rate is: 𝜂 =
𝑓′′ 𝑥
−1
• In multiple dimensions, 𝒙 ← 𝒙 − 𝐻 𝑓 𝒙 𝛻(𝑓 𝒙 )
• Practically, we do not want to compute the inverse of
a Hessian matrix, so we approximate Hessian inverse
Hessian of a function of a vector
• Double derivative with respect
to a pair of dimensions forms
the Hessian matrix:
f(x1, x2) →

• If all eigenvalues of a Hessian

matrix are positive, then the
function is convex
Original image source unknown
Example of Hessian
• Let 𝑓 𝒙 = 𝑓 𝑥1 , 𝑥2 = 5𝑥1 2 + 3𝑥2 2 + 4𝑥1 𝑥2
• Then
𝜕𝑓
𝜕𝑥1 10𝑥1 + 4𝑥2
𝛻𝑓 𝒙 = 𝛻𝑓 𝑥1 , 𝑥2 = =
𝜕𝑓 6𝑥2 + 4𝑥1
𝜕𝑥2

𝜕2 𝑓 𝜕2 𝑓
𝜕𝑥1 2 𝜕𝑥1 𝜕𝑥2 10 4
• And, 𝐻(𝑓 𝒙 ) = =
𝜕2 𝑓 𝜕2 𝑓 4 6
𝜕𝑥2 𝜕𝑥1 𝜕𝑥2 2
Saddle points, Hessian and long local
furrows

• Some variables may

have reached a local
minima while others
have not
• Some weights may have
almost zero gradient
• At least some
eigenvalues may not be
negative
Image source: Wikipedia
Complicated loss functions

Original image source unknown

Global
minima?

Saddle
point
A realistic picture

Local
minima

Local
maxima

Image source: https://www.cs.umd.edu/~tomg/projects/landscapes/

MoneySend Brochure
No ratings yet
MoneySend Brochure
5 pages
Assignment 3.1 Production Planning - Case Study
No ratings yet
Assignment 3.1 Production Planning - Case Study
25 pages
Early Stopping in Practice
No ratings yet
Early Stopping in Practice
14 pages
A Comprehensive Guide To Ensemble Learning (With Python Codes)
100% (2)
A Comprehensive Guide To Ensemble Learning (With Python Codes)
21 pages
Convolutional Neural Networks
No ratings yet
Convolutional Neural Networks
5 pages
On Deep Machine Learning & Time Series Models: A Case Study With The Use of Keras
100% (1)
On Deep Machine Learning & Time Series Models: A Case Study With The Use of Keras
34 pages
Building Powerful Image Classification Models Using Very Little Data
No ratings yet
Building Powerful Image Classification Models Using Very Little Data
20 pages
Notes On Backpropagation
No ratings yet
Notes On Backpropagation
14 pages
Eda PDF
100% (1)
Eda PDF
45 pages
Hyperparameter Tuning in XGBoost Using Genetic Algorithm
100% (1)
Hyperparameter Tuning in XGBoost Using Genetic Algorithm
11 pages
Gradient Descent Algorithms and Variations - PyImageSearch
No ratings yet
Gradient Descent Algorithms and Variations - PyImageSearch
21 pages
3 Regression Diagnostics
100% (1)
3 Regression Diagnostics
53 pages
The Multilayer Perceptron
No ratings yet
The Multilayer Perceptron
11 pages
ML Unit-Iv
No ratings yet
ML Unit-Iv
19 pages
Cluster Analysis: Concepts and Techniques - Chapter 7
100% (1)
Cluster Analysis: Concepts and Techniques - Chapter 7
60 pages
Lec16 - Autoencoders
No ratings yet
Lec16 - Autoencoders
18 pages
The Problem of Overfitting: Overfitting With Linear Regression
No ratings yet
The Problem of Overfitting: Overfitting With Linear Regression
32 pages
Chapter 5.3-Mulitple Linear Regression
No ratings yet
Chapter 5.3-Mulitple Linear Regression
26 pages
Optimization Techniques in Deep Learning
No ratings yet
Optimization Techniques in Deep Learning
14 pages
27 SVM Interview Questions (ANSWERED) To Master Before ML & Data Science Interview - MLStack - Cafe
No ratings yet
27 SVM Interview Questions (ANSWERED) To Master Before ML & Data Science Interview - MLStack - Cafe
25 pages
RBF, KNN, SVM, DT
No ratings yet
RBF, KNN, SVM, DT
9 pages
Stanford CS224W Limitations of Graph Neural Networks 18-Limitations
No ratings yet
Stanford CS224W Limitations of Graph Neural Networks 18-Limitations
75 pages
Neural Networks
No ratings yet
Neural Networks
29 pages
02 ML Supervised Learning
No ratings yet
02 ML Supervised Learning
32 pages
Classification Algorithms
100% (2)
Classification Algorithms
23 pages
Pattern Classification
100% (1)
Pattern Classification
42 pages
Gradient Descent
No ratings yet
Gradient Descent
15 pages
Soft Max
No ratings yet
Soft Max
6 pages
PPT_Btech CSE
No ratings yet
PPT_Btech CSE
17 pages
Transformers in NLP 1
No ratings yet
Transformers in NLP 1
9 pages
Deep Learning (MODULE-3) (1)
No ratings yet
Deep Learning (MODULE-3) (1)
85 pages
Artificial Neural Networks: Part 1/3
No ratings yet
Artificial Neural Networks: Part 1/3
25 pages
Ensemble Learning Methods
100% (1)
Ensemble Learning Methods
24 pages
Deep Learning Hands On
100% (1)
Deep Learning Hands On
18 pages
Vector Representation of Text: Vagelis Hristidis Prepared With The Help of Nhat Le Many Slides Are From Richard Socher
No ratings yet
Vector Representation of Text: Vagelis Hristidis Prepared With The Help of Nhat Le Many Slides Are From Richard Socher
20 pages
Machine Learning Module-3
No ratings yet
Machine Learning Module-3
23 pages
ML UNIT-2 Notes
No ratings yet
ML UNIT-2 Notes
15 pages
Hyperparameters
No ratings yet
Hyperparameters
15 pages
What Is Convolutional Neural Network
No ratings yet
What Is Convolutional Neural Network
16 pages
Seminar Report Machine Learning
No ratings yet
Seminar Report Machine Learning
20 pages
2.neural Network
No ratings yet
2.neural Network
19 pages
CNN Architectures: Lenet, Alexnet, VGG, Googlenet, Resnet and More
No ratings yet
CNN Architectures: Lenet, Alexnet, VGG, Googlenet, Resnet and More
9 pages
Loss Functions
No ratings yet
Loss Functions
37 pages
Unit 2
No ratings yet
Unit 2
112 pages
DL Lab Manual
No ratings yet
DL Lab Manual
65 pages
Matplotlib PDF
No ratings yet
Matplotlib PDF
16 pages
Clustering K-Means
100% (2)
Clustering K-Means
28 pages
Stanford University CS224d - Deep Learning For Natural Language Processing - Syllabus
No ratings yet
Stanford University CS224d - Deep Learning For Natural Language Processing - Syllabus
3 pages
Ensemble Classifiers
100% (1)
Ensemble Classifiers
37 pages
Artificial Neural Network
No ratings yet
Artificial Neural Network
15 pages
Deep Learning Unit 1
No ratings yet
Deep Learning Unit 1
32 pages
Convolutional Neural Networks
No ratings yet
Convolutional Neural Networks
17 pages
2.building Blocks of Neural Networks
100% (1)
2.building Blocks of Neural Networks
2 pages
AIML - 04 Single Layer Perceptron
No ratings yet
AIML - 04 Single Layer Perceptron
11 pages
Using Categorical Data With One Hot Encoding - Kaggle PDF
No ratings yet
Using Categorical Data With One Hot Encoding - Kaggle PDF
4 pages
Machine L-Lab-Manual
No ratings yet
Machine L-Lab-Manual
90 pages
Advanced Information Retreival: Chapter 02: Modeling - Neural Network Model
No ratings yet
Advanced Information Retreival: Chapter 02: Modeling - Neural Network Model
31 pages
Nueral Network Mcqs
No ratings yet
Nueral Network Mcqs
6 pages
Effective Amazon Machine Learning
From Everand
Effective Amazon Machine Learning
Alexis Perrier
No ratings yet
Hopfield Networks: Fundamentals and Applications of The Neural Network That Stores Memories
From Everand
Hopfield Networks: Fundamentals and Applications of The Neural Network That Stores Memories
Fouad Sabry
No ratings yet
Hebbian Learning: Fundamentals and Applications for Uniting Memory and Learning
From Everand
Hebbian Learning: Fundamentals and Applications for Uniting Memory and Learning
Fouad Sabry
No ratings yet
Hybrid Neural Networks: Fundamentals and Applications for Interacting Biological Neural Networks with Artificial Neuronal Models
From Everand
Hybrid Neural Networks: Fundamentals and Applications for Interacting Biological Neural Networks with Artificial Neuronal Models
Fouad Sabry
No ratings yet
Elearning Battle Card PDF
No ratings yet
Elearning Battle Card PDF
5 pages
Fundamentals of Data Structure Notes (Sppu Sem-1 Unit 1)
100% (1)
Fundamentals of Data Structure Notes (Sppu Sem-1 Unit 1)
62 pages
Dissertation Thesis
No ratings yet
Dissertation Thesis
42 pages
Valmet DNA Report Basic Course
No ratings yet
Valmet DNA Report Basic Course
2 pages
BN 3300 Prox Cable
No ratings yet
BN 3300 Prox Cable
9 pages
Cómo Escribir Un Ensayo Sobre Comunicaciones
100% (1)
Cómo Escribir Un Ensayo Sobre Comunicaciones
4 pages
HP ZBook 17 G6 Mobile Workstation Maintenance and Service Guide
No ratings yet
HP ZBook 17 G6 Mobile Workstation Maintenance and Service Guide
204 pages
Install Ubuntu 20.04 LTS On Google Cloud Platform & Setup Teamviewer & SSH On It Complete Guide 2020
No ratings yet
Install Ubuntu 20.04 LTS On Google Cloud Platform & Setup Teamviewer & SSH On It Complete Guide 2020
2 pages
Data Integrity Auditing Without Private Key Storage For Secure Cloud Storage
No ratings yet
Data Integrity Auditing Without Private Key Storage For Secure Cloud Storage
7 pages
The 214 Traditional Kanji Radicals and Their Variants
No ratings yet
The 214 Traditional Kanji Radicals and Their Variants
14 pages
Acc Pressure Um PDF
No ratings yet
Acc Pressure Um PDF
36 pages
1 Data Sheet Nokia 7750 SR-s Service Router
No ratings yet
1 Data Sheet Nokia 7750 SR-s Service Router
15 pages
Sae J1708
100% (1)
Sae J1708
39 pages
Automation of Library Management System of The Case of Paradise University College
No ratings yet
Automation of Library Management System of The Case of Paradise University College
9 pages
CIS Password Policy Guide
No ratings yet
CIS Password Policy Guide
33 pages
Manual-Monitor FSN PDF
No ratings yet
Manual-Monitor FSN PDF
69 pages
ASSIGNMENT A-001 - Girls
No ratings yet
ASSIGNMENT A-001 - Girls
29 pages
iDAS Datasheet - 2022
No ratings yet
iDAS Datasheet - 2022
1 page
GM Sheet
No ratings yet
GM Sheet
1 page
FA-Mega Sale-Merchandising Display - FA
No ratings yet
FA-Mega Sale-Merchandising Display - FA
30 pages
APA Formatting Guidelines: Title Page Setup
No ratings yet
APA Formatting Guidelines: Title Page Setup
4 pages
1000 Artificial Intelligence MCQ (Multiple Choice Questions) - Sanfoundry
No ratings yet
1000 Artificial Intelligence MCQ (Multiple Choice Questions) - Sanfoundry
23 pages
Data Visualization With Python
No ratings yet
Data Visualization With Python
34 pages
ViewPublicDocument 2
No ratings yet
ViewPublicDocument 2
16 pages
Define Day Types in Sap
No ratings yet
Define Day Types in Sap
6 pages
Penuma Youtube Guidlines
No ratings yet
Penuma Youtube Guidlines
1 page
Autonomous Underwater Vehicle - Arduino Project H
No ratings yet
Autonomous Underwater Vehicle - Arduino Project H
1 page
Master Thesis Summary
No ratings yet
Master Thesis Summary
5 pages

AML 04 Backpropagation

Uploaded by

AML 04 Backpropagation

Uploaded by

Advanced Machine Learning

• Articulate how storage of partial derivatives

• Write gradient descent as matrix operations

• Design questions (hyper parameters):

E.g. 𝑓 𝑥 = 𝑎𝑥 2 + 𝑏𝑥 + 𝑐, 𝑓 ′ (𝑥) = 2𝑎𝑥 + 𝑏, 𝑓′′ 𝑥 = 2𝑎

negative of the gradient

Original image source unknown

Original image source unknown

• What if we scale all the weights by a factor R?

• What happens a few layers down?

• The step size is decided by learning rate 𝜂 and

E.g. 𝑓 𝑥 = 𝑎𝑥 2 + 𝑏𝑥 + 𝑐, 𝑓 ′ (𝑥) = 2𝑎𝑥 + 𝑏, 𝑓′′ 𝑥 = 2𝑎

• Double derivative tells how far the minima might be

• If all eigenvalues of a Hessian

• Some variables may

Original image source unknown

Image source: https://www.cs.umd.edu/~tomg/projects/landscapes/

You might also like