0% found this document useful (0 votes)

154 views

Gradient Notes PDF

The document discusses computing gradients for neural networks in a vectorized manner. It introduces the Jacobian matrix, which allows applying the chain rule to vector-valued functions by multiplying Jacobians. Several useful identities are provided for computing Jacobians of common neural network operations, such as matrix multiplication, elementwise functions, and more. Computing the gradient of the loss with respect to the weights of a matrix multiplication requires first taking the gradient with respect to each weight individually.

Uploaded by

Albert Jordan

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

154 views

Gradient Notes PDF

Uploaded by

Albert Jordan

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

Computing Neural Network Gradients

Kevin Clark

1 Introduction
The purpose of these notes is to demonstrate how to quickly compute neural
network gradients in a completely vectorized way. It is complementary to the
last part of lecture 3 in CS224n 2019, which goes over the same material.

2 Vectorized Gradients
While it is a good exercise to compute the gradient of a neural network with re-
spect to a single parameter (e.g., a single element in a weight matrix), in practice
this tends to be quite slow. Instead, it is more efficient to keep everything in ma-
trix/vector form. The basic building block of vectorized gradients is the Jacobian
Matrix. Suppose we have a function f : Rn → Rm that maps a vector of length n
to a vector of length m: f (x) = [f1 (x1 , ..., xn ), f2 (x1 , ..., xn ), ..., fm (x1 , ..., xn )].
Then its Jacobian is the following m × n matrix:

 ∂f1 ∂f1 
∂x1 ... ∂xn
∂f
=  ... .. .. 

∂x . . 
∂f m ∂fm
∂x1 ... ∂xn

That is, ( ∂f ∂fi

∂x )ij = ∂xj (which is just a standard non-vector derivative). The
Jacobian matrix will be useful for us because we can apply the chain rule to a
vector-valued function just by multiplying Jacobians.

As a little illustration of this, suppose we have a function f (x) = [f1 (x), f2 (x)]
taking a scalar to a vector of size 2 and a function g(y) = [g1 (y1 , y2 ), g2 (y1 , y2 )]
taking a vector of size two to a vector of size two. Now let’s compose them to
get g(x) = [g1 (f1 (x), f2 (x)), g2 (f1 (x), f2 (x))]. Using the regular chain rule, we
can compute the derivative of g as the Jacobian
∂ " ∂g1 ∂f1 ∂g1 ∂f2
#
∂g g1 (f 1 (x), f2 (x)) ∂f ∂x + ∂f ∂x
= ∂x ∂ = ∂g12 ∂f1 2
∂g2 ∂f2
∂x ∂x g2 (f1 (x), f2 (x)) ∂f1 ∂x + ∂f2 ∂x

1
And we see this is the same as multiplying the two Jacobians:
" #
∂g1 ∂g1 ∂f1
∂g ∂g ∂f ∂f1 ∂f2 ∂x
= = ∂g2 ∂g2 ∂f2
∂x ∂f ∂x ∂f1 ∂f2 ∂x

3 Useful Identities
This section will now go over how to compute the Jacobian for several simple
functions. It will provide some useful identities you can apply when taking neu-
ral network gradients.

(1) Matrix times column vector with respect to the column vector
∂z
(z = W x, what is ∂x ?)

Suppose W ∈ Rn×m . Then we can think of z as a function of x taking an

m-dimensional vector to an n-dimensional vector. So its Jacobian will be
n × m. Note that
m
X
zi = Wik xk
k=1

∂z
So an entry ( ∂x )ij of the Jacobian will be
m m
∂z ∂zi ∂ X X ∂
( )ij = = Wik xk = Wik xk = Wij
∂x ∂xj ∂xj ∂xj
k=1 k=1

∂ ∂z
because ∂xj xk = 1 if k = j and 0 if otherwise. So we see that =W
∂x
(2) Row vector times matrix with respect to the row vector
∂z
(z = xW , what is ∂x ?)

∂z
A computation similar to (1) shows that = WT .
∂x
(3) A vector with itself
∂z
(z = x, what is ∂x ? )
We have zi = xi . So
(
∂z ∂zi ∂ 1 if i = j
( )ij = = xi =
∂x ∂xj ∂xj 0 if otherwise
∂z
So we see that the Jacobian is a diagonal matrix where the entry at (i, i)
∂x
∂z
is 1. This is just the identity matrix: = I . When applying the chain
∂x

2
rule, this term will disappear because a matrix or vector multiplied by the
identity matrix does not change.
(4) An elementwise function applied a vector
∂z
(z = f (x), what is ∂x ? )
Since f is being applied elementwise, we have zi = f (xi ). So
(
∂z ∂zi ∂ f 0 (xi ) if i = j
( )ij = = f (xi ) =
∂x ∂xj ∂xj 0 if otherwise

∂z
So we see that the Jacobian ∂x is a diagonal matrix where the entry at (i, i)
∂z
is the derivative of f applied to xi . We can write this as = diag(f 0 (x)) .
∂x
Since multiplication by a diagonal matrix is the same as doing elementwise
multiplication by the diagonal, we could also write ◦f 0 (x) when applying
the chain rule.

(5) Matrix times column vector with respect to the matrix

(z = W x, δ = ∂J ∂J ∂J ∂z ∂z
∂z what is ∂W = ∂z ∂W = δ ∂W ?)

This is a bit more complicated than the other identities. The reason for in-
cluding ∂J
∂z in the above problem formulation will become clear in a moment.
First suppose we have a loss function J (a scalar) and are computing its
gradient with respect to a matrix W ∈ Rn×m . Then we could think of J as
a function of W taking nm inputs (the entries of W ) to a single output (J).
∂J
This means the Jacobian ∂W would be a 1 × nm vector. But in practice
this is not a very useful way of arranging the gradient. It would be much
nicer if the derivatives were in a n × m matrix like this:
 ∂J ∂J 
∂W11 . . . ∂W 1m
∂J
=  ... .. .. 

∂W . . 
∂J ∂J
∂Wn1 . . . ∂Wnm

Since this matrix has the same shape as W , we could just subtract it (times
the learning rate) from W when doing gradient descent. So (in a slight abuse
∂J
of notation) let’s find this matrix as ∂W instead.
This way of arranging the gradients becomes complicated when computing
∂z
∂W . Unlike J, z is a vector. So if we are trying to rearrange the gradients
∂J ∂z
like with ∂W , ∂W would be an n × m × n tensor! Luckily, we can avoid
the issue by taking the gradient with respect to a single weight Wij instead.

3
∂z
∂Wij is just a vector, which is much easier to deal with. We have

m
X
zk = Wkl xl
l=1
m
∂zk X ∂
= xl Wkl
∂Wij ∂Wij
l=1

∂
Note that ∂W ij
Wkl = 1 if i = k and j = l and 0 if otherwise. So if k 6= i
everything in the sum is zero and the gradient is zero. Otherwise, the only
nonzero element of the sum is when l = j, so we just get xj . Thus we find
∂zk
∂Wij = xj if k = i and 0 if otherwise. Another way of writing this is

 
0
 .. 
.
 
0
∂z  
xj  ← ith element
= 
∂Wij 0
 
.
 .. 
0
∂J
Now let’s compute ∂Wij

m
∂J ∂J ∂z ∂z X ∂zk
= =δ = δk = δ i xj
∂Wij ∂z ∂Wij ∂Wij ∂Wij
k=1

∂zi ∂J
(the only nonzero term in the sum is δi ∂W ij
). To get ∂W we want a ma-
trix where entry (i, j) is δi xj . This matrix is equal to the outer product
∂J
= δ T xT
∂W
(6) Row vector time matrix with respect to the matrix
(z = xW , δ = ∂J ∂J ∂z
∂z what is ∂W = δ ∂W ?)
∂J
A similar computation to (5) shows that = xT δ .
∂W
(7) Cross-entropy loss with respect to logits (ŷ = softmax(θ), J =
CE(y, ŷ), what is ∂J
∂θ ?)

∂J
The gradient is = ŷ − y
∂θ
(or (ŷ − y)T if y is a column vector).

4
These identities will be enough to let you quickly compute the gradients for many
neural networks. However, it’s important to know how to compute Jacobians
for other functions as well in case they show up. Some examples if you want
practice: dot product of two vectors, elementwise product of two vectors, 2-norm
of a vector. Feel free to use these identities in the assignments. One option is
just to memorize them. Another option is to figure them out by looking at the
dimensions. For example, only one ordering/orientation of δ and x will produce
∂J
the correct shape for ∂W (assuming W is not square).

4 Gradient Layout
Jacobean formulation is great for applying the chain rule: you just have to mul-
tiply the Jacobians. However, when doing SGD it’s more convenient to follow
the convention “the shape of the gradient equals the shape of the parameter”
∂J
(as we did when computing ∂W ). That way subtracting the gradient times the
learning rate from the parameters is easy. We expect answers to homework
questions to follow this convention. Therefore if you compute the gradient
of a column vector using Jacobian formulation, you should take the transpose
when reporting your final answer so the gradient is a column vector. Another
option is to always follow the convention. In this case the identities may not
work, but you can still figure out the answer by making sure the dimensions of
your derivatives match up. Up to you which of these options you choose!

5 Example: 1-Layer Neural Network

This section provides an example of computing the gradients of a full neural
network. In particular we are going to compute the gradients of a one-layer
neural network trained with cross-entropy loss. The forward pass of the model
is as follows:
x = input
z = W x + b1
h = ReLU(z)
θ = U h + b2
ŷ = softmax(θ)
J = CE(y, ŷ)
It helps to break up the model into the simplest parts possible, so note that
we defined z and θ to split up the activation functions from the linear trans-
formations in the network’s layers. The dimensions of the model’s parameters
are
x ∈ RDx ×1 b1 ∈ RDh ×1 W ∈ RDh ×Dx b2 ∈ RNc ×1 U ∈ RNc ×Dh
where Dx is the size of our input, Dh is the size of our hidden layer, and Nc is
the number of classes.

5
In this example, we will compute all of the network’s gradients:
∂J ∂J ∂J ∂J ∂J
∂U ∂b2 ∂W ∂b1 ∂x
To start with, recall that ReLU(x) = max(x, 0). This means
(
0 1 if x > 0
ReLU (x) = = sgn(ReLU(x))
0 if otherwise

where sgn is the signum function. Note that we are able to write the derivative
of the activation in terms of the activation itself.

∂J ∂J
Now let’s write out the chain rule for ∂U and ∂b2 :

∂J ∂J ∂ ŷ ∂θ
=
∂U ∂ ŷ ∂θ ∂U
∂J ∂J ∂ ŷ ∂θ
=
∂b2 ∂ ŷ ∂θ ∂b2
∂ ŷ
Notice that ∂J ∂J
∂ ŷ ∂θ = ∂θ is present in both gradients. This makes the math a bit
cumbersome. Even worse, if we’re implementing the model without automatic
differentiation, computing ∂J ∂θ twice will be inefficient. So it will help us to define
some variables to represent the intermediate derivatives:
∂J ∂J
δ1 = δ2 =
∂θ ∂z
These can be thought as the error signals passed down to θ and z when doing
backpropagation. We can compute them as follows:
∂J
δ1 = = (ŷ − y)T this is just identity (7)
∂θ
∂J ∂J ∂θ ∂h
δ2 = = using the chain rule
∂z ∂θ ∂h ∂z
∂θ ∂h
= δ1 substituting in δ1
∂h ∂z
∂h
= δ1 U using identity (1)
∂z
= δ1 U ◦ ReLU0 (z) using identity (4)
= δ1 U ◦ sgn(h) we computed this earlier

A good way of checking our work is by looking at the dimensions of the Jaco-
bians:
∂J
= δ1 U ◦ sgn(h)
∂z
(1 × Dh ) (1 × Nc ) (Nc × Dh ) (Dh )

6
We see that the dimensions of all the terms in the gradient match up (i.e., the
number of columns in a term equals the number of rows in the next term). This
will always be the case if we computed our gradients correctly.

Now we can use the error terms to compute our gradients. Note that we trans-
pose out answers when computing the gradients for column vectors terms to
follow the shape convention.
∂J ∂J ∂θ ∂θ
= = δ1 = δ1T hT using identity (5)
∂U ∂θ ∂U ∂U
∂J ∂J ∂θ ∂θ
= = δ1 = δ1T using identity (3) and transposing
∂b2 ∂θ ∂b2 ∂b2
∂J ∂J ∂z ∂z
= = δ2 = δ2T xT using identity (5)
∂W ∂θ ∂W ∂W
∂J ∂J ∂z ∂z
= = δ2 = δ2T using identity (3) and transposing
∂b1 ∂θ ∂b1 ∂b1
∂J ∂J ∂z
= = (δ2 W )T using identity (1) and transposing
∂x ∂θ ∂x

The Solubility Product of PbCl2 From Electrochemical Measurements
0% (1)
The Solubility Product of PbCl2 From Electrochemical Measurements
2 pages
4-Partial Derivatives and Their Applications
78% (9)
4-Partial Derivatives and Their Applications
92 pages
CS 229, Autumn 2016 Problem Set #0 Solutions: Linear Algebra and Multivariable Calculus
No ratings yet
CS 229, Autumn 2016 Problem Set #0 Solutions: Linear Algebra and Multivariable Calculus
4 pages
Fundamental of Plant Pathology PDF
93% (14)
Fundamental of Plant Pathology PDF
219 pages
HW 2 - Hydropower - Iman Ansori
100% (1)
HW 2 - Hydropower - Iman Ansori
3 pages
Computing Neural Network Gradients-merged
No ratings yet
Computing Neural Network Gradients-merged
67 pages
Derivatives, Backpropagation, and Vectorization
No ratings yet
Derivatives, Backpropagation, and Vectorization
7 pages
Expectation Values of Operators: Lecture Notes: QM 04
No ratings yet
Expectation Values of Operators: Lecture Notes: QM 04
8 pages
Chapter9-18
No ratings yet
Chapter9-18
23 pages
Introduction To Probability Theory
No ratings yet
Introduction To Probability Theory
13 pages
MATH283 L05 Wk2 Web
No ratings yet
MATH283 L05 Wk2 Web
10 pages
Quantum Field Theory Notes On The Delta Function and Related Issues
No ratings yet
Quantum Field Theory Notes On The Delta Function and Related Issues
3 pages
3.23 Electrical, Optical, and Magnetic Properties of Materials
No ratings yet
3.23 Electrical, Optical, and Magnetic Properties of Materials
7 pages
Ondiculas en Espacios de Hilbert
No ratings yet
Ondiculas en Espacios de Hilbert
35 pages
Curl
No ratings yet
Curl
49 pages
Advanced PDE HW2
No ratings yet
Advanced PDE HW2
6 pages
Differentials and The Chain Rule: F M F F F
No ratings yet
Differentials and The Chain Rule: F M F F F
2 pages
Matrix Calculus
No ratings yet
Matrix Calculus
8 pages
Maths.chapt.6
No ratings yet
Maths.chapt.6
8 pages
MA2E01 Chapter 5
No ratings yet
MA2E01 Chapter 5
24 pages
Handy Hint 1 Physics 521 ASU
No ratings yet
Handy Hint 1 Physics 521 ASU
2 pages
Calculus With Vectors and Matrices
No ratings yet
Calculus With Vectors and Matrices
16 pages
Math 434/734: Homework 7
No ratings yet
Math 434/734: Homework 7
8 pages
Lecture 2
No ratings yet
Lecture 2
10 pages
CH08 PDF
No ratings yet
CH08 PDF
25 pages
Bsph101 Ar Dec 20
No ratings yet
Bsph101 Ar Dec 20
10 pages
Section 3.6 Lecture Notes 122B
No ratings yet
Section 3.6 Lecture Notes 122B
5 pages
Homework2 Advanced Ml
No ratings yet
Homework2 Advanced Ml
4 pages
Implicit Function Theorem
No ratings yet
Implicit Function Theorem
4 pages
CS6910 Tutorial1
No ratings yet
CS6910 Tutorial1
10 pages
This Study Resource Was: CS 7641 CSE/ISYE 6740 Homework 3
No ratings yet
This Study Resource Was: CS 7641 CSE/ISYE 6740 Homework 3
4 pages
PHD Lecture08 09
No ratings yet
PHD Lecture08 09
9 pages
Lecture 7
No ratings yet
Lecture 7
4 pages
Dif Form-Kurt Bryan
No ratings yet
Dif Form-Kurt Bryan
12 pages
CH 10 1 (M)
No ratings yet
CH 10 1 (M)
8 pages
slide8
No ratings yet
slide8
15 pages
Partial Derivatives: Calculus On R Seminar 5 2020
No ratings yet
Partial Derivatives: Calculus On R Seminar 5 2020
7 pages
Chapter 8: Performance Surfaces and Optimum Points: Brandon Morgan 1/15/2021
No ratings yet
Chapter 8: Performance Surfaces and Optimum Points: Brandon Morgan 1/15/2021
10 pages
3 Applications of Partial Differentiation: 3.1 Stationary Points
No ratings yet
3 Applications of Partial Differentiation: 3.1 Stationary Points
29 pages
Generalized Duality Mapping: Journal of The Indian Mathematical Society January 2015
No ratings yet
Generalized Duality Mapping: Journal of The Indian Mathematical Society January 2015
16 pages
CM 02 CalculusVariations
No ratings yet
CM 02 CalculusVariations
14 pages
Differential Equations-1
No ratings yet
Differential Equations-1
13 pages
Lec. 2
No ratings yet
Lec. 2
4 pages
Lecture 9
No ratings yet
Lecture 9
8 pages
StochasticModels 2011 Part 3
No ratings yet
StochasticModels 2011 Part 3
15 pages
Solution Series 7
No ratings yet
Solution Series 7
6 pages
More On The Cauchy-Riemann Equations: Ryan C. Daileda
No ratings yet
More On The Cauchy-Riemann Equations: Ryan C. Daileda
15 pages
Gradient Notes
No ratings yet
Gradient Notes
5 pages
Partial Derivatives
No ratings yet
Partial Derivatives
2 pages
Partial Derivatives
No ratings yet
Partial Derivatives
2 pages
1 Review of Key Concepts From Previous Lectures: Lecture Notes - Amber Habib - December 1
No ratings yet
1 Review of Key Concepts From Previous Lectures: Lecture Notes - Amber Habib - December 1
4 pages
CALCUL - Applications of Partial Differentiation PDF
No ratings yet
CALCUL - Applications of Partial Differentiation PDF
28 pages
Vector Calculus
No ratings yet
Vector Calculus
13 pages
f- ODE
No ratings yet
f- ODE
4 pages
Solution 2
No ratings yet
Solution 2
3 pages
Mas201cd-0306 (2 4)
No ratings yet
Mas201cd-0306 (2 4)
12 pages
Nlpsol 3
No ratings yet
Nlpsol 3
25 pages
ODE Mathematical Physics
No ratings yet
ODE Mathematical Physics
9 pages
Derivatives of Delta Function
No ratings yet
Derivatives of Delta Function
2 pages
Chapter 3 Vector Calculus
No ratings yet
Chapter 3 Vector Calculus
68 pages
TutorialSection 2.4
No ratings yet
TutorialSection 2.4
4 pages
Elementary Calculus
From Everand
Elementary Calculus
George N. Frempong
No ratings yet
Calculus-II (Mathematics) Question Bank
From Everand
Calculus-II (Mathematics) Question Bank
Mohmmad Khaja Shareef
No ratings yet
CS224n: Natural Language Processing With Deep Learning: Lecture Notes: Part IV Dependency Parsing Winter 2019
No ratings yet
CS224n: Natural Language Processing With Deep Learning: Lecture Notes: Part IV Dependency Parsing Winter 2019
5 pages
Assignment I: Memorize: Objective
No ratings yet
Assignment I: Memorize: Objective
6 pages
Reading Assignment I: Intro To Swift: Objective
No ratings yet
Reading Assignment I: Intro To Swift: Objective
10 pages
Reading Assignment II: Functional Programming: Objective
No ratings yet
Reading Assignment II: Functional Programming: Objective
9 pages
Assignment II: More Memorize: Objective
No ratings yet
Assignment II: More Memorize: Objective
7 pages
Stanford CS193p: Developing Applications For iOS Spring 2020
No ratings yet
Stanford CS193p: Developing Applications For iOS Spring 2020
4 pages
James Reynolds Elementary School - School Profile
No ratings yet
James Reynolds Elementary School - School Profile
10 pages
Leeds Psychogeography Presentation
No ratings yet
Leeds Psychogeography Presentation
39 pages
Cme390 Thermal Power Engineering
No ratings yet
Cme390 Thermal Power Engineering
7 pages
Effective Leadership
100% (1)
Effective Leadership
69 pages
PDF Several Complex Variables with Connections to Algebraic Geometry and Lie Groups 1st Edition Joseph L. Taylor download
No ratings yet
PDF Several Complex Variables with Connections to Algebraic Geometry and Lie Groups 1st Edition Joseph L. Taylor download
56 pages
Portrait of A Ligurian Fishing Colony
No ratings yet
Portrait of A Ligurian Fishing Colony
6 pages
Ravi Kumar 2021 IOP Conf. Ser. Mater. Sci. Eng. 1074 012012
No ratings yet
Ravi Kumar 2021 IOP Conf. Ser. Mater. Sci. Eng. 1074 012012
10 pages
Complete Preparation: General Awareness
No ratings yet
Complete Preparation: General Awareness
42 pages
By Engineer M Waleed
No ratings yet
By Engineer M Waleed
21 pages
Circuit Zone Limited Price List 08-Feb-2011: Item #
No ratings yet
Circuit Zone Limited Price List 08-Feb-2011: Item #
345 pages
If..else Program CLASS XII COMPUTER SCIENCE
No ratings yet
If..else Program CLASS XII COMPUTER SCIENCE
17 pages
ASCII Chart and Other Resources
No ratings yet
ASCII Chart and Other Resources
8 pages
14 - June - 2024 - Dawn Editorials & Opinion
No ratings yet
14 - June - 2024 - Dawn Editorials & Opinion
15 pages
NAPA For Operations - 2012
No ratings yet
NAPA For Operations - 2012
7 pages
Facilitator Guide U101
No ratings yet
Facilitator Guide U101
2 pages
Narayan Chandra Rana
No ratings yet
Narayan Chandra Rana
24 pages
Fraud Detection Handbook
No ratings yet
Fraud Detection Handbook
6 pages
NKK Guidance For Measures To Cope With Degraded Marine Heavy Fuels (Version II)
No ratings yet
NKK Guidance For Measures To Cope With Degraded Marine Heavy Fuels (Version II)
266 pages
Guide To Develop Good Content Writing Skills
No ratings yet
Guide To Develop Good Content Writing Skills
1 page
Untitled
No ratings yet
Untitled
12 pages
Hindu Law
No ratings yet
Hindu Law
26 pages
Norval: Pressure Regulator
No ratings yet
Norval: Pressure Regulator
16 pages
Week 4 Notes - Pelvis Anatomy and Biomechanics
No ratings yet
Week 4 Notes - Pelvis Anatomy and Biomechanics
12 pages
Formulation and Manufacturing Process Of: Cosmetics With Packaging
No ratings yet
Formulation and Manufacturing Process Of: Cosmetics With Packaging
59 pages
WeeMow 03
No ratings yet
WeeMow 03
2 pages
Pronouns Unit - Notes and Practice
No ratings yet
Pronouns Unit - Notes and Practice
12 pages
Certification Standard For Shop Application of Complex Protective Coating Systems
100% (3)
Certification Standard For Shop Application of Complex Protective Coating Systems
17 pages

Gradient Notes PDF

Uploaded by

Gradient Notes PDF

Uploaded by

Computing Neural Network Gradients

That is, ( ∂f ∂fi

Suppose W ∈ Rn×m . Then we can think of z as a function of x taking an

(5) Matrix times column vector with respect to the matrix

5 Example: 1-Layer Neural Network

You might also like