Notes On Backpropagation
Notes On Backpropagation
on
Nota!ons
Let’s begin with a nota!on which lets us refer to weights in the network in an
l to denote the weight for the connec!on from the k th
unambiguous way. We’ll use wjk
neuron in the (l−1)th layer to the j th neuron in the lth layer. So, for example, the
diagram the below shows the weight on a connec!on from the 4th neuron in the 2nd
layer to the 2nd neuron in the 3rd layer of a network:
We use a similar nota!on for the network’s biases and ac!va!ons. Explicitly, we use blj
for the bias of the j th neuron in the lth layer. And we use alj for the ac!va!on of the
j th neuron in the lth layer. The following diagram shows examples of these nota!ons in
use:
With these nota!ons, the ac!va!on alj of the j th neuron in the lth layer is related to
the ac!va!ons in the (l − 1)th layer by the equa!on.
nl−1
alj = σ(∑ wjk
l
× al−1 l
k + bj )
k=1
where nl−1 is the number of neurons in the (l − 1)th layer, and σ is an ac!va!on
func!on, such as sigmoid, tanh and ReLU .
To rewrite this expression in a matrix form we define a weight matrix w l for each layer, l.
The entries of the weight matrix w l are just the weights connec!ng to the lth layer of
neurons, that is, the entry in the j th row and k th column is wjk
l . Similarly, for each layer
l we define a bias vector, bl . blj is the j th entry of the bias vector for the lth layer. And
finally, we define an ac!va!on vector al whose components are the ac!va!ons alj .
The last ingredient we need to rewrite in a matrix form is the idea of vectorizing a
func!on such as σ . The idea is that we want to apply a func!on such as σ to every
element in a vector v . We use the obvious nota!on σ(v) to denote this kind of
elementwise applica!on of a func!on. That is, the components of σ(v) are just
σ(v)j = σ(vj ).
With these nota!ons in mind, the above equa!on can be rewri"en in the beau!ful and
compact vectorized form:
al = σ(wl al−1 + bl )
That global view is o#en easier and more succinct (and involves fewer indices!) than the
neuron-by-neuron view we’ve taken to now. Think of it as a way of escaping index hell,
while remaining precise about what’s going on. The expression is also useful in prac!ce,
because most matrix libraries provide fast ways of implemen!ng matrix mul!plica!on,
vector addi!on, and vectoriza!on.
When using the above vectorized form equa!on to compute al , we compute the
intermediate quan!ty
z l = wl al−1 + bl
So we also write
al = σ(z l )
z l has components
zjl = ∑ wjk
l
× al−1
k + bl
j
k
zjl is just the weighted input to the ac!va!on func!on for neuron j in layer l.
Cost func!on
To train a neural network you need some measure of error between computed outputs
and the desired target outputs of the training data. The most common measure of error
is called mean squared error. However, there are some research results that suggest
using a different measure, called cross entropy error, is some!mes preferable to using
mean squared error.
So, which is be"er for neural network training: mean squared error or mean cross
entropy error? The answer is, as usual, it depends on the par!cular problem.
Research results in this area are rather difficult to compare. If one of the error
func!ons were clearly superior to the other func!on in all situa!ons, there would be
no need for ar!cles like this one. The consensus opinion among my immediate
colleagues is that it’s best to try mean cross entropy error first; then, if you have
!me, try mean squared error.
∂C ∂C
The goal of backpropaga!on is to compute the par!al deriva!ves ∂w l and ∂bl of the
cost func!on C with respect to weight w l or bias bl in each layer l of the network. For
backpropaga!on to work we need to make two main assump!ons about the form of the
cost func!on.
The first assump!on is that the cost func!on can be wri"en as an average over cost
func!ons for individual training examples.
M
1
C= ∑ Cm
M
m=1
where M is the number of training examples, and Cm is cost of the mth training
example.
The second assump!on we make about the cost is that it can be wri"en as a func!on of
the outputs from the neural network.
C = C L (aL L L
1 , ⋯ , ai , ⋯ , anL )
= C L (σ(z1L ), ⋯ , σ(ziL ), ⋯ , σ(znLL ))
= f L (z1L , ⋯ , ziL , ⋯ , znLL )
y = y(x1 , ⋯ , xi , ⋯ , xn )
xi = xi (t1 , ⋯ , tj , ⋯ , tm ); 1 ≤ i ≤ n
Based on the Chain rule for par!al deriva!ves of mul!variable func!ons, we can get
n
∂y ∂y ∂xi
=∑ × ;1 ≤ j ≤ m
∂tj ∂xi ∂tj
i=1
If we slice a neural network and keep the layers from l to L, and treat all neurons in
these layers as one big black box, then we can also rewrite cost C as a new func!on
with input (z1l , ⋯ , zjl , ⋯ , znl l ).
Note:
We can also rewrite zjl as a new func!on with input (z1l−1 , ⋯ , zkl−1 , ⋯ , znl−1
l−1 ).
nl−1
zjl = ∑ wjk
l
× al−1
k + bl
j
k=1
= zjl (al−1
1 ,⋯ , al−1
k , ⋯ , al−1
nl−1 )
nl−1
= ∑ wjk
l
× σ(zkl−1 ) + blj
k=1
= zjl (σ(z1l−1 ), ⋯ , σ(zkl−1 ), ⋯ , σ(znl−1
l−1 ))
Note:
previous example.
We will show how to apply the above chain rule to Backpropaga!on algorithm in the
neural networks soon.
Suppose s and t are two vectors of the same dimension. Then we use s ⊙ t to denote
the elementwise product of the two vectors. Thus the components of s ⊙ t are just
(s ⊙ t)j = sj × tj . For example:
⎡ s1 ⎤ ⎡ t1 ⎤ ⎡ s1 × t1 ⎤
⎢⋮⎥ ⎢⋮⎥ ⎢ ⋮ ⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎢ sj ⎥ ⊙ ⎢ tj ⎥ = ⎢ sj × tj ⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎢⋮⎥ ⎢⋮⎥ ⎢ ⋮ ⎥
⎣sn ⎦ ⎣tn ⎦ ⎣sn × tn ⎦
Calcula!on of Backpropaga!on
quan!ty, δjl , which we call the error in the j th neuron in the lth layer. Backpropaga!on
∂C ∂C
will give us a procedure to compute the error δjl , and then will relate δjl to ∂w l and ∂bl .
jk j
∂C
δjl =
∂zjl
The above equa!on can be wri"en in the vectorized form, and we define the deriva!ve
of C with respect to bl as
∂C l
∇bl C = = δ
∂bl
∂C ∂C ∂zjl
= ×
∂wjk
l ∂zj
l ∂wjk
l
nl−1 l
(∑k=1 wjk × al−1
k + bj )
l
= δjl ×
∂wjk
l
= δjl × al−1
k
The above equa!on can be wri"en in the vectorized form, and we define the deriva!ve
of C with respect to w l as
∂C l l−1 T
∇wl C = = δ × (a )
∂wl
Note:
to be small. In this case, we’ll say the weight learns slowly, meaning that it’s not
changing much during gradient descent. In other words, one consequence is that
weights associated with low-ac!va!on neurons learn slowly.
The key issue is how to calculate δjl for each layer l (L ≥ l ≥ 2).
∂C ∂C ∂aLj ∂C ∂σ(zjL ) ∂C ′ L
δjL = = = = σ (zj )
∂zj
L ∂aj ∂zj
L L ∂aj ∂zj
L L ∂aj
L
δ L = ∇aL C ⊙ σ ′ (z L )
We can start from the layer L to go backwards layer by layer, and calculate the δkl−1 of
the previous layer l − 1 based on δjl of the current layer l based on mul!variable chain
rule we introduced before.
∂C
δkl−1 =
∂zkl−1
nl
∂f l ∂slj
= ∑ l × l−1
j=1
∂zj ∂zk
nl
∂C ∂zjl
= ∑ l × l−1
j=1
∂zj ∂zk
We have
nl−1 l
∂zjl ∂(∑k=1 wjk × al−1
k + bj )
l
=
∂zkl−1 ∂zkl−1
nl−1 l
∂(∑k=1 wjk × σ(zkl−1 ) + blj )
=
∂zkl−1
nl−1 l
∂(∑k=1 wjk × σ(zkl−1 ) + blj ) ∂σ(zkl−1 )
= ×
∂σ(zkl−1 ) zkl−1
l
= wjk × σ ′ (zkl−1 )
So
nl
δkl−1 = ∑ δjl × wjk
l
× σ ′ (zkl−1 )
j=1
Note:
From the above equa!on, we can get δ l = (w l+1 )T δ l+1 ⊙ σ ′ (z l ). Consider the
term σ ′ (zkl ), if the ac!va!on func!on σ is the sigmoid func!on that the σ
func!on becomes very flat when σ(zkl ) is approximately 0 or 1. When this occurs
∂C
we will have σ ′ (zkl ) ≈ 0. We know ∂w l = δjl × al−1
k , so the lesson is that a
jk
weight in the layer will learn slowly if the neuron ac!va!on σ(zkl ) is either low (≈ 0
) or high (≈ 1). In this case it’s common to say the neuron has saturated and, as a
result, the weight has stopped learning (or is learning slowly). Similar remarks hold
also for the biases of neuron.
δ L = ∇aL C ⊙ σ ′ (z L )
δ l−1 = (wl )T δ l ⊙ σ ′ (z l−1 )
∇bl C = δ l
∇wl C = δ l × (al−1 )T
Implementa!on of Backpropaga!on
The previous sec!on introduces the inference of Backpropaga!on, and let’s explicitly
write its implementa!on out in the form of an algorithm.
1. Input vector x(m) : Assign x(m) to the ac!va!on a1 for the input layer.
2. Feedforward: For each layer l = 2, 3, ⋯ , L compute z l = w l al−1 + bl and
al = σ(z l ).
3. Error in the last layer δL : Compute the vector δ L = ∇aL Cm ⊙ σ ′ (z L ).
4. Backpropagate the error to previous layer: For each l = L − 1, L−2, ⋯ , 2
compute δ l = (w l+1 )T δ l+1 ⊙ σ ′ (z l ).
5. Output: The gradient of the cost func!on is given by ∇bl Cm = δ l and
∇wl Cm = δ l × (al−1 )T .
M
η
wl → wl − ∑ ∇bl Cm
M m=1
M
η
l l
b →b − ∑ ∇wl Cm
M m=1
Of course, to implement stochas!c gradient descent in prac!ce you also need an outer
loop genera!ng mini-batches of training examples, and an outer loop stepping through
mul!ple epochs of training. It’s omi"ed those for simplicity.
approxima!on
where ϵ > 0 is a small posi!ve number, and ej is the unit vector in the j th direc!on. In
∂C
other words, we can es!mate ∂w by compu!ng the cost CC for two slightly different
j
values of wj . The same idea will let us compute the par!al deriva!ves ∂C
∂b with respect
to the biases.
Unfortunately, while this approach appears promising, when you implement the code it
turns out to be extremely slow. To understand why, imagine we have a million weights in
our network. Then for each dis!nct weight wj we need to compute C(w + ϵej ) in
∂C
order to compute ∂w . That means that to compute the gradient we need to compute
j
the cost func!on a million different !mes, requiring a million forward passes through the
network (per training example). We need to compute C(w) as well, so that’s a total of a
million and one passes through the network.
one backward pass through the network. Roughly speaking, the computa!onal cost of
the backward pass is about the same as the forward pass (This should be plausible, but it
requires some analysis to make a careful statement. It’s plausible because the dominant
computa!onal cost in the forward pass is mul!plying by the weight matrices, while in
the backward pass it’s mul!plying by the transposes of the weight matrices. These
opera!ons obviously have similar computa!onal cost). And so the total cost of
backpropaga!on is roughly the same as making just two forward passes through the
network.
so even though backpropaga!on appears superficially more complex than the previous
obvious simple approach, it’s actually much, much faster!
Reference