0% found this document useful (0 votes)
10 views

Backpropagation

Uploaded by

aimad baigouar
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

Backpropagation

Uploaded by

aimad baigouar
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Backpropagation

Chain rule refresher 1


Applying the chain rule 1
Saving work with memoization 2
Code example 3

The goals of backpropagation are straightforward: adjust each weight in the network in proportion to how
much it contributes to overall error. If we iteratively reduce each weight's error, eventually we’ll have a
series of weights that produce good predictions.

Chain rule refresher


As seen above, foward propagation can be viewed as a long series of nested equations. If you think of
feed forward this way, then backpropagation is merely an application of :ref:`chain_rule` to find the
:ref:`derivative` of cost with respect to any variable in the nested equation. Given a forward propagation
function:
f(x) = A(B(C(x)))
A, B, and C are activation functions at different layers. Using the chain rule we easily calculate the
derivative of with respect to :
f 0(x) = f 0(A) ⋅ A 0(B) ⋅ B 0(C) ⋅ C 0(x)
How about the derivative with respect to B? To find the derivative with respect to B you can pretend
is a constant, replace it with a placeholder variable B, and proceed to find the derivative normally
with respect to B.
f 0(B) = f 0(A) ⋅ A 0(B)
This simple technique extends to any variable within a function and allows us to precisely pinpoint the
exact impact each variable has on the total output.

Applying the chain rule


Let's use the chain rule to calculate the derivative of cost with respect to any weight in the network. The
chain rule will help us identify how much each weight contributes to our overall error and the direction to
update each weight to reduce our error. Here are the equations we need to make a prediction and
calculate total error, or cost:

Given a network consisting of a single neuron, total cost could be calculated as:
Cost = C(R(Z(XW)))
Using the chain rule we can easily find the derivative of Cost with respect to weight W.
C'(W) &= C'(R) \cdot R'(Z) \cdot Z'(W) \\■ &= (\hat{y} -
Now that we have an equation to calculate the derivative of cost with respect to any weight, let's go back to
our toy neural network example above
What is the derivative of cost with respect to ?
C'(W_O) &= C'(\hat{y}) \cdot \hat{y}'(Z_O) \cdot Z_O'(W_O
And how about with respect to ? To find out we just keep going further back in our function applying the
chain rule recursively until we get to the function that has the Wh term.
C'(W_h) &= C'(\hat{y}) \cdot O'(Z_o) \cdot Z_o'(H) \cdot H
And just for fun, what if our network had 10 hidden layers. What is the derivative of cost for the first weight
?
C'(w_1) = \frac{dC}{d\hat{y}} \cdot \frac{d\hat{y}}{dZ_{11}}
See the pattern? The number of calculations required to compute cost derivatives increases as our
network grows deeper. Notice also the redundancy in our derivative calculations. Each layer's cost
derivative appends two new terms to the terms that have already been calculated by the layers above it.
What if there was a way to save our work somehow and avoid these duplicate calculations?

Saving work with memoization


Memoization is a computer science term which simply means: don’t recompute the same thing over and
over. In memoization we store previously computed results to avoid recalculating the same function. It's
handy for speeding up recursive functions of which backpropagation is one. Notice the pattern in the
derivative equations below.

Each of these layers is recomputing the same derivatives! Instead of writing out long derivative equations
for every weight, we can use memoization to save our work as we backprop error through the network. To
do this, we define 3 equations (below), which together encapsulate all the calculations needed for
backpropagation. The math is the same, but the equations provide a nice shorthand we can use to track
which calculations we've already performed and save our work as we move backwards through the
network.

We first calculate the output layer error and pass the result to the hidden layer before it. After calculating
the hidden layer error, we pass its error value back to the previous hidden layer before it. And so on and so
forth. As we move back through the network we apply the 3rd formula at every layer to calculate the
derivative of cost with respect that layer's weights. This resulting derivative tells us in which direction to
adjust our weights to reduce overall cost.

Note
The term layer error refers to the derivative of cost with respect to a layer's input. It answers the
question: how does the cost function output change when the input to that layer changes?

Output layer error


To calculate output layer error we need to find the derivative of cost with respect to the output layer input,
. It answers the question — how are the final layer's weights impacting overall error in the network? The
derivative is then:
C 0(Zo) = (y ̂ − y) ⋅ R 0(Zo)
To simplify notation, ml practitioners typically replace the sequence with the term . So
our formula for output layer error equals:
Eo = (y ̂ − y) ⋅ R 0(Zo)
Hidden layer error
To calculate hidden layer error we need to find the derivative of cost with respect to the hidden layer input,
Zh.
C 0(Zh) = (y ̂ − y) ⋅ R 0(Zo) ⋅ Wo ⋅ R 0(Zh)
Next we can swap in the term above to avoid duplication and create a new simplified equation for
Hidden layer error:
Eh = Eo ⋅ Wo ⋅ R 0(Zh)
This formula is at the core of backpropagation. We calculate the current layer's error, and pass the
weighted error back to the previous layer, continuing the process until we arrive at our first hidden layer.
Along the way we update the weights using the derivative of cost with respect to each weight.
Derivative of cost with respect to any weight
Let’s return to our formula for the derivative of cost with respect to the output layer weight .
C 0(WO) = (y ̂ − y) ⋅ R 0(ZO) ⋅ H
We know we can replace the first part with our equation for output layer error . H represents the hidden
layer activation.
C 0(Wo) = Eo ⋅ H
So to find the derivative of cost with respect to any weight in our network, we simply multiply the
corresponding layer's error times its input (the previous layer's output).
C 0(w) = CurrentLayerError ⋅ CurrentLayerInput

Note
Input refers to the activation from the previous layer, not the weighted input, Z.

Summary
Here are the final 3 equations that together form the foundation of backpropagation.

Here is the process visualized using our toy neural network example above.

Code example
References
1 Example

You might also like