Backpropagation
Backpropagation
The goals of backpropagation are straightforward: adjust each weight in the network in proportion to how
much it contributes to overall error. If we iteratively reduce each weight's error, eventually we’ll have a
series of weights that produce good predictions.
Given a network consisting of a single neuron, total cost could be calculated as:
Cost = C(R(Z(XW)))
Using the chain rule we can easily find the derivative of Cost with respect to weight W.
C'(W) &= C'(R) \cdot R'(Z) \cdot Z'(W) \\■ &= (\hat{y} -
Now that we have an equation to calculate the derivative of cost with respect to any weight, let's go back to
our toy neural network example above
What is the derivative of cost with respect to ?
C'(W_O) &= C'(\hat{y}) \cdot \hat{y}'(Z_O) \cdot Z_O'(W_O
And how about with respect to ? To find out we just keep going further back in our function applying the
chain rule recursively until we get to the function that has the Wh term.
C'(W_h) &= C'(\hat{y}) \cdot O'(Z_o) \cdot Z_o'(H) \cdot H
And just for fun, what if our network had 10 hidden layers. What is the derivative of cost for the first weight
?
C'(w_1) = \frac{dC}{d\hat{y}} \cdot \frac{d\hat{y}}{dZ_{11}}
See the pattern? The number of calculations required to compute cost derivatives increases as our
network grows deeper. Notice also the redundancy in our derivative calculations. Each layer's cost
derivative appends two new terms to the terms that have already been calculated by the layers above it.
What if there was a way to save our work somehow and avoid these duplicate calculations?
Each of these layers is recomputing the same derivatives! Instead of writing out long derivative equations
for every weight, we can use memoization to save our work as we backprop error through the network. To
do this, we define 3 equations (below), which together encapsulate all the calculations needed for
backpropagation. The math is the same, but the equations provide a nice shorthand we can use to track
which calculations we've already performed and save our work as we move backwards through the
network.
We first calculate the output layer error and pass the result to the hidden layer before it. After calculating
the hidden layer error, we pass its error value back to the previous hidden layer before it. And so on and so
forth. As we move back through the network we apply the 3rd formula at every layer to calculate the
derivative of cost with respect that layer's weights. This resulting derivative tells us in which direction to
adjust our weights to reduce overall cost.
Note
The term layer error refers to the derivative of cost with respect to a layer's input. It answers the
question: how does the cost function output change when the input to that layer changes?
Note
Input refers to the activation from the previous layer, not the weighted input, Z.
Summary
Here are the final 3 equations that together form the foundation of backpropagation.
Here is the process visualized using our toy neural network example above.
Code example
References
1 Example