Lec10 Handout
Lec10 Handout
University of Toronto
CSC411 Lec10 1 / 41
Today
Multi-layer Perceptron
Forward propagation
Backward propagation
CSC411 Lec10 2 / 41
Motivating Examples
CSC411 Lec10 3 / 41
Are You Excited about Deep Learning?
CSC411 Lec10 4 / 41
Limitations of Linear Classifiers
0,1 1,1
outpu
t
outpu =1
t =0
0,0 1,0
CSC411 Lec10 6 / 41
Inspiration: The Brain
Many machine learning methods inspired by biology, e.g., the (human) brain
Our brain has ∼ 1011 neurons, each of which communicates (is connected)
to ∼ 104 other neurons
1
Sigmoid: σ(z) = 1+exp(−z)
exp(z)−exp(−z)
Tanh: tanh(z) = exp(z)+exp(−z)
CSC411 Lec10 9 / 41
Neural Network Architecture (Multi-Layer Perceptron)
output units
input units
Figure: Two different visualizations of a 2-layer neural network. In this example: 3 input
units, 4 hidden units and 2 output units
Each unit computes its value based on linear combination of values of units
that point into it, and an activation function
[http://cs231n.github.io/neural-networks-1/]
CSC411 Lec10 10 / 41
Neural Network Architecture (Multi-Layer Perceptron)
Network with one layer of four hidden units:
output units
input units
Figure: Two different visualizations of a 2-layer neural network. In this example: 3 input
units, 4 hidden units and 2 output units
Going deeper: a 3-layer neural network with two layers of hidden units
Figure: A 3-layer neural net with 3 input units, 4 hidden units in the first and second
hidden layer and 1 output unit
[http://cs231n.github.io/neural-networks-1/]
CSC411 Lec10 12 / 41
Representational Power
Neural network with at least one hidden layer is a universal approximator
(can represent any function).
Proof in: Approximation by Superpositions of Sigmoidal Function, Cybenko, paper
The capacity of the network increases with more hidden units and more
hidden layers
Why go deeper (still kind of an open theory question)? One hidden layer
might need exponential number of neurons, deep can be more compact.
[http://cs231n.github.io/neural-networks-1/]
CSC411 Lec10 13 / 41
Demo
CSC411 Lec10 14 / 41
Neural Networks
CSC411 Lec10 15 / 41
Forward Pass: What does the Network Compute?
Output of the network can be
written as:
D
X
hj (x) = f (vj0 + xi vji )
i=1
J
X
ok (x) = g (wk0 + hj (x)wkj )
j=1
1 exp(z) − exp(−z)
σ(z) = , tanh(z) = , ReLU(z) = max(0, z)
1 + exp(−z) exp(z) + exp(−z)
CSC411 Lec10 16 / 41
Special Case
What is a single layer (no hiddens) network with a sigmoid act. function?
Network: 1
ok (x) =
1 + exp(−zk )
J
X
zk = wk0 + xj wkj
j=1
Logistic regression!
CSC411 Lec10 17 / 41
Feedforward network
CSC411 Lec10 18 / 41
How do we train?
CSC411 Lec10 19 / 41
Training Neural Networks
CSC411 Lec10 20 / 41
Training Neural Networks: Back-propagation
Given any error function E, activation functions g () and f (), just need to
derive gradients
CSC411 Lec10 21 / 41
Key Idea behind Backpropagation
We don’t have targets for a hidden unit, but we can compute how fast the
error changes as we change its activity
I Instead of using desired activities to train the hidden units, use error
derivatives w.r.t. hidden activities
I Each hidden activity can affect many output units and can therefore
have many separate effects on the error. These effects must be
combined
I We can compute error derivatives for all the hidden units efficiently
I Once we have the error derivatives for the hidden activities, its easy to
get the error derivatives for the weights going into a hidden unit
This is just the chain rule!
CSC411 Lec10 22 / 41
Useful Derivatives
CSC411 Lec10 23 / 41
Computing Gradients: Single Layer Network
Let’s take a single layer network and draw it a bit differently
CSC411 Lec10 24 / 41
Computing Gradients: Single Layer Network
Error gradient is computable for any smooth activation function g (), and
any smooth error function
CSC411 Lec10 25 / 41
Computing Gradients: Single Layer Network
∂E ∂E ∂ok ∂zk
=
∂wki ∂ok ∂zk ∂wki
|{z}
δko
CSC411 Lec10 26 / 41
Computing Gradients: Single Layer Network
CSC411 Lec10 27 / 41
Computing Gradients: Single Layer Network
CSC411 Lec10 28 / 41
Computing Gradients: Single Layer Network
CSC411 Lec10 29 / 41
Gradient Descent for Single Layer Network
Assuming the error function is mean-squared error (MSE), on a single
training example n, we have
∂E (n) (n)
(n)
= ok − tk := δko
∂ok
CSC411 Lec10 30 / 41
Multi-layer Neural Network
CSC411 Lec10 31 / 41
Back-propagation: Sketch on One Training Case
Convert discrepancy between each output and its target value into an error
derivative
1X ∂E
E= (ok − tk )2 ; = ok − tk
2 ∂ok
k
Compute error derivatives in each hidden layer from error derivatives in layer
above. [assign blame for error at k to each unit j according to its influence
on k (depends on wkj )]
Use error derivatives w.r.t. activities to get error derivatives w.r.t. the
weights.
CSC411 Lec10 32 / 41
Gradient Descent for Multi-layer Network
CSC411 Lec10 33 / 41
Gradient Descent for Multi-layer Network
CSC411 Lec10 34 / 41
Gradient Descent for Multi-layer Network
The output weight gradients for a
multi-layer network are the same as for a
single layer network
N
X ∂E ∂o ∂z (n) (n) N
∂E k k
X z,(n) (n)
= (n) (n) ∂w
= δk hj
∂wkj n=1 ∂ok ∂zk
kj
n=1
CSC411 Lec10 35 / 41
Gradient Descent for Multi-layer Network
The output weight gradients for a
multi-layer network are the same as for a
single layer network
N
X ∂E ∂o ∂z (n) (n) N
∂E k k
X z,(n) (n)
= (n) (n) ∂w
= δk hj
∂wkj n=1 ∂ok ∂zk
kj
n=1
CSC411 Lec10 36 / 41
Backprob in deep networks
The exact same ideas (and math) can be used when we have multiple
∂E ∂E ∂E
hidden layer - compute ∂h L and use it to compute ∂w L and
∂hL−1
j ij j
Two phases:
I Forward: Compute output layer by layer (in order)
I Backwards: Compute gradients layer by layer (reverse order)
Modern software packages (theano, tensorflow, pytorch) do this
automatically.
I You define the computation graph, it takes care of the rest.
CSC411 Lec10 37 / 41
Training neural networks
CSC411 Lec10 38 / 41
Activation functions
CSC411 Lec10 39 / 41
Initialization
CSC411 Lec10 40 / 41
Momentum
CSC411 Lec10 41 / 41