Chapter 4 Data Sci
Chapter 4 Data Sci
Chapter 4
Recurrent neural networks
E&TC Department
Dr.Pratibha Shingare
Assistant Professor
Deep Learning and Edge intelligence
Chapter 4
Recurrent neural networks
Getting targets when modeling sequences
•When applying machine learning to sequences, we often want to turn an input
sequence into an output sequence that lives in a different domain.
– E. g. turn a sequence of sound pressures into a sequence of word identities.
•When there is no separate target sequence, we can get a teaching signal by trying to
predict the next term in the input sequence.
– The target output sequence is the input sequence with an advance of 1 step.
– This seems much more natural than trying to predict one pixel in an image
from the other pixels, or one patch of an image from the rest of the image.
– For temporal sequences there is a natural order for the predictions.
•Predicting the next term in a sequence blurs the distinction between supervised and
unsupervised learning.
– It uses methods designed for supervised learning, but it doesn’t require a
separate teaching signal.
Geoffrey Hinton
Beyond memoryless models
• If we give our generative model some hidden state, and if we give
this hidden state its own internal dynamics, we get a much more
interesting kind of model.
– It can store information in its hidden state for a long time.
– If the dynamics is noisy and the way it generates outputs from its
hidden state is noisy, we can never know its exact hidden state.
– The best we can do is to infer a probability distribution over the
space of hidden state vectors.
• This inference is only tractable for two types of hidden state model.
Linear Dynamical Systems (engineers love them!)
• These are generative models. They have a real- time
valued hidden state that cannot be observed
output
output
output
directly.
– The hidden state has linear dynamics with
Gaussian noise and produces the observations
using a linear model with Gaussian noise.
hidden
hidden
hidden
– There may also be driving inputs.
• To predict the next output (so that we can shoot
down the missile) we need to infer the hidden
state.
input
driving
input
driving
input
driving
– A linearly transformed Gaussian is a Gaussian. So
the distribution over the hidden state given the data
so far is Gaussian. It can be computed using
“Kalman filtering”.
Hidden Markov Models (computer scientists love them!)
• Hidden Markov Models have a discrete one-
output
output
output
of-N hidden state. Transitions between states
are stochastic and controlled by a transition
matrix. The outputs produced by a state are
stochastic.
– We cannot be sure which state produced a
given output. So the state is “hidden”.
– It is easy to represent a probability distribution
across N states with N numbers.
• To predict the next output we need to infer the
probability distribution over hidden states.
– HMMs have efficient algorithms for time
inference and learning.
A fundamental limitation of HMMs
• Consider what happens when a hidden Markov model generates
data.
– At each time step it must select one of its hidden states. So with N
hidden states it can only remember log(N) bits about what it generated
so far.
• Consider the information that the first half of an utterance contains
about the second half:
– The syntax needs to fit (e.g. number and tense agreement).
– The semantics needs to fit. The intonation needs to fit.
– The accent, rate, volume, and vocal tract characteristics must all fit.
• All these aspects combined could be 100 bits of information that the
first half of an utterance needs to convey to the second half. 2^100
is big!
Recurrent neural networks
• RNNs are very powerful, because they time
combine two properties:
output
output
output
– Distributed hidden state that allows
them to store a lot of information
about the past efficiently.
hidden
hidden
hidden
– Non-linear dynamics that allows
them to update their hidden state in
complicated ways.
• With enough neurons and time, RNNs
input
input
input
can compute anything that can be
computed by your computer.
Do generative models need to be stochastic?
• Linear dynamical systems and • Recurrent neural networks are
hidden Markov models are deterministic.
stochastic models. – So think of the hidden state
– But the posterior probability of an RNN as the
distribution over their equivalent of the
hidden states given the deterministic probability
observed data so far is a distribution over hidden
deterministic function of the states in a linear dynamical
data. system or hidden Markov
model.
Recurrent neural networks
• What kinds of behaviour can RNNs exhibit?
– They can oscillate. Good for motor control?
– They can settle to point attractors. Good for retrieving memories?
– They can behave chaotically. Bad for information processing?
– RNNs could potentially learn to implement lots of small programs
that each capture a nugget of knowledge and run in parallel,
interacting to produce very complicated effects.
• But the computational power of RNNs makes them very hard to train.
– For many years we could not exploit the computational power of
RNNs despite some heroic efforts (e.g. Tony Robinson’s speech
recognizer).
The equivalence between feedforward nets and recurrent
nets
w1 w4
time=3
w1 w2 W3 W4
w2 w3
time=2
Assume that there is a time w1 w2 W3 W4
delay of 1 in using each
connection. time=1
The recurrent net is just a w1 w2 W3 W4
layered net that keeps
reusing the same weights. time=0
Geoffrey Hinton
Backpropagation through time
– Specify the initial states of a
time
subset of the units. w1 w2 W3 W4
– Specify the states of the same
subset of the units at every time
step.
• This is the natural way to
w1 w2 W3 W4
model most sequential data.
Teaching signals for recurrent networks
• We can specify targets in several
ways:
– Specify desired final activities of w1 w2 W3 W4
all the units
– Specify desired activities of all
units for the last few steps
• Good for learning attractors
w1 w2 W3 W4
• It is easy to add in extra error
derivatives as we
backpropagate. w1 w2 W3 W4
– Specify the desired activity of a
subset of the units.
• The other units are input or
hidden units.
A good toy problem for a recurrent network
• We can train a feedforward net to do binary
addition, but there are obvious regularities
that it cannot capture efficiently. 11001100
– We must decide in advance the
maximum number of digits in each
number.
– The processing applied to the beginning
hidden units
of a long number does not generalize to
the end of the long number because
it uses different weights.
• As a result, feedforward nets do not 00100110 10100110
generalize well on the binary addition task.
The algorithm for binary addition
1 0 0 1
0 1 no carry carry
0 1
print 1 print 1
0
1 0 0 0 1 1 0
1
0 1 0 1 0 1
1
no carry carry
0 print 0 1 print 0 0 1
0 1 1 0
This is a finite state automaton. It decides what transition to make by looking at the next
column. It prints after making the transition. It moves from right to left over the two input
numbers.
A recurrent net for binary addition
• The network has two input units and
one output unit.
• It is given two input digits at each 00110100
time step.
• The desired output at each time 01001101
step is the output for the column
that was provided as input two time
steps ago.
10000001
– It takes one time step to update
time
the hidden units based on the
two input digits.
– It takes another time step for the
hidden units to cause the output.
The connectivity of the network
• The 3 hidden units are fully
interconnected in both
directions.
– This allows a hidden
activity pattern at one
time step to vote for the 3 fully interconnected hidden units
hidden activity pattern at
the next time step.
• The input units have
feedforward connections that
allow then to vote for the
next hidden activity pattern.
What the network learns
• It learns four distinct patterns of • A recurrent network can emulate
activity for the 3 hidden units. a finite state automaton, but it is
These patterns correspond to the exponentially more powerful.
nodes in the finite state With N hidden neurons it has 2^N
automaton. possible binary activity vectors
– Do not confuse units in a (but only N^2 weights)
neural network with nodes in a – This is important when the
finite state automaton. Nodes input stream has two
are like activity vectors. separate things going on at
– The automaton is restricted to once.
be in exactly one state at each – A finite state automaton
time. The hidden units are needs to square its number of
restricted to have exactly one states.
vector of activity at each time. – An RNN needs to double its
number of units.
The backward pass is linear
• There is a big difference between the
forward and backward passes.
• In the forward pass we use squashing
functions (like the logistic) to prevent the
activity vectors from exploding.
• The backward pass, is completely linear. If
you double the error derivatives at the final
layer, all the error derivatives will double.
– The forward pass determines the slope
of the linear function used for
backpropagating through each neuron.
The problem of exploding or vanishing gradients
• What happens to the magnitude of • In an RNN trained on long
the gradients as we backpropagate sequences (e.g. 100 time steps) the
through many layers? gradients can easily explode or
– If the weights are small, the vanish.
gradients shrink exponentially. – We can avoid this by initializing
the weights very carefully.
– If the weights are big the
gradients grow exponentially. • Even with good initial weights, its
very hard to detect that the current
• Typical feed-forward neural nets target output depends on an input
can cope with these exponential from many time-steps ago.
effects because they only have a
– So RNNs have difficulty dealing
few hidden layers.
with long-range dependencies.
Why the back-propagated gradient blows up
1.7 1.7
time
Reading cursive handwriting
• This is a natural task for an • Graves & Schmidhuber (2009)
RNN. showed that RNNs with LSTM
• The input is a sequence of are currently the best systems
(x,y,p) coordinates of the tip of for reading cursive writing.
the pen, where p indicates – They used a sequence of
whether the pen is up or down. small images as input
• The output is a sequence of rather than pen
characters. coordinates.
A demonstration of online handwriting recognition by an
RNN with Long Short Term Memory (from Alex Graves)
• The movie that follows shows several different things:
• Row 1: This shows when the characters are recognized.
– It never revises its output so difficult decisions are more delayed.
• Row 2: This shows the states of a subset of the memory cells.
– Notice how they get reset when it recognizes a character.
• Row 3: This shows the writing. The net sees the x and y coordinates.
– Optical input actually works a bit better than pen coordinates.
• Row 4: This shows the gradient backpropagated all the way to the x and
y inputs from the currently most active character.
– This lets you see which bits of the data are influencing the decision.
SHOW ALEX GRAVES’ MOVIE
How much can we reduce the error better
ratio
by moving in a given direction?
• If we choose a direction to move in and we keep
going in that direction, how much does the error
decrease before it starts rising again? We assume
the curvature is constant (i.e. it’s a quadratic error surface).
– Assume the magnitude of the gradient decreases as we
move down the gradient (i.e. the error surface is convex
upward).
• The maximum error reduction depends on the ratio of the
gradient to the curvature. So a good direction to move in is one
with a high ratio of gradient to curvature, even if the gradient
itself is small.
– How can we find directions like these?
Geoffrey Hinton
Geoffrey Hinton
How to avoid inverting a huge matrix
• The curvature matrix has too many terms to be of use in a big network.
– Maybe we can get some benefit from just using the terms along the
leading diagonal (Le Cun). But the diagonal terms are only a tiny
fraction of the interactions (they are the self-interactions).
• The curvature matrix can be approximated in many different ways
– Hessian-free methods, LBFGS, …
• In the HF method, we make an approximation to the curvature matrix
and then, assuming that approximation is correct, we minimize the error
using an efficient technique called conjugate gradient. Then we make
another approximation to the curvature matrix and minimize again.
– For RNNs its important to add a penalty for changing any of the
hidden activities too much.
Conjugate gradient
c softmax
character: predicted distribution
1-of-86 for next character.
It’s a lot easier to predict 86 characters than 100,000 words.
A sub-tree in the tree of all character strings
There are ...fix
exponentially many i e In an RNN, each
nodes in the tree of node is a hidden
…fixi …fixe state vector. The
all character strings
of length N. n next character
must transform this
…fixin to a new node.
LEARNING METHOD
Fit a linear model that takes the states of the hidden units as
input and produces a single scalar output.
Example from
Scholarpedia
The target and predicted outputs after learning
Beyond echo state networks
• Good aspects of ESNs • Bad aspects of ESNs
Echo state networks can be trained They need many more hidden
very fast because they just fit a units for a given task than an
linear model. RNN that learns the
• They demonstrate that its very hiddenhidden weights.
important to initialize weights
sensibly. • Ilya Sutskever (2012) has
• They can do impressive modeling of shown that if the weights are
one-dimensional time-series. initialized using the ESN
– but they cannot compete methods, RNNs can be trained
seriously for high-dimensional very effectively.
data like pre-processed speech. – He uses rmsprop with
momentum.