Sequence Learning
Sequence Learning
Mitesh M. Khapra
1/44
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
Module 14.1: Sequence Learning Problems
2/44
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
In feedforward and convolutional
neural networks the size of the input
was always fixed
For example, we fed fixed size (32 ×
32) images to convolutional neural
networks for image classification
10
5
10 5
3/44
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
In feedforward and convolutional
P (chair|sat, he)
P (man|sat, he)
neural networks the size of the input
P (on|sat, he)
P (he|sat, he)
Wcontext Wcontext
he sat
4/44
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
In feedforward and convolutional
neural networks the size of the input
was always fixed
apple
For example, we fed fixed size (32 ×
bus 32) images to convolutional neural
10
5 car networks for image classification
10 5
.. Similarly in word2vec, we fed a fixed
.
window (k) of words to the network
Further, each input to the network
was independent of the previous or
future inputs
5/44
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
In feedforward and convolutional
neural networks the size of the input
was always fixed
apple
For example, we fed fixed size (32 ×
bus 32) images to convolutional neural
10
5 car networks for image classification
10 5
.. Similarly in word2vec, we fed a fixed
.
window (k) of words to the network
Further, each input to the network
was independent of the previous or
future inputs
For example, the computatations,
outputs and decisions for two success-
ive images are completely independ-
ent of each other
6/44
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
In many applications the input is not
of a fixed size
Further successive inputs may not be
e e p h stop i independent of each other
For example, consider the task of
auto completion
Given the first character ‘d’ you want
to predict the next character ‘e’ and
so on
d e e p
7/44
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
Notice a few things
First, successive inputs are no longer
independent (while predicting ‘e’ you
e e p h stop i would want to know what the previ-
ous input was in addition to the cur-
rent input)
Second, the length of the inputs and
the number of predictions you need
to make is not fixed (for example,
“learn”, “deep”, “machine” have dif-
ferent number of characters)
Third, each network (orange-blue-
d e e p
green structure) is performing the
same task (input : character output
: character)
8/44
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
These are known as sequence learning
problems
We need to look at a sequence of (de-
e e p h stop i pendent) inputs and produce an out-
put (or outputs)
Each input corresponds to one time
step
Let us look at some more examples of
such problems
d e e p
9/44
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
Consider the task of predicting the part
of speech tag (noun, adverb, adjective
verb) of each word in a sentence
Once we see an adjective (social) we are
noun verb article adjective noun almost sure that the next word should be
a noun (man)
Thus the current output (noun) depends
on the current input as well as the previ-
ous input
Further the size of the input is not fixed
(sentences could have arbitrary number
of words)
Notice that here we are interested in pro-
man is a social animal ducing an output at each time step
Each network is performing the same
task (input : word, output : tag)
10/44
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
Sometimes we may not be interested
in producing an output at every stage
Instead we would look at the full se-
don’t
care
don’t
care
don’t
care
don’t
care
don’t
care +/−
quence and then produce an output
For example, consider the task of pre-
dicting the polarity of a movie review
The prediction clearly does not de-
pend only on the last word but also
on some words which appear before
Here again we could think that the
network is performing the same task
at each step (input : word, output :
The movie was boring and long
+/−) but it’s just that we don’t care
about intermediate outputs
11/44
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
Sequences could be composed of any-
thing (not just words)
For example, a video could be treated
as a sequence of images
Surya Namaskar We may want to look at the entire se-
quence and detect the activity being
performed
...
...
12/44
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
Module 14.2: Recurrent Neural Networks
13/44
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
How do we model such tasks involving sequences ?
14/44
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
Wishlist
Account for dependence between inputs
Account for variable number of inputs
Make sure that the function executed at each time step is the same
We will focus on each of these to arrive at a model for dealing with sequences
15/44
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
What is the function being executed
at each time step ?
y1 y2
si = σ(U xi + b)
yi = O(V si + c)
i = timestep
V V
Since we want the same function to be
s1 s2 executed at each timestep we should
share the same network (i.e., same
U U
parameters at each timestep)
x1 x2
16/44
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
This parameter sharing also ensures
that the network becomes agnostic to
the length (size) of the input
y1 y2 y3 y4 yn Since we are simply going to compute
the same function (with same para-
meters) at each timestep, the number
V V V V V of timesteps doesn’t matter
We just create multiple copies of the
s1 s2 s3 s4 . . . sn
network and execute them at each
U U U U U timestep
x1 x2 x3 x4 xn
17/44
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
y1 y2 How do we account for dependence
between inputs ?
v v Let us first see an infeasible way of
doing this
u u
At each timestep we will feed all the
previous inputs to the network
x1 x1 x2
Is this okay ?
No, it violates the other two items on
y3 y4
our wishlist
How ? Let us see
v v
u u
x1 x2 x3 x1 x2 x3 x4
18/44
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
y1 y2 First, the function being computed at
each time-step now is different
v v
y1 = f1 (x1 )
y2 = f2 (x1 , x2 )
u u
y3 = f3 (x1 , x2 , x3 )
x1 x1 x2
x1 x2 x3 x1 x2 x3 x4
19/44
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
The solution is to add a recurrent
connection in the network,
y1 y2 y3 y4 yn
si = σ(U xi + W si−1 + b)
yi = O(V si + c)
or
V V V V V yi = f (xi , si−1 , W, U, V, b, c)
W W W W ... W sn si is the state of the network at
timestep i
U U U U U
The parameters are W, U, V, c, b
which are shared across timesteps
x1 x2 x3 x4 xn
The same network (and parameters)
can be used to compute y1 , y2 , . . . , y10
or y100
20/44
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
This can be represented more com-
pactly
yi
si W
xi
21/44
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14
Let us revisit the sequence learning
e e p h stop i noun verb article adjective noun problems that we saw earlier
We now have recurrent connections
between time steps which account for
dependence between inputs
d e e p man is a social animal
Surya Namaskar
don’t don’t don’t don’t don’t
care care care care care +/−
...
...
The movie was boring and long
22/44
Mitesh M. Khapra CS7015 (Deep Learning) : Lecture 14