rnn
rnn
Learning goals
Why do we need them?
How do they work?
Computational Graph o
Recurrent Networks
Motivation
Deep Learning – 1 / 20
MOTIVATION FOR RECURRENT NETWORKS
The two types of neural network architectures that we’ve seen so
far are fully-connected networks and CNNs.
Their input layers have a fixed size and (typically) only handle
fixed-length inputs.
The primary reason: if we vary the size of the input layer, we would
also have to vary the number of learnable weights in the network.
This in particular relates to sequence data such as time-series,
audio and text.
Recurrent Neural Networks (RNNs) is a class of architectures
that allows varying input lengths and properly accounts for the
ordering in sequence data.
Deep Learning – 2 / 20
RNNS - INTRODUCTION
Suppose we have some text data and our task is to analyse the
sentiment in the text.
For example, given an input sentence, such as "This is good news.", the
network has to classify it as either ’positive’ or ’negative’.
We would like to train a simple neural network (such as the one below) to
perform the task.
Figure: Two equivalent visualizations of a dense net with a single hidden layer, where
the left is more abstract showing the network on a layer point-of-view.
Deep Learning – 3 / 20
RNNS - INTRODUCTION
Because sentences can be of varying lengths, we need to modify
the dense net architecture to handle such a scenario.
One approach is to draw inspiration from the way a human reads a
sentence; that is, one word at a time.
An important cognitive mechanism that makes this possible is
"short-term memory".
As we read a sentence from beginning to end, we retain some
information about the words that we have already read and use
this information to understand the meaning of the entire sentence.
Therefore, in order to feed the words in a sentence sequentially to
a neural network, we need to give it the ability to retain some
information about past inputs.
Deep Learning – 4 / 20
RNNS - INTRODUCTION
When words in a sentence are fed to the network one at a time,
the inputs are no longer independent. It is much more likely that
the word "good" is followed by "morning" rather than "plastic".
Hence, we also need to model this (long-term) dependency.
Each word must still be encoded as a fixed-length vector because
the size of the input layer will remain fixed.
Here, for the sake of the visualization, each word is represented as
a ’one-hot coded’ vector of length 5. (<eos> = ’end of sequence’)
Deep Learning – 5 / 20
RNNS - INTRODUCTION
Our goal is to feed the words to the network sequentially in
discrete time-steps.
A regular dense neural network with a single hidden layer only has
two sets of weights: ’input-to-hidden’ weights W and ’hidden-to-
output’ weights U.
Deep Learning – 6 / 20
RNNS - INTRODUCTION
In order to enable the network to retain information about past inputs, we
introduce an additional set of weights V, from the hidden neurons at
time-step t to the hidden neurons at time-step t + 1.
Having this additional set of weights makes the activations of the hidden
layer depend on both the current input and the activations for the
previous input.
Deep Learning – 7 / 20
RNNS - INTRODUCTION
With this additional set of hidden-to-hidden weights V, the network
is now a Recurrent Neural Network (RNN).
In a regular feed-forward network, the activations of the hidden
layer are only computed using the input-hidden weights W (and
bias b).
z = σ(W> x + b)
In an RNN, the activations of the hidden layer (at time-step t) are
computed using both the input-to-hidden weights W and the
hidden-to-hidden weights V.
Deep Learning – 9 / 20
APPLICATION EXAMPLE - SENTIMENT ANALYSIS
At t = 0, we feed the word "This" to the network and obtain z[0] .
z[0] = σ(W> x[0] + b)
Because this is the very first input, there is no past state (or,
equivalently, the state is initialized to 0).
Deep Learning – 10 / 20
APPLICATION EXAMPLE - SENTIMENT ANALYSIS
At t = 1, we feed the second word to the network to obtain z[1] .
z[1] = σ(V> z[0] + W> x[1] + b)
Deep Learning – 11 / 20
APPLICATION EXAMPLE - SENTIMENT ANALYSIS
At t = 2, we feed the next word in the sentence.
z[2] = σ(V> z[1] + W> x[2] + b)
Deep Learning – 12 / 20
APPLICATION EXAMPLE - SENTIMENT ANALYSIS
At t = 3, we feed the next word ("news") in the sentence.
z[3] = σ(V> z[2] + W> x[3] + b)
Deep Learning – 13 / 20
APPLICATION EXAMPLE - SENTIMENT ANALYSIS
Once the entire input sequence has been processed, the
prediction of the network can be generated by feeding the
activations of the final time-step to the output neuron(s).
f = σ(U> z[4] + c ), where c is the bias of the output neuron.
Deep Learning – 14 / 20
PARAMETER SHARING
This way, the network can process the sentence one word at a
time and the length of the network can vary based on the length of
the sequence.
It is important to note that no matter how long the input sequence
is, the matrices W and V are the same in every time-step. This is
another example of parameter sharing.
Therefore, the number of weights in the network is independent of
the length of the input sequence.
Deep Learning – 15 / 20
RNNS - USE CASE SPECIFIC ARCHITECTURES
RNNs are very versatile. They can be applied to a wide range of tasks.
Figure: RNNs can be used in tasks that involve multiple inputs and/or multiple outputs.
Examples:
Sequence-to-One: Sentiment analysis, document classification.
One-to-Sequence: Image captioning.
Sequence-to-Sequence: Language modelling, machine translation,
time-series prediction.
Deep Learning – 16 / 20
Computational Graph
Deep Learning – 17 / 20
RNNS - COMPUTATIONAL GRAPH
Deep Learning – 18 / 20
RNNS - COMPUTATIONAL GRAPH
Deep Learning – 18 / 20
RNNS - COMPUTATIONAL GRAPH
Deep Learning – 18 / 20
RNNS - COMPUTATIONAL GRAPH
We went from
Deep Learning – 18 / 20
RNNS - COMPUTATIONAL GRAPH
Deep Learning – 18 / 20
RECURRENT OUTPUT-HIDDEN CONNECTIONS
Recurrent connections do not need to map from hidden to hidden
neurons!
Figure: RNN with feedback connection from the output to the hidden layer.
The RNN is only allowed to send f to future time points and, hence, z [t −1] is
connected to z [t ] only indirectly, via the predictions f [t −1] .
Deep Learning – 19 / 20
SEQ-TO-ONE MAPPINGS
RNNs do not need to produce an output at each time step. Often only
one output is produced after processing the whole sequence.
Figure: Time-unfolded recurrent neural network with a single output at the end
of the sequence. Such a network can be used to summarize a sequence and
produce a fixed size representation.
Deep Learning – 20 / 20
Deep Learning
Learning goals
How does Backpropagation work
for RNNs?
Exploding and Vanishing
Gradients
SIMPLE EXAMPLE: CHARACTER LEVEL
LANGUAGE MODEL
Task: Learn character probability distribution from input text
Suppose we only had a vocabulary of four possible letters: “h”, “e”,
“l” and “o”
We want to train an RNN on the training sequence “hello”.
This training sequence is in fact a source of 4 separate training
examples:
The probability of “e” should be likely given the context of “h”
“l” should be likely in the context of “he”
“l” should also be likely given the context of “hel”
and “o” should be likely given the context of “hell”
Deep Learning – 1 / 12
SIMPLE EXAMPLE: CHARACTER LEVEL
LANGUAGE MODEL
Deep Learning – 2 / 12
SIMPLE EXAMPLE: CHARACTER LEVEL
LANGUAGE MODEL
Deep Learning – 3 / 12
SIMPLE EXAMPLE: CHARACTER LEVEL
LANGUAGE MODEL
Deep Learning – 4 / 12
SIMPLE EXAMPLE: CHARACTER LEVEL
LANGUAGE MODEL
Deep Learning – 5 / 12
SIMPLE EXAMPLE: CHARACTER LEVEL
LANGUAGE MODEL
Deep Learning – 5 / 12
BACKPROPAGATION THROUGH TIME
dL
For training the RNN, we need to compute du i ,j
, dvdLi ,j , and dw
dL
i ,j
.
To do so, during backpropagation at time step t for an arbitrary
RNN, we need to compute
dL dL dz[t ] dz[2]
= ...
dz[1] dz[t ] dz[t −1] dz[1]
Deep Learning – 6 / 12
LONG-TERM DEPENDENCIES
Here, z[t ] = σ(V> z[t −1] + W> x[t ] + b)
It follows that:
dz[t ]
= diag(σ 0 (V> z[t −1] + W> x[t ] + b))V> = D[t −1] V>
dz[t −1]
dz[t −1]
= diag(σ 0 (V> z[t −2] + W> x[t −1] + b))V> = D[t −2] V>
dz[t −2]
..
.
dz[2]
= diag(σ 0 (V> z[1] + W> x[2] + b))V> = D[1] V>
dz[1]
dL dL dz[t ] dz[2]
= ... = D[t −1] D[t −2] . . . D[1] (V> )t −1
dz[1] dz[t ] dz[t −1] dz[1]
Deep Learning – 7 / 12
LONG-TERM DEPENDENCIES
dz[t ]
In general, for an arbitrary time-step i < t in the past, dz[i ]
will
contain the term (V> )t −i (this follows from the chain rule).
Based on the largest eigenvalue of V> , the presence of the term
(V> )t −i can either result in vanishing or exploding gradients.
This problem is quite severe for RNNs (as compared to
feedforward networks) because the same matrix V> is multiplied
several times. Click here
As the gap between t and i increases, the instability worsens.
It is thus quite challenging for RNNs to learn long-term
dependencies. The gradients either vanish (most of the time) or
explode (rarely, but with much damage to the optimization).
That happens simply because we propagate errors over very
many stages backwards.
Deep Learning – 8 / 12
LONG-TERM DEPENDENCIES
Deep Learning – 9 / 12
LONG-TERM DEPENDENCIES
Recall, that we can counteract exploding gradients by
implementing gradient clipping.
To avoid exploding gradients, we simply clip the norm of the
gradient at some threshold h (see chapter 4):
h
if ||∇W || > h : ∇W ← ∇W
||∇W ||
Deep Learning – 10 / 12
LONG-TERM DEPENDENCIES
Deep Learning – 11 / 12
LONG-TERM DEPENDENCIES
Even for a stable RNN (gradients not exploding), there will be
exponentially smaller weights for long-term interactions compared
to short-term ones and a more sophisticated solution is needed for
this vanishing gradient problem (discussed in the next chapters).
The vanishing gradient problem heavily depends on the choice of
the activation functions.
Sigmoid maps a real number into a “small” range (i.e. [0, 1])
and thus even huge changes in the input will only produce a
small change in the output. Hence, the gradient will be small.
This becomes even worse when we stack multiple layers.
We can avoid this problem by using activation functions which
do not “squash” the input.
The most popular choice is ReLU with gradients being either
0 or 1, i.e., they never saturate and thus don’t vanish.
The downside of this is that we can obtain a “dead” ReLU.
Deep Learning – 12 / 12
Deep Learning
Learning goals
LSTM cell
GRU cell
Bidirectional RNNs
Long Short-Term Memory (LSTM)
Deep Learning – 1 / 14
LONG SHORT-TERM MEMORY (LSTM)
The LSTM provides a way of dealing with vanishing gradients and
modelling long-term dependencies.
Deep Learning – 2 / 14
LONG SHORT-TERM MEMORY (LSTM)
The LSTM provides a way of dealing with vanishing gradients and
modelling long-term dependencies.
Deep Learning – 3 / 14
LONG SHORT-TERM MEMORY (LSTM)
Forget gate e[t ] : indicates which information of the old cell state
we should forget.
Intuition: Think of a model trying to predict the next word based on
all the previous ones. The cell state might include the gender of
the present subject, so that the correct pronouns can be used.
When we now see a new subject, we want to forget the gender of
the old one.
Deep Learning – 3 / 14
LONG SHORT-TERM MEMORY (LSTM)
Deep Learning – 3 / 14
LONG SHORT-TERM MEMORY (LSTM)
Deep Learning – 3 / 14
LONG SHORT-TERM MEMORY (LSTM)
Deep Learning – 3 / 14
LONG SHORT-TERM MEMORY (LSTM)
Deep Learning – 3 / 14
LONG SHORT-TERM MEMORY (LSTM)
Deep Learning – 3 / 14
LONG SHORT-TERM MEMORY (LSTM)
Output gate o[t ] : Indicates which information form the cell state is
filtered.
It is given by o[t ] = σ(bo + V>
o z
[t −1] + W> x[t ] ), with specific
o
weights Wo , Vo .
Deep Learning – 3 / 14
LONG SHORT-TERM MEMORY (LSTM)
Finally, the new state z[t ] of the LSTM is a function of the cell state,
multiplied by the output gate:
Deep Learning – 3 / 14
Gated Recurrent Units (GRU)
Deep Learning – 4 / 14
GATED RECURRENT UNITS (GRU)
The key distinction between regular RNNs and GRUs is that the
latter support gating of the hidden state.
Here, we have dedicated mechanisms for when a hidden state
should be updated and also when it should be reset.
These mechanisms are learned to:
avoid the vanishing/exploding gradient problem which comes
with a standard recurrent neural network.
solve the vanishing gradient problem by using an update gate
and a reset gate.
control the information that flows into (update gate) and out of
(reset gate) memory.
Deep Learning – 5 / 14
GATED RECURRENT UNITS (GRU)
For a given time step t, the hidden state of the last time step is
z[t −1] . The update gate u[t ] is computed as follows:
u[t ] = σ(W> [t ] > [t −1] + b )
u x + Vu z u
Deep Learning – 6 / 14
GATED RECURRENT UNITS (GRU)
Deep Learning – 7 / 14
GATED RECURRENT UNITS (GRU)
Deep Learning – 8 / 14
GATED RECURRENT UNITS (GRU)
The update gate u[t ] determines how much the old state z[t −1] and
the new candidate state z̃[t ] is used.
z [t ] = u [t ] z[t −1] + (1 − u[t ] ) z̃[t ] .
Deep Learning – 9 / 14
GATED RECURRENT UNITS (GRU)
Figure: GRU
Deep Learning – 10 / 14
GRU VS LSTM
Deep Learning – 11 / 14
Bidirectional RNNs
Deep Learning – 12 / 14
BIDIRECTIONAL RNNS
Another generalization of the simple RNN are bidirectional RNNs.
These allow us to process sequential data depending on both past
and future inputs, e.g. an application predicting missing words,
which probably depend on both preceding and following words.
One RNN processes inputs in the forward direction from x[1] to x[T ]
computing a sequence of hidden states (z[1] , . . . , z(T ) ), another
RNN in the backward direction from x[T ] to x[1] computing hidden
states (g[T ] , . . . , g[1] )
Predictions are then based on both hidden states, which could be
concatenated.
With connections going back in time, the whole input sequence
must be known in advance to train and infer from the model.
Bidirectional RNNs are often used for the encoding of a sequence
in machine translation.
Deep Learning – 13 / 14
BIDIRECTIONAL RNNS
Computational graph of a bidirectional RNN:
Deep Learning – 14 / 14
Deep Learning
Applications of RNNs
Learning goals
Understand application to
Language Modelling
Get to know Encoder-Decoder
architectures
Learn about further RNN
applications
Language Modelling
Deep Learning – 1 / 20
Seq-to-Seq (Type I)
Deep Learning – 2 / 20
RNNS - LANGUAGE MODELLING
In an earlier example, we built a ’sequence-to-one’ RNN model to
perform ’sentiment analysis’.
Another common task in Natural Language Processing (NLP) is
’language modelling’.
Input: word/character, encoded as a one-hot vector.
Output: probability distribution over words/characters given
previous words
T
Y
P(y [1] , . . . , y [T ] ) = P(y [i ] |y [1] , . . . , y [i −1] )
i =1
Deep Learning – 3 / 20
RNNS - LANGUAGE MODELLING
In this example, we will feed the characters in the word "hello" one
at a time to a ’seq-to-seq’ RNN.
For the sake of the visualization, the characters "h", "e", "l" and "o"
are one-hot coded as a vectors of length 4 and the output layer
only has 4 neurons, one for each character (we ignore the <eos>
token).
At each time step, the RNN has to output a probability distribution
(softmax) over the 4 possible characters that might follow the
current input.
Naturally, if the RNN has been trained on words in the English
language:
The probability of “e” should be likely, given the context of “h”.
“l” should be likely in the context of “he”.
“l” should also be likely, given the context of “hel”.
and, finally, “o” should be likely, given the context of “hell”.
Deep Learning – 4 / 20
RNNS - LANGUAGE MODELLING
Deep Learning – 5 / 20
RNNS - LANGUAGE MODELLING
Deep Learning – 5 / 20
RNNS - LANGUAGE MODELLING
Deep Learning – 5 / 20
RNNS - LANGUAGE MODELLING
Deep Learning – 5 / 20
RNNS - LANGUAGE MODELLING
Deep Learning – 6 / 20
WORD EMBEDDINGS
The dimensionality of these embeddings is typically much smaller
than the number of words in the dictionary.
Using them gives you a "warm start" for any NLP task. It is an
easy way to incorporate prior knowledge into your model and a
rudimentary form of transfer learning.
Two very popular approaches to learn word embeddings are
word2vec by Google and GloVe by Facebook. These embeddings
are typically 100 to 1000 dimensional.
Even though these embeddings capture the meaning of each word
to an extent, they do not capture the semantics of the word in a
given context because each word has a static precomputed
representation. For example, depending on the context, the word
"bank" might refer to a financial institution or to a river bank.
Deep Learning – 7 / 20
Encoder-Decoder Architectures
Deep Learning – 8 / 20
Seq-to-Seq (Type II)
Deep Learning – 9 / 20
ENCODER-DECODER NETWORK
For many interesting applications such as question answering,
dialogue systems, or machine translation, the network needs to
map an input sequence to an output sequence of different length.
This is what an encoder-decoder (also called
sequence-to-sequence architecture) enables us to do!
Deep Learning – 10 / 20
ENCODER-DECODER NETWORK
Figure: In the first part of the network, information from the input is encoded in
the context vector, here the final hidden state, which is then passed on to
every hidden state of the decoder, which produces the target sequence.
Deep Learning – 11 / 20
ENCODER-DECODER NETWORK
An input/encoder-RNN processes the input sequence of length nx
and computes a fixed-length context vector C, usually the final
hidden state or simple function of the hidden states.
One time step after the other information from the input sequence
is processed, added to the hidden state and passed forward in
time through the recurrent connections between hidden states in
the encoder.
The context vector summarizes important information from the
input sequence, e.g. the intent of a question in a question
answering task or the meaning of a text in the case of machine
translation.
The decoder RNN uses this information to predict the output, a
sequence of length ny , which could vary from nx .
Deep Learning – 12 / 20
ENCODER-DECODER NETWORK
In machine translation, the decoder is a language model with
recurrent connections between the output at one time step and the
hidden state at the next time step as well as recurrent connections
between the hidden states:
ny
Y
P(y [1] , . . . , y [ny ] |x[1] , . . . , x[nx ] ) = p(y [t ] |C ; y [1] , . . . , y [t −1] )
t =1
Deep Learning – 14 / 20
SOME MORE SOPHISTICATED APPLICATIONS
Figure: Generating Sequences With Recurrent Neural Networks. Top row are
real data, the rest are generated by various RNNs (Graves, 2014).
Deep Learning – 17 / 20
SOME MORE SOPHISTICATED APPLICATIONS
Figure: Convolutional and recurrent nets for detecting emotion from audio
data (Anand, 2015).
Deep Learning – 18 / 20
SOME MORE SOPHISTICATED APPLICATIONS
Deep Learning – 19 / 20
REFERENCES
Ruizendaal, R. (2018, October 21). Using deep learning for structured data with
entity embeddings. Medium. https://towardsdatascience.com/
deep-learning-structured-data-8d6a278f3088
Vinyals, O., Toshev, A., Bengio, S., & Erhan, D. (2014). Show and Tell: A Neural
Image Caption Generator.
Deep Learning – 20 / 20
Deep Learning
Learning goals
Familiarize with the most recent
sequence data modeling
technique:
Attention Mechanism
Transformers
Get to know the CNN alternative
to RNNs
Attention
Deep Learning – 1 / 15
ATTENTION
In a classical decoder-encoder RNN all information about the input
sequence must be incorporated into the final hidden state, which is
then passed as an input to the decoder network.
With a long input sequence this fixed-sized context vector is
unlikely to capture all relevant information about the past.
Each hidden state contains mostly information from recent inputs.
Key idea: Allow the decoder to access all the hidden states of the
encoder (instead of just the final one) so that it can dynamically
decide which ones are relevant at each time-step in the decoding.
This means the decoder can choose to "focus" on different hidden
states (of the encoder) at different time-steps of the decoding
process similar to how the human eye can focus on different
regions of the visual field.
This is known as an attention mechanism.
Deep Learning – 2 / 15
ATTENTION
The attention mechanism is implemented by an additional
component in the decoder.
For example, this can be a simple single-hidden layer feed-forward
neural network which is trained along with the RNN.
At any given time-step i of the decoding process, the network
computes the relevance of encoder state z[j ] as:
Deep Learning – 3 / 15
ATTENTION
The attention mechanism allows the decoder network to focus on
different parts of the input sequence by adding connections from
all hidden states of the encoder to each hidden state of the
decoder.
Figure: Attention at i = t + 1
Deep Learning – 4 / 15
ATTENTION
At each time step i, a set of weights (α[j ] )[i ] is computed which
determine how to combine the hidden states of the encoder into a
Pn
context vector g[i ] = j =x 1 (α[j ] )[i ] z[j ] , which holds the necessary
information to predict the correct output.
Figure: Attention at i = t + 2
Deep Learning – 5 / 15
ATTENTION
Deep Learning – 6 / 15
ATTENTION
Figure: Attention for image captioning: the attention mechanism tells the
network roughly which pixels to pay attention to when writing the text (Xu et
al., 2016).
Deep Learning – 7 / 15
Transformers
Deep Learning – 8 / 15
TRANSFORMERS
Advanced RNNs have similar limitations as vanilla RNN networks:
RNNs process the input data sequentially.
Difficulties in learning long term dependency (although GRU
or LSTM perform better than vanilla RNNs, they sometimes
struggle to remember the context introduced earlier in long
sequences).
These challenges are tackled by transformer networks.
Deep Learning – 9 / 15
TRANSFORMERS
Transformers are solely based on attention (no RNN or CNN).
In fact, the paper which coined the term transformer is called
Attention is all you need.
They are the state-of-the-art networks in natural language
processing (NLP) tasks since 2017.
Transformer architectures like BERT (Bidirectional Encoder
Representations from Transformers, 2018) and GPT-3 (Generative
Pre-trained Transformer-3, 2020) are pre-trained on a large corpus
and can be fine-tuned to specific language tasks.
Deep Learning – 10 / 15
TRANSFORMERS
Deep Learning – 11 / 15
CNNs or RNNs?
Deep Learning – 12 / 15
CNNS OR RNNS?
Historically, RNNs were the default for sequence processing tasks.
However, some families of CNNs (especially those based on Fully
Convolutional Networks (FCNs)) can be used to process
variable-length sequences such as text or time-series data.
If a CNN doesn’t contain any fully-connected layers, the total
number of weights in the network is independent of the spatial
dimensions of the input because of weight-sharing in the
convolutional layers.
Recent research [Bai et al. , 2018] indicates that such
convolutional architectures, so-called Temporal Convolutional
Networks (TCNs), can outperform RNNs on a wide range of tasks.
A major advantage of TCNs is that the entire input sequence can
be fed to the network at once (as opposed to sequentially).
Deep Learning – 13 / 15
CNNS OR RNNS?
Figure: A TCN (we have already seen this in the CNN lecture!) is simply a
variant of the one-dimensional FCN which uses a special type of dilated
convolutions called causal dilated convolutions (Roy, 2019).
Deep Learning – 14 / 15
REFERENCES
Roy, R. (2019, February 4). Temporal Convolutional Networks. Medium.
https://medium.com/@raushan2807/
temporal-convolutional-networks-bfea16e6d7d2
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N.,
Kaiser, L., & Polosukhin, I. (2023). Attention Is All You Need.
Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhutdinov, R., Zemel, R., &
Bengio, Y. (2016). Show, Attend and Tell: Neural Image Caption Generation with
Visual Attention.
Loye, G. (2019, September 15). Attention mechanism. FloydHub Blog.
https://blog.floydhub.com/attention-mechanism/
Deep Learning – 15 / 15