0% found this document useful (0 votes)

5 views

rnn

Recurrent Neural Networks (RNNs) are designed to handle varying input lengths and maintain the order of sequence data, making them suitable for tasks like sentiment analysis and language modeling. They introduce hidden-to-hidden weights to retain information about past inputs, allowing the network to process sequences one word at a time while sharing parameters across time steps. However, RNNs face challenges such as vanishing and exploding gradients, which can hinder learning long-term dependencies, necessitating advanced architectures like LSTMs and GRUs.

Uploaded by

Võ Hồng Việt

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views

rnn

Uploaded by

Võ Hồng Việt

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 106

Deep Learning

Recurrent Neural Networks - Introduction

Learning goals
Why do we need them?
How do they work?
Computational Graph o
Recurrent Networks
Motivation

Deep Learning – 1 / 20
MOTIVATION FOR RECURRENT NETWORKS
The two types of neural network architectures that we’ve seen so
far are fully-connected networks and CNNs.
Their input layers have a fixed size and (typically) only handle
fixed-length inputs.
The primary reason: if we vary the size of the input layer, we would
also have to vary the number of learnable weights in the network.
This in particular relates to sequence data such as time-series,
audio and text.
Recurrent Neural Networks (RNNs) is a class of architectures
that allows varying input lengths and properly accounts for the
ordering in sequence data.

Deep Learning – 2 / 20
RNNS - INTRODUCTION
Suppose we have some text data and our task is to analyse the
sentiment in the text.
For example, given an input sentence, such as "This is good news.", the
network has to classify it as either ’positive’ or ’negative’.
We would like to train a simple neural network (such as the one below) to
perform the task.

Figure: Two equivalent visualizations of a dense net with a single hidden layer, where
the left is more abstract showing the network on a layer point-of-view.
Deep Learning – 3 / 20
RNNS - INTRODUCTION
Because sentences can be of varying lengths, we need to modify
the dense net architecture to handle such a scenario.
One approach is to draw inspiration from the way a human reads a
sentence; that is, one word at a time.
An important cognitive mechanism that makes this possible is
"short-term memory".
As we read a sentence from beginning to end, we retain some
information about the words that we have already read and use
this information to understand the meaning of the entire sentence.
Therefore, in order to feed the words in a sentence sequentially to
a neural network, we need to give it the ability to retain some
information about past inputs.

Deep Learning – 4 / 20
RNNS - INTRODUCTION
When words in a sentence are fed to the network one at a time,
the inputs are no longer independent. It is much more likely that
the word "good" is followed by "morning" rather than "plastic".
Hence, we also need to model this (long-term) dependency.
Each word must still be encoded as a fixed-length vector because
the size of the input layer will remain fixed.
Here, for the sake of the visualization, each word is represented as
a ’one-hot coded’ vector of length 5. (<eos> = ’end of sequence’)

While this is one option to represent words in a network, the

standard approach are word embeddings (more on this later).

Deep Learning – 5 / 20
RNNS - INTRODUCTION
Our goal is to feed the words to the network sequentially in
discrete time-steps.
A regular dense neural network with a single hidden layer only has
two sets of weights: ’input-to-hidden’ weights W and ’hidden-to-
output’ weights U.

Deep Learning – 6 / 20
RNNS - INTRODUCTION
In order to enable the network to retain information about past inputs, we
introduce an additional set of weights V, from the hidden neurons at
time-step t to the hidden neurons at time-step t + 1.
Having this additional set of weights makes the activations of the hidden
layer depend on both the current input and the activations for the
previous input.

Figure: Input-to-hidden weights W and hidden-to-hidden weights V. The

hidden-to-output weights U are not shown for better readability.

Deep Learning – 7 / 20
RNNS - INTRODUCTION
With this additional set of hidden-to-hidden weights V, the network
is now a Recurrent Neural Network (RNN).
In a regular feed-forward network, the activations of the hidden
layer are only computed using the input-hidden weights W (and
bias b).
z = σ(W> x + b)
In an RNN, the activations of the hidden layer (at time-step t) are
computed using both the input-to-hidden weights W and the
hidden-to-hidden weights V.

z[t ] = σ(V> z[t−1] + W> x[t ] + b)

The vector z[t ] represents the short-term memory of the RNN

because it is a function of the current input x[t ] and the activations
z[t −1] of the previous time-step.
Therefore, by recurrence, it contains a "summary" of all previous
inputs.
Deep Learning – 8 / 20
Examples

Deep Learning – 9 / 20
APPLICATION EXAMPLE - SENTIMENT ANALYSIS
At t = 0, we feed the word "This" to the network and obtain z[0] .
z[0] = σ(W> x[0] + b)

Because this is the very first input, there is no past state (or,
equivalently, the state is initialized to 0).

Deep Learning – 10 / 20
APPLICATION EXAMPLE - SENTIMENT ANALYSIS
At t = 1, we feed the second word to the network to obtain z[1] .
z[1] = σ(V> z[0] + W> x[1] + b)

Deep Learning – 11 / 20
APPLICATION EXAMPLE - SENTIMENT ANALYSIS
At t = 2, we feed the next word in the sentence.
z[2] = σ(V> z[1] + W> x[2] + b)

Deep Learning – 12 / 20
APPLICATION EXAMPLE - SENTIMENT ANALYSIS
At t = 3, we feed the next word ("news") in the sentence.
z[3] = σ(V> z[2] + W> x[3] + b)

Deep Learning – 13 / 20
APPLICATION EXAMPLE - SENTIMENT ANALYSIS
Once the entire input sequence has been processed, the
prediction of the network can be generated by feeding the
activations of the final time-step to the output neuron(s).
f = σ(U> z[4] + c ), where c is the bias of the output neuron.

Deep Learning – 14 / 20
PARAMETER SHARING
This way, the network can process the sentence one word at a
time and the length of the network can vary based on the length of
the sequence.
It is important to note that no matter how long the input sequence
is, the matrices W and V are the same in every time-step. This is
another example of parameter sharing.
Therefore, the number of weights in the network is independent of
the length of the input sequence.

Deep Learning – 15 / 20
RNNS - USE CASE SPECIFIC ARCHITECTURES
RNNs are very versatile. They can be applied to a wide range of tasks.

Figure: RNNs can be used in tasks that involve multiple inputs and/or multiple outputs.

Examples:
Sequence-to-One: Sentiment analysis, document classification.
One-to-Sequence: Image captioning.
Sequence-to-Sequence: Language modelling, machine translation,
time-series prediction.

Deep Learning – 16 / 20
Computational Graph

Deep Learning – 17 / 20
RNNS - COMPUTATIONAL GRAPH

On the left is an abstract representation of the computational

graph for the network on the right. A loss function L measures how
far each output f is from the corresponding training target y .

Deep Learning – 18 / 20
RNNS - COMPUTATIONAL GRAPH

A helpful way to think of an RNN is as multiple copies of the same

network, each passing a message to a successor.
RNNs are networks with loops, allowing information to persist.

Deep Learning – 18 / 20
RNNS - COMPUTATIONAL GRAPH

Things might become more clear if we unfold the architecture.

We call z[t ] the state of the system at time t.
Tthe state contains information about the whole past sequence.

Deep Learning – 18 / 20
RNNS - COMPUTATIONAL GRAPH

We went from

f = τ (c + U> σ(b + W> x)) for the dense net, to

f [t ] = τ (c + U> σ(b + V> z[t −1] + W> x[t ] )) for the RNN.

Deep Learning – 18 / 20
RNNS - COMPUTATIONAL GRAPH

A potential computational graph for time-step t with

f [t ] = τ (c + U> σ(b + V> z[t −1] + W> x[t ] ))

Deep Learning – 18 / 20
RECURRENT OUTPUT-HIDDEN CONNECTIONS
Recurrent connections do not need to map from hidden to hidden
neurons!

Figure: RNN with feedback connection from the output to the hidden layer.
The RNN is only allowed to send f to future time points and, hence, z [t −1] is
connected to z [t ] only indirectly, via the predictions f [t −1] .

Deep Learning – 19 / 20
SEQ-TO-ONE MAPPINGS
RNNs do not need to produce an output at each time step. Often only
one output is produced after processing the whole sequence.

Figure: Time-unfolded recurrent neural network with a single output at the end
of the sequence. Such a network can be used to summarize a sequence and
produce a fixed size representation.

Deep Learning – 20 / 20
Deep Learning

Recurrent Neural Networks -

Backpropogation

Learning goals
How does Backpropagation work
for RNNs?
Exploding and Vanishing
Gradients
SIMPLE EXAMPLE: CHARACTER LEVEL
LANGUAGE MODEL
Task: Learn character probability distribution from input text
Suppose we only had a vocabulary of four possible letters: “h”, “e”,
“l” and “o”
We want to train an RNN on the training sequence “hello”.
This training sequence is in fact a source of 4 separate training
examples:
The probability of “e” should be likely given the context of “h”
“l” should be likely in the context of “he”
“l” should also be likely given the context of “hel”
and “o” should be likely given the context of “hell”

Deep Learning – 1 / 12
SIMPLE EXAMPLE: CHARACTER LEVEL
LANGUAGE MODEL

Deep Learning – 2 / 12
SIMPLE EXAMPLE: CHARACTER LEVEL
LANGUAGE MODEL

Deep Learning – 3 / 12
SIMPLE EXAMPLE: CHARACTER LEVEL
LANGUAGE MODEL

Deep Learning – 4 / 12
SIMPLE EXAMPLE: CHARACTER LEVEL
LANGUAGE MODEL

The RNN has a

4-dimensional input and
output. The exemplary
hidden layer consists of 3
neurons. This diagram
shows the activations in
the forward pass when the
RNN is fed the characters
“hell” as input. The output
contains confidences the
RNN assigns for the next
character.

Deep Learning – 5 / 12
SIMPLE EXAMPLE: CHARACTER LEVEL
LANGUAGE MODEL

Our goal is to increase the

confidence for the correct
letters (green digits) and
decrease the confidence of
all others (we could also
use a softmax activation to
squash the digits to
probabilities ∈ [0, 1]). How
can we now train the
network?
Backpropagation
through time!

Deep Learning – 5 / 12
BACKPROPAGATION THROUGH TIME

dL
For training the RNN, we need to compute du i ,j
, dvdLi ,j , and dw
dL
i ,j
.
To do so, during backpropagation at time step t for an arbitrary
RNN, we need to compute
dL dL dz[t ] dz[2]
= ...
dz[1] dz[t ] dz[t −1] dz[1]
Deep Learning – 6 / 12
LONG-TERM DEPENDENCIES
Here, z[t ] = σ(V> z[t −1] + W> x[t ] + b)
It follows that:

dz[t ]
= diag(σ 0 (V> z[t −1] + W> x[t ] + b))V> = D[t −1] V>
dz[t −1]

dz[t −1]
= diag(σ 0 (V> z[t −2] + W> x[t −1] + b))V> = D[t −2] V>
dz[t −2]
..
.
dz[2]
= diag(σ 0 (V> z[1] + W> x[2] + b))V> = D[1] V>
dz[1]

dL dL dz[t ] dz[2]
= ... = D[t −1] D[t −2] . . . D[1] (V> )t −1
dz[1] dz[t ] dz[t −1] dz[1]

Deep Learning – 7 / 12
LONG-TERM DEPENDENCIES
dz[t ]
In general, for an arbitrary time-step i < t in the past, dz[i ]
will
contain the term (V> )t −i (this follows from the chain rule).
Based on the largest eigenvalue of V> , the presence of the term
(V> )t −i can either result in vanishing or exploding gradients.
This problem is quite severe for RNNs (as compared to
feedforward networks) because the same matrix V> is multiplied
several times. Click here
As the gap between t and i increases, the instability worsens.
It is thus quite challenging for RNNs to learn long-term
dependencies. The gradients either vanish (most of the time) or
explode (rarely, but with much damage to the optimization).
That happens simply because we propagate errors over very
many stages backwards.

Deep Learning – 8 / 12
LONG-TERM DEPENDENCIES

Figure: Exploding gradients

Deep Learning – 9 / 12
LONG-TERM DEPENDENCIES
Recall, that we can counteract exploding gradients by
implementing gradient clipping.
To avoid exploding gradients, we simply clip the norm of the
gradient at some threshold h (see chapter 4):

h
if ||∇W || > h : ∇W ← ∇W
||∇W ||

Deep Learning – 10 / 12
LONG-TERM DEPENDENCIES

Figure: Vanishing gradients

Deep Learning – 11 / 12
LONG-TERM DEPENDENCIES
Even for a stable RNN (gradients not exploding), there will be
exponentially smaller weights for long-term interactions compared
to short-term ones and a more sophisticated solution is needed for
this vanishing gradient problem (discussed in the next chapters).
The vanishing gradient problem heavily depends on the choice of
the activation functions.
Sigmoid maps a real number into a “small” range (i.e. [0, 1])
and thus even huge changes in the input will only produce a
small change in the output. Hence, the gradient will be small.
This becomes even worse when we stack multiple layers.
We can avoid this problem by using activation functions which
do not “squash” the input.
The most popular choice is ReLU with gradients being either
0 or 1, i.e., they never saturate and thus don’t vanish.
The downside of this is that we can obtain a “dead” ReLU.

Deep Learning – 12 / 12
Deep Learning

Modern Recurrent Neural Networks

Learning goals
LSTM cell
GRU cell
Bidirectional RNNs
Long Short-Term Memory (LSTM)

Deep Learning – 1 / 14
LONG SHORT-TERM MEMORY (LSTM)
The LSTM provides a way of dealing with vanishing gradients and
modelling long-term dependencies.

A simple RNN mechanism;

Until now, we simply computed

z[t ] = σ(b + V> z[t −1] + W> x[t ] )

Deep Learning – 2 / 14
LONG SHORT-TERM MEMORY (LSTM)
The LSTM provides a way of dealing with vanishing gradients and
modelling long-term dependencies.

Left: A simple RNN mechanism; Right: An LSTM cell

Until now, we simply computed

z[t ] = σ(b + V> z[t −1] + W> x[t ] )

Now we introduce the LSTM cell, a small network on its own.

Deep Learning – 2 / 14
LONG SHORT-TERM MEMORY (LSTM)

The key to LSTMs is the cell state s[t ] .

s[t ] can be manipulated by different gates to forget old information,
add new information, and read information out of it.
Each gate is a vector of the same size as s[t ] with elements
between 0 ("let nothing pass") and 1 ("let everything pass").