0% found this document useful (0 votes)
14 views34 pages

AN2DL_04_2324_RecurrentNeuralNetworks

Uploaded by

kiankr79
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views34 pages

AN2DL_04_2324_RecurrentNeuralNetworks

Uploaded by

kiankr79
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 34

Artificial Neural Networks and Deep Learning

- Recurrent Neural Networks-

Matteo Matteucci, PhD ([email protected])


Artificial Intelligence and Robotics Laboratory
Politecnico di Milano
Sequence Modeling

So far we have considered only «static» datasets


1 1 1
𝑤10 1
Xt
𝑤11
x1
… 𝑔1 𝑥 w
𝑤𝑗𝑖
xi …

… … … 𝑔𝐾 𝑥 w
xI 𝑤𝐽𝐼

2
Sequence Modeling

So far we have considered only «static» datasets

X0 X1 X2 X3 Xt

x1 x1 x1 x1 x1
… … … … …
xi xi xi xi … xi …
… … … … …

xI xI xI xI xI

time

3
Sequence Modeling

Different ways to deal with «dynamic» data:


X0 X1 X2 X3 Xt
Memoryless models (fixed lag):
• Autoregressive models x1 x1 x1 x1 x1
• Feedforward neural networks … … … … …
xi xi xi xi … xi …

Models with memory (unlimited): … … … … …

• Linear dynamical systems xI xI xI xI xI


• Hidden Markov models
time
• Recurrent Neural Networks X0 X1 X2 X3 Xt
… …
• ...

4
Memoryless Models for Sequences (1/2)
𝑊𝑡−2

Autoregressive models X0 X1 X2 X3 Xt
• Predict the next input from … …

previous ones using «delay taps» 𝑊𝑡−1

time

Y0 Y1 Y2 Y3 … … Yt

Linear models with fixed lag


• Predict the next output from 𝑊𝑡−2 𝑊𝑡−1
previous inputs using
X0 X1 X2 X3 Xt
«delay taps» … …

time
5
Memoryless Models for Sequences (2/2)
Hidden
Feed forward neural networks
• Generalize autoregressive models 𝑊𝑡−2 𝑊𝑡−1 𝑊𝑡

using non linear hidden layers X0 X1 X2 X3



Xt

time

Feed forward neural networks Y0 Y1 Y2 Y3 … … Yt


with delays 𝑉𝑡−2

• Predict the next output from Hidden

previous inputs and previous 𝑊𝑡−2 𝑊𝑡−1 𝑊𝑡


outputs using «delay taps» X0 X1 X2 X3 Xt
… …

time
6
Dynamical Systems (Models with Memory)

Generative models with a hidden state which cannot be observed directly


• The hidden state has some dynamics possibly Y0 Y1 … Yt
affected by noise and produces the output
• To compute the output need to infer hidden state

Hidden
Hidden

Hidden
• Input are treated as driving inputs …

In linear dynamical systems this becomes:


• State continuous with Gaussian uncertainty X0 X1 Xt

• Transformations are assumed to be linear
• State can be estimated using Kalman filtering
time

Stochastic systems ...


7
Dynamical Systems (Models with Memory)

Generative models with a hidden state which cannot be observed directly


• The hidden state has some dynamics possibly Y0 Y1 … Yt
affected by noise and produces the output
• To compute the output need to infer hidden state

Hidden
Hidden

Hidden
• Input are treated as driving inputs …

In hidden Markov models this becomes:


• State assumed to be discrete, state transitions
are stochastic (transition matrix)
• Output is a stochastic function of hidden states
• State can be estimated via Viterbi algorithm. time

Stochastic systems ...


8
Recurrent Neural networks Deterministic
systems ...
1

Memory via recurrent connections: ℎ𝑗𝑡 𝑥 𝑡 , W


𝑤11 1 , 𝑐 𝑡−1 , V 1

x1

• Distributed hidden state allows …


𝑤𝑗𝑖 1
to store information efficiently xi
• Non-linear dynamics allows … …
complex hidden state updates 𝑔𝑡 𝑥 w
xI 𝑤𝐽𝐼

“With enough neurons and time, RNNs


can compute anything that can be
computed by a computer.” 𝑐1𝑡−1

1 (1)
(Computation Beyond the Turing Limit … 𝑐𝑏𝑡 𝑥 𝑡 , W𝐵 , 𝑐 𝑡−1 , VB
Hava T. Siegelmann, 1995)

𝑐𝐵𝑡−1

9
Recurrent Neural networks
1

Memory via recurrent connections: ℎ𝑗𝑡 𝑥 𝑡 , W


𝑤11 1 , 𝑐 𝑡−1 , V 1

x1

• Distributed hidden state allows …


𝑤𝑗𝑖 1
to store information efficiently xi
• Non-linear dynamics allows … …
complex hidden state updates 𝑔𝑡 𝑥 w
xI 𝑤𝐽𝐼
𝐽 𝐵
(2) (2)
𝑔𝑡 𝑥𝑛 |𝑤 = 𝑔 ෍ 𝑤1𝑗 ⋅ ℎ𝑗𝑡 ⋅ + ෍ 𝑣1𝑏 ⋅ 𝑐𝑏𝑡 ⋅
𝑗=0 𝑏=0

𝐽 𝐵
(1) (1) 𝑐1𝑡−1
ℎ𝑗𝑡 ⋅ = ℎ𝑗 ෍ 𝑤𝑗𝑖 ⋅ 𝑥𝑖,𝑛
𝑡
+ ෍ 𝑣𝑗𝑏 ⋅ 𝑐𝑏𝑡−1
𝑗=0 𝑏=0
1 (1)
… 𝑐𝑏𝑡 𝑥 𝑡 , W𝐵 , 𝑐 𝑡−1 , VB
𝐽 𝐵
(1)
𝑡 (1)
𝑐𝑏𝑡 ⋅ = 𝑐𝑏 ෍ 𝑤𝑏𝑖 ⋅ 𝑥𝑖,𝑛 𝑡−1
+ ෍ 𝑣𝑏𝑏′ ⋅ 𝑐𝑏′
𝑗=0 𝑏′=0
𝑐𝐵𝑡−1

10
Backpropagation Through Time
1

𝑤11 ℎ𝑗𝑡 𝑥 𝑡 , W 1 , 𝑐 𝑡−1 , V 1

x1


𝑤𝑗𝑖 1

xi

… …
𝑔𝑡 𝑥 w
xI 𝑤𝐽𝐼

𝑐1𝑡−1

1 (1)
… 𝑐𝑏𝑡 𝑥 𝑡 , W𝐵 , 𝑐 𝑡−1 , VB

𝑐𝐵𝑡−1

11
Backpropagation Through Time
1 1 1 1

𝑤11 ℎ𝑗𝑡 𝑥 𝑡 , W 1 , 𝑐 𝑡−1 , V 1

x1 x1 x1 x1

… … All these weights… …


𝑤𝑗𝑖
should be the same. 1

xi xi xi xi

… … … … …
𝑔𝑡 𝑥 w
xI xI xI xI 𝑤𝐽𝐼

… 1 (1)
… … … … 𝑐𝑏𝑡 𝑥 𝑡 , W𝐵 , 𝑐 𝑡−1 , VB

12
Backpropagation Through Time
1

• Perform network unroll for U steps 𝑤11 ℎ𝑗𝑡 𝑥 𝑡 , W 1


, 𝑐 𝑡−1 , V 1
x1
• Initialize WB , 𝑉𝐵 replicas to be the same …
• Compute gradients and update replicas 𝑤𝑗𝑖 1

with the average of their gradients xi

𝑈−1 𝑈−1
… …
1 𝜕𝐸 1 𝜕𝐸 𝑡
𝑊𝐵 = 𝑊𝐵 − 𝜂 ⋅ ෍ 𝑉 = 𝑉𝐵 − 𝜂 ⋅ ෍
𝑈 𝜕𝑊𝐵𝑡−𝑢 𝐵 𝑈 𝜕𝑉𝐵𝑡−𝑢 xI 𝑤𝐽𝐼 𝑔𝑡 𝑥 w
𝑢=0 𝑢=0

… … … … 1 (1)
… 𝑐𝑏𝑡 𝑥 𝑡 , W𝐵 , 𝑐 𝑡−1 , VB
𝑉𝐵𝑡−3 𝑉𝐵𝑡−2 𝑉𝐵𝑡−1 𝑉𝐵𝑡

13
How much should we go back in time?
1

𝑤11
ℎ𝑗𝑡 𝑥 𝑡 , W 1 , 𝑐 𝑡−1 , V 1
Sometime output might be related to x1

some input happened quite long before …


𝑤𝑗𝑖 1

xi
Jane walked into the room. John walked in too.
It was late in the day. Jane said hi to <???> … …
𝑔𝑡 𝑥 w
xI 𝑤𝐽𝐼

However backpropagation through 𝑐1𝑡−1


time was not able to train recurrent
neural networks significantly … 1
𝑐𝑏𝑡 𝑥 𝑡 , W𝐵 , 𝑐 𝑡−1 , VB
(1)

back in time ... 𝑐𝐵𝑡−1


Was due to not being able to
backprop through many layers ...
14
How much can we go back in time?

To better understand why it was not working consider a simplified case:


ℎ𝑡 = ℎ(𝑣 1
⋅ ℎ𝑡−1 + 𝑤 1
⋅ 𝑥) 𝑦 𝑡 = 𝑔(𝑤 2 ⋅ ℎ𝑡 )
𝑥

Backpropagation over an entire sequence 𝑆 is computed as


𝑆 𝑡 𝑡 𝑡
𝜕𝐸 𝜕𝐸 𝑡 𝜕𝐸 𝑡 𝜕𝐸 𝑡 𝜕𝑦 𝑡 𝜕ℎ𝑡 𝜕ℎ𝑘 𝜕ℎ𝑡 𝜕ℎ𝑖 1 1
=෍ =෍ 𝑡 𝑡 𝑘 = ෑ = ෑ 𝑣 ℎ′ 𝑣 ⋅ ℎ𝑖−1 + 𝑤 (1) ⋅ 𝑥
𝜕𝑤 𝜕𝑤 𝜕𝑤 𝜕𝑦 𝜕ℎ 𝜕ℎ 𝜕𝑤 𝜕ℎ𝑘 𝜕ℎ𝑖−1
𝑡=1 𝑡=1 𝑖=𝑘+1 𝑖=𝑘+1

If 𝛾𝑣 ⋅ 𝛾ℎ′ < 1this


If we consider the norm of these terms converges to 0 ...
𝜕ℎ𝑖 1
𝜕ℎ𝑡 𝑡−𝑘
≤ 𝑣 ℎ′ ⋅ 𝑘
≤ 𝛾𝑣 ⋅ 𝛾ℎ′
𝜕ℎ𝑖−1 𝜕ℎ
With Sigmoids and Tanh we
have vanishing gradients
15
Which Activation Function?

Sigmoid activation function Tanh activation function


1 exp 𝑎 − exp(−𝑎)
𝑔 𝑎 = 𝑔 𝑎 =
1 + exp(−𝑎) exp(𝑎) + exp(−𝑎)
𝑔′ 𝑎 = 𝑔(𝑎)(1 − 𝑔 𝑎 ) 𝑔′ 𝑎 = 1 − 𝑔 𝑎 2
1 exp 0 exp 0 −exp 0 2
𝑔′ 0 = 𝑔 0 1 − 𝑔 0 = ⋅ = 0.25 𝑔′ 0 =1−𝑔 0 2 =1− =1
1 + exp(0) 1 + exp 0 exp 0 + exp 0

16
Dealing with Vanishing Gradient

Force all gradients to be either 0 or 1

𝑔 𝑎 = 𝑅𝑒𝐿𝑢 𝑎 = max 0, 𝑎
𝑔′ 𝑎 = 1𝑎>0

Build Recurrent Neural Networks using small modules that are designed
to remember values for a long time.
ℎ𝑡 = 𝑣 (1) ℎ𝑡−1 + 𝑤 (1) 𝑥 𝑦 𝑡 = 𝑔(𝑤 2
⋅ ℎ𝑡 )
𝑥

𝑣 (1) = 1
It only accumulates
the input ...
17
Long Short-Term Memories (LSTM)

Hochreiter & Schmidhuber (1997) solved the problem of vanishing


gradient designing a memory cell using logistic and linear units with
multiplicative interactions:

• Information gets into the cell


whenever its “write” gate is on.
• The information stays in the cell
so long as its “keep” gate is on.
• Information is read from the cell
by turning on its “read” gate.
Can backpropagate
through this since the
loop has fixed weight.
18
RNN vs. LSTM

RNN

LSTM Images from: https://colah.github.io/posts/2015-08-Understanding-LSTMs/


19
Long Short-Term Memory

LSTM

LSTM Images from: https://colah.github.io/posts/2015-08-Understanding-LSTMs/


20
Long Short-Term Memory

Input gate

LSTM

LSTM Images from: https://colah.github.io/posts/2015-08-Understanding-LSTMs/


21
Long Short-Term Memory

Forget gate

LSTM

LSTM Images from: https://colah.github.io/posts/2015-08-Understanding-LSTMs/


22
Long Short-Term Memory

Memory gate

LSTM

LSTM Images from: https://colah.github.io/posts/2015-08-Understanding-LSTMs/


23
Long Short-Term Memory

Output gate

LSTM

LSTM Images from: https://colah.github.io/posts/2015-08-Understanding-LSTMs/


24
Gated Recurrent Unit (GRU)

It combines the forget and input gates into a single “update gate.” It also
merges the cell state and hidden state, and makes some other changes.

LSTM Images from: https://colah.github.io/posts/2015-08-Understanding-LSTMs/


25
LSTM Networks

You can build a computation graph with continuous transformations.


Y0 Y1 … Yt

Hidden

Hidden

Hidden

X0 X1 Xt

26
Multiple Layers and Bidirectional LSTM Networks

A computation graph in time with continuous transformations.


Hierarchical
representation
Y0 Y1 Yt

ReLu

ReLu

ReLu

LSTM
LSTM

LSTM

LSTM

LSTM

LSTM

X0 X1 … Xt

27
Tips & Tricks

When conditioning on full input sequence Bidirectional RNNs exploit it:


• Have one RNNs traverse the sequence left-to-right
• Have another RNN traverse the sequence right-to-left
• Use concatenation of hidden layers as feature representation

28
Multiple Layers and Bidirectional LSTM Networks

A computation graph in time with continuous transformations.


Hierarchical
representation
Y0 Y1 Yt Y0 Y1 Yt
… …
ReLu

ReLu

ReLu
ReLu
ReLu
ReLu
… …
LSTM
LSTM

LSTM

LSTM
LSTM
LSTM
… …
LSTM

LSTM

LSTM

LSTM
LSTM
LSTM
… …

Xt Xt-1 X0
X0 X1 … Xt X0 X1 … Xt
Bidirectional
processing
29
Tips & Tricks

When conditioning on full input sequence Bidirectional RNNs exploit it:


• Have one RNNs traverse the sequence left-to-right
• Have another RNN traverse the sequence right-to-left
• Use concatenation of hidden layers as feature representation
When initializing RNN we need to specify the initial state
• Could initialize them to a fixed value (such as 0)
• Better to treat the initial state as learned parameters
• Start off with random guesses of the initial state values
• Backpropagate the prediction error through time all the way to the initial state values
and compute the gradient of the error with respect to these
• Update these parameters by gradient descent

30
Sequential Data Problems

Fixed-sized Sequence output Sequence input (e.g. Sequence input and Synced sequence input
input (e.g. image captioning sentiment analysis sequence output (e.g. and output (e.g. video
to fixed-sized takes an image and where a given sentence Machine Translation: an classification where we
output outputs a sentence of is classified as RNN reads a sentence in wish to label each frame
(e.g. image words). expressing positive or English and then outputs of the video)
classification) negative sentiment). a sentence in French)

LSTM Images Credits: Andrej Karpathy


31
Sequence to Sequence Learning Examples (1/3)

Image Captioning: input a single image and get a series or sequence of


words as output which describe it. The image has a fixed size, but the
output has varying length.

32
Sequence to Sequence Learning Examples (2/3)

Sentiment Classification/Analysis: input a sequence of characters or


words, e.g., a tweet, and classify the sequence into positive or negative
sentiment. Input has varying lengths; output is of a fixed type and size.

33
Sequence to Sequence Learning Examples (3/3)

Language Translation: having some text in a particular language, e.g.,


English, we wish to translate it in another, e.g., French. Each language has
it’s own semantics and it has varying lengths for the same sentence.

34

You might also like