0% found this document useful (0 votes)
9 views

lecture10-lstms

Mcgill COMP 550 Fall 2024 lecture note

Uploaded by

Bohan Wang
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

lecture10-lstms

Mcgill COMP 550 Fall 2024 lecture note

Uploaded by

Bohan Wang
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 34

Lecture 10: Recurrent Neural

Networks
Instructor: Jackie CK Cheung & David Adelani
COMP-550
Primer by Yoav Goldberg:
https://arxiv.org/abs/1510.00726
Eisenstein, Section 7.6
J&M Chapter 9 (3rd ed)
Reminders
• RA1 deadline is on Friday, due Oct 11
• Week of Oct 14: Thanksgiving & Fall reading break
• Tutorials on RNNs!
• Check Ed for schedule

2
Outline
Review of LC-CRF
Review of neural networks and deep learning
Recurrent neural networks
Long short-term memory networks
LSTM-CRFs

3
Discriminative Sequence Model
The parallel to an HMM in the discriminative case:
linear-chain conditional random fields (linear-chain
CRFs) (Lafferty et al., 2001)

1
𝑃 𝑌𝑋 = exp * * 𝜃" 𝑓" (𝑦! , 𝑦!#$ , 𝑥! )
𝑍 𝑋
! "
sum over all features
sum over all time-steps
Z(X) is a normalization constant:
𝑍 𝑋 = * exp * * 𝜃" 𝑓" (𝑦! , 𝑦!#$ , 𝑥! )
𝒚 ! "

sum over all possible sequences of hidden states


4
Features in CRFs
Standard HMM probabilities as CRF features:
• Transition from state DT to state NN
𝑓!"→$$ (𝑦% , 𝑦%&' , 𝑥% ) = 𝟏(𝑦%&' = 𝐷𝑇) 𝟏(𝑦% = 𝑁𝑁)
• Emit the from state DT
𝑓!"→%() (𝑦% , 𝑦%&' , 𝑥% ) = 𝟏(𝑦% = 𝐷𝑇) 𝟏(𝑥% = 𝑡ℎ𝑒)
• Initial state is DT
𝑓!" 𝑦' , 𝑥' = 𝟏 𝑦' = 𝐷𝑇

Indicator function:
1 if 𝑐𝑜𝑛𝑑𝑖𝑡𝑖𝑜𝑛 is true
Let 𝟏 𝑐𝑜𝑛𝑑𝑖𝑡𝑖𝑜𝑛 = )
0 otherwise

5
Features in CRFs
Additional features that may be useful
• Word is capitalized
𝑓*+, (𝑦% , 𝑦%&' , 𝑥% ) = 𝟏(𝑦% = ? )𝟏(𝑥% is capitalized)
• Word ends in –ed
𝑓&)- (𝑦% , 𝑦%&' , 𝑥% ) = 𝟏(𝑦% = ? )𝟏(𝑥% ends with 𝑒𝑑)

• Let’s brainstorm and propose more features


• 𝑓!"#$% 𝑦& , 𝑦&'(, 𝑥& = 𝟏 𝑦& = ? 𝟏 𝑙𝑒𝑛(𝑥& < 5)
• 𝑓()& 𝑦& , 𝑦&'(, 𝑥& = 𝟏 𝑦& = ? 𝟏(𝑡 = 1)

6
Inference with LC-CRFs
Dynamic programming still works – modify the forward
and the Viterbi algorithms to work with the weight-
feature products.

HMM LC-CRF

Forward algorithm 𝑃 𝑋𝜃 𝑍(𝑋)

Viterbi algorithm argmax 𝑃(𝑋, 𝑌|𝜃) argmax 𝑃(𝑌|𝑋, 𝜃)


! !

7
Gradient of Log-Likelihood
Find the gradient of the log likelihood of the training corpus:

𝑙 𝜃 = log ' 𝑃(𝑌 ! |𝑋 ! )


!

Here it is:

. . . .
> > 𝑓/ 𝑦% , 𝑦%&' , 𝑥% − > > > 𝑓/ 𝑦, 𝑦 2 , 𝑥% 𝑃(𝑦, 𝑦 2 |𝑋 . )
. % . % 0,02

Derivation on the next slide.

8
(Artificial) Neural Networks
A kind of learning model which automatically learns
non-linear functions from input to output
Biologically inspired metaphor:
• Network of computational units called neurons
• Each neuron takes scalar inputs, and produces a scalar
output, very much like a logistic regression model
Neuron(𝑥)
⃑ = 𝑔(𝑎(𝑥( + 𝑎8𝑥8 + … + 𝑎# 𝑥# + 𝑏)

As a whole, the network can theoretically compute any


computable function, given enough neurons. (These
notions can be formalized.)

9
Feedforward Neural Networks
All connections flow forward (no loops); each layer of
hidden units is fully connected to the next.

Figure from Goldberg (2015)

10
Inference in a FF Neural Network
Perform computations forwads
through the graph:

𝐡𝟏 = 𝑔$ (𝐱𝐖 𝟏 + 𝐛𝟏 )
𝐡𝟐 = 𝑔( (𝐡𝟏 𝐖 𝟐 + 𝐛𝟐 )
𝐲 = 𝐡𝟐 𝐖 𝟑

Note that we are now representing each layer as a


vector; combining all of the weights in a layer across
the units into a weight matrix

11
Training Neural Networks
Typically done by stochastic gradient descent
• For one training example, find gradient of loss function
wrt parameters of the network (i.e., the weights of each
layer); “travel along in that direction”.
Network has very many parameters!
Efficient algorithm to compute the gradient with
respect to all parameters: backpropagation (Rumelhart
et al., 1986)
• Boils down to an efficient way to use the chain rule of
derivatives to propagate the error signal from the loss
function backwards through the network back to the
inputs

12
SGD Overview
Inputs:
• Function computed by neural network, 𝑓(𝐱; 𝜃)
• Training samples {𝐱 𝐤, 𝐲 𝐤}
• Loss function 𝐿
Repeat for a while:
Sample a training case, 𝐱 𝐤, 𝐲 𝐤
Compute loss 𝐿(𝑓 𝐱 𝐤; 𝜃 , 𝐲 𝐤) Forward pass
Compute gradient 𝛻𝐿(𝐱 𝐤) wrt the parameters 𝜃
In neural networks,
Update 𝜃 ← 𝜃 − 𝜂𝛻𝐿(𝐱 𝐤) by backpropagation
Return 𝜃

13
Example: Forward Pass
𝐡𝟏 = 𝑔$ (𝐱𝐖 𝟏 + 𝐛𝟏 )
𝐡𝟐 = 𝑔( (𝐡𝟏 𝐖 𝟐 + 𝐛𝟐 )
𝑓 𝐱 = 𝐲 = 𝑔 * 𝐡𝟐 = 𝐡𝟐 𝐖 𝟑

Loss function: 𝐿(𝐲, 𝐲+,-. )

Save the values for 𝐡𝟏 , 𝐡𝟐 , 𝐲 too!

14
Example: Time Delay Neural Network
Let’s draw a neural network architecture for POS
tagging using a feedforward neural network.
We’ll construct a context window around each word,
and predict the POS tag of that word as the output.

Limitations of this approach?

15
TDNN POS Tagger
Tag 𝑄!

𝑤!#$ 𝑤! 𝑤!/$

16
Recurrent Neural Networks
A neural network sequence model:
𝑅𝑁𝑁 𝐬𝟎 , 𝐱 𝟏:𝐧 = 𝐬𝟏:𝐧 , 𝐲𝟏:𝐧
𝐬𝐢 = 𝑅(𝐬𝐢'𝟏 , 𝐱 𝐢 ) # 𝐬𝐢 : state vector
𝐲𝐢 = 𝑂 𝐬𝐢 # 𝐲𝐢 : output vector
𝑅 and 𝑂 are parts of the neural network that compute the
next state vector and the output vector

17
Long-Term Dependencies in Language
There can be dependencies between words that are
arbitrarily far apart.
• I will look the word that you have described that doesn’t
make sense to her up.
• Can you think of some other examples in English of long-
range dependencies?

Cannot easily model with HMMs or even LC-CRFs, but


can with RNNs

18
Comparing LC-CRFs and RNNs
LC-CRFs RNNs

At each Linear model (linear between A neural network (more


feature values and weights) expressive model, may
timestep similar to logistic regression need more data to train)

Feature
engineering

Inference
properties

19
Comparing LC-CRFs and RNNs
LC-CRFs RNNs

At each Linear model (linear between A neural network (more


feature values and weights) expressive model, may
timestep similar to logistic regression need more data to train)

Feature Necessary for performance Less critical; neural


network can learn useful
engineering features given enough
labelled data

Inference
properties

20
Comparing LC-CRFs and RNNs
LC-CRFs RNNs

At each Linear model (linear between A neural network (more


feature values and weights) expressive model, may
timestep similar to logistic regression need more data to train)

Feature Necessary for performance Less critical; neural


network can learn useful
engineering features given enough
labelled data

Inference Exact polynomial-time inference Approximate inference


due to strong local only (greedy or beam
properties independence assumptions search)

21
Different RNN Architectures
Different architectures for different use cases

Which architecture would you use for:


• document classification?
• language modelling?
• POS tagging?
22
Long Short-Term Memory Networks
Popular RNN architecture for NLP
(Hochreiter and Schmidhuber, 1997)
Model includes a “memory” cell
• A vector of weights in a hidden layer
• Learned operations to:
• Forget old information
• Extract new information
• Integrate relevant information into memory
• Predict output at current timestep

23
Basic Operations / Components
Masking
• Multiply a vector by numbers between 0 and 1 to model
forgetting vs. retaining information
• Ingredients: Sigmoid function, component-wise
multiplication

Integrating information
• Done via component-wise addition and concatenation

Non-linearities:
• Sigmoid function (output between 0 and 1)
• Tanh function (output between -1 and 1)

24
Visual step-by-step explanation
http://colah.github.io/posts/2015-08-Understanding-
LSTMs/

25
Vanishing and Exploding Gradients
If 𝑅 and 𝑂 are simple fully connected layers, we have a
problem. In the unrolled network, the gradient signal
can get lost on its way back to the words far in the past:
• Suppose it is 𝐖( that we want to modify, and there are N
layers between that and the loss function.
𝜕𝐿 𝜕𝐿 𝜕𝑔? 𝜕𝑔8 𝜕𝑔(
( = ? ?'( … (
𝜕𝐖 𝜕𝑔 𝜕𝑔 𝜕𝑔 𝜕𝐖(

• If the gradient norms are small (<1), the gradient will


vanish to near-zero (or explode to near-infinity if >1)
• This happens especially because we have repeated
applications of the same weight matrices in the
recurrence

26
Fix for Vanishing Gradients
In LSTMs, we can propagate a cell state directly, fixing
the vanishing gradient problem:

There is no repeated weight application between the


internal states across time!

27
Bidirectional LSTMs – Motivation
Standard LSTMs go forward in time

For some applications, this is necessary


For others, we would like to use both past context and
future context

28
BiLSTMs
Have two LSTM layers, forward and backward in time

Concatenate their outputs to make final prediction

29
Combining LSTMs and CRFs
LSTMs allow learning of complex relationships between
input and output
(LC-)CRFs allow learning of relationships between
output labels in a sequence

Can we combine advantages of both?

30
LSTM-CRFs
Add a linear-chain CRF layer on top of a (Bi)LSTM
(Huang et al., 2015; Lample et al., 2016)!

31
LSTM-CRF
Features of the CRF:
• The output scores of the LSTM
• Transition probabilities between tags (need to be learned)
Score of a sequence in LSTM-CRF:

Transition probabilities Scores from LSTM


size (k x k) size (n x k)

sequence of length n, with k possible tags

32
LSTM-CRF: Inference and Training
The LSTM-CRF is trained to minimize negative log
likelihood on the training corpus (same as regular CRF):

33
Summary
LSTMs are the backbone of many modern NLP models!
LSTM-CRFs are very high performing on many basic
sequence labelling tasks:
• Named entity recognition
• POS tagging
• Chunking
• Competitive with Transformers without language
modelling pre-training.
Next, we will start looking at models of hierarchical
structure in language.

34

You might also like