lecture10-lstms
lecture10-lstms
Networks
Instructor: Jackie CK Cheung & David Adelani
COMP-550
Primer by Yoav Goldberg:
https://arxiv.org/abs/1510.00726
Eisenstein, Section 7.6
J&M Chapter 9 (3rd ed)
Reminders
• RA1 deadline is on Friday, due Oct 11
• Week of Oct 14: Thanksgiving & Fall reading break
• Tutorials on RNNs!
• Check Ed for schedule
2
Outline
Review of LC-CRF
Review of neural networks and deep learning
Recurrent neural networks
Long short-term memory networks
LSTM-CRFs
3
Discriminative Sequence Model
The parallel to an HMM in the discriminative case:
linear-chain conditional random fields (linear-chain
CRFs) (Lafferty et al., 2001)
1
𝑃 𝑌𝑋 = exp * * 𝜃" 𝑓" (𝑦! , 𝑦!#$ , 𝑥! )
𝑍 𝑋
! "
sum over all features
sum over all time-steps
Z(X) is a normalization constant:
𝑍 𝑋 = * exp * * 𝜃" 𝑓" (𝑦! , 𝑦!#$ , 𝑥! )
𝒚 ! "
Indicator function:
1 if 𝑐𝑜𝑛𝑑𝑖𝑡𝑖𝑜𝑛 is true
Let 𝟏 𝑐𝑜𝑛𝑑𝑖𝑡𝑖𝑜𝑛 = )
0 otherwise
5
Features in CRFs
Additional features that may be useful
• Word is capitalized
𝑓*+, (𝑦% , 𝑦%&' , 𝑥% ) = 𝟏(𝑦% = ? )𝟏(𝑥% is capitalized)
• Word ends in –ed
𝑓&)- (𝑦% , 𝑦%&' , 𝑥% ) = 𝟏(𝑦% = ? )𝟏(𝑥% ends with 𝑒𝑑)
6
Inference with LC-CRFs
Dynamic programming still works – modify the forward
and the Viterbi algorithms to work with the weight-
feature products.
HMM LC-CRF
7
Gradient of Log-Likelihood
Find the gradient of the log likelihood of the training corpus:
Here it is:
. . . .
> > 𝑓/ 𝑦% , 𝑦%&' , 𝑥% − > > > 𝑓/ 𝑦, 𝑦 2 , 𝑥% 𝑃(𝑦, 𝑦 2 |𝑋 . )
. % . % 0,02
8
(Artificial) Neural Networks
A kind of learning model which automatically learns
non-linear functions from input to output
Biologically inspired metaphor:
• Network of computational units called neurons
• Each neuron takes scalar inputs, and produces a scalar
output, very much like a logistic regression model
Neuron(𝑥)
⃑ = 𝑔(𝑎(𝑥( + 𝑎8𝑥8 + … + 𝑎# 𝑥# + 𝑏)
9
Feedforward Neural Networks
All connections flow forward (no loops); each layer of
hidden units is fully connected to the next.
10
Inference in a FF Neural Network
Perform computations forwads
through the graph:
𝐡𝟏 = 𝑔$ (𝐱𝐖 𝟏 + 𝐛𝟏 )
𝐡𝟐 = 𝑔( (𝐡𝟏 𝐖 𝟐 + 𝐛𝟐 )
𝐲 = 𝐡𝟐 𝐖 𝟑
11
Training Neural Networks
Typically done by stochastic gradient descent
• For one training example, find gradient of loss function
wrt parameters of the network (i.e., the weights of each
layer); “travel along in that direction”.
Network has very many parameters!
Efficient algorithm to compute the gradient with
respect to all parameters: backpropagation (Rumelhart
et al., 1986)
• Boils down to an efficient way to use the chain rule of
derivatives to propagate the error signal from the loss
function backwards through the network back to the
inputs
12
SGD Overview
Inputs:
• Function computed by neural network, 𝑓(𝐱; 𝜃)
• Training samples {𝐱 𝐤, 𝐲 𝐤}
• Loss function 𝐿
Repeat for a while:
Sample a training case, 𝐱 𝐤, 𝐲 𝐤
Compute loss 𝐿(𝑓 𝐱 𝐤; 𝜃 , 𝐲 𝐤) Forward pass
Compute gradient 𝛻𝐿(𝐱 𝐤) wrt the parameters 𝜃
In neural networks,
Update 𝜃 ← 𝜃 − 𝜂𝛻𝐿(𝐱 𝐤) by backpropagation
Return 𝜃
13
Example: Forward Pass
𝐡𝟏 = 𝑔$ (𝐱𝐖 𝟏 + 𝐛𝟏 )
𝐡𝟐 = 𝑔( (𝐡𝟏 𝐖 𝟐 + 𝐛𝟐 )
𝑓 𝐱 = 𝐲 = 𝑔 * 𝐡𝟐 = 𝐡𝟐 𝐖 𝟑
14
Example: Time Delay Neural Network
Let’s draw a neural network architecture for POS
tagging using a feedforward neural network.
We’ll construct a context window around each word,
and predict the POS tag of that word as the output.
15
TDNN POS Tagger
Tag 𝑄!
𝑤!#$ 𝑤! 𝑤!/$
16
Recurrent Neural Networks
A neural network sequence model:
𝑅𝑁𝑁 𝐬𝟎 , 𝐱 𝟏:𝐧 = 𝐬𝟏:𝐧 , 𝐲𝟏:𝐧
𝐬𝐢 = 𝑅(𝐬𝐢'𝟏 , 𝐱 𝐢 ) # 𝐬𝐢 : state vector
𝐲𝐢 = 𝑂 𝐬𝐢 # 𝐲𝐢 : output vector
𝑅 and 𝑂 are parts of the neural network that compute the
next state vector and the output vector
17
Long-Term Dependencies in Language
There can be dependencies between words that are
arbitrarily far apart.
• I will look the word that you have described that doesn’t
make sense to her up.
• Can you think of some other examples in English of long-
range dependencies?
18
Comparing LC-CRFs and RNNs
LC-CRFs RNNs
Feature
engineering
Inference
properties
19
Comparing LC-CRFs and RNNs
LC-CRFs RNNs
Inference
properties
20
Comparing LC-CRFs and RNNs
LC-CRFs RNNs
21
Different RNN Architectures
Different architectures for different use cases
23
Basic Operations / Components
Masking
• Multiply a vector by numbers between 0 and 1 to model
forgetting vs. retaining information
• Ingredients: Sigmoid function, component-wise
multiplication
Integrating information
• Done via component-wise addition and concatenation
Non-linearities:
• Sigmoid function (output between 0 and 1)
• Tanh function (output between -1 and 1)
24
Visual step-by-step explanation
http://colah.github.io/posts/2015-08-Understanding-
LSTMs/
25
Vanishing and Exploding Gradients
If 𝑅 and 𝑂 are simple fully connected layers, we have a
problem. In the unrolled network, the gradient signal
can get lost on its way back to the words far in the past:
• Suppose it is 𝐖( that we want to modify, and there are N
layers between that and the loss function.
𝜕𝐿 𝜕𝐿 𝜕𝑔? 𝜕𝑔8 𝜕𝑔(
( = ? ?'( … (
𝜕𝐖 𝜕𝑔 𝜕𝑔 𝜕𝑔 𝜕𝐖(
26
Fix for Vanishing Gradients
In LSTMs, we can propagate a cell state directly, fixing
the vanishing gradient problem:
27
Bidirectional LSTMs – Motivation
Standard LSTMs go forward in time
28
BiLSTMs
Have two LSTM layers, forward and backward in time
29
Combining LSTMs and CRFs
LSTMs allow learning of complex relationships between
input and output
(LC-)CRFs allow learning of relationships between
output labels in a sequence
30
LSTM-CRFs
Add a linear-chain CRF layer on top of a (Bi)LSTM
(Huang et al., 2015; Lample et al., 2016)!
31
LSTM-CRF
Features of the CRF:
• The output scores of the LSTM
• Transition probabilities between tags (need to be learned)
Score of a sequence in LSTM-CRF:
32
LSTM-CRF: Inference and Training
The LSTM-CRF is trained to minimize negative log
likelihood on the training corpus (same as regular CRF):
33
Summary
LSTMs are the backbone of many modern NLP models!
LSTM-CRFs are very high performing on many basic
sequence labelling tasks:
• Named entity recognition
• POS tagging
• Chunking
• Competitive with Transformers without language
modelling pre-training.
Next, we will start looking at models of hierarchical
structure in language.
34