Lecture6 421
Lecture6 421
CPSC 436N
Term 1
Lecture 6: Text Cl ssi ic tion 2 - Neur l Models
Technology
Sports
1
a
a
f
a
a
a
a
Naïve Bayes Introduce terminology and common representations.
Interesting relation with n-gram LMs.
Logistic Regression Strong Baseline.
Fundamental relation with MLP.
Generalized to sequence models (CRFs).
Multi Layer Perceptron Key component of modern neural solutions (e.g. ine tuning).
2
f
Today
Text Classi ication
• MLP classi ier
• CNN classi ier
3
f
f
f
Today
Text Classi ication
• MLP classi ier
• CNN classi ier
4
f
f
f
Binary Logistic Regression as a 1-layer Network
(we don't count the input layer in the layers!)
Output layer σ = ( ∙ + )
(σ node) (y is a scalar)
w w1 wn b (scalar)
(vector)
Input layer x1 xn +1
vector x
5
𝑦
𝜎
𝑤
𝑥
𝑏
1/28/2022
Two-Layer Network with scalar output
y = σ(W2h + b2)
Output layer
y is a scalar
W2 b2
hidden units h = tanh(W1x + b1)
b1
Could be ReLU
W1
Input layer
(vector) x1 xn +1
1/28/2022 6
Multinomial Logistic Regression as a 1-layer Network
y1 yn
Output layer s s s = softmax( + )
(softmax nodes) y is a vector
W b
W is a b is a vector
matrix
Input layer x1 xn +1
scalars
7
𝑦
𝑊
𝑥
𝑏
1/28/2022
Multinomial Two-Layer Network
…
y = softmax(W2h + b2)
Output layer
W2 b2
hidden units h = tanh(W1x + b1)
Could be ReLU
W1 b1
Input layer
(vector) x1 xn +1
1/28/2022 8
CPSC436N Winter 2021-22
9
How to represent a document
in a neural network for
sentiment classi ication?
1/28/2022 10
f
Neural Language Model (Lecture 4) (J&M Chapter 7)
11
Neural Net Classi ication
with embeddings as input features!
12
f
Issue: Texts Come in Different Lengths
• This assumes a ixed size length (3)!
• Kind of unrealistic.
• Some simple solutions (more sophisticated solutions later)
15
Issue: Texts Come in Different Lengths
B. Make the input the length of the longest review
If shorter then pad with zero embeddings
Truncate if you get longer reviews at test time
15
Issue: Texts Come in Different Lengths
B. Make the input the length of the longest review
If shorter then pad with zero embeddings
Truncate if you get longer reviews at test time
C. Create a single "sentence embedding" (the same
dimensionality as a word) to represent all the words
Take the mean of all the word embeddings, or:
Take the element-wise max of all the word embeddings
For each dimension, pick the max value from all words
15
Issue: Texts Come in Different Lengths
B. Make the input the length of the longest review
If shorter then pad with zero embeddings
Truncate if you get longer reviews at test time
C. Create a single "sentence embedding" (the same
dimensionality as a word) to represent all the words
Take the mean of all the word embeddings, or:
Take the element-wise max of all the word embeddings
For each dimension, pick the max value from all words
16
f
f
f
CNN Example: Predicting Sentiment of Sentence
w1
de = embedding
dimension
Negative
CNN Sentiment Neutral
Classi ier Positive
d ≪ de ⋅ n
wn
f
(Task Speci ic) N-gram Detectors:
Convolutional Neural Networks (CNNs)
18
f
f
f
CNN: feature-extracting architecture
Not a standalone, useful network on its own,
• but meant to be integrated into a larger network, and to be trained to
work in tandem with it in order to produce an end result.
• CNN layer’s responsibility is to extract meaningful sub-structures that
are useful for the overall prediction task at hand.
19
CNN: feature-extracting architecture
Not a standalone, useful network on its own,
• but meant to be integrated into a larger network, and to be trained to
work in tandem with it in order to produce an end result.
• CNN layer’s responsibility is to extract meaningful sub-structures that
are useful for the overall prediction task at hand.
• When applied to images, the architecture is using 2D (grid)
convolutions.
• When applied to text, we are mainly concerned with 1D (sequence)
convolutions.
19
1D Convolution over Sentence
k=3 Each word is represented by its embedding (de = 2 here for simplicity)
Scalar
. u -> non-linear function value
……
. u -> non-linear function
……
. u -> non-linear function
. ………..
. ……….. de ⋅ k = 6 u ∈ R 6×1
• A “ ilter” is applied for each k-word sliding window:
• the input is multiplied by a vector u and a non-linear function is
applied to the result.
• The output is a scalar value for each window.
20
f
CNN: Convolution and Pooling in NLP
Each column of
W is a ilter
6 3
tanh(xi ⋅ u2)
CNN: Convolution and Pooling in NLP
6 3
24
Intuition
• The goal is to focus on the most important “features” in the
sentence, regardless of their location
• Each ilter extracts a different indicator from the window
24
f
Intuition
• The goal is to focus on the most important “features” in the
sentence, regardless of their location
• Each ilter extracts a different indicator from the window
• The pooling operation zooms in on the important indicators.
24
f
Intuition
• The goal is to focus on the most important “features” in the
sentence, regardless of their location
• Each ilter extracts a different indicator from the window
• The pooling operation zooms in on the important indicators.
• The resulting l -dimensional vector is then fed further into a
network that is used for prediction.
24
f
Training
25
Training
• Gradients are back-propagated according to the loss on the task.
25
Training
• Gradients are back-propagated according to the loss on the task.
• The weights of the ilter function W are trained to highlight the
aspects of the data that are important for the task.
25
f
Training
• Gradients are back-propagated according to the loss on the task.
• The weights of the ilter function W are trained to highlight the
aspects of the data that are important for the task.
• Intuitively, when the sliding window of size k is run over a sequence,
the ilter function learns to identify informative k-grams.
25
f
f
Capturing k-grams of Varying Length
26
Capturing k-grams of Varying Length
• Several convolutional layers may be applied in parallel.
26
Capturing k-grams of Varying Length
• Several convolutional layers may be applied in parallel.
• E.g. 4 convolutional layers, each with a different window
size in the range 2,3,4,5, capturing k-grams of varying
lengths.
26
Capturing k-grams of Varying Length
• Several convolutional layers may be applied in parallel.
• E.g. 4 convolutional layers, each with a different window
size in the range 2,3,4,5, capturing k-grams of varying
lengths.
26
CNN Less Popular Today in NLP, but…
• Component in multimodal applications (vision & language)
• GCN - graph embeddings, useful in enhancing neural networks
with structured knowledge from a knowledge base
27
CNN Less Popular Today in NLP, but…
• Component in multimodal applications (vision & language)
• GCN - graph embeddings, useful in enhancing neural networks
with structured knowledge from a knowledge base
27
Summary
Naïve Bayes Introduce terminology and common representations.
Interesting relation with n-gram LMs.
Logistic Regression Strong Baseline.
Fundamental relation with MLP.
Generalized to sequence models (CRFs).
Multi Layer Perceptron Key component of modern neural solutions (e.g. ine tuning).
28
f
For Next Time
• Optional Reading:
• J&M: https://web.stanford.edu/~jurafsky/slp3/
• Sequence labeling - Chapter 8 until section 8.4.3
29