CS585 Lecture October17th
CS585 Lecture October17th
2
Plan for Today
Word Embeddings (word2vec)
3
Encoding Word Relationships:
Vector Representations
Word Embeddings
4
Exercise: Word2Vec
https://www.cs.cmu.edu/~dst/WordEmbeddingDem
o/index.html
5
Challenge
We know word relationships exist
How can we quantify them in a automated
fashion?
How do we represent them in numerical
way?
How can we use them in computational
models and processes?
6
Vector Semantics: Two Ideas
Idea 1:
Let's define the meaning of a word by its
distribution in language use (neighboring
words or grammatical environments)
Idea 2:
Let's define the meaning of a word as a
point in space
7
Bag of Words: Strings Representation
Some document: Word: Frequency:
I love this movie! It's sweet, but it 6
with satirical humor. The dialogue I 5
is great and the adventure scenes
the 4
are fun... It manages to be
whimsical and romantic while to 3
laughing at the conventions of the and 3
fairy tale genre. I would seen 2
recommend it to just about
yet 1
anyone. I've seen it several times,
and I'm always happy to see it whimsical 1
again whenever I have a friend times 1
who hasn't seen it yet! .... ...
10
Vector Semantics
The idea:
represent a word as a point in a
multidimensional semantic space that is
derived from the distributions of word
neighbors
11
Point in Space Based on Distribution
Each word = a vector
not just "good" or "word45"
Similar words: “nearby in semantic space"
We build this space automatically by seeing
which words are nearby in text
12
Vector Semantics: Words as Vectors
Source: Signorelli, Camilo & Arsiwalla, Xerxes. (2019). Moral Dilemmas for Artificial Intelligence: a position paper on an application of
Compositional Quantum Cognition
13
Word Embedding: Definition
Word Embedding:
a term used for the representation of words for text analysis,
typically in the form of a real-valued vector that encodes the
meaning of the word such that the words that are closer in
the vector space are expected to be similar in meaning
from Wikipedia
14
Word Embedding
Embedding:
“embedded into a space”
mapping from one space or structure to
another
The standard way to represent meaning in
NLP
Fine-grained model of meaning for
similarity
15
The Why: Sentiment Analysis
Using words only:
a feature is a word identity
for example
feature
16
The Why: Sentiment Analysis
Using embeddings:
a feature is a word vector
the previous word was vector [35, 22, 17]
now in the test set we might see a similar
vector [34, 21, 14]
we can generalize to similar but unseen words
17
Term-Document Matrix
Each document is represented by a vector
of words
18
Term-Document Matrix
Vectors are similar for the two comedies
“As you like it” and “Twelfth Night”
21
Words as Vectors
battle is "the kind of word that occurs in
Julius Caesar and Henry V"
22
Word-Word (Term-Context) Matrix
Two words are similar in meaning if their
context vectors are similar
23
Document Vector Visualization
24
Document Vector Visualization
Note vector
length and
direction
25
Vector Dot / Scalar Product
Given two vectors a and b (N - vector space dimension):
and
their vector dot/scalar product is:
26
Vector Dot / Scalar Product
Vector dot/scalar product is a scalar:
Vector dot/scalar:
high values when the two vectors have large
values in the same dimensions
useful similarity measure
27
Vectors and Dot / Scalar Product
28
Vector Dot / Scalar Product: Problem
Dot product favors long vectors: higher if a vector is
longer (has higher values in many dimension)
Vector length:
30
Word Similarity | Cosine Similarity
33
Word Similarity Visualization
34
Word Similarity | Cosine Similarity
pie data computer
cherry 442 8 2
digital 5 1683 1670
information 5 3982 3325
Low similarity
High similarity
35
Cosine Similarity Visualization
36
Words as Vectors: Issues
We saw how to build vectors to represent
words:
one-hot encoding:
binary, count, tf*idf
Some problems
Large dimensionality of word vectors
Lack of meaningful relationships between
words
37
Vector Embeddings: Methods
tf-idf
popular in Information Retrieval
sparse vectors
word represented by (a simple function of) the
counts of nearby words
Word2vec
dense vectors
representation is created by training a classifier to
predict whether a word is likely to appear nearby
38
Sparse vs. Dense Vectors
Sparse vectors have a lot of values set to
zero.
[0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 2]
Dense vector: most of the values are non-
zero.
better use of storage
carries more information
[3, 1, 5, 0, 1, 4, 9, 8, 7, 1, 1, 2, 2, 2, 2]
39
Sparse vs. Dense Vectors
tf-idf vectors are typically:
long (length 20,000 to 50,000)
sparse (most elements are zero)
40
Short / Dense Vectors: Benefits
Why short/dense vectors?
short vectors may be easier to use as features in
machine learning (fewer weights to tune)
dense vectors may generalize better than explicit
counts
dense vectors may do better at capturing synonymy:
car and automobile are synonyms; but are distinct
dimensions
a word with car as a neighbor and a word with
automobile as a neighbor should be similar, but aren't
In practice, they work better
41
Short/Dense Vectors: Methods
“Neural Language Model”-inspired models
Word2vec, GloVe
43
Language Models: Application
44
N-gram Language Models
General Maximum Likelihood Estimation (MLE) of
an N-gram:
𝑵 𝑲 𝟏 𝑵 𝑲 𝟐 𝑵 𝟏 𝑵
𝑵 𝑵 𝑲 𝟏 𝑵 𝑲 𝟐 𝑵 𝟏
𝑵 𝑲 𝟏 𝑵 𝑲 𝟐 𝑵 𝟏
where:
- ith word / token
45
N-gram Language Models
General Maximum Likelihood Estimation (MLE) of
an N-gram:
<------------thou shalt----------->
𝑵 𝑲 𝟏 𝑵 𝑲 𝟐 𝑵 𝟏 𝑵
𝑵 𝑵 𝑲 𝟏 𝑵 𝑲 𝟐 𝑵 𝟏
𝑵 𝑲 𝟏 𝑵 𝑲 𝟐 𝑵 𝟏
“rest” | next query word
where:
- ith word / token
46
N-gram Language Models
General Maximum Likelihood Estimation (MLE) of
an N-gram:
<------------thou shalt----------->
𝑵 𝑲 𝟏 𝑵 𝑲 𝟐 𝑵 𝟏 𝑵
𝑵 𝑵 𝑲 𝟏 𝑵 𝑲 𝟐 𝑵 𝟏
𝑵 𝑲 𝟏 𝑵 𝑲 𝟐 𝑵 𝟏
“rest” | next query word
where:
Looks at PAST words to
- ith word / token predict the NEXT word!
(highest P() NEXT word)
47
N-gram Language Models: Prediction
Given:
S: wordt-N ... wordt-2 wordt-1 _____
wordt-N
wordt-N+1
Model wordt
wordt-2
wordt-1
Input Output
48
N-gram Language Models: Prediction
Given:
S: wordt-N ... wordt-2 wordt-1 _____
wordt-N
wordt-N+1
Model wordt
wordt-2
wordt-1
Features Prediction
49
N-gram Language Models: Prediction
Given (N = 2):
S: thou shalt _____
anchor
P(S) = 0.0001
thou ....
Model not
P(S) = 0.8
shalt
...
zebra
P(S) = 0.0002
Input Output
50
N-gram Language Models: Prediction
Given (N = 2):
S: thou shalt not _____
anchor
P(S) = 0.0001
shalt ....
Model bear
P(S) = 0.45
not
...
zebra
P(S) = 0.0002
Input Output
51
N-gram Language Models: Prediction
Given (N = 2):
S: thou shalt not bear _____
anchor
P(S) = 0.0001
not ....
Model false
P(S) = 0.4
bear
...
zebra
P(S) = 0.0002
Input Output
52
N-gram Language Models: Prediction
Given (N = 2):
S: thou shalt not bear false _____
anchor
P(S) = 0.0001
bear ....
Model witness
P(S) = 0.75
false
...
zebra
P(S) = 0.0002
Input Output
53
Trained Language Models: Prediction
Given input and a model (word embeddings):
S: thou shalt _____
anchor
P(S) = 0.0001
thou ....
Look up Project to
word
Calculate
output not
predictions P(S) = 0.8
embeddings vocabulary
shalt
...
zebra
P(S) = 0.0002
Input Output
54
N-gram Language Models
N-gram language model will handle cases such as:
Tomorrow is ______
but not:
Tomorrow is ______ to be
where:
context words
a word to be predicted: target word
55
N-gram Language Models
N-gram language model will handle cases such as:
Tomorrow is ______
Tomorrow is ______ to be
context words
56
Predicting the Missing Word
Say, we want to predict the missing word _____ in:
Tomorrow is ______ to be
57
Word2Vec: Idea
58
Word2Vec: Idea
Instead of counting how often each word w occurs near
"apricot"
Train a classifier on a binary prediction task:
Is w likely to show up near "apricot"?
We don’t actually care about this task
but we'll take the learned classifier weights as the word
embeddings
Use self-supervision:
A word c that occurs near “apricot” in the corpus acts as the
gold "correct answer" for supervised learning
No need for human labels
59
Available Tools
Word2vec (Mikolov et al)
https://code.google.com/archive/p/word2vec
/
60
Word2Vec: the Approach
1. Treat the target word t and a neighboring context
word c as positive examples.
2. Randomly sample other words in the lexicon to
get negative examples
3. Use logistic regression to train a classifier to
distinguish those two cases
4. Use the learned classifier weights as the
embeddings
61
Word2Vec: the Approach
Given the set of positive and negative training
instances, and an initial set of embedding vectors
63
Sliding Window
Say, we want to predict the missing word _____ in:
Tomorrow is ______ to be
64
Sliding Window
Say, we want to predict the missing word _____ in:
Tomorrow is ______ to be
65
Sliding Window
Say, we want to predict the missing word _____ in:
Tomorrow is ______ to be
66
Predicting the Missing Word
Say, we want to predict the missing word _____ in:
Tomorrow is ______ to be
67
CBOW Word2Vec
Given (window size 2):
wordt-N ... wordt-2 wordt-1 _____ wordt+1 wordt+2 ... wordt+N
wordt-2
wordt-2
sum wordt
wordt+1
wordt+2
68
Skip Gram Word2Vec
Given (window size 2):
wordt-N ... wordt-2 wordt-1 _____ wordt+1 wordt+2 ... wordt+N
wordt-2
wordt-2
wordt sum
wordt+1
wordt+2
69
Skip Gram Word2Vec
Predict context given target word:
thou shalt _not_ bear false witness
thou
shalt
not sum
bear
false
70
Skip Gram Word2Vec
Predict context given target word:
thou shalt _not_ bear false witness
thou
P(+|not, thou)
P(-|not, thou)
shalt
P(+|not,shalt)
P(-|not, shalt)
not sum
bear
P(+|not, bear)
P(-|not, bear)
false
P(+|not, false)
P(-|not, false)
71
Word2Vec: the Approach
1. Treat the target word t and a neighboring context
word c as positive examples.
2. Randomly sample other words in the lexicon to
get negative examples
3. Use logistic regression to train a classifier to
distinguish those two cases
4. Use the learned classifier weights as the
embeddings
72
Word2Vec: the Approach
Given the set of positive and negative training
instances, and an initial set of embedding vectors
74
Word2Vec: the Approach
75
Skip Gram Classifier
Assume a +/- 2 (L = 4) word window, given training
sentence: target word
context words
76
Skip Gram Classifier
Assume a +/- 2 (L = 4) word window, given training
sentence:
w
…lemon, a [tablespoon of apricot jam, a] pinch…
c1 c2 c3 c4
77
Skip Gram Classifier
Assume a +/- 2 (L = 4) word window, given training
sentence:
w
…lemon, a [tablespoon of apricot jam, a] pinch…
c1 c2 c3 c4
78
Skip Gram Classifier
Assume a +/- 2 (L = 4) word window, given training
sentence:
w
…lemon, a [tablespoon of apricot jam, a] pinch…
c1 c2 c3 c4
79
Target and Context Embeddings
Target Context
aardvark aardvark
... ...
not not
Vocabulary size |V|
shalt shalt
... ...
thou thou
... ...
zebra zone
Embedding size d Embedding size d
80
Intuition: Target & Context Similar
Target Context
aardvark aardvark
... ...
not not
Vocabulary size |V|
... ...
shalt shalt
... ...
thou thou
... ...
zebra zone
Embedding size d Embedding size d
81
Cosine Similarity Visualization
... ...
not w not
Vocabulary size |V|
... ...
Target &
shalt Context similar shalt c
...
when ...
thou
cw thou
... ...
zebra
is high zone
Embedding size d Embedding size d
83
Intuition: Target & Context Similar
Target Context
aardvark aardvark
... ...
not w not
Vocabulary size |V|
... ...
Similarity(c, w)
shalt cw shalt c
... ...
thou thou
... ...
zebra zone
Embedding size d Embedding size d
84
Intuition: Target & Context Similar
Target Context
aardvark aardvark
... ...
not w not
Vocabulary size |V|
... ...
Similarity(c, w)
shalt cw shalt c
... ...
NOT a
thou thou
PROBABILITY
... though! ...
zebra zone
Embedding size d Embedding size d
85
Intuition: Target & Context Similar
Target Context
aardvark aardvark
... ...
not w not
Vocabulary size |V|
... ...
Similarity(c, w)
shalt cw shalt c
... ...
Use sigmoid
thou thou
function!
... ...
zebra zone
Embedding size d Embedding size d
86
Similarity Probability
P(+ | w, c) = (c w) =
P(- | w, c) = 1 - P(+ | w, c) =
= (-c w) =
87
Skip Gram Classifier
Assume a +/- 2 (L = 4) word window, given training
sentence:
w
…lemon, a [tablespoon of apricot jam, a] pinch…
c1 c2 c3 c4
P(+ | w, c1) P(+ | w, c2) P(+ | w, c3) P(+ | w, c4)
P(- | w, c1) P(- | w, c2) P(- | w, c3) P(- | w, c4)
88
Skip Gram Classifier
Assume a +/- 2 (L = 4) word window, given training
sentence:
w
…lemon, a [tablespoon of apricot jam, a] pinch…
c1 c2 c3 c4
P(+ | w, c1) P(+ | w, c2) P(+ | w, c3) P(+ | w, c4)
P(- | w, c1) P(- | w, c2) P(- | w, c3) P(- | w, c4)
89
Skip Gram Classifier
Assume a +/- 2 (L = 4) word window, given training
sentence:
w
…lemon, a [tablespoon of apricot jam, a] pinch…
c1 c2 c3 c4
P(+ | w, c1) P(+ | w, c2) P(+ | w, c3) P(+ | w, c4)
P(- | w, c1) P(- | w, c2) P(- | w, c3) P(- | w, c4)
In general:
P(+ | w, c1:L) =
90
Skip Gram Classifier
Assume a +/- 2 (L = 4) word window, given training
sentence:
w
…lemon, a [tablespoon of apricot jam, a] pinch…
c1 c2 c3 c4
P(+ | w, c1) P(+ | w, c2) P(+ | w, c3) P(+ | w, c4)
P(- | w, c1) P(- | w, c2) P(- | w, c3) P(- | w, c4)
91
Skip Gram Classifier: Summary
A probabilistic classifier, given
a test target word w
its context window of L words c1:L
Estimates probability that w occurs in this window
based on similarity of w (embeddings) to c1:L
(embeddings).
93
Word2Vec: the Approach
1. Treat the target word t and a neighboring context
word c as positive examples.
2. Randomly sample other words in the lexicon to
get negative examples
3. Use logistic regression to train a classifier to
distinguish those two cases
4. Use the learned classifier weights as the
embeddings
94
Word2Vec: Training
Assume a +/- 2 (L = 4) word window, given training
sentence:
95
Word2Vec: the Approach
Given the set of positive and negative training
instances, and an initial set of embedding vectors
97
Classifier: Learning Process
How to learn?
use stochastic gradient descent
98
Gradient Descent: Single Step
99
Loss Function Derivatives
100
Gradient Descent: Updates
Start with randomly initialized W and C matrices
Target (W) Context (C)
aardvark aardvark
... ...
not not
Vocabulary size |V|
shalt shalt
... ...
thou thou
... ...
zebra zone
Embedding size d Embedding size d
101
Gradient Descent: Updates
... then incrementally do updates using.
learning rate
102
Skip Gram Word2Vec: Summary
Start with |V| random d-dimensional vectors as initial
embeddings
Train a classifier based on embedding similarity
Take a corpus and take pairs of words that co-occur as positive
examples
Take pairs of words that don't co-occur as negative examples
Train the classifier to distinguish these by slowly adjusting all
the embeddings to improve the classifier performance
Throw away the classifier code and keep the embeddings.
103
Sliding Window Size
Small windows (+/- 2) : nearest words are
syntactically similar words in same taxonomy
Hogwarts nearest neighbors are other fictional
schools
Sunnydale, Evernight, Blandings