0% found this document useful (0 votes)
13 views

CS585 Lecture October17th

Uploaded by

fyi3
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

CS585 Lecture October17th

Uploaded by

fyi3
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 104

CS 585

Natural Language Processing

October 17, 2024


Announcements / Reminders
 Please follow the Week 09 To Do List instructions (if you haven't
already)

 Programming Assignment #02 due on Sunday (10/13/24) Sunday


(10/20/24) at 11:59 PM CST

 New written assignment will be posted soon

2
Plan for Today
 Word Embeddings (word2vec)

3
Encoding Word Relationships:
Vector Representations
Word Embeddings

4
Exercise: Word2Vec
https://www.cs.cmu.edu/~dst/WordEmbeddingDem
o/index.html

5
Challenge
 We know word relationships exist
 How can we quantify them in a automated
fashion?
 How do we represent them in numerical
way?
 How can we use them in computational
models and processes?

6
Vector Semantics: Two Ideas
 Idea 1:
 Let's define the meaning of a word by its
distribution in language use (neighboring
words or grammatical environments)

 Idea 2:
 Let's define the meaning of a word as a
point in space
7
Bag of Words: Strings Representation
Some document: Word: Frequency:
I love this movie! It's sweet, but it 6
with satirical humor. The dialogue I 5
is great and the adventure scenes
the 4
are fun... It manages to be
whimsical and romantic while to 3
laughing at the conventions of the and 3
fairy tale genre. I would seen 2
recommend it to just about
yet 1
anyone. I've seen it several times,
and I'm always happy to see it whimsical 1
again whenever I have a friend times 1
who hasn't seen it yet! .... ...

Bag of words assumption: word/token position does not matter.


8
Bag of Words: Meaning Ignored!
Some document: Word: Frequency:
I love this movie! It's sweet, but it 6
with satirical humor. The dialogue I 5
is great and the adventure scenes
the 4
are fun... It manages to be
whimsical and romantic while to 3
laughing at the conventions of the and 3
fairy tale genre. I would seen 2
recommend it to just about
yet 1
anyone. I've seen it several times,
and I'm always happy to see it whimsical 1
again whenever I have a friend times 1
who hasn't seen it yet! .... ...

Bag of words assumption: word/token position does not matter.


9
Connotation as a Point in Space
 Words seem to vary along three affective DIMENSIONS:
 valence: the pleasantness of the stimulus
 arousal: the intensity of emotion provoked by the stimulus
 dominance: the degree of control exerted by the stimulus

Word Score Word Score


love 1.000 toxic 0.008
valence
happy 1.000 nightmare 0.005
elated 0.960 mellow 0.069
arousal
frenzy 0.965 napping 0.046
powerful 0.991 weak 0.045
dominance
leadership 0.983 empty 0.081
Source: NRC VAD Lexicon (https://saifmohammad.com/WebPages/nrc-vad.html)

10
Vector Semantics
 The idea:
 represent a word as a point in a
multidimensional semantic space that is
derived from the distributions of word
neighbors

11
Point in Space Based on Distribution
 Each word = a vector
 not just "good" or "word45"
 Similar words: “nearby in semantic space"
 We build this space automatically by seeing
which words are nearby in text

12
Vector Semantics: Words as Vectors

Source: Signorelli, Camilo & Arsiwalla, Xerxes. (2019). Moral Dilemmas for Artificial Intelligence: a position paper on an application of
Compositional Quantum Cognition

13
Word Embedding: Definition
Word Embedding:
a term used for the representation of words for text analysis,
typically in the form of a real-valued vector that encodes the
meaning of the word such that the words that are closer in
the vector space are expected to be similar in meaning
from Wikipedia

14
Word Embedding
 Embedding:
 “embedded into a space”
 mapping from one space or structure to
another
 The standard way to represent meaning in
NLP
 Fine-grained model of meaning for
similarity

15
The Why: Sentiment Analysis
 Using words only:
 a feature is a word identity
 for example

 feature

 requires exact same word to be in training and


test

16
The Why: Sentiment Analysis
 Using embeddings:
 a feature is a word vector
 the previous word was vector [35, 22, 17]
 now in the test set we might see a similar
vector [34, 21, 14]
 we can generalize to similar but unseen words

17
Term-Document Matrix
 Each document is represented by a vector
of words

18
Term-Document Matrix
 Vectors are similar for the two comedies
 “As you like it” and “Twelfth Night”

 But comedies are different than the other


two
 more fools and wit and fewer battles
19
Term-Document Matrix
 Vectors are similar for the two comedies
 “As you like it” and “Twelfth Night”

 But comedies are different than the other


two
 more fools and wit and fewer battles
20
Document Vector Visualization

21
Words as Vectors
 battle is "the kind of word that occurs in
Julius Caesar and Henry V"

 fool is "the kind of word that occurs in


comedies, especially Twelfth Night"

22
Word-Word (Term-Context) Matrix
 Two words are similar in meaning if their
context vectors are similar

23
Document Vector Visualization

24
Document Vector Visualization

Note vector
length and
direction
25
Vector Dot / Scalar Product
Given two vectors a and b (N - vector space dimension):
and
their vector dot/scalar product is:

Using matrix representation:

26
Vector Dot / Scalar Product
 Vector dot/scalar product is a scalar:

 Vector dot/scalar:
 high values when the two vectors have large
values in the same dimensions
 useful similarity measure

27
Vectors and Dot / Scalar Product

28
Vector Dot / Scalar Product: Problem
 Dot product favors long vectors: higher if a vector is
longer (has higher values in many dimension)
 Vector length:

 Frequent words (of, the, you) have long vectors (since


they occur many times with other words).
 dot product overly favors frequent words
29
Alternative: Cosine Similarity
Euclidean distance Cosine similarity

30
Word Similarity | Cosine Similarity

Where: v and w are two different word vectors


31
Word Similarity | Cosine Similarity
 -1: vectors point in opposite directions
 +1: vectors point in same directions
 0: vectors are orthogonal

But since raw frequency values are non-negative, the


cosine for term-term matrix vectors ranges from 0–1
32
Word Similarity
 Two words are similar in meaning if their
context vectors are similar

33
Word Similarity Visualization

34
Word Similarity | Cosine Similarity
pie data computer
cherry 442 8 2
digital 5 1683 1670
information 5 3982 3325

Low similarity

High similarity

35
Cosine Similarity Visualization

36
Words as Vectors: Issues
 We saw how to build vectors to represent
words:
 one-hot encoding:
 binary, count, tf*idf
 Some problems
 Large dimensionality of word vectors
 Lack of meaningful relationships between
words

37
Vector Embeddings: Methods
 tf-idf
 popular in Information Retrieval
 sparse vectors
 word represented by (a simple function of) the
counts of nearby words

 Word2vec
 dense vectors
 representation is created by training a classifier to
predict whether a word is likely to appear nearby

38
Sparse vs. Dense Vectors
 Sparse vectors have a lot of values set to
zero.
[0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 2]
 Dense vector: most of the values are non-
zero.
 better use of storage
 carries more information
[3, 1, 5, 0, 1, 4, 9, 8, 7, 1, 1, 2, 2, 2, 2]
39
Sparse vs. Dense Vectors
 tf-idf vectors are typically:
 long (length 20,000 to 50,000)
 sparse (most elements are zero)

 What if we could learn vectors that are


 short (length 50-1000)
 dense (most elements are non-zero)

40
Short / Dense Vectors: Benefits
 Why short/dense vectors?
 short vectors may be easier to use as features in
machine learning (fewer weights to tune)
 dense vectors may generalize better than explicit
counts
 dense vectors may do better at capturing synonymy:
 car and automobile are synonyms; but are distinct
dimensions
 a word with car as a neighbor and a word with
automobile as a neighbor should be similar, but aren't
 In practice, they work better

41
Short/Dense Vectors: Methods
 “Neural Language Model”-inspired models
 Word2vec, GloVe

 Singular Value Decomposition (SVD)


 A special case of this is called LSA – Latent Semantic
Analysis
 Alternative to these "static embeddings":
 Contextual Embeddings (ELMo, BERT)
 Compute distinct embeddings for a word in its
context
 Separate embeddings for each token of a word
42
Word2Vec

43
Language Models: Application

we want to find predict the “rest” of the query

44
N-gram Language Models
General Maximum Likelihood Estimation (MLE) of
an N-gram:

𝑵 𝑲 𝟏 𝑵 𝑲 𝟐 𝑵 𝟏 𝑵
𝑵 𝑵 𝑲 𝟏 𝑵 𝑲 𝟐 𝑵 𝟏
𝑵 𝑲 𝟏 𝑵 𝑲 𝟐 𝑵 𝟏

where:
- ith word / token

In MLE, the resulting parameter set maximizes the likelihood of the


training set T given the model M (i.e., P(T | M)).

45
N-gram Language Models
General Maximum Likelihood Estimation (MLE) of
an N-gram:
<------------thou shalt----------->
𝑵 𝑲 𝟏 𝑵 𝑲 𝟐 𝑵 𝟏 𝑵
𝑵 𝑵 𝑲 𝟏 𝑵 𝑲 𝟐 𝑵 𝟏
𝑵 𝑲 𝟏 𝑵 𝑲 𝟐 𝑵 𝟏
“rest” | next query word

where:
- ith word / token

In MLE, the resulting parameter set maximizes the likelihood of the


training set T given the model M (i.e., P(T | M)).

46
N-gram Language Models
General Maximum Likelihood Estimation (MLE) of
an N-gram:
<------------thou shalt----------->
𝑵 𝑲 𝟏 𝑵 𝑲 𝟐 𝑵 𝟏 𝑵
𝑵 𝑵 𝑲 𝟏 𝑵 𝑲 𝟐 𝑵 𝟏
𝑵 𝑲 𝟏 𝑵 𝑲 𝟐 𝑵 𝟏
“rest” | next query word

where:
Looks at PAST words to
- ith word / token predict the NEXT word!
(highest P() NEXT word)

In MLE, the resulting parameter set maximizes the likelihood of the


training set T given the model M (i.e., P(T | M)).

47
N-gram Language Models: Prediction
Given:
S: wordt-N ... wordt-2 wordt-1 _____

wordt-N

wordt-N+1

Model wordt

wordt-2

wordt-1

Input Output

48
N-gram Language Models: Prediction
Given:
S: wordt-N ... wordt-2 wordt-1 _____

wordt-N

wordt-N+1

Model wordt

wordt-2

wordt-1

Features Prediction

49
N-gram Language Models: Prediction
Given (N = 2):
S: thou shalt _____

anchor
P(S) = 0.0001

thou ....

Model not
P(S) = 0.8

shalt
...

zebra
P(S) = 0.0002

Input Output

50
N-gram Language Models: Prediction
Given (N = 2):
S: thou shalt not _____

anchor
P(S) = 0.0001

shalt ....

Model bear
P(S) = 0.45

not
...

zebra
P(S) = 0.0002

Input Output

51
N-gram Language Models: Prediction
Given (N = 2):
S: thou shalt not bear _____

anchor
P(S) = 0.0001

not ....

Model false
P(S) = 0.4

bear
...

zebra
P(S) = 0.0002

Input Output

52
N-gram Language Models: Prediction
Given (N = 2):
S: thou shalt not bear false _____

anchor
P(S) = 0.0001

bear ....

Model witness
P(S) = 0.75

false
...

zebra
P(S) = 0.0002

Input Output

53
Trained Language Models: Prediction
Given input and a model (word embeddings):
S: thou shalt _____

anchor
P(S) = 0.0001

thou ....
Look up Project to
word
Calculate
output not
predictions P(S) = 0.8
embeddings vocabulary
shalt
...

zebra
P(S) = 0.0002

Input Output

54
N-gram Language Models
N-gram language model will handle cases such as:
Tomorrow is ______

but not:
Tomorrow is ______ to be

where:
 context words
 a word to be predicted: target word
55
N-gram Language Models
N-gram language model will handle cases such as:
Tomorrow is ______

but not: target word

Tomorrow is ______ to be

context words

56
Predicting the Missing Word
Say, we want to predict the missing word _____ in:
Tomorrow is ______ to be

How would you go about it?

57
Word2Vec: Idea

DON’T count - Predict!

58
Word2Vec: Idea
 Instead of counting how often each word w occurs near
"apricot"
 Train a classifier on a binary prediction task:
 Is w likely to show up near "apricot"?
 We don’t actually care about this task
 but we'll take the learned classifier weights as the word
embeddings
 Use self-supervision:
 A word c that occurs near “apricot” in the corpus acts as the
gold "correct answer" for supervised learning
 No need for human labels

59
Available Tools
 Word2vec (Mikolov et al)
https://code.google.com/archive/p/word2vec
/

 GloVe (Pennington, Socher, Manning)


http://nlp.stanford.edu/projects/glove/

60
Word2Vec: the Approach
1. Treat the target word t and a neighboring context
word c as positive examples.
2. Randomly sample other words in the lexicon to
get negative examples
3. Use logistic regression to train a classifier to
distinguish those two cases
4. Use the learned classifier weights as the
embeddings

61
Word2Vec: the Approach
Given the set of positive and negative training
instances, and an initial set of embedding vectors

The goal of learning is to adjust those word vectors


such that we:
 maximize the similarity of the target word,
context word pairs (w, cpos) drawn from the
positive data
 minimize the similarity of the (w, cneg) pairs
drawn from the negative data.
62
Predicting the Missing Word
Say, we want to predict the missing word _____ in:
Tomorrow is ______ to be

Two approaches possible:


 use the context words to predict the target word
 use the target word to predict context words

63
Sliding Window
Say, we want to predict the missing word _____ in:
Tomorrow is ______ to be

Let’s generalize it a bit:


wordt-2 wordt-1 wordt wordt+1 wordt+2

64
Sliding Window
Say, we want to predict the missing word _____ in:
Tomorrow is ______ to be

Let’s generalize it even further:


wordt-N ... wordt-2 wordt-1 wordt wordt+1 wordt+2 ... wordt+N

65
Sliding Window
Say, we want to predict the missing word _____ in:
Tomorrow is ______ to be

But we don’t need to look at ALL words in:


wordt-N ... wordt-2 wordt-1 wordt wordt+1 wordt+2 ... wordt+N

We can reduce the size of the context:


wordt-N ... wordt-2 wordt-1 wordt wordt+1 wordt+2 ... wordt+N
sliding window +/- 2

66
Predicting the Missing Word
Say, we want to predict the missing word _____ in:
Tomorrow is ______ to be

Two approaches possible:


 use the context words to predict the target word:
Continuous Bag of Words model (CBOW)
 use the target word to predict context words:
Skip Gram model

67
CBOW Word2Vec
Given (window size 2):
wordt-N ... wordt-2 wordt-1 _____ wordt+1 wordt+2 ... wordt+N

wordt-2

wordt-2

sum wordt

wordt+1

wordt+2

Input Projection Output

68
Skip Gram Word2Vec
Given (window size 2):
wordt-N ... wordt-2 wordt-1 _____ wordt+1 wordt+2 ... wordt+N

wordt-2

wordt-2

wordt sum

wordt+1

wordt+2

Input Projection Output

69
Skip Gram Word2Vec
Predict context given target word:
thou shalt _not_ bear false witness

thou

shalt

not sum

bear

false

Input Projection Output

70
Skip Gram Word2Vec
Predict context given target word:
thou shalt _not_ bear false witness
thou
P(+|not, thou)
P(-|not, thou)

shalt
P(+|not,shalt)
P(-|not, shalt)

not sum
bear
P(+|not, bear)
P(-|not, bear)

false
P(+|not, false)
P(-|not, false)

Input Projection Output

71
Word2Vec: the Approach
1. Treat the target word t and a neighboring context
word c as positive examples.
2. Randomly sample other words in the lexicon to
get negative examples
3. Use logistic regression to train a classifier to
distinguish those two cases
4. Use the learned classifier weights as the
embeddings

72
Word2Vec: the Approach
Given the set of positive and negative training
instances, and an initial set of embedding vectors

The goal of learning is to adjust those word vectors


such that we:
 maximize the similarity of the target word,
context word pairs (w, cpos) drawn from the
positive data
 minimize the similarity of the (w, cneg) pairs
drawn from the negative data.
73
Word2Vec: the Approach

74
Word2Vec: the Approach

75
Skip Gram Classifier
Assume a +/- 2 (L = 4) word window, given training
sentence: target word

…lemon, a [tablespoon of apricot jam, a] pinch…

context words

76
Skip Gram Classifier
Assume a +/- 2 (L = 4) word window, given training
sentence:
w
…lemon, a [tablespoon of apricot jam, a] pinch…
c1 c2 c3 c4

77
Skip Gram Classifier
Assume a +/- 2 (L = 4) word window, given training
sentence:
w
…lemon, a [tablespoon of apricot jam, a] pinch…
c1 c2 c3 c4

Goal 1: train a classifier that is given a candidate


(word, context word) pair: (apricot, jam), (apricot,
aardvark), etc.

78
Skip Gram Classifier
Assume a +/- 2 (L = 4) word window, given training
sentence:
w
…lemon, a [tablespoon of apricot jam, a] pinch…
c1 c2 c3 c4

Goal 2: assign probabilities to every (word, context


word) pair:
P(+ | w, ci) and
P(- | w, ci) = 1 - P(+ | w, ci)

79
Target and Context Embeddings
Target Context
aardvark aardvark

... ...

not not
Vocabulary size |V|

Vocabulary size |V|


... ...

shalt shalt

... ...

thou thou

... ...

zebra zone
Embedding size d Embedding size d

80
Intuition: Target & Context Similar
Target Context
aardvark aardvark

... ...

not not
Vocabulary size |V|

... ...

shalt shalt

... ...

thou thou

... ...

zebra zone
Embedding size d Embedding size d

81
Cosine Similarity Visualization

Two vectors are similar if they have


a high dot product | cosine similarity
82
Intuition: Target & Context Similar
Target Context
aardvark aardvark

... ...

not w not
Vocabulary size |V|

... ...
Target &
shalt Context similar shalt c
...
when ...

thou
cw thou

... ...

zebra
is high zone
Embedding size d Embedding size d

83
Intuition: Target & Context Similar
Target Context
aardvark aardvark

... ...

not w not
Vocabulary size |V|

... ...
Similarity(c, w)
shalt cw shalt c
... ...

thou thou

... ...

zebra zone
Embedding size d Embedding size d

84
Intuition: Target & Context Similar
Target Context
aardvark aardvark

... ...

not w not
Vocabulary size |V|

... ...
Similarity(c, w)
shalt cw shalt c
... ...
NOT a
thou thou
PROBABILITY
... though! ...

zebra zone
Embedding size d Embedding size d

85
Intuition: Target & Context Similar
Target Context
aardvark aardvark

... ...

not w not
Vocabulary size |V|

... ...
Similarity(c, w)
shalt cw shalt c
... ...
Use sigmoid
thou thou
function!
... ...

zebra zone
Embedding size d Embedding size d

86
Similarity  Probability

P(+ | w, c) = (c  w) =

P(- | w, c) = 1 - P(+ | w, c) =
= (-c  w) =

87
Skip Gram Classifier
Assume a +/- 2 (L = 4) word window, given training
sentence:
w
…lemon, a [tablespoon of apricot jam, a] pinch…
c1 c2 c3 c4
P(+ | w, c1) P(+ | w, c2) P(+ | w, c3) P(+ | w, c4)
P(- | w, c1) P(- | w, c2) P(- | w, c3) P(- | w, c4)

OK, but we have lots of possible context words!

88
Skip Gram Classifier
Assume a +/- 2 (L = 4) word window, given training
sentence:
w
…lemon, a [tablespoon of apricot jam, a] pinch…
c1 c2 c3 c4
P(+ | w, c1) P(+ | w, c2) P(+ | w, c3) P(+ | w, c4)
P(- | w, c1) P(- | w, c2) P(- | w, c3) P(- | w, c4)

Assuming word independence, calculate:


P(+ | w, c1, c2, c3, c4) = 

89
Skip Gram Classifier
Assume a +/- 2 (L = 4) word window, given training
sentence:
w
…lemon, a [tablespoon of apricot jam, a] pinch…
c1 c2 c3 c4
P(+ | w, c1) P(+ | w, c2) P(+ | w, c3) P(+ | w, c4)
P(- | w, c1) P(- | w, c2) P(- | w, c3) P(- | w, c4)

In general:
P(+ | w, c1:L) = 

90
Skip Gram Classifier
Assume a +/- 2 (L = 4) word window, given training
sentence:
w
…lemon, a [tablespoon of apricot jam, a] pinch…
c1 c2 c3 c4
P(+ | w, c1) P(+ | w, c2) P(+ | w, c3) P(+ | w, c4)
P(- | w, c1) P(- | w, c2) P(- | w, c3) P(- | w, c4)

In general [with sums instead of products]:


log P(+ | w, c1:L) = 

91
Skip Gram Classifier: Summary
A probabilistic classifier, given
 a test target word w
 its context window of L words c1:L
Estimates probability that w occurs in this window
based on similarity of w (embeddings) to c1:L
(embeddings).

To compute this, we just need embeddings for all


the words.
92
Parameters:Target (W) and Context (C)

93
Word2Vec: the Approach
1. Treat the target word t and a neighboring context
word c as positive examples.
2. Randomly sample other words in the lexicon to
get negative examples
3. Use logistic regression to train a classifier to
distinguish those two cases
4. Use the learned classifier weights as the
embeddings

94
Word2Vec: Training
Assume a +/- 2 (L = 4) word window, given training
sentence:

…lemon, a [tablespoon of apricot jam, a] pinch…

Positive (+) examples:


(apricot, tablespoon),(apricot, of),(apricot, jam),(apricot, a)
Negative (-) K (typically double (+)) examples:
(apricot, aardvark),(apricot, my),(apricot, where),(apricot, coaxial)
(apricot, seven),(apricot, forever),(apricot, dear),(apricot, if)

95
Word2Vec: the Approach
Given the set of positive and negative training
instances, and an initial set of embedding vectors

The goal of learning is to adjust those word vectors


such that we:
 maximize the similarity of the target word,
context word pairs (w, cpos) drawn from the
positive data
 minimize the similarity of the (w, cneg) pairs
drawn from the negative data.
96
Loss Function
Loss function for one w with cpos , cneg1 ...cnegk
Maximize the similarity of the target with the actual
context words (+), and minimize the similarity of the target
with the k negative sampled non-neighbor words (-).

97
Classifier: Learning Process
 How to learn?
 use stochastic gradient descent

 Adjust the word weights to:


 make the positive pairs more likely
 and the negative pairs less likely,
 ... for the entire training set.

98
Gradient Descent: Single Step

99
Loss Function Derivatives

100
Gradient Descent: Updates
Start with randomly initialized W and C matrices
Target (W) Context (C)
aardvark aardvark

... ...

not not
Vocabulary size |V|

Vocabulary size |V|


... ...

shalt shalt

... ...

thou thou

... ...

zebra zone
Embedding size d Embedding size d

101
Gradient Descent: Updates
... then incrementally do updates using.

learning rate

102
Skip Gram Word2Vec: Summary
 Start with |V| random d-dimensional vectors as initial
embeddings
 Train a classifier based on embedding similarity
 Take a corpus and take pairs of words that co-occur as positive
examples
 Take pairs of words that don't co-occur as negative examples
 Train the classifier to distinguish these by slowly adjusting all
the embeddings to improve the classifier performance
 Throw away the classifier code and keep the embeddings.

103
Sliding Window Size
 Small windows (+/- 2) : nearest words are
syntactically similar words in same taxonomy
 Hogwarts nearest neighbors are other fictional
schools
 Sunnydale, Evernight, Blandings

 Large windows (+/- 5) : nearest words are


related words in same semantic field
 Hogwarts nearest neighbors are Harry Potter world:
 Dumbledore, half-blood, Malfoy
104

You might also like