0% found this document useful (0 votes)
3 views

Lecture -7 PPMI

Chapter 6 of 'Speech and Language Processing' discusses vector semantics, defining words by their usage and surrounding context. It introduces embeddings as a way to represent words as vectors in a space, facilitating tasks like sentiment analysis and information retrieval. The chapter also covers methods such as tf-idf and Positive Pointwise Mutual Information (PPMI) for measuring word similarity based on their co-occurrence in contexts.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Lecture -7 PPMI

Chapter 6 of 'Speech and Language Processing' discusses vector semantics, defining words by their usage and surrounding context. It introduces embeddings as a way to represent words as vectors in a space, facilitating tasks like sentiment analysis and information retrieval. The chapter also covers methods such as tf-idf and Positive Pointwise Mutual Information (PPMI) for measuring word similarity based on their co-occurrence in contexts.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

Dan Jurafsky and James Martin

Speech and Language Processing

Chapter 6:
Vector Semantics
Let's define words by their
usages
In particular, words are defined by their
environments (the words around them)

Zellig Harris (1954): If A and B have almost


identical environments we say that they are
synonyms.
What does ong choi mean?
Suppose you see these sentences:
• Ongchoi is delicious sautéed with garlic.
• Ongchoi is superb over rice
• Ongchoi leaves with salty sauces
And you've also seen these:
• …spinach sautéed with garlic over rice
• Chard stems and leaves are delicious
• Collard greens and other salty leafy greens
Conclusion:
◦ Ongchoi is a leafy green like spinach, chard, or collard
greens
Ong choi: Ipomoea aquatica
"Water Spinach"

Yamaguchi, Wikimedia Commons, public domain


Build a new model of meaning
focusing on similarity
Each word = a vector
Similar words are "nearby in space"

not good
bad
to by dislike worst
‘s
that now incredibly bad
are worse
a i you
than with is

very good incredibly good


amazing fantastic
terrific wonderful
nice
good
Define a word as a vector
Called an "embedding" because it's embedded
into a space
The standard way to represent meaning in NLP
Fine-grained model of meaning for similarity
◦ NLP tasks like sentiment analysis
◦ With words, requires same word to be in training and test
◦ With embeddings: ok if similar words occurred!!!
◦ Question answering, conversational agents, etc
2 kinds of embeddings
Tf-idf
◦ A common baseline model
◦ Sparse vectors
◦ Words are represented by a simple function of the counts
of nearby words
Word2vec
◦ Dense vectors
◦ Representation is created by training a classifier to
distinguish nearby and far-away words
Review: words, vectors, and
co-occurrence matrices
Term-document matrix
Each document is represented by a vector of words
Visualizing document vectors
40
Henry V [4,13]
15
battle

10 Julius Caesar [1,7]

5 As You Like It [36,1] Twelfth Night [58,0]

5 10 15 20 25 30 35 40 45 50 55 60
fool
Vectors are the basis of
information retrieval

Vectors are similar for the two comedies


Different than the history

Comedies have more fools and wit and


fewer battles.
Words can be vectors too

battle is "the kind of word that occurs in Julius


Caesar and Henry V"

fool is "the kind of word that occurs in


comedies, especially Twelfth Night"
More common: word-word matrix
(or "term-context matrix")
Two words are similar in meaning if their context vectors
are similar

aardvark computer data pinch result sugar …


apricot 0 0 0 1 0 1
pineapple 0 0 0 1 0 1
digital 0 2 1 0 1 0
information 0 1 6 0 4 0
4
information
3 [6,4]
result

2 digital
[1,1]
1

1 2 3 4 5 6
data
Visualizing cosines
(well, angles)
Dimension 1: ‘large’

2
apricot
1 information

digital
1 2 3 4 5 6 7
Dimension 2: ‘data’
Cosine for computing similarity Sec. 6.3

vi is the count for word v in context i


wi is the count for word w in context i.

Cos(v,w) is the cosine similarity of v and w


Cosine as a similarity metric
-1: vectors point in opposite directions
+1: vectors point in same directions
0: vectors are orthogonal

Frequency is non-negative, so cosine range 0-1

17
large data computer
apricot 1 0 0
åi=1 vi wi
N
v ·w v w digital 0 1 2
cos(v, w) = = · =
v w v w
åi=1 i åi=1 wi2
N 2 N
v information 1 6 1
Which pair of words is more similar?
1+ 0 + 0 1
cosine(apricot,information) = = = .16
1+ 0 + 0 1+ 36 +1 38
cosine(digital,information) = 0+6+2 8
= = .58
0 +1+ 4 1+ 36 +1 38 5
cosine(apricot,digital) =

0+0+0
=0
1+ 0 + 0 0 +1+ 4
18
But raw frequency is a bad
representation
Frequency is clearly useful; if sugar appears a lot
near apricot, that's useful information.
But overly frequent words like the, it, or they are
not very informative about the context
Need a function that resolves this frequency
paradox!
tf-idf: combine two factors
tf: term frequency. frequency count (usually log-transformed):

Idf: inverse document frequency: tf-


Total # of docs in collection

Words like "the" or "good" have very low idf # of docs that have word i

tf-idf value for word t in document d:


A tf-idf weighted term-document matrix for four words in
four Shakespeare plays
Summary: tf-idf
Compare two words using tf-idf cosine to see
if they are similar
Compare two documents
◦ Take the centroid of vectors of all the words in
the document
◦ Centroid document vector is:
Example of Tf*Idf Vector
Represent the word “apple” as vector using following corpus. Use TF.IDF weights.
Assume the window size for word context is 2
Document 1: I like to ride cycle often.
Document 2: Ali and Hassan ate apple and oranges.
Document 3: Ali ate apple not oranges
Example of Tf*Idf Vector
Represent the word “apples” as vector using following corpus. Use TF.IDF weights.
Assume the window size for word context is 2
Document 1: I like to ride cycle often.
Document 2: Ali and [ Hassan ate apple and oranges ].
Document 3: [ Ali ate apple not oranges ].
Context words of “apple”= Hassan, ate, and, oranges, Ali, not

Dimens I Lik to ride cycle ofte Ali and Hassan ate apple oran n
ion e n ges
Raw 0 0 0 0 0 0 1 1 1 2 0 2
Count
TF 0 0 0 0 0 0 1 1 1 1.3 0 1.3
IDF 0.48 0.4 0.4 0.48 0.48 0.48 0.18 0.48 0.48 0.18 0.18 0.18 0
8 8

Tf.IDF 0 0 0 0 0 0 0.18 0.48 0.48 0.23 0 0.23 0


Weight
An alternative to tf-idf
Ask whether a context word is particularly
informative about the target word.
◦ Positive Pointwise Mutual Information (PPMI)

26
Pointwise Mutual Information
whether the probability of x and y occurring together is higher than what we would expect if x and
y were unrelated or independent.

Pointwise mutual information:


Do events x and y co-occur more than if they were independent?

PMI(X,Y ) = log2 P(x,y)


P(x)P(y)

PMI between two words: (Church & Hanks 1989)


Do words x and y co-occur more than if they were independent?

𝑃(𝑤𝑜𝑟𝑑1 , 𝑤𝑜𝑟𝑑2 )
PMI 𝑤𝑜𝑟𝑑1 , 𝑤𝑜𝑟𝑑2 = log 2
𝑃 𝑤𝑜𝑟𝑑1 𝑃(𝑤𝑜𝑟𝑑2 )
Positive Pointwise Mutual Information
◦ PMI ranges from −∞ to + ∞
◦ But the negative values are problematic
◦ Things are co-occurring less than we expect by chance
◦ So we just replace negative PMI values by 0
◦ Positive PMI (PPMI) between word1 and word2:
𝑃(𝑤𝑜𝑟𝑑1 , 𝑤𝑜𝑟𝑑2 )
PPMI 𝑤𝑜𝑟𝑑1 , 𝑤𝑜𝑟𝑑2 = max log 2 ,0
𝑃 𝑤𝑜𝑟𝑑1 𝑃(𝑤𝑜𝑟𝑑2 )
Computing PPMI on a term-context
matrix
Matrix F with W rows (words) and C columns (contexts)
fij is # of times wi occurs in context cj
C W

fij å fij å fij


pij = W C pi* = j=1 p* j = i=1
W C W C
åå fij åå fij åå fij
i=1 j=1 i=1 j=1 i=1 j=1

pij ìï pmi if pmiij > 0


ij
pmiij = log 2 ppmiij = í
pi* p* j ïî 0 otherwise

29
fij
pij = W C
åå fij
i=1 j=1

30
fij
pij = W C
åå fij
i=1 j=1
C W
p(w=information,c=data) = 6/19 = .32 å fij å fij
11/19 = .58 p(w ) = j=1 p(c j ) = i=1
p(w=information) = i
N N
7/19 = .37 p(w,context) p(w)
p(c=data) =
computer data pinch result sugar
apricot 0.00 0.00 0.05 0.00 0.05 0.11
pineapple 0.00 0.00 0.05 0.00 0.05 0.11
digital 0.11 0.05 0.00 0.05 0.00 0.21
information 0.05 0.32 0.00 0.21 0.00 0.58

p(context) 0.16 0.37 0.11 0.26 0.11

31
p(w,context) p(w)
computer data pinch result sugar
p apricot 0.00 0.00 0.05 0.00 0.05 0.11
pmiij = log 2 ij pineapple 0.00 0.00 0.05 0.00 0.05 0.11
pi* p* j
digital 0.11 0.05 0.00 0.05 0.00 0.21
information 0.05 0.32 0.00 0.21 0.00 0.58
p(context) 0.16 0.37 0.11 0.26 0.11

pmi(information,data) = log2 ( .32 / (.37*.58) ) = .58


(.57 using full precision)
PPMI(w,context)
computer data pinch result sugar
apricot - - 2.25 - 2.25
pineapple - - 2.25 - 2.25
digital 1.66 0.00 - 0.00 -
information 0.00 0.57 - 0.47 -

32
Weighting PMI
PMI is biased toward infrequent events
◦ Very rare words have very high PMI values
Two solutions:
◦ Give rare words slightly higher probabilities

35
Weighting PMI: Giving rare
context words slightly higher
probability
Raise the context probabilities to 𝛼 = 0.75:

This helps because 𝑃𝛼 𝑐 > 𝑃 𝑐 for rare c


Consider two events, P(a) = .99 and P(b)=.01
99.75 1.75
𝑃𝛼 𝑎 = = .97 𝑃𝛼 𝑏 = = .03
99.75 +1.75 99.75 +1.75

36
Summary for Part I
• Idea of Embeddings: Represent a word as a
function of its distribution with other words
• Tf-idf
• Cosines
• PPMI

You might also like