Lecture -7 PPMI
Lecture -7 PPMI
Chapter 6:
Vector Semantics
Let's define words by their
usages
In particular, words are defined by their
environments (the words around them)
not good
bad
to by dislike worst
‘s
that now incredibly bad
are worse
a i you
than with is
5 10 15 20 25 30 35 40 45 50 55 60
fool
Vectors are the basis of
information retrieval
2 digital
[1,1]
1
1 2 3 4 5 6
data
Visualizing cosines
(well, angles)
Dimension 1: ‘large’
2
apricot
1 information
digital
1 2 3 4 5 6 7
Dimension 2: ‘data’
Cosine for computing similarity Sec. 6.3
17
large data computer
apricot 1 0 0
åi=1 vi wi
N
v ·w v w digital 0 1 2
cos(v, w) = = · =
v w v w
åi=1 i åi=1 wi2
N 2 N
v information 1 6 1
Which pair of words is more similar?
1+ 0 + 0 1
cosine(apricot,information) = = = .16
1+ 0 + 0 1+ 36 +1 38
cosine(digital,information) = 0+6+2 8
= = .58
0 +1+ 4 1+ 36 +1 38 5
cosine(apricot,digital) =
0+0+0
=0
1+ 0 + 0 0 +1+ 4
18
But raw frequency is a bad
representation
Frequency is clearly useful; if sugar appears a lot
near apricot, that's useful information.
But overly frequent words like the, it, or they are
not very informative about the context
Need a function that resolves this frequency
paradox!
tf-idf: combine two factors
tf: term frequency. frequency count (usually log-transformed):
Words like "the" or "good" have very low idf # of docs that have word i
Dimens I Lik to ride cycle ofte Ali and Hassan ate apple oran n
ion e n ges
Raw 0 0 0 0 0 0 1 1 1 2 0 2
Count
TF 0 0 0 0 0 0 1 1 1 1.3 0 1.3
IDF 0.48 0.4 0.4 0.48 0.48 0.48 0.18 0.48 0.48 0.18 0.18 0.18 0
8 8
26
Pointwise Mutual Information
whether the probability of x and y occurring together is higher than what we would expect if x and
y were unrelated or independent.
𝑃(𝑤𝑜𝑟𝑑1 , 𝑤𝑜𝑟𝑑2 )
PMI 𝑤𝑜𝑟𝑑1 , 𝑤𝑜𝑟𝑑2 = log 2
𝑃 𝑤𝑜𝑟𝑑1 𝑃(𝑤𝑜𝑟𝑑2 )
Positive Pointwise Mutual Information
◦ PMI ranges from −∞ to + ∞
◦ But the negative values are problematic
◦ Things are co-occurring less than we expect by chance
◦ So we just replace negative PMI values by 0
◦ Positive PMI (PPMI) between word1 and word2:
𝑃(𝑤𝑜𝑟𝑑1 , 𝑤𝑜𝑟𝑑2 )
PPMI 𝑤𝑜𝑟𝑑1 , 𝑤𝑜𝑟𝑑2 = max log 2 ,0
𝑃 𝑤𝑜𝑟𝑑1 𝑃(𝑤𝑜𝑟𝑑2 )
Computing PPMI on a term-context
matrix
Matrix F with W rows (words) and C columns (contexts)
fij is # of times wi occurs in context cj
C W
29
fij
pij = W C
åå fij
i=1 j=1
30
fij
pij = W C
åå fij
i=1 j=1
C W
p(w=information,c=data) = 6/19 = .32 å fij å fij
11/19 = .58 p(w ) = j=1 p(c j ) = i=1
p(w=information) = i
N N
7/19 = .37 p(w,context) p(w)
p(c=data) =
computer data pinch result sugar
apricot 0.00 0.00 0.05 0.00 0.05 0.11
pineapple 0.00 0.00 0.05 0.00 0.05 0.11
digital 0.11 0.05 0.00 0.05 0.00 0.21
information 0.05 0.32 0.00 0.21 0.00 0.58
31
p(w,context) p(w)
computer data pinch result sugar
p apricot 0.00 0.00 0.05 0.00 0.05 0.11
pmiij = log 2 ij pineapple 0.00 0.00 0.05 0.00 0.05 0.11
pi* p* j
digital 0.11 0.05 0.00 0.05 0.00 0.21
information 0.05 0.32 0.00 0.21 0.00 0.58
p(context) 0.16 0.37 0.11 0.26 0.11
32
Weighting PMI
PMI is biased toward infrequent events
◦ Very rare words have very high PMI values
Two solutions:
◦ Give rare words slightly higher probabilities
35
Weighting PMI: Giving rare
context words slightly higher
probability
Raise the context probabilities to 𝛼 = 0.75:
36
Summary for Part I
• Idea of Embeddings: Represent a word as a
function of its distribution with other words
• Tf-idf
• Cosines
• PPMI