CS224n: Natural Language Processing With Deep Learning
CS224n: Natural Language Processing With Deep Learning
Learning 1 1
Course Instructors: Christopher
Manning, Richard Socher
Lecture Notes: Part I
Word Vectors I: Introduction, SVD and Word2Vec 2 2
Authors: Francois Chaubard, Michael
Fang, Guillaume Genthial, Rohit
Winter 2019 Mundra, Richard Socher
Easy
• Spell Checking
• Keyword Search
• Finding Synonyms
Medium
Hard
2 Word Vectors
indicate tense (past vs. present vs. future), count (singular vs. plural),
and gender (masculine vs. feminine). One-hot vector: Represent every word
So let’s dive into our first word vector and arguably the most as an R|V |×1 vector with all 0s and one
1 at the index of that word in the sorted
simple, the one-hot vector: Represent every word as an R|V |×1 vector english language.
with all 0s and one 1 at the index of that word in the sorted english
language. In this notation, |V | is the size of our vocabulary. Word
vectors in this type of encoding would appear as the following:
1 0 0 0
0 1 0 0
aardvark 0 a 0 at 1 zebra 0
w = , w = , w = , · · · w
=
Fun fact: The term "one-hot" comes
.. .. .. ..
from digital circuit design, meaning "a
. . . . group of bits among which the legal
0 0 0 1 combinations of values are only those
We represent each word as a completely independent entity. As with a single high (1) bit and all the
others low (0)".
we previously discussed, this word representation does not give us
directly any notion of similarity. For instance,
∑ik=1 σi
|V |
∑i=1 σi
Applying SVD to X:
|V | |V | |V | |V |
0 ··· − v1 −
| | σ1
0 · · · |V | − v2 −
X = u u2 · · · |V | σ2
|V | |V | 1
.. .. .. ..
| | . . . .
cs224n: natural language processing with deep learning lecture notes: part i word
vectors i: introduction, svd and word2vec 5
|V | k k |V |
0 ··· − v1 −
| | σ1
0 ··· k − v2 −
X̂ = u1 u2 ··· k σ2
|V | |V |
.. .. .. ..
| | . . . .
Both of these methods give us word vectors that are more than
sufficient to encode semantic and syntactic (part of speech) informa-
tion but are associated with many other problems:
• The dimensions of the matrix change very often (new words are
added very frequently and corpus changes in size). SVD based methods do not scale
well for big matrices and it is hard to
• The matrix is extremely sparse since most words do not co-occur. incorporate new words or documents.
Computational cost for a m × n matrix
• The matrix is very high dimensional in general (≈ 106 × 106 ) is O(mn2 )
The idea is to design a model whose parameters are the word vec-
tors. Then, train the model on a certain objective. At every iteration Iteration-based methods capture cooc-
we run our model, evaluate the errors, and follow an update rule currence of words one at a time instead
of capturing all cooccurrence counts
that has some notion of penalizing the model parameters that caused directly like in SVD methods.
the error. Thus, we learn our word vectors. This idea is a very old
cs224n: natural language processing with deep learning lecture notes: part i word
vectors i: introduction, svd and word2vec 6
P ( w1 , w2 , · · · , w n )
We can take the unary language model approach and break apart
this probability by assuming the word occurrences are completely
independent:
n
P ( w1 , w2 , · · · , w n ) = ∏ P ( wi )
i =1
Unigram model:
However, we know this is a bit ludicrous because we know the n
next word is highly contingent upon the previous sequence of words. P ( w1 , w2 , · · · , w n ) = ∏ P ( wi )
i =1
And the silly sentence example might actually score highly. So per-
haps we let the probability of the sequence depend on the pairwise
cs224n: natural language processing with deep learning lecture notes: part i word
vectors i: introduction, svd and word2vec 7
probability of a word in the sequence and the word next to it. We call
this the bigram model and represent it as:
n
P ( w1 , w2 , · · · , w n ) = ∏ P ( wi | wi −1 )
i =2
Bigram model:
Again this is certainly a bit naive since we are only concerning n
ourselves with pairs of neighboring words rather than evaluating a P ( w1 , w2 , · · · , w n ) = ∏ P ( wi | wi −1 )
i =2
whole sentence, but as we will see, this representation gets us pretty
far along. Note in the Word-Word Matrix with a context of size 1, we
basically can learn these pairwise probabilities. But again, this would
require computing and storing global information about a massive
dataset.
Now that we understand how we can think about a sequence of
tokens having a probability, let us observe some example models that
could learn these probabilities.
1. We generate our one hot word vectors for the input context of size
m : (x (c−m) , . . . , x (c−1) , x (c+1) , . . . , x (c+m) ∈ R|V | ).
cs224n: natural language processing with deep learning lecture notes: part i word
vectors i: introduction, svd and word2vec 8
5. Turn the scores into probabilities ŷ = softmax(z) ∈ R|V | . The softmax is an operator that we’ll
use very frequently. It transforms a vec-
6. We desire our probabilities generated, ŷ ∈ R|V | , to match the true tor into a vector whose i-th component
eŷi
probabilities, y ∈ R|V | , which also happens to be the one hot vector is |V | yˆ
∑ k =1 e k
.
|V |
H (ŷ, y) = − ∑ y j log(ŷ j )
j =1
hot vector is 1. We can now consider the case where our predic-
ŷ 7→ H (ŷ, y) is minimum when ŷ = y.
tion was perfect and thus ŷc = 1. We can then calculate H (ŷ, y) = Then, if we found a ŷ such that H (ŷ, y)
−1 log(1) = 0. Thus, for a perfect prediction, we face no penalty or is close to the minimum, we have ŷ ≈ y.
This means that our model is very good
loss. Now let us consider the opposite case where our prediction was at predicting the center word!
very bad and thus ŷc = 0.01. As before, we can calculate our loss to
be H (ŷ, y) = −1 log(0.01) ≈ 4.605. We can thus see that for proba-
bility distributions, cross entropy provides us with a good measure of
distance. We thus formulate our optimization objective as: To learn the vectors (the matrices U and
V) CBOW defines a cost that measures
how good it is at predicting the center
word. Then, we optimize this cost by
updating the matrices U and V thanks
to stochastic gradient descent
cs224n: natural language processing with deep learning lecture notes: part i word
vectors i: introduction, svd and word2vec 9
the word "jumped" the context. We call this type of model a Skip-
Gram model. Notation for Skip-Gram Model:
Let’s discuss the Skip-Gram model above. The setup is largely the • wi : Word i from vocabulary V
same but we essentially swap our x and y i.e. x in the CBOW are • V ∈ Rn×|V | : Input word matrix
now y and vice-versa. The input one hot vector (center word) we will • vi : i-th column of V , the input vector
represent with an x (since there is only one). And the output vectors representation of word wi
1. We generate our one hot input vector x ∈ R|V | of the center word.
• objective function
cs224n: natural language processing with deep learning lecture notes: part i word
vectors i: introduction, svd and word2vec 11
• gradients
• update rules
1 1
J=− ∑ log T
− ∑ log(
1 + exp(−uw vc ) (w,c)∈ D̃ Tv )
1 + exp(uw c
)
(w,c)∈ D
For skip-gram, our new objective function for observing the con-
text word c − m + j given the center word c would be
K
− log σ(ucT−m+ j · vc ) − ∑ log σ(−ũkT · vc )
k =1
To compare with the regular softmax
loss for skip-gram
|V |
−ucT−m+ j vc + log ∑k=1 exp(ukT vc )
For CBOW, our new objective function for observing the center
v +v +1 +...+vc+m
word uc given the context vector v̂ = c−m c−m2m would be
K
− log σ(ucT · v̂) − ∑ log σ(−ũkT · v̂)
k =1
To compare with the regular softmax
loss for CBOW
In the above formulation, {ũk |k = 1 . . . K } are sampled from Pn (w). |V |
−ucT v̂ + log ∑ j=1 exp(u Tj v̂)
Let’s discuss what Pn (w) should be. While there is much discussion
of what makes the best approximation, what seems to work best is
the Unigram Model raised to the power of 3/4. Why 3/4? Here’s an
example that might help gain some intuition:
vector vn(w,i) . So n(w, 1) is the root, while n(w, L(w)) is the father
of w. Now for each inner node n, we arbitrarily choose one of its
children and call it ch(n) (e.g. always the left node). Then, we can
compute the probability as
L(w)−1
P ( w | wi ) = ∏ σ ([n(w, j + 1) = ch(n(w, j))] · vnT(w,j) vwi )
j =1
where
1 if x is true
[x] =
−1 otherwise
.
and σ (·) is the sigmoid function.
This formula is fairly dense, so let’s examine it more closely.
First, we are computing a product of terms based on the shape of
the path from the root (n(w, 1)) to the leaf (w). If we assume ch(n) is
always the left node of n, then term [n(w, j + 1) = ch(n(w, j))] returns
1 when the path goes left, and -1 if right.
Furthermore, the term [n(w, j + 1) = ch(n(w, j))] provides normal-
ization. At a node n, if we sum the probabilities for going to the left
and right node, you can check that for any value of vnT vwi ,
References
[Bengio et al., 2003] Bengio, Y., Ducharme, R., Vincent, P., and Janvin, C. (2003). A
neural probabilistic language model. J. Mach. Learn. Res., 3:1137–1155.
cs224n: natural language processing with deep learning lecture notes: part i word
vectors i: introduction, svd and word2vec 14
[Collobert et al., 2011] Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu,
K., and Kuksa, P. P. (2011). Natural language processing (almost) from scratch.
CoRR, abs/1103.0398.
[Mikolov et al., 2013] Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013). Efficient
estimation of word representations in vector space. CoRR, abs/1301.3781.
[Rong, 2014] Rong, X. (2014). word2vec parameter learning explained. CoRR,
abs/1411.2738.
[Rumelhart et al., 1988] Rumelhart, D. E., Hinton, G. E., and Williams, R. J. (1988).
Neurocomputing: Foundations of research. chapter Learning Representations by
Back-propagating Errors, pages 696–699. MIT Press, Cambridge, MA, USA.