0% found this document useful (0 votes)

269 views14 pages

CS224n: Natural Language Processing With Deep Learning

This document introduces natural language processing and discusses methods for representing words as vectors. It begins with an overview of NLP tasks and what makes human language special. Next, it discusses using one-hot vectors to represent words and the limitations of this approach. It then introduces the concept of word embeddings using singular value decomposition on co-occurrence matrices to find a lower dimensional semantic space and encode word similarity.

Uploaded by

Muhammad Rizwan Khalid

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

269 views14 pages

CS224n: Natural Language Processing With Deep Learning

Uploaded by

Muhammad Rizwan Khalid

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 14

CS224n: Natural Language Processing with Deep

Learning 1 1
Course Instructors: Christopher
Manning, Richard Socher
Lecture Notes: Part I
Word Vectors I: Introduction, SVD and Word2Vec 2 2
Authors: Francois Chaubard, Michael
Fang, Guillaume Genthial, Rohit
Winter 2019 Mundra, Richard Socher

Keyphrases: Natural Language Processing. Word Vectors. Singu-

lar Value Decomposition. Skip-gram. Continuous Bag of Words
(CBOW). Negative Sampling. Hierarchical Softmax. Word2Vec.
This set of notes begins by introducing the concept of Natural
Language Processing (NLP) and the problems NLP faces today. We
then move forward to discuss the concept of representing words as
numeric vectors. Lastly, we discuss popular approaches to designing
word vectors.

1 Introduction to Natural Language Processing

We begin with a general discussion of what is NLP.

1.1 What is so special about NLP?

What’s so special about human (natural) language? Human language
is a system specifically constructed to convey meaning, and is not
produced by a physical manifestation of any kind. In that way, it is
very different from vision or any other machine learning task. Natural language is a dis-
Most words are just symbols for an extra-linguistic entity : the crete/symbolic/categorical system
word is a signifier that maps to a signified (idea or thing).
For instance, the word "rocket" refers to the concept of a rocket,
and by extension can designate an instance of a rocket. There are
some exceptions, when we use words and letters for expressive sig-
naling, like in "Whooompaa". On top of this, the symbols of language
can be encoded in several modalities : voice, gesture, writing, etc
that are transmitted via continuous signals to the brain, which itself
appears to encode things in a continuous manner. (A lot of work in
philosophy of language and linguistics has been done to conceptu-
alize human language and distinguish words from their references,
meanings, etc. Among others, see works by Wittgenstein, Frege, Rus-
sell and Mill.)

1.2 Examples of tasks

There are different levels of tasks in NLP, from speech processing to
semantic interpretation and discourse processing. The goal of NLP is
to be able to design algorithms to allow computers to "understand"
cs224n: natural language processing with deep learning lecture notes: part i word
vectors i: introduction, svd and word2vec 2

natural language in order to perform some task. Example tasks come

in varying level of difficulty:

Easy

• Spell Checking

• Keyword Search

• Finding Synonyms

Medium

• Parsing information from websites, documents, etc.

Hard

• Machine Translation (e.g. Translate Chinese text to English)

• Semantic Analysis (What is the meaning of query statement?)

• Coreference (e.g. What does "he" or "it" refer to given a docu-

ment?)

• Question Answering (e.g. Answering Jeopardy questions).

1.3 How to represent words?

The first and arguably most important common denominator across
all NLP tasks is how we represent words as input to any of our mod-
els. Much of the earlier NLP work that we will not cover treats words
as atomic symbols. To perform well on most NLP tasks we first need
to have some notion of similarity and difference between words. With
word vectors, we can quite easily encode this ability in the vectors
themselves (using distance measures such as Jaccard, Cosine, Eu-
clidean, etc).

2 Word Vectors

There are an estimated 13 million tokens for the English language

but are they all completely unrelated? Feline to cat, hotel to motel?
I think not. Thus, we want to encode word tokens each into some
vector that represents a point in some sort of "word" space. This is
paramount for a number of reasons but the most intuitive reason is
that perhaps there actually exists some N-dimensional space (such
that N 13 million) that is sufficient to encode all semantics of
our language. Each dimension would encode some meaning that
we transfer using speech. For instance, semantic dimensions might
cs224n: natural language processing with deep learning lecture notes: part i word
vectors i: introduction, svd and word2vec 3

indicate tense (past vs. present vs. future), count (singular vs. plural),
and gender (masculine vs. feminine). One-hot vector: Represent every word
So let’s dive into our first word vector and arguably the most as an R|V |×1 vector with all 0s and one
1 at the index of that word in the sorted
simple, the one-hot vector: Represent every word as an R|V |×1 vector english language.
with all 0s and one 1 at the index of that word in the sorted english
language. In this notation, |V | is the size of our vocabulary. Word
vectors in this type of encoding would appear as the following:
       
1 0 0 0
 0   1   0   0 
       
aardvark 0 a 0 at 1 zebra 0
       
w =  , w =  , w =  , · · · w
      = 
 Fun fact: The term "one-hot" comes
 ..   ..   ..   .. 

from digital circuit design, meaning "a
 .   .   .   .  group of bits among which the legal
0 0 0 1 combinations of values are only those
We represent each word as a completely independent entity. As with a single high (1) bit and all the
others low (0)".
we previously discussed, this word representation does not give us
directly any notion of similarity. For instance,

(whotel )T wmotel = (whotel )T wcat = 0

Denotational semantics: The concept
So maybe we can try to reduce the size of this space from R| V | to of representing an idea as a symbol (a
word or a one-hot vector). It is sparse
something smaller and thus find a subspace that encodes the rela- and cannot capture similarity. This is a
tionships between words. "localist" representation.

3 SVD Based Methods

For this class of methods to find word embeddings (otherwise known

as word vectors), we first loop over a massive dataset and accumu-
late word co-occurrence counts in some form of a matrix X, and then
perform Singular Value Decomposition on X to get a USV T decom-
position. We then use the rows of U as the word embeddings for all
words in our dictionary. Let us discuss a few choices of X.

3.1 Word-Document Matrix

Distributional semantics: The concept
As our first attempt, we make the bold conjecture that words that of representing the meaning of a word
are related will often appear in the same documents. For instance, based on the context in which it usually
appears. It is dense and can better
"banks", "bonds", "stocks", "money", etc. are probably likely to ap- capture similarity.
pear together. But "banks", "octopus", "banana", and "hockey" would
probably not consistently appear together. We use this fact to build
a word-document matrix, X in the following manner: Loop over
billions of documents and for each time word i appears in docu-
ment j, we add one to entry Xij . This is obviously a very large matrix
(R|V |× M ) and it scales with the number of documents (M). So per-
haps we can try something better.
cs224n: natural language processing with deep learning lecture notes: part i word
vectors i: introduction, svd and word2vec 4

3.2 Window based Co-occurrence Matrix

The same kind of logic applies here however, the matrix X stores
co-occurrences of words thereby becoming an affinity matrix. In this
method we count the number of times each word appears inside a
window of a particular size around the word of interest. We calculate
this count for all the words in corpus. We display an example below.
Let our corpus contain just three sentences and the window size be 1: Using Word-Word Co-occurrence
Matrix:
1. I enjoy flying. • Generate |V | × |V | co-occurrence
matrix, X.
2. I like NLP. • Apply SVD on X to get X = USV T .
• Select the first k columns of U to get
3. I like deep learning. a k-dimensional word vectors.
∑ik=1 σi
The resulting counts matrix will then be: • |V | indicates the amount of
∑i=1 σi
variance captured by the first k
I like enjoy deep learning NLP f lying . dimensions.
 
I 0 2 1 0 0 0 0 0
like

 2 0 0 1 0 1 0 0 

1 0 0 0 0 0 1 0
 
enjoy  
 
deep  0 1 0 0 1 0 0 0 
X=  
learning

 0 0 0 1 0 0 0 1 

0 1 0 0 0 0 0 1
 
NLP  
 
f lying  0 0 1 0 0 0 0 1 
. 0 0 0 0 1 1 1 0

3.3 Applying SVD to the cooccurrence matrix

We now perform SVD on X, observe the singular values (the diago-
nal entries in the resulting S matrix), and cut them off at some index
k based on the desired percentage variance captured:

∑ik=1 σi
|V |
∑i=1 σi

We then take the submatrix of U1:|V |,1:k to be our word embedding

matrix. This would thus give us a k-dimensional representation of
every word in the vocabulary.

Applying SVD to X:

|V | |V | |V | |V |
   
0 ··· − v1 −
   
| | σ1
 0 · · ·  |V |  − v2 − 
   
X = u u2 · · ·  |V | σ2
   
|V |  |V |  1
.. .. .. ..
   
| | . . . .
cs224n: natural language processing with deep learning lecture notes: part i word
vectors i: introduction, svd and word2vec 5

Reducing dimensionality by selecting first k singular vectors:

|V | k k |V |
   
0 ··· − v1 −
   
| | σ1
 0 ···  k  − v2 − 
   
X̂ =  u1 u2 ···  k σ2
   
|V |  |V |
.. .. .. ..
   
| | . . . .

Both of these methods give us word vectors that are more than
sufficient to encode semantic and syntactic (part of speech) informa-
tion but are associated with many other problems:

• The dimensions of the matrix change very often (new words are
added very frequently and corpus changes in size). SVD based methods do not scale
well for big matrices and it is hard to
• The matrix is extremely sparse since most words do not co-occur. incorporate new words or documents.
Computational cost for a m × n matrix
• The matrix is very high dimensional in general (≈ 106 × 106 ) is O(mn2 )

• Quadratic cost to train (i.e. to perform SVD)

• Requires the incorporation of some hacks on X to account for the

drastic imbalance in word frequency
However, count-based method make an
Some solutions to exist to resolve some of the issues discussed above: efficient use of the statistics

• Ignore function words such as "the", "he", "has", etc.

• Apply a ramp window – i.e. weight the co-occurrence count based

on distance between the words in the document.

• Use Pearson correlation and set negative counts to 0 instead of

using just raw count.

As we see in the next section, iteration based methods solve many

of these issues in a far more elegant manner.

4 Iteration Based Methods - Word2vec

For an overview of Word2vec, a note
Let us step back and try a new approach. Instead of computing and map can be found here : https://
storing global information about some huge dataset (which might be myndbook.com/view/4900

billions of sentences), we can try to create a model that will be able

to learn one iteration at a time and eventually be able to encode the A detailed summary of word2vec mod-
probability of a word given its context. els can also be found here [Rong, 2014]

The idea is to design a model whose parameters are the word vec-
tors. Then, train the model on a certain objective. At every iteration Iteration-based methods capture cooc-
we run our model, evaluate the errors, and follow an update rule currence of words one at a time instead
of capturing all cooccurrence counts
that has some notion of penalizing the model parameters that caused directly like in SVD methods.
the error. Thus, we learn our word vectors. This idea is a very old
cs224n: natural language processing with deep learning lecture notes: part i word
vectors i: introduction, svd and word2vec 6

one dating back to 1986. We call this method "backpropagating" the

errors (see [Rumelhart et al., 1988]). The simpler the model and the
task, the faster it will be to train it. Context of a word:
Several approaches have been tested. [Collobert et al., 2011] design The context of a word is the set of m
surrounding words. For instance, the
models for NLP whose first step is to transform each word in a vec- m = 2 context of the word "fox" in the
tor. For each special task (Named Entity Recognition, Part-of-Speech sentence "The quick brown fox jumped
over the lazy dog" is {"quick", "brown",
tagging, etc. ) they train not only the model’s parameters but also the "jumped", "over"}.
vectors and achieve great performance, while computing good word
vectors! Other interesting reading would be [Bengio et al., 2003]. This model relies on a very important
In this class, we will present a simpler, more recent, probabilistic hypothesis in linguistics, distributional
similarity, the idea that similar words
method by [Mikolov et al., 2013] : word2vec. Word2vec is a software have similar context.
package that actually includes :
- 2 algorithms: continuous bag-of-words (CBOW) and skip-gram.
CBOW aims to predict a center word from the surrounding context in
terms of word vectors. Skip-gram does the opposite, and predicts the
distribution (probability) of context words from a center word.
- 2 training methods: negative sampling and hierarchical softmax.
Negative sampling defines an objective by sampling negative exam-
ples, while hierarchical softmax defines an objective using an efficient
tree structure to compute probabilities for all the vocabulary.

4.1 Language Models (Unigrams, Bigrams, etc.)

First, we need to create such a model that will assign a probability to
a sequence of tokens. Let us start with an example:

"The cat jumped over the puddle."

A good language model will give this sentence a high probability

because this is a completely valid sentence, syntactically and semanti-
cally. Similarly, the sentence "stock boil fish is toy" should have a very
low probability because it makes no sense. Mathematically, we can
call this probability on any given sequence of n words:

P ( w1 , w2 , · · · , w n )

We can take the unary language model approach and break apart
this probability by assuming the word occurrences are completely
independent:
n
P ( w1 , w2 , · · · , w n ) = ∏ P ( wi )
i =1
Unigram model:
However, we know this is a bit ludicrous because we know the n
next word is highly contingent upon the previous sequence of words. P ( w1 , w2 , · · · , w n ) = ∏ P ( wi )
i =1
And the silly sentence example might actually score highly. So per-
haps we let the probability of the sequence depend on the pairwise
cs224n: natural language processing with deep learning lecture notes: part i word
vectors i: introduction, svd and word2vec 7

probability of a word in the sequence and the word next to it. We call
this the bigram model and represent it as:
n
P ( w1 , w2 , · · · , w n ) = ∏ P ( wi | wi −1 )
i =2
Bigram model:
Again this is certainly a bit naive since we are only concerning n
ourselves with pairs of neighboring words rather than evaluating a P ( w1 , w2 , · · · , w n ) = ∏ P ( wi | wi −1 )
i =2
whole sentence, but as we will see, this representation gets us pretty
far along. Note in the Word-Word Matrix with a context of size 1, we
basically can learn these pairwise probabilities. But again, this would
require computing and storing global information about a massive
dataset.
Now that we understand how we can think about a sequence of
tokens having a probability, let us observe some example models that
could learn these probabilities.

4.2 Continuous Bag of Words Model (CBOW)

One approach is to treat {"The", "cat", ’over", "the’, "puddle"} as a
context and from these words, be able to predict or generate the
center word "jumped". This type of model we call a Continuous Bag
of Words (CBOW) Model. CBOW Model:
Let’s discuss the CBOW Model above in greater detail. First, we Predicting a center word from the
surrounding context
set up our known parameters. Let the known parameters in our
For each word, we want to learn 2
model be the sentence represented by one-hot word vectors. The vectors
input one hot vectors or context we will represent with an x (c) . And - v: (input vector) when the word is in
the context
the output as y(c) and in the CBOW model, since we only have one
- u: (output vector) when the word is
output, so we just call this y which is the one hot vector of the known in the center
center word. Now let’s define our unknowns in our model.
We create two matrices, V ∈ Rn×|V | and U ∈ R|V |×n . Where n
Notation for CBOW Model:
is an arbitrary size which defines the size of our embedding space.
• wi : Word i from vocabulary V
V is the input word matrix such that the i-th column of V is the n-
• V ∈ Rn×|V | : Input word matrix
dimensional embedded vector for word wi when it is an input to this
• vi : i-th column of V , the input vector
model. We denote this n × 1 vector as vi . Similarly, U is the output representation of word wi
word matrix. The j-th row of U is an n-dimensional embedded vector • U ∈ R|V |×n : Output word matrix
for word w j when it is an output of the model. We denote this row of • ui : i-th row of U , the output vector
U as u j . Note that we do in fact learn two vectors for every word wi representation of word wi
(i.e. input word vector vi and output word vector ui ).

We breakdown the way this model works in these steps:

1. We generate our one hot word vectors for the input context of size
m : (x (c−m) , . . . , x (c−1) , x (c+1) , . . . , x (c+m) ∈ R|V | ).
cs224n: natural language processing with deep learning lecture notes: part i word
vectors i: introduction, svd and word2vec 8

2. We get our embedded word vectors for the context (vc−m =

V x (c−m) , vc−m+1 = V x (c−m+1) , . . ., vc+m = V x (c+m) ∈ Rn )
vc−m +vc−m+1 +...+vc+m
3. Average these vectors to get v̂ = 2m ∈ Rn

4. Generate a score vector z = U v̂ ∈ R|V | . As the dot product of

similar vectors is higher, it will push similar words close to each
other in order to achieve a high score.

5. Turn the scores into probabilities ŷ = softmax(z) ∈ R|V | . The softmax is an operator that we’ll
use very frequently. It transforms a vec-
6. We desire our probabilities generated, ŷ ∈ R|V | , to match the true tor into a vector whose i-th component
eŷi
probabilities, y ∈ R|V | , which also happens to be the one hot vector is |V | yˆ
∑ k =1 e k
.

of the actual word. - exponentiate to make positive

|V |
- Dividing by ∑k=1 eyˆk normalizes the
So now that we have an understanding of how our model would vector (∑nk=1 ŷk = 1) to give probability

work if we had a V and U , how would we learn these two matrices?

Well, we need to create an objective function. Very often when we
are trying to learn a probability from some true probability, we look
to information theory to give us a measure of the distance between
two distributions. Here, we use a popular choice of distance/loss
measure, cross entropy H (ŷ, y).
The intuition for the use of cross-entropy in the discrete case can
be derived from the formulation of the loss function:

|V |
H (ŷ, y) = − ∑ y j log(ŷ j )
j =1

Let us concern ourselves with the case at hand, which is that y

is a one-hot vector. Thus we know that the above loss simplifies to
simply:
H (ŷ, y) = −yi log(ŷi ) Figure 1: This image demonstrates how
CBOW works and how we must learn
In this formulation, c is the index where the correct word’s one the transfer matrices

hot vector is 1. We can now consider the case where our predic-
ŷ 7→ H (ŷ, y) is minimum when ŷ = y.
tion was perfect and thus ŷc = 1. We can then calculate H (ŷ, y) = Then, if we found a ŷ such that H (ŷ, y)
−1 log(1) = 0. Thus, for a perfect prediction, we face no penalty or is close to the minimum, we have ŷ ≈ y.
This means that our model is very good
loss. Now let us consider the opposite case where our prediction was at predicting the center word!
very bad and thus ŷc = 0.01. As before, we can calculate our loss to
be H (ŷ, y) = −1 log(0.01) ≈ 4.605. We can thus see that for proba-
bility distributions, cross entropy provides us with a good measure of
distance. We thus formulate our optimization objective as: To learn the vectors (the matrices U and
V) CBOW defines a cost that measures
how good it is at predicting the center
word. Then, we optimize this cost by
updating the matrices U and V thanks
to stochastic gradient descent
cs224n: natural language processing with deep learning lecture notes: part i word
vectors i: introduction, svd and word2vec 9

minimize J = − log P(wc |wc−m , . . . , wc−1 , wc+1 , . . . , wc+m )

We use stochastic gradient descent to update all relevant word vec-

tors uc and v j . Stochastic gradient descent (SGD)
computes gradients for a window and
updates the parameters
4.3 Skip-Gram Model Unew ← Uold − α∇U J
Vold ← Vold − α∇V J
Another approach is to create a model such that given the center Skip-Gram Model:
word "jumped", the model will be able to predict or generate the Predicting surrounding context words
surrounding words "The", "cat", "over", "the", "puddle". Here we call given a center word

the word "jumped" the context. We call this type of model a Skip-
Gram model. Notation for Skip-Gram Model:
Let’s discuss the Skip-Gram model above. The setup is largely the • wi : Word i from vocabulary V
same but we essentially swap our x and y i.e. x in the CBOW are • V ∈ Rn×|V | : Input word matrix
now y and vice-versa. The input one hot vector (center word) we will • vi : i-th column of V , the input vector
represent with an x (since there is only one). And the output vectors representation of word wi

as y( j) . We define V and U the same as in CBOW. • U ∈ Rn×|V | : Output word matrix

• ui : i-th row of U , the output vector
representation of word wi
We breakdown the way this model works in these 6 steps:

1. We generate our one hot input vector x ∈ R|V | of the center word.

2. We get our embedded word vector for the center word vc = V x ∈

3. Generate a score vector z = U vc .

4. Turn the score vector into probabilities, ŷ = softmax(z). Note

that ŷc−m , . . . , ŷc−1 , ŷc+1 , . . . , ŷc+m are the probabilities of observing
each context word.

5. We desire our probability vector generated to match the true prob-

abilities which is y(c−m) , . . . , y(c−1) , y(c+1) , . . . , y(c+m) , the one hot
vectors of the actual output. Figure 2: This image demonstrates how
Skip-Gram works and how we must
learn the transfer matrices
As in CBOW, we need to generate an objective function for us to
evaluate the model. A key difference here is that we invoke a Naive
Bayes assumption to break out the probabilities. If you have not
seen this before, then simply put, it is a strong (naive) conditional
cs224n: natural language processing with deep learning lecture notes: part i word
vectors i: introduction, svd and word2vec 10

independence assumption. In other words, given the center word, all

output words are completely independent.

minimize J = − log P(wc−m , . . . , wc−1 , wc+1 , . . . , wc+m |wc )

With this objective function, we can compute the gradients with

respect to the unknown parameters and at each iteration update
them via Stochastic Gradient Descent. Only one probability vector ŷ is com-
Note that puted. Skip-gram treats each context
word equally : the models computes
2m the probability for each word of appear-
J=− ∑ log P(uc−m+ j |vc ) ing in the context independently of its
distance to the center word
j=0,j6=m
2m
= ∑ H (ŷ, yc−m+ j )
j=0,j6=m

where H (ŷ, yc−m+ j ) is the cross-entropy between the probability

vector ŷ and the one-hot vector yc−m+ j .

4.4 Negative Sampling

Loss functions J for CBOW and Skip-
Lets take a second to look at the objective function. Note that the Gram are expensive to compute because
summation over |V | is computationally huge! Any update we do or of the softmax normalization, where we
sum over all |V | scores!
evaluation of the objective function would take O(|V |) time which
if we recall is in the millions. A simple idea is we could instead just
approximate it.
For every training step, instead of looping over the entire vocabu-
lary, we can just sample several negative examples! We "sample" from
a noise distribution (Pn (w)) whose probabilities match the ordering
of the frequency of the vocabulary. To augment our formulation of
the problem to incorporate Negative Sampling, all we need to do is
update the:

• objective function
cs224n: natural language processing with deep learning lecture notes: part i word
vectors i: introduction, svd and word2vec 11

• gradients

• update rules

Mikolov et al. present Negative Sampling in Distributed

Representations of Words and Phrases and their Compo-
sitionality. While negative sampling is based on the Skip-Gram
model, it is in fact optimizing a different objective. Consider a pair
(w, c) of word and context. Did this pair come from the training
data? Let’s denote by P( D = 1|w, c) the probability that (w, c) came
from the corpus data. Correspondingly, P( D = 0|w, c) will be the
probability that (w, c) did not come from the corpus data. First, let’s
model P( D = 1|w, c) with the sigmoid function: The sigmoid function
σ( x ) = 1+1e−x
1 is the 1D version of the softmax and
P( D = 1|w, c, θ ) = σ (vcT vw ) = T can be used to model a probability
1 + e(−vc vw )
Now, we build a new objective function that tries to maximize the
probability of a word and context being in the corpus data if it in-
deed is, and maximize the probability of a word and context not
being in the corpus data if it indeed is not. We take a simple maxi-
mum likelihood approach of these two probabilities. (Here we take θ
to be the parameters of the model, and in our case it is V and U .)

θ = argmax ∏ P( D = 1|w, c, θ ) ∏ P( D = 0|w, c, θ ) Figure 3: Sigmoid function

θ (w,c)∈ D (w,c)∈ D̃

= argmax ∏ P( D = 1|w, c, θ ) ∏ (1 − P( D = 1|w, c, θ ))

θ (w,c)∈ D (w,c)∈ D̃

= argmax ∑ log P( D = 1|w, c, θ ) + ∑ log(1 − P( D = 1|w, c, θ ))

θ (w,c)∈ D (w,c)∈ D̃
1 1
= argmax ∑ log Tv )
1 + exp(−uw c
+ ∑ log(1 −
1 + exp Tv )
(−uw c
)
θ (w,c)∈ D (w,c)∈ D̃
1 1
= argmax ∑ log T
+ ∑ log(
1 + exp(−uw vc ) (w,c)∈ D̃ Tv )
1 + exp(uw c
)
θ (w,c)∈ D

Note that maximizing the likelihood is the same as minimizing the

negative log likelihood

1 1
J=− ∑ log T
− ∑ log(
1 + exp(−uw vc ) (w,c)∈ D̃ Tv )
1 + exp(uw c
)
(w,c)∈ D

Note that D̃ is a "false" or "negative" corpus. Where we would have

sentences like "stock boil fish is toy". Unnatural sentences that should
get a low probability of ever occurring. We can generate D̃ on the fly
by randomly sampling this negative from the word bank.
cs224n: natural language processing with deep learning lecture notes: part i word
vectors i: introduction, svd and word2vec 12

For skip-gram, our new objective function for observing the con-
text word c − m + j given the center word c would be

K
− log σ(ucT−m+ j · vc ) − ∑ log σ(−ũkT · vc )
k =1
To compare with the regular softmax
loss for skip-gram
|V |
−ucT−m+ j vc + log ∑k=1 exp(ukT vc )
For CBOW, our new objective function for observing the center
v +v +1 +...+vc+m
word uc given the context vector v̂ = c−m c−m2m would be

K
− log σ(ucT · v̂) − ∑ log σ(−ũkT · v̂)
k =1
To compare with the regular softmax
loss for CBOW
In the above formulation, {ũk |k = 1 . . . K } are sampled from Pn (w). |V |
−ucT v̂ + log ∑ j=1 exp(u Tj v̂)
Let’s discuss what Pn (w) should be. While there is much discussion
of what makes the best approximation, what seems to work best is
the Unigram Model raised to the power of 3/4. Why 3/4? Here’s an
example that might help gain some intuition:

is: 0.93/4 = 0.92

Constitution: 0.093/4 = 0.16
bombastic: 0.013/4 = 0.032

"Bombastic" is now 3x more likely to be sampled while "is" only

went up marginally.

4.5 Hierarchical Softmax

Mikolov et al. also present hierarchical softmax as a much more
efficient alternative to the normal softmax. In practice, hierarchical
softmax tends to be better for infrequent words, while negative sam-
pling works better for frequent words and lower dimensional vectors. Hierarchical Softmax uses a binary
Hierarchical softmax uses a binary tree to represent all words in tree where leaves are the words. The
probability of a word being the output
the vocabulary. Each leaf of the tree is a word, and there is a unique word is defined as the probability
path from root to leaf. In this model, there is no output representation of a random walk from the root to
that word’s leaf. Computational cost
for words. Instead, each node of the graph (except the root and the becomes O(log(|V |)) instead of O(|V|).
leaves) is associated to a vector that the model is going to learn.
In this model, the probability of a word w given a vector wi ,
P(w|wi ), is equal to the probability of a random walk starting in
the root and ending in the leaf node corresponding to w. The main
advantage in computing the probability this way is that the cost is
only O(log(|V |)), corresponding to the length of the path.
Let’s introduce some notation. Let L(w) be the number of nodes
in the path from the root to the leaf w. For instance, L(w2 ) in Figure
4 is 3. Let’s write n(w, i ) as the i-th node on this path with associated Figure 4: Binary tree for Hierarchical
softmax
cs224n: natural language processing with deep learning lecture notes: part i word
vectors i: introduction, svd and word2vec 13

vector vn(w,i) . So n(w, 1) is the root, while n(w, L(w)) is the father
of w. Now for each inner node n, we arbitrarily choose one of its
children and call it ch(n) (e.g. always the left node). Then, we can
compute the probability as

L(w)−1
P ( w | wi ) = ∏ σ ([n(w, j + 1) = ch(n(w, j))] · vnT(w,j) vwi )
j =1

where 
1 if x is true
[x] =
−1 otherwise
.
and σ (·) is the sigmoid function.
This formula is fairly dense, so let’s examine it more closely.
First, we are computing a product of terms based on the shape of
the path from the root (n(w, 1)) to the leaf (w). If we assume ch(n) is
always the left node of n, then term [n(w, j + 1) = ch(n(w, j))] returns
1 when the path goes left, and -1 if right.
Furthermore, the term [n(w, j + 1) = ch(n(w, j))] provides normal-
ization. At a node n, if we sum the probabilities for going to the left
and right node, you can check that for any value of vnT vwi ,

σ (vnT vwi ) + σ (−vnT vwi ) = 1

|V |
The normalization also ensures that ∑w=1 P(w|wi ) = 1, just as in
the original softmax.
Finally, we compare the similarity of our input vector vwi to each
inner node vector vnT(w,j) using a dot product. Let’s run through an
example. Taking w2 in Figure 4, we must take two left edges and
then a right edge to reach w2 from the root, so
P(w2 |wi ) = p(n(w2 , 1), left) · p(n(w2 , 2), left) · p(n(w2 , 3), right)
= σ(vnT(w2 ,1) vwi ) · σ(vnT(w2 ,2) vwi ) · σ(−vnT(w2 ,3) vwi )
To train the model, our goal is still to minimize the negative log
likelihood − log P(w|wi ). But instead of updating output vectors per
word, we update the vectors of the nodes in the binary tree that are
in the path from root to leaf node.
The speed of this method is determined by the way in which the
binary tree is constructed and words are assigned to leaf nodes.
Mikolov et al. use a binary Huffman tree, which assigns frequent
words shorter paths in the tree.

References
[Bengio et al., 2003] Bengio, Y., Ducharme, R., Vincent, P., and Janvin, C. (2003). A
neural probabilistic language model. J. Mach. Learn. Res., 3:1137–1155.
cs224n: natural language processing with deep learning lecture notes: part i word
vectors i: introduction, svd and word2vec 14

[Collobert et al., 2011] Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu,
K., and Kuksa, P. P. (2011). Natural language processing (almost) from scratch.
CoRR, abs/1103.0398.
[Mikolov et al., 2013] Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013). Efficient
estimation of word representations in vector space. CoRR, abs/1301.3781.
[Rong, 2014] Rong, X. (2014). word2vec parameter learning explained. CoRR,
abs/1411.2738.
[Rumelhart et al., 1988] Rumelhart, D. E., Hinton, G. E., and Williams, R. J. (1988).
Neurocomputing: Foundations of research. chapter Learning Representations by
Back-propagating Errors, pages 696–699. MIT Press, Cambridge, MA, USA.

(Ebook) Speech and Language Processing: An Introduction To Natural Language Processing, Computational Linguistics, and Speech Recognition by Daniel Jurafsky, James H. Martin Download
100% (1)
(Ebook) Speech and Language Processing: An Introduction To Natural Language Processing, Computational Linguistics, and Speech Recognition by Daniel Jurafsky, James H. Martin Download
80 pages
Test Taker Score Report: December 21, 2019 Test Date Scores Scores
100% (1)
Test Taker Score Report: December 21, 2019 Test Date Scores Scores
2 pages
NLP Unit 2 Part 1
No ratings yet
NLP Unit 2 Part 1
28 pages
NLP Unit Ii
No ratings yet
NLP Unit Ii
30 pages
NLP Lab Manual
No ratings yet
NLP Lab Manual
33 pages
Nlp-Unit-I Final
No ratings yet
Nlp-Unit-I Final
31 pages
NLP Notes
No ratings yet
NLP Notes
203 pages
Formal Grammars and Parsing
No ratings yet
Formal Grammars and Parsing
11 pages
NLP Lab Tasks
No ratings yet
NLP Lab Tasks
16 pages
Unit 1
No ratings yet
Unit 1
99 pages
Week 6: Introduction To Natural Language Processing
No ratings yet
Week 6: Introduction To Natural Language Processing
18 pages
Unit-2 Aim 502
No ratings yet
Unit-2 Aim 502
6 pages
S V C C P: OME ERY Hallenging Alculus Roblems
No ratings yet
S V C C P: OME ERY Hallenging Alculus Roblems
12 pages
Solutions To NLP I Mid Set A
100% (1)
Solutions To NLP I Mid Set A
8 pages
NLP MODULE 1 Chapter1 &2
100% (1)
NLP MODULE 1 Chapter1 &2
83 pages
Chapter 3 Geometry Notes
No ratings yet
Chapter 3 Geometry Notes
16 pages
10 0096 01 MS 5RP AFP tcm142-725646
90% (10)
10 0096 01 MS 5RP AFP tcm142-725646
12 pages
Baron William Thomson Kelvin - Reprint of Papers On Electrostatics and Magnetism
No ratings yet
Baron William Thomson Kelvin - Reprint of Papers On Electrostatics and Magnetism
628 pages
Unit-1 Aim 502
No ratings yet
Unit-1 Aim 502
15 pages
Word2Vec - A Baby Step in Deep Learning But A Giant Leap Towards Natural Language Processing
100% (1)
Word2Vec - A Baby Step in Deep Learning But A Giant Leap Towards Natural Language Processing
12 pages
NLP UNIT 5 Part B
100% (2)
NLP UNIT 5 Part B
31 pages
CS 224D: Deep Learning For NLP: Lecture Notes: Part I Spring 2016
No ratings yet
CS 224D: Deep Learning For NLP: Lecture Notes: Part I Spring 2016
10 pages
NLP MQP Solved
No ratings yet
NLP MQP Solved
26 pages
Unit 1
No ratings yet
Unit 1
35 pages
Unit 3
No ratings yet
Unit 3
14 pages
Quadratic Eq.
No ratings yet
Quadratic Eq.
4 pages
Vector Representation of Text: Vagelis Hristidis Prepared With The Help of Nhat Le Many Slides Are From Richard Socher
No ratings yet
Vector Representation of Text: Vagelis Hristidis Prepared With The Help of Nhat Le Many Slides Are From Richard Socher
20 pages
QUESTION 2 - Convolution Neural Network Q2 A - Consider The Following CNN Architecture
No ratings yet
QUESTION 2 - Convolution Neural Network Q2 A - Consider The Following CNN Architecture
2 pages
Flintstones Cannon Balls and Calculus - Proceedengs
No ratings yet
Flintstones Cannon Balls and Calculus - Proceedengs
10 pages
Natural Language Processing
No ratings yet
Natural Language Processing
6 pages
HXC2019 Final P4
No ratings yet
HXC2019 Final P4
2 pages
Probability: (EAMCET 2009)
100% (1)
Probability: (EAMCET 2009)
11 pages
PhysRevLett 129 201604
No ratings yet
PhysRevLett 129 201604
6 pages
NLP Unit 1 Answers
No ratings yet
NLP Unit 1 Answers
7 pages
CS490 Advtopics Bese7 - PDF
No ratings yet
CS490 Advtopics Bese7 - PDF
4 pages
Solving Quadratic Equations Using The Quadratic Formula Dominoes
No ratings yet
Solving Quadratic Equations Using The Quadratic Formula Dominoes
3 pages
QUESTION 1 - Linear Regression (10 Points) : Your Solution (Only Run For 1 Iteration)
No ratings yet
QUESTION 1 - Linear Regression (10 Points) : Your Solution (Only Run For 1 Iteration)
1 page
NLP Qb-Ese
No ratings yet
NLP Qb-Ese
2 pages
Time: 2 Hours Total Marks: 80: CBSE Board Class VI Mathematics Term I Sample Paper 1 - Solution
No ratings yet
Time: 2 Hours Total Marks: 80: CBSE Board Class VI Mathematics Term I Sample Paper 1 - Solution
9 pages
Module 3: Morphology Morphological Parsing With Finite State
No ratings yet
Module 3: Morphology Morphological Parsing With Finite State
29 pages
(A) What Is Traditional Model of NLP?: Unit - 1
No ratings yet
(A) What Is Traditional Model of NLP?: Unit - 1
18 pages
Backpropagation Examples PDF
No ratings yet
Backpropagation Examples PDF
9 pages
Che 555
No ratings yet
Che 555
10 pages
RPT Math DLP Year 1 2023-2024
No ratings yet
RPT Math DLP Year 1 2023-2024
13 pages
Power System Security: Definitions and Analysis
No ratings yet
Power System Security: Definitions and Analysis
21 pages
Determinants
No ratings yet
Determinants
2 pages
Multiplication of Two 16 Bit Numbers
No ratings yet
Multiplication of Two 16 Bit Numbers
1 page
Shivangi Tyagi (NLP Assignments)
No ratings yet
Shivangi Tyagi (NLP Assignments)
60 pages
1 Intro To NLP
100% (1)
1 Intro To NLP
46 pages
NLP Unit Iv
No ratings yet
NLP Unit Iv
24 pages
Natural Language Processing
100% (1)
Natural Language Processing
3 pages
Python and NLP Notes
No ratings yet
Python and NLP Notes
32 pages
Hintikka - Synthetic A Priori
No ratings yet
Hintikka - Synthetic A Priori
13 pages
Algebra Cat
No ratings yet
Algebra Cat
19 pages
Question Bank
No ratings yet
Question Bank
13 pages
NLP UNIT 2 Notes
No ratings yet
NLP UNIT 2 Notes
14 pages
Retrospective Mixture View of Experiments
No ratings yet
Retrospective Mixture View of Experiments
18 pages
3D Computer Vision Assignment 4
No ratings yet
3D Computer Vision Assignment 4
13 pages
Website Preface
No ratings yet
Website Preface
5 pages
The Nature of Mathematics
No ratings yet
The Nature of Mathematics
19 pages
NLP
No ratings yet
NLP
2 pages
Math GCF and LCM Grade 4 Dec 5 - 9 2021
No ratings yet
Math GCF and LCM Grade 4 Dec 5 - 9 2021
21 pages
Egyptian Mathematics: Babylonian Numerals
No ratings yet
Egyptian Mathematics: Babylonian Numerals
2 pages
Com713 Advanced Data Structures and Algorithms
No ratings yet
Com713 Advanced Data Structures and Algorithms
13 pages
CS490 Advanced Topics in Computing - Deep Learning
No ratings yet
CS490 Advanced Topics in Computing - Deep Learning
20 pages
IS 7118 Unit1 Introduction
No ratings yet
IS 7118 Unit1 Introduction
58 pages
IGCSE (9-1) Maths - Practice Paper 2F
No ratings yet
IGCSE (9-1) Maths - Practice Paper 2F
22 pages
Linear Algebra: Lecture Slides For Chapter 2 of
No ratings yet
Linear Algebra: Lecture Slides For Chapter 2 of
23 pages
Word Semantics, Sentence Semantics and Utterance Semantics
No ratings yet
Word Semantics, Sentence Semantics and Utterance Semantics
11 pages
Natural Language Processing: Dr. Abdulfetah A.A
No ratings yet
Natural Language Processing: Dr. Abdulfetah A.A
25 pages
Bhawini NLP File
No ratings yet
Bhawini NLP File
100 pages
6CS4 AI Unit-5
No ratings yet
6CS4 AI Unit-5
65 pages
Cs 224 N
No ratings yet
Cs 224 N
128 pages
Chapter 8 Test 1
No ratings yet
Chapter 8 Test 1
6 pages
ATN
100% (1)
ATN
16 pages
NLP Question Bank Answers (Jagmeet)
No ratings yet
NLP Question Bank Answers (Jagmeet)
31 pages
CS 490 Deep Learning: Reading Chap 1 of Deep Learing DR Omar Arif Omar - Arif@seecs - Edu.pk
No ratings yet
CS 490 Deep Learning: Reading Chap 1 of Deep Learing DR Omar Arif Omar - Arif@seecs - Edu.pk
33 pages
EEE - I Year B.Tech. (R13) Regular
No ratings yet
EEE - I Year B.Tech. (R13) Regular
48 pages
04-CNN PDF
No ratings yet
04-CNN PDF
170 pages
Omar Arif Omar - Arif@seecs - Edu.pk National University of Sciences and Technology
No ratings yet
Omar Arif Omar - Arif@seecs - Edu.pk National University of Sciences and Technology
44 pages
Implementation of N-Gram Technique
No ratings yet
Implementation of N-Gram Technique
6 pages
NLP Lab Expdoc New
No ratings yet
NLP Lab Expdoc New
103 pages
NLP Unit-V
No ratings yet
NLP Unit-V
30 pages
Natural Language Processing
No ratings yet
Natural Language Processing
12 pages
Ch11 3 Tries
No ratings yet
Ch11 3 Tries
11 pages
Langauage Model
No ratings yet
Langauage Model
148 pages
NLP 3 4 5
No ratings yet
NLP 3 4 5
105 pages
Word Sense Disambiguation: by Under The Guidance of
No ratings yet
Word Sense Disambiguation: by Under The Guidance of
99 pages
PEA Botswana 2019 Primary Catalogue
No ratings yet
PEA Botswana 2019 Primary Catalogue
48 pages
CHAPTER 1, 1.3 Integration of Trigonometric Substitutions
No ratings yet
CHAPTER 1, 1.3 Integration of Trigonometric Substitutions
19 pages
05-TrainingNN PDF
No ratings yet
05-TrainingNN PDF
81 pages
Table of Content
No ratings yet
Table of Content
13 pages
Topoc Modeling PDF
No ratings yet
Topoc Modeling PDF
120 pages
Dependency Parsing: Pawan Goyal
No ratings yet
Dependency Parsing: Pawan Goyal
38 pages
IS 7118 Unit-9 Semantics
No ratings yet
IS 7118 Unit-9 Semantics
82 pages
Natural Language Processing
No ratings yet
Natural Language Processing
36 pages
Natural Language Processing-Wiki
No ratings yet
Natural Language Processing-Wiki
237 pages
Data Visualization PDF
No ratings yet
Data Visualization PDF
90 pages
Biology at a Glance 4th Edition Judy Dodds (Author) instant download
No ratings yet
Biology at a Glance 4th Edition Judy Dodds (Author) instant download
91 pages
Semantics: Lexical Semantics: Pawan Goyal
No ratings yet
Semantics: Lexical Semantics: Pawan Goyal
54 pages
Be Computer Engineering Semester 7 2023 May Dloc III Natural Language Processing Rev 2019 C Scheme
0% (1)
Be Computer Engineering Semester 7 2023 May Dloc III Natural Language Processing Rev 2019 C Scheme
2 pages