0% found this document useful (0 votes)
9 views94 pages

NLP Course Lecture03 Huawei Noahs Ark Lab

ods.ai NLP course lecture 3
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views94 pages

NLP Course Lecture03 Huawei Noahs Ark Lab

ods.ai NLP course lecture 3
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 94

Natural Language Processing

Lecture 03 Word Embeddings

Qun Liu, Valentin Malykh


Huawei Noah’s Ark Lab

Autumn 2020
A course delivered at MIPT, Moscow

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Autumn 2020 1 / 94
Content

1 Distributional semantics

2 Word embeddings

3 Word2Vec

4 GloVe

5 Evaluation of word embeddings

6 Fasttext

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Autumn 2020 2 / 94
Distributional semantics

Content

1 Distributional semantics

2 Word embeddings

3 Word2Vec

4 GloVe

5 Evaluation of word embeddings

6 Fasttext

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Autumn 2020 3 / 94
Distributional semantics

Word representations

In rule-based approaches, i.e., grammars, automata, etc., words


are represented as symbols.
However, if we want to apply machine learning algorithms, we
should represents linguistic units (words, phrases, etc.) as
numerical vectors.
Then the questions is, how can we represent words as numerical
vectors?

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Autumn 2020 4 / 94
Distributional semantics

Representing words by their context

“You shall know a word by the company it keeps”

(J. R. Firth, 1957)


Distributional hypothesis:
Linguistic items with similar distributions have similar meanings.
⇒ Distributional Semantics

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Autumn 2020 5 / 94
Distributional semantics

Distributional semantics

Distributional semantics is a research area that develops and


studies theories and methods for quantifying and categorizing
semantic similarities between linguistic items based on their
distributional properties in large samples of language data.
(Wikipedia)
Idea: Collect distributional information in high-dimensional
vectors, and to define distributional/semantic similarity in terms of
vector similarity.

A sample of concordance

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Autumn 2020 6 / 94
Distributional semantics

Distributional semantic models

Distributional semantic models differ primarily with respect to the


following parameters:
Context type (text regions vs. linguistic items)
Context window (size, extension, etc.)
Frequency weighting (e.g. entropy, pointwise mutual information,
etc.)
Dimension reduction (e.g. random indexing, singular value
decomposition, etc.)
Similarity measure (e.g. cosine similarity, Minkowski distance, etc.)

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Autumn 2020 7 / 94
Distributional semantics

Example: Window based co-occurrence matrix


• Window length 1 (more common: 5–10)
• Symmetric (irrelevant whether left or right context)
• Example corpus:
• I like deep learning.
• I like NLP.
• I enjoy flying.

16
Christopher Manning, Natural Language Processing with Deep Learning, Standford U. CS224n

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Autumn 2020 8 / 94
Distributional semantics

Window based co-occurrence matrix


• Example corpus:
• I like deep learning.
• I like NLP.
• I enjoy flying.
counts I like enjoy deep learning NLP flying .
I 0 2 1 0 0 0 0 0
like 2 0 0 1 0 1 0 0
enjoy 1 0 0 0 0 0 1 0
deep 0 1 0 0 1 0 0 0
learning 0 0 0 1 0 0 0 1
NLP 0 1 0 0 0 0 0 1
flying 0 0 1 0 0 0 0 1
. 0 0 0 0 1 1 1 0
17
Christopher Manning, Natural Language Processing with Deep Learning, Standford U. CS224n

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Autumn 2020 9 / 94
Distributional semantics

Problems with simple co-occurrence vectors

Increase in size with vocabulary

Very high dimensional: requires a lot of storage

Subsequent classification models have sparsity issues

à Models are less robust

18
Christopher Manning, Natural Language Processing with Deep Learning, Standford U. CS224n

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Autumn 2020 10 / 94
Distributional semantics

Solution: Low dimensional vectors


• Idea: store “most” of the important information in a fixed, small
number of dimensions: a dense vector

• Usually 25–1000 dimensions, similar to word2vec

• How to reduce the dimensionality?

19
Christopher Manning, Natural Language Processing with Deep Learning, Standford U. CS224n

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Autumn 2020 11 / 94
Distributional semantics

Method: Dimensionality Reduction on X (HW1)


Singular Value Decomposition of co-occurrence matrix X
Factorizes X into UΣVT, where U and V are orthonormal

k
X

Retain only k singular values, in order to generalize.


𝑋J is the best rank k approximation to X , in terms of least squares.
20
Classic linear algebra result. Expensive to compute for large matrices.
Christopher Manning, Natural Language Processing with Deep Learning, Standford U. CS224n

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Autumn 2020 12 / 94
Distributional semantics

Simple SVD word vectors in Python


Corpus:
I like deep learning. I like NLP. I enjoy flying.

21
Christopher Manning, Natural Language Processing with Deep Learning, Standford U. CS224n

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Autumn 2020 13 / 94
Distributional semantics

Simple SVD word vectors in Python


Corpus: I like deep learning. I like NLP. I enjoy flying.
Printing first two columns of U corresponding to the 2 biggest singular values

22
Christopher Manning, Natural Language Processing with Deep Learning, Standford U. CS224n

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Autumn 2020 14 / 94
Distributional semantics

Hacks to X (several used in Rohde et al. 2005)

Scaling the counts in the cells can help a lot


• Problem: function words (the, he, has) are too
frequent à syntax has too much impact. Some fixes:
• min(X,t), with t ≈ 100
• Ignore them all

• Ramped windows that count closer words more


• Use Pearson correlations instead of counts, then set
negative values to 0
• Etc.
23
Christopher Manning, Natural Language Processing with Deep Learning, Standford U. CS224n

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Autumn 2020 15 / 94
CALLED BASHED
Distributional semantics
Figure 10: Multidimensional scaling of three verb semantic classes.

Interesting syntactic patterns emerge in the vectors


CHOOSING
CHOOSE
CHOSE
CHOSEN

STOLEN
STEAL
STOLE
STEALING

TAKE
SPOKE SPEAK
SPOKEN
SPEAKING TAKEN TAKING
TOOK
THROW
THROWN THREW
THROWING

SHOWN
SHOWED EATEN
EAT
SHOWING ATE
EATING

SHOW

GROWN
GROW
GREW

GROWING

FigureCOALS model from


11: Multidimensional scaling of present, past, progressive, and past participle forms for eight verb families.
An Improved Model of Semantic Similarity Based on Lexical Co-Occurrence
Rohde et al. ms., 2005
24
Christopher Manning, Natural Language Processing with Deep Learning, Standford U. CS224n
22

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Autumn 2020 16 / 94
Distributional semantics

Rohde, Gonnerman, Plaut Modeling Word Meaning Using Lexical Co-Occurrence


Interesting semantic patterns emerge in the vectors
DRIVER

JANITOR
DRIVE SWIMMER
STUDENT

CLEAN TEACHER

DOCTOR
BRIDE
SWIM
PRIEST

LEARN TEACH
MARRY

TREAT PRAY

COALS Figure
model13: Multidimensional scaling for nouns and their associated verbs.
from
An Improved Model of Semantic Similarity Based on Lexical Co-Occurrence
Table 10 Rohde et al. ms., 2005
The 25
10 nearest neighbors and their percent correlation similarities for a set of nouns, under the COALS-14K model.
gunChristopherpoint
Manning, Natural
mind Language Processing
monopoly with Deep Learning,lipstick
cardboard Standford U. leningrad
CS224n feet
1) 46.4 handgun 32.4 points 33.5 minds 39.9 monopolies 47.4 plastic 42.9 shimmery 24.0 moscow 59.5 inches
2) 41.1 firearms 29.2 argument 24.9 consciousness 27.8 monopolistic 37.2 foam 40.8 eyeliner 22.7 sevastopol 57.7 foot
Qun Liu3)& 41.0
Valentin
firearmMalykh
25.4 (Huawei)
question 23.2 thoughts Natural
26.5Language
corporations Processing
36.7 plywood 38.8 clinique Autumn 52.0
22.7 petersburg 2020 metres 17 / 94
Word embeddings

Content

1 Distributional semantics

2 Word embeddings

3 Word2Vec

4 GloVe

5 Evaluation of word embeddings

6 Fasttext

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Autumn 2020 18 / 94
Word embeddings

Count-based representations

The vectors used in distributional semantics are based on


frequencies, which are also called count-based representations.
Count-based representations were successful in many
similarity-related tasks, however, their usage was not able to
extended to other NLP tasks.
In order to take full advantages of machine learning / deep
learning approaches, it is necessity to represent words solely in
vectors. In another word, we must use vectors to replace words
completely, and get rid of symbols in computing, except for the
output layer.

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Autumn 2020 19 / 94
Word embeddings

Prediction-based representations

If a word can be predicted by the vectors of its content words, then


we can expect we can use the vectors to replace the words in any
NLP tasks.
Prediction-based word representations:
Randomly assign a vector to each word in the vocabulary;
Prepare a corpus;
For each of the word (referred as the current word) in the corpus,
repeat:
Calculate the probability of all content words given the current word
(or vice versa);
Adjust the word vectors to maximize the above probability.

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Autumn 2020 20 / 94
Word embeddings

Word embeddings

Various predict-based word representations, or word embeddings,


are developed, including:
Word2Vec, GloVe, FastText, etc.
Word embeddings are the first step towards deep learning (neural
network) based NLP
In some cases, the pretrained word embeddings (like Word2Vec)
can be directly used to solve NLP problems.
However, this is not always the case.
Instead, word embeddings are defined as components of the
whole NLP system, whose parameters are tuned together with all
other parameters.

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Autumn 2020 21 / 94
Word2Vec

Content

1 Distributional semantics

2 Word embeddings

3 Word2Vec

4 GloVe

5 Evaluation of word embeddings

6 Fasttext

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Autumn 2020 22 / 94
Word2Vec

Word2Vec

T Mikolov, I Sutskever, K Chen, GS Corrado, J Dean, Distributed representations


of words and phrases and their compositionality, NIPS 2013
Word2Vec (include Skip-gram (SG) and Continuous Bag of Word (CBOG) )
Y Goldberg, O Levy. word2vec Explained: deriving Mikolov et al.’s
negative-sampling word-embedding method. arXiv:1402.3722
A Joulin, É Grave, P Bojanowski, T Mikolov, Bag of Tricks for Efficient Text
Classification. EACL 2017.
FastText

Tomas Mikolov

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Autumn 2020 23 / 94
Word2Vec Basic idea

Content

3 Word2Vec
Basic idea
Cross-entropy loss function
Softmax
Skip-gram Model
Training
Derivation of gradients
Stochastic Gradient Descent
More details

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Autumn 2020 24 / 94
Word2Vec Basic idea

3. Word2vec: Overview
Word2vec (Mikolov et al. 2013) is a framework for learning
word vectors

Idea:
• We have a large corpus of text
• Every word in a fixed vocabulary is represented by a vector
• Go through each position t in the text, which has a center word
c and context (“outside”) words o
• Use the similarity of the word vectors for c and o to calculate
the probability of o given c (or vice versa)
• Keep adjusting the word vectors to maximize this probability

18
Christopher Manning, Natural Language Processing with Deep Learning, Standford U. CS224n

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Autumn 2020 25 / 94
Word2Vec Basic idea

Word2Vec Overview
• Example windows and process for computing 𝑃 𝑤89: | 𝑤8

𝑃 𝑤8>= | 𝑤8 𝑃 𝑤89= | 𝑤8

𝑃 𝑤8>< | 𝑤8 𝑃 𝑤89< | 𝑤8

… problems turning into banking crises as …

outside context words center word outside context words


in window of size 2 at position t in window of size 2

19
Christopher Manning, Natural Language Processing with Deep Learning, Standford U. CS224n

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Autumn 2020 26 / 94
Word2Vec Basic idea

Word2Vec Overview
• Example windows and process for computing 𝑃 𝑤89: | 𝑤8

𝑃 𝑤8>= | 𝑤8 𝑃 𝑤89= | 𝑤8

𝑃 𝑤8>< | 𝑤8 𝑃 𝑤89< | 𝑤8

… problems turning into banking crises as …

outside context words center word outside context words


in window of size 2 at position t in window of size 2

20
Christopher Manning, Natural Language Processing with Deep Learning, Standford U. CS224n

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Autumn 2020 27 / 94
Word2Vec Cross-entropy loss function

Content

3 Word2Vec
Basic idea
Cross-entropy loss function
Softmax
Skip-gram Model
Training
Derivation of gradients
Stochastic Gradient Descent
More details

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Autumn 2020 28 / 94
Word2Vec Cross-entropy loss function

Word2vec: objective function


For each position 𝑡 = 1, … , 𝑇, predict context words within a
window of fixed size m, given center word 𝑤: .
I

Likelihood = 𝐿 𝜃 = G G 𝑃 𝑤89: | 𝑤8 ; 𝜃
8H< >JK:KJ
:LM
𝜃 is all variables
to be optimized
sometimes called cost or loss function

The objective function 𝐽 𝜃 is the (average) negative log likelihood:


I
1 1
𝐽 𝜃 = − log 𝐿(𝜃) = − S S log 𝑃 𝑤89: | 𝑤8 ; 𝜃
𝑇 𝑇
8H< >JK:KJ
:LM

Minimizing objective function ⟺ Maximizing predictive accuracy


21
Christopher Manning, Natural Language Processing with Deep Learning, Standford U. CS224n

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Autumn 2020 29 / 94
Word2Vec Cross-entropy loss function

Cross-entropy loss function

Negative Log Likelihood Loss:


T
1 1X X
J(θ) = − log L(θ) = − log(wt+j |wt ; θ)
T T
t=1 −m≤j≤m
j6=0

is also called Cross-Entropy Loss.


Cross-entropy is a measure of the difference between two
probability distributions for a given random variable or set of
events.

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Autumn 2020 30 / 94
Word2Vec Cross-entropy loss function

Cross-entropy loss function

Consider two probability distributions p and q defined on a set of


events X = {x1 , x2 , ..., xn }, then cross-entropy between p and q is:
X
H(q, p) = − q(x) log p(x)
x∈X
Assume X is the vocabulary, p(x) is the model generated
probabilities over the vocabulary, q(x) is the actual distribution of
the content word at t + j: (
1, for x = wt+j
q(x) =
0, otherwise
then:
H(q, p) = − log p(wt+j )

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Autumn 2020 31 / 94
Word2Vec Cross-entropy loss function

Word2vec: objective function

• We want to minimize the objective function:


I
1
𝐽 𝜃 =− S S log 𝑃 𝑤89: | 𝑤8 ; 𝜃
𝑇
8H< >JK:KJ
:LM

• Question: How to calculate 𝑃 𝑤89: | 𝑤8 ; 𝜃 ?

• Answer: We will use two vectors per word w:


• 𝑣U when w is a center word
• 𝑢U when w is a context word

• Then for a center word c and a context word o:

exp(𝑢YI 𝑣Z )
𝑃 𝑜𝑐 = I𝑣 )
22
∑U∈] exp(𝑢U Z
Christopher Manning, Natural Language Processing with Deep Learning, Standford U. CS224n

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Autumn 2020 32 / 94
Word2Vec Cross-entropy loss function

Word2Vec Overview with Vectors


• Example windows and process for computing 𝑃 𝑤89: | 𝑤8
• 𝑃 𝑢^_Y`abJc | 𝑣de8Y short for P 𝑝𝑟𝑜𝑏𝑙𝑒𝑚𝑠 | 𝑖𝑛𝑡𝑜 ; 𝑢^_Y`abJc , 𝑣de8Y , 𝜃

𝑃 𝑢^_Y`abJc | 𝑣de8Y 𝑃 𝑢Z_dcdc |𝑣de8Y

𝑃 𝑢8seder | 𝑣de8Y 𝑃 𝑢`peqder |𝑣de8Y

… problems turning into banking crises as …

outside context words center word outside context words


in window of size 2 at position t in window of size 2

23
Christopher Manning, Natural Language Processing with Deep Learning, Standford U. CS224n

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Autumn 2020 33 / 94
Word2Vec Softmax

Content

3 Word2Vec
Basic idea
Cross-entropy loss function
Softmax
Skip-gram Model
Training
Derivation of gradients
Stochastic Gradient Descent
More details

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Autumn 2020 34 / 94
Word2Vec Softmax

Word2vec: prediction function


② Exponentiation makes anything positive
① Dot product compares similarity of o and c.
𝑢I 𝑣 = 𝑢. 𝑣 = ∑edH< 𝑢d 𝑣d
exp(𝑢YI 𝑣Z ) Larger dot product = larger probability
𝑃 𝑜𝑐 = I𝑣 )
∑U∈] exp(𝑢U Z
③ Normalize over entire vocabulary
to give probability distribution

• This is an example of the softmax function ℝe → (0,1)e


exp(𝑥d ) Open
softmax 𝑥d = e = 𝑝d region
∑:H< exp(𝑥: )
• The softmax function maps arbitrary values 𝑥d to a probability
distribution 𝑝d
• “max” because amplifies probability of largest 𝑥d
• “soft” because still assigns some probability to smaller 𝑥d
• Frequently used in Deep Learning
24
Christopher Manning, Natural Language Processing with Deep Learning, Standford U. CS224n

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Autumn 2020 35 / 94
Word2Vec Softmax

Softmax

In mathematics, the softmax function, also known as softargmax


or normalized exponential function, is a function that takes as
input a vector of K real numbers, and normalizes it. We could
interpret this as a probability distribution consisting of K
probabilities proportional to the exponentials of the input numbers.
The standard (unit) softmax function σ : RK → RK is defined by
the formula:
ezi
yi = PK for i = 1, . . . , K and z = (z1 , . . . , zK ) ∈ RK
z
j=1 e
j

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Autumn 2020 36 / 94
Word2Vec Softmax

Softmax

Figure source:https://blog.csdn.net/xg123321123/article/details/80781611

In Word2Vec, because the softmax function is calculated over all


words in the vocabulary, it is quite expensive computationally.

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Autumn 2020 37 / 94
Word2Vec Skip-gram Model

Content

3 Word2Vec
Basic idea
Cross-entropy loss function
Softmax
Skip-gram Model
Training
Derivation of gradients
Stochastic Gradient Descent
More details

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Autumn 2020 38 / 94
Word2Vec Skip-gram Model

Word2Vec: Skip-gram Model

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Autumn 2020 39 / 94
Word2Vec Training

Content

3 Word2Vec
Basic idea
Cross-entropy loss function
Softmax
Skip-gram Model
Training
Derivation of gradients
Stochastic Gradient Descent
More details

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Autumn 2020 40 / 94
Word2Vec Training

Training a model by optimizing parameters


To train a model, we adjust parameters to minimize a loss
E.g., below, for a simple convex function over two parameters
Contour lines show levels of objective function

25
Christopher Manning, Natural Language Processing with Deep Learning, Standford U. CS224n

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Autumn 2020 41 / 94
Word2Vec Training

To train the model: Compute all vector gradients!


• Recall: 𝜃 represents all model parameters, in one long vector
• In our case with d-dimensional vectors and V-many words:

• Remember: every word has two vectors


• We optimize these parameters by walking down the gradient
26
Christopher Manning, Natural Language Processing with Deep Learning, Standford U. CS224n

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Autumn 2020 42 / 94
Word2Vec Derivation of gradients

Content

3 Word2Vec
Basic idea
Cross-entropy loss function
Softmax
Skip-gram Model
Training
Derivation of gradients
Stochastic Gradient Descent
More details

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Autumn 2020 43 / 94
Word2Vec Derivation of gradients

Derivation of gradients for Word2Vec model

1
J(θ) = − log L(θ)
T
T
1X X
=− log p(wt+j |wt ; θ)
T
t=1 −m≤j≤m
j6=0
1 X X
=− log p(o|c; u, v )
T
o∈context(c) c∈corpus
∂ 1 X X ∂
J(θ) = − log p(o|c; u, v )
∂θ T c∈corpus
∂θ
o∈context(c)

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Autumn 2020 44 / 94
Word2Vec Derivation of gradients

Derivation of gradients for Word2Vec model


∂ ∂ exp(uo T vc )
log p(o|c; u, v ) = log PV
∂vc ∂vc w=1 exp(uw vc )
T

∂ ∂
(uo T vc ) − log Vw=1 exp(uw T vc )
P
=
∂vc ∂vc
1 ∂ PV T
= uo − PV w=1 exp(uw vc )
T
w=1 exp(uw vc )
∂v c
1 PV ∂ T
= uo − PV w=1 ∂vc exp(uw vc )
T
w=1 exp(uw vc )
1 PV T
= uo − PV w=1 exp(uw vc )uw
T
w=1 exp(uw vc )
T
= uo − x=1 PVexp(ux vc T) ux
PV
w=1 exp(uw vc )
PV
= uo − x=1 p(x|c)ux

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Autumn 2020 45 / 94
Word2Vec Derivation of gradients

Derivation of gradients for Word2Vec model

∂ 1 X X ∂
J(θ) = − log p(o|c; u, v )
∂θ T c∈corpus
∂θ
o∈context(c)

V
" #
∂ 1 X X X
J(θ) = − uo − p(x|c)ux
∂vc T
o∈context(c) c∈corpus x=1
V
" #
∂ 1 X X X
J(θ) = − vc − p(o|x)vx
∂uo T
o∈context(c) c∈corpus x=1

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Autumn 2020 46 / 94
Word2Vec Derivation of gradients

Calculating all gradients!


• We went through gradient for each center vector v in a window
• We also need gradients for outside vectors u
• Derive at home!
• Generally in each window we will compute updates for all
parameters that are being used in that window. For example:

𝑃 𝑢8s_eder |𝑣`peqder 𝑃 𝑢pc |𝑣`peqder

𝑃 𝑢de8Y | 𝑣`peqder 𝑃 𝑢Z_dcbc |𝑣`peqder

… problems turning into banking crises as …

outside context words center word outside context words


in window of size 2 at position t in window of size 2
30
Christopher Manning, Natural Language Processing with Deep Learning, Standford U. CS224n

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Autumn 2020 47 / 94
Word2Vec Stochastic Gradient Descent

Content

3 Word2Vec
Basic idea
Cross-entropy loss function
Softmax
Skip-gram Model
Training
Derivation of gradients
Stochastic Gradient Descent
More details

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Autumn 2020 48 / 94
Word2Vec Stochastic Gradient Descent

5. Optimization: Gradient Descent


• We have a cost function 𝐽 𝜃 we want to minimize
• Gradient Descent is an algorithm to minimize 𝐽 𝜃
• Idea: for current value of 𝜃, calculate gradient of 𝐽 𝜃 , then take
small step in direction of negative gradient. Repeat.

Note: Our
objectives
may not
be convex
like this :(

32

Christopher Manning, Natural Language Processing with Deep Learning, Standford U. CS224n

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Autumn 2020 49 / 94
Word2Vec Stochastic Gradient Descent

Gradient Descent
• Update equation (in matrix notation):

𝛼 = step size or learning rate

• Update equation (for single parameter):

• Algorithm:

33
Christopher Manning, Natural Language Processing with Deep Learning, Standford U. CS224n

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Autumn 2020 50 / 94
Word2Vec Stochastic Gradient Descent

Stochastic Gradient Descent


• Problem: 𝐽 𝜃 is a function of all windows in the corpus
(potentially billions!)
• So is very expensive to compute
• You would wait a very long time before making a single update!

• Very bad idea for pretty much all neural nets!


• Solution: Stochastic gradient descent (SGD)
• Repeatedly sample windows, and update after each one
• Algorithm:

34
Christopher Manning, Natural Language Processing with Deep Learning, Standford U. CS224n

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Autumn 2020 51 / 94
Word2Vec Stochastic Gradient Descent

Stochastic gradients with word vectors!


• Iteratively take gradients at each such window for SGD
• But in each window, we only have at most 2m + 1 words,
so is very sparse!

Christopher Manning, Natural Language Processing with Deep Learning, Standford U. CS224n

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Autumn 2020 52 / 94
Word2Vec Stochastic Gradient Descent

Stochastic gradients with word vectors!


• We might only update the word vectors that actually appear!

• Solution: either you need sparse matrix update operations to


only update certain rows of full embedding matrices U and V,
or you need to keep around a hash for word vectors

[ ]
d

|V|

• If you have millions of word vectors and do distributed


computing, it is important to not have to send gigantic
updates around!

10
Christopher Manning, Natural Language Processing with Deep Learning, Standford U. CS224n

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Autumn 2020 53 / 94
Word2Vec More details

Content

3 Word2Vec
Basic idea
Cross-entropy loss function
Softmax
Skip-gram Model
Training
Derivation of gradients
Stochastic Gradient Descent
More details

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Autumn 2020 54 / 94
Word2Vec More details

Word2vec: More details


Why two vectors? à Easier optimization. Average both at the end.

Two model variants:


1. Skip-grams (SG)
Predict context (”outside”) words (position independent) given center
word
2. Continuous Bag of Words (CBOW)
Predict center word from (bag of) context words
This lecture so far: Skip-gram model

Additional efficiency in training:


1. Negative sampling
So far: Focus on naïve softmax (simpler training method)
31
Christopher Manning, Natural Language Processing with Deep Learning, Standford U. CS224n

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Autumn 2020 55 / 94
Word2Vec More details

Skip-gram Model vs. CBOW Model

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Autumn 2020 56 / 94
Word2Vec More details

Negative Sampling

X X
J(uo , C) = exp(uoT uw ) + exp(−uoT uw )
w∈C w ∈C
/

C - is a context (set of words),


first part is positive samples,
second part is negative samples.

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Autumn 2020 57 / 94
GloVe

Content

1 Distributional semantics

2 Word embeddings

3 Word2Vec

4 GloVe

5 Evaluation of word embeddings

6 Fasttext

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Autumn 2020 58 / 94
GloVe

4. Towards GloVe: Count based vs. direct prediction

• LSA, HAL (Lund & Burgess), • Skip-gram/CBOW (Mikolov et al)


• COALS, Hellinger-PCA (Rohde • NNLM, HLBL, RNN (Bengio et
et al, Lebret & Collobert) al; Collobert & Weston; Huang et al; Mnih
& Hinton)

• Fast training • Scales with corpus size


• Efficient usage of statistics
• Inefficient usage of statistics
• Primarily used to capture word • Generate improved performance
similarity on other tasks
• Disproportionate importance
given to large counts • Can capture complex patterns
beyond word similarity

26
Christopher Manning, Natural Language Processing with Deep Learning, Standford U. CS224n

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Autumn 2020 59 / 94
GloVe

Encoding meaning in vector differences


[Pennington, Socher, and Manning, EMNLP 2014]

Crucial insight: Ratios of co-occurrence probabilities can encode


meaning components

x = solid x = gas x = water x = random

large small large small

small large large small

large small ~1 ~1

Christopher Manning, Natural Language Processing with Deep Learning, Standford U. CS224n

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Autumn 2020 60 / 94
GloVe

Encoding meaning in vector differences


[Pennington, Socher, and Manning, EMNLP 2014]

Crucial insight: Ratios of co-occurrence probabilities can encode


meaning components

x = solid x = gas x = water x = fashion

1.9 x 10-4 6.6 x 10-5 3.0 x 10-3 1.7 x 10-5

2.2 x 10-5 7.8 x 10-4 2.2 x 10-3 1.8 x 10-5

8.9 8.5 x 10-2 1.36 0.96

Christopher Manning, Natural Language Processing with Deep Learning, Standford U. CS224n

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Autumn 2020 61 / 94
GloVe

Encoding meaning in vector differences


Q: How can we capture ratios of co-occurrence probabilities as
linear meaning components in a word vector space?

A: Log-bilinear model:

with vector differences

Christopher Manning, Natural Language Processing with Deep Learning, Standford U. CS224n

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Autumn 2020 62 / 94
GloVe

Encoding meaning in vector differences


Q: How can we capture ratios of co-occurrence probabilities as
linear meaning components in a word vector space?

A: Log-bilinear model:

with vector differences

Christopher Manning, Natural Language Processing with Deep Learning, Standford U. CS224n

Note: P(i|j) 6= P(j|i), w and w̃ should be defined separately!


Correction: Log-bilinear model: wi · w̃j = log P(i|j)
P(x|a)
with vector differences: wx · (w̃a − w̃b ) = log P(x|b)

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Autumn 2020 63 / 94
GloVe

Combining the best of both worlds


GloVe [Pennington et al., EMNLP 2014]

• Fast training
• Scalable to huge corpora
• Good performance even with
small corpus and small vectors
Christopher Manning, Natural Language Processing with Deep Learning, Standford U. CS224n

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Autumn 2020 64 / 94
GloVe

GloVe results

Nearest words to
frog:

1. frogs
2. toad
3. litoria litoria leptodactylidae
4. leptodactylidae
5. rana
6. lizard
7. eleutherodactylus

rana eleutherodactylus
31
Christopher Manning, Natural Language Processing with Deep Learning, Standford U. CS224n

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Autumn 2020 65 / 94
Evaluation of word embeddings

Content

1 Distributional semantics

2 Word embeddings

3 Word2Vec

4 GloVe

5 Evaluation of word embeddings

6 Fasttext

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Autumn 2020 66 / 94
Evaluation of word embeddings

5. How to evaluate word vectors?


• Related to general evaluation in NLP: Intrinsic vs. extrinsic
• Intrinsic:
• Evaluation on a specific/intermediate subtask
• Fast to compute
• Helps to understand that system
• Not clear if really helpful unless correlation to real task is established
• Extrinsic:
• Evaluation on a real task
• Can take a long time to compute accuracy
• Unclear if the subsystem is the problem or its interaction or other
subsystems
• If replacing exactly one subsystem with another improves accuracy à
Winning!

32
Christopher Manning, Natural Language Processing with Deep Learning, Standford U. CS224n

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Autumn 2020 67 / 94
Evaluation of word embeddings

Intrinsic word vector evaluation


• Word Vector Analogies

a:b :: c:?

man:woman :: king:?

• Evaluate word vectors by how well


their cosine distance after addition
captures intuitive semantic and king
syntactic analogy questions
• Discarding the input words from the
search!
woman
• Problem: What if the information is
man
there but not linear?

33
Christopher Manning, Natural Language Processing with Deep Learning, Standford U. CS224n

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Autumn 2020 68 / 94
Evaluation of word embeddings

Glove Visualizations

34
Christopher Manning, Natural Language Processing with Deep Learning, Standford U. CS224n

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Autumn 2020 69 / 94
Evaluation of word embeddings

Glove Visualizations: Company - CEO

35
Christopher Manning, Natural Language Processing with Deep Learning, Standford U. CS224n

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Autumn 2020 70 / 94
Evaluation of word embeddings

Glove Visualizations: Superlatives

36
Christopher Manning, Natural Language Processing with Deep Learning, Standford U. CS224n

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Autumn 2020 71 / 94
(18) available ; (i)vLBL results
efore we GloVe Evaluation of100 1.6B are
word embeddings 67.5from54.3
(Mnih60.3 et al.,
2013); skip-gram (SG) and CBOW results are
SG 300 1B 61 61 61
Analogy from evaluation
(Mikolov et and al., 2013a,b); we trained SG†
hyperparameters
terms of CBOW 300 1.6B 16.1 52.6 36.1
(19) and CBOW† using the word2vec tool3 . See text
The up-Glove word vLBLvectors 300 1.5B 54.2 64.8 60.0
for details and aevaluation
description of the SVD models.
mum fre- ivLBL 300 1.5B 65.2 63.0 64.0
C| when
umber of Model
GloVe Dim. 300 1.6B Size Sem.80.8 Syn.
61.5 70.3 Tot.
e free to
umber is ivLBL
SVD 100
300 1.5B 6B 55.9 6.3 50.1 8.1 53.2 7.3
for large
Eqn. (17) HPCA
SVD-S 100
300 1.6B 6B 36.7 4.2 16.446.6 10.8 42.1
n of gen-
efore we GloVe
SVD-L 100 1.6B
300 6B 56.6 67.5 54.3
63.0 60.3 60.1
76), †
CBOW SG 300 1B
6B 63.6 61 67.4 61 65.7 61
CBOW
SG † 300 1.6B 6B 73.0 16.1 66.0
52.6 69.1 36.1
0, s ,(19)
1, vLBL
GloVe 300 1.5B 6B 77.4 54.2 67.0
64.8 60.0 71.7
(20) ivLBL 1000
CBOW 300 1.5B 6B 57.3 65.2 68.9
63.0 63.7 64.0
C| when
GloVe
SG 300 1.6B
1000 6B 66.1 80.8 61.5
65.1 70.3 65.6
e free to
37 SVD
SVD-L 300 42B 6B 38.4 6.3 58.2 8.1 49.2 7.3
for large Christopher Manning, Natural Language Processing with Deep Learning, Standford U. CS224n
(21) SVD-S
GloVe
Qun Liu & Valentin Malykh (Huawei)
300Natural Language
42B 36.7 69.3
81.9
6B Processing 46.6 75.0 42.1Autumn 2020 72 / 94
Evaluation of word embeddings

Analogy evaluation and hyperparameters

• More data helps • Dimensionality


• Wikipedia is better than • Good dimension is ~300
news text!
80 70
Semantic Syntactic Overall
vectors. 85
65
ors. We
70
80

HSMN, 75 60 60
Accuracy [%]

Accuracy [%]
Accuracy [%]
70
50 55

7 65

60 40 50
Semantic
55 Syntactic
30 45
Overall
50

Gigaword5 + 20 40
Wiki2010 Wiki2014 Gigaword5 Common Crawl 0 100 200 300 400 500 600 2 4
Wiki2014 Vector Dimension
1B tokens 1.6B tokens 4.3B tokens 6B tokens 42B tokens

(a) Symmetric context (b) S


Figure 3: Accuracy on the analogy task for 300-
dimensional vectors trained on different corpora.
38
Figure 2: Accuracy on the analogy task as fun
entries Christopher
are updated to assimilate
Manning, new knowledge,
Natural Language trained
Processing with on the Standford
Deep Learning, 6 billionU.token
CS224ncorpus. In (a), th
whereas Gigaword is a fixed news repository with
Qun Liu outdated and possibly
& Valentin Malykh incorrect information.
(Huawei) Word similarity. WhileAutumn
Natural Language Processing the analogy
2020 task
73 / 94is
Evaluation of word embeddings

Another intrinsic word vector evaluation


• Word vector distances and their correlation with human judgments
• Example dataset: WordSim353
http://www.cs.technion.ac.il/~gabr/resources/data/wordsim353/

Word 1 Word 2 Human (mean)


tiger cat 7.35
tiger tiger 10
book paper 7.46
computer internet 7.58
plane car 5.77
professor doctor 6.62
stock phone 1.62
stock CD 1.31
stock jaguar 0.92
39
Christopher Manning, Natural Language Processing with Deep Learning, Standford U. CS224n

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Autumn 2020 74 / 94
Evaluation of word embeddings
ng so typ-
Table 3: Spearman rank correlation on word simi-
, with the
larity tasks. All vectors are 300-dimensional. The
task. Correlation evaluation
CBOW⇤ vectors are from the word2vec website
s of a va-
and differ in that they contain phrase vectors.
l as with• Word vector distances and their correlation with human judgments
ord2vec Model Size WS353 MC RG SCWS RW
Ds. With SVD 6B 35.3 35.1 42.5 38.3 25.6

SG ) and SVD-S 6B 56.5 71.5 71.0 53.6 34.7
els on the SVD-L 6B 65.7 72.7 75.1 56.5 37.0
CBOW 6B † 57.2 65.6 68.2 57.0 32.5
4 + Giga-
SG† 6B 62.8 65.2 69.7 58.1 37.2
000 most
GloVe 6B 65.8 72.7 77.8 53.9 38.1
ize of 10.
SVD-L 42B 74.0 76.4 74.1 58.3 39.9
e show in
GloVe 42B 75.9 83.6 82.9 59.6 47.8
orpus. CBOW⇤ 100B 68.4 79.6 75.4 59.4 45.5
truncated
on of how• Some ideas from Glove paper have been shown to improve skip-gram (SG)
L model onsum this larger corpus. The fact that this
model also (e.g. both vectors)
y the top basic SVD model does not scale well to large cor-
p is typi-40 Christopher
pora lends further
Manning, evidence
Natural Language to the
Processing necessity
with Deep of theU. CS224n
Learning, Standford
methods as type of weighting scheme proposed in our model.
Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Autumn 2020 75 / 94
Evaluation of word embeddings

Extrinsic word vector evaluation


• Extrinsic evaluation of word vectors: All subsequent tasks in this class Semantic
Table 4: F1 score on NER task with 50d vectors. 85

Discrete is the baseline without word vectors. We 80


• One exampleuse where good word vectors
publicly-available should
vectors help directly:
for HPCA, HSMN,named75entity

Accuracy [%]
recognition: finding
and CW.a person,
See text organization
for details. or location 70

Model Dev Test ACE MUC7 65

60
Discrete 91.0 85.4 77.4 73.4
55
SVD 90.8 85.7 77.3 73.7 50
SVD-S 91.0 85.5 77.6 74.3
Wiki2010 Wiki2014 G
SVD-L 90.5 84.8 73.6 71.5 1B tokens 1.6B tokens 4

HPCA 92.6 88.7 81.7 80.7


Figure 3: Accuracy on t
HSMN 90.5 85.7 78.7 74.7
dimensional vectors train
CW 92.2 87.4 81.7 80.2
CBOW 93.1 88.2 82.2 81.1 entries are updated to as
GloVe 93.2 88.3 82.9 82.2 whereas Gigaword is a fi
outdated and possibly in
• Next: How toshown for neural
use word vectorsvectors in (Turian
in neural et al., 2010).
net models!
4.4 Model Analysis: Vector Length and 4.6 Model Analysis: R
41
Context Size
Christopher Manning, Natural Language Processing with Deep Learning, Standford U. CS224n
The total run-time is sp
In Fig.
Qun Liu & Valentin Malykh (Huawei) 2, we Natural
show the results
Language of experiments that
Processing
andAutumn
training
2020
the model.
76 / 94
Evaluation of word embeddings

6. Word senses and word sense ambiguity


• Most words have lots of meanings!
• Especially common words
• Especially words that have existed for a long time

• Example: pike

• Does one vector capture all these meanings or do we have a


mess?
42
Christopher Manning, Natural Language Processing with Deep Learning, Standford U. CS224n

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Autumn 2020 77 / 94
Evaluation of word embeddings

pike
• A sharp point or staff
• A type of elongated fish
• A railroad line or system
• A type of road
• The future (coming down the pike)
• A type of body position (as in diving)
• To kill or pierce with a pike
• To make one’s way (pike along)
• In Australian English, pike means to pull out from doing
something: I reckon he could have climbed that cliff, but he
piked!

43
Christopher Manning, Natural Language Processing with Deep Learning, Standford U. CS224n

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Autumn 2020 78 / 94
Evaluation of word embeddings

Improving Word Representations Via Global Context


And Multiple Word Prototypes (Huang et al. 2012)
• Idea: Cluster word windows around words, retrain with each
word assigned to multiple different clusters bank1, bank2, etc

44
Christopher Manning, Natural Language Processing with Deep Learning, Standford U. CS224n

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Autumn 2020 79 / 94
Evaluation of word embeddings

Linear Algebraic Structure of Word Senses, with


Applications to Polysemy (Arora, …, Ma, …, TACL 2018)
• Different senses of a word reside in a linear superposition (weighted
sum) in standard word embeddings like word2vec
• 𝑣pike = 𝛼7 𝑣pikeP + 𝛼9 𝑣pikeR + 𝛼S 𝑣pikeT
UP
• Where 𝛼7 = , etc., for frequency f
UP 6UR 6UT
• Surprising result:
• Because of ideas from sparse coding you can actually separate out
the senses (providing they are relatively common)

45

Christopher Manning, Natural Language Processing with Deep Learning, Standford U. CS224n
Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Autumn 2020 80 / 94
Fasttext

Content

1 Distributional semantics

2 Word embeddings

3 Word2Vec

4 GloVe

5 Evaluation of word embeddings

6 Fasttext

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Autumn 2020 81 / 94
Fasttext

Limitation of Skip-Gram
 It is difficult for good representations of rare words to
be learned with traditional word2vec.
 There could be words in the NLP task that were not
present in the word2vec training corpus
- This limitation is more pronounced in case of morphologically rich
languages
EX) In French or Spanish, most verbs have more than forty different
inflected froms. While the Finnish languages has fifteen cases for
nouns
-> It is possible to improve vector representations for
Morphologically rich languages by using character
level information.

IDS Lab.
Piotr Bonjanowski, 5
Edouard Grave, Armand Joulin, Tomas Mikolov, Enriching Word Vectors with Subword Information, slides

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Autumn 2020 82 / 94
Fasttext

Example
 German verb : ‘sein’ (English verb : ‘be’)

IDS Lab.
Piotr Bonjanowski, 6
Edouard Grave, Armand Joulin, Tomas Mikolov, Enriching Word Vectors with Subword Information, slides

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Autumn 2020 83 / 94
Fasttext

Character n-gram based model


• The basic skip-gram model described above ignores the internal
structure of the word.
• However, character n-gram based model incorporates information
about the structure in terms of character n-gram embeddings.
• This paper suppose that each word 𝑤 is represented as a bag of
character n-gram.
EX) where / n = 3, it will be represented by the
character n-grams :
<wh / whe / her / ere / re>
word <her> is different from the
And the special sequence tri-gram her from the word where
where
<where>

IDS Lab.
Piotr Bonjanowski, 8
Edouard Grave, Armand Joulin, Tomas Mikolov, Enriching Word Vectors with Subword Information, slides

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Autumn 2020 84 / 94
Fasttext

Character n-gram based model (cont’d)

• We represent a word by the sum of the vector representations of


its n-gram. We thus obtain the scoring function :

𝑠 𝑤, 𝑐 = g∈ς 𝑧𝑔𝑇 𝑣𝑐
𝑤
𝑤 ∶ 𝑔𝑖𝑣𝑒𝑛 𝑤𝑜𝑟𝑑
ς𝑤 ∶ the set of n − grams appearing in word 𝑤
𝑧𝑔 ∶ 𝑣𝑒𝑐𝑡𝑜𝑟 𝑟𝑒𝑝𝑟𝑒𝑠𝑒𝑛𝑡𝑎𝑡𝑖𝑜𝑛 𝑡𝑜 𝑒𝑎𝑐ℎ 𝑛 − 𝑔𝑟𝑎𝑚𝑠
𝑣𝑐 ∶ 𝑡ℎ𝑒 𝑤𝑜𝑟𝑑 𝑣𝑒𝑐𝑡𝑜𝑟 𝑜𝑓 𝑡ℎ𝑒 𝑐𝑒𝑛𝑡𝑒𝑟 𝑤𝑜𝑟𝑑 𝐶
• We extract all the 𝑛 -grams. (3 ≤ 𝑛 ≤ 6)

IDS Lab. 9

Piotr Bonjanowski, Edouard Grave, Armand Joulin, Tomas Mikolov, Enriching Word Vectors with Subword Information, slides

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Autumn 2020 85 / 94
Fasttext

Computing word vector representation

Piotr Bonjanowski, Edouard Grave, Armand Joulin, Tomas Mikolov, Enriching Word Vectors with Subword Information, slides
IDS Lab. 10

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Autumn 2020 86 / 94
Fasttext

Experiments Settings
 Target Languages
• German / English / French / Spanish /
Arabic / Romanian / Russian / Czech

 Kind of tasks
1. Human similarity judgement
2. Word analogy tasks
3. Comparison with morphological representations
4. Effect of the size of the training data
5. Effect of the size of n-grams

IDS Lab.
Piotr Bonjanowski, 11
Edouard Grave, Armand Joulin, Tomas Mikolov, Enriching Word Vectors with Subword Information, slides

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Autumn 2020 87 / 94
Fasttext

1. Human similarity judgement


 Correlation between human judgement and similarity scores on word
similarity datasets.

RW : Rare Words dataset

sg : Skip-Gram
cbow : continuous bag of words
sisg- : Subword Information Skip-Gram
(Treat unseen words as a null vector)
sisg : Subword Information Skip-Gram
(Treat unseen words by summing the n-gram vectors)

IDS Lab.
Piotr Bonjanowski, 12
Edouard Grave, Armand Joulin, Tomas Mikolov, Enriching Word Vectors with Subword Information, slides

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Autumn 2020 88 / 94
Fasttext

2. Word analogy tasks


 Accuracy of our model and baselines on word analogy tasks for Czech,
German, English and Italian

* It is observed that morphological


information significantly improves the
syntactic tasks; our approach outperforms
the baselines. In contrast, it does not help for
semantic questions, and even degrades the
performance for German and Italian.
IDS Lab. 13
Piotr Bonjanowski, Edouard Grave, Armand Joulin, Tomas Mikolov, Enriching Word Vectors with Subword Information, slides

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Autumn 2020 89 / 94
Fasttext

3. Comparison with morphological representations

 Spearman’s rank correlation coefficient between human judgement and


model scores for different methods using morphology to learn word
representations.

IDS Lab.
Piotr Bonjanowski, 14
Edouard Grave, Armand Joulin, Tomas Mikolov, Enriching Word Vectors with Subword Information, slides

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Autumn 2020 90 / 94
Fasttext

4. Effect of the size of the training data

 Influence of size of the training data on performance


(Data : full Wikipedia dump / Task : 1. similarity task)

 Sisg model is more robust to the size of the training data.


 However, the performance of the baseline cbow model gets better as more and more data is
available. Sisg model, on the other hand, seems to quickly saturate and adding more data
does not always lead to improved result.
 It is observed the performance sisg with very small dataset better than the performance

IDS Lab.
Piotr Bonjanowski, 15
Edouard Grave, Armand Joulin, Tomas Mikolov, Enriching Word Vectors with Subword Information, slides

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Autumn 2020 91 / 94
Fasttext

5. Effect of the size of n-grams


Maximum value of n

Minimum value of n

 The choice of n boundary ​is observed to be language and task


dependent.
 Results are always improved by taking 𝑛 ≥ 3 rather than 𝑛 ≥ 2, which
shows that character 2-games arenot informative for that task

IDS Lab.
Piotr Bonjanowski, 16
Edouard Grave, Armand Joulin, Tomas Mikolov, Enriching Word Vectors with Subword Information, slides

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Autumn 2020 92 / 94
Fasttext

Conclusion
 General Skip-Gram model has some limitations. (Ex: OOV)
 But, This can be overcome by using subword information
(character n-grams).
 This model is simple. Because of simplicity, this model trains fast
and does not require any preprocessing or supervision.
 It works better for certain languages. (Ex: German)

IDS Lab.
Piotr Bonjanowski, 17
Edouard Grave, Armand Joulin, Tomas Mikolov, Enriching Word Vectors with Subword Information, slides

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Autumn 2020 93 / 94
Summary

Content

1 Distributional semantics

2 Word embeddings

3 Word2Vec

4 GloVe

5 Evaluation of word embeddings

6 Fasttext

Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Autumn 2020 94 / 94

You might also like