NLP Course Lecture03 Huawei Noahs Ark Lab
NLP Course Lecture03 Huawei Noahs Ark Lab
Autumn 2020
A course delivered at MIPT, Moscow
Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Autumn 2020 1 / 94
Content
1 Distributional semantics
2 Word embeddings
3 Word2Vec
4 GloVe
6 Fasttext
Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Autumn 2020 2 / 94
Distributional semantics
Content
1 Distributional semantics
2 Word embeddings
3 Word2Vec
4 GloVe
6 Fasttext
Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Autumn 2020 3 / 94
Distributional semantics
Word representations
Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Autumn 2020 4 / 94
Distributional semantics
Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Autumn 2020 5 / 94
Distributional semantics
Distributional semantics
A sample of concordance
Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Autumn 2020 6 / 94
Distributional semantics
Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Autumn 2020 7 / 94
Distributional semantics
16
Christopher Manning, Natural Language Processing with Deep Learning, Standford U. CS224n
Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Autumn 2020 8 / 94
Distributional semantics
Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Autumn 2020 9 / 94
Distributional semantics
18
Christopher Manning, Natural Language Processing with Deep Learning, Standford U. CS224n
Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Autumn 2020 10 / 94
Distributional semantics
19
Christopher Manning, Natural Language Processing with Deep Learning, Standford U. CS224n
Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Autumn 2020 11 / 94
Distributional semantics
k
X
Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Autumn 2020 12 / 94
Distributional semantics
21
Christopher Manning, Natural Language Processing with Deep Learning, Standford U. CS224n
Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Autumn 2020 13 / 94
Distributional semantics
22
Christopher Manning, Natural Language Processing with Deep Learning, Standford U. CS224n
Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Autumn 2020 14 / 94
Distributional semantics
Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Autumn 2020 15 / 94
CALLED BASHED
Distributional semantics
Figure 10: Multidimensional scaling of three verb semantic classes.
STOLEN
STEAL
STOLE
STEALING
TAKE
SPOKE SPEAK
SPOKEN
SPEAKING TAKEN TAKING
TOOK
THROW
THROWN THREW
THROWING
SHOWN
SHOWED EATEN
EAT
SHOWING ATE
EATING
SHOW
GROWN
GROW
GREW
GROWING
Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Autumn 2020 16 / 94
Distributional semantics
JANITOR
DRIVE SWIMMER
STUDENT
CLEAN TEACHER
DOCTOR
BRIDE
SWIM
PRIEST
LEARN TEACH
MARRY
TREAT PRAY
COALS Figure
model13: Multidimensional scaling for nouns and their associated verbs.
from
An Improved Model of Semantic Similarity Based on Lexical Co-Occurrence
Table 10 Rohde et al. ms., 2005
The 25
10 nearest neighbors and their percent correlation similarities for a set of nouns, under the COALS-14K model.
gunChristopherpoint
Manning, Natural
mind Language Processing
monopoly with Deep Learning,lipstick
cardboard Standford U. leningrad
CS224n feet
1) 46.4 handgun 32.4 points 33.5 minds 39.9 monopolies 47.4 plastic 42.9 shimmery 24.0 moscow 59.5 inches
2) 41.1 firearms 29.2 argument 24.9 consciousness 27.8 monopolistic 37.2 foam 40.8 eyeliner 22.7 sevastopol 57.7 foot
Qun Liu3)& 41.0
Valentin
firearmMalykh
25.4 (Huawei)
question 23.2 thoughts Natural
26.5Language
corporations Processing
36.7 plywood 38.8 clinique Autumn 52.0
22.7 petersburg 2020 metres 17 / 94
Word embeddings
Content
1 Distributional semantics
2 Word embeddings
3 Word2Vec
4 GloVe
6 Fasttext
Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Autumn 2020 18 / 94
Word embeddings
Count-based representations
Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Autumn 2020 19 / 94
Word embeddings
Prediction-based representations
Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Autumn 2020 20 / 94
Word embeddings
Word embeddings
Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Autumn 2020 21 / 94
Word2Vec
Content
1 Distributional semantics
2 Word embeddings
3 Word2Vec
4 GloVe
6 Fasttext
Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Autumn 2020 22 / 94
Word2Vec
Word2Vec
Tomas Mikolov
Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Autumn 2020 23 / 94
Word2Vec Basic idea
Content
3 Word2Vec
Basic idea
Cross-entropy loss function
Softmax
Skip-gram Model
Training
Derivation of gradients
Stochastic Gradient Descent
More details
Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Autumn 2020 24 / 94
Word2Vec Basic idea
3. Word2vec: Overview
Word2vec (Mikolov et al. 2013) is a framework for learning
word vectors
Idea:
• We have a large corpus of text
• Every word in a fixed vocabulary is represented by a vector
• Go through each position t in the text, which has a center word
c and context (“outside”) words o
• Use the similarity of the word vectors for c and o to calculate
the probability of o given c (or vice versa)
• Keep adjusting the word vectors to maximize this probability
18
Christopher Manning, Natural Language Processing with Deep Learning, Standford U. CS224n
Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Autumn 2020 25 / 94
Word2Vec Basic idea
Word2Vec Overview
• Example windows and process for computing 𝑃 𝑤89: | 𝑤8
𝑃 𝑤8>= | 𝑤8 𝑃 𝑤89= | 𝑤8
𝑃 𝑤8>< | 𝑤8 𝑃 𝑤89< | 𝑤8
19
Christopher Manning, Natural Language Processing with Deep Learning, Standford U. CS224n
Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Autumn 2020 26 / 94
Word2Vec Basic idea
Word2Vec Overview
• Example windows and process for computing 𝑃 𝑤89: | 𝑤8
𝑃 𝑤8>= | 𝑤8 𝑃 𝑤89= | 𝑤8
𝑃 𝑤8>< | 𝑤8 𝑃 𝑤89< | 𝑤8
20
Christopher Manning, Natural Language Processing with Deep Learning, Standford U. CS224n
Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Autumn 2020 27 / 94
Word2Vec Cross-entropy loss function
Content
3 Word2Vec
Basic idea
Cross-entropy loss function
Softmax
Skip-gram Model
Training
Derivation of gradients
Stochastic Gradient Descent
More details
Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Autumn 2020 28 / 94
Word2Vec Cross-entropy loss function
Likelihood = 𝐿 𝜃 = G G 𝑃 𝑤89: | 𝑤8 ; 𝜃
8H< >JK:KJ
:LM
𝜃 is all variables
to be optimized
sometimes called cost or loss function
Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Autumn 2020 29 / 94
Word2Vec Cross-entropy loss function
Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Autumn 2020 30 / 94
Word2Vec Cross-entropy loss function
Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Autumn 2020 31 / 94
Word2Vec Cross-entropy loss function
exp(𝑢YI 𝑣Z )
𝑃 𝑜𝑐 = I𝑣 )
22
∑U∈] exp(𝑢U Z
Christopher Manning, Natural Language Processing with Deep Learning, Standford U. CS224n
Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Autumn 2020 32 / 94
Word2Vec Cross-entropy loss function
23
Christopher Manning, Natural Language Processing with Deep Learning, Standford U. CS224n
Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Autumn 2020 33 / 94
Word2Vec Softmax
Content
3 Word2Vec
Basic idea
Cross-entropy loss function
Softmax
Skip-gram Model
Training
Derivation of gradients
Stochastic Gradient Descent
More details
Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Autumn 2020 34 / 94
Word2Vec Softmax
Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Autumn 2020 35 / 94
Word2Vec Softmax
Softmax
Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Autumn 2020 36 / 94
Word2Vec Softmax
Softmax
Figure source:https://blog.csdn.net/xg123321123/article/details/80781611
Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Autumn 2020 37 / 94
Word2Vec Skip-gram Model
Content
3 Word2Vec
Basic idea
Cross-entropy loss function
Softmax
Skip-gram Model
Training
Derivation of gradients
Stochastic Gradient Descent
More details
Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Autumn 2020 38 / 94
Word2Vec Skip-gram Model
Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Autumn 2020 39 / 94
Word2Vec Training
Content
3 Word2Vec
Basic idea
Cross-entropy loss function
Softmax
Skip-gram Model
Training
Derivation of gradients
Stochastic Gradient Descent
More details
Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Autumn 2020 40 / 94
Word2Vec Training
25
Christopher Manning, Natural Language Processing with Deep Learning, Standford U. CS224n
Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Autumn 2020 41 / 94
Word2Vec Training
Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Autumn 2020 42 / 94
Word2Vec Derivation of gradients
Content
3 Word2Vec
Basic idea
Cross-entropy loss function
Softmax
Skip-gram Model
Training
Derivation of gradients
Stochastic Gradient Descent
More details
Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Autumn 2020 43 / 94
Word2Vec Derivation of gradients
1
J(θ) = − log L(θ)
T
T
1X X
=− log p(wt+j |wt ; θ)
T
t=1 −m≤j≤m
j6=0
1 X X
=− log p(o|c; u, v )
T
o∈context(c) c∈corpus
∂ 1 X X ∂
J(θ) = − log p(o|c; u, v )
∂θ T c∈corpus
∂θ
o∈context(c)
Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Autumn 2020 44 / 94
Word2Vec Derivation of gradients
∂ ∂
(uo T vc ) − log Vw=1 exp(uw T vc )
P
=
∂vc ∂vc
1 ∂ PV T
= uo − PV w=1 exp(uw vc )
T
w=1 exp(uw vc )
∂v c
1 PV ∂ T
= uo − PV w=1 ∂vc exp(uw vc )
T
w=1 exp(uw vc )
1 PV T
= uo − PV w=1 exp(uw vc )uw
T
w=1 exp(uw vc )
T
= uo − x=1 PVexp(ux vc T) ux
PV
w=1 exp(uw vc )
PV
= uo − x=1 p(x|c)ux
Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Autumn 2020 45 / 94
Word2Vec Derivation of gradients
∂ 1 X X ∂
J(θ) = − log p(o|c; u, v )
∂θ T c∈corpus
∂θ
o∈context(c)
V
" #
∂ 1 X X X
J(θ) = − uo − p(x|c)ux
∂vc T
o∈context(c) c∈corpus x=1
V
" #
∂ 1 X X X
J(θ) = − vc − p(o|x)vx
∂uo T
o∈context(c) c∈corpus x=1
Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Autumn 2020 46 / 94
Word2Vec Derivation of gradients
Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Autumn 2020 47 / 94
Word2Vec Stochastic Gradient Descent
Content
3 Word2Vec
Basic idea
Cross-entropy loss function
Softmax
Skip-gram Model
Training
Derivation of gradients
Stochastic Gradient Descent
More details
Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Autumn 2020 48 / 94
Word2Vec Stochastic Gradient Descent
Note: Our
objectives
may not
be convex
like this :(
32
Christopher Manning, Natural Language Processing with Deep Learning, Standford U. CS224n
Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Autumn 2020 49 / 94
Word2Vec Stochastic Gradient Descent
Gradient Descent
• Update equation (in matrix notation):
• Algorithm:
33
Christopher Manning, Natural Language Processing with Deep Learning, Standford U. CS224n
Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Autumn 2020 50 / 94
Word2Vec Stochastic Gradient Descent
34
Christopher Manning, Natural Language Processing with Deep Learning, Standford U. CS224n
Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Autumn 2020 51 / 94
Word2Vec Stochastic Gradient Descent
Christopher Manning, Natural Language Processing with Deep Learning, Standford U. CS224n
Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Autumn 2020 52 / 94
Word2Vec Stochastic Gradient Descent
[ ]
d
|V|
10
Christopher Manning, Natural Language Processing with Deep Learning, Standford U. CS224n
Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Autumn 2020 53 / 94
Word2Vec More details
Content
3 Word2Vec
Basic idea
Cross-entropy loss function
Softmax
Skip-gram Model
Training
Derivation of gradients
Stochastic Gradient Descent
More details
Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Autumn 2020 54 / 94
Word2Vec More details
Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Autumn 2020 55 / 94
Word2Vec More details
Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Autumn 2020 56 / 94
Word2Vec More details
Negative Sampling
X X
J(uo , C) = exp(uoT uw ) + exp(−uoT uw )
w∈C w ∈C
/
Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Autumn 2020 57 / 94
GloVe
Content
1 Distributional semantics
2 Word embeddings
3 Word2Vec
4 GloVe
6 Fasttext
Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Autumn 2020 58 / 94
GloVe
26
Christopher Manning, Natural Language Processing with Deep Learning, Standford U. CS224n
Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Autumn 2020 59 / 94
GloVe
large small ~1 ~1
Christopher Manning, Natural Language Processing with Deep Learning, Standford U. CS224n
Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Autumn 2020 60 / 94
GloVe
Christopher Manning, Natural Language Processing with Deep Learning, Standford U. CS224n
Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Autumn 2020 61 / 94
GloVe
A: Log-bilinear model:
Christopher Manning, Natural Language Processing with Deep Learning, Standford U. CS224n
Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Autumn 2020 62 / 94
GloVe
A: Log-bilinear model:
Christopher Manning, Natural Language Processing with Deep Learning, Standford U. CS224n
Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Autumn 2020 63 / 94
GloVe
• Fast training
• Scalable to huge corpora
• Good performance even with
small corpus and small vectors
Christopher Manning, Natural Language Processing with Deep Learning, Standford U. CS224n
Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Autumn 2020 64 / 94
GloVe
GloVe results
Nearest words to
frog:
1. frogs
2. toad
3. litoria litoria leptodactylidae
4. leptodactylidae
5. rana
6. lizard
7. eleutherodactylus
rana eleutherodactylus
31
Christopher Manning, Natural Language Processing with Deep Learning, Standford U. CS224n
Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Autumn 2020 65 / 94
Evaluation of word embeddings
Content
1 Distributional semantics
2 Word embeddings
3 Word2Vec
4 GloVe
6 Fasttext
Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Autumn 2020 66 / 94
Evaluation of word embeddings
32
Christopher Manning, Natural Language Processing with Deep Learning, Standford U. CS224n
Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Autumn 2020 67 / 94
Evaluation of word embeddings
a:b :: c:?
man:woman :: king:?
33
Christopher Manning, Natural Language Processing with Deep Learning, Standford U. CS224n
Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Autumn 2020 68 / 94
Evaluation of word embeddings
Glove Visualizations
34
Christopher Manning, Natural Language Processing with Deep Learning, Standford U. CS224n
Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Autumn 2020 69 / 94
Evaluation of word embeddings
35
Christopher Manning, Natural Language Processing with Deep Learning, Standford U. CS224n
Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Autumn 2020 70 / 94
Evaluation of word embeddings
36
Christopher Manning, Natural Language Processing with Deep Learning, Standford U. CS224n
Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Autumn 2020 71 / 94
(18) available ; (i)vLBL results
efore we GloVe Evaluation of100 1.6B are
word embeddings 67.5from54.3
(Mnih60.3 et al.,
2013); skip-gram (SG) and CBOW results are
SG 300 1B 61 61 61
Analogy from evaluation
(Mikolov et and al., 2013a,b); we trained SG†
hyperparameters
terms of CBOW 300 1.6B 16.1 52.6 36.1
(19) and CBOW† using the word2vec tool3 . See text
The up-Glove word vLBLvectors 300 1.5B 54.2 64.8 60.0
for details and aevaluation
description of the SVD models.
mum fre- ivLBL 300 1.5B 65.2 63.0 64.0
C| when
umber of Model
GloVe Dim. 300 1.6B Size Sem.80.8 Syn.
61.5 70.3 Tot.
e free to
umber is ivLBL
SVD 100
300 1.5B 6B 55.9 6.3 50.1 8.1 53.2 7.3
for large
Eqn. (17) HPCA
SVD-S 100
300 1.6B 6B 36.7 4.2 16.446.6 10.8 42.1
n of gen-
efore we GloVe
SVD-L 100 1.6B
300 6B 56.6 67.5 54.3
63.0 60.3 60.1
76), †
CBOW SG 300 1B
6B 63.6 61 67.4 61 65.7 61
CBOW
SG † 300 1.6B 6B 73.0 16.1 66.0
52.6 69.1 36.1
0, s ,(19)
1, vLBL
GloVe 300 1.5B 6B 77.4 54.2 67.0
64.8 60.0 71.7
(20) ivLBL 1000
CBOW 300 1.5B 6B 57.3 65.2 68.9
63.0 63.7 64.0
C| when
GloVe
SG 300 1.6B
1000 6B 66.1 80.8 61.5
65.1 70.3 65.6
e free to
37 SVD
SVD-L 300 42B 6B 38.4 6.3 58.2 8.1 49.2 7.3
for large Christopher Manning, Natural Language Processing with Deep Learning, Standford U. CS224n
(21) SVD-S
GloVe
Qun Liu & Valentin Malykh (Huawei)
300Natural Language
42B 36.7 69.3
81.9
6B Processing 46.6 75.0 42.1Autumn 2020 72 / 94
Evaluation of word embeddings
HSMN, 75 60 60
Accuracy [%]
Accuracy [%]
Accuracy [%]
70
50 55
7 65
60 40 50
Semantic
55 Syntactic
30 45
Overall
50
Gigaword5 + 20 40
Wiki2010 Wiki2014 Gigaword5 Common Crawl 0 100 200 300 400 500 600 2 4
Wiki2014 Vector Dimension
1B tokens 1.6B tokens 4.3B tokens 6B tokens 42B tokens
Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Autumn 2020 74 / 94
Evaluation of word embeddings
ng so typ-
Table 3: Spearman rank correlation on word simi-
, with the
larity tasks. All vectors are 300-dimensional. The
task. Correlation evaluation
CBOW⇤ vectors are from the word2vec website
s of a va-
and differ in that they contain phrase vectors.
l as with• Word vector distances and their correlation with human judgments
ord2vec Model Size WS353 MC RG SCWS RW
Ds. With SVD 6B 35.3 35.1 42.5 38.3 25.6
†
SG ) and SVD-S 6B 56.5 71.5 71.0 53.6 34.7
els on the SVD-L 6B 65.7 72.7 75.1 56.5 37.0
CBOW 6B † 57.2 65.6 68.2 57.0 32.5
4 + Giga-
SG† 6B 62.8 65.2 69.7 58.1 37.2
000 most
GloVe 6B 65.8 72.7 77.8 53.9 38.1
ize of 10.
SVD-L 42B 74.0 76.4 74.1 58.3 39.9
e show in
GloVe 42B 75.9 83.6 82.9 59.6 47.8
orpus. CBOW⇤ 100B 68.4 79.6 75.4 59.4 45.5
truncated
on of how• Some ideas from Glove paper have been shown to improve skip-gram (SG)
L model onsum this larger corpus. The fact that this
model also (e.g. both vectors)
y the top basic SVD model does not scale well to large cor-
p is typi-40 Christopher
pora lends further
Manning, evidence
Natural Language to the
Processing necessity
with Deep of theU. CS224n
Learning, Standford
methods as type of weighting scheme proposed in our model.
Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Autumn 2020 75 / 94
Evaluation of word embeddings
Accuracy [%]
recognition: finding
and CW.a person,
See text organization
for details. or location 70
60
Discrete 91.0 85.4 77.4 73.4
55
SVD 90.8 85.7 77.3 73.7 50
SVD-S 91.0 85.5 77.6 74.3
Wiki2010 Wiki2014 G
SVD-L 90.5 84.8 73.6 71.5 1B tokens 1.6B tokens 4
• Example: pike
Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Autumn 2020 77 / 94
Evaluation of word embeddings
pike
• A sharp point or staff
• A type of elongated fish
• A railroad line or system
• A type of road
• The future (coming down the pike)
• A type of body position (as in diving)
• To kill or pierce with a pike
• To make one’s way (pike along)
• In Australian English, pike means to pull out from doing
something: I reckon he could have climbed that cliff, but he
piked!
43
Christopher Manning, Natural Language Processing with Deep Learning, Standford U. CS224n
Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Autumn 2020 78 / 94
Evaluation of word embeddings
44
Christopher Manning, Natural Language Processing with Deep Learning, Standford U. CS224n
Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Autumn 2020 79 / 94
Evaluation of word embeddings
45
Christopher Manning, Natural Language Processing with Deep Learning, Standford U. CS224n
Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Autumn 2020 80 / 94
Fasttext
Content
1 Distributional semantics
2 Word embeddings
3 Word2Vec
4 GloVe
6 Fasttext
Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Autumn 2020 81 / 94
Fasttext
Limitation of Skip-Gram
It is difficult for good representations of rare words to
be learned with traditional word2vec.
There could be words in the NLP task that were not
present in the word2vec training corpus
- This limitation is more pronounced in case of morphologically rich
languages
EX) In French or Spanish, most verbs have more than forty different
inflected froms. While the Finnish languages has fifteen cases for
nouns
-> It is possible to improve vector representations for
Morphologically rich languages by using character
level information.
IDS Lab.
Piotr Bonjanowski, 5
Edouard Grave, Armand Joulin, Tomas Mikolov, Enriching Word Vectors with Subword Information, slides
Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Autumn 2020 82 / 94
Fasttext
Example
German verb : ‘sein’ (English verb : ‘be’)
IDS Lab.
Piotr Bonjanowski, 6
Edouard Grave, Armand Joulin, Tomas Mikolov, Enriching Word Vectors with Subword Information, slides
Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Autumn 2020 83 / 94
Fasttext
IDS Lab.
Piotr Bonjanowski, 8
Edouard Grave, Armand Joulin, Tomas Mikolov, Enriching Word Vectors with Subword Information, slides
Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Autumn 2020 84 / 94
Fasttext
𝑠 𝑤, 𝑐 = g∈ς 𝑧𝑔𝑇 𝑣𝑐
𝑤
𝑤 ∶ 𝑔𝑖𝑣𝑒𝑛 𝑤𝑜𝑟𝑑
ς𝑤 ∶ the set of n − grams appearing in word 𝑤
𝑧𝑔 ∶ 𝑣𝑒𝑐𝑡𝑜𝑟 𝑟𝑒𝑝𝑟𝑒𝑠𝑒𝑛𝑡𝑎𝑡𝑖𝑜𝑛 𝑡𝑜 𝑒𝑎𝑐ℎ 𝑛 − 𝑔𝑟𝑎𝑚𝑠
𝑣𝑐 ∶ 𝑡ℎ𝑒 𝑤𝑜𝑟𝑑 𝑣𝑒𝑐𝑡𝑜𝑟 𝑜𝑓 𝑡ℎ𝑒 𝑐𝑒𝑛𝑡𝑒𝑟 𝑤𝑜𝑟𝑑 𝐶
• We extract all the 𝑛 -grams. (3 ≤ 𝑛 ≤ 6)
IDS Lab. 9
Piotr Bonjanowski, Edouard Grave, Armand Joulin, Tomas Mikolov, Enriching Word Vectors with Subword Information, slides
Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Autumn 2020 85 / 94
Fasttext
Piotr Bonjanowski, Edouard Grave, Armand Joulin, Tomas Mikolov, Enriching Word Vectors with Subword Information, slides
IDS Lab. 10
Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Autumn 2020 86 / 94
Fasttext
Experiments Settings
Target Languages
• German / English / French / Spanish /
Arabic / Romanian / Russian / Czech
Kind of tasks
1. Human similarity judgement
2. Word analogy tasks
3. Comparison with morphological representations
4. Effect of the size of the training data
5. Effect of the size of n-grams
IDS Lab.
Piotr Bonjanowski, 11
Edouard Grave, Armand Joulin, Tomas Mikolov, Enriching Word Vectors with Subword Information, slides
Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Autumn 2020 87 / 94
Fasttext
sg : Skip-Gram
cbow : continuous bag of words
sisg- : Subword Information Skip-Gram
(Treat unseen words as a null vector)
sisg : Subword Information Skip-Gram
(Treat unseen words by summing the n-gram vectors)
IDS Lab.
Piotr Bonjanowski, 12
Edouard Grave, Armand Joulin, Tomas Mikolov, Enriching Word Vectors with Subword Information, slides
Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Autumn 2020 88 / 94
Fasttext
Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Autumn 2020 89 / 94
Fasttext
IDS Lab.
Piotr Bonjanowski, 14
Edouard Grave, Armand Joulin, Tomas Mikolov, Enriching Word Vectors with Subword Information, slides
Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Autumn 2020 90 / 94
Fasttext
IDS Lab.
Piotr Bonjanowski, 15
Edouard Grave, Armand Joulin, Tomas Mikolov, Enriching Word Vectors with Subword Information, slides
Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Autumn 2020 91 / 94
Fasttext
Minimum value of n
IDS Lab.
Piotr Bonjanowski, 16
Edouard Grave, Armand Joulin, Tomas Mikolov, Enriching Word Vectors with Subword Information, slides
Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Autumn 2020 92 / 94
Fasttext
Conclusion
General Skip-Gram model has some limitations. (Ex: OOV)
But, This can be overcome by using subword information
(character n-grams).
This model is simple. Because of simplicity, this model trains fast
and does not require any preprocessing or supervision.
It works better for certain languages. (Ex: German)
IDS Lab.
Piotr Bonjanowski, 17
Edouard Grave, Armand Joulin, Tomas Mikolov, Enriching Word Vectors with Subword Information, slides
Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Autumn 2020 93 / 94
Summary
Content
1 Distributional semantics
2 Word embeddings
3 Word2Vec
4 GloVe
6 Fasttext
Qun Liu & Valentin Malykh (Huawei) Natural Language Processing Autumn 2020 94 / 94