0% found this document useful (0 votes)
9 views

Week5

Uploaded by

fatimabuhari2014
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

Week5

Uploaded by

fatimabuhari2014
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 26

Vector semantics and word embedding

 Introduction
 Concepts of word senses
 Word semantics & embedding
 Tf-idf
 Word2vec
Introduction
 In natural language, the meaning of a word is fully reflected
by its context and its contextual relations.
 Semantics can address meaning at the levels of words,
phrases, sentences, or larger units of discourse.
 Word representations are inputs for the learning models in
Natural Language Understanding tasks.
 Word embedding is a term used for the representation of
words for text analysis, typically, as
 a real-valued vector that encodes the meaning of the word.
 The words that are closer in the vector space are expected
to be similar in meaning.
CS3TM20©XH 2
Concepts of word senses
 Have a complex many-to-many association with words
(homonymy, multiple senses)

 Have relations with each other


 Synonymy
 Antonymy
 Similarity
 Relatedness
 Connotation

CS3TM20©XH 3
 The notion of word similarity is
very useful in larger semantic
tasks.
 While words don’t have many
synonyms, most words do have
lots of similar words.
 Cat is not a synonym of dog, but
cats and dogs are certainly similar
words.
 Knowing how similar two words are can help in computing
how similar the meaning of two phrases or sentences are.
CS3TM20©XH 4
 Word relatedness is also called “word association”
 One common kind of relatedness between words is if they
belong to the same semantic field.
 A semantic field is a set of words which cover a particular
semantic domain and can bear structured relations with
each other.
 hospitals
surgeon, scalpel, nurse, anaesthetic, hospital
 restaurants
waiter, menu, plate, food, menu, chef
 houses
door, roof, kitchen, family, bed
CS3TM20©XH 5
 Affective meanings or connotations are related to a writer or
reader’s emotions, sentiment, opinions, or evaluations.
 Some words have positive connotations (happy) while others
have negative connotations (sad).
 Differences in connotations between fake, knockoff, forgery,
and copy, replica, reproduction.
 Some words describe positive evaluation (great, love) and
others negative evaluation (terrible, hate).
 Positive or negative evaluation language is called Sentiment
analysis
 Applications of NLP, e.g. in business form, customer product
review, social media analysis
CS3TM20©XH 6
Vector semantics & embedding
 In vector semantics, we define meaning
as a point in space based on
distribution.
 Similar words are nearby in semantic
space.
 crucially, as we'll see, we build this
space automatically by seeing which
words are nearby in text.
 Every modern NLP algorithm uses
embeddings as the representation of
word meaning.
CS3TM20©XH 7
 Imagine we have a collection of documents, e.g Shakespeare
 Term-document matrix : Each row represents a word in the
vocabulary and each column represents a document from the
collection.
 Each cell in this matrix represents the number of times a
particular word.

CS3TM20©XH 8
 Term-document matrix :

cos ( 𝒗 , 𝒘 ) =¿𝒗 ,𝒘 > ¿ ¿¿


 Cosine similarity:
|𝑣|∨𝑤∨¿
Where dot product as

CS3TM20©XH 9
Example: Based on the vector
space of [fool, battle], calculate
cosine similarity of word counts
between pairs of documents

1. Henry V , Julius Caesar


2. Julius Caesar, Twelfth night

CS3TM20©XH 10
Example: Based on the vector space of [fool, battle], calculate
cosine similarity of tf between pairs of documents
1. Henry V , Julius Caesar
Dot product ( = 4*1 +13*7 =95
| 13.60
| =7.07
95/(13.60*7.07)= 0.988
2. Julius Caesar, Twelfth night
Dot product ( = 1*58 +7*0 =58
| =7.07

58/(58*7.07)= 0.23
CS3TM20©XH 11
tf-idf
 tf (term frequency) may be defined the ratio of number of
times the word term (t) appears in a document (d)
compared to the total number of words in that document.
𝐶𝑜𝑢𝑛𝑡 ( 𝑡 , ⅆ )
𝑡 𝑓 𝑡 ,𝑑 =
𝑁𝑑
 There are other variants definitions for
 Frequency is clearly useful; if sugar appears a lot near
apricot, that's useful information.
 But overly frequent words like the, it, or they are not very
informative about the context
CS3TM20©XH 12
tf-idf
 idf: inverse document frequency

𝑖 ⅆ 𝑓 𝑡 =log
( ⅆ𝑓
𝑁
𝑡
)
where N is total number of d in collection
df i number of d that have word t
 Words like "the" or "good" have very low idf
 tf-idf value for word t in document d:

CS3TM20©XH 13
Document 1: I love artificial intelligence, big love!
Example:
Document 2: I like computational intelligence.
words index D1 D2 (tf) D1( idf) D 2 (Idf) D 1 D2
(tf) (Tf-Idf) (Tf-Idf )
I 0 1/6 1/4 Log(2/2)=0 Log(2/2)=0 0 0
love 1 2/6 0 Log(2/1) - 2/6*Log(2/1) 0
like 2 0 1/4 - Log(2/1) 0 ¼*Log(2/1)
artificial 3 1/6 0 Log(2/1) - 1/6*Log(2/1) 0
computational 4 0 1/4 - Log(2/1) 0 ¼*Log(2/1)
intelligence 5 1/6 1/4 Log(2/2)=0 Log(2/2)=0 0 0
big 6 1/6 0 Log(2/1) - 1/6*Log(2/1) 0

CS3TM20©XH 14
Exam style question:
The Table below contains three documents, each consisting of one
sentence.
a) Consider all three documents. Identify words with zero Tf-idf
value for each document.
b) Which word in these documents has highest Tf value?
c) Which country Name in these documents has highest Tf-idf
value?
Doc 1 Germany or France.
Doc 2 Has Germany won over France?
Doc 3 England lost to Germany.

CS3TM20©XH 15
Doc 1 Germany, Germany, France.
Doc 2 Has Germany won over France?
Doc 3 England lost to Germany.

a) Consider all three documents. Identify words with zero Tf-idf value in
each document.
Since Germany appeared in all three documents, the multiplier
= log (3/3) =0 for all documents, Tf-idf become zero in each Doc.
b) Which word in these has highest Tf value?
This applies to Germany in Doc 1 as 2/3, the shortest Doc.
c) Which country name in any document has highest Tf-idf value?
Comparing France and England:
Doc 1: France: 1/3 log(3/2 ) = 0.0587
Doc 3: England (1/4)log(3/1 )=0.119 England
CS3TM20©XH 16
Word2vec
 The word2vec algorithm learn word associations from a
large corpus of text
 Each distinct word with a particular list of numbers as a
vector.
 Popular embedding method and very fast to train
 Word2vec provides various options. We'll do skip-gram with
negative sampling (SGNS)
 The intuition of word2vec is that instead of counting how
often each word w occurs near, say, apricot, we’ll instead
train a classifier on a binary prediction task:
“Is word w likely to show up near apricot?”
CS3TM20©XH 17
 Avoids the need for any sort of hand-labeled supervision
signal.
 Train a logistic regression classifier instead of a multi-layer
neural network.
 Semantic and syntactic patterns can be reproduced using
vector arithmetic.
Brother" - "Man" + "Woman" produces a result which is
closest to the vector representation of "Sister" in the model
 You can download from
https://code.google.com/archive/p/word2vec/

CS3TM20©XH 18
Output:
Skip-Gram V(context words)

Input : W(t-2) tablespoon


V(target word) Projection
W(t-1) of
W(t) Logistic model
W(t+1) jam
apricot
W(t+2) a

CS3TM20©XH 19
 Assuming a +/- 2 word window

…lemon, a [tablespoon of apricot jam, a] pinch…


c1 c2 [target] c3 c4

 Goal: train a classifier that is given a candidate (word,


context) pair
(apricot, jam)
(apricot, aardvark)

 And assigns each pair a probability:
P(+|w, c) or P(−|w, c) = 1 − P(+|w, c)
CS3TM20©XH 20
Skip-Gram Training data
…lemon, a [tablespoon of apricot jam, a] pinch…
c1 c2 [target] c3 c4

 K=Negative samples /positive samples =2 21

 Negative samples drawn from N-gram models with low prob.


 Word2vec: How to learn vectors

Given the set of positive and negative training instances,


and an initial set of embedding vectors
The goal of learning is to adjust those word vectors such
that we:
• Maximize the similarity of the target word, context word
pairs (w , cpos) drawn from the positive data
• Minimize the similarity of the (w , cneg) pairs drawn from
the negative data.

01/24/2025 22
Loss function for one w with cpos , cneg1 ...Cnegk

Where 1
𝜎 ( 𝑐 ∙𝑤 )=
1+exp ( − 𝑐 ∙𝑤 )
This is to be minimised by updating the w, c
Two sets of embeddings
 Start with V random d-dimensional vectors as initial
embeddings
 SGNS learns two sets of embeddings
Target embeddings matrix W
Context embedding matrix C
 It's common to just add them together, representing word
i as the vector wi + ci
Exam style question:
A word2vec algorithm, as shown below, employs a Skigram
of +/- 1 word window and takes incoming sentence for training.
a)Suggest 4 pairs of positive {word, context} examples drawn
from the sentence.
b) Explain why negative examples are needed and how these
can be generated.
“I like to eat chicken and French fries”
W(t-1)
Logistic
W(t)
model W(t+1)
CS3TM20©XH 25
a)Suggest 4 pairs of positive {word, context} examples drawn from the sentence.
Positive
w c
like I
“I like to eat chicken and French fries”
like to
to like
to eat or other choices
b) Explain why negative examples are needed and how these can be generated.
Since logistic model needs negative samples to train, corresponding to positive
examples, noisy context word unlikely to be near the input word is used as negative
samples. There are generally more negative samples, e.g. twice as much.
Negative Negative
w c w c
like zeal eat like sellotape or other choices
like saki like cradle
to tooth to kechi
to deadly to Samsung
CS3TM20©XH 26

You might also like