Week5
Week5
Introduction
Concepts of word senses
Word semantics & embedding
Tf-idf
Word2vec
Introduction
In natural language, the meaning of a word is fully reflected
by its context and its contextual relations.
Semantics can address meaning at the levels of words,
phrases, sentences, or larger units of discourse.
Word representations are inputs for the learning models in
Natural Language Understanding tasks.
Word embedding is a term used for the representation of
words for text analysis, typically, as
a real-valued vector that encodes the meaning of the word.
The words that are closer in the vector space are expected
to be similar in meaning.
CS3TM20©XH 2
Concepts of word senses
Have a complex many-to-many association with words
(homonymy, multiple senses)
CS3TM20©XH 3
The notion of word similarity is
very useful in larger semantic
tasks.
While words don’t have many
synonyms, most words do have
lots of similar words.
Cat is not a synonym of dog, but
cats and dogs are certainly similar
words.
Knowing how similar two words are can help in computing
how similar the meaning of two phrases or sentences are.
CS3TM20©XH 4
Word relatedness is also called “word association”
One common kind of relatedness between words is if they
belong to the same semantic field.
A semantic field is a set of words which cover a particular
semantic domain and can bear structured relations with
each other.
hospitals
surgeon, scalpel, nurse, anaesthetic, hospital
restaurants
waiter, menu, plate, food, menu, chef
houses
door, roof, kitchen, family, bed
CS3TM20©XH 5
Affective meanings or connotations are related to a writer or
reader’s emotions, sentiment, opinions, or evaluations.
Some words have positive connotations (happy) while others
have negative connotations (sad).
Differences in connotations between fake, knockoff, forgery,
and copy, replica, reproduction.
Some words describe positive evaluation (great, love) and
others negative evaluation (terrible, hate).
Positive or negative evaluation language is called Sentiment
analysis
Applications of NLP, e.g. in business form, customer product
review, social media analysis
CS3TM20©XH 6
Vector semantics & embedding
In vector semantics, we define meaning
as a point in space based on
distribution.
Similar words are nearby in semantic
space.
crucially, as we'll see, we build this
space automatically by seeing which
words are nearby in text.
Every modern NLP algorithm uses
embeddings as the representation of
word meaning.
CS3TM20©XH 7
Imagine we have a collection of documents, e.g Shakespeare
Term-document matrix : Each row represents a word in the
vocabulary and each column represents a document from the
collection.
Each cell in this matrix represents the number of times a
particular word.
CS3TM20©XH 8
Term-document matrix :
CS3TM20©XH 9
Example: Based on the vector
space of [fool, battle], calculate
cosine similarity of word counts
between pairs of documents
CS3TM20©XH 10
Example: Based on the vector space of [fool, battle], calculate
cosine similarity of tf between pairs of documents
1. Henry V , Julius Caesar
Dot product ( = 4*1 +13*7 =95
| 13.60
| =7.07
95/(13.60*7.07)= 0.988
2. Julius Caesar, Twelfth night
Dot product ( = 1*58 +7*0 =58
| =7.07
58/(58*7.07)= 0.23
CS3TM20©XH 11
tf-idf
tf (term frequency) may be defined the ratio of number of
times the word term (t) appears in a document (d)
compared to the total number of words in that document.
𝐶𝑜𝑢𝑛𝑡 ( 𝑡 , ⅆ )
𝑡 𝑓 𝑡 ,𝑑 =
𝑁𝑑
There are other variants definitions for
Frequency is clearly useful; if sugar appears a lot near
apricot, that's useful information.
But overly frequent words like the, it, or they are not very
informative about the context
CS3TM20©XH 12
tf-idf
idf: inverse document frequency
𝑖 ⅆ 𝑓 𝑡 =log
( ⅆ𝑓
𝑁
𝑡
)
where N is total number of d in collection
df i number of d that have word t
Words like "the" or "good" have very low idf
tf-idf value for word t in document d:
CS3TM20©XH 13
Document 1: I love artificial intelligence, big love!
Example:
Document 2: I like computational intelligence.
words index D1 D2 (tf) D1( idf) D 2 (Idf) D 1 D2
(tf) (Tf-Idf) (Tf-Idf )
I 0 1/6 1/4 Log(2/2)=0 Log(2/2)=0 0 0
love 1 2/6 0 Log(2/1) - 2/6*Log(2/1) 0
like 2 0 1/4 - Log(2/1) 0 ¼*Log(2/1)
artificial 3 1/6 0 Log(2/1) - 1/6*Log(2/1) 0
computational 4 0 1/4 - Log(2/1) 0 ¼*Log(2/1)
intelligence 5 1/6 1/4 Log(2/2)=0 Log(2/2)=0 0 0
big 6 1/6 0 Log(2/1) - 1/6*Log(2/1) 0
CS3TM20©XH 14
Exam style question:
The Table below contains three documents, each consisting of one
sentence.
a) Consider all three documents. Identify words with zero Tf-idf
value for each document.
b) Which word in these documents has highest Tf value?
c) Which country Name in these documents has highest Tf-idf
value?
Doc 1 Germany or France.
Doc 2 Has Germany won over France?
Doc 3 England lost to Germany.
CS3TM20©XH 15
Doc 1 Germany, Germany, France.
Doc 2 Has Germany won over France?
Doc 3 England lost to Germany.
a) Consider all three documents. Identify words with zero Tf-idf value in
each document.
Since Germany appeared in all three documents, the multiplier
= log (3/3) =0 for all documents, Tf-idf become zero in each Doc.
b) Which word in these has highest Tf value?
This applies to Germany in Doc 1 as 2/3, the shortest Doc.
c) Which country name in any document has highest Tf-idf value?
Comparing France and England:
Doc 1: France: 1/3 log(3/2 ) = 0.0587
Doc 3: England (1/4)log(3/1 )=0.119 England
CS3TM20©XH 16
Word2vec
The word2vec algorithm learn word associations from a
large corpus of text
Each distinct word with a particular list of numbers as a
vector.
Popular embedding method and very fast to train
Word2vec provides various options. We'll do skip-gram with
negative sampling (SGNS)
The intuition of word2vec is that instead of counting how
often each word w occurs near, say, apricot, we’ll instead
train a classifier on a binary prediction task:
“Is word w likely to show up near apricot?”
CS3TM20©XH 17
Avoids the need for any sort of hand-labeled supervision
signal.
Train a logistic regression classifier instead of a multi-layer
neural network.
Semantic and syntactic patterns can be reproduced using
vector arithmetic.
Brother" - "Man" + "Woman" produces a result which is
closest to the vector representation of "Sister" in the model
You can download from
https://code.google.com/archive/p/word2vec/
CS3TM20©XH 18
Output:
Skip-Gram V(context words)
CS3TM20©XH 19
Assuming a +/- 2 word window
01/24/2025 22
Loss function for one w with cpos , cneg1 ...Cnegk
Where 1
𝜎 ( 𝑐 ∙𝑤 )=
1+exp ( − 𝑐 ∙𝑤 )
This is to be minimised by updating the w, c
Two sets of embeddings
Start with V random d-dimensional vectors as initial
embeddings
SGNS learns two sets of embeddings
Target embeddings matrix W
Context embedding matrix C
It's common to just add them together, representing word
i as the vector wi + ci
Exam style question:
A word2vec algorithm, as shown below, employs a Skigram
of +/- 1 word window and takes incoming sentence for training.
a)Suggest 4 pairs of positive {word, context} examples drawn
from the sentence.
b) Explain why negative examples are needed and how these
can be generated.
“I like to eat chicken and French fries”
W(t-1)
Logistic
W(t)
model W(t+1)
CS3TM20©XH 25
a)Suggest 4 pairs of positive {word, context} examples drawn from the sentence.
Positive
w c
like I
“I like to eat chicken and French fries”
like to
to like
to eat or other choices
b) Explain why negative examples are needed and how these can be generated.
Since logistic model needs negative samples to train, corresponding to positive
examples, noisy context word unlikely to be near the input word is used as negative
samples. There are generally more negative samples, e.g. twice as much.
Negative Negative
w c w c
like zeal eat like sellotape or other choices
like saki like cradle
to tooth to kechi
to deadly to Samsung
CS3TM20©XH 26