0% found this document useful (0 votes)
79 views69 pages

Unit - 3 Distributional Semantics and Word Embedding

Distributed representations aim to capture semantic relationships between words by representing words as dense, low-dimensional vectors based on their distributional properties and contexts of usage. This addresses limitations of previous high-dimensional, sparse representations that failed to capture relationships. Distributional semantics hypothesizes that words appearing in similar contexts have similar meanings, so their vector representations should also be close together. Word embedding algorithms like Word2Vec and GloVe learn vector representations from large corpora to map words with similar distributions to nearby points in the shared vector space.

Uploaded by

Madhu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
79 views69 pages

Unit - 3 Distributional Semantics and Word Embedding

Distributed representations aim to capture semantic relationships between words by representing words as dense, low-dimensional vectors based on their distributional properties and contexts of usage. This addresses limitations of previous high-dimensional, sparse representations that failed to capture relationships. Distributional semantics hypothesizes that words appearing in similar contexts have similar meanings, so their vector representations should also be close together. Word embedding algorithms like Word2Vec and GloVe learn vector representations from large corpora to map words with similar distributions to nearby points in the shared vector space.

Uploaded by

Madhu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 69

Unit 3

Distributional Semantics and Word Embeddings


Word Embeddings
Limitations of
text • Failed to capture relationships between words.
• The feature vectors are sparse and high-dimensional
representation representations.
discussed so • The dimensionality increases with the size of the
vocabulary, with most values being zero for any vector. This
far hampers learning capability.
• Further, high-dimensionality representation makes them
computationally inefficient.
• They cannot handle OOV words.
This is an issue for NLP tasks since we want to be able to capture the relation between words. Clearly,
the Coffee is closer to the Tea than to the Laptop.
Distributed Representations-dense, low-
dimensional representations of words and texts.
• Distributional similarity: meaning of a word can be understood from
the context in which the word appears.
• This is also known as connotation: meaning is defined by context.
• This is opposed to denotation: the literal meaning of any word.
• For example: “NLP rocks.” The literal meaning of the word “rocks” is
“stones,” but from the context, it’s used to refer to something good
and fashionable.
• In linguistics, this hypothesizes that words that
occur in similar contexts have similar meanings.
• For example, the English words “dog” and “cat”
occur in similar contexts.
• Thus, according to the distributional hypothesis,
Distributional there must be a strong similarity between the
hypothesis meanings of these two words.
• The meaning of a word is represented by the
vector.
• Thus, if two words often occur in similar context,
then their corresponding representation
vectors must also be close to each other.
Distributional representation
• Obtained based on distribution of words from the context in which the
words appear.
• Mathematically, distributional representation schemes use high-
dimensional vectors to represent words.
• These vectors are obtained from a co-occurrence matrix that captures co-
occurrence of word and context.
• The dimension of this matrix is equal to the size of the vocabulary of the
corpus.
• The four schemes that we’ve seen so far—one-hot, bag of words, bag of n-
grams, and TF-IDF—all fall under the umbrella of distributional
representation.
• Distributed representation schemes significantly
compress the dimensionality.
Distributed representation
• This results in vectors that are compact (i.e., low
dimensional) and dense (i.e., hardly any zeros).
• The resulting vector space is known as
distributed representation.
Embedding

• For the set of words in a corpus, embedding is a mapping between


vector space coming from distributional representation to vector
space coming from distributed representation.
• Vector semantics
• This refers to the set of NLP methods that aim to learn the word
representations based on distributional properties of words in a large
corpus.
• The idea was to build a model where words which are used
in the same context are semantically similar to each other.
• We could use the phrase that “A word is characterized by
the company it keeps”
• Let’s consider the following example of “Tea” and “Coffee”
and think

Word
Embeddings:
Intuition

We want a model which will be able to state that coffee and tea are
close and they are also close with words like cup, caffeine, drink,
sugar etc.
• All the available algorithms are based on the following
principles:
• Semantically similar words are mapped to nearby points.
• The basic idea is the Distributional Hypothesis: words
that appear in the same contexts share semantic meaning
Word like tea and coffee.
• The most common algorithms are the Word2Vec (Mikolov et
Embeddings: al. (2013) at Google) and GloVe (2014 Stanford) where they
take as input a large corpus of text and produce a vector
The space typically of 100-300 dimensions.
• So the corresponding Word Embeddings of the words coffee,
Algorithms tea and laptop would look like
Language
Construct
Semantics vs
pragmatics
Study of meaning of
words,sentences and phrases

Lexical Semantics- Words and


Semantics meaning relationship among
words

Pharasal/Sentential semantics-
Syntactic units larger thana
word.
Semantics and why??
Semantics is the study of meaning.

It investigates questions such as:

What is meaning?

How come words and sentences have meaning?

What is the meaning of words and sentences?

How can the meanings of words combine to form the meaning of sentences?

Do two people mean the same thing when they utter the word ‘cat’.

How do we communicate? Etc, etc.


Limitations of
text • Failed to capture relationships between words.
• The feature vectors are sparse and high-dimensional
representation representations.
discussed so • The dimensionality increases with the size of the
vocabulary, with most values being zero for any vector. This
far hampers learning capability.
• Further, high-dimensionality representation makes them
computationally inefficient.
• They cannot handle OOV words.
• Distributional similarity: meaning of a word can
Distributed be understood from the context in which the
word appears.
Representations- • This is also known as connotation: meaning is
dense, low- defined by context.
dimensional • This is opposed to denotation: the literal
representations meaning of any word.
of words and • For example: “NLP rocks.” The literal meaning of
the word “rocks” is “stones,” but from the
texts. context, it’s used to refer to something good and
fashionable.
In linguistics, this hypothesizes that words that occur in
similar contexts have similar meanings.

For example, the English words “dog” and “cat” occur in


similar contexts.

Thus, according to the distributional hypothesis, there


Distributional must be a strong similarity between the meanings of
hypothesis these two words.
The meaning of a word is represented by the vector.

Thus, if two words often occur in similar context, then


their corresponding representation vectors must also be
close to each other.
Obtained based on distribution of words from the context
in which the words appear.

Mathematically, distributional representation schemes


use high-dimensional vectors to represent words.

These vectors are obtained from a co-occurrence matrix


Distributional representation that captures co-occurrence of word and context.

The dimension of this matrix is equal to the size of the


vocabulary of the corpus.

The four schemes that we’ve seen so far—one-hot, bag of


words, bag of n-grams, and TF-IDF—all fall under the
umbrella of distributional representation.
Distributed representation

• Distributed representation schemes significantly compress the


dimensionality.
• This results in vectors that are compact (i.e., low dimensional) and dense
(i.e., hardly any zeros).
• The resulting vector space is known as distributed representation.
Distributional semantics is a subfield of Natural
Language Processing that learns meaning from word
usages.

Distributional Distributional semantics relies on a specific view of


meaning, i.e. that meaning comes from usage, or in
semantics other words

Words as well as sentences are represented as


vectors or tensors of real numbers. Vectors for
words are obtained observing how these words co-
occur with other words in document collections.
For example, if one person shows a ring to another, saying “Here is the
ring”, there are several layers of meaning that could be examined.

At the level of semantics,


• the pronoun here indicates a proximal location;
• the verb be signifies existence in a location;
• the determiner the shows that both speaker and hearer have previous
knowledge of this ring; and the word ring picks out a particular type of
object in the world.
At the level of pragmatics,
depending on the context, the hearer might infer that this speech act is a
proposal of marriage, or a request for a divorce, or a directive to embark
upon a magical quest.
Distributional Semantics
• Words as well as sentences are represented as vectors or tensors of
real numbers.
• Vectors for words are obtained observing how these words co-occur
with other words in document collections
• You shall know a word by the company it keeps (John Rupert
Firth, 1957
• The meaning of a word is the set of contexts in which it occurs in
texts
• Important aspects of the meaning of a word are a function of (can
be approximated by) the set of contexts in which it occurs in texts
• Distributional hypothesis suggests that we can induce (aspects of
the) meaning of words from texts
• This is its biggest selling point in computational linguistics: it is a
“theory of meaning” that can be easily operationalized into a
procedure to extract “meaning” from text corpora on a large scale
Introduction to Word Vectors
• Word vectors represent a significant leap forward in advancing our
ability to analyze relationships across words, sentences, and
documents.
• In doing so, they advance technology by providing machines with
much more information about words than has previously been
possible using traditional representations of words.
• Word vectors are simply vectors of numbers that represent the
meaning of a word.
One hot Encoding
Weights Updated
Self Supervised

King – Man + Women = ??

You might also like