0% found this document useful (0 votes)
14 views77 pages

Unit 2 Updated New

Vector Space Models (VSMs) are essential in Natural Language Processing, representing text as vectors in high-dimensional space, enabling operations like comparison and classification. Key concepts include vector representation, term-document matrices, cosine similarity, and dimensionality reduction techniques such as LSA and PCA. Applications of VSMs range from document retrieval and text classification to machine translation and sentiment analysis, with methods like Word2Vec and GloVe used to capture semantic meaning and relationships between words.

Uploaded by

vikasguduri20
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views77 pages

Unit 2 Updated New

Vector Space Models (VSMs) are essential in Natural Language Processing, representing text as vectors in high-dimensional space, enabling operations like comparison and classification. Key concepts include vector representation, term-document matrices, cosine similarity, and dimensionality reduction techniques such as LSA and PCA. Applications of VSMs range from document retrieval and text classification to machine translation and sentiment analysis, with methods like Word2Vec and GloVe used to capture semantic meaning and relationships between words.

Uploaded by

vikasguduri20
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 77

Unit 2

Vector Space Models and Machine


Translation
Vector Space Models
•Vector Space Models (VSMs) are fundamental tools in Natural
Language Processing (NLP) that represent text (such as words, phrases,
or entire documents) as vectors in a continuous, high-dimensional
space.
•They represent text documents, words, or other linguistic units as
vectors in a high-dimensional space,
• enabling various operations like comparison, clustering, and
classification.
•This representation allows for the mathematical manipulation of text
data and makes it possible to apply various machine learning
algorithms to textual information.
•.
Key Concepts of Vector Space Models
• Vector Representation:
•Each word or document is represented as a vector in an n-
dimensional space.
•The dimensions correspond to unique terms or features (e.g., words,
phrases).
•The value in each dimension typically represents the frequency of
the corresponding term in the document or word.
• Term-Document Matrix:
•A matrix where rows represent documents and columns represent
terms (or vice versa).
•Each entry in the matrix can be a simple count of terms (Term
Frequency, TF), a weighted frequency (e.g., TF-IDF), or other
measures.
Key Concepts of Vector Space Models
• Cosine Similarity:
•A common method to measure the similarity between two vectors.
•It computes the cosine of the angle between two vectors. A cosine
similarity of 1 indicates identical vectors, 0 indicates no similarity,
and -1 indicates completely opposite vectors.
• Dimensionality Reduction:
•Techniques like Latent Semantic Analysis (LSA),PCA reduce the
number of dimensions by capturing the underlying semantic
structure.
•This helps in overcoming issues like synonymy (different words with
similar meanings) and polysemy (same word with different
meanings).
Example: Text Classification
• Imagine you're building a spam classifier. You have a dataset of emails labeled as "spam" or "not spam." To
use a Vector Space Model:
• Tokenization and Preprocessing:
• Tokenize the text into words.
• Remove noise like stopwords (if desired) and perform stemming or
lemmatization.
• Vectorization:
• Construct a Term-Document Matrix, where each email is represented as a
vector of word frequencies.
• Apply weighting like TF-IDF to give more importance to rare but informative
words.
• Similarity Calculation:
• Calculate cosine similarity between new emails and existing labeled examples to
classify them as spam or not.
• Modeling:
• Use the vectors as features in machine learning models like logistic regression,
SVM, or neural networks for classification.
Applications of VSM in NLP

•Document Retrieval: Finding relevant documents based on a


query.
•Text Classification: Categorizing documents into predefined
categories.
•Word Embeddings: Capturing semantic similarity between
words.
•Clustering: Grouping similar documents or words together.
Link of CODE

•https://colab.research.google.com/drive/1ut0s6id
kttKNLWK2uI0VztJNoyre7UDY?usp=sharing
Word2vec Model
Types of Vector Space Models:
1.Bag of Words (BoW)
2.TF-IDF (Term Frequency-Inverse Document Frequency)
3.Word Embeddings

1. Bag of Words (BoW):


• Simplest form of VSM where each dimension represents a
unique word in the vocabulary.
• BOW Represents text (such as a sentence or document) as a
vector of word frequencies.
• Each element of the vector represents the frequency of a word
in the text.

• Example: For the sentences "apple banana apple" and


"banana orange", with a vocabulary {apple, banana,
orange}:
• “apple banana apple" = [2, 1, 0]
Example of a Simple BoW Model:
2. TF-IDF (Term Frequency-Inverse Document Frequency):

• Enhances BoW by scaling down the impact of frequent but less


informative words.
• Combines term frequency (TF) and inverse document frequency
(IDF) to provide a more meaningful representation.
• Weighs the frequency of words in a document relative to how
commonly they appear across all documents.
• This reduces the weight of common words like "the" and "is" and
increases the weight of rare but significant words.
• Example: Given two documents:
• Document 1: "apple banana apple“
• Document 2: "banana orange“
• The TF-IDF score will be higher for words that are unique to each
document.
3. Word Embeddings:
• Word embeddings convert words into numerical vectors in such a way that
words with similar meanings are positioned close to each other in this vector
space.
• These embeddings are learned from large text corpora and aim to capture
the context of a word within a sentence, its syntactic role, and its semantic
meaning.
• Dense vector representations that capture semantic
relationships.
• Represents words in a continuous vector space where
semantically similar words are closer together.
• Examples include Word2Vec, GloVe, and FastText. These
embeddings capture more semantic meaning than one-hot or
BoW models.
• Example: Using Word2Vec, "king" and "queen" might be
represented as vectors:
• king = [0.25, 0.8, -0.1, ..., 0.5]
Word Embedding
Word Embedding
Capturing Semantic Meaning

•Semantic meaning in NLP refers to the meaning conveyed by


words, phrases, or texts. Capturing semantic meaning
involves understanding not just the individual words, but their
relationships, contexts, and the nuances of their use.

•Word Embeddings: Word Embeddings are dense vector


representations of words where words with similar meanings
have similar representations. They capture semantic
relationships through context learned from large corpora of
text.
Semantic Similarity and Relationships:
In vector space models, semantic similarity between words, phrases, or documents is
measured using mathematical functions like cosine similarity, Euclidean distance, or
dot product. Words with similar meanings or usage contexts tend to have vectors that
are close to each other in the vector space.
This allows models to:

• Identify Synonyms: Words like "car" and "automobile" will have similar vectors due
to their similar usage contexts.

• Detect Analogies: Word2Vec models, for example, can perform vector arithmetic to
solve analogies (e.g., "king - man + woman ≈ queen").

• Cluster Words by Topic: Words related to specific themes (like "sports," "finance,"
etc.) cluster together.
Semantic Similarity Example
• The Analogy Example: "king - man + woman ≈ queen“

• Word Vectors: Assume that each word—"king," "man," "woman," and "queen"—is
represented by a vector in a high-dimensional space.

1.Vector Arithmetic for Analogies:


•The expression "king - man + woman" involves vector subtraction and
addition:

• "king" - "man": This operation computes the vector difference between the vectors for
"king" and "man." This difference captures the concept of "royalty minus male," which
abstracts out the notion of "masculinity.“

• Resultant Vector + "woman": Adding the vector for "woman" to the resultant vector
effectively transforms the concept of "royalty minus male" into "royalty plus female." The
result is a vector that is close to the vector for "queen."
2. Solving the Analogy:
• The model looks for the word vector closest to the resultant vector of "king -
man + woman". In most cases, the closest vector will be "queen.“
• Mathematically, if V("king"), V("man") , V("woman"), and V("queen") are the
vectors for these words, then:
V("king")−V("man")+V("woman")≈V("queen")V

3. Capturing Relationships:
• This arithmetic reflects the understanding of the relationship between the
words:
• "King" is to "man" as "queen" is to "woman" (both represent royal titles for
different genders).
• The model has captured not just the meaning of individual words but also how
they relate to each other in a broader semantic space.
Applications in Natural Language Processing (NLP):
Semantic meaning captured by vector space models is widely used in numerous
NLP applications:
• Machine Translation: Vector representations allow for translating meaning across
languages.
• Information Retrieval: Search engines use vector models to match query
semantics with relevant documents.
• Sentiment Analysis: Understanding the sentiment of a sentence involves
capturing the semantic meaning of words and phrases.
• Question Answering: Contextual embeddings enable models to understand and
generate responses accurately.
In sentiment analysis, VSMs help classify the sentiment of a text (e.g., positive or
negative) by representing words and documents as vectors. Words with similar
sentiment meanings (like "happy" and "joyful") will be close to each other in the
vector space, making it easier for machine learning models to distinguish between
different sentiments.
Popular Methods to Capture Semantic Meaning:

1. Word2Vec:
•Trained using neural networks on large corpora to create word vectors.
•Two main models: Continuous Bag of Words (CBOW) and Skip-gram.
•CBOW predicts the target word from surrounding context words.
•Skip-gram predicts surrounding context words given the target word.

2. GloVe (Global Vectors for Word Representation):


•Combines the benefits of matrix factorization and local context
windows.
•Trains on a global word-word co-occurrence matrix, capturing more
global semantics.
Relationships Between Words
• Understanding relationships between words is crucial for various NLP tasks. Word
embeddings, like Word2Vec, GloVe, and FastText, allow us to capture these
relationships by representing words as vectors in a continuous vector space. In this
space, words with similar meanings are located closer together, while dissimilar
words are farther apart.

• Word2Vec is a popular model for creating word vectors and capturing word
relationships. It comes in two flavors: Continuous Bag of Words (CBOW) and
Skip-gram.

• CBOW predicts the target word from its context words.

• Skip-gram predicts context words from the target word.


Continuous Bag of Words (CBOW):

•Objective: The CBOW model predicts a target word (center word)


from a given context (surrounding words). It is based on the idea that
the meaning of a word can be inferred from the words around it.

•How It Works: Given a set of surrounding words, CBOW predicts the


central word.

•Example: In the sentence "The cat sat on the mat," given the context
words "The," "cat," "on," and "the," CBOW would try to predict the
central word "sat."
Example of CBOW Model
• Consider the sentence: "The quick brown fox jumps over the lazy dog.“

• If we want to predict the word "fox" using the CBOW model with a context window of 2,
the context words would be ["quick", "brown", "jumps", "over"].

• Training Objective: Maximize the probability of the target word given the context words:
P(target word | context words)

• For the example above, the model aims to maximize:

P("fox" | "quick", "brown", "jumps", "over")

• The model takes the context words, averages their word embeddings (vector
representations), and feeds this average into a neural network to predict the target
word.
CBOW Model
Skip-gram
• Objective: The Skip-gram model is the inverse of the CBOW model. It predicts
context words given a target word. This model is particularly effective for
capturing rare word representations and learning more about the context of the
target word.

• How It Works: Given a central word, Skip-gram predicts the surrounding context
words.

• Example: In the same sentence, given the word "sat," Skip-gram would try to
predict the context words "The," "cat," "on," and "the."
Example of Skip Gram Model

• Using the same sentence: "The quick brown fox jumps over the lazy dog.“

• If "fox" is the target word and the context window is 2, the Skip-gram model tries to
predict the context words ["quick", "brown", "jumps", "over"] based on the target
word "fox".

• Training Objective: Maximize the probability of context words given the target word:
P(context words | target word)

• For the example above, the model aims to maximize:


P("quick", "brown", "jumps", "over" | "fox")

• The model uses the target word to predict its surrounding context words.
Key Differences Between CBOW and
Skip-gram
1.Prediction Task:
• CBOW: Predicts the target word given the context words.
• Skip-gram: Predicts the context words given the target word.

2. Efficiency:
• CBOW: More efficient for training on large datasets, as it uses context words to predict the
target word and works well with frequent words.
• Skip-gram: Slower to train than CBOW but performs better with infrequent words and can
capture finer-grained word relationships.

3. Performance:
• CBOW: Better for smaller datasets and captures more frequent words effectively.
• Skip-gram: Better for larger datasets and provides better representations for infrequent words
and more nuanced word relationships.
• Both the CBOW and Skip-gram models are foundational approaches to learning word embeddings,
each with its strengths. CBOW is more efficient and works well with frequent words, while Skip-
gram is better for capturing detailed relationships and handling rare words. The choice between
them often depends on the specific NLP task and the nature of the dataset.
Cosine similarity

• It is a metric used to measure how similar two vectors are. In the context of
natural language processing (NLP), it is often used to measure the similarity
between two word vectors. The value of cosine similarity ranges from -1 to 1,
where:

• 1 indicates that the two vectors are pointing in the exact same direction
(perfectly similar).

• 0 indicates that the two vectors are orthogonal (completely dissimilar).

• -1 indicates that the two vectors are pointing in opposite directions (perfectly
dissimilar).
Formula for Cosine Similarity
1. Calculate the Dot Product
Interpretation of Cosine Similarity Result
The cosine similarity between vectors A and B is
approximately 0.9759. This value is close to 1, indicating that the
two vectors are highly similar and point in almost the same
direction in the vector space.
Global Vectors for Word Representation
• Purpose: GloVe is a technique used to create word embeddings (word
representations as vectors) that capture the meaning and relationships between
words.

• How It Works: GloVe builds a co-occurrence matrix that counts how often words
appear together in a large text dataset.

• It then uses this matrix to learn word vectors, where the distances and directions
between vectors reflect meaningful word relationships.

• Key Idea: Words that appear in similar contexts will have similar vector
representations.

• For example, words like "king" and "queen" are related to "man" and "woman,"
respectively. GloVe can capture this relationship in the vectors it learns.
GLOVE
•GloVe (Global Vectors for Word Representation) is a word embedding
technique developed by researchers at Stanford University, including
Jeffrey Pennington, Richard Socher, and Christopher Manning.

• It was introduced in their 2014 paper titled "GloVe: Global Vectors for
Word Representation.“

• GloVe is designed to capture both the global statistical information of


a corpus and the local context of words.

•effectively combining the strengths of both co-occurrence matrix-


based methods and predictive models like Word2Vec.
Key Concepts of GloVe

•Co-occurrence Matrix:
•GloVe begins with constructing a word co-occurrence matrix from a
large corpus of text. This matrix counts how frequently pairs of
words appear together within a certain context window. For
example, if the context window is 5, it considers the 5 words before
and after a target word.
•Counting Word Pairs:GloVe looks at a huge amount of text and
counts how often pairs of words appear close to each other. For
example, the words "cat" and "meow" might often be found near
each other.
•Global Context:
• Unlike Word2Vec, which is focused on predicting a word based on its immediate
context, GloVe leverages the entire co-occurrence matrix to capture global
statistics. This means that GloVe can better understand the overall distribution
of words in a corpus.
• Understanding Word Relationships: It uses these counts to understand the
relationships between words. If "cat" is often found near "meow" but less often
near "bark," GloVe captures this difference.
• Creating Word Vectors:
• GloVe turns each word into a vector (a list of numbers). Words that have similar
meanings or are used in similar ways will have vectors that are close to each other
in this numerical space.
•Ratio of Co-occurrences:
• GloVe's key insight is that the ratio of word-word co-occurrence probabilities is
meaningful. For example, the ratio of the probability of co-occurrence of "king"
with "man" to the probability of co-occurrence of "king" with "woman" should be
similar to the ratio of the co-occurrence of "queen" with "man" to "queen" with
"woman". This helps in capturing analogies (e.g., "king" is to "queen" as "man" is
to "woman").
• Learning from Ratios:
• One of the smart things GloVe does is to look at the ratios of how often words
appear together. For example, the relationship between "man" and "woman" is
similar to that between "king" and "queen." GloVe captures this by comparing
these word pairs.
• Objective Function:
• The objective function used in GloVe minimizes the difference between the dot
product of the word vectors and the logarithm of the co-occurrence probability.
• It also includes a weighting function that gives less importance to very frequent or
very rare co-occurrences.
• Training Process:
• The model adjusts these vectors by minimizing the difference between what it
predicts and the actual word pairs in the text. After training, these vectors can
represent the meanings of words in a way that makes it easy to do things like
finding similar words or understanding word analogies.
• Interpretation:
• The word vectors generated by GloVe are dense, low-dimensional, and can be
used in various NLP tasks. These vectors capture both syntactic and semantic
relationships between words, allowing for tasks like word similarity, analogy
solving, and even as features in downstream tasks like text classification.
Short Example: GloVe Word Embeddings
• Imagine a tiny corpus with just two sentences:
1."I love pizza."
2."I love pasta."
• From this corpus, we will create word embeddings using GloVe.
• Step-by-Step Process
1. Create a Co-occurrence Matrix:
• We'll first build a co-occurrence matrix for the words in our small corpus. We'll use a window size of 1,
meaning we count how often each word appears next to every other word within a one-word distance.
• The words in our corpus are: I, love, pizza, pasta.
1.The co-occurrence matrix will look like this:
I love pizza pasta
I 0 2 0 0
love 2 0 1 1
pizza 0 1 0 0
pasta 0 1 0 0

This matrix shows how many times each word appears next to each other word in the corpus
After calculation this will give a value close to 1, so, we can find that "pizza" and "pasta" have a high similarity
because they often appear in similar contexts ("I love ...").
Capturing Word Relationships with GloVe:
GloVe embeddings can perform simple math with words to find relationships:
• For example, "king" - "man" + "woman" should result in "queen.“

• This shows that GloVe understands word relationships, like gender differences or other
semantic analogies.

Benefits:
• GloVe effectively captures both the overall context of words in a large text (global
information) and their specific usage (local context).

• This makes it a powerful tool for various natural language processing tasks, like
understanding text meaning, sentiment analysis, and translation.

By learning from how words co-occur in large amounts of text, GloVe provides vectors
that represent word meanings and relationships in a way that computers can use to
understand language.
Example of GloVe Model
Visualizing Relationships in Two Dimensions Using PCA
• Principal Component Analysis (PCA) is a dimensionality reduction technique used to reduce the number of
dimensions of a dataset while retaining most of its variability. In NLP, PCA can be used to visualize high-
dimensional word embeddings in two or three dimensions, making it easier to understand the relationships
between words.
• Why Use PCA?
• Simplification: Reducing the dimensionality of word vectors simplifies the visualization process.
• Understanding Relationships: It helps in understanding the semantic relationships between words by plotting
them in a 2D or 3D space.
• Data Exploration: Provides insights into the structure of word embeddings and their clustering behavior.
• Example: Visualizing Word Relationships with PCA

We'll use the sklearn library to perform PCA on word vectors obtained from a pre-trained Word2Vec model.
Step-by-Step Process:
• Train a Word2Vec Model
• Extract Word Vectors
• Apply PCA
• Visualize the Result
PCA
• Principal Component Analysis (PCA) is a dimensionality reduction technique often
used in NLP (Natural Language Processing) to simplify the representation of high-
dimensional data while retaining as much information as possible.
• 1. High-Dimensional Data in NLP
• In NLP, text data is often represented in a high-dimensional space. For example:
• One-Hot Encoding: If you have a vocabulary of 10,000 words, each word is
represented as a 10,000-dimensional vector where only one element is 1, and the
rest are 0.
• Word Embeddings: Techniques like Word2Vec, GloVe, or FastText produce dense
vectors, but even these vectors can be of substantial dimensionality, like 300 or
400 dimensions.
•2. The Curse of Dimensionality
•High-dimensional data can lead to several problems:
•Computational Complexity: Operations on high-dimensional
vectors (e.g., clustering, classification) can be computationally
expensive.
•Overfitting: High-dimensional data might lead to models that
overfit because they capture noise rather than relevant
patterns.
•Visualization: It becomes difficult to visualize and interpret
high-dimensional data
3. PCA Basics
•PCA addresses these issues by reducing the number of dimensions
while preserving as much of the data’s variance as possible:
•Variance: PCA identifies directions (called principal components) in the
data that have the highest variance. The first principal component
captures the most variance, the second captures the next most, and so
on.
•Orthogonality: Principal components are orthogonal (uncorrelated) to
each other, which helps in reducing redundancy.
•Projection: Data is projected onto these principal components,
reducing the number of dimensions.
4.Applying PCA in NLP
•In NLP, PCA can be applied to reduce the dimensionality of:
•Word Embeddings: After generating word embeddings (e.g., using
Word2Vec or GloVe), PCA can reduce the dimensionality of these
vectors, making them more manageable for downstream tasks.
•Document-Term Matrices: When representing documents as term-
frequency vectors, PCA can reduce the dimensionality, helping with
tasks like topic modeling or clustering.
•Latent Semantic Analysis (LSA): PCA is the foundation of LSA, where it
is used to reduce the dimensionality of term-document matrices to
uncover latent (hidden) topics in text data.
5. PCA in Action: An Example
• Imagine you have word embeddings of 300 dimensions. You may want to reduce
these to 2 or 3 dimensions for visualization or to 50 dimensions for faster
computations while retaining most of the original information.
• Step-by-Step:
• Compute the Covariance Matrix: Calculate the covariance matrix of the data
(word embeddings).
• Eigen Decomposition: Compute the eigenvalues and eigenvectors of the
covariance matrix.
• Select Principal Components: Sort the eigenvectors by their corresponding
eigenvalues in descending order. The top eigenvectors form the principal
components.
• Project the Data: Multiply the original data by the selected principal components
to get the reduced-dimensional representation.
•PCA helped us reduce the dimensions from 2 to 1 by finding the
direction (principal component) that captures the most variance in our
dataset. This simplified example illustrates the core steps of PCA:
centering the data, computing the covariance matrix, finding
eigenvalues and eigenvectors, selecting principal components, and
transforming the data into a lower-dimensional space. PCA is widely
used in machine learning and data analysis to simplify datasets, reduce
noise, and make patterns more apparent.
• 6. Benefits of PCA in NLP
• Speed: Reduced dimensions lead to faster processing and model training.
• Noise Reduction: By focusing on components with the most variance, PCA can
help filter out noise.
• Interpretability: Lower-dimensional data is easier to visualize and interpret.
• 7. Limitations
• Linear Assumption: PCA assumes linear relationships in data, which may not
capture more complex, non-linear patterns.
• Loss of Information: Reducing dimensions inevitably leads to some loss of
information, which might affect the performance of NLP models.
PCA with Random Values
PCA with Simple dataset
Example: Visualizing Word Relationships with
PCA
from gensim.models import Word2Vec
# Example corpus
#Visualize the Result
sentences = [ ['this', 'is', 'the', 'first', 'document'],
['this', 'document', 'is', 'the', 'second', 'document'], # Plot the words in the 2D space
['and', 'this', 'is', 'the', 'third', 'one'], plt.figure(figsize=(10, 7))
['is', 'this', 'the', 'first', 'document']] for word, (x, y) in zip(words, word_vectors_pca):
# Train the Word2Vec model plt.scatter(x, y)
model = Word2Vec(sentences, vector_size=100, plt.text(x + 0.01, y + 0.01, word, fontsize=12)
window=5, min_count=1, workers=4) plt.xlabel('PCA Component 1')
# Extract word vectors for a few sample words plt.ylabel('PCA Component 2')
plt.title('2D PCA Visualization of Word Vectors')
words = ['this', 'document', 'first', 'second', 'third', 'one']
plt.grid(True)
word_vectors = [model.wv[word] for word in words] plt.show()
#Apply PCA
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
# Apply PCA to reduce dimensions to 2
pca = PCA(n_components=2)
word_vectors_pca = pca.fit_transform(word_vectors)
Explanation of the Code

1.Train a Word2Vec Model: We use a small example corpus to train the Word2Vec model
and create word embeddings.
2.Extract Word Vectors: We select a few sample words and extract their corresponding
word vectors from the trained model.
3.Apply PCA: PCA is applied to reduce the dimensionality of the word vectors from 100
dimensions to 2 dimensions.
4.Visualize the Result: We plot the 2D representations of the word vectors using
Matplotlib, which allows us to visualize and understand the relationships between the
words.
In the resulting plot, words that are semantically similar or related appear closer together,
while unrelated words are farther apart. This visualization helps in intuitively understanding
the structure and relationships captured by the word embeddings.
Locality Sensitive Hashing
•Locality Sensitive Hashing (LSH) is a technique used to find similar
items in large datasets efficiently.
• Instead of comparing every item with every other one (which can be
computationally expensive)
•LSH maps items into buckets using hash functions, such that similar
items are more likely to be placed in the same bucket.
•This reduces the number of comparisons needed to find similar pairs.
Key Concept
•Hash Functions: LSH uses multiple hash functions that map similar
items (in some high-dimensional space, like vectors) to the same hash
buckets with a high probability.
•Buckets: Items are placed into buckets based on the hash values. You
only compare items within the same bucket, significantly reducing the
number of comparisons.
•Approximate Nearest Neighbor Search: LSH is particularly useful for
this type of search, where you aim to find items that are
approximately close to each other in some metric space.
Example in Context

•Let's say you have a large collection of text documents, and


you want to find all pairs of documents that are similar
(using,cosine similarity). Comparing each document with
every other document would be too slow. Instead, LSH can
group similar documents into the same bucket, so you only
compare documents within the same bucket, which speeds
up the process.
SparseRandomProjection
•SparseRandomProjection is used to hash data points into a lower-
dimensional space.
•After transformation, you can compute the similarity among the
hashed items, but only need to compare within buckets, not the entire
dataset.
•LSH can be used in areas like image recognition, document similarity,
and recommendation systems, making it easier to handle large-scale
similarity searches efficiently.
Locality Sensitive Hashing (LSH) with sample text data

• To demonstrate Locality Sensitive Hashing (LSH) with sample text data, we will
first convert the text documents into a numerical format (using a technique like
TF-IDF), and then apply LSH using random projections to group similar documents.
• Steps:
• Preprocess Text Data: Use TF-IDF to convert text into numerical vectors.
• Apply LSH: Use random projections to hash similar documents into the same
bucket.
• Find Similar Documents: After the projection, compare documents within the
same hash bucket.
Difference of PCA AND LSH
• Principal Component Analysis (PCA) and Locality Sensitive Hashing (LSH) are
both dimensionality reduction techniques, but they serve different purposes and
work in fundamentally different ways.
• 1. Goal:
• PCA:
• The goal of PCA is to reduce the dimensionality of the data while preserving as much of the
original variance as possible.
• It finds the directions (principal components) along which the data has the most variance and
projects the data onto these directions.
• LSH:
• The goal of LSH is to efficiently find approximate nearest neighbors by hashing similar data
points into the same buckets, thus avoiding the need to compute similarities between all data
points.
• It sacrifices some accuracy for speed by using random projections to map similar points into
similar buckets in a lower-dimensional space.
• How They Work:
• PCA:
• PCA works by calculating the covariance matrix of the data and finding its
eigenvectors and eigenvalues.
• The eigenvectors corresponding to the largest eigenvalues are the principal
components, which capture the most significant variance in the data.
• It is a deterministic method, meaning that the same input always produces the
same result.
• LSH:
• LSH uses random projections and hash functions to assign similar data points to
the same hash bucket with a high probability.
• It is a probabilistic method, meaning that two similar points are likely, but not
guaranteed, to be placed in the same bucket.
• The random nature of LSH allows it to scale well to large datasets where exact
comparisons would be too slow.
• Output:
• PCA:
•PCA provides a lower-dimensional representation of the data while
attempting to preserve the original structure and variance of the
data.
•Each dimension in the reduced space is a linear combination of the
original features (principal components).
• LSH:
•LSH provides a lower-dimensional representation, but it focuses on
grouping similar points together based on hash functions.
•It does not necessarily preserve global structure or variance. Instead,
it emphasizes approximate similarity and nearest neighbor
relationships.
• Use Cases:
• PCA:
• Used when the goal is to reduce dimensionality while retaining as much
information as possible, such as in data compression, noise reduction, and
visualization.
• Common in tasks like image compression, exploratory data analysis, and
feature extraction for machine learning.
• LSH:
• Used when the goal is to perform fast approximate nearest neighbor search in
large datasets, such as in recommendation systems, document similarity
search, and large-scale clustering.
• Common in tasks where exact comparisons are computationally prohibitive,
such as identifying similar items in large-scale search engines or social networks.
• 5. Deterministic vs. Probabilistic:
• PCA: Deterministic, meaning the output is always the same for a given input dataset.
• LSH: Probabilistic, meaning similar items are likely, but not guaranteed, to be placed in the same bucket, and
it may produce different results depending on the hash functions used.
• 6. Dimensionality Reduction Purpose:
• PCA:
• Reduces dimensions while preserving variance.
• The reduced space is designed to capture the maximum possible variance of the
original data.
• LSH:
• Reduces dimensions primarily for fast similarity search rather than preserving
variance.
• It doesn't care about preserving global variance but focuses on local similarity
(i.e., making sure similar items are hashed together).
• 7. Scaling:
• PCA:
•Computationally expensive for large datasets, as it requires
eigenvalue decomposition in the covariance matrix.
• LSH:
•Scales much better to large datasets since it relies on hashing rather
than eigenvalue decomposition, making it faster and more efficient
for approximate searches.
Machine Translation and Document Search
Machine Translation (MT):
• MT involves translating text from one language to another. Techniques in MT can benefit from word vectors
and LSH as follows:
• Word Embeddings: Use pre-trained word embeddings to represent words in the source and target languages.
• Subsetting with LSH: Efficiently handle large vocabulary sizes by using LSH to group similar word
embeddings, thereby speeding up the translation process.

• Machine Translation Example


• Scenario: Translating a sentence from English to French using pre-trained word vectors and LSH.
1.Translate "The cat sat on the mat": Convert each word to its vector representation.
1. "The": [0.1, -0.3, 0.2, ...]
2. "cat": [0.25, -0.1, 0.2, ...]
3. "sat": [0.3, -0.2, 0.1, ...]
4. "on": [0.05, -0.1, 0.15, ...]
5. "the": [0.1, -0.3, 0.2, ...]
6. "mat": [0.2, -0.15, 0.3, ...]
2.Translate: Use an NMT model to map these vectors to the target language vectors and then generate the
French sentence, e.g., "Le chat est sur le tapis."
Document Search:
• Document Search involves retrieving relevant documents from a corpus based on a user's query. The
process can be enhanced by word vectors and LSH:
• Query and Document Representation: Represent both queries and documents using word vectors.
• Efficient Retrieval: Use LSH to quickly find documents with similar word vector representations to the
query, improving search efficiency and scalability.
Example: Searching for documents related to "cat" in a large corpus.
1.Index Documents: Represent each document with word vectors. For instance:
• Document 1: "Cats are great pets." → [0.27, -0.12, 0.22, ...] (for "Cats")
• Document 2: "Dogs are loyal animals." → [0.23, -0.05, 0.18, ...] (for "Dogs")
2.Query Vector: Represent the query "cat" as a vector: [0.25, -0.1, 0.2, ...].
3.Search: Use LSH to quickly locate documents with vectors similar to the query vector. This might
involve:
• Bucket for Query: Identifying the bucket(s) containing vectors similar to [0.25, -0.1, 0.2, ...].
• Retrieve Documents: Document 1 is found in the relevant bucket and is returned as relevant to the
query.
Transforming Word Vectors:
Transforming word vectors involves processing these embeddings to better suit the needs of specific NLP tasks. Some
common transformations include:
• Dimensionality Reduction: Techniques like Principal Component Analysis (PCA) or t-Distributed Stochastic
Neighbor Embedding (t-SNE) reduce the dimensionality of word vectors while retaining their most important
information.
• Contextualization: Transformer-based models like BERT provide contextual word embeddings, where the meaning
of a word is influenced by the words around it in a sentence.
2. Assign to Subsets using Locality Sensitive Hashing (LSH)
• Locality Sensitive Hashing (LSH):
• LSH is a technique used to hash similar items into the same buckets with high probability. It is particularly useful for
tasks involving nearest neighbor search in high-dimensional spaces, which is a common requirement in NLP
applications like document search and similarity detection.
Steps in LSH:
• Hash Functions: Select hash functions that maximize the probability of collision for similar vectors while
minimizing it for dissimilar ones.
• Buckets: Suppose you hash vectors into buckets. Vectors for "cat," "dog," and "ball" might end up in
different buckets depending on their similarity to each other.
• Bucket 1: [0.25, -0.1, 0.2, ...] ("cat"), [0.23, -0.05, 0.18, ...] ("dog")
• Bucket 2: [0.3, -0.2, 0.15, ...] ("ball")
Example of Transform Word Vectors and Contextual Transformation with
BERT
• Example: Word Embeddings with Word2Vec
• Scenario: Suppose we are working with a text corpus containing sentences about animals, such as
"The cat sat on the mat" and "The dog played with the ball."
• Word2Vec will generate word vectors like:
• "cat": [0.25, -0.1, 0.2, ...]
• "dog": [0.23, -0.05, 0.18, ...]
• "ball": [0.3, -0.2, 0.15, ...]
• These vectors capture semantic similarities. For instance, "cat" and "dog" might have similar
vectors because they are both animals, whereas "ball" would have a different vector as it represents
an object.
• Contextual Transformation with BERT
• Scenario: In a sentence like "The cat is on the mat," BERT provides contextual embeddings:
• "cat": [0.27, -0.12, 0.22, ...] (in this context, emphasizing the cat as a subject)
• "mat": [0.31, -0.08, 0.19, ...] (in this context, emphasizing the mat as the location)
• The embeddings change based on context, providing more nuanced meanings.

You might also like