0% found this document useful (0 votes)

175 views

Embeddings - A Simple Guide To Rag

Uploaded by

SV Aasheesh Sharma

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

175 views

Embeddings - A Simple Guide To Rag

Uploaded by

SV Aasheesh Sharma

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

A Simple Guide to Retrieval Augmented Generation

What are Embeddings?

All Machine Learning/AI models work with numerical data. Before the
performance of any operation all text/image/audio/video data has to be
transformed into a numerical representation. Embeddings are vector
representations of data that capture meaningful relationships between
entities. As a general definition, embeddings are data that has been
transformed into n-dimensional matrices for use in deep learning
computations. A word embedding is a vector representation of words.

What are Vectors?

In computer science and machine learning,
the idea of a vector is an abstract
representation of data, and the
representation is an array or list of
numbers. These numbers represent the
features or attribute of the data. In NLP, a
vector can represent a document, a
sentence or even a word. The length of the
array or list is the number of dimensions in
Words in a two-dimensional vector space the vector

Words in a unidimensional vector

The process of embedding transforms data (like text) into vectors, compresses
the input information resulting in an embedding space specific to the training
data

Keep Calm & Build AI. Abhinav Kimothi

A Simple Guide to Retrieval Augmented Generation

Embeddings in RAG
The reason why embeddings are popular is because they help in
establishing semantic relationship between words, phrases, and
documents. In the simplest methods of searching or text matching,
we use keywords and if the keywords match, we can show the
matching documents as results of the search. However, this
approach fails to consider the semantic relationships or the
meanings of the words while searching. This challenge is overcome
by using embeddings.

In the retrieval step, the user query is matched with relevant

documents based on how similar the documents are with the query

How is similarity calculated using embeddings?

Cosine similarity of vectors in two-dimensional vector space is the cosine of the

angle between them

Closeness to each other is calculated by the distance between the points in the
vector space. One of the most common measures of similarity is Cosine Similarity.
Cosine similarity is calculated as the cosine value of the angle between the two
vectors. Recall from trigonometry that cosine of parallel lines i.e. angle=0o is 1
and cosine of a right angle i.e. 90o is 0. On the other end, the cosine of opposite
lines i.e. angle =180o is -1. Therefore, the cosine similarity lies between -1 and 1
where unrelated terms have a value close to 0, and related terms have a value
close to 1. Terms that are opposite in meaning have a value of -1.

Keep Calm & Build AI. Abhinav Kimothi

A Simple Guide to Retrieval Augmented Generation

Popular Embedding Algorithms

word2vec embeddings. The official paper -
Google’s Word2Vec is one of the most popular pre-trained word

https://arxiv.org/pdf/1301.3781.pdf

The ‘Global Vectors’ model is so termed because it captures

GLOVE statistics directly at a global level. The official paper -
https://nlp.stanford.edu/pubs/glove.pdf

fastText Facebook’s AI research, fastText builds embeddings composed of

characters instead of words. The official paper -
https://arxiv.org/pdf/1607.04606.pdf

Elmo
Embeddings from Language Models, are learnt from the internal
state of a bidirectional LSTM. The official paper -
https://arxiv.org/pdf/1802.05365.pdf

Bidirectional Encoder Representations from Transformers is a

BERT transformer bases approach. The official paper -
https://arxiv.org/pdf/1810.04805.pdf

Common Pre-trained Embeddings Models

The good news for anyone building RAG enabled systems is that embeddings
once created can also generalize across tasks and domains. There are a variety of
proprietary and open-source pre-trained embeddings models that are available
to use. This is also one of the reasons why the usage of embeddings has exploded
in popularity across machine learning applications.

text-embedding-ada-002 was released in December 2022. It has a dimension of 1536 meaning

that it converts text into a vector of 1536 dimensions.
text-embedding-3-small is the latest small embedding model of 1536 dimensions released in
January 2024. The flexibility it provides over ada-002 model is that users can adjust the size of
the dimensions according to their needs.
a.text-embedding-3-large is a large embedding model of 3072 dimensions released together
with the text-embedding-3-small model. It is the best performing model released by OpenAI
yet.

Keep Calm & Build AI. Abhinav Kimothi

A Simple Guide to Retrieval Augmented Generation

Common Pre-trained Embeddings Models

text-embedding-004 (last updated in April 2024) is the model offered by Google Gemini. It
offers elastic embeddings size up to 768 dimensions and can be accessed via the Gemini API

Voyage AI embeddings models are recommended by Anthropic, the providers of Claude

series of Large Language Models. Voyage offers several embeddings models like –
voyage-large-2-instruct is a 1024-dimension embeddings model that has become a leader
in embeddings models.
voyage-law-2 is a 1024-dimension model that has been optimized for legal documents.
voyage-code-2 is a 1536-dimension model that has been optimized for code retrieval.
voyage-large-2 is a 1536-dimension general purpose model optimized for retrieval.

Mistral is the company behind LLMs like Mistral and Mixtral. They offer a 1024-dimension
embeddings model by the name of mistral-embed. This is an open-source embeddings model.

Cohere, the developers of Command, Command R and Command R+ LLMs also offer a variety
of embeddings models. Some of these are-
embed-english-v3.0 is a 1024-dimension model that works on embeddings for English only.
embed-english-light-v3.0 is a lighter version of embed-english model that has 384 dimensions.
embed-multilingual-v3.0 offers multilingual support for over 100 languages.

These five models are in no way recommendations but just a list of the
popular embeddings models. Apart from these providers, almost all
LLM developers like Meta, TII, LMSYS also offer pre-trained
embeddings models. One place to check out all the popular
embeddings models is the MTEB (Massive Text Embedding Benchmark)
Leaderboard on HuggingFace
(https://huggingface.co/spaces/mteb/leaderboard). The MTEB
benchmark compares the embeddings models on tasks like
classification, retrieval, clustering and more.

Keep Calm & Build AI. Abhinav Kimothi

Want to know more about RAG?

**Subscribe Now**

Early Access to
Chapter 1-3
Raw & Ch 1 : LLMs & the need for RAG
Ch 2 : RAG enabled systems & their design

Unedited Ch 3 : Indexing Pipeline - Creating a

knowledge base for RAG based applications

Launch Offer : Use Code mlkimothi for 50% discount

valid till July 4th, 2024
A Simple Guide to Retrieval Augmented Generation

How to Choose Embeddings?

Ever since the release of ChatGPT and the advent of the aptly described LLM
Wars, there has also been a mad rush in developing embeddings models. There
are many evolving standards of evaluating LLMs and embeddings alike.
When building RAG powered LLM apps, there is no right answer to “Which
embeddings model to use?”. However, you may notice particular embeddings
working better for specific use cases (like summarization, text generations,
classification etc.)

OpenAI used to recommend different embeddings models for different

use cases. However, now they recommend ada v2 for all tasks.

MTEB Leaderboard at Hugging Face evaluates almost all available embedding

models across seven use cases - Classification, Clustering, Pair Classification,
Reranking, Retrieval, Semantic Textual Similarity (STS) and Summarization.

Another important consideration is cost. With OpenAI models you can incur
significant costs if you are working with a lot of documents. The cost of open
source models will depend on the implementation.

Keep Calm & Build AI. Abhinav Kimothi

A Simple Guide to Retrieval Augmented Generation

Creating Embeddings
Once you’ve chosen your embedding model, there are several ways of creating
the embeddings. Sometimes, our friends, LlamaIndex and LangChain come in
pretty handy to convert documents (split into chunks) into vector embeddings.
Other times you can use the service from a provider directly or get the
embeddings from HuggingFace

Example : OpenAI text-embedding-ada-002

using Embedding.create() function from openai library

You’ll need an OpenAI apikey to create these embeddings

You can get one here - https://platform.openai.com/api-keys

Keep Calm & Build AI. Abhinav Kimothi

A Simple Guide to Retrieval Augmented Generation

Example Response

response.data[0].embedding will give the created embeddings

that can be stored for retrieval

Cost

In this example, 1014 tokens will cost about $.0001. Recall that for this youtube
transcript we got 14 chunks. So creating the embeddings for the entire transcript
will cost about 0.14 cents. This may seem low, but when you scale up to
thousands of documents being updated frequently, the cost can become a
concern.

Keep Calm & Build AI. Abhinav Kimothi

A Simple Guide to Retrieval Augmented Generation

Example : msmarco-bert-base-dot-v5
using HuggingFaceEmbeddings from langchain.embeddings

Example : embed-english-light-v3.0
using CohereEmbeddings from langchain.embeddings

All the available embeddings classes on

LangChain

Keep Calm & Build AI. Abhinav Kimothi

Hello!
I’m Abhinav...
A data science and AI professional with over 15
years in the industry. Passionate about AI
advancements, I constantly explore emerging
technologies to push the boundaries and create
positive impacts in the world.

A Simple Guide to Retrieval Augmented Generation

is now available for Early Access

**Subscribe Now**
Avail Early Access Discounts to Chapter 1-3
Ch 1 : LLMs & the need for RAG
Raw & Ch 2 : RAG enabled systems & their design
Unedited Ch 3 : Indexing Pipeline - Creating a
knowledge base for RAG based applications

Complete book coming soon

Launch Offer : Use Code mlkimothi for 50% discount
valid till July 4th, 2024

Multi-Agent Agentic RAG Systems - Prashant Sahu
No ratings yet
Multi-Agent Agentic RAG Systems - Prashant Sahu
10 pages
En 13757 2 2018 04
50% (2)
En 13757 2 2018 04
30 pages
Databricks 101
No ratings yet
Databricks 101
16 pages
2024-05-EB-A Compact GuideTo RAG
No ratings yet
2024-05-EB-A Compact GuideTo RAG
38 pages
An Introduction To Vision-Language Modeling: Aishwarya Agrawal Kate Saenko Asli Celikyilmaz Vikas Chandra
No ratings yet
An Introduction To Vision-Language Modeling: Aishwarya Agrawal Kate Saenko Asli Celikyilmaz Vikas Chandra
76 pages
WWW - K2view - Com - What Is Retrieval Augmented Generation
No ratings yet
WWW - K2view - Com - What Is Retrieval Augmented Generation
29 pages
RAG (Retrieval Augmented Generation)
No ratings yet
RAG (Retrieval Augmented Generation)
3 pages
GraphRAG + GPT-4o Mini - Building An AI Knowledge Graph at Low Cost - by Shuyi Wang - Jul, 2024 - Cubed
No ratings yet
GraphRAG + GPT-4o Mini - Building An AI Knowledge Graph at Low Cost - by Shuyi Wang - Jul, 2024 - Cubed
31 pages
Day 45 PyTorch Presentation
No ratings yet
Day 45 PyTorch Presentation
67 pages
Prompt_Engineering_Notes
No ratings yet
Prompt_Engineering_Notes
2 pages
A Data Engineer's Guide To Semantic Modelling - Ilaria Maresi
No ratings yet
A Data Engineer's Guide To Semantic Modelling - Ilaria Maresi
50 pages
Brief Introduction To GenAI
No ratings yet
Brief Introduction To GenAI
1 page
1 - Optimize Amazon SageMaker Deployment Strategies
No ratings yet
1 - Optimize Amazon SageMaker Deployment Strategies
45 pages
Bias-Variance Tradeoff Presentation
No ratings yet
Bias-Variance Tradeoff Presentation
11 pages
GenAI POC - Training
100% (1)
GenAI POC - Training
43 pages
A Tour of TensorFlow
No ratings yet
A Tour of TensorFlow
16 pages
Rag - LLM
No ratings yet
Rag - LLM
16 pages
Agents & Environment
No ratings yet
Agents & Environment
24 pages
How To Prepare For Optimal Results With Azure AI Security Copilot
No ratings yet
How To Prepare For Optimal Results With Azure AI Security Copilot
32 pages
Future of AI by Google Cloud
No ratings yet
Future of AI by Google Cloud
75 pages
Neo4j - GraphRAG - 2024
100% (1)
Neo4j - GraphRAG - 2024
23 pages
Fine-Tuning AI Models - A Guide. Fine-Tuning Is A Technique For Adapting - by Prabhu Srivastava - Medium
No ratings yet
Fine-Tuning AI Models - A Guide. Fine-Tuning Is A Technique For Adapting - by Prabhu Srivastava - Medium
12 pages
IDE204 - TimeGPT Generative AI For Time Series
100% (1)
IDE204 - TimeGPT Generative AI For Time Series
36 pages
Data Science Interview Preparation
100% (1)
Data Science Interview Preparation
113 pages
Long-Context LLMs Meet RAG: Overcoming Challenges For Long Inputs in RAG
No ratings yet
Long-Context LLMs Meet RAG: Overcoming Challenges For Long Inputs in RAG
34 pages
Uncertainty in Modeling
No ratings yet
Uncertainty in Modeling
25 pages
Arize U - Intro To ML Observability
No ratings yet
Arize U - Intro To ML Observability
13 pages
The Top 5 Use Cases of Graph Databases: Unlocking New Possibilities With Connected Data
No ratings yet
The Top 5 Use Cases of Graph Databases: Unlocking New Possibilities With Connected Data
13 pages
Data For GenAI
No ratings yet
Data For GenAI
17 pages
Kubernetes
No ratings yet
Kubernetes
42 pages
Shreyash's Resume
No ratings yet
Shreyash's Resume
1 page
10 Evani Generative AI Champion
No ratings yet
10 Evani Generative AI Champion
39 pages
The COMPLETE TRUTH About AI Agents (2024)
No ratings yet
The COMPLETE TRUTH About AI Agents (2024)
32 pages
The Use of Ontologies For Effective Knowledge Modelling
No ratings yet
The Use of Ontologies For Effective Knowledge Modelling
11 pages
Federated learning Overview, strategies, applications, tools and
No ratings yet
Federated learning Overview, strategies, applications, tools and
24 pages
LangChain and LlamaIndex Projects Lab Book Hooking Large Language Models Up to the Real World (Mark Watson) (Z-Library)
No ratings yet
LangChain and LlamaIndex Projects Lab Book Hooking Large Language Models Up to the Real World (Mark Watson) (Z-Library)
86 pages
Kinematics Modeling and Simulation of SCARA Robot Arm
No ratings yet
Kinematics Modeling and Simulation of SCARA Robot Arm
6 pages
Little Guide To Building Large Language Models in 2024
100% (1)
Little Guide To Building Large Language Models in 2024
65 pages
Meetup - Data Mesh 24 Mar 2022 - Anders Boje v3
No ratings yet
Meetup - Data Mesh 24 Mar 2022 - Anders Boje v3
27 pages
Best Practices For Prompt Engineering With The OpenAI
No ratings yet
Best Practices For Prompt Engineering With The OpenAI
6 pages
Presentation 1
No ratings yet
Presentation 1
5 pages
Andrea Martorana Tusa: Failure Prediction For Manufacturing Industry
No ratings yet
Andrea Martorana Tusa: Failure Prediction For Manufacturing Industry
23 pages
Altman Solon - AWS - Telecoms Generative AI Study
No ratings yet
Altman Solon - AWS - Telecoms Generative AI Study
53 pages
1GitHub - Modelcontextprotocol_python-sdk_ the Official Python SDK for Model Context Protocol Servers and Clients
No ratings yet
1GitHub - Modelcontextprotocol_python-sdk_ the Official Python SDK for Model Context Protocol Servers and Clients
9 pages
Module 3 Image Segmentation
No ratings yet
Module 3 Image Segmentation
296 pages
LLM Training Update
No ratings yet
LLM Training Update
31 pages
TensorFlow Cheatsheet Zero To Mastery V1.01
No ratings yet
TensorFlow Cheatsheet Zero To Mastery V1.01
26 pages
Ethics Of AI GDC
No ratings yet
Ethics Of AI GDC
49 pages
Machine: Learning ATO Z - I
No ratings yet
Machine: Learning ATO Z - I
131 pages
A Short Guide For Feature Engineering and Feature Selection
No ratings yet
A Short Guide For Feature Engineering and Feature Selection
32 pages
Vector Database in LLMs
No ratings yet
Vector Database in LLMs
14 pages
Crud Rag
No ratings yet
Crud Rag
31 pages
Driverless A I Booklet
No ratings yet
Driverless A I Booklet
120 pages
Arnab Majumdar & Amitava Ghosh - Intelligent Reliability Testing – Harnessing the Power of Gen AI on Azure Cloud
No ratings yet
Arnab Majumdar & Amitava Ghosh - Intelligent Reliability Testing – Harnessing the Power of Gen AI on Azure Cloud
10 pages
Amarnath Gupta, Ramesh Jain - Managing Event Information - Modeling, Retrieval, and Applications - Morgan & Claypool (2011) PDF
No ratings yet
Amarnath Gupta, Ramesh Jain - Managing Event Information - Modeling, Retrieval, and Applications - Morgan & Claypool (2011) PDF
143 pages
Generative AI
No ratings yet
Generative AI
2 pages
03 - IBM Watsonx - Data Introduction For Clients
No ratings yet
03 - IBM Watsonx - Data Introduction For Clients
31 pages
Generative AI 3d Model
No ratings yet
Generative AI 3d Model
117 pages
Siamese Neural Networks For One-Shot Image Recognition
No ratings yet
Siamese Neural Networks For One-Shot Image Recognition
8 pages
Gated Recurrent Unit: Master Sidsd - S2
100% (1)
Gated Recurrent Unit: Master Sidsd - S2
23 pages
Hybrid Neural Networks: Fundamentals and Applications for Interacting Biological Neural Networks with Artificial Neuronal Models
From Everand
Hybrid Neural Networks: Fundamentals and Applications for Interacting Biological Neural Networks with Artificial Neuronal Models
Fouad Sabry
No ratings yet
What's next in DevSecOps for financial services
No ratings yet
What's next in DevSecOps for financial services
18 pages
2021 2022 - Aml 1413
No ratings yet
2021 2022 - Aml 1413
5 pages
Week 2 Branding, Strengths and Values - AML1301
No ratings yet
Week 2 Branding, Strengths and Values - AML1301
12 pages
Computer Vision in Banking
No ratings yet
Computer Vision in Banking
7 pages
India Canada - Bilateral December 2023
No ratings yet
India Canada - Bilateral December 2023
7 pages
OBE Based Question Paper Formation
No ratings yet
OBE Based Question Paper Formation
4 pages
Mediant Software SBC Users Manual Ver 72
No ratings yet
Mediant Software SBC Users Manual Ver 72
1,394 pages
Chapter Five
No ratings yet
Chapter Five
38 pages
Mindmanager End User License Agreement
No ratings yet
Mindmanager End User License Agreement
11 pages
C Ontro LS: MIDI/Audio Control Surface With Motorized Faders
No ratings yet
C Ontro LS: MIDI/Audio Control Surface With Motorized Faders
18 pages
Ip SQP 2
No ratings yet
Ip SQP 2
9 pages
Naming in Distributed Systems
No ratings yet
Naming in Distributed Systems
40 pages
OSI Unified View of Protocols and Services: Unit 01.02.02 CS 5220: Computer Communications
No ratings yet
OSI Unified View of Protocols and Services: Unit 01.02.02 CS 5220: Computer Communications
14 pages
New Microsoft Excel Worksheet
No ratings yet
New Microsoft Excel Worksheet
3 pages
Student Attendance System Guide
No ratings yet
Student Attendance System Guide
9 pages
Zihin Yaşı Testi
No ratings yet
Zihin Yaşı Testi
1 page
Sinclair Uim SDT sdv5 en
No ratings yet
Sinclair Uim SDT sdv5 en
41 pages
TG ECET - 2025 USER GUIDE
No ratings yet
TG ECET - 2025 USER GUIDE
15 pages
Asrock Manual H55M-LE (Mux) - Asrock
No ratings yet
Asrock Manual H55M-LE (Mux) - Asrock
185 pages
Best Practices Virrtualizing MS SQL Server On Nutanix
No ratings yet
Best Practices Virrtualizing MS SQL Server On Nutanix
59 pages
Wireless Access Controller (AC and Fit AP) V200R019C00 Web-Based Configuration Guide PDF
No ratings yet
Wireless Access Controller (AC and Fit AP) V200R019C00 Web-Based Configuration Guide PDF
846 pages
Full Text
No ratings yet
Full Text
16 pages
ch1 PC
No ratings yet
ch1 PC
84 pages
Robotics Unit4 Slides
No ratings yet
Robotics Unit4 Slides
72 pages
CyberRatings Enterprise-Firewall Comparative-Report April2023
No ratings yet
CyberRatings Enterprise-Firewall Comparative-Report April2023
21 pages
Chapter 4
No ratings yet
Chapter 4
10 pages
PWPP Pathport DIN Mount 2 Port Manual
No ratings yet
PWPP Pathport DIN Mount 2 Port Manual
3 pages
LIFE ANH VĂN 3
No ratings yet
LIFE ANH VĂN 3
97 pages
(8085 Trainer Kit) n552f41719937b
100% (1)
(8085 Trainer Kit) n552f41719937b
57 pages
Deta Pharma Guidelines and Operation Manual en PDF
No ratings yet
Deta Pharma Guidelines and Operation Manual en PDF
24 pages
Group Policy Administrator User Guide
No ratings yet
Group Policy Administrator User Guide
266 pages
ARM Cortex M0 User Guide
No ratings yet
ARM Cortex M0 User Guide
109 pages
MCES 18CS44 Unit3 4 LPC 2148 Interfacing
No ratings yet
MCES 18CS44 Unit3 4 LPC 2148 Interfacing
52 pages
Report 01
No ratings yet
Report 01
9 pages