Embeddings - A Simple Guide To Rag
Embeddings - A Simple Guide To Rag
The process of embedding transforms data (like text) into vectors, compresses
the input information resulting in an embedding space specific to the training
data
Embeddings in RAG
The reason why embeddings are popular is because they help in
establishing semantic relationship between words, phrases, and
documents. In the simplest methods of searching or text matching,
we use keywords and if the keywords match, we can show the
matching documents as results of the search. However, this
approach fails to consider the semantic relationships or the
meanings of the words while searching. This challenge is overcome
by using embeddings.
Closeness to each other is calculated by the distance between the points in the
vector space. One of the most common measures of similarity is Cosine Similarity.
Cosine similarity is calculated as the cosine value of the angle between the two
vectors. Recall from trigonometry that cosine of parallel lines i.e. angle=0o is 1
and cosine of a right angle i.e. 90o is 0. On the other end, the cosine of opposite
lines i.e. angle =180o is -1. Therefore, the cosine similarity lies between -1 and 1
where unrelated terms have a value close to 0, and related terms have a value
close to 1. Terms that are opposite in meaning have a value of -1.
https://arxiv.org/pdf/1301.3781.pdf
Elmo
Embeddings from Language Models, are learnt from the internal
state of a bidirectional LSTM. The official paper -
https://arxiv.org/pdf/1802.05365.pdf
text-embedding-004 (last updated in April 2024) is the model offered by Google Gemini. It
offers elastic embeddings size up to 768 dimensions and can be accessed via the Gemini API
Mistral is the company behind LLMs like Mistral and Mixtral. They offer a 1024-dimension
embeddings model by the name of mistral-embed. This is an open-source embeddings model.
Cohere, the developers of Command, Command R and Command R+ LLMs also offer a variety
of embeddings models. Some of these are-
embed-english-v3.0 is a 1024-dimension model that works on embeddings for English only.
embed-english-light-v3.0 is a lighter version of embed-english model that has 384 dimensions.
embed-multilingual-v3.0 offers multilingual support for over 100 languages.
These five models are in no way recommendations but just a list of the
popular embeddings models. Apart from these providers, almost all
LLM developers like Meta, TII, LMSYS also offer pre-trained
embeddings models. One place to check out all the popular
embeddings models is the MTEB (Massive Text Embedding Benchmark)
Leaderboard on HuggingFace
(https://huggingface.co/spaces/mteb/leaderboard). The MTEB
benchmark compares the embeddings models on tasks like
classification, retrieval, clustering and more.
**Subscribe Now**
Early Access to
Chapter 1-3
Raw & Ch 1 : LLMs & the need for RAG
Ch 2 : RAG enabled systems & their design
Another important consideration is cost. With OpenAI models you can incur
significant costs if you are working with a lot of documents. The cost of open
source models will depend on the implementation.
Creating Embeddings
Once you’ve chosen your embedding model, there are several ways of creating
the embeddings. Sometimes, our friends, LlamaIndex and LangChain come in
pretty handy to convert documents (split into chunks) into vector embeddings.
Other times you can use the service from a provider directly or get the
embeddings from HuggingFace
Example Response
Cost
In this example, 1014 tokens will cost about $.0001. Recall that for this youtube
transcript we got 14 chunks. So creating the embeddings for the entire transcript
will cost about 0.14 cents. This may seem low, but when you scale up to
thousands of documents being updated frequently, the cost can become a
concern.
Example : msmarco-bert-base-dot-v5
using HuggingFaceEmbeddings from langchain.embeddings
Example : embed-english-light-v3.0
using CohereEmbeddings from langchain.embeddings
**Subscribe Now**
Avail Early Access Discounts to Chapter 1-3
Ch 1 : LLMs & the need for RAG
Raw & Ch 2 : RAG enabled systems & their design
Unedited Ch 3 : Indexing Pipeline - Creating a
knowledge base for RAG based applications