Weaviate Advanced RAG Techniques eBook
Weaviate Advanced RAG Techniques eBook
Techniques
from wledge
an external kno source to help reduce hallucinations and increase the factual accuracy of the quality responses.
generated responses.
-
This e book discusses various vanced techniques you can apply to improve the performance
ad
template, and a generative LLM. At inference time, it embeds the user query to retrieve relevant document shown below:
chunks of information from the vector database, which it stuffs into the LLM’s prompt to generate an answer.
Documents Chunks
Chunking Strategies
Query Transformation
Query Pre-Retrieval
Query Decomposition
Query Routing
Metadata Filtering
Hybrid search
Re-ranking
Context Post-processing
Prompt Engineering
Prompt Template
LLM Fine-tuning
Response LLM
Advanced RAG Techniques | weaviate Ebook
Data Pre-Processing
Data pre-processing is fundamental to the success of any RAG system, as the quality of your
Documents Chunks Indexing
processed data directly impacts the overall performance. By thoughtfully transforming raw data
into a structured format suitable for LLMs, you can significantly enhance your system's
effectiveness before considering more complex optimizations.
Query
While there are several common pre-processing techniques available, the optimal approach and
sequence should be tailored to your specific use case and requirements.
Embedding Model The process usually begins with data acquisition and integration, where diverse document
types from multiple sources are collected and consolidated into a ‘knowledge base’.
Vector Database
Data Sources Raw Data
Source 1
Context
Source 2
Prompt Template
Source 3
Response LLM
Data Extraction and Data Parsing
Data extraction and parsing take place over the raw data so that it is accurately processed for
downstream tasks. For text-based formats like Markdown, Word documents, and plain text,
Indexing Optimization extraction techniques focus on preserving structure while capturing relevant content.
Scanned documents, images, and PDFs containing image-based text/tables require OCR
Techniques (Optical Character Recognition) technology to convert into an ‘LLM-ready’ format. However,
recent advancements in multimodal retrieval models, such as ColPali and ColQwen, have
revolutionized this process. These models can directly embed images of documents, potentially
making traditional OCR obsolete.
3
Advanced RAG Techniques | weaviate Ebook
then vectorized. These units are then combined into chunks based
... : [
{...}
Each of the discussed techniques has its strengths, and the choice depends on the RAG system's
specific requirements and the nature of the documents being processed. New approaches continue
to emerge, such as late chunking, which processes text through long-context embedding models
before splitting it into chunks to better preserve document-wide context.
4
Advanced RAG Techniques | weaviate Ebook
Query Transformation
Documents Chunks
Using the user query directly as the search query for retrieval can lead to poor search results.
That’s why turning the raw user query into an optimized search query is essential. Query
transformation refines and expands unclear, complex, or ambiguous user queries to improve the
Query Pre-Retrieval quality of search results.
Query Rewriting involves reformulating the original user query to make it more suitable for
retrieval. This is particularly useful in scenarios where user queries are not optimally phrased or
Embedding Model expressed differently. This can be achieved by using an LLM to rephrase the original user query
or employing specialized smaller language models trained specifically for this task.
Vector Database This approach is called 'Rewrite-Retrieve-Read' instead of the traditional 'Retrieve-then-Read'
paradigm.
Context Raw Query Query Re-writer (LLM) Rewritten Query Retriever Retrieved Documents
Response LLM
Query Expansion focuses on broadening the original query to capture more relevant
information. This involves using an LLM to generate multiple similar queries based on the user's
initial input. These expanded queries are then used in the retrieval process, increasing both the
number and relevance of retrieved documents.
Pre-retrieval Optimization Note: Due to the increased quantity of retrieved documents, a reranking step is often necessary
to prioritize the most relevant results (see Re-ranking).
Index optimization techniques enhance retrieval accuracy by structuring Can meditation improve
focus and concentration?
external data in more organized, searchable ways. These techniques can be What are the
benefits of What are the long-term mental
applied to both data pre-processing and chunking stages in the RAG pipeline, meditation? health benefits of meditation?
5
Advanced RAG Techniques | weaviate Ebook
The process can include agentic elements, where AI agents decide how to handle each query.
These agents evaluate factors such as query complexity and domain to determine the optimal
approach. For example, fact-based questions may be routed to one pipeline, while those
requiring summarization or interpretation are sent to another.
Agentic RAG functions like a network of specialized agents, each with different expertise. It can
choose from various data stores, retrieval strategies (keyword-based, semantic, or hybrid),
query transformations (for poorly structured queries), and specialized tools or APIs, such as
text-to-SQL converters or even web search capabilities.
Query decomposition is a technique that breaks down complex queries into simpler sub-
queries. This is useful for answering multifaceted questions requiring diverse information
sources, leading to more precise and relevant search results.
Tools
Retrieval
The process typically involves two main stages: decomposing the original query into smaller, Query Agent Vector search engine A Collection A
focused sub-queries using an LLM and then processing these sub-queries to retrieve relevant
information.
What are the common dietary factors that can cause fatigue
What are some popular diet trends and their effects on energy levels
How can I determine if my diet is balanced and supports my energy needs?
Web search
Each sub-query targets a specific aspect, enabling the retriever to find relevant documents or
chunks. Sub-queries can also be processed in parallel to improve efficiency. Additional Single Agent RAG
Response LLM System (Router)
techniques like keyword extraction and metadata filter extraction can help identify both key
search terms and structured filtering criteria, enabling more precise searches. After retrieval,
the system aggregates and synthesizes results from all sub-queries to generate a
comprehensive answer to the original complex query.
6
Advanced RAG Techniques | weaviate Ebook
Metadata Filtering
Documents Chunks
Query
Embedding Model
Retrieval
Vector Database
Context
Prompt Template
Metadata is the additional information attached to each document or chunk in a vector
database, providing valuable context to enhance retrieval. This supplementary data can include
Response LLM timestamps, categories, author info, source references, languages, file types, etc.
When retrieving content from a vector database, metadata helps refine results by filtering out
irrelevant objects, even when they are semantically similar to the query. This narrows the search
Retrieval Optimization
scope and improves the relevance of the retrieved information.
Strategies
metadata, the system can prioritize recent information, ensuring the retrieved knowledge
remains current and relevant. This is particularly useful in domains where information freshness
is critical.
To get the most out of metadata filtering, it's important to plan carefully and choose metadata
Retrieval optimization strategies aim to improve retrieval results by directly
that improves search without adding unnecessary complexity.
manipulating the way in which external data is retrieved in relation to the user
query. This can involve refining the search query, such as using metadata to
filter candidates or excluding outliers, or even involve fine-tuning an embedding
model on external data to improve the quality of the underlying embeddings
themselves.
7
Advanced RAG Techniques | weaviate Ebook
Vector
Context
Context
Context
Search A B C
3
Context B
1
2 Query Fusion Algorithm Context A
Context C
The most straightforward approach to defining the number of returned results is explicitly Keyword
Context
Context
Context
setting a value for the top k (top_k) results. If you set top_k to 5, you'll get the five closest Search B C A
vectors, regardless of their relevance. While easy to implement, this can include poor matches
just because they made the cutoff.
Here are two techniques to manage the number of search results implicitly that can help with Hybrid search combines the strengths of vector-based semantic search with traditional
excluding outliers: keyword-based methods. This technique aims to improve the relevance and accuracy of
retrieved information in RAG systems.
The key to hybrid search lies in the 'alpha' (α) parameter, which controls the balance between
Distance thresholding adds a quality check by semantic and keyword-based search methods
setting a maximum allowed distance between
vectors. Any result with a distance score above this α = 1: Pure semantic searc
threshold gets filtered out, even if it would have α = 0: Pure keyword-based searc
made the top_k cutoff. This helps remove the obvious 0 < α < 1: Weighted combination of both methods
Consider a technical support knowledge base for a software company. A user might submit a
query like "Excel formula not calculating correctly after update". In this scenario, semantic
Autocut is more dynamic - it looks at how the result
search helps understand the context of the problem, potentially retrieving articles about formula
distances are clustered. Instead of using fixed limits,
errors, calculation issues, or software update impacts. Meanwhile, keyword search ensures that
it groups results based on their relative distances
documents containing specific terms like "Excel" and "formula" are not overlooked.
8
Advanced RAG Techniques | weaviate Ebook
domain-specific data. During this process, the loss function adjusts the model’s embeddings so
that semantically similar items are placed closer together in the embedding space. To evaluate a
Fine-tuning embedding models on custom datasets can significantly improve the quality of fine-tuned embedding model, you can use a validation set of curated query-answer pairs to
embeddings, subsequently improving performance on downstream tasks like RAG. Fine-tuning assess the quality of retrieval in your RAG pipeline. Now, the model is ready to generate more
improves embeddings to better capture the dataset's meaning and context, leading to more accurate and representative embeddings for your specific dataset.
The more niche your dataset is, the more it can benefit from embedding model fine-tuning.
Datasets with specialized vocabularies, like medical or legal datasets, are ideal for embedding
model fine-tuning, which helps extend out-of-domain vocabularies and enhance the accuracy
and relevance of information retrieval and generation in RAG pipelines.
9
Advanced RAG Techniques | weaviate Ebook
Re-Ranking
Documents Chunks
One proven method to improve the performance of your information retrieval system is to
leverage a retrieve-and-rerank pipeline. A retrieve-and-rerank pipeline combines the speed of
Query vector search with the contextual richness of a re-ranking model.
In vector search, the query and documents are processed separately. First, the documents are
pre-indexed. Then, at query time, the query is processed, and the documents closest in vector
Embedding Model space are retrieved. While vector search is a fast method to retrieve candidates, it can miss
contextual nuances.
This is where re-ranking models come into play. Because re-ranking models process the query
Vector Database and the documents together at query time, they can capture more contextual nuances.
However, they are usually complex and resource-intensive and thus not suitable for first-stage
retrieval like vector search.
Context Post-Retrieval
By combining vector search with re-ranking models, you can quickly cast a wide net of potential
candidates and then re-order them to improve the quality of relevant context in your prompt.
Prompt Template
Note that when using a re-ranking model, you should over-retrieve chunks to filter out less
relevant ones later.
Response LLM
Query Embedding Model
Vector Database
Post-retrieval optimization techniques aim to enhance the quality of generated Re-ranked Context
responses, meaning that their work begins after the retrieval process has been
completed. This diverse group of techniques includes using models to re-rank Prompt Template
retrieved results, enhancing or compressing the retrieved context, prompt
engineering, and fine-tuning the generative LLM on external data.
Response LLM
10
Advanced RAG Techniques | weaviate Ebook
RAG systems rely on diverse knowledge sources to retrieve relevant information. However, this
After retrieval, it can be beneficial to post-process the retrieved context for generation. For often results in the retrieval of irrelevant or redundant data, which can lead to suboptimal
example, if the retrieved context might benefit from additional information you can enhance it responses and costly LLM calls (more tokens).
with metadata. On the other hand, if it contains redundant data, you can compress it.
Context compression effectively addresses this challenge by extracting only the most
meaningful information from the retrieved data. This process begins with a base retriever that
Context Enhancement with Metadata
retrieves documents/chunks related to the query. These documents/chunks are then passed
through a document compressor that shortens them and eliminates irrelevant content, ensuring
that valuable data is not lost in a sea of extraneous information.
One post-processing technique is to use metadata to enhance the retrieved context with
additional information to improve generation accuracy. While you can simply add additional Contextual compression reduces data volume, lowering retrieval and operational costs. Current
information from the metadata, such as timestamps, document names, etc., you can also apply research focuses on two main approaches: embedding-based and lexical-based compression,
more creative techniques.
both of which aim to retain essential information while easing computational demands on RAG
systems.
Context enhancement is particularly useful when data needs to be pre-processed into smaller
chunk sizes to achieve better retrieval precision that doesn’t contain enough contextual
information to generate high-quality responses. In this case, you can apply a technique called
“Sentence window retrieval”. This technique chunks the initial document into smaller pieces
(usually single sentences) but stores a larger context window in its metadata. At retrieval time, Query Embedding Model
the smaller chunks help improve retrieval precision. After retrieval, the retrieved smaller chunks
are replaced with the larger context window to improve generation quality.
Vector Database
(n sentences before
(n sentences before
Compressed Context
Prompt Template
Response LLM
11
Advanced RAG Techniques | weaviate Ebook
Prompt Engineering
The generated outputs of LLMs are greatly influenced by the quality, tone, length, and structure of their
corresponding prompts. Prompt engineering is the practice of optimizing LLM prompts to improve the quality and
accuracy of generated output. Often one of the lowest-hanging fruits when it comes to techniques for improving RAG
systems, prompt engineering does not require making changes to the underlying LLM itself. This makes it an efficient
and accessible way to enhance performance without complex modifications.
There are several different prompting techniques that are especially useful in improving RAG pipelines.
Prompt
Thought
...
Thought
Response
Chain of Thought (CoT) prompting involves asking Tree of Thoughts (ToT) prompting builds on CoT by ReAct (Reasoning and Acting) prompting combines CoT with
the model to “think step-by-step” and break down instructing the model to evaluate its responses at each step agents, creating a system in which the model can generate
complex reasoning tasks into a series of in the problem-solving process or even generate several thoughts and delegate actions to agents that interact with
intermediate steps. This can be especially useful different solutions to a problem and choose the best result. external data sources in an iterative process. ReAct can improve
when retrieved documents contain conflicting or This is useful in RAG when there are many potential pieces of RAG pipelines by enabling LLMs to dynamically interact with
dense information that requires careful analysis. evidence, and the model needs to weigh different possible retrieved documents, updating reasoning and actions based on
answers based on multiple retrieved documents. external knowledge to provide more accurate and contextually
relevant responses.
12
Advanced RAG Techniques | weaviate Ebook
Indexing optimization techniques, like data preprocessing and chunking focus on formatting
Context Domain-
specific external data to improve its efficiency and searchability.
dataset Pre-retrieval techniques aim to optimizing the user query itself by rewriting, reformatting, or
Prompt Template
routing queries to specialized pipelines
Fine-tuned LLM Retrieval optimization strategies often focus on refining search results during the retrieval
Response LLM phase.
Post-retrieval optimization strategies aim to improve the accuracy of generated results
through a variety of techniques including, re-ranking retrieved results, enhancing or
Pre-trained LLMs are trained on large, diverse datasets to acquire a sense of general compressing the (retrieved) context, and manipulating the prompt or generative model (LLM).
knowledge, including language and grammar patterns, extensive vocabularies, and the ability to
perform general tasks. When it comes to RAG, using pre-trained LLMs can sometimes result in We recommend implementing a validation pipeline to identify which parts of your RAG system
generated output that is too generic, factually incorrect, or fails to directly address the retrieved need optimization and to assess the effectiveness of advanced techniques. Evaluating your RAG
context.
pipeline enables continuous monitoring and refinement, ensuring that optimizations positively
impact retrieval quality and model performance.
High-quality domain-specific data is crucial for fine-tuning LLMs. Labeled datasets, like
positive and negative customer reviews, can help fine-tuned models better perform
Ready to supercharge your
downstream tasks like text classification or sentiment analysis. Unlabeled datasets, on the other RAG applications?
hand, like the latest articles published on PubMed, can help fine-tuned models gain more
domain-specific knowledge and expand their vocabularies.
During the fine-tuning process, the model weights of the pre-trained LLM (also referred to as a
base model) are iteratively updated through a process called backpropagation to learn from the
domain-specific dataset. The result is a fine-tuned LLM that better captures the nuances and Try Now Contact Us