100% found this document useful (1 vote)
120 views

Weaviate Advanced RAG Techniques eBook

The document provides a comprehensive guide on advanced techniques to enhance Retrieval-Augmented Generation (RAG) applications, focusing on optimizing various stages of the RAG pipeline. It covers topics such as data pre-processing, query transformation, indexing optimization, and embedding model fine-tuning, emphasizing the importance of tailoring these techniques to specific use cases. The goal is to improve the performance and accuracy of responses generated by large language models by effectively utilizing external knowledge sources.

Uploaded by

guimiao yang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
120 views

Weaviate Advanced RAG Techniques eBook

The document provides a comprehensive guide on advanced techniques to enhance Retrieval-Augmented Generation (RAG) applications, focusing on optimizing various stages of the RAG pipeline. It covers topics such as data pre-processing, query transformation, indexing optimization, and embedding model fine-tuning, emphasizing the importance of tailoring these techniques to specific use cases. The goal is to improve the performance and accuracy of responses generated by large language models by effectively utilizing external knowledge sources.

Uploaded by

guimiao yang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Advanced RAG

Techniques

A guide on different techniques to improve the


performance of your Retrieval-Augmented
Generation applications.
Advanced RAG Techniques | weaviate Ebook

Retrie val-augmented generation (RAG) vides


pro ve
generati large language models (LLMs) with information While this naive approach f ward,
is straight or f w-
it has many limitations and can o ten lead to lo

from wledge
an external kno source to help reduce hallucinations and increase the factual accuracy of the quality responses.

generated responses.

-
This e book discusses various vanced techniques you can apply to improve the performance
ad

A naive R AG pipeline f four components: an embedding model, a vector database, a prompt


consists o o f AG system.
your R These techniques can be applied at various stages in the RAG pipeline, as

template, and a generative LLM. At inference time, it embeds the user query to retrieve relevant document shown below:

chunks of information from the vector database, which it stuffs into the LLM’s prompt to generate an answer.

Indexing Optimization Techniques


3

Indexing Data Pre-processing

Documents Chunks
Chunking Strategies

Pre-retrieval Optimization Techniques


5

Query Transformation

Query Pre-Retrieval
Query Decomposition

Query Routing

Retrieval Retrieval Optimization Strategies


7
Embedding Model

Metadata Filtering

Excluding Vector Search Outliers

Hybrid search

Embedding model fine-tuning


Vector Database

Context Post-Retrieval Post-retrieval Optimization Techniques


10

Re-ranking

Context Post-processing

Prompt Engineering

Prompt Template
LLM Fine-tuning

Response LLM
Advanced RAG Techniques | weaviate Ebook

Data Pre-Processing
Data pre-processing is fundamental to the success of any RAG system, as the quality of your
Documents Chunks Indexing
processed data directly impacts the overall performance. By thoughtfully transforming raw data
into a structured format suitable for LLMs, you can significantly enhance your system's
effectiveness before considering more complex optimizations. 

Query
While there are several common pre-processing techniques available, the optimal approach and
sequence should be tailored to your specific use case and requirements.

Embedding Model The process usually begins with data acquisition and integration, where diverse document
types from multiple sources are collected and consolidated into a ‘knowledge base’.

Vector Database
Data Sources Raw Data
Source 1
Context
Source 2

Prompt Template
Source 3

Response LLM
Data Extraction and Data Parsing
Data extraction and parsing take place over the raw data so that it is accurately processed for
downstream tasks. For text-based formats like Markdown, Word documents, and plain text,

Indexing Optimization extraction techniques focus on preserving structure while capturing relevant content. 

Scanned documents, images, and PDFs containing image-based text/tables require OCR

Techniques (Optical Character Recognition) technology to convert into an ‘LLM-ready’ format. However,
recent advancements in multimodal retrieval models, such as ColPali and ColQwen, have
revolutionized this process. These models can directly embed images of documents, potentially
making traditional OCR obsolete.

Index optimization techniques enhance retrieval accuracy by structuring


external data in more organized, searchable ways. These techniques can be Web content often involves HTML parsing, utilizing DOM traversal to extract structured data,
applied to both data pre-processing and chunking stages in the RAG pipeline, while spreadsheets demand specialized parsing to handle cell relationships. Metadata
ensuring that relevant information is effectively retrieved. extraction is also crucial across file types, pulling key details like author, timestamps, and other
document properties (see Metadata Filtering)

3
Advanced RAG Techniques | weaviate Ebook

Data Cleaning Chunking Strategies


Data cleaning and noise reduction involves removing irrelevant information (such as headers, Chunking divides large documents into smaller, semantically meaningful segments. This process
footers, or boilerplate text), correcting inconsistencies, and handling missing values while optimizes retrieval by balancing context preservation with manageable chunk sizes. Various
maintaining the extracted data's structural integrity. common techniques exist for effective chunking in RAG, some of which are discussed below:

Fixed-size chunking is a simple technique that splits text into


chunks of a predetermined size, regardless of content structure.
While it's cost-effective, it lacks contextual awareness. This can be
improved by using overlapping chunks, allowing adjacent chunks to
share some content.

Recursive chunking offers more flexibility by initially splitting text


using a primary separator (like paragraphs) and then applying
secondary separators (like sentences) if chunks are still too large.
This technique respects the document's structure and adapts well
to various use cases.
Data Transformation
Document-based chunking creates chunks based on the natural
This involves converting all extracted and processed content into a standardized schema, divisions within a document, such as headings or sections. It's
regardless of the original file type. It's at this stage that document partitioning (not to be particularly effective for structured data like HTML, Markdown, or
confused with chunking) occurs, separating document content into logical units or elements code files but less useful when the data lacks clear structural
(e.g., paragraphs, sections, tables) elements.

Semantic chunking divides text into meaningful units, which are

then vectorized. These units are then combined into chunks based

on the cosine distance between their embeddings, with a new

chunk formed whenever a significant context shift is detected. This


{ ... : ...,

... : [

method balances semantic coherence with chunk size.


{...},

{...}

LLM-based chunking is an advanced technique that uses an LLM

to generate chunks by processing text and creating semantically

isolated sentences or propositions. While highly accurate, it's also

the most computationally demanding approach.

Each of the discussed techniques has its strengths, and the choice depends on the RAG system's
specific requirements and the nature of the documents being processed. New approaches continue
to emerge, such as late chunking, which processes text through long-context embedding models
before splitting it into chunks to better preserve document-wide context.
4
Advanced RAG Techniques | weaviate Ebook

Query Transformation
Documents Chunks
Using the user query directly as the search query for retrieval can lead to poor search results.
That’s why turning the raw user query into an optimized search query is essential. Query
transformation refines and expands unclear, complex, or ambiguous user queries to improve the
Query Pre-Retrieval quality of search results.

Query Rewriting involves reformulating the original user query to make it more suitable for
retrieval. This is particularly useful in scenarios where user queries are not optimally phrased or
Embedding Model expressed differently. This can be achieved by using an LLM to rephrase the original user query
or employing specialized smaller language models trained specifically for this task. 

Vector Database This approach is called 'Rewrite-Retrieve-Read' instead of the traditional 'Retrieve-then-Read'
paradigm.

Context Raw Query Query Re-writer (LLM) Rewritten Query Retriever Retrieved Documents

Can you tell me


which movies were What were the
popular last top-grossing
Prompt Template summer? I’m trying
to find a
movies released
last summer?
blockbuster film.

Response LLM
Query Expansion focuses on broadening the original query to capture more relevant
information. This involves using an LLM to generate multiple similar queries based on the user's
initial input. These expanded queries are then used in the retrieval process, increasing both the
number and relevance of retrieved documents. 

Pre-retrieval Optimization Note: Due to the increased quantity of retrieved documents, a reranking step is often necessary
to prioritize the most relevant results (see Re-ranking).

Techniques Query Re-writer


Expanded Queries

How does meditation reduce


stress and anxiety?
Retriever
Retrieved
Documents

Raw Query (LLM)

Index optimization techniques enhance retrieval accuracy by structuring Can meditation improve
focus and concentration?

external data in more organized, searchable ways. These techniques can be What are the
benefits of What are the long-term mental
applied to both data pre-processing and chunking stages in the RAG pipeline, meditation? health benefits of meditation?

ensuring that relevant information is effectively retrieved. How does meditation


affect sleep quality?

5
Advanced RAG Techniques | weaviate Ebook

Query Decomposition Query Routing


Query routing is a technique that directs queries to specific pipelines based on their content
and intent, enabling a RAG system to handle diverse scenarios effectively. It works by analyzing
each query and choosing the best retrieval method or processing pipeline to provide an
accurate response. This often requires implementing multi-index strategies, where different
types of information are organized into separate, specialized indexes optimized.

The process can include agentic elements, where AI agents decide how to handle each query.
These agents evaluate factors such as query complexity and domain to determine the optimal
approach. For example, fact-based questions may be routed to one pipeline, while those
requiring summarization or interpretation are sent to another.

Agentic RAG functions like a network of specialized agents, each with different expertise. It can
choose from various data stores, retrieval strategies (keyword-based, semantic, or hybrid),
query transformations (for poorly structured queries), and specialized tools or APIs, such as
text-to-SQL converters or even web search capabilities.

Query decomposition is a technique that breaks down complex queries into simpler sub-
queries. This is useful for answering multifaceted questions requiring diverse information
sources, leading to more precise and relevant search results.

Tools

Retrieval
The process typically involves two main stages: decomposing the original query into smaller, Query Agent Vector search engine A Collection A
focused sub-queries using an LLM and then processing these sub-queries to retrieve relevant
information.

Vector search engine B Collection B


For example, the complex query “Why am I always so tired even though I eat healthy? Should I
be doing something different with my diet or maybe try some diet trends?” can be decomposed
into the following three simpler sub-queries
Calculator

What are the common dietary factors that can cause fatigue
What are some popular diet trends and their effects on energy levels
How can I determine if my diet is balanced and supports my energy needs?

Web search

Each sub-query targets a specific aspect, enabling the retriever to find relevant documents or
chunks. Sub-queries can also be processed in parallel to improve efficiency. Additional Single Agent RAG
Response LLM System (Router)
techniques like keyword extraction and metadata filter extraction can help identify both key
search terms and structured filtering criteria, enabling more precise searches. After retrieval,
the system aggregates and synthesizes results from all sub-queries to generate a
comprehensive answer to the original complex query.

6
Advanced RAG Techniques | weaviate Ebook

Metadata Filtering
Documents Chunks

Query

Embedding Model

Retrieval

Vector Database

Context

Prompt Template
Metadata is the additional information attached to each document or chunk in a vector
database, providing valuable context to enhance retrieval. This supplementary data can include
Response LLM timestamps, categories, author info, source references, languages, file types, etc.

When retrieving content from a vector database, metadata helps refine results by filtering out
irrelevant objects, even when they are semantically similar to the query. This narrows the search

Retrieval Optimization
scope and improves the relevance of the retrieved information.

Another benefit of using metadata is time-awareness. By incorporating timestamps as

Strategies
metadata, the system can prioritize recent information, ensuring the retrieved knowledge
remains current and relevant. This is particularly useful in domains where information freshness
is critical.

To get the most out of metadata filtering, it's important to plan carefully and choose metadata
Retrieval optimization strategies aim to improve retrieval results by directly
that improves search without adding unnecessary complexity.

manipulating the way in which external data is retrieved in relation to the user
query. This can involve refining the search query, such as using metadata to
filter candidates or excluding outliers, or even involve fine-tuning an embedding
model on external data to improve the quality of the underlying embeddings
themselves.
7
Advanced RAG Techniques | weaviate Ebook

Excluding Vector Search Outliers Hybrid Search

Vector
Context
Context
Context

Search A B C
3
Context B

1
2 Query Fusion Algorithm Context A

Context C
The most straightforward approach to defining the number of returned results is explicitly Keyword
Context
Context
Context

setting a value for the top k (top_k) results. If you set top_k to 5, you'll get the five closest Search B C A
vectors, regardless of their relevance. While easy to implement, this can include poor matches
just because they made the cutoff. 

Here are two techniques to manage the number of search results implicitly that can help with Hybrid search combines the strengths of vector-based semantic search with traditional
excluding outliers: keyword-based methods. This technique aims to improve the relevance and accuracy of
retrieved information in RAG systems.

The key to hybrid search lies in the 'alpha' (α) parameter, which controls the balance between
Distance thresholding adds a quality check by semantic and keyword-based search methods
setting a maximum allowed distance between
vectors. Any result with a distance score above this α = 1: Pure semantic searc
threshold gets filtered out, even if it would have α = 0: Pure keyword-based searc
made the top_k cutoff. This helps remove the obvious 0 < α < 1: Weighted combination of both methods

bad matches but requires careful threshold


adjustment. This approach is particularly beneficial when you need both contextual understanding and exact
keyword matching.

Consider a technical support knowledge base for a software company. A user might submit a
query like "Excel formula not calculating correctly after update". In this scenario, semantic
Autocut is more dynamic - it looks at how the result
search helps understand the context of the problem, potentially retrieving articles about formula
distances are clustered. Instead of using fixed limits,
errors, calculation issues, or software update impacts. Meanwhile, keyword search ensures that
it groups results based on their relative distances
documents containing specific terms like "Excel" and "formula" are not overlooked.

from your query vector. When there's a big jump in


distance scores between groups, Autocut can cut off
Therefore, while implementing hybrid search, it’s crucial to adjust the alpha parameter based on
the results at that jump. This catches outliers that
your specific use case to optimize the performance.

might slip through top_k or basic distance thresholds.

8
Advanced RAG Techniques | weaviate Ebook

Embedding Model Fine-Tuning


Off-the-shelf embedding models are usually trained on large general datasets to embed a wide To fine-tune an existing embedding model you first need to select a base model that you would
range of data inputs. However, embedding models can fail to capture the context and nuances like to improve. Next, you begin the fine-tuning process by providing the model with your
of smaller, domain-specific datasets. 

domain-specific data. During this process, the loss function adjusts the model’s embeddings so
that semantically similar items are placed closer together in the embedding space. To evaluate a
Fine-tuning embedding models on custom datasets can significantly improve the quality of fine-tuned embedding model, you can use a validation set of curated query-answer pairs to
embeddings, subsequently improving performance on downstream tasks like RAG. Fine-tuning assess the quality of retrieval in your RAG pipeline. Now, the model is ready to generate more
improves embeddings to better capture the dataset's meaning and context, leading to more accurate and representative embeddings for your specific dataset.

accurate and relevant retrievals in RAG applications.

The more niche your dataset is, the more it can benefit from embedding model fine-tuning.
Datasets with specialized vocabularies, like medical or legal datasets, are ideal for embedding
model fine-tuning, which helps extend out-of-domain vocabularies and enhance the accuracy
and relevance of information retrieval and generation in RAG pipelines.

9
Advanced RAG Techniques | weaviate Ebook

Re-Ranking
Documents Chunks

One proven method to improve the performance of your information retrieval system is to
leverage a retrieve-and-rerank pipeline. A retrieve-and-rerank pipeline combines the speed of
Query vector search with the contextual richness of a re-ranking model.

In vector search, the query and documents are processed separately. First, the documents are
pre-indexed. Then, at query time, the query is processed, and the documents closest in vector
Embedding Model space are retrieved. While vector search is a fast method to retrieve candidates, it can miss
contextual nuances.

This is where re-ranking models come into play. Because re-ranking models process the query
Vector Database and the documents together at query time, they can capture more contextual nuances.
However, they are usually complex and resource-intensive and thus not suitable for first-stage
retrieval like vector search. 

Context Post-Retrieval
By combining vector search with re-ranking models, you can quickly cast a wide net of potential
candidates and then re-order them to improve the quality of relevant context in your prompt.

Prompt Template
Note that when using a re-ranking model, you should over-retrieve chunks to filter out less
relevant ones later.

Response LLM
Query Embedding Model

Vector Database

Post-Retrieval Optimization Retrieved Context Re-ranking

Techniques Reranker Model

Post-retrieval optimization techniques aim to enhance the quality of generated Re-ranked Context
responses, meaning that their work begins after the retrieval process has been
completed. This diverse group of techniques includes using models to re-rank Prompt Template
retrieved results, enhancing or compressing the retrieved context, prompt
engineering, and fine-tuning the generative LLM on external data.
Response LLM

10
Advanced RAG Techniques | weaviate Ebook

Context Post-Processing Context Compression

RAG systems rely on diverse knowledge sources to retrieve relevant information. However, this
After retrieval, it can be beneficial to post-process the retrieved context for generation. For often results in the retrieval of irrelevant or redundant data, which can lead to suboptimal
example, if the retrieved context might benefit from additional information you can enhance it responses and costly LLM calls (more tokens).

with metadata. On the other hand, if it contains redundant data, you can compress it.
Context compression effectively addresses this challenge by extracting only the most
meaningful information from the retrieved data. This process begins with a base retriever that
Context Enhancement with Metadata

retrieves documents/chunks related to the query. These documents/chunks are then passed
through a document compressor that shortens them and eliminates irrelevant content, ensuring
that valuable data is not lost in a sea of extraneous information.

One post-processing technique is to use metadata to enhance the retrieved context with
additional information to improve generation accuracy. While you can simply add additional Contextual compression reduces data volume, lowering retrieval and operational costs. Current
information from the metadata, such as timestamps, document names, etc., you can also apply research focuses on two main approaches: embedding-based and lexical-based compression,
more creative techniques.

both of which aim to retain essential information while easing computational demands on RAG
systems.

Context enhancement is particularly useful when data needs to be pre-processed into smaller
chunk sizes to achieve better retrieval precision that doesn’t contain enough contextual
information to generate high-quality responses. In this case, you can apply a technique called
“Sentence window retrieval”. This technique chunks the initial document into smaller pieces
(usually single sentences) but stores a larger context window in its metadata. At retrieval time, Query Embedding Model
the smaller chunks help improve retrieval precision. After retrieval, the retrieved smaller chunks
are replaced with the larger context window to improve generation quality.

Vector Database

Context Context Compression


Window
Window

(n sentences before
(n sentences before

and after sentence) and after sentence)


Compressor
Sentence
embedding LLM LLM

Compressed Context

Prompt Template

Response LLM

11
Advanced RAG Techniques | weaviate Ebook

Prompt Engineering
The generated outputs of LLMs are greatly influenced by the quality, tone, length, and structure of their
corresponding prompts. Prompt engineering is the practice of optimizing LLM prompts to improve the quality and
accuracy of generated output. Often one of the lowest-hanging fruits when it comes to techniques for improving RAG
systems, prompt engineering does not require making changes to the underlying LLM itself. This makes it an efficient
and accessible way to enhance performance without complex modifications.

There are several different prompting techniques that are especially useful in improving RAG pipelines.

Prompt

Thought

...
Thought

Response

Chain of Thought (CoT) prompting involves asking Tree of Thoughts (ToT) prompting builds on CoT by ReAct (Reasoning and Acting) prompting combines CoT with
the model to “think step-by-step” and break down instructing the model to evaluate its responses at each step agents, creating a system in which the model can generate
complex reasoning tasks into a series of in the problem-solving process or even generate several thoughts and delegate actions to agents that interact with
intermediate steps. This can be especially useful different solutions to a problem and choose the best result. external data sources in an iterative process. ReAct can improve
when retrieved documents contain conflicting or This is useful in RAG when there are many potential pieces of RAG pipelines by enabling LLMs to dynamically interact with
dense information that requires careful analysis. evidence, and the model needs to weigh different possible retrieved documents, updating reasoning and actions based on
answers based on multiple retrieved documents. external knowledge to provide more accurate and contextually
relevant responses.

12
Advanced RAG Techniques | weaviate Ebook

LLM Fine-Tuning Summary


RAG Pipeline RAG enhances generative models by enabling them to reference external data, improving
response accuracy and relevance while mitigating hallucinations and information gaps. Naive
Documents Chunks
RAG retrieves documents based on query similarity and directly feeds them into a generative
model for response generation. However, more advanced techniques, like the ones detailed in
Query
this guide, can significantly improve the quality of RAG pipelines by enhancing the relevance and
accuracy of the retrieved information.

Embedding Model LLM Fine-tuning


This e-book reviewed advanced RAG techniques that can be applied at various stages of the
RAG pipeline to improve retrieval quality and accuracy of generated responses.
Vector Database Pre-trained LLM

Indexing optimization techniques, like data preprocessing and chunking focus on formatting
Context Domain-
specific external data to improve its efficiency and searchability.
dataset Pre-retrieval techniques aim to optimizing the user query itself by rewriting, reformatting, or
Prompt Template
routing queries to specialized pipelines
Fine-tuned LLM Retrieval optimization strategies often focus on refining search results during the retrieval
Response LLM phase.
Post-retrieval optimization strategies aim to improve the accuracy of generated results
through a variety of techniques including, re-ranking retrieved results, enhancing or
Pre-trained LLMs are trained on large, diverse datasets to acquire a sense of general compressing the (retrieved) context, and manipulating the prompt or generative model (LLM).

knowledge, including language and grammar patterns, extensive vocabularies, and the ability to
perform general tasks. When it comes to RAG, using pre-trained LLMs can sometimes result in We recommend implementing a validation pipeline to identify which parts of your RAG system
generated output that is too generic, factually incorrect, or fails to directly address the retrieved need optimization and to assess the effectiveness of advanced techniques. Evaluating your RAG
context.

pipeline enables continuous monitoring and refinement, ensuring that optimizations positively
impact retrieval quality and model performance.

Fine-tuning a pre-trained model involves training it further on a specific dataset or task to


adapt the model's general knowledge to the nuances of that particular domain, improving its
performance in that area. Using a fine-tuned model in RAG pipelines can help improve the
quality of generated responses, especially when the topic at hand is highly specialized.

High-quality domain-specific data is crucial for fine-tuning LLMs. Labeled datasets, like
positive and negative customer reviews, can help fine-tuned models better perform
Ready to supercharge your
downstream tasks like text classification or sentiment analysis. Unlabeled datasets, on the other RAG applications?
hand, like the latest articles published on PubMed, can help fine-tuned models gain more
domain-specific knowledge and expand their vocabularies.

Start building today with a 14 day free

trial of Weaviate Cloud (WCD).

During the fine-tuning process, the model weights of the pre-trained LLM (also referred to as a
base model) are iteratively updated through a process called backpropagation to learn from the
domain-specific dataset. The result is a fine-tuned LLM that better captures the nuances and Try Now Contact Us

requirements of the new data, such as specific terminology, style, or tone.


3

You might also like