0% found this document useful (0 votes)
23 views5 pages

Don't Do RAG: When Cache-Augmented Generation Is All You Need For Knowledge Tasks

CAG

Uploaded by

firstyanto
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views5 pages

Don't Do RAG: When Cache-Augmented Generation Is All You Need For Knowledge Tasks

CAG

Uploaded by

firstyanto
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Don’t Do RAG:

When Cache-Augmented Generation is All You Need for


Knowledge Tasks
Brian J Chan∗ Hen-Hsen Huang
Chao-Ting Chen∗ Insititue of Information Science
Jui-Hung Cheng∗ Academia Sinica
Taipei, Taiwan
arXiv:2412.15605v1 [cs.CL] 20 Dec 2024

Department of Computer Science


National Chengchi University [email protected]
Taipei, Taiwan
{110703065,110703038,110703007}@nccu.edu.tw
Abstract A1

Retrieval-augmented generation (RAG) has gained traction as a Q1


LLM

powerful approach for enhancing language models by integrating Retrieval


K1 K1 Q1

external knowledge sources. However, RAG introduces challenges Model


Q2
A2

such as retrieval latency, potential errors in document selection, K2


LLM

and increased system complexity. With the advent of large lan-


K2 Q2
guage models (LLMs) featuring significantly extended context win- Knowledge

dows, this paper proposes an alternative paradigm, cache-augmented A1 A2


Pre-compute
generation (CAG) that bypasses real-time retrieval. Our method in-
volves preloading all relevant resources, especially when the docu- LLM
Append Q1
LLM
Truncate Q1
LLM

Knowledge Knowledge Knowledge


ments or knowledge for retrieval are of a limited and manageable Cache Cache
Q1 Append Q2
Cache
Q2

size, into the LLM’s extended context and caching its runtime pa-
rameters. During inference, the model utilizes these preloaded pa-
Q1 Q2 Q2
rameters to answer queries without additional retrieval steps. Com-
parative analyses reveal that CAG eliminates retrieval latency and
minimizes retrieval errors while maintaining context relevance. Per- Figure 1: Comparison of Traditional RAG and our CAG
formance evaluations across multiple benchmarks highlight sce- Workflows: The upper section illustrates the RAG pipeline,
narios where long-context LLMs either outperform or complement including real-time retrieval and reference text input dur-
traditional RAG pipelines. These findings suggest that, for certain ing inference, while the lower section depicts our CAG ap-
applications, particularly those with a constrained knowledge base, proach, which preloads the KV-cache, eliminating the re-
CAG provide a streamlined and efficient alternative to RAG, achiev- trieval step and reference text input at inference.
ing comparable or superior results with reduced complexity.

CCS Concepts
• Computing methodologies → Discourse, dialogue and prag-
matics; Natural language generation; • Information systems errors in selecting or ranking relevant documents can degrade the
→ Specialized information retrieval. quality of the generated responses. Additionally, integrating re-
trieval and generation components increases system complexity,
Keywords necessitating careful tuning and adding to the maintenance over-
Large Language Models, Retrieval Augmented Generation, Retrieval- head.
Free Question Answering This paper proposes an alternative paradigm, cache-augmented
generation (CAG), leveraging the capabilities of long-context LLMs
1 Introduction to address these challenges. Instead of relying on a retrieval pipeline,
as shown in Figure 1, our approach involves preloading the LLM
The advent of retrieval-augmented generation (RAG) [1, 3] has
with all relevant documents in advance and precomputing the key-
significantly enhanced the capabilities of large language models
value (KV) cache, which encapsulates the inference state of the
(LLMs) by dynamically integrating external knowledge sources. RAG
LLM. The preloaded context enables the model to provide rich,
systems have proven effective in handling open-domain questions
contextually accurate answers without the need for additional re-
and specialized tasks, leveraging retrieval pipelines to provide con-
trieval during runtime. This approach eliminates retrieval latency,
textually relevant answers. However, RAG is not without its draw-
mitigates retrieval errors, and simplifies system architecture, all
backs. The need for real-time retrieval introduces latency, while
while maintaining high-quality responses by ensuring the model
∗ Three authors contributed equally to this research. processes all relevant context holistically.
Conference acronym ’XX, June 03–05, 2018, Woodstock, NY Brian J Chan, Chao-Ting Chen, Jui-Hung Cheng, and Hen-Hsen Huang

Recent advances in long-context LLMs have extended their abil- (1) External Knowledge Preloading
ity to process and reason over substantial textual inputs. By ac- In this phase, a curated collection of documents D relevant
commodating larger context windows, these models can assimi- to the target application is preprocessed and formatted to
late extensive information in a single inference step, making them fit within the model’s extended context window. The LLM
well-suited for tasks like document comprehension, multi-turn di- M, with parameters 𝜃, processes D, transforming it into a
alogue, and summarization of lengthy texts. This capability elimi- precomputed KV cache:
nates the dependency on real-time retrieval, as all necessary infor- CKV = KV-Encode(D) (1)
mation can be preloaded into the model. These developments cre-
ate opportunities to streamline workflows for knowledge-intensive This KV cache, which encapsulates the inference state of
tasks, potentially reducing or even eliminating the need for tradi- the LLM, is stored on disk or in memory for future use. The
tional RAG systems. computational cost of processing D is incurred only once,
Recent studies [2, 4] have investigated the performance of long- regardless of the number of subsequent queries.
context models in RAG tasks, revealing that state-of-the-art mod- (2) Inference
els like GPT-o1, GPT-4, and Claude 3.5 can effectively process large During inference, the precomputed KV cache CKV is loaded
amounts of retrieved data, outperforming traditional systems in alongside the user’s query Q. The LLM utilizes this cached
many scenarios. Findings suggest that as long as all documents context to generate responses:
fit within the extended context length, traditional RAG systems R = M (Q | CKV ) (2)
can be replaced by these long-context models. Similarly, Lu et al.
[5] has demonstrated the benefits of precomputed KV caching to By preloading the external knowledge, this phase eliminates
improve efficiency, albeit with the need for position ID rearrange- retrieval latency and reduces risks of errors or omissions
ment to enable proper functioning. Nonetheless, these methods re- that arise from dynamic retrieval. The combined prompt
main vulnerable to retrieval failures inherent to RAG systems. P = Concat(D, Q) ensures a unified understanding of both
Through a series of experiments comparing traditional RAG work- the external knowledge and the user query.
flows with our proposed approach, we identify scenarios where (3) Cache Reset
long-context LLMs outperform RAG in both efficiency and accu- To maintain system performance across multiple inference
racy. By addressing the technical and practical implications, this sessions, the KV cache, stored in memory, can be reset effi-
paper aims to provide insights into when and why CAG may serve ciently. As the KV cache grows in an append-only manner
as a streamlined, effective alternative to RAG, particularly for cases with new tokens 𝑡 1, 𝑡 2, . . . , 𝑡𝑘 sequentially appended, reset-
where the documents or knowledge for retrieval are of limited, ting involves truncating these new tokens:
reset
manageable size. Our findings challenge the default reliance on CKV = Truncate(CKV , 𝑡 1 , 𝑡 2, . . . , 𝑡𝑘 ) (3)
RAG for knowledge integration tasks, offering a simplified, robust This allows for rapid reinitialization without reloading the
solution to harness the growing capabilities of long-context LLMs. entire cache from disk, ensuring sustained speed and re-
Our contributions are threefold as follows: sponsiveness.
• Retrieval-Free Long-Context Paradigm: Introduced a novel The proposed methodology offers several significant advantages
approach leveraging long-context LLMs with preloaded doc- over traditional RAG systems:
uments and precomputed KV caches, eliminating retrieval
• Reduced Inference Time: By eliminating the need for real-
latency, errors, and system complexity.
time retrieval, the inference process becomes faster and more
• Performance Comparison: Conducted extensive experiments
efficient, enabling quicker responses to user queries.
showing scenarios where long-context LLMs outperform tra-
• Unified Context: Preloading the entire knowledge collec-
ditional RAG systems, especially with manageable knowl-
tion into the LLM provides a holistic and coherent under-
edge bases.
standing of the documents, resulting in improved response
• Practical Insights: Provided actionable insights into optimiz-
quality and consistency across a wide range of tasks.
ing knowledge-intensive workflows, demonstrating the via-
• Simplified Architecture: By removing the need to inte-
bility of retrieval-free methods for specific applications. Our
grate retrievers and generators, the system becomes more
CAG framework is released publicly.1
streamlined, reducing complexity, improving maintainabil-
ity, and lowering development overhead.
2 Methodology
Looking forward, our approach is poised to become even more
Our CAG framework leverages the extended context capabilities of powerful with the anticipated advancements in LLMs. As future
long-context LLMs to enable retrieval-free knowledge integration. models continue to expand their context length, they will be able
By preloading external knowledge sources, such as a collection to process increasingly larger knowledge collections in a single in-
of documents D = {𝑑 1, 𝑑 2, . . . }, and precomputing the key-value ference step. Additionally, the improved ability of these models to
(KV) cache CKV , we address the computational challenges and in- extract and utilize relevant information from long contexts will
efficiencies inherent to real-time retrieval in traditional RAG sys- further enhance their performance. These two trends will signifi-
tems. The operation of our framework is divided into three phases: cantly extend the usability of our approach, enabling it to handle
more complex and diverse applications. Consequently, our method-
1 https://github.com/hhhuang/CAG ology is well-positioned to become a robust and versatile solution
Don’t Do RAG:
When Cache-Augmented Generation is All You Need for Knowledge Tasks Conference acronym ’XX, June 03–05, 2018, Woodstock, NY

for knowledge-intensive tasks, leveraging the growing capabilities baselines and our proposed method. This model supports input
of next-generation LLMs. sizes of up to 128k tokens, enabling the processing of extensive
contexts. For our proposed method, the context of each dataset
Source Size # Docs # Tokens # QA Pairs was preloaded into the model via a precomputed key-value (KV)
cache. For SQuAD, the documents DS were encoded into a KV
Small 16 21k 1,392 S = KV-Encode (D ), while for HotPotQA, the documents
cache CKV S
HotPotQA Medium 32 43k 1,056 H = KV-Encode(D ). These caches were
Large 64 85k 1,344 DH were encoded into CKV H
stored offline and loaded during inference to eliminate the need for
Small 3 21k 500 real-time retrieval, ensuring comprehensive access to all relevant
SQuAD Medium 4 32k 500 information for each dataset.
Large 7 50k 500
Table 1: Overview of the SQuAD and HotPotQA test sets 3.2 Baseline Systems
with varying reference text lengths, highlighting the num- The baseline RAG systems were implemented using the LlamaIn-
ber of documents, questions, and associated responses for dex framework,2 employing two retrieval strategies: BM25 for sparse
each configuration. retrieval and OpenAI Indexes for dense retrieval. Each dataset—SQuAD
and HotPotQA—was evaluated separately, with retrieval systems
configured to fetch passages exclusively from the respective dataset
to ensure focused and fair evaluation. The details of each baseline
3 Experiments system are as follows:
3.1 Experimental Setup (1) Sparse Retrieval System (BM25): The first baseline sys-
To evaluate the effectiveness of our proposed method, we conducted tem employed BM25 indexes for retrieval. BM25, a sparse re-
experiments using two widely recognized question-answering bench- trieval algorithm, ranks documents based on term frequency-
marks: the Stanford Question Answering Dataset (SQuAD) 1.0 [6] inverse document frequency (TF-IDF) and document length
and the HotPotQA dataset [7]. These datasets provide complemen- normalization. Given a query 𝑞𝑖 , BM25 retrieves the top-𝑘
tary challenges, with SQuAD focusing on precise, context-aware passages P𝑘 = {𝑝 1, 𝑝 2, . . . , 𝑝𝑘 } from the indexed collection
answers within single passages and HotPotQA emphasizing multi- D. These passages were then passed to the generator, M,
hop reasoning across multiple documents. Each of both datasets to synthesize answers:
consists of documents D = {𝑑 1, 𝑑 2, . . . } paired with questions 𝑟ˆ𝑖 = M (𝑞𝑖 | P𝑘 ) (4)
QS = {𝑞 1, 𝑞 2, . . . } and golden responses R = {𝑟 1, 𝑟 2, . . . }. These
datasets provide a robust platform for assessing both single-context BM25 provides a robust and interpretable retrieval mecha-
comprehension and complex multi-hop reasoning. nism, suited for tasks involving keyword matching.
To investigate how different levels of reference text length im- (2) Dense Retrieval System (OpenAI Indexes) The second
pact retrieval difficulty, we created three test sets for each dataset, baseline utilized OpenAI indexes,3 which employ dense em-
varying the size of the reference text. For example, in the HotPotQA- beddings to represent both documents and queries in a shared
small configuration, we sampled 16 documents D𝑠 ⊂ D from the semantic space. For a query 𝑞𝑖 , dense retrieval selects the
HotPotQA document set to form a long reference text. QA pairs as- top-𝑘 passages P𝑘 that semantically align with the query,
sociated with D𝑠 were selected as test instances. The same method- offering improved contextual understanding compared to
ology was applied to create test sets for SQuAD. sparse methods. These passages were similarly passed to
The dataset statistics are summarized in Table 1. As the number the generator for answer synthesis as Equation 4. This sys-
of documents (and hence the length of the reference text) increases, tem is particularly effective for questions requiring nuanced
the task becomes more challenging, particularly for RAG systems. contextual matching beyond exact term overlap.
Longer reference texts increase the difficulty of accurately retriev- Our experiments were conducted on both the SQuAD and Hot-
ing the correct information, which is crucial for LLMs to generate PotQA datasets to evaluate the performance of different systems
high-quality responses. in terms of similarity to ground-truth answers, measured using
The primary task involves generating accurate and contextually BERTScore [8]. For the RAG baselines, the top-1, top-3, top-5, and
relevant answers R̂ = {𝑟ˆ1, 𝑟ˆ2, . . . } for the SQuAD and HotPotQA top-10 retrieved passages were used for inference. In contrast, our
questions, based on the respective preloaded passages. By leverag- CAG utilized the preloaded context specific to each dataset to gen-
ing the precomputed key-value cache CKV = KV-Encode(D), our erate answers without retrieval constraints.
system generates responses 𝑟ˆ𝑖 = M (𝑞𝑖 | CKV ) without relying on
retrieval mechanisms during inference. This unified approach al- 3.3 Results
lows for direct performance comparisons against traditional RAG As shown in Table 2, the experimental results revealed clear distinc-
systems, highlighting the strengths and limitations of our method tions between our proposed method and traditional RAG systems.
across diverse QA challenges. Our proposed approach achieved the highest BERTScore in most
The experiments were executed on Tesla V100 32G × 8 GPUs.
For all experiments, we used the Llama 3.1 8B Instruction model 2 https://www.llamaindex.ai/framework

as the underlying LLM across all systems, including both the RAG 3 https://cookbook.openai.com/examples/evaluation/evaluate_rag_with_llamaindex
Conference acronym ’XX, June 03–05, 2018, Woodstock, NY Brian J Chan, Chao-Ting Chen, Jui-Hung Cheng, and Hen-Hsen Huang

Table 2: Experimental Results might retrieve incomplete or irrelevant passages, leading to subop-
timal answer generation. These results underscore the robustness
HotPotQA SQuAD and efficiency of our method, especially for tasks requiring a uni-
Size System Top-𝑘 BERT-Score BERT-Score fied understanding of the source material. While dense retrieval
methods such as OpenAI Indexes perform better than sparse re-
1 0.0673 0.7469
trieval methods like BM25, both are inherently limited by their
3 0.0673 0.7999 dependence on retrieval accuracy and ranking heuristics. Our ap-
Sparse RAG
5 0.7549 0.8022 proach bypasses these challenges, leveraging the long-context ca-
10 0.7461 0.8191
pabilities of the Llama 3.1 model to achieve superior performance.
1 0.7079 0.6445 Table 3 compares our CAG approach with standard in-context
Small 3 0.7509 0.7304 learning, where the reference text is provided dynamically dur-
Dense RAG
5 0.7414 0.7583
ing inference, requiring real-time KV-cache computation. The re-
10 0.7516 0.8035 sults demonstrate that CAG dramatically reduces generation time,
CAG (Ours) 0.7759 0.8265 particularly as the reference text length increases. This efficiency
1 0.6652 0.7036 stems from preloading the KV-cache, which eliminates the need to
3 0.7619 0.7471 process the reference text on the fly.
Sparse RAG
5 0.7616 0.7467 Moreover, CAG is also faster than traditional RAG systems, as
10 0.7238 0.7420 it bypasses the retrieval stage entirely. Unlike RAG, CAG does not
1 0.7135 0.6188 require retrieval or reference text input during inference, stream-
Medium 3 0.7464 0.6869 lining the process and further enhancing efficiency. These advan-
Dense RAG
5 0.7278 0.7047 tages make CAG an optimal solution for scenarios with extensive
10 0.7451 0.7350 reference contexts, offering substantial time savings without com-
CAG (Ours) 0.7696 0.7512 promising performance.
1 0.6567 0.7135
3 0.7424 0.7510 4 Conclusion
Sparse RAG As long-context LLMs evolve, we present a compelling case for
5 0.7495 0.7543
10 0.7358 0.7548 rethinking traditional RAG workflows. While our work empha-
1 0.6969 0.6057 sizes eliminating retrieval latency, there is potential for hybrid ap-
Large 3 0.7426 0.6908 proaches that combine preloading with selective retrieval. For ex-
Dense RAG ample, a system could preload a foundation context and use re-
5 0.7300 0.7169
10 0.7398 0.7499 trieval only to augment edge cases or highly specific queries. This
CAG (Ours) 0.7527 0.7640 would balance the efficiency of preloading with the flexibility of
retrieval, making it suitable for scenarios where context complete-
ness and adaptability are equally important.
Table 3: Comparison of Generation Time
References
Dataset Size System Generation Time (s) [1] Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai,
Jiawei Sun, and Haofen Wang. 2023. Retrieval-augmented generation for large
CAG 0.85292 language models: A survey. arXiv preprint arXiv:2312.10997 (2023).
Small [2] Quinn Leng, Jacob Portes, Sam Havens, Matei Zaharia, and Michael Carbin. 2024.
w/o CAG 9.24734 Long Context RAG Performance of Large Language Models. arXiv preprint
CAG 1.66132 arXiv:2411.03538 (2024).
HotpotQA Medium [3] Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir
w/o CAG 28.81642 Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rock-
CAG 2.32667 täschel, et al. 2020. Retrieval-augmented generation for knowledge-intensive nlp
Large tasks. Advances in Neural Information Processing Systems 33 (2020), 9459–9474.
w/o CAG 94.34917
[4] Zhuowan Li, Cheng Li, Mingyang Zhang, Qiaozhu Mei, and Michael Ben-
CAG 1.06509 dersky. 2024. Retrieval Augmented Generation or Long-Context LLMs?
Small A Comprehensive Study and Hybrid Approach. In Proceedings of the 2024
w/o CAG 10.29533 Conference on Empirical Methods in Natural Language Processing: Industry
CAG 1.73114 Track, Franck Dernoncourt, Daniel Preoţiuc-Pietro, and Anastasia Shimorina
SQuAD Medium (Eds.). Association for Computational Linguistics, Miami, Florida, US, 881–893.
w/o CAG 13.35784
https://doi.org/10.18653/v1/2024.emnlp-industry.66
CAG 2.40577 [5] Songshuo Lu, Hua Wang, Yutian Rong, Zhi Chen, and Yaohua Tang.
Large
w/o CAG 31.08368 2024. TurboRAG: Accelerating Retrieval-Augmented Generation with Pre-
computed KV Caches for Chunked Text. arXiv:2410.07590 [cs.CV]
https://arxiv.org/abs/2410.07590
[6] Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016.
SQuAD: 100,000+ Questions for Machine Comprehension of Text. In Proceedings
situations, outperforming both RAG systems. By preloading the en- of the 2016 Conference on Empirical Methods in Natural Language Processing, Jian
tire context from the test set, our system eliminates retrieval errors Su, Kevin Duh, and Xavier Carreras (Eds.). Association for Computational Lin-
guistics, Austin, Texas, 2383–2392. https://doi.org/10.18653/v1/D16-1264
and ensures holistic reasoning over all relevant information. This [7] Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W. Cohen, Rus-
advantage is particularly evident in scenarios where RAG systems lan Salakhutdinov, and Christopher D. Manning. 2018. HotpotQA: A Dataset for
Don’t Do RAG:
When Cache-Augmented Generation is All You Need for Knowledge Tasks Conference acronym ’XX, June 03–05, 2018, Woodstock, NY

Diverse, Explainable Multi-hop Question Answering. In Conference on Empirical [8] Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q Weinberger, and Yoav Artzi.
Methods in Natural Language Processing (EMNLP). [n. d.]. BERTScore: Evaluating Text Generation with BERT. In International Con-
ference on Learning Representations.

You might also like