Enhanced Retrieval-Augmented Reasoning With Open-Source Large Language Models
Enhanced Retrieval-Augmented Reasoning With Open-Source Large Language Models
1 2 3
Knowledge 1: Andover, Kansas Knowledge 1: Andover USD 385 Knowledge 1: Andover USD 385
Andover is a city in Butler County, Kansas, USD 385 is a unified school district USD 385 is a unified school district headquartered
United States, and a suburb of Wichita. As of headquartered in Andover, Kansas, United in Andover, Kansas, United States.
the 2010 census, the city population was States. Knowledge 2: Prince, West Virginia
11,791. Knowledge 2: Melvern, Kansas Prince is a census-designated place (CDP) in
Knowledge 2: Andover USD 385 Melvern is a city in Osage County, Kansas, Fayette County, West Virginia, United States. As of
USD 385 is a unified school district United States, along the Marais des Cygnes the 2010 census, its population was 116. Located
headquartered in Andover, Kansas, United River. As of the 2010 census, the city population at an altitude of 1,263 feet (385 m), it is served by
States. was 385. an Amtrak station.
Figure 1: Inference pipeline in our framework, O PEN -RAG. It learns to generate retrieval/no_retrieval tokens,
contrasts between relevant and irrelevant contexts, and categorizes answers as partially, fully, or not supported. Then
at inference, given a (multi-hop) user query, we first enforce the model to generate an answer with conditional to
no_retrieval as input, and based on the model confidence we dynamically determine if retrieval is needed.
grounded generation, they struggle with navigat- mine retrieval on-demand and balance performance
ing irrelevant or misleading information, especially and speed, we propose a hybrid adaptive retrieval
when dealing with complex queries such as multi- method with two threshold alternatives based on
hop retrieval tasks. This limitation arises since the model confidence. We train our model to generate
models are not explicitly trained to contrast harder retrieval/no_retrieval reflection tokens and mea-
distractor passages and adhere to the facts from the sure the confidence of outputs conditioned on en-
retrievals. forced no_retrieval during inference. If retrieval is
To confront the challenge, our framework O PEN - needed, following Asai et al. (2024), we process all
RAG transforms an arbitrary dense LLM into a retrieved passages in parallel and rank them using
parameter-efficient (PEFT) sparse mixture of ex- the weighted linear sum of reflection token prob-
perts (MoE) model (Wu et al., 2024; Komatsuzaki abilities. Differently from other multi-step active
et al., 2022) capable not only of self-reflection but or adaptive retrieval methods (Jeong et al., 2024b;
also of handling complex reasoning tasks, includ- Jiang et al., 2023a; Trivedi et al., 2023a), this elim-
ing both single- and multi-hop queries. It uniquely inates the need for iterative generations.
trains the model to navigate challenging distrac-
tors that appear relevant but are misleading, while
expanding the MoE only in the adapters, main- In experiments, we evaluate our framework
taining the model’s scale. By combining con- on a wide range of single/multi-hop short/long-
structive learning, architectural transformation, and form knowledge-intensive reasoning tasks, in-
reflection-based generation, O PEN -RAG leverages cluding PopQA, TriviaQA, PubQA, Bio, ALCE-
latent learning, dynamically selects relevant ex- ASQA, HotpotQA, MuSiQue, and 2WikiMulti-
perts, and integrates external knowledge effectively HopQA benchmarks. Results show that our O PEN -
for more accurate and contextually supported re- RAG significantly improves the overall factual ac-
sponse generation and estimates of their usefulness. curacy and reasoning capabilities w.r.t the prior
State-of-the-art (SoTA) open-LLM-based RAG open-source RAG models, often matching or out-
models use external models to determine if re- performing state-of-the-art proprietary LLMs and
trieval is needed; e.g., Asai et al. (2024) use GPT- their RAG models. In multiple tasks, O PEN -RAG,
4 distillation and Jeong et al. (2024b) use a fine- based on Llama2-7B, sets new benchmarks, sur-
tuned FlanT5-XXL for Llama2. However, since passing ChatGPT-RAG, Self-RAG, RAG 2.0, and
LLMs possess different parametric knowledge, it 104B RAG-Command R+. Through detailed abla-
may not be effective to rely on another LLM to tions, examples, and analysis, we provide further
fully determine the retrieval necessity. To deter- insights into the effectiveness of O PEN -RAG.
Critic Retrieval
Query q N N
LLM Irrelevant y U
Retrieve
Do not
Retrieve Utility U
Retrieval Retrieval
No retrieval y U P P y U P N
y U Relevant Relevant
Fully supported Partially supported
Figure 2: O PEN -RAG training data preparation involves generating four variations of new training instances from
each original pair (q, y), each incorporating different reflection tokens using ground truth/LLM critic and retrieved
passages. Our approach enables an LLM not only to reflect on generation quality but also to contrast distractors.
N
2 O PEN -RAG: Enhanced consists of {rj }j=1 H
with rj ∈ D and NH denot-
Retrieval-Augmented Reasoning ing the hop size. For each retrieved content st ,
MG generates a Relevance token, the output re-
O PEN -RAG transforms an arbitrary dense LLM sponse yt , a Grounding token, and a Utility token.
into a parameter-efficient sparse MoE model capa- The Relevance tokens ([Relevant/Irrelevant])
ble not only of self-reflection but also of handling indicate if st is relevant to q, the Grounding tokens
complex reasoning tasks. ([Fully Supported/Partially Supported/No
Additionally, we devise an adaptive hybrid re- Support]) indicate if yt is supported by st , and the
trieval schema to balance the retrieval frequency Utility tokens ([U:1]-[U:5]) define how useful yt
and speed trade-off. Below we first present the is to q. We process each st in parallel and generate
overview of O PEN -RAG and then discuss the train- the final answer ypred by ranking them (i.e., all yt )
ing, including dataset and fine-tuning, and hybrid based on the weighted sum of the normalized con-
adaptive inference. fidence of the corresponding predicted Relevance,
3
Grounding, and Utility tokens (see Figure 1).
2.1 Overview
We define O PEN -RAG LLM as a model MG that, 2.2 O PEN -RAG Training
1
given an input query q , generates an output se- Here, we discuss our training data collection (Sec
quence of m tokens o = [o1 , o2 , ..., om ]. To con- 2.2.1) and parameter-efficient MoE fine-tuning
trol model behavior and generate more context- (Sec 2.2.2).
supported responses, we adopt the reflection-based
generation from Self-RAG (Asai et al., 2024) and 2.2.1 Data Collection
augment output vocabularies with four types of To empower O PEN -RAG to tackle retrieval-free
special reflection tokens: Retrieval, Relevance, queries, as well as single- and multi-hop queries
Grounding and Utility. During training, given q, that require retrieval, we build our training data
the model learns to first generate the Retrieval to- using various types of tasks and datasets. Given an
kens ([RT]/[NoRT]) that indicate whether retrieval input-output data pair (q, y) in an original dataset,
2
is necessary to answer q. During inference, we em- we augment the data with reflection tokens (Sec.
ploy a hybrid adaptive retrieval schema, leveraging 2.1) leveraging ground truth annotation or critic
both the Retrieval tokens and model confidence. LLM C to create supervised data. If the corre-
If no retrieval is needed, MG generates the re- sponding Retrieval token added by C is [RT], we
sponse using only the parametric knowledge of further augment the data and create three different
the LLM (i.e., return o as ypred ). If retrieval is new instances accordingly as follows. First, we
needed, for both single- or multiple-hop from an use R to retrieve the top-k documents S. For each
Nd
external knowledge source D = {di }i=1 , we use retrieved document st , C evaluates whether st is
a user-defined frozen retriever R to retrieve the relevant or not and returns the Relevance token.
k
top-k documents S = {st }t=1 , where each st To address both single- and multi-hop queries, we
1 3
With additional contexts if provided For long-form generation, we use the same segment-level
2
For long-form generation, we also use the [Continue] beam search strategy as in Self-RAG (Asai et al., 2024) to
token, which indicates that the model can continue to use obtain the Top-B segments, where B is the beam size, and
information from the previous segment. return the best sequence at the end of generation.
equip our data pipeline with a hop-unified heuris- Norm Attention Norm FFN
Frozen
Trainable
tic: if at least one passage {rj } ∈ st is relevant, Copy Weights
Dense Block
we add the Relevance token as [Relevant]; other-
Parameter-
wise, we use [Irrelevant]. When [Relevant] Efficient MoE
FFN
Table 1: Model performances on RAG tasks. Pop, TQA, Pub, Bio, Hotpot, MuSiQue, 2WikiMH denote PopQA,
TriviaQA, PubHealth, Biography Generations, HotpotQA, MuSiQue-Ans, 2WikiMultihopQA. Acc, FS, SM, rg,
mau, EM, and F1 denote accuracy, FactScore (factuality), str-em, rouge (correctness), MAUVE (fluency), exact
# ∗
match, and F1 scores. : evaluated using ‘gpt-3.5-turbo-instruct’ instead of ‘text-davinci-003’. : using 4-bit
† ‡
quantized model. : using a proprietary retriever with Tree-of-Thought prompting. : O PEN -RAG model with 7.8B
total and 7.0B active parameters. Gray results are best performances with larger/proprietary models.
trained and reinforced with private data such as ditionally, we assess RQ-RAG (Chan et al., 2024),
ChatGPT (Ouyang et al., 2022). For instruction- which employs proprietary retriever models. Fi-
tuned LMs, we utilize the official system prompt nally, our comparisons extend to Perplexity.ai, Self-
or instruction format of the corresponding model. RAG (Asai et al., 2024), and SAIL (Luo et al.,
2023), which are also finetuned with retrieved texts.
Baselines with retrievals. We evaluate models
incorporating retrieval during both testing and 4 Results and Analysis
training phases, focusing on standard Retrieval-
Augmented Generation (RAG) baselines with Here, we (i) evaluate the RAG models (ii) demon-
open-source Large Language Models (LLMs) strate the effectiveness of our adaptive retrieval in
like Llama2, Alpaca and LongChat (Li et al., balancing the performance-speed (iii) present abla-
2023). These models generate outputs based on tion studies and further analysis.
queries alongside top retrieved documents using
our retriever. We also present results for RAG 4.1 Main Results
baselines utilizing private data, including RAG- Comparison against baselines without retrieval.
ChatGPT, RAG2.0 (Contextual.AI, 2024), and Table 1 (top and middle blocks) shows the perfor-
RAG-Command R+ (Cohere Team, 2024), which mance of open-source baselines without retrieval.
prepend top-retrieved documents to the query. Ad- O PEN -RAG demonstrates substantial performance
fmeanp fminp fret
PopQA PubHealth TriviaQA
60.0
77.0 66.0
76.5
50.0 64.0
Accuracy (%)
76.0
62.0
40.0 75.5
60.0
75.0
58.0
30.0 74.5
56.0
74.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
Retrieval Proportion Retrieval Proportion Retrieval Proportion
80.0
Accuracy (%)
60.0
40.0
20.0
0.0
0.05 0.15 0.25 0.35 0.45 0.55 0.65 0.75 0.85 0.95 0.05 0.15 0.25 0.35 0.45 0.55 0.65 0.75 0.85 0.95 0.05 0.15 0.25 0.35 0.45 0.55 0.65 0.75 0.85 0.95
Model Conf. Partition Model Conf. Partition Model Conf. Partition
Figure 4: (Top) Performance vs Retrieval by different adaptive retrieval strategies. (Bottom) Performance vs scores
from adaptive retrieval. fret denotes probability score from external model distilled/predicted reflection token.
gains over all supervised fine-tuned LLMs, many etary RAG-ChatGPT in all complex multi-hop
of which are larger in size (e.g., 65B CoVE) datasets.
and even our O PEN -RAG outperforms ChatGPT Moreover, O PEN -RAG surpasses RAG 2.0 and
across all metrics and tasks. Particularly in multi- 104B Command R+, which are specifically built
hop reasoning tasks such as HotpotQA, O PEN - for RAG tasks, in HotpotQA (63.3% vs. 60.0%
RAG achieves a significant EM score of 63.3%, EM) and PubQA (75.9% vs. 46.3% Acc). In
surpassing Alpaca13B ’s 0.7%. In contrast, while long-form generation, proprietary models often
ChatGPT achieves a decent score of 22.4% EM in achieve higher scores, but ours remains highly
HotpotQA, its performance drops notably in other competitive. For instance, RAG-Command R+ at-
multi-hop tasks like MuSiQue, where it achieves tains a FactScore (FS) of 84.0% in Bio, slightly
only 3.1% EM while O PEN -RAG achieves a much outperforming O PEN -RAG’s 82.2%. In addition,
higher score of 41.6% EM in MuSiQue, highlight- our O PEN -RAG 13B+8×213M model outperforms
ing its robustness and effectiveness in complex all baselines in all multi-hop tasks; and all open
query handling compared to both open-source and baselines in all short-form tasks and shows com-
proprietary LLMs. petitive performance with the proprietary mod-
Comparison against baselines with retrieval. els. These results highlight the superior ability
As shown in Table 1 (bottom), O PEN -RAG con- of O PEN -RAG to effectively integrate and utilize
sistently outperforms existing open-source RAG retrieved information, enhancing both reasoning
models, even those larger in size. It achieves the accuracy and fluency across varying complexities
top performance among non-proprietary LM-based and both short- and long-form generations.
models across all tasks, with the exception of Trivi-
aQA and PubQA, where it is marginally surpassed 4.2 Performance-Speed by Adaptive Retrieval
(by 1.2% and 0.4%, respectively) by the larger Self- As discussed in Sec 2.3, given the query, adaptive
RAG13B model, and by Alpaca13B in a single met- retrieval method provides a probability/confidence
ric within the ALCE-ASQA dataset. score from the model. By thresholding on that
We observe that while baseline open-source score, we can control the retrieval frequency and
RAG models achieve higher accuracy, even surpass- balance the performance-speed trade-off and this
ing strong proprietary models like RAG-ChatGPT can also guide to determine when retrieval is
in single-hop reasoning tasks, their performance needed. A better scoring method should achieve
significantly lags in multi-hop reasoning tasks. higher accuracy at any retrieval frequency. In order
Our contrastive learning of the distractor contexts to demonstrate our hybrid adaptive retrieval scoring
substantially enhances the reasoning in O PEN - over the existing reflection token probability-based
RAG and empowers it to outperform the propri- method fret in Self-RAG, in Figure 4, we plot
the downstream accuracy vs retrieval frequency NE k Epochs PopQA PubHealth MuSiQue
(top), and accuracy vs confidence score (bottom) Acc Acc EM F1
for PopQA, PubHealth, and TriviaQA datasets by 8 2 1 59.8 74.6 39.6 54.4
sweeping across different threshold values γ (larger 16 2 1 59.2 74.6 40.5 54.4
γ causes less retrieval) from 0 to 1. In Figure 4 (bot- 16 4 1 59.0 72.4 40.5 54.5
tom), we notice that for fmeanp or fminp , the ac- 8 2 2 58.3 75.9 41.6 55.3
curacy increases with higher values of confidence Table 2: Ablation study model performances
while fmeanp is more robust, showing monotoni-
cally increasing accuracy with higher confidence
its potential for improvement with high-quality con-
scores consistently in all dataset. But in the case of
texts.
fret , no such pattern exists. Overall (top) as these
benchmarks are knowledge-intensive, they typi- Routing Analysis of O PEN -RAG. We perform
cally perform better with retrieved contexts and our routing analysis for PopQA, PubHealth, HotpotQA,
adaptive scoring shows a better determination of and 2WikiMultihopQA tasks to demonstrate Top-2
when to retrieve and when not – resulting in higher expert activation in different layers during retrieval-
accuracy at any retrieval frequency. In fact, the ad- free generation by O PEN -RAG as illustrated in
vantage is more amplified in PubHealth where we Figure 6. We observe, that E7 is a general expert
can find a clear threshold confidence score which that is highly activated in the first (Layer 1), mid-
if achieved, retrieval data are found to be less effec- dle (Layer 16), and final (Layer 32) layers for all
tive than the parametric knowledge. This gives us datasets. Whereas E2 is activated in the first layer
a peak accuracy of 1% more than always retrieval, while E6 is activated mostly in the final layer. In the
which can not be determined by Self-RAG. middle layer, we also observe a higher activation of
E5 and a lower activation of E7 in the PopQA and
4.3 Ablation Studies PubHealth datasets (single-hop), but the opposite
in the case of multi-hop datasets – showing that
CRAG 88.8
90
Self-RAG 86.2 the experts implicitly learn to identify query com-
85
Self-CRAG 82.2
80 Open-RAG 78.3
81.8
plexity and play important roles across layers for
Performance (%)
0.4
0.3
0.2
0.1
0.0
1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8
Expert Index Expert Index Expert Index
Figure 6: Layer-wise expert activation on single-hop (PopQA, PubHealth) vs multi-hop tasks (HotpotQA, MuSiQue).
Self-RAG
79.1
84.3 like long-form generation compared to proprietary
80 Open-RAG-Dense 75.9
Open-RAG-MoE 72.0 73.6
74.9
models, which we aim to address in future.
70
61.9 63.3
Performance (%)
56.7 58.3
60
54.9 7 Limitations
50
40.0 41.6
40
40.2
O PEN -RAG has a higher memory footprint due to
30 an increase in total parameters (7.81B) in compar-
22.1
20
MuSiQue PopQA HotpotQA PubHealth ALCE-ASQA
ison to Llama27B family baselines (6.74B). But
Figure 7: Performances (MAUVE for ALCE-ASQA;
our O PEN -RAG outperforms open LLMs with
EM for HotpotQA and MuSiQue-Ans; and accuracy for total parameters ranging from 7B to 65B, rival-
PopQA and PubHealth ) with different architecture. ing proprietary models such as ChatGPT, Perplex-
ity.ai, and Command R+ in various downstream
5 Related work tasks. Thus, O PEN -RAG eventually reduces the
Complex factual reasoning requires contextualiz- compute and memory cost with 7.01B active pa-
ing information from multiple documents (Trivedi rameters during inference in comparison to its
et al., 2022; Yang et al., 2018b). Prior works (Khat- performance. Additionally, as our framework is
tab et al., 2022; Press et al., 2023; Pereira et al., general, future direction can be building stronger
2023; Khot et al., 2023) proposed decomposing sparse-upcycled LLMs based on recent models
multi-hop queries into single-hop queries, then such as Llama38B and Mistral7B utilizing O PEN -
repeatedly using LLMs and Retrievers. In ad- RAG multi-hop training dataset. Although our
dition, Jiang et al. (2023b) retrieved new docu- approach is theoretically applicable to any do-
ments if the tokens within generated sentences have main, future work can explore developing high-
low confidence. However, the performance im- performance domain-specific RAG based on our
provement of these approaches often comes at the O PEN -RAG.
cost of resource-intensive techniques such as inter-
leave Chain-of-Thought (Yao et al., 2023; Trivedi Acknowledgement
et al., 2023b; Zhang et al., 2024b) or Tree-of- We thank anonymous reviewers for their valu-
Thought (Chan et al., 2024) reasoning with doc- able feedback on the paper. We also thank Mo-
ument retrieval; and requiring external models hamed El Banani and Amr Keleg for fruitful dis-
(Jeong et al., 2024b). In this work, we train a single cussions. We are grateful to Qatar Computing Re-
MoE model capable of answering complex ques- search Institute for providing compute and OpenAI
tions in one iteration with a minimal increase in APIs. Shayekh Bin Islam is supported by the Fa-
model complexity. tima Al-Fihri Predoctoral Fellowship sponsored
by Hugging Face. This work was supported in
6 Conclusion
part by National Science Foundation (NSF) awards
To enhance reasoning capabilities in RAG mod- CNS-1730158, ACI-1540112, ACI-1541349, OAC-
els with open-source LLMs, we develop O PEN - 1826967, OAC-2112167, CNS-2100237, CNS-
RAG featuring a PEFT MoE architecture, con- 2120019, the University of California Office of
trastive learning, and adaptive retrieval. O PEN - the President, and the University of California San
RAG shows significant performance improvements Diego’s California Institute for Telecommunica-
in complex reasoning tasks, outperforming SoTA tions and Information Technology/Qualcomm Insti-
methods. However, there is still a gap in tasks tute. Thanks to CENIC for the 100Gbps networks.
References through retrieval and self-reflection with retrieval-
augmented large language models. arXiv preprint
Akari Asai, Sewon Min, Zexuan Zhong, and Danqi arXiv:2401.15269.
Chen. 2023. Retrieval-based language models and
applications. In Proceedings of the 61st Annual Meet- Soyeong Jeong, Jinheon Baek, Sukmin Cho, Sung Ju
ing of the Association for Computational Linguistics Hwang, and Jong C Park. 2024b. Adaptive-rag:
(Volume 6: Tutorial Abstracts), pages 41–46, Toronto, Learning to adapt retrieval-augmented large language
Canada. Association for Computational Linguistics. models through question complexity. arXiv preprint
arXiv:2403.14403.
Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and
Hannaneh Hajishirzi. 2024. Self-RAG: Learning to Zhengbao Jiang, Frank F. Xu, Luyu Gao, Zhiqing
retrieve, generate, and critique through self-reflection. Sun, Qian Liu, Jane Dwivedi-Yu, Yiming Yang,
In The Twelfth International Conference on Learning Jamie Callan, and Graham Neubig. 2023a. Ac-
Representations. tive retrieval augmented generation. arXiv preprint
arXiv:2305.06983.
Edward Beeching, Clémentine Fourrier, Nathan
Habib, Sheon Han, Nathan Lambert, Nazneen Zhengbao Jiang, Frank F. Xu, Luyu Gao, Zhiqing Sun,
Rajani, Omar Sanseviero, Lewis Tunstall, Qian Liu, Jane Dwivedi-Yu, Yiming Yang, Jamie
and Thomas Wolf. 2023. Open LLM leader- Callan, and Graham Neubig. 2023b. Active retrieval
board. https://huggingface.co/spaces/ augmented generation. In EMNLP 2023.
HuggingFaceH4/open_llm_leaderboard.
Mandar Joshi, Eunsol Choi, Daniel Weld, and Luke
Chi-Min Chan, Chunpu Xu, Ruibin Yuan, Hongyin Luo, Zettlemoyer. 2017. TriviaQA: A large scale distantly
Wei Xue, Yike Guo, and Jie Fu. 2024. RQ-RAG: supervised challenge dataset for reading comprehen-
Learning to refine queries for retrieval augmented sion. In Proceedings of the 55th Annual Meeting of
generation. arXiv preprint arXiv:2404.00610. the Association for Computational Linguistics (Vol-
ume 1: Long Papers).
x Cohere Team. 2024. Introducing Command
R+: A Scalable LLM Built for Business Omar Khattab, Keshav Santhanam, Xiang Lisa
— cohere.com. https://cohere.com/blog/ Li, David Hall, Percy Liang, Christopher Potts,
command-r-plus-microsoft-azure. [Accessed and Matei Zaharia. 2022. Demonstrate-Search-
14-06-2024]. Predict: Composing retrieval and language mod-
Contextual.AI. 2024. Introducing RAG 2.0 - Contex- els for knowledge-intensive NLP. arXiv preprint
tual AI — contextual.ai. https://contextual.ai/ arXiv.2212.14024, abs/2212.14024.
introducing-rag2/. [Accessed 14-06-2024]. Tushar Khot, Harsh Trivedi, Matthew Finlayson, Yao
Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Fu, Kyle Richardson, Peter Clark, and Ashish Sab-
Luke Zettlemoyer. 2023. Qlora: Efficient finetuning harwal. 2023. Decomposed Prompting: A modular
of quantized llms. arxiv. approach for solving complex tasks. In The Eleventh
International Conference on Learning Representa-
Yann Dubois, Xuechen Li, Rohan Taori, Tianyi Zhang, tions, ICLR 2023, Kigali, Rwanda, May 1-5, 2023.
Ishaan Gulrajani, Jimmy Ba, Carlos Guestrin, Percy OpenReview.net.
Liang, and Tatsunori B. Hashimoto. 2023. Al-
pacaFarm: A simulation framework for methods Aran Komatsuzaki, Joan Puigcerver, James Lee-Thorp,
that learn from human feedback. arXiv preprint Carlos Riquelme Ruiz, Basil Mustafa, Joshua Ainslie,
arXiv:2305.14387. Yi Tay, Mostafa Dehghani, and Neil Houlsby.
2022. Sparse upcycling: Training mixture-of-
Tianyu Gao, Howard Yen, Jiatong Yu, and Danqi Chen. experts from dense checkpoints. arXiv preprint
2023a. Enabling large language models to generate arXiv:2212.05055.
text with citations. In Proceedings of the 2023 Con-
ference on Empirical Methods in Natural Language Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio
Processing, pages 6465–6488, Singapore. Associa- Petroni, Vladimir Karpukhin, Naman Goyal, Hein-
tion for Computational Linguistics. rich Küttler, Mike Lewis, Wen-tau Yih, Tim Rock-
täschel, Sebastian Riedel, and Douwe Kiela. 2020.
Tianyu Gao, Howard Yen, Jiatong Yu, and Danqi Chen. Retrieval-Augmented Generation for knowledge-
2023b. Enabling large language models to generate intensive NLP tasks. In Advances in Neural Infor-
text with citations. arXiv preprint arXiv:2305.14627. mation Processing Systems, volume 33, pages 9459–
9474.
Xanh Ho, Anh-Khoa Duong Nguyen, Saku Sugawara,
and Akiko Aizawa. 2020. Constructing A multi-hop Dacheng Li, Rulin Shao, Anze Xie, Ying Sheng, Lian-
QA dataset for comprehensive evaluation of reason- min Zheng, Joseph Gonzalez, Ion Stoica, Xuezhe Ma,
ing steps. CoRR, abs/2011.01060. and Hao Zhang. 2023. How long can context length
of open-source LLMs truly promise? In NeurIPS
Minbyul Jeong, Jiwoong Sohn, Mujeen Sung, and Jae- 2023 Workshop on Instruction Tuning and Instruction
woo Kang. 2024a. Improving medical reasoning Following.
Xin Liu, Muhammad Khalifa, and Lu Wang. 2023. Nogueira. 2023. Visconde: Multi-document QA with
Litcab: Lightweight calibration of language mod- GPT-3 and neural reranking. In Advances in Informa-
els on outputs of varied lengths. arXiv preprint tion Retrieval - 45th European Conference on Infor-
arXiv:2310.19208. mation Retrieval, ECIR 2023, Dublin, Ireland, April
2-6, 2023, Proceedings, Part II, volume 13981 of
Ximing Lu, Sean Welleck, Jack Hessel, Liwei Jiang, Lecture Notes in Computer Science, pages 534–543.
Lianhui Qin, Peter West, Prithviraj Ammanabrolu, Springer.
and Yejin Choi. 2022. QUARK: Controllable text
generation with reinforced unlearning. In Advances Krishna Pillutla, Swabha Swayamdipta, Rowan Zellers,
in Neural Information Processing Systems. John Thickstun, Sean Welleck, Yejin Choi, and Zaid
Harchaoui. 2021. MAUVE: Measuring the gap be-
Hongyin Luo, Yung-Sung Chuang, Yuan Gong, Tian- tween neural text and human text using divergence
hua Zhang, Yoon Kim, Xixin Wu, Danny Fox, He- frontiers. In Advances in Neural Information Pro-
len Meng, and James Glass. 2023. SAIL: Search- cessing Systems.
augmented instruction learning. arXiv preprint
arXiv:2305.15225. Ofir Press, Muru Zhang, Sewon Min, Ludwig Schmidt,
Noah A. Smith, and Mike Lewis. 2023. Measuring
Alex Mallen, Akari Asai, Victor Zhong, Rajarshi Das, and narrowing the compositionality gap in language
Daniel Khashabi, and Hannaneh Hajishirzi. 2022. models. In Findings of the Association for Computa-
When not to trust language models: Investigating tional Linguistics: EMNLP 2023.
effectiveness of parametric and non-parametric mem-
ories. arXiv preprint arXiv:2212.10511. Freda Shi, Xinyun Chen, Kanishka Misra, Nathan
Scales, David Dohan, Ed H Chi, Nathanael Schärli,
Alex Mallen, Akari Asai, Victor Zhong, Rajarshi Das, and Denny Zhou. 2023. Large language models can
Daniel Khashabi, and Hannaneh Hajishirzi. 2023. be easily distracted by irrelevant context. In Inter-
When not to trust language models: Investigating national Conference on Machine Learning, pages
effectiveness of parametric and non-parametric mem- 31210–31227. PMLR.
ories. arXiv preprint arXiv:2212.10511.
Ivan Stelmakh, Yi Luan, Bhuwan Dhingra, and Ming-
Sewon Min, Kalpesh Krishna, Xinxi Lyu, Mike Lewis, Wei Chang. 2022. ASQA: Factoid questions meet
Wen-tau Yih, Pang Koh, Mohit Iyyer, Luke Zettle- long-form answers. In Proceedings of the 2022 Con-
moyer, and Hannaneh Hajishirzi. 2023a. FActScore: ference on Empirical Methods in Natural Language
Fine-grained atomic evaluation of factual precision Processing.
in long form text generation. In Proceedings of the
2023 Conference on Empirical Methods in Natural Hugo Touvron, Louis Martin, Kevin Stone, Peter Al-
Language Processing, pages 12076–12100, Singa- bert, Amjad Almahairi, Yasmine Babaei, Nikolay
pore. Association for Computational Linguistics. Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti
Bhosale, et al. 2023. Llama 2: Open founda-
Sewon Min, Kalpesh Krishna, Xinxi Lyu, Mike tion and fine-tuned chat models. arXiv preprint
Lewis, Wen-tau Yih, Pang Wei Koh, Mohit Iyyer, arXiv:2307.09288.
Luke Zettlemoyer, and Hannaneh Hajishirzi. 2023b.
FActScore: Fine-grained atomic evaluation of factual Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot,
precision in long form text generation. arXiv preprint and Ashish Sabharwal. 2022. MuSiQue: Multi-
arXiv:2305.14251. hop questions via single-hop question composition.
Transactions of the Association for Computational
Rodrigo Nogueira and Kyunghyun Cho. 2020. Pas- Linguistics, 10:539–554.
sage re-ranking with BERT. arXiv preprint
arXiv:1901.04085. Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot,
and Ashish Sabharwal. 2023a. Interleaving retrieval
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, with chain-of-thought reasoning for knowledge-
Carroll Wainwright, Pamela Mishkin, Chong Zhang, intensive multi-step questions. In Association for
Sandhini Agarwal, Katarina Slama, Alex Gray, John Computational Linguistics.
Schulman, Jacob Hilton, Fraser Kelton, Luke Miller,
Maddie Simens, Amanda Askell, Peter Welinder, Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot,
Paul Christiano, Jan Leike, and Ryan Lowe. 2022. and Ashish Sabharwal. 2023b. Interleaving Retrieval
Training language models to follow instructions with with Chain-of-Thought Reasoning for Knowledge-
human feedback. In Advances in Neural Information Intensive Multi-Step Questions. In Proceedings
Processing Systems. of the 61st Annual Meeting of the Association for
Computational Linguistics (Volume 1: Long Papers),
Md Rizwan Parvez. 2024. Evidence to generate ACL 2023, Toronto, Canada, July 9-14, 2023, pages
(e2g): A single-agent two-step prompting for context 10014–10037. Association for Computational Lin-
grounded and retrieval augmented reasoning. arXiv guistics.
preprint arXiv:2401.05787.
Shuohang Wang, Mo Yu, Xiaoxiao Guo, Zhiguo
Jayr Alencar Pereira, Robson do Nascimento Fidalgo, Wang, Tim Klinger, Wei Zhang, Shiyu Chang, Gerry
Roberto de Alencar Lotufo, and Rodrigo Frassetto Tesauro, Bowen Zhou, and Jing Jiang. 2018. R3:
Reinforced ranker-reader for open-domain question
answering. In Proceedings of the AAAI Conference
on Artificial Intelligence, volume 32.
Zhiruo Wang, Jun Araki, Zhengbao Jiang, Md Rizwan
Parvez, and Graham Neubig. 2023. Learning to filter
context for retrieval-augmented generation. arXiv
preprint arXiv:2311.08377.
Haoyuan Wu, Haisheng Zheng, and Bei Yu. 2024.
Parameter-Efficient Sparsity Crafting from Dense to
Mixture-of-Experts for Instruction Tuning on Gen-
eral Tasks. arXiv preprint arXiv:2401.02731.
Fangyuan Xu, Weijia Shi, and Eunsol Choi. 2023. RE-
COMP: Improving retrieval-augmented lms with
compression and selective augmentation. Preprint,
arXiv:2310.04408.
Shi-Qi Yan, Jia-Chen Gu, Yun Zhu, and Zhen-Hua Ling.
2024. Corrective Retrieval Augmented Generation.
arXiv preprint arXiv:2401.15884.
Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio,
William Cohen, Ruslan Salakhutdinov, and Christo-
pher D. Manning. 2018a. HotpotQA: A dataset for
diverse, explainable multi-hop question answering.
In Proceedings of the 2018 Conference on Empiri-
cal Methods in Natural Language Processing, pages
2369–2380, Brussels, Belgium. Association for Com-
putational Linguistics.
Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio,
William Cohen, Ruslan Salakhutdinov, and Christo-
pher D. Manning. 2018b. HotpotQA: A dataset for
diverse, explainable multi-hop question answering.
In Proceedings of the 2018 Conference on Empiri-
cal Methods in Natural Language Processing, pages
2369–2380, Brussels, Belgium. Association for Com-
putational Linguistics.