0% found this document useful (0 votes)
72 views14 pages

Enhanced Retrieval-Augmented Reasoning With Open-Source Large Language Models

Uploaded by

zexyzm1201
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
72 views14 pages

Enhanced Retrieval-Augmented Reasoning With Open-Source Large Language Models

Uploaded by

zexyzm1201
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

O PEN -RAG: Enhanced Retrieval-Augmented Reasoning with Open-Source

Large Language Models


,1,6,7 ,1 2
Shayekh Bin Islam* , Md Asib Rahman* , K S M Tozammel Hossain
3 4 5
Enamul Hoque , Shafiq Joty , Md Rizwan Parvez
1 2
Bangladesh University of Engineering and Technology, University of North Texas
3 4 5
York University, Canada, Salesforce Research, Qatar Computing Research Institute (QCRI)
6 7
Fatima Al-Fihri Predoctoral Fellowship, Cohere For AI Community
[email protected], [email protected]

Abstract Despite this, existing RAG methods demonstrate


limited reasoning capabilities, particularly when
Retrieval-Augmented Generation (RAG) has employing open-source LLMs and addressing high-
been shown to enhance the factual accuracy of
complexity queries such as multi-hop retrieval aug-
arXiv:2410.01782v1 [cs.CL] 2 Oct 2024

Large Language Models (LLMs), but existing


methods often suffer from limited reasoning mented tasks (Jeong et al., 2024b; Zhang et al.,
capabilities in effectively using the retrieved 2024b). Thus, building an effective RAG model us-
evidence, particularly when using open-source ing open-source LLMs remains an open challenge.
LLMs. To mitigate this gap, we introduce a To address this gap, we present O PEN -RAG, a
novel framework, O PEN -RAG, designed to en- novel framework aimed at improving reasoning
hance reasoning capabilities in RAG with open- capabilities in RAG with open-source LLMs.
source LLMs. Our framework transforms an
Reasoning over retrieved documents is particu-
arbitrary dense LLM into a parameter-efficient
sparse mixture of experts (MoE) model capable larly difficult. In general, retrievers are imperfect
of handling complex reasoning tasks, includ- and can return noisy passages (Shi et al., 2023).
ing both single- and multi-hop queries. O PEN - The generated outputs can also be inconsistent with
RAG uniquely trains the model to navigate retrieved passages (Gao et al., 2023a) or can even
challenging distractors that appear relevant but override the LLM’s accurate parametric knowledge
are misleading. As a result, O PEN -RAG lever- (Parvez, 2024). Approaches like re-ranking or filter-
ages latent learning, dynamically selecting rele- ing retrieved documents (Xu et al., 2023; Nogueira
vant experts and integrating external knowledge
and Cho, 2020; Wang et al., 2018) and active re-
effectively for more accurate and contextually
relevant responses. In addition, we propose a trieval methods (i.e., retrieve only when needed)
hybrid adaptive retrieval method to determine (Mallen et al., 2023; Jiang et al., 2023a; Trivedi
retrieval necessity and balance the trade-off be- et al., 2023a) have shown promising success in
tween performance gain and inference speed. tackling these, but they require substantial human
Experimental results show that the Llama2-7B- annotations, can filter out useful information, of-
based O PEN -RAG outperforms state-of-the-art ten perform sequential and repetitive calls (hence
LLMs and RAG models such as ChatGPT, Self-
slow), and can still suffer from distracting content,
RAG, and Command R+ in various knowledge-
intensive tasks. We open-source our code and even in relevant passages (Wang et al., 2023).
models at https://openragmoe.github.io/ To address and control these behaviors such as
retrieval frequency of the RAG model and guide
1 Introduction the generation to be contextually consistent, Self-
RAG and its variants (Asai et al., 2024; Yan et al.,
The rapid advancement of Large Language Models
2024; Jeong et al., 2024a) adopt a self-reflection-
(LLMs) has significantly improved various NLP
based method. During training, these models learn
tasks (Beeching et al., 2023). However, these mod-
to generate both task output and intermittent spe-
els often suffer from factual inaccuracies (Min
cial reflection or critic tokens (e.g., is_supported,
et al., 2023a; Mallen et al., 2022). Retrieval-
is_relevant, etc.), leveraging knowledge distillation
Augmented Generation (RAG) has emerged as a
from proprietary models like GPT-4. At inference,
promising approach to integrate LLMs with ex-
these generated tokens determine the usability of
ternal knowledge, thereby improving generation
each candidate output. While these methods en-
accuracy (Asai et al., 2023; Lewis et al., 2020).
able the model to effectively rank candidate out-
*
Equal contribution. puts from different retrievals and partially improve
Prompt with Open-RAG LLM
Confidence
Enforcement Generation without
Score
Retrieval
Adaptive
Q: According to the 2010 Retrieval
census, what was the
population of the city in which A: 10,404
Andover USD 385 is located? Model Uncertain:
Retrieval Required
No Retrieval

1 2 3
Knowledge 1: Andover, Kansas Knowledge 1: Andover USD 385 Knowledge 1: Andover USD 385
Andover is a city in Butler County, Kansas, USD 385 is a unified school district USD 385 is a unified school district headquartered
United States, and a suburb of Wichita. As of headquartered in Andover, Kansas, United in Andover, Kansas, United States.
the 2010 census, the city population was States. Knowledge 2: Prince, West Virginia
11,791. Knowledge 2: Melvern, Kansas Prince is a census-designated place (CDP) in
Knowledge 2: Andover USD 385 Melvern is a city in Osage County, Kansas, Fayette County, West Virginia, United States. As of
USD 385 is a unified school district United States, along the Marais des Cygnes the 2010 census, its population was 116. Located
headquartered in Andover, Kansas, United River. As of the 2010 census, the city population at an altitude of 1,263 feet (385 m), it is served by
States. was 385. an Amtrak station.

Relevant Fully supported Relevant Partially supported Relevant Partially supported


Conf: 99.3% A: 11,791 Conf: 56.2% A: 12,898 Conf: 57.1% A: 12,795
Rank 1 Rank 3 Rank 2
(3.48) (2.56) (2.57)

Figure 1: Inference pipeline in our framework, O PEN -RAG. It learns to generate retrieval/no_retrieval tokens,
contrasts between relevant and irrelevant contexts, and categorizes answers as partially, fully, or not supported. Then
at inference, given a (multi-hop) user query, we first enforce the model to generate an answer with conditional to
no_retrieval as input, and based on the model confidence we dynamically determine if retrieval is needed.

grounded generation, they struggle with navigat- mine retrieval on-demand and balance performance
ing irrelevant or misleading information, especially and speed, we propose a hybrid adaptive retrieval
when dealing with complex queries such as multi- method with two threshold alternatives based on
hop retrieval tasks. This limitation arises since the model confidence. We train our model to generate
models are not explicitly trained to contrast harder retrieval/no_retrieval reflection tokens and mea-
distractor passages and adhere to the facts from the sure the confidence of outputs conditioned on en-
retrievals. forced no_retrieval during inference. If retrieval is
To confront the challenge, our framework O PEN - needed, following Asai et al. (2024), we process all
RAG transforms an arbitrary dense LLM into a retrieved passages in parallel and rank them using
parameter-efficient (PEFT) sparse mixture of ex- the weighted linear sum of reflection token prob-
perts (MoE) model (Wu et al., 2024; Komatsuzaki abilities. Differently from other multi-step active
et al., 2022) capable not only of self-reflection but or adaptive retrieval methods (Jeong et al., 2024b;
also of handling complex reasoning tasks, includ- Jiang et al., 2023a; Trivedi et al., 2023a), this elim-
ing both single- and multi-hop queries. It uniquely inates the need for iterative generations.
trains the model to navigate challenging distrac-
tors that appear relevant but are misleading, while
expanding the MoE only in the adapters, main- In experiments, we evaluate our framework
taining the model’s scale. By combining con- on a wide range of single/multi-hop short/long-
structive learning, architectural transformation, and form knowledge-intensive reasoning tasks, in-
reflection-based generation, O PEN -RAG leverages cluding PopQA, TriviaQA, PubQA, Bio, ALCE-
latent learning, dynamically selects relevant ex- ASQA, HotpotQA, MuSiQue, and 2WikiMulti-
perts, and integrates external knowledge effectively HopQA benchmarks. Results show that our O PEN -
for more accurate and contextually supported re- RAG significantly improves the overall factual ac-
sponse generation and estimates of their usefulness. curacy and reasoning capabilities w.r.t the prior
State-of-the-art (SoTA) open-LLM-based RAG open-source RAG models, often matching or out-
models use external models to determine if re- performing state-of-the-art proprietary LLMs and
trieval is needed; e.g., Asai et al. (2024) use GPT- their RAG models. In multiple tasks, O PEN -RAG,
4 distillation and Jeong et al. (2024b) use a fine- based on Llama2-7B, sets new benchmarks, sur-
tuned FlanT5-XXL for Llama2. However, since passing ChatGPT-RAG, Self-RAG, RAG 2.0, and
LLMs possess different parametric knowledge, it 104B RAG-Command R+. Through detailed abla-
may not be effective to rely on another LLM to tions, examples, and analysis, we provide further
fully determine the retrieval necessity. To deter- insights into the effectiveness of O PEN -RAG.
Critic Retrieval
Query q N N
LLM Irrelevant y U
Retrieve
Do not
Retrieve Utility U

Retrieval Retrieval
No retrieval y U P P y U P N
y U Relevant Relevant
Fully supported Partially supported

Figure 2: O PEN -RAG training data preparation involves generating four variations of new training instances from
each original pair (q, y), each incorporating different reflection tokens using ground truth/LLM critic and retrieved
passages. Our approach enables an LLM not only to reflect on generation quality but also to contrast distractors.
N
2 O PEN -RAG: Enhanced consists of {rj }j=1 H
with rj ∈ D and NH denot-
Retrieval-Augmented Reasoning ing the hop size. For each retrieved content st ,
MG generates a Relevance token, the output re-
O PEN -RAG transforms an arbitrary dense LLM sponse yt , a Grounding token, and a Utility token.
into a parameter-efficient sparse MoE model capa- The Relevance tokens ([Relevant/Irrelevant])
ble not only of self-reflection but also of handling indicate if st is relevant to q, the Grounding tokens
complex reasoning tasks. ([Fully Supported/Partially Supported/No
Additionally, we devise an adaptive hybrid re- Support]) indicate if yt is supported by st , and the
trieval schema to balance the retrieval frequency Utility tokens ([U:1]-[U:5]) define how useful yt
and speed trade-off. Below we first present the is to q. We process each st in parallel and generate
overview of O PEN -RAG and then discuss the train- the final answer ypred by ranking them (i.e., all yt )
ing, including dataset and fine-tuning, and hybrid based on the weighted sum of the normalized con-
adaptive inference. fidence of the corresponding predicted Relevance,
3
Grounding, and Utility tokens (see Figure 1).
2.1 Overview
We define O PEN -RAG LLM as a model MG that, 2.2 O PEN -RAG Training
1
given an input query q , generates an output se- Here, we discuss our training data collection (Sec
quence of m tokens o = [o1 , o2 , ..., om ]. To con- 2.2.1) and parameter-efficient MoE fine-tuning
trol model behavior and generate more context- (Sec 2.2.2).
supported responses, we adopt the reflection-based
generation from Self-RAG (Asai et al., 2024) and 2.2.1 Data Collection
augment output vocabularies with four types of To empower O PEN -RAG to tackle retrieval-free
special reflection tokens: Retrieval, Relevance, queries, as well as single- and multi-hop queries
Grounding and Utility. During training, given q, that require retrieval, we build our training data
the model learns to first generate the Retrieval to- using various types of tasks and datasets. Given an
kens ([RT]/[NoRT]) that indicate whether retrieval input-output data pair (q, y) in an original dataset,
2
is necessary to answer q. During inference, we em- we augment the data with reflection tokens (Sec.
ploy a hybrid adaptive retrieval schema, leveraging 2.1) leveraging ground truth annotation or critic
both the Retrieval tokens and model confidence. LLM C to create supervised data. If the corre-
If no retrieval is needed, MG generates the re- sponding Retrieval token added by C is [RT], we
sponse using only the parametric knowledge of further augment the data and create three different
the LLM (i.e., return o as ypred ). If retrieval is new instances accordingly as follows. First, we
needed, for both single- or multiple-hop from an use R to retrieve the top-k documents S. For each
Nd
external knowledge source D = {di }i=1 , we use retrieved document st , C evaluates whether st is
a user-defined frozen retriever R to retrieve the relevant or not and returns the Relevance token.
k
top-k documents S = {st }t=1 , where each st To address both single- and multi-hop queries, we
1 3
With additional contexts if provided For long-form generation, we use the same segment-level
2
For long-form generation, we also use the [Continue] beam search strategy as in Self-RAG (Asai et al., 2024) to
token, which indicates that the model can continue to use obtain the Top-B segments, where B is the beam size, and
information from the previous segment. return the best sequence at the end of generation.
equip our data pipeline with a hop-unified heuris- Norm Attention Norm FFN
Frozen
Trainable
tic: if at least one passage {rj } ∈ st is relevant, Copy Weights
Dense Block
we add the Relevance token as [Relevant]; other-
Parameter-
wise, we use [Irrelevant]. When [Relevant] Efficient MoE
FFN

is predicted, to enable MG to contrast between Norm Attention Norm


FFN Weighted
Sum
Router
useful and distractor contexts in st in a more fine-
FFN Gating
grained way, we design a data-contrastive heuristic: Parameter-Efficient
Sparse MoE Block
Weights

(i) for single-hop RAG datasets, we use C directly


to label the Grounding token; (ii) for multi-hop Figure 3: Architechture transformation (dense to PEFT
RAG datasets, if all passages {rj } ∈ st are indi- MoE) in O PEN -RAG. Router R is trained from scratch.
vidually predicted as [RT], then we add [Fully The FFN layer is kept frozen and adapted by parallel-
Supported] as the Grounding token; otherwise, adapter-based experts E. Other layers are being copied.
we use [Partially Supported]. Finally, regard-
less of the prediction of the Relevance token, we this way, we are only required to store one FFN
use C to provide a Utility score for y with respect replica keeping the model size unchanged except
to q. We depict an example of the training data for the increase in the parameters in the adapter
collection for a 2-hop question in Figure 2. and the router modules. The rest of the layers, such
as Norm and Attention, are copied from the dense
2.2.2 Parameter-Efficient MoE Finetuning model.
RAG tasks are inherently complex, composed of For a given input x, the router module R acti-
various components such as queries with single vates Top-k experts out of NE experts based on
(single-hop) or multiple (multi-hop) passages. The the normalized output xin of the attention layer.
ability to leverage different parts of the model se- Given W∣⋅∣ denotes the weight of the correspond-
lectively based on such complexities can facilitate ing expert module, we define the router module as
more adaptive and fine-grained reasoning capabil- follows:
ities over versatile input contexts. Therefore, in-
stead of traditional dense models that treat all parts R(xin ) = Softmax(Top-k(WR ⋅ xin )) (1)
uniformly, we propose to transform MG into a
MoE architecture on the fly, which learns to selec- We formulate the adapter Ae as:
tively activate the most suitable experts dynam-
ically for each query with versatile complexity down up
Ae (x) = σ(xWe )We + x. (2)
(e.g., single/multi-hop). This selective activation
is learned (fine-tuned) using our tailored training
The efficiency of O PEN -RAG model results
data, ensuring that the model learns to differentiate down up
between useful and misleading information. from the setup that ∣θe ∣ = ∣We ∣ + ∣We ∣ ≪
∣ϕo ∣ where we keep ϕo from the dense LLM frozen
As open-source models are often used in low-
during fine-tuning. Finally, we express the output
compute settings, O PEN -RAG employs sparse
y of a parameter-efficient expert module as:
upcycling (Komatsuzaki et al., 2022; Wu et al.,
2024) to transform MG into a parameter-efficient NE
sparse MoE. This approach adds only a few mil- y = ∑ R(x)e Ae (Ee (x)). (3)
lion adapter parameters, preserving the same order e=1
of active parameters as in the original LLM. The
sparse MoE O PEN -RAG model augments the FFN In our implementation, we use NE = 8 and
layer of the dense backbone LLM with a parameter- k = 2 if not otherwise specified. In other words,
efficient MoE transformer block consisting of a set only 2 of the 8 experts are active during train-
NE
of expert layers E = {Ee }e=1 along with an ef- ing and inference. We train O PEN -RAG with
ficient routing mechanism as in Figure 3. Each QLoRA (Dettmers et al., 2023) adapters during
expert layer comprises a replicated original shared fine-tuning which has a load-balancing objective
FFN layer weight, adapted by an adapter module along with the standard conditional language mod-
Ae with parameters θe . To ensure parameter ef- eling objective. To mitigate the approximation er-
ficiency, in each expert, we keep the FFN layer ror in the expert adapters, we use the adapters with
frozen and train the adapter module Ae only. In a dimension of 512 by default.
2.3 Hybrid Approach for Adaptive Retrieval 3.2 Experimental settings
Since LLMs possess different parametric knowl- Training Data and Settings. In our data cura-
edge, instead of using other LLMs, we propose a tion process, as detailed in Section 2.2.1, we com-
hybrid adaptive retrieval method with two thresh- pile a diverse set of instruction-following input-
old alternatives based on model confidence to de- output pairs encompassing retrieval-free, single-
termine retrieval on-demand and balance perfor- hop, and multi-hop datasets requiring retrieval.
mance speed. We take motivation from both con- For no-retrieval and single-hop datasets, we uti-
trol token-based (Asai et al., 2024; Lu et al., 2022) lize 150K instruction-output pairs curated by Self-
and confidence-based (Liu et al., 2023; Jiang et al., RAG. For the multi-hop dataset, we randomly sam-
2023a) inference methods. ple 16K two-hop instances from the HotpotQA
During training, MG learns to generate Re- (Yang et al., 2018b) Distractor train split, each with
trieval reflection tokens ([RT] and [NoRT]). At in- 10 passages annotated with the ground truth Rel-
ference, we measure the confidence of the output evance tokens. Using our data collection method
sequence o conditioned on an enforced no retrieval from Section 2.2.1, we generate 28K new multi-
setting by adding [NoRT] to the input, such that hop training instances. All other reflection tokens
q̂ = q ⊕ [NoRT]. We design two different confi- are labeled by the Llama27B (Touvron et al., 2023)
dence scores f∣⋅∣ : (i) fminp , the minimum value of critic LLM in Self-RAG, which is distilled from
the probabilities of the individual tokens, and (ii) GPT-4. Additional information regarding training
fmeanp , the geometric mean of the probabilities of is provided in Appendix Section A. Following pre-
the individual tokens in the generated sequence. vious works and for a fair comparison, we use the
m Llama27B (Touvron et al., 2023) as the base RAG
fminp (o∣q̂) = min p(oi ∣q̂, o<i ) (4)
i=1 model MG . O PEN -RAG is transformed into a



m MoE model with NE = 8 and k = 2, incorporating
fmeanp (o∣q̂) = √
m√

⎷∏ p(oi ∣q̂, o<i ) (5) adapters with a dimension of 512, totaling an addi-
i=1 tional (8×135M) adapter model parameters. More-
We control retrieval frequency with a tunable over, we train a larger version of O PEN -RAG based
threshold γ, where retrieval occurs if f∣⋅∣ < γ. on Llama213B with additional (8×213M) param-
eters to demonstrate the scalability of our frame-
3 Experiments work. By O PEN -RAG model, we indicate O PEN -
3.1 Tasks and Datasets RAG 7B+8×135M if not explicitly mentioned.
Inference Data and Settings. We assign the de-
Single-hop short-form tasks include PopQA
fault weight of 1.0, 1.0, and 0.5 to Relevance,
(Mallen et al., 2022), TriviaQA-unfiltered (Joshi
Grounding, and Utility tokens respectively. Fol-
et al., 2017), and PubHealth (Zhang et al., 2023).
lowing Self-RAG, we compare the model perfor-
These datasets involve answering factual questions
mances with always retrieval and vary the retrieval
and verifying public health facts, using retrieved
frequency as discussed in Sec 2.3 only to demon-
contexts provided by Self-RAG. We use the accu-
strate optimum thresholding and performance-
racy metric for evaluation.
speed trade-offs. In multi-hop evaluations, from
Single-hop long-form generation tasks cover bi-
the corresponding retrieval candidate passages, we
ography generation (Bio) (Min et al., 2023b) and
use Beam Retriever (Zhang et al., 2024a) to retrieve
the long-form QA benchmark ALCE-ASQA (Gao
Top-3 multi-hop contexts, each with the mentioned
et al., 2023b; Stelmakh et al., 2022). Biographies
NH number of passages. For single-hop tasks, we
are evaluated with FactScore (Min et al., 2023b),
use Self-RAG’s setup (See Appendix B).
while ALCE-ASQA uses official metrics for cor-
rectness (str-em) and fluency based on MAUVE
3.3 Baselines
(Pillutla et al., 2021).
Multi-hop reasoning tasks include HotpotQA Baselines without retrievals. We compare ours
(distractor dev split) (Yang et al., 2018a), MuSique- with several strong, publicly available pre-trained
Ans (Trivedi et al., 2022), and 2WikiMultihopQA LLMs, including Llama2-7B,13B (Touvron et al.,
(Ho et al., 2020) which require systems to answer 2023), SAIL-7B (Luo et al., 2023) as well as
complex multi-hop questions. We use official EM instruction-tuned models, Alpaca-7B,13B (Dubois
and F1 metrics for evaluation. et al., 2023). Additionally, we consider models
Short-form Long-form generations Multi-hop generations
Pop TQA Pub Bio ALCE-ASQA Hotpot MuSiQue 2WikiMH
LM Acc Acc Acc FS SM rg mau EM F1 EM F1 EM F1
LMs with proprietary data/retriever
Perplexity.ai – – – 71.2 – – – – – – – – –
RAG 2.0 – – – – – – – 54.0 – – – – –
ChatGPT 29.3 74.3 70.1 71.8 35.3 36.2 68.8 22.4 30.0 3.1 7.3 18.7 21.7
RAG-ChatGPT 50.8 65.7 54.7 – 40.7 39.9 79.7 55.3 69.9 31.2 43.5 44.7 54.8

RAG-Command R+ 104B 59.9 74.0 46.3 84.0 – – – 60.0 75.8 41.3 55.4 57.1 66.1

RQ-RAG 7B (ToT) 57.1 – – – – – – 62.6 – 41.7 – 44.8 –
Baselines without retrieval
Llama27B 14.7 30.5 34.2 44.5 7.9 15.3 19.0 3.8 9.3 2.0 3.3 8.0 14.5
Alpaca7B 23.6 54.5 49.8 45.8 18.8 29.4 61.7 4.7 11.5 2.5 3.8 15.3 20.0
SAIL7B 22.8 – – – – – – – – – – – –
Llama213B 14.7 38.5 29.4 53.4 7.2 12.4 16.0 14.9 21.6 1.3 5.4 21.4 25.2
Alpaca13B 24.4 61.3 55.5 50.2 22.9 32.0 70.6 0.7 6.1 0.0 3.3 3.1 12.0
CoVE65B – – – 71.2 – – – – – – – – –
Baselines with retrieval
Llama27B 38.2 48.8 30.0 78.0 15.2 22.1 32.0 5.9 19.4 3.4 10.5 11.9 19.2
Alpaca7B 46.7 64.1 40.2 76.6 30.9 33.3 57.9 23.0 35.6 6.4 14.8 18.2 23.8
SAIL7B 44.0 – 69.2 – – – – – – – – – –
Self-RAG7B 54.9 66.1 72.0 78.6 30.2 35.7 74.9 40.2 54.3 22.1 33.2 24.6 35.8
Llama213B 38.2 42.5 30.0 78.0 15.2 22.1 32.0 26.7 38.5 10.8 18.6 20.2 27.4
Alpaca13B 46.1 66.9 51.1 77.7 34.8 36.7 56.6 12.3 27.3 2.6 10.7 7.0 17.1
Self-RAG13B 56.0 67.5 76.3 81.1 31.6 35.9 69.7 44.2 58.2 22.2 40.0 17.7 31.8
LongChat13B – – – – – – – 25.0 40.6 7.9 18.9 18.2 29.2

O PEN -RAG 7B+8×135M 58.3 66.3 75.9 82.2 31.9 36.7 84.3 63.3 76.9 41.6 55.3 51.5 61.0
#
O PEN -RAG 13B+8×213M 59.5 69.6 77.2 81.7 36.3 38.1 80.0 66.2 80.1 46.0 60.1 60.7 70.9

Table 1: Model performances on RAG tasks. Pop, TQA, Pub, Bio, Hotpot, MuSiQue, 2WikiMH denote PopQA,
TriviaQA, PubHealth, Biography Generations, HotpotQA, MuSiQue-Ans, 2WikiMultihopQA. Acc, FS, SM, rg,
mau, EM, and F1 denote accuracy, FactScore (factuality), str-em, rouge (correctness), MAUVE (fluency), exact
# ∗
match, and F1 scores. : evaluated using ‘gpt-3.5-turbo-instruct’ instead of ‘text-davinci-003’. : using 4-bit
† ‡
quantized model. : using a proprietary retriever with Tree-of-Thought prompting. : O PEN -RAG model with 7.8B
total and 7.0B active parameters. Gray results are best performances with larger/proprietary models.

trained and reinforced with private data such as ditionally, we assess RQ-RAG (Chan et al., 2024),
ChatGPT (Ouyang et al., 2022). For instruction- which employs proprietary retriever models. Fi-
tuned LMs, we utilize the official system prompt nally, our comparisons extend to Perplexity.ai, Self-
or instruction format of the corresponding model. RAG (Asai et al., 2024), and SAIL (Luo et al.,
2023), which are also finetuned with retrieved texts.
Baselines with retrievals. We evaluate models
incorporating retrieval during both testing and 4 Results and Analysis
training phases, focusing on standard Retrieval-
Augmented Generation (RAG) baselines with Here, we (i) evaluate the RAG models (ii) demon-
open-source Large Language Models (LLMs) strate the effectiveness of our adaptive retrieval in
like Llama2, Alpaca and LongChat (Li et al., balancing the performance-speed (iii) present abla-
2023). These models generate outputs based on tion studies and further analysis.
queries alongside top retrieved documents using
our retriever. We also present results for RAG 4.1 Main Results
baselines utilizing private data, including RAG- Comparison against baselines without retrieval.
ChatGPT, RAG2.0 (Contextual.AI, 2024), and Table 1 (top and middle blocks) shows the perfor-
RAG-Command R+ (Cohere Team, 2024), which mance of open-source baselines without retrieval.
prepend top-retrieved documents to the query. Ad- O PEN -RAG demonstrates substantial performance
fmeanp fminp fret
PopQA PubHealth TriviaQA
60.0
77.0 66.0
76.5
50.0 64.0
Accuracy (%)

76.0
62.0
40.0 75.5
60.0
75.0
58.0
30.0 74.5
56.0
74.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
Retrieval Proportion Retrieval Proportion Retrieval Proportion

fmeanp fminp fret


PopQA PubHealth TriviaQA
100.0

80.0
Accuracy (%)

60.0

40.0

20.0

0.0
0.05 0.15 0.25 0.35 0.45 0.55 0.65 0.75 0.85 0.95 0.05 0.15 0.25 0.35 0.45 0.55 0.65 0.75 0.85 0.95 0.05 0.15 0.25 0.35 0.45 0.55 0.65 0.75 0.85 0.95
Model Conf. Partition Model Conf. Partition Model Conf. Partition

Figure 4: (Top) Performance vs Retrieval by different adaptive retrieval strategies. (Bottom) Performance vs scores
from adaptive retrieval. fret denotes probability score from external model distilled/predicted reflection token.
gains over all supervised fine-tuned LLMs, many etary RAG-ChatGPT in all complex multi-hop
of which are larger in size (e.g., 65B CoVE) datasets.
and even our O PEN -RAG outperforms ChatGPT Moreover, O PEN -RAG surpasses RAG 2.0 and
across all metrics and tasks. Particularly in multi- 104B Command R+, which are specifically built
hop reasoning tasks such as HotpotQA, O PEN - for RAG tasks, in HotpotQA (63.3% vs. 60.0%
RAG achieves a significant EM score of 63.3%, EM) and PubQA (75.9% vs. 46.3% Acc). In
surpassing Alpaca13B ’s 0.7%. In contrast, while long-form generation, proprietary models often
ChatGPT achieves a decent score of 22.4% EM in achieve higher scores, but ours remains highly
HotpotQA, its performance drops notably in other competitive. For instance, RAG-Command R+ at-
multi-hop tasks like MuSiQue, where it achieves tains a FactScore (FS) of 84.0% in Bio, slightly
only 3.1% EM while O PEN -RAG achieves a much outperforming O PEN -RAG’s 82.2%. In addition,
higher score of 41.6% EM in MuSiQue, highlight- our O PEN -RAG 13B+8×213M model outperforms
ing its robustness and effectiveness in complex all baselines in all multi-hop tasks; and all open
query handling compared to both open-source and baselines in all short-form tasks and shows com-
proprietary LLMs. petitive performance with the proprietary mod-
Comparison against baselines with retrieval. els. These results highlight the superior ability
As shown in Table 1 (bottom), O PEN -RAG con- of O PEN -RAG to effectively integrate and utilize
sistently outperforms existing open-source RAG retrieved information, enhancing both reasoning
models, even those larger in size. It achieves the accuracy and fluency across varying complexities
top performance among non-proprietary LM-based and both short- and long-form generations.
models across all tasks, with the exception of Trivi-
aQA and PubQA, where it is marginally surpassed 4.2 Performance-Speed by Adaptive Retrieval
(by 1.2% and 0.4%, respectively) by the larger Self- As discussed in Sec 2.3, given the query, adaptive
RAG13B model, and by Alpaca13B in a single met- retrieval method provides a probability/confidence
ric within the ALCE-ASQA dataset. score from the model. By thresholding on that
We observe that while baseline open-source score, we can control the retrieval frequency and
RAG models achieve higher accuracy, even surpass- balance the performance-speed trade-off and this
ing strong proprietary models like RAG-ChatGPT can also guide to determine when retrieval is
in single-hop reasoning tasks, their performance needed. A better scoring method should achieve
significantly lags in multi-hop reasoning tasks. higher accuracy at any retrieval frequency. In order
Our contrastive learning of the distractor contexts to demonstrate our hybrid adaptive retrieval scoring
substantially enhances the reasoning in O PEN - over the existing reflection token probability-based
RAG and empowers it to outperform the propri- method fret in Self-RAG, in Figure 4, we plot
the downstream accuracy vs retrieval frequency NE k Epochs PopQA PubHealth MuSiQue
(top), and accuracy vs confidence score (bottom) Acc Acc EM F1
for PopQA, PubHealth, and TriviaQA datasets by 8 2 1 59.8 74.6 39.6 54.4
sweeping across different threshold values γ (larger 16 2 1 59.2 74.6 40.5 54.4
γ causes less retrieval) from 0 to 1. In Figure 4 (bot- 16 4 1 59.0 72.4 40.5 54.5
tom), we notice that for fmeanp or fminp , the ac- 8 2 2 58.3 75.9 41.6 55.3
curacy increases with higher values of confidence Table 2: Ablation study model performances
while fmeanp is more robust, showing monotoni-
cally increasing accuracy with higher confidence
its potential for improvement with high-quality con-
scores consistently in all dataset. But in the case of
texts.
fret , no such pattern exists. Overall (top) as these
benchmarks are knowledge-intensive, they typi- Routing Analysis of O PEN -RAG. We perform
cally perform better with retrieved contexts and our routing analysis for PopQA, PubHealth, HotpotQA,
adaptive scoring shows a better determination of and 2WikiMultihopQA tasks to demonstrate Top-2
when to retrieve and when not – resulting in higher expert activation in different layers during retrieval-
accuracy at any retrieval frequency. In fact, the ad- free generation by O PEN -RAG as illustrated in
vantage is more amplified in PubHealth where we Figure 6. We observe, that E7 is a general expert
can find a clear threshold confidence score which that is highly activated in the first (Layer 1), mid-
if achieved, retrieval data are found to be less effec- dle (Layer 16), and final (Layer 32) layers for all
tive than the parametric knowledge. This gives us datasets. Whereas E2 is activated in the first layer
a peak accuracy of 1% more than always retrieval, while E6 is activated mostly in the final layer. In the
which can not be determined by Self-RAG. middle layer, we also observe a higher activation of
E5 and a lower activation of E7 in the PopQA and
4.3 Ablation Studies PubHealth datasets (single-hop), but the opposite
in the case of multi-hop datasets – showing that
CRAG 88.8
90
Self-RAG 86.2 the experts implicitly learn to identify query com-
85
Self-CRAG 82.2
80 Open-RAG 78.3
81.8
plexity and play important roles across layers for
Performance (%)

Open-CRAG 75.6 74.8 75.9 74.1 different kinds of task complexities.


75
72.0
70
Sparse Upcycling Hyperparameters. We exper-
65

60 59.3 59.3 58.3


61.0 iment with different hyper-parameters of O PEN -
55
54.9 RAG as shown in Table 2. We observe that increas-
50
PopQA PubHealth Bio
ing the number of experts NE slightly improves
Figure 5: Model performances utilizing CRAG contexts the performance in MuSiQue, and performance
improvement in training longer (epoch 1 vs 2). In-
Robustness to Different Retrieval (CRAG) Meth- creasing the number of active experts k from 2
ods. CRAG (Yan et al., 2024) proposes a corrective to 4 causes performance degradation showing the
RAG method where, if corpus (e.g., Wikipedia) re- necessity of less active experts.
trievals are detected as low-quality, a web search Impact of Modules. It is important to understand
is performed to obtain new retrievals. These new how much gain is coming from our contrastive
retrievals are then fed into the system. The Self- learning and how much from the architectural trans-
CRAG method combines both reflection-based formation. In Figure 7 with reference to Self-
models and CRAG-based datasets (Self-RAG + RAG, we plot O PEN -RAG performances with both
CRAG dataset). We evaluate O PEN -RAG and dense and MoE architecture. O PEN -RAG-Dense
O PEN -CRAG (O PEN -RAG + CRAG datasets) outperforms Self-RAG-7B by 1.8% in PopQA,
on the benchmarks (PopQA, PubHealth, and Bio) 1.6% in PubHealth, 4.2% in ASQA (MAUVE),
using CRAG, Self-RAG (Asai et al., 2024), and 17.9% in MuSiQue (EM) and 21.7% in HotpotQA
Self-CRAG as baselines, as illustrated in Figure 5. (EM). Moreover, O PEN -RAG-MoE improves over
O PEN -CRAG outperforms all baselines across all O PEN -RAG-Dense by 1.6% in PopQA, 2.2% in
tasks. Specifically, O PEN -RAG achieves 2%, 4% PubHealth, 5.2% in ASQA (MAUVE), 1.6% in
higher accuracy than Self-CRAG in (Bio, PopQA) MuSiQue (EM) and 1.4% in HotpotQA (EM) –
and PubHealth respectively. This demonstrates both components enhances the model significantly
O PEN -RAG’s robustness to retrieval quality and while contrastive learning as highest.
PopQA PubHealth HotpotQA MuSiQue
Layer 1 Layer 16 Layer 32
0.5
Selection Proportion

0.4

0.3

0.2

0.1

0.0
1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8
Expert Index Expert Index Expert Index
Figure 6: Layer-wise expert activation on single-hop (PopQA, PubHealth) vs multi-hop tasks (HotpotQA, MuSiQue).

Self-RAG
79.1
84.3 like long-form generation compared to proprietary
80 Open-RAG-Dense 75.9
Open-RAG-MoE 72.0 73.6
74.9
models, which we aim to address in future.
70
61.9 63.3
Performance (%)

56.7 58.3
60
54.9 7 Limitations
50

40.0 41.6
40
40.2
O PEN -RAG has a higher memory footprint due to
30 an increase in total parameters (7.81B) in compar-
22.1
20
MuSiQue PopQA HotpotQA PubHealth ALCE-ASQA
ison to Llama27B family baselines (6.74B). But
Figure 7: Performances (MAUVE for ALCE-ASQA;
our O PEN -RAG outperforms open LLMs with
EM for HotpotQA and MuSiQue-Ans; and accuracy for total parameters ranging from 7B to 65B, rival-
PopQA and PubHealth ) with different architecture. ing proprietary models such as ChatGPT, Perplex-
ity.ai, and Command R+ in various downstream
5 Related work tasks. Thus, O PEN -RAG eventually reduces the
Complex factual reasoning requires contextualiz- compute and memory cost with 7.01B active pa-
ing information from multiple documents (Trivedi rameters during inference in comparison to its
et al., 2022; Yang et al., 2018b). Prior works (Khat- performance. Additionally, as our framework is
tab et al., 2022; Press et al., 2023; Pereira et al., general, future direction can be building stronger
2023; Khot et al., 2023) proposed decomposing sparse-upcycled LLMs based on recent models
multi-hop queries into single-hop queries, then such as Llama38B and Mistral7B utilizing O PEN -
repeatedly using LLMs and Retrievers. In ad- RAG multi-hop training dataset. Although our
dition, Jiang et al. (2023b) retrieved new docu- approach is theoretically applicable to any do-
ments if the tokens within generated sentences have main, future work can explore developing high-
low confidence. However, the performance im- performance domain-specific RAG based on our
provement of these approaches often comes at the O PEN -RAG.
cost of resource-intensive techniques such as inter-
leave Chain-of-Thought (Yao et al., 2023; Trivedi Acknowledgement
et al., 2023b; Zhang et al., 2024b) or Tree-of- We thank anonymous reviewers for their valu-
Thought (Chan et al., 2024) reasoning with doc- able feedback on the paper. We also thank Mo-
ument retrieval; and requiring external models hamed El Banani and Amr Keleg for fruitful dis-
(Jeong et al., 2024b). In this work, we train a single cussions. We are grateful to Qatar Computing Re-
MoE model capable of answering complex ques- search Institute for providing compute and OpenAI
tions in one iteration with a minimal increase in APIs. Shayekh Bin Islam is supported by the Fa-
model complexity. tima Al-Fihri Predoctoral Fellowship sponsored
by Hugging Face. This work was supported in
6 Conclusion
part by National Science Foundation (NSF) awards
To enhance reasoning capabilities in RAG mod- CNS-1730158, ACI-1540112, ACI-1541349, OAC-
els with open-source LLMs, we develop O PEN - 1826967, OAC-2112167, CNS-2100237, CNS-
RAG featuring a PEFT MoE architecture, con- 2120019, the University of California Office of
trastive learning, and adaptive retrieval. O PEN - the President, and the University of California San
RAG shows significant performance improvements Diego’s California Institute for Telecommunica-
in complex reasoning tasks, outperforming SoTA tions and Information Technology/Qualcomm Insti-
methods. However, there is still a gap in tasks tute. Thanks to CENIC for the 100Gbps networks.
References through retrieval and self-reflection with retrieval-
augmented large language models. arXiv preprint
Akari Asai, Sewon Min, Zexuan Zhong, and Danqi arXiv:2401.15269.
Chen. 2023. Retrieval-based language models and
applications. In Proceedings of the 61st Annual Meet- Soyeong Jeong, Jinheon Baek, Sukmin Cho, Sung Ju
ing of the Association for Computational Linguistics Hwang, and Jong C Park. 2024b. Adaptive-rag:
(Volume 6: Tutorial Abstracts), pages 41–46, Toronto, Learning to adapt retrieval-augmented large language
Canada. Association for Computational Linguistics. models through question complexity. arXiv preprint
arXiv:2403.14403.
Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and
Hannaneh Hajishirzi. 2024. Self-RAG: Learning to Zhengbao Jiang, Frank F. Xu, Luyu Gao, Zhiqing
retrieve, generate, and critique through self-reflection. Sun, Qian Liu, Jane Dwivedi-Yu, Yiming Yang,
In The Twelfth International Conference on Learning Jamie Callan, and Graham Neubig. 2023a. Ac-
Representations. tive retrieval augmented generation. arXiv preprint
arXiv:2305.06983.
Edward Beeching, Clémentine Fourrier, Nathan
Habib, Sheon Han, Nathan Lambert, Nazneen Zhengbao Jiang, Frank F. Xu, Luyu Gao, Zhiqing Sun,
Rajani, Omar Sanseviero, Lewis Tunstall, Qian Liu, Jane Dwivedi-Yu, Yiming Yang, Jamie
and Thomas Wolf. 2023. Open LLM leader- Callan, and Graham Neubig. 2023b. Active retrieval
board. https://huggingface.co/spaces/ augmented generation. In EMNLP 2023.
HuggingFaceH4/open_llm_leaderboard.
Mandar Joshi, Eunsol Choi, Daniel Weld, and Luke
Chi-Min Chan, Chunpu Xu, Ruibin Yuan, Hongyin Luo, Zettlemoyer. 2017. TriviaQA: A large scale distantly
Wei Xue, Yike Guo, and Jie Fu. 2024. RQ-RAG: supervised challenge dataset for reading comprehen-
Learning to refine queries for retrieval augmented sion. In Proceedings of the 55th Annual Meeting of
generation. arXiv preprint arXiv:2404.00610. the Association for Computational Linguistics (Vol-
ume 1: Long Papers).
x Cohere Team. 2024. Introducing Command
R+: A Scalable LLM Built for Business Omar Khattab, Keshav Santhanam, Xiang Lisa
— cohere.com. https://cohere.com/blog/ Li, David Hall, Percy Liang, Christopher Potts,
command-r-plus-microsoft-azure. [Accessed and Matei Zaharia. 2022. Demonstrate-Search-
14-06-2024]. Predict: Composing retrieval and language mod-
Contextual.AI. 2024. Introducing RAG 2.0 - Contex- els for knowledge-intensive NLP. arXiv preprint
tual AI — contextual.ai. https://contextual.ai/ arXiv.2212.14024, abs/2212.14024.
introducing-rag2/. [Accessed 14-06-2024]. Tushar Khot, Harsh Trivedi, Matthew Finlayson, Yao
Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Fu, Kyle Richardson, Peter Clark, and Ashish Sab-
Luke Zettlemoyer. 2023. Qlora: Efficient finetuning harwal. 2023. Decomposed Prompting: A modular
of quantized llms. arxiv. approach for solving complex tasks. In The Eleventh
International Conference on Learning Representa-
Yann Dubois, Xuechen Li, Rohan Taori, Tianyi Zhang, tions, ICLR 2023, Kigali, Rwanda, May 1-5, 2023.
Ishaan Gulrajani, Jimmy Ba, Carlos Guestrin, Percy OpenReview.net.
Liang, and Tatsunori B. Hashimoto. 2023. Al-
pacaFarm: A simulation framework for methods Aran Komatsuzaki, Joan Puigcerver, James Lee-Thorp,
that learn from human feedback. arXiv preprint Carlos Riquelme Ruiz, Basil Mustafa, Joshua Ainslie,
arXiv:2305.14387. Yi Tay, Mostafa Dehghani, and Neil Houlsby.
2022. Sparse upcycling: Training mixture-of-
Tianyu Gao, Howard Yen, Jiatong Yu, and Danqi Chen. experts from dense checkpoints. arXiv preprint
2023a. Enabling large language models to generate arXiv:2212.05055.
text with citations. In Proceedings of the 2023 Con-
ference on Empirical Methods in Natural Language Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio
Processing, pages 6465–6488, Singapore. Associa- Petroni, Vladimir Karpukhin, Naman Goyal, Hein-
tion for Computational Linguistics. rich Küttler, Mike Lewis, Wen-tau Yih, Tim Rock-
täschel, Sebastian Riedel, and Douwe Kiela. 2020.
Tianyu Gao, Howard Yen, Jiatong Yu, and Danqi Chen. Retrieval-Augmented Generation for knowledge-
2023b. Enabling large language models to generate intensive NLP tasks. In Advances in Neural Infor-
text with citations. arXiv preprint arXiv:2305.14627. mation Processing Systems, volume 33, pages 9459–
9474.
Xanh Ho, Anh-Khoa Duong Nguyen, Saku Sugawara,
and Akiko Aizawa. 2020. Constructing A multi-hop Dacheng Li, Rulin Shao, Anze Xie, Ying Sheng, Lian-
QA dataset for comprehensive evaluation of reason- min Zheng, Joseph Gonzalez, Ion Stoica, Xuezhe Ma,
ing steps. CoRR, abs/2011.01060. and Hao Zhang. 2023. How long can context length
of open-source LLMs truly promise? In NeurIPS
Minbyul Jeong, Jiwoong Sohn, Mujeen Sung, and Jae- 2023 Workshop on Instruction Tuning and Instruction
woo Kang. 2024a. Improving medical reasoning Following.
Xin Liu, Muhammad Khalifa, and Lu Wang. 2023. Nogueira. 2023. Visconde: Multi-document QA with
Litcab: Lightweight calibration of language mod- GPT-3 and neural reranking. In Advances in Informa-
els on outputs of varied lengths. arXiv preprint tion Retrieval - 45th European Conference on Infor-
arXiv:2310.19208. mation Retrieval, ECIR 2023, Dublin, Ireland, April
2-6, 2023, Proceedings, Part II, volume 13981 of
Ximing Lu, Sean Welleck, Jack Hessel, Liwei Jiang, Lecture Notes in Computer Science, pages 534–543.
Lianhui Qin, Peter West, Prithviraj Ammanabrolu, Springer.
and Yejin Choi. 2022. QUARK: Controllable text
generation with reinforced unlearning. In Advances Krishna Pillutla, Swabha Swayamdipta, Rowan Zellers,
in Neural Information Processing Systems. John Thickstun, Sean Welleck, Yejin Choi, and Zaid
Harchaoui. 2021. MAUVE: Measuring the gap be-
Hongyin Luo, Yung-Sung Chuang, Yuan Gong, Tian- tween neural text and human text using divergence
hua Zhang, Yoon Kim, Xixin Wu, Danny Fox, He- frontiers. In Advances in Neural Information Pro-
len Meng, and James Glass. 2023. SAIL: Search- cessing Systems.
augmented instruction learning. arXiv preprint
arXiv:2305.15225. Ofir Press, Muru Zhang, Sewon Min, Ludwig Schmidt,
Noah A. Smith, and Mike Lewis. 2023. Measuring
Alex Mallen, Akari Asai, Victor Zhong, Rajarshi Das, and narrowing the compositionality gap in language
Daniel Khashabi, and Hannaneh Hajishirzi. 2022. models. In Findings of the Association for Computa-
When not to trust language models: Investigating tional Linguistics: EMNLP 2023.
effectiveness of parametric and non-parametric mem-
ories. arXiv preprint arXiv:2212.10511. Freda Shi, Xinyun Chen, Kanishka Misra, Nathan
Scales, David Dohan, Ed H Chi, Nathanael Schärli,
Alex Mallen, Akari Asai, Victor Zhong, Rajarshi Das, and Denny Zhou. 2023. Large language models can
Daniel Khashabi, and Hannaneh Hajishirzi. 2023. be easily distracted by irrelevant context. In Inter-
When not to trust language models: Investigating national Conference on Machine Learning, pages
effectiveness of parametric and non-parametric mem- 31210–31227. PMLR.
ories. arXiv preprint arXiv:2212.10511.
Ivan Stelmakh, Yi Luan, Bhuwan Dhingra, and Ming-
Sewon Min, Kalpesh Krishna, Xinxi Lyu, Mike Lewis, Wei Chang. 2022. ASQA: Factoid questions meet
Wen-tau Yih, Pang Koh, Mohit Iyyer, Luke Zettle- long-form answers. In Proceedings of the 2022 Con-
moyer, and Hannaneh Hajishirzi. 2023a. FActScore: ference on Empirical Methods in Natural Language
Fine-grained atomic evaluation of factual precision Processing.
in long form text generation. In Proceedings of the
2023 Conference on Empirical Methods in Natural Hugo Touvron, Louis Martin, Kevin Stone, Peter Al-
Language Processing, pages 12076–12100, Singa- bert, Amjad Almahairi, Yasmine Babaei, Nikolay
pore. Association for Computational Linguistics. Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti
Bhosale, et al. 2023. Llama 2: Open founda-
Sewon Min, Kalpesh Krishna, Xinxi Lyu, Mike tion and fine-tuned chat models. arXiv preprint
Lewis, Wen-tau Yih, Pang Wei Koh, Mohit Iyyer, arXiv:2307.09288.
Luke Zettlemoyer, and Hannaneh Hajishirzi. 2023b.
FActScore: Fine-grained atomic evaluation of factual Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot,
precision in long form text generation. arXiv preprint and Ashish Sabharwal. 2022. MuSiQue: Multi-
arXiv:2305.14251. hop questions via single-hop question composition.
Transactions of the Association for Computational
Rodrigo Nogueira and Kyunghyun Cho. 2020. Pas- Linguistics, 10:539–554.
sage re-ranking with BERT. arXiv preprint
arXiv:1901.04085. Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot,
and Ashish Sabharwal. 2023a. Interleaving retrieval
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, with chain-of-thought reasoning for knowledge-
Carroll Wainwright, Pamela Mishkin, Chong Zhang, intensive multi-step questions. In Association for
Sandhini Agarwal, Katarina Slama, Alex Gray, John Computational Linguistics.
Schulman, Jacob Hilton, Fraser Kelton, Luke Miller,
Maddie Simens, Amanda Askell, Peter Welinder, Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot,
Paul Christiano, Jan Leike, and Ryan Lowe. 2022. and Ashish Sabharwal. 2023b. Interleaving Retrieval
Training language models to follow instructions with with Chain-of-Thought Reasoning for Knowledge-
human feedback. In Advances in Neural Information Intensive Multi-Step Questions. In Proceedings
Processing Systems. of the 61st Annual Meeting of the Association for
Computational Linguistics (Volume 1: Long Papers),
Md Rizwan Parvez. 2024. Evidence to generate ACL 2023, Toronto, Canada, July 9-14, 2023, pages
(e2g): A single-agent two-step prompting for context 10014–10037. Association for Computational Lin-
grounded and retrieval augmented reasoning. arXiv guistics.
preprint arXiv:2401.05787.
Shuohang Wang, Mo Yu, Xiaoxiao Guo, Zhiguo
Jayr Alencar Pereira, Robson do Nascimento Fidalgo, Wang, Tim Klinger, Wei Zhang, Shiyu Chang, Gerry
Roberto de Alencar Lotufo, and Rodrigo Frassetto Tesauro, Bowen Zhou, and Jing Jiang. 2018. R3:
Reinforced ranker-reader for open-domain question
answering. In Proceedings of the AAAI Conference
on Artificial Intelligence, volume 32.
Zhiruo Wang, Jun Araki, Zhengbao Jiang, Md Rizwan
Parvez, and Graham Neubig. 2023. Learning to filter
context for retrieval-augmented generation. arXiv
preprint arXiv:2311.08377.
Haoyuan Wu, Haisheng Zheng, and Bei Yu. 2024.
Parameter-Efficient Sparsity Crafting from Dense to
Mixture-of-Experts for Instruction Tuning on Gen-
eral Tasks. arXiv preprint arXiv:2401.02731.
Fangyuan Xu, Weijia Shi, and Eunsol Choi. 2023. RE-
COMP: Improving retrieval-augmented lms with
compression and selective augmentation. Preprint,
arXiv:2310.04408.
Shi-Qi Yan, Jia-Chen Gu, Yun Zhu, and Zhen-Hua Ling.
2024. Corrective Retrieval Augmented Generation.
arXiv preprint arXiv:2401.15884.
Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio,
William Cohen, Ruslan Salakhutdinov, and Christo-
pher D. Manning. 2018a. HotpotQA: A dataset for
diverse, explainable multi-hop question answering.
In Proceedings of the 2018 Conference on Empiri-
cal Methods in Natural Language Processing, pages
2369–2380, Brussels, Belgium. Association for Com-
putational Linguistics.
Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio,
William Cohen, Ruslan Salakhutdinov, and Christo-
pher D. Manning. 2018b. HotpotQA: A dataset for
diverse, explainable multi-hop question answering.
In Proceedings of the 2018 Conference on Empiri-
cal Methods in Natural Language Processing, pages
2369–2380, Brussels, Belgium. Association for Com-
putational Linguistics.

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak


Shafran, Karthik R. Narasimhan, and Yuan Cao. 2023.
ReAct: Synergizing reasoning and acting in language
models. In The Eleventh International Conference
on Learning Representations, ICLR 2023, Kigali,
Rwanda, May 1-5, 2023. OpenReview.net.
Jiahao Zhang, Haiyang Zhang, Dongmei Zhang, Yong
Liu, and Shen Huang. 2024a. End-to-End Beam Re-
trieval for Multi-Hop Question Answering. In 2024
Annual Conference of the North American Chapter
of the Association for Computational Linguistics.
Tianhua Zhang, Hongyin Luo, Yung-Sung Chuang, Wei
Fang, Luc Gaitskell, Thomas Hartvigsen, Xixin Wu,
Danny Fox, Helen Meng, and James Glass. 2023. In-
terpretable unified language checking. arXiv preprint
arXiv:2304.03728.
Tianjun Zhang, Shishir G Patil, Naman Jain, Sheng
Shen, Matei Zaharia, Ion Stoica, and Joseph E Gon-
zalez. 2024b. Raft: Adapting language model to do-
main specific rag. arXiv preprint arXiv:2403.10131.
A Training Details B Inference Details
B.1 Inference Hyper-parameters
We train both MoE and Dense models with LoRA
rank 64, LoRA α 16, and LoRA dropout 0.1. We The weights of the Relevance, Grounding and Util-
optimize the models with the AdamW optimizer ity tokens types are 1.0, 1.0, and 0.5 respectively
with a linear learning rate scheduler and a weight during inference of O PEN -RAG and Self-RAG.
decay of 0.0. Both models have a context length of During long-form generation, we use the maximum
4096 for facilitating long-context multi-hop QAs. depth of search of 7 and the size of the beam of 2
Other training hyper-parameters are mentioned in following Self-RAG. To evaluate the performance
Table 3. in the retrieval setting, we report the performance
in the always retrieval setup in Table 1. Next, we
LM LR Epoch Quantization Adapter Dim
employ greedy decoding for O PEN -RAG and Self-
−4
RAG; and top-p (nucleus) sampling for open base-
Dense7B 1 × 10 3 None –
−4 line models with temperature 0.8 and p = 0.95.
MoE7B 2 × 10 2 QLoRA (NF4) 512
MoE13B 1 × 10
−4
2 QLoRA (NF4) 512 We discuss the different soft retrieval constraints
in Section 2.3 and Section 4.2. Moreover, we iden-
4
Table 3: Training Hyper-parameters. tify a bug in the implementation of soft-constraint
for adaptive retrieval in Self-RAG where the im-
plementation utilizes the log-probability of the Re-
We train O PEN -RAG models using NVIDIA trieval token instead of the probability.
A100 GPUs with 80GB VRAM. About 40 GPU
days have been spent in total during training and B.2 Instruction Format
model development. We utilize standard prompt without any complex
prompting, such as Chain-of-Thoughts (CoT). For
A.1 Dataset Details single-hop tasks, we follow the instruction format
in Self-RAG, whereas the instruction format for
The complete breakdown of O PEN -RAG training multi-hop question answering is shown in Table 5.
dataset is displayed in Table 4. Algorithm 1 shows
the process of the multi-hop training data prepara-
Instructions
tion.
You are a question answering agent.
Given a context and a question,
Dataset Name Source Number of Instances
your task is to answer the question
Instruction-Following based on the context. Instead of
GPT-4 Alpaca Open-Instruct 26,168 a full sentence, your answer must
Stanford Alpaca Open-Instruct 25,153
FLAN-V2 Open-Instruct 17,817 be the shortest word or phrase
ShareGPT Open-Instruct 13,406 or named entity. Some example
Open Assistant 1 Open-Instruct 9,464 outputs ’answer’ are: yes; no; Ibn
Knowledge-Intensive (Single-Hop) Sina; Doha, Qatar; 2,132 seats, Los
Wizard of Wikipedia KILT 17,367 Angeles, California etc.
Natural Questions KILT 15,535
FEVER KILT 9,966
OpenBoookQA HF Dataset 4,699 ### Instruction
Arc-Easy HF Dataset 2,147 What administrative territorial
ASQA ASQA 3,897 entity is the owner of Ciudad
Knowledge-Intensive (Multi-Hop) Deportiva located?
HotpotQA (Ours) HotpotQA 28,117
### Response:
Table 4: The generator LM training data statis-
tics. Instruction-following and single-hop knowledge-
intensive samples are from Self-RAG (Asai et al., 2024).
We curate the multi-hop knowledge-intensive samples Table 5: Instruction Example for Multi-Hop QAs.
with reflection tokens.
4
Implementation issue of soft-constraint in Self-RAG
Algorithm 1 O PEN -RAG Multi-Hop Training Data Preparation
Require: Critic Model C, Multi-hop Reasoning QA collections (Q, Y ) with a set of supporting contexts
Pi and a set of non-supporting contexts Ni for QA pair (qi , yi ).
1: Output: Multi-hop input-output pairs D̂.
2: C predicts Retrieval for qi and Utility U of yi for answering qi .
3: Initialize an empty list D̂
4: for (qi , yi ) ∈ {Q, Y } do
5: if Retrieval == [NoRT] then
6: ρ0 = [NoRT] ⊕ yi ⊕ U
7: D̂ ≔ D̂ ∪ {(qi , ρ0 )}
8: else if Retrieval == [RT] then
9: // Relevant and fully supported context
1 2
10: Without replacement, uniformly sample two contexts (pi , pi ) ⊆ Pi
1 2
11: ρ1 = [RT] ⊕ <p> ⊕ pi ⊕ pi ⊕ </p> ⊕ [Relevant] ⊕yi ⊕ [Fully supported] ⊕ U
12: // Relevant and partially supported context
3
13: Randomly sample one context pi ∈ Pi
1
14: Randomly sample one context ni ∈ Ni
3 1
15: ρ2 = [RT] ⊕ <p> ⊕ pi ⊕ ui ⊕ </p> ⊕ [Relevant] ⊕yi ⊕ [Partially supported] ⊕ U
16: // Irrelevant context
2 3
17: Without replacement, uniformly sample two contexts (ni , ni ) ⊆ Ni
2 3
18: ρ3 = [RT] ⊕ <p> ⊕ ni ⊕ ni ⊕ </p> ⊕ [Irrelevant] ⊕ yi ⊕ U
19: D̂ ≔ D̂ ∪ {(qi , ρ1 ), (qi , ρ2 ), (qi , ρ3 )}

You might also like