LLMSelf Improve Long CTXT Reasoning
LLMSelf Improve Long CTXT Reasoning
1
Llama-3.1-8B-Instruct Llama-3.1-70B-Instruct
Prompt
HotpotQA MuSiQue 2WikiMQA HotpotQA MuSiQue 2WikiMQA
Default 55.5 33.0 66.0 60.0 54.0 77.0
Direct answer 49.0 28.5 55.0 61.5 51.5 74.0
Think step-by-step (Kojima et al., 2022) 62.5 50.5 77.5 75.5 62.5 85.0
Fact-and-reflection (Zhao et al., 2024b) 67.0 49.0 76.5 78.0 62.0 84.0
Plan-and-solve (Wang et al., 2023a) 64.0 49.5 82.0 74.0 68.5 85.5
Table 1: Comparison of various prompting methods. The best result is highlighted in bold.
Self-improving method for rEAsoning over LONG- vious work (Mallen et al., 2023; Asai et al., 2024;
contexts (S EA L ONG). This involves first sam- Yen et al., 2024b), we use substring exact match
pling multiple reasoning trajectories from the LLM, (SubEM) for evaluation, assessing whether the
then scoring each based on Minimum Bayes Risk golden answer is included in the output.
(MBR) (Bickel and Doksum, 1977), which prior-
itizes outputs that are more consistent with oth- 2.1 Prompting Strategies Matter
ers. This idea is intuitive, as reasoning trajectories Numerous long-context evaluation benchmarks as-
that deviate from the majority are more likely to sess LLMs by simply asking them to respond to
be hallucinations (Manakul et al., 2023; Farquhar a query based on a long context (Bai et al., 2023;
et al., 2024). Following this, we can either conduct An et al., 2024a; Zhang et al., 2024d; Wang et al.,
supervised fine-tuning using high-scoring outputs 2024b; Yen et al., 2024b). We suggest that this
or apply preference optimization by utilizing both approach may underestimate LLMs’ potential in
high-scoring and low-scoring outputs. long-context scenarios, particularly for questions
We apply S EA L ONG to several leading LLMs requiring complex, multi-step reasoning to arrive
and conduct evaluations on multiple long-context at an answer. To further investigate this, we exam-
reasoning tasks (Bai et al., 2023; Yang et al., ine various prompting strategies for long-context
2018; Trivedi et al., 2022; Ho et al., 2020; Dasigi reasoning, including:
et al., 2021). The results reveal that LLMs can
• Default: Prompting the LLM with the long
self-improve in long-context reasoning. Specifi-
context and a question.
cally, S EA L ONG raises the score of Llama-3.1-8B-
Instruct (Dubey et al., 2024) from 50.8 to 55.0. • Direct Answer: Asking the LLM to directly
Additionally, S EA L ONG enables Qwen-2.5-14B- answer the question based on the long context.
Instruct (Yang et al., 2024a) to outperform its 32B
variant (54.7 vs. 53.1). In comparison to previ- • Think Step-by-step: Providing the LLM with
ous synthetic data, S EA L ONG demonstrate notable the context, question, and an instruction to
improvement without requiring human or expert think step-by-step (Kojima et al., 2022).
model annotation. We hope that S EA L ONG can
• Fact-and-reflection: Providing the LLM with
pave the way for self-improving approaches in
the long context, question, and an instruction
long-context scenarios, supporting the continual
to first identify the relevant information from
advancement of LLM capabilities.
the long context, and then carry out step-by-
step reasoning and provide the answer (Zhao
2 Understanding the Potential of LLMs in
et al., 2024b; Li et al., 2024a).
Long-context Reasoning
• Plan-and-solve: Providing the LLM with the
We explore the potential of LLMs in long-context long context, question, and an instruction to
reasoning through experiments on three reasoning- first devise a plan and then follow it to solve
intensive tasks from LongBench (Bai et al., 2023): the problem step-by-step (Wang et al., 2023a).
HotpotQA (Yang et al., 2018), MuSiQue (Trivedi
et al., 2022) and 2WikiMQA (Ho et al., 2020). The detailed prompts for these strategies are pre-
These tasks involve handling multiple documents sented in Tab. 9 (Appx. B). As shown in Tab. 1,
within the context and addressing multi-hop ques- prompting strategies play a crucial role in long-
tions that span several paragraphs. Following pre- context reasoning. A notable performance gap
2
Figure 1: Scaling up the number of sampled outputs improves the performance of both the oracle sample and MBR
decoding (§3.1). The results are based on Llama-3.1-8B-Instruct.
exists between default prompting and reasoning- trajectories for each question and its corresponding
targeted prompting strategies, aligning with obser- long context. The primary challenge lies in evalu-
vations in short-context tasks (Wei et al., 2022; ating these outputs. The fundamental idea behind
Zhou et al., 2023). Manual inspection reveals that S EA L ONG is that correct reasoning trajectories typ-
with an appropriate prompting strategy, the LLM ically exhibit higher semantic consistency. For ex-
breaks down multi-hop questions into simpler parts, ample, they tend to follow similar planning steps
addresses each part using the long context, and ul- and reference the same information within the long
timately arrives at an answer. context. This observation aligns with hallucination
detection methods (Manakul et al., 2023; Farquhar
2.2 The Potential of LLMs for Correct et al., 2024), where less consistent outputs are more
Long-context Reasoning likely to indicate hallucinations, representing incor-
We further investigate the potential of LLMs for rect reasoning in our scenario.
long-context reasoning by expanding the genera- We formalize this idea using Minimum Bayes
tion space. Specifically, we use temperature sam- Risk (MBR) (Bickel and Doksum, 1977; Bertsch
pling to produce multiple outputs per question, eval- et al., 2023; Wu et al., 2024), which prioritizes
uate each with SubEM, and designate the highest- outputs that exhibit higher consistency with others.
scoring output as the oracle sample. As shown in In the MBR literature, the quality of an output
Fig. 1, there is a notable gap between oracle per- is assessed by its expected utility under the model
formance and that of greedy search, even with just distribution (Bickel and Doksum, 1977; Kumar and
8 outputs. Scaling up to 128 samples achieves over Byrne, 2004; Tromble et al., 2008):
90% correct answers. These results underscore the
potential of LLMs for long-context reasoning and s(y) = Ey∗ ∼πθ (y|x) [u(y, y ∗ )]
motivate the development of methods that enable
Here, s(y) is the score assigned to output y, where
LLMs to self-improve in this area.
x denotes the input, including the long context,
3 S EA L ONG question and instruction. The term πθ (y | x) repre-
sents the policy distribution of the LLM and the util-
Motivated by the potential of LLMs in long-context ity metric u(y, y ∗ ) assesses y based on y ∗ . We ap-
reasoning (§2), we propose S EA L ONG, a self- proximate this expectation using the Monte Carlo
improving method for reasoning over long con- method with N sampled outputs:
texts. This approach consists of two stages: creat-
N
ing self-supervision and fine-tuning the model. An 1 X
overview of S EA L ONG is provided in Fig. 2. s(y) ≈ [u(y, y ∗ )]
N
i=1
3
“Let the Heartaches Begin” is a song performed
…... (30k-word long document) ! Chosen
What is the country of citizenship of Just a !
Little Heartache's performer? ! Self-supervision " Rejected Fine-tuning
Figure 2: S EA L ONG consists of two stages: self-supervision creation and fine-tuning. Given a long context
and a corresponding query, multiple outputs are sampled, each assigned a score based on Minimum Bayes Risk.
Fine-tuning is then conducted using either the highest-scoring output for supervised fine-tuning or both high-scoring
and low-scoring outputs for preference optimization.
the semantic alignment between the two reasoning likelihood of the output as follows:
trajectories. Formally:
1
LSFT = − log πθ (y | x)
∗
u(y, y ) = Sim (Emb(y), Emb(y )) ∗ |y|
|y|
1 X
We employ a lightweight RoBERTa-based =− log πθ (yi | x, y<i )
model (Liu, 2019) to embed outputs and measure |y|
i=1
similarity with the inner product. This approach
Here, y denotes the MBR decoding output.
allows us to assign each output y a score s(y), and
selecting the output with the highest score is re- Preference Optimization. Alternatively, we can
ferred to as MBR decoding (Bickel and Doksum, conduct preference optimization to reinforce the
1977; Bertsch et al., 2023; Wu et al., 2024). As tendency toward high-scoring outputs and reduce
demonstrated in Fig. 1, MBR decoding substan- the likelihood of low-scoring outputs. Among the
tially surpasses greedy search: with absolute im- various preference optimization methods, we adopt
provements of 11.5% on MuSiQue (Trivedi et al., the monolithic odds ratio preference optimization
2022), 5.0% on HotpotQA (Yang et al., 2018), and (ORPO) algorithm (Hong et al., 2024) due to its
5.0% on 2WikiMultihopQA (Ho et al., 2020) when strong empirical performance. ORPO introduces
N = 128. These results highlight the potential an odds ratio loss to minimize the negative log
for LLMs to self-improve by leveraging multiple odds ratio between a preferred output yw and a
samples and an effective evaluation metric based less-preferred output yl :
on output consensus, eliminating the need for hu-
man experts or advanced models. Furthermore, this oddsθ (yw |x)
LOR = − log σ log
evaluation approach produces preference pairs by oddsθ (yl |x)
contrasting high-scoring and low-scoring outputs, Here, σ represents the sigmoid function, and
allowing straightforward preference optimization oddsθ (y|x) measures how much more likely y is
(Ouyang et al., 2022; Rafailov et al., 2024). to be generated than not:
3.2 Fine-tuning πθ (y|x)
oddsθ (y|x) =
Leveraging self-provided supervision, we can ei- 1 − πθ (y|x)
ther perform supervised fine-tuning on the highest-
The final objective in ORPO combines SFT and
scoring outputs or apply preference optimization
OR losses, with a hyperparameter β controlling
using preference pairs.
their relative importance:
Supervised Fine-tuning. For supervised fine-
tuning (SFT), we minimize the negative log- LORPO = LSFT + β · LOR
4
Model Qasper MultiFieldQA-En HotpotQA MuSiQue 2WikiMQA Avg.
Qwen-2.5-7B-Instruct (Yang et al., 2024a) 21.0 28.0 70.5 48.0 77.5 49.0
+ S EA L ONG 26.0 29.3 72.5 51.5 79.5 51.8
Qwen-2.5-14B-Instruct (Yang et al., 2024a) 21.0 32.0 73.0 52.0 83.0 52.2
+ S EA L ONG 24.0 30.0 75.0 57.0 87.5 54.7
Llama-3.1-8B-Instruct (Dubey et al., 2024) 29.0 29.3 64.0 49.5 82.0 50.8
+ S EA L ONG 32.5 31.3 68.0 58.5 84.5 55.0
Qwen-2.5-32B-Instruct (Yang et al., 2024a) 24.5 26.0 72.0 55.0 88.0 53.1
Qwen-2.5-72B-Instruct (Yang et al., 2024a) 27.0 28.7 74.5 58.5 89.0 55.5
Llama-3.1-70B-Instruct (Dubey et al., 2024) 30.0 33.3 74.0 68.5 85.5 58.3
GPT-4o (Hurst et al., 2024) 21.5 28.0 74.5 64.0 84.0 54.4
Table 2: Main evaluation results. Substring exact match (SubEM) serves as the evaluation metric, with the top-
performing results emphasized in bold. S EA L ONG utilizes the training set of MuSiQue with self-supervision (§3.1),
and its performance on other tasks demonstrates the generalization ability of S EA L ONG.
Task # Example Max Tokens Avg. Tokens temperature of 0.7. By default, we synthesize 2048
Qasper 200 21,110 4,921 examples for fine-tuning, with context lengths ran-
MultiFieldQA-en 150 14,947 6,888 domly specified between 4K and 31K tokens. We
HotpotQA 200 16,322 12,779
MuSiQue 200 16,335 15,542 conduct experiments using the Llama-3.1 models
2WikiMultihopQA 200 16,319 7,096 (Dubey et al., 2024) and Qwen-2.5 models (Yang
et al., 2024a), with jina-embeddings-v3 serving as
Table 3: Statistics of evaluation tasks, with token counts the sentence embedding model (Sturua et al., 2024).
calculated using the tokenizer of Llama-3.1-8B-Instruct.
ORPO (Hong et al., 2024) is employed as the de-
fault fine-tuning method. More training details can
Model Avg. Long-context Avg. Output Tokens
be found in Appx. A.
Qwen-2.5-Instruct 7B 49.0 375
+ S EA L ONG 51.8 371
Llama-3.1-Instruct 8B 50.8 289 4.2 Evaluation Setup
+ S EA L ONG 55.0 295
We conduct evaluations in long-context scenarios
Table 4: Average performance on long-context tasks across a wide range of tasks. For single-document
(Tab. 2) and average token count in model predictions QA, we include Qasper (Dasigi et al., 2021) and
for these tasks, measured with the model’s tokenizer. MultiFieldQA-En (Bai et al., 2023) from the Long-
Bench benchmark (Bai et al., 2023). For multi-
document QA, we use HotpotQA (Yang et al.,
In our implementation, we use the MBR decoding
2018), MuSiQue (Trivedi et al., 2022) and 2Wiki-
output as yw and randomly select a low-scoring
MultihopQA (Ho et al., 2020), also from Long-
output to serve as yl .
Bench. Task statistics are presented in Tab. 3.
4 Experiments We adopt plan-and-solve prompting for evaluation
due to its strong performance (Tab. 1). Following
4.1 Implementation previous research (Mallen et al., 2023; Asai et al.,
S EA L ONG requires query and long-context pairs 2024; Yen et al., 2024b), we use substring exact
to synthesize training data. Specifically, we lever- match (SubEM) as the evaluation metric, measur-
age the training dataset of MuSiQue (Trivedi et al., ing whether the output contains the golden answer.
2022), where each question is related to several
Wikipedia documents. To achieve a specified num- 4.3 Main Results
ber of tokens in the context, we randomly sample S EA L ONG Improves Various Models. We im-
some unrelated documents, shuffle them with the plement S EA L ONG on the leading open-source
related ones and concatenate them into a single LLMs, including Qwen-2.5 models (Yang et al.,
context. We use the original questions in MuSiQue 2024a) and Llama-3.1 models (Dubey et al., 2024).
without the annotated answer, relying on the LLM As illustrated in Tab. 2, S EA L ONG brings notable
to produce self-supervision (§3.1). For each ques- improvements: when implemented on Qwen-2.5-
tion, we sample N = 32 outputs with a sampling 7B-Instruct, it closes the performance gap with
5
Model Qasper MultiFieldQA-En HotpotQA MuSiQue 2WikiMQA Avg.
Llama-3.1-8B-Instruct 29.0 29.3 64.0 49.5 82.0 50.8
Supervised Fine-tuning
+ TULU-V2-mix 26.5 27.3 49.5 27.5 54.0 37.0
+ WildChat 20.5 29.3 46.5 28.0 58.0 36.5
+ LongAlpaca 22.5 31.3 48.0 31.0 45.0 35.6
+ LongAlign 25.0 36.7 58.5 47.5 76.0 48.7
+ LongMIT 20.0 30.0 56.0 36.0 66.5 41.7
+ LongReward-SFT 22.0 28.7 58.0 52.0 76.5 47.4
+ GPT-4o-MuSiQue 21.5 31.3 64.0 54.0 83.5 50.9
+ S EA L ONG-SFT 28.5 30.7 68.5 50.5 84.0 52.4
Preference Optimization
+ UltraFeedback 26.0 27.3 47.5 28.5 46.0 35.1
+ LongReward-Preference 26.5 32.0 63.5 52.0 80.5 50.9
+ S EA L ONG 32.5 31.3 68.0 58.5 84.5 55.0
Table 5: A comparison between S EA L ONG and previous datasets. The results are based on Llama-3.1-8B-Instruct
finetuned on the corresponding dataset. To ensure fairness, 2K examples are randomly sampled from each dataset,
with the exception of TULU-V2-mix, WildChat, and UltraFeedback, where the longest 2K examples are selected.
The preference optimization strategy is ORPO (Hong et al., 2024).
Dataset Supervision Avg. Tokens answer. To explore this, we examine output to-
TULU-V2-mix (2023) [1], [2], [3] 3,788 ken counts. As shown in Tab. 4, S EA L ONG has
WildChat (2024a) [2], [3] 32,230 minimal effect on the number of output tokens.
LongAlpaca (2024b) [1], [4] 9,160
LongAlign (2024) [4] 16,881
LongMIT (2024c) [5] 78,412 S EA L ONG Competes with Previous Datasets.
LongReward-SFT (2024b) [6] 22,206 We compare S EA L ONG with several previous
LongReward-Preference (2024b) [6] 22,689
UltraFeedback (2023) [3] 1,356 datasets, including short-context datasets such as
GPT-4o-MuSiQue [7] 18,476 TULU-V2-mix (Ivison et al., 2023), WildChat
S EA L ONG [8] 18,532 (Zhao et al., 2024a), UltraFeedback (Cui et al.,
2023), as well as long-context datasets including
Table 6: Dataset statistics, including supervision source
and average token count, measured with the Llama3.1- LongAlpaca (Chen et al., 2024b), LongAlign (Bai
8B-Instruct tokenizer. Sources: [1] Human, [2] GPT- et al., 2024), LongMIT (Chen et al., 2024c), and
3.5-Turbo (OpenAI, 2022), [3] GPT-4 (Achiam et al., LongReward (Zhang et al., 2024b). Additionally,
2023), [4] Claude (Anthropic, 2023), [5] Qwen2-72B- we utilize GPT-4o to synthesize data using the same
Instruct (Yang et al., 2024a), [6] GLM-4 (GLM et al., question and long-context as S EA L ONG, creating a
2024), [7] GPT-4o (Hurst et al., 2024), and [8] Self. dataset we term GPT-4o-MuSiQue. Dataset statis-
tics are presented in Tab. 6. To ensure fairness, 2K
examples are randomly sampled from each dataset,
Qwen-2.5-14B-Instruct (51.8 vs. 52.2); when ap- with the exception of TULU-V2-mix, WildChat,
plied to Qwen-2.5-14B-Instruct, it even exceeds and UltraFeedback, where the longest 2K exam-
the performance of Qwen-2.5-32B-Instruct (54.7 ples are selected. As demonstrated in Tab. 5, most
vs. 53.1). Additionally, S EA L ONG yields an abso- previous datasets negatively affect the performance
lute improvement of 4.2 on Llama-3.1-8B-Instruct, of Llama-3.1-8B-Instruct, consistent with the ob-
outperforming GPT-4o (Hurst et al., 2024) (55.0 servation of Gao et al. (2024). We hypothesize that
vs. 54.4). Although S EA L ONG utilizes MuSiQue this is because Llama-3.1-8B-Instruct already has
for data synthesis, it achieves strong performance strong long-context processing capabilities, and
across other tasks as well, highlighting its gener- additional training on low-quality synthetic data
alization potential. One possible shortcut of S EA - could diminish its performance. However, we ob-
L ONG is producing more tokens, as the evaluation serve a performance improvement with S EA L ONG
metric, SubEM, might favor outputs with more to- (50.8 to 55.0), indicating that self-improvement
kens, which are more likely to contain the golden holds promise, which is particularlly promising as
6
current LLMs advance rapidly.
4.4 Analysis
Table 7: Comparison of various scoring methods and amples provide limited benefit. This suggests that
greedy search. Each scoring method evaluates 16 out- S EA L ONG is unlocking the inherent potential of
puts sampled from Llama-3.1-8B-Instruct. The results LLMs for long-context reasoning rather than intro-
indicate the performance of the highest-scoring output ducing a new skill that would require more data.
for each method.
7
Long-Context Short-Context
Model
Avg. MMLU GSM8K ARC-Challenge HellaSwag Winogrande TruthfulQA Avg.
Qwen-2.5-7B-Instruct 49.0 74.2 82.4 67.1 81.5 74.7 64.7 74.1
+ S EA L ONG 51.8 74.1 83.2 66.5 81.3 74.4 64.8 74.1
Llama-3.1-8B-Instruct 50.8 68.3 77.7 60.2 80.1 77.4 54.1 69.6
+ S EA L ONG 55.0 68.4 77.8 60.3 79.9 77.3 53.8 69.6
Table 8: Evaluation results on short-context tasks from the Open LLM Leaderboard (Beeching et al., 2023), with
the long-context average performance referenced from Tab.2. S EA L ONG demonstrates a marked improvement in
long-context performance, with minimal impact on short-context performance.
8, while S EA L ONG achieves substantial improve- et al., 2024). However, the reliability of these self-
ments in long-context performance, it has minimal refinement has been questioned in recent studies
impact on short-context performance. (Huang et al., 2024; Jiang et al., 2024). The sec-
ond approach involves generating synthetic training
5 Related Work data through the models themselves. This process
typically involves generating multiple outputs for
Long-context Language Modeling. Numerous
a given input, filtering out inaccurate results based
studies explore methods to extend the long-context
on ground-truths, and using the remaining correct
processing abilities of LLMs. One line of research
responses for model fine-tuning Zelikman et al.
approaches addresses this challenge from a model-
(2022); Hosseini et al. (2024); Pang et al. (2024);
centered perspective, with some studies focusing
Wang et al. (2024c); Gulcehre et al. (2023); Zhang
on minimal modifications to existing LLMs, such
et al. (2024a). Additionally, Yuan et al. (2024) fine-
as adjustments to position embeddings (Chen et al.,
tune LLMs to assign rewards to their own outputs
2023; Peng et al., 2024; Ding et al., 2024; Zhu
using human preference data and facilitate contin-
et al., 2024; Xiong et al., 2024) and refinements
ual improvement in instruction following. To re-
to the attention mechanism (Ding et al., 2023; Jin
duce reliance on human annotations, some studies
et al., 2024; An et al., 2024b,c). Additionally, some
adopts consensus-based supervision, designating
works propose novel architectures for efficient long-
the output with the higher consensus across mul-
context processing (Wu et al., 2022; Bertsch et al.,
tiple outputs as better, with applications in areas
2024; Wang et al., 2024d; Yen et al., 2024a; Lieber
such as arithmetic and logical reasoning (Huang
et al., 2024; Ye et al., 2024; Sun et al., 2024). An-
et al., 2023; Prasad et al., 2024), machine trans-
other line of research adopts a data-centric perspec-
lation (Finkelstein and Freitag, 2024; Wang et al.,
tive, focusing on data engineering strategies. For
2024a; Yang et al., 2024b), and instruction follow-
example, Dubey et al. (2024); Lieber et al. (2024);
ing (Wu et al., 2024). S EA L ONG first reveals the
Fu et al. (2024); Gao et al. (2024) continue pre-
underestimated potential of LLMs in long-context
training models on long sequences, while An et al.
reasoning and then leverages a consensus-based su-
(2024d); Bai et al. (2024); Zhang et al. (2024b);
pervision strategy to enable LLMs to self-improve
Chen et al. (2024c,b) leverage expert models or
in long-context reasoning.
human annotations to create long-context data for
fine-tuning. In contrast to these approaches, this
work aims to facilitate the self-improvement of 6 Conclusion
LLMs in long-context reasoning.
Self-improving. The self-improvement of LLMs In this study, we investigate the potential of LLMs
has become a vital area of research as these models to self-improve in long-context reasoning and pro-
advance toward human intelligence. Research in pose S EA L ONG for this purpose. This method
this area follows two main approaches. The first achieves substantial improvements across multi-
approach investigates the self-reflection capabili- ple long-context reasoning tasks. We hope this re-
ties of LLMs, where models are prompted to assess search will open new avenues for self-improvement
and refine their own outputs (Ganguli et al., 2023; in long-context reasoning, which is vital for the
Madaan et al., 2024; Shinn et al., 2024; Xie et al., sustained progress of LLMs, particularly as they
2024; Gou et al., 2024; Chen et al., 2024a; Pan advance toward surpassing human intelligence.
8
Limitations Long Papers), Bangkok, Thailand. Association for
Computational Linguistics.
We recognize that this work has several limitations
Chenxin An, Fei Huang, Jun Zhang, Shansan Gong,
that warrant further investigation.
Xipeng Qiu, Chang Zhou, and Lingpeng Kong.
Scoring Method. To establish self-supervision 2024b. Training-free long-context scaling of large
language models. In Forty-first International Confer-
(§3.1), we score each output according to Mini- ence on Machine Learning.
mum Bayesian Risk (MBR), which reflects consen-
sus across multiple sampled outputs. However, a Chenxin An, Jun Zhang, Ming Zhong, Lei Li, Shansan
Gong, Yao Luo, Jingjing Xu, and Lingpeng Kong.
substantial performance gap remains between the 2024c. Why does the effective context length of llms
highest MBR-scored output and the oracle sample fall short? arXiv preprint arXiv:2410.18745.
(see Tab. 1 for details). Future research should ex-
Shengnan An, Zexiong Ma, Zeqi Lin, Nanning Zheng,
plore more effective approaches for self-evaluation and Jian-Guang Lou. 2024d. Make your llm fully
of outputs. One possible direction could involve utilize the context. arXiv preprint arXiv:2404.16811.
examining the critic capabilities of LLMs in long-
Anthropic. 2023. Anthropic: Introducing claude 2.1.
context scenarios (Lan et al., 2024b; Lin et al.,
2024; Lan et al., 2024a). Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and
Hannaneh Hajishirzi. 2024. Self-RAG: Learning to
Synthetic Data. Another limitation of this work retrieve, generate, and critique through self-reflection.
is its reliance on MuSiQue (Trivedi et al., 2022) for In The Twelfth International Conference on Learning
Representations.
synthetic data, which consists of multi-hop ques-
tions spanning multiple paragraphs. While this Yushi Bai, Xin Lv, Jiajie Zhang, Yuze He, Ji Qi, Lei Hou,
approach has enabled some progress, MuSiQue Jie Tang, Yuxiao Dong, and Juanzi Li. 2024. Lon-
galign: A recipe for long context alignment of large
dose not cover all challenging question types, such language models. arXiv preprint arXiv:2401.18058.
as those requiring full-context reasoning, which
remains a key limitation of current long-context Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu,
Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao
LLMs (Karpinska et al., 2024; Wang et al., 2024b; Liu, Aohan Zeng, Lei Hou, et al. 2023. Longbench:
Vodrahalli et al., 2024; Yen et al., 2024b). We ad- A bilingual, multitask benchmark for long context
vocate for future work to prioritize the creation of understanding. arXiv preprint arXiv:2308.14508.
high-quality prompt sets, which are essential for Edward Beeching, Clémentine Fourrier, Nathan Habib,
the development of long-context LLMs. Sheon Han, Nathan Lambert, Nazneen Rajani, Omar
Sanseviero, Lewis Tunstall, and Thomas Wolf. 2023.
Experimental Setup. Due to the computational Open llm leaderboard.
limitations, we restrict the implementation of S EA -
Amanda Bertsch, Uri Alon, Graham Neubig, and
L ONG to LLMs with up to 14B parameters, though Matthew Gormley. 2024. Unlimiformer: Long-range
its effectiveness at larger scales warrants further transformers with unlimited length input. Advances
investigation. Likewise, the maximum sequence in Neural Information Processing Systems, 36.
length is set to 32K tokens, whereas current lead- Amanda Bertsch, Alex Xie, Graham Neubig, and
ing LLMs support context lengths of up to 128K Matthew R Gormley. 2023. It’s mbr all the way
tokens or more. We leave the exploration of longer down: Modern generation techniques through the
context lengths for future work. lens of minimum bayes risk. In Proceedings of the
Big Picture Workshop, pages 108–122.
P.J. Bickel and K.A. Doksum. 1977. Mathematical
References Statistics: Basic Ideas and Selected Topics. Prentice
Hall.
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama
Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Sébastien Bubeck, Varun Chandrasekaran, Ronen El-
Diogo Almeida, Janko Altenschmidt, Sam Altman, dan, Johannes Gehrke, Eric Horvitz, Ece Kamar,
Shyamal Anadkat, et al. 2023. Gpt-4 technical report. Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lund-
arXiv preprint arXiv:2303.08774. berg, et al. 2023. Sparks of artificial general intelli-
gence: Early experiments with gpt-4. arXiv preprint
Chenxin An, Shansan Gong, Ming Zhong, Xingjian arXiv:2303.12712.
Zhao, Mukai Li, Jun Zhang, Lingpeng Kong, and
Xipeng Qiu. 2024a. L-eval: Instituting standard- Shouyuan Chen, Sherman Wong, Liangjian Chen, and
ized evaluation for long context language models. In Yuandong Tian. 2023. Extending context window of
Proceedings of the 62nd Annual Meeting of the As- large language models via positional interpolation.
sociation for Computational Linguistics (Volume 1: arXiv preprint arXiv:2306.15595.
9
Xinyun Chen, Maxwell Lin, Nathanael Schärli, and Sebastian Farquhar, Jannik Kossen, Lorenz Kuhn, and
Denny Zhou. 2024a. Teaching large language mod- Yarin Gal. 2024. Detecting hallucinations in large
els to self-debug. In The Twelfth International Con- language models using semantic entropy. Nature,
ference on Learning Representations. 630(8017):625–630.
Yukang Chen, Shengju Qian, Haotian Tang, Xin Lai, Mara Finkelstein and Markus Freitag. 2024. MBR
Zhijian Liu, Song Han, and Jiaya Jia. 2024b. Lon- and QE finetuning: Training-time distillation of the
gloRA: Efficient fine-tuning of long-context large best and most expensive decoding methods. In The
language models. In The Twelfth International Con- Twelfth International Conference on Learning Repre-
ference on Learning Representations. sentations.
Zhi Chen, Qiguang Chen, Libo Qin, Qipeng Guo, Hai- Yao Fu, Rameswar Panda, Xinyao Niu, Xiang Yue, Han-
jun Lv, Yicheng Zou, Wanxiang Che, Hang Yan, Kai naneh Hajishirzi, Yoon Kim, and Hao Peng. 2024.
Chen, and Dahua Lin. 2024c. What are the essential Data engineering for scaling language models to 128k
factors in crafting effective long context multi-hop in- context. arXiv preprint arXiv:2402.10171.
struction datasets? insights and best practices. arXiv
preprint arXiv:2409.01893. Deep Ganguli, Amanda Askell, Nicholas Schiefer,
Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Thomas I Liao, Kamilė Lukošiūtė, Anna Chen, Anna
Ashish Sabharwal, Carissa Schoenick, and Oyvind Goldie, Azalia Mirhoseini, Catherine Olsson, Danny
Tafjord. 2018. Think you have solved question an- Hernandez, et al. 2023. The capacity for moral self-
swering? try arc, the ai2 reasoning challenge. arXiv correction in large language models. arXiv preprint
preprint arXiv:1803.05457. arXiv:2302.07459.
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Tianyu Gao, Alexander Wettig, Howard Yen, and
Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Danqi Chen. 2024. How to train long-context
Plappert, Jerry Tworek, Jacob Hilton, Reiichiro language models (effectively). arXiv preprint
Nakano, et al. 2021. Training verifiers to solve math arXiv:2410.02660.
word problems. arXiv preprint arXiv:2110.14168.
Team GLM, Aohan Zeng, Bin Xu, Bowen Wang, Chen-
Ganqu Cui, Lifan Yuan, Ning Ding, Guanming Yao, hui Zhang, Da Yin, Diego Rojas, Guanyu Feng, Han-
Wei Zhu, Yuan Ni, Guotong Xie, Zhiyuan Liu, and lin Zhao, Hanyu Lai, et al. 2024. Chatglm: A family
Maosong Sun. 2023. Ultrafeedback: Boosting lan- of large language models from glm-130b to glm-4 all
guage models with high-quality feedback. arXiv tools. arXiv preprint arXiv:2406.12793.
preprint arXiv:2310.01377.
Zhibin Gou, Zhihong Shao, Yeyun Gong, yelong shen,
Pradeep Dasigi, Kyle Lo, Iz Beltagy, Arman Cohan, Yujiu Yang, Nan Duan, and Weizhu Chen. 2024.
Noah A Smith, and Matt Gardner. 2021. A dataset CRITIC: Large language models can self-correct
of information-seeking questions and answers an- with tool-interactive critiquing. In The Twelfth Inter-
chored in research papers. In Proceedings of the national Conference on Learning Representations.
2021 Conference of the North American Chapter of
the Association for Computational Linguistics: Hu- Caglar Gulcehre, Tom Le Paine, Srivatsan Srini-
man Language Technologies, pages 4599–4610. vasan, Ksenia Konyushkova, Lotte Weerts, Abhishek
Sharma, Aditya Siddhant, Alex Ahern, Miaosen
Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Wang, Chenjie Gu, et al. 2023. Reinforced self-
Luke Zettlemoyer. 2024. Qlora: Efficient finetuning training (rest) for language modeling. arXiv preprint
of quantized llms. Advances in Neural Information arXiv:2308.08998.
Processing Systems, 36.
Jiayu Ding, Shuming Ma, Li Dong, Xingxing Zhang, Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou,
Shaohan Huang, Wenhui Wang, Nanning Zheng, Mantas Mazeika, Dawn Song, and Jacob Steinhardt.
and Furu Wei. 2023. Longnet: Scaling trans- 2021. Measuring massive multitask language under-
formers to 1,000,000,000 tokens. arXiv preprint standing. In International Conference on Learning
arXiv:2307.02486. Representations.
Yiran Ding, Li Lyna Zhang, Chengruidong Zhang, Xanh Ho, Anh-Khoa Duong Nguyen, Saku Sugawara,
Yuanyuan Xu, Ning Shang, Jiahang Xu, Fan Yang, and Akiko Aizawa. 2020. Constructing a multi-hop
and Mao Yang. 2024. LongroPE: Extending LLM qa dataset for comprehensive evaluation of reasoning
context window beyond 2 million tokens. In Forty- steps. In Proceedings of the 28th International Con-
first International Conference on Machine Learning. ference on Computational Linguistics, pages 6609–
6625.
Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey,
Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Jiwoo Hong, Noah Lee, and James Thorne. 2024.
Akhil Mathur, Alan Schelten, Amy Yang, Angela Orpo: Monolithic preference optimization without
Fan, et al. 2024. The llama 3 herd of models. arXiv reference model. arXiv preprint arXiv:2403.07691,
preprint arXiv:2407.21783. 2(4):5.
10
Arian Hosseini, Xingdi Yuan, Nikolay Malkin, Aaron Marzena Karpinska, Katherine Thai, Kyle Lo, Tanya
Courville, Alessandro Sordoni, and Rishabh Agarwal. Goyal, and Mohit Iyyer. 2024. One thousand and one
2024. V-STar: Training verifiers for self-taught rea- pairs: A" novel" challenge for long-context language
soners. In First Conference on Language Modeling. models. arXiv preprint arXiv:2406.16264.
Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shan- Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yu-
tanu Acharya, Dima Rekesh, Fei Jia, and Boris Gins- taka Matsuo, and Yusuke Iwasawa. 2022. Large lan-
burg. 2024. Ruler: What’s the real context size of guage models are zero-shot reasoners. Advances in
your long-context language models? arXiv preprint neural information processing systems, 35:22199–
arXiv:2404.06654. 22213.
Jiaxin Huang, Shixiang Gu, Le Hou, Yuexin Wu, Xuezhi Shankar Kumar and Bill Byrne. 2004. Minimum bayes-
Wang, Hongkun Yu, and Jiawei Han. 2023. Large risk decoding for statistical machine translation. In
language models can self-improve. In Proceedings Proceedings of the Human Language Technology
of the 2023 Conference on Empirical Methods in Conference of the North American Chapter of the
Natural Language Processing, pages 1051–1068. Association for Computational Linguistics: HLT-
NAACL 2004, pages 169–176.
Jie Huang, Xinyun Chen, Swaroop Mishra,
Huaixiu Steven Zheng, Adams Wei Yu, Xiny- Tian Lan, Wenwei Zhang, Chengqi Lyu, Shuaibin Li,
ing Song, and Denny Zhou. 2024. Large language Chen Xu, Heyan Huang, Dahua Lin, Xian-Ling Mao,
models cannot self-correct reasoning yet. In The and Kai Chen. 2024a. Training language models to
Twelfth International Conference on Learning critique with multi-agent feedback. arXiv preprint
Representations. arXiv:2410.15287.
Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Tian Lan, Wenwei Zhang, Chen Xu, Heyan Huang,
Perelman, Aditya Ramesh, Aidan Clark, AJ Os- Dahua Lin, Kai Chen, and Xian-ling Mao. 2024b.
trow, Akila Welihinda, Alan Hayes, Alec Radford, Criticbench: Evaluating large language models as
et al. 2024. Gpt-4o system card. arXiv preprint critic. arXiv preprint arXiv:2402.13764.
arXiv:2410.21276.
Mosh Levy, Alon Jacoby, and Yoav Goldberg. 2024.
Hamish Ivison, Yizhong Wang, Valentina Pyatkin, Same task, more tokens: the impact of input length on
Nathan Lambert, Matthew Peters, Pradeep Dasigi, the reasoning performance of large language models.
Joel Jang, David Wadden, Noah A Smith, Iz Belt- In Proceedings of the 62nd Annual Meeting of the
agy, et al. 2023. Camels in a changing climate: En- Association for Computational Linguistics (Volume
hancing lm adaptation with tulu 2. arXiv preprint 1: Long Papers), Bangkok, Thailand. Association for
arXiv:2311.10702. Computational Linguistics.
Sam Ade Jacobs, Masahiro Tanaka, Chengming Zhang, Huayang Li, Pat Verga, Priyanka Sen, Bowen Yang, Vi-
Minjia Zhang, Shuaiwen Leon Song, Samyam Rajb- jay Viswanathan, Patrick Lewis, Taro Watanabe, and
handari, and Yuxiong He. 2023. Deepspeed ulysses: Yixuan Su. 2024a. A retrieve-then-reason framework
System optimizations for enabling training of ex- for long-context question answering. arXiv preprint
treme long sequence transformer models. arXiv arXiv:2410.03227.
preprint arXiv:2309.14509.
Mo Li, Songyang Zhang, Yunxin Liu, and Kai Chen.
Dongwei Jiang, Jingyu Zhang, Orion Weller, Nathaniel 2024b. Needlebench: Can llms do retrieval and rea-
Weir, Benjamin Van Durme, and Daniel Khashabi. soning in 1 million context window? arXiv preprint
2024. Self-[in] correct: Llms struggle with re- arXiv:2407.11963.
fining self-generated responses. arXiv preprint
arXiv:2404.04298. Yanyang Li, Shuo Liang, Michael Lyu, and Liwei Wang.
2024c. Making long-context language models better
Carlos E Jimenez, John Yang, Alexander Wettig, multi-hop reasoners. In Proceedings of the 62nd
Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R Annual Meeting of the Association for Computational
Narasimhan. 2024. SWE-bench: Can language mod- Linguistics (Volume 1: Long Papers), pages 2462–
els resolve real-world github issues? In The Twelfth 2475.
International Conference on Learning Representa-
tions. Opher Lieber, Barak Lenz, Hofit Bata, Gal Co-
hen, Jhonathan Osin, Itay Dalmedigos, Erez
Hongye Jin, Xiaotian Han, Jingfeng Yang, Zhimeng Safahi, Shaked Meirom, Yonatan Belinkov, Shai
Jiang, Zirui Liu, Chia-Yuan Chang, Huiyuan Chen, Shalev-Shwartz, et al. 2024. Jamba: A hybrid
and Xia Hu. 2024. LLM maybe longLM: Selfextend transformer-mamba language model. arXiv preprint
LLM context window without tuning. In Forty-first arXiv:2403.19887.
International Conference on Machine Learning.
Chin-Yew Lin. 2004. Rouge: A package for automatic
Greg Kamradt. 2023. Needle in a haystack - pressure evaluation of summaries. In Text summarization
testing llms. branches out, pages 74–81.
11
Stephanie Lin, Jacob Hilton, and Owain Evans. 2022. Bowen Peng, Jeffrey Quesnelle, Honglu Fan, and Enrico
Truthfulqa: Measuring how models mimic human Shippole. 2024. YaRN: Efficient context window ex-
falsehoods. In Proceedings of the 60th Annual Meet- tension of large language models. In The Twelfth
ing of the Association for Computational Linguistics International Conference on Learning Representa-
(Volume 1: Long Papers), pages 3214–3252. tions.
Zicheng Lin, Zhibin Gou, Tian Liang, Ruilin Luo, Archiki Prasad, Weizhe Yuan, Richard Yuanzhe Pang,
Haowei Liu, and Yujiu Yang. 2024. CriticBench: Jing Xu, Maryam Fazel-Zarandi, Mohit Bansal, Sain-
Benchmarking LLMs for critique-correct reasoning. bayar Sukhbaatar, Jason Weston, and Jane Yu. 2024.
In Findings of the Association for Computational Lin- Self-consistency preference optimization. Preprint,
guistics: ACL 2024, Bangkok, Thailand. Association arXiv:2411.04109.
for Computational Linguistics.
Rafael Rafailov, Archit Sharma, Eric Mitchell, Christo-
Yinhan Liu. 2019. Roberta: A robustly opti- pher D Manning, Stefano Ermon, and Chelsea Finn.
mized bert pretraining approach. arXiv preprint 2024. Direct preference optimization: Your language
arXiv:1907.11692, 364. model is secretly a reward model. Advances in Neu-
ral Information Processing Systems, 36.
Chang Ma, Junlei Zhang, Zhihao Zhu, Cheng Yang,
Yujiu Yang, Yaohui Jin, Zhenzhong Lan, Lingpeng Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavat-
Kong, and Junxian He. 2024. Agentboard: An analyt- ula, and Yejin Choi. 2021. Winogrande: An adver-
ical evaluation board of multi-turn llm agents. arXiv sarial winograd schema challenge at scale. Commu-
preprint arXiv:2401.13178. nications of the ACM, 64(9):99–106.
Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Noah Shinn, Federico Cassano, Ashwin Gopinath,
Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Karthik Narasimhan, and Shunyu Yao. 2024. Re-
Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, flexion: Language agents with verbal reinforcement
et al. 2024. Self-refine: Iterative refinement with learning. Advances in Neural Information Process-
self-feedback. Advances in Neural Information Pro- ing Systems, 36.
cessing Systems, 36.
Saba Sturua, Isabelle Mohr, Mohammad Kalim Akram,
Alex Mallen, Akari Asai, Victor Zhong, Rajarshi Das, Michael Günther, Bo Wang, Markus Krimmel, Feng
Daniel Khashabi, and Hannaneh Hajishirzi. 2023. Wang, Georgios Mastrapas, Andreas Koukounas,
When not to trust language models: Investigating Nan Wang, et al. 2024. jina-embeddings-v3: Mul-
effectiveness of parametric and non-parametric mem- tilingual embeddings with task lora. arXiv preprint
ories. In Proceedings of the 61st Annual Meeting of arXiv:2409.10173.
the Association for Computational Linguistics (Vol-
ume 1: Long Papers), pages 9802–9822. Yutao Sun, Li Dong, Yi Zhu, Shaohan Huang, Wenhui
Wang, Shuming Ma, Quanlu Zhang, Jianyong Wang,
Potsawee Manakul, Adian Liusie, and Mark Gales. 2023. and Furu Wei. 2024. You only cache once: Decoder-
Selfcheckgpt: Zero-resource black-box hallucina- decoder architectures for language models. arXiv
tion detection for generative large language models. preprint arXiv:2405.05254.
In Proceedings of the 2023 Conference on Empiri-
cal Methods in Natural Language Processing, pages Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot,
9004–9017. and Ashish Sabharwal. 2022. Musique: Multi-
hop questions via single-hop question composition.
OpenAI. 2022. Chatgpt blog post. https://openai. Transactions of the Association for Computational
com/blog/chatgpt. Linguistics, 10:539–554.
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Roy Tromble, Shankar Kumar, Franz Josef Och, and
Carroll Wainwright, Pamela Mishkin, Chong Zhang, Wolfgang Macherey. 2008. Lattice minimum bayes-
Sandhini Agarwal, Katarina Slama, Alex Ray, et al. risk decoding for statistical machine translation. In
2022. Training language models to follow instruc- Proceedings of the 2008 Conference on Empirical
tions with human feedback. Advances in neural in- Methods in Natural Language Processing, pages 620–
formation processing systems, 35:27730–27744. 629.
Liangming Pan, Michael Saxon, Wenda Xu, Deepak Kiran Vodrahalli, Santiago Ontanon, Nilesh Tripuraneni,
Nathani, Xinyi Wang, and William Yang Wang. 2024. Kelvin Xu, Sanil Jain, Rakesh Shivanna, Jeffrey Hui,
Automatically correcting large language models: Sur- Nishanth Dikkala, Mehran Kazemi, Bahare Fatemi,
veying the landscape of diverse automated correction et al. 2024. Michelangelo: Long context evaluations
strategies. Transactions of the Association for Com- beyond haystacks via latent structure queries. arXiv
putational Linguistics, 12:484–506. preprint arXiv:2409.12640.
Richard Yuanzhe Pang, Weizhe Yuan, He He, Jun Wang, Eleftheria Briakou, Hamid Dadkhahi,
Kyunghyun Cho, Sainbayar Sukhbaatar, and Jason E Rishabh Agarwal, Colin Cherry, and Trevor Cohn.
Weston. 2024. Iterative reasoning preference opti- 2024a. Don’t throw away data: Better se-
mization. In The Thirty-eighth Annual Conference quence knowledge distillation. arXiv preprint
on Neural Information Processing Systems. arXiv:2407.10456.
12
Lei Wang, Wanyu Xu, Yihuai Lan, Zhiqiang Hu, of the North American Chapter of the Association
Yunshi Lan, Roy Ka-Wei Lee, and Ee-Peng Lim. for Computational Linguistics: Human Language
2023a. Plan-and-solve prompting: Improving zero- Technologies (Volume 1: Long Papers), Mexico City,
shot chain-of-thought reasoning by large language Mexico. Association for Computational Linguistics.
models. In Proceedings of the 61st Annual Meet-
ing of the Association for Computational Linguistics An Yang, Baosong Yang, Binyuan Hui, Bo Zheng,
(Volume 1: Long Papers), pages 2609–2634. Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan
Li, Dayiheng Liu, Fei Huang, Guanting Dong, Hao-
Minzheng Wang, Longze Chen, Cheng Fu, Shengyi ran Wei, Huan Lin, Jialong Tang, Jialin Wang, Jian
Liao, Xinghua Zhang, Bingli Wu, Haiyang Yu, Nan Yang, Jianhong Tu, Jianwei Zhang, Jianxin Ma, Jin
Xu, Lei Zhang, Run Luo, et al. 2024b. Leave Xu, Jingren Zhou, Jinze Bai, Jinzheng He, Junyang
no document behind: Benchmarking long-context Lin, Kai Dang, Keming Lu, Keqin Chen, Kexin Yang,
llms with extended multi-doc qa. arXiv preprint Mei Li, Mingfeng Xue, Na Ni, Pei Zhang, Peng
arXiv:2406.17419. Wang, Ru Peng, Rui Men, Ruize Gao, Runji Lin,
Shijie Wang, Shuai Bai, Sinan Tan, Tianhang Zhu,
Tianduo Wang, Shichen Li, and Wei Lu. 2024c. Self- Tianhao Li, Tianyu Liu, Wenbin Ge, Xiaodong Deng,
training with direct preference optimization improves Xiaohuan Zhou, Xingzhang Ren, Xinyu Zhang, Xipin
chain-of-thought reasoning. In Proceedings of the Wei, Xuancheng Ren, Yang Fan, Yang Yao, Yichang
62nd Annual Meeting of the Association for Compu- Zhang, Yu Wan, Yunfei Chu, Yuqiong Liu, Zeyu
tational Linguistics (Volume 1: Long Papers), pages Cui, Zhenru Zhang, and Zhihao Fan. 2024a. Qwen2
11917–11928. technical report. arXiv preprint arXiv:2407.10671.
Weizhi Wang, Li Dong, Hao Cheng, Xiaodong Liu, Guangyu Yang, Jinghong Chen, Weizhe Lin, and Bill
Xifeng Yan, Jianfeng Gao, and Furu Wei. 2024d. Byrne. 2024b. Direct preference optimization for
Augmenting language models with long-term mem- neural machine translation with minimum bayes risk
ory. Advances in Neural Information Processing decoding. In Proceedings of the 2024 Conference
Systems, 36. of the North American Chapter of the Association
for Computational Linguistics: Human Language
Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Technologies (Volume 2: Short Papers), pages 391–
Liu, Noah A Smith, Daniel Khashabi, and Hannaneh 398.
Hajishirzi. 2023b. Self-instruct: Aligning language
models with self-generated instructions. In Proceed- Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio,
ings of the 61st Annual Meeting of the Association for William Cohen, Ruslan Salakhutdinov, and Christo-
Computational Linguistics (Volume 1: Long Papers), pher D Manning. 2018. Hotpotqa: A dataset for
pages 13484–13508. diverse, explainable multi-hop question answering.
In Proceedings of the 2018 Conference on Empiri-
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten cal Methods in Natural Language Processing, pages
Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, 2369–2380.
et al. 2022. Chain-of-thought prompting elicits rea-
soning in large language models. Advances in neural Tianzhu Ye, Li Dong, Yuqing Xia, Yutao Sun, Yi Zhu,
information processing systems, 35:24824–24837. Gao Huang, and Furu Wei. 2024. Differential trans-
former. arXiv preprint arXiv:2410.05258.
Ian Wu, Patrick Fernandes, Amanda Bertsch, Seungone
Kim, Sina Pakazad, and Graham Neubig. 2024. Bet- Howard Yen, Tianyu Gao, and Danqi Chen. 2024a.
ter instruction-following through minimum bayes Long-context language modeling with parallel con-
risk. arXiv preprint arXiv:2410.02902. text encoding. arXiv preprint arXiv:2402.16617.
Yuhuai Wu, Markus Norman Rabe, DeLesley Hutchins, Howard Yen, Tianyu Gao, Minmin Hou, Ke Ding,
and Christian Szegedy. 2022. Memorizing transform- Daniel Fleischer, Peter Izasak, Moshe Wasserblat,
ers. In International Conference on Learning Repre- and Danqi Chen. 2024b. Helmet: How to evaluate
sentations. long-context language models effectively and thor-
oughly. arXiv preprint arXiv:2410.02694.
Yuxi Xie, Kenji Kawaguchi, Yiran Zhao, James Xu
Zhao, Min-Yen Kan, Junxian He, and Michael Xie. Weizhe Yuan, Richard Yuanzhe Pang, Kyunghyun Cho,
2024. Self-evaluation guided beam search for rea- Xian Li, Sainbayar Sukhbaatar, Jing Xu, and Ja-
soning. Advances in Neural Information Processing son E Weston. 2024. Self-rewarding language mod-
Systems, 36. els. In Forty-first International Conference on Ma-
chine Learning.
Wenhan Xiong, Jingyu Liu, Igor Molybog, Hejia Zhang,
Prajjwal Bhargava, Rui Hou, Louis Martin, Rashi Eric Zelikman, Yuhuai Wu, Jesse Mu, and Noah Good-
Rungta, Karthik Abinav Sankararaman, Barlas Oguz, man. 2022. STar: Bootstrapping reasoning with rea-
Madian Khabsa, Han Fang, Yashar Mehdad, Sharan soning. In Advances in Neural Information Process-
Narang, Kshitiz Malik, Angela Fan, Shruti Bhosale, ing Systems.
Sergey Edunov, Mike Lewis, Sinong Wang, and Hao
Ma. 2024. Effective long-context scaling of founda- Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali
tion models. In Proceedings of the 2024 Conference Farhadi, and Yejin Choi. 2019. Hellaswag: Can a
13
machine really finish your sentence? In Proceedings Dawei Zhu, Nan Yang, Liang Wang, Yifan Song, Wen-
of the 57th Annual Meeting of the Association for hao Wu, Furu Wei, and Sujian Li. 2024. PoSE: Ef-
Computational Linguistics, pages 4791–4800. ficient context window extension of LLMs via po-
sitional skip-wise training. In The Twelfth Interna-
Dan Zhang, Sining Zhoubian, Yisong Yue, Yuxiao tional Conference on Learning Representations.
Dong, and Jie Tang. 2024a. Rest-mcts*: Llm self-
training via process reward guided tree search. arXiv
preprint arXiv:2406.03816.
14
A Training Details
To support efficient fine-tuning for long-context
scenarios, we implement sequence parallelization
(Jacobs et al., 2023) with a parallel size of 8. Addi-
tionally, we utilize QLoRA (Dettmers et al., 2024)
to reduce memory consumption during fine-tuning.
The LoRA rank, alpha, and dropout are set to 128,
128, and 0.05, respectively, with all attention and
feedforward linear layers designated as target mod-
ules. All models are fine-tuned for one epoch. The
batch size, learning rate, and maximum sequence
length are set to 8, 5e − 5, and 32K, respectively.
The β for ORPO is configured to 0.1. All exper-
iments are conducted on a computing setup with
8 × H100 GPUs.
B Prompts
We provide the prompts for various prompting
strategies (§2.1) in Tab. 9, and the prompts for the
reference-free and reference-based self-evaluation
strategies (§4.4) in Tab. 10.
15
Strategy Prompt
{context}
Default
{input}
{context}
Direct Answer
{input}
Let’s answer the question directly.
{context}
Think step-by-step
(Kojima et al., 2022) {input}
Let’s think step by step.
{context}
Fact-and-reflection
(Zhao et al., 2024b) {input}
Let’s first identify the relevant information from the long context and list
it. Then, carry out step-by-step reasoning based on that information, and
finally, provide the answer.
{context}
Plan-and-solve
(Wang et al., 2023a) {input}
Let’s first understand the problem and devise a plan to solve it. Then, let’s
carry out the plan and solve the problem step-by-step.
Table 9: The prompts for various prompting strategies (§2.1), where {context} and {input} serve as placeholders for
the long context and input query, respectively.
16
Strategy Prompt
[Context]
{context}
[Question]
Reference-free {question}
Self-Evaluation
[Predicted Response]
{prediction}
Please evaluate the correctness of the predicted response based on the context
and the question. Begin your evaluation by providing a brief explanation.
Be as objective as possible. After giving your explanation, you must rate the
response on a scale from 1 to 5, following this format exactly: “[[rating]]”.
For example, “Rating: [[3]]”.
Here is a question along with two responses: one is the reference response,
and the other is the predicted response. Please determine whether the two
responses provide the same answer to the question. Respond with “True” or
“False” directly.
Reference-based
Self-Evaluation [Question]
{question}
[Reference Response]
{reference}
[Predicted Response]
{prediction}
Table 10: The prompts for the reference-free and reference-based self-evaluation strategies (§4.4), where {question},
{reference}, {prediction}, and {context} serve as placeholders for their respective elements.
17