0% found this document useful (0 votes)

15 views17 pages

LLMSelf Improve Long CTXT Reasoning

Uploaded by

hleather

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views17 pages

LLMSelf Improve Long CTXT Reasoning

Uploaded by

hleather

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 17

Large Language Models Can Self-Improve in Long-context Reasoning

Siheng Li♡ Cheng Yang♡ Zesen Cheng♠ Lemao Liu♢ Mo Yu♢

Yujiu Yang♣ Wai Lam♡
♡
The Chinese University of Hong Kong
♠
Peking University ♣ Tsinghua University ♢ Tencent
[email protected]
Correspondence: [email protected] [email protected]

Abstract LLMs to attain near-perfect accuracy on the needle-

in-a-haystack (NIAH; Kamradt (2023); Li et al.
Large language models (LLMs) have achieved (2024b)) task (Hsieh et al., 2024; Yen et al., 2024b;
substantial progress in processing long con-
arXiv:2411.08147v1 [cs.CL] 12 Nov 2024

Dubey et al., 2024), which involves locating evi-

texts but still struggle with long-context rea-
soning. Existing approaches typically involve dence within vast amounts of irrelevant text, sub-
fine-tuning LLMs with synthetic data, which stantial performance declines persist on tasks that
depends on annotations from human experts or require reasoning over long contexts (Levy et al.,
advanced models like GPT-4, thus restricting 2024; Hsieh et al., 2024; Vodrahalli et al., 2024;
further advancements. To address this issue, Yen et al., 2024b; Li et al., 2024a), limiting their
we investigate the potential for LLMs to self- applicability in real-word scenarios.
improve in long-context reasoning and propose
To address this limitation, recent studies have
S EA L ONG, an approach specifically designed
for this purpose. This approach is straightfor- investigated fine-tuning LLMs to improve long-
ward: we sample multiple outputs for each context reasoning, with effective data synthesis
question, score them with Minimum Bayes as a primary challenge. Two main approaches
Risk, and then apply supervised fine-tuning have emerged: one relies on human annotations
or preference optimization based on these out- (Chen et al., 2024b; Li et al., 2024c,a), which are
puts. Extensive experiments on several leading expensive and difficult to scale; the other lever-
LLMs demonstrate the effectiveness of S EA - ages expert models, such as GPT-4o (Hurst et al.,
L ONG, with an absolute improvement of 4.2
2024), for data synthesis. For example, Bai et al.
points for Llama-3.1-8B-Instruct. Furthermore,
S EA L ONG achieves superior performance com- (2024); Zhang et al. (2024c,b) apply self-instruct
pared to prior approaches that depend on data techniques (Wang et al., 2023b) to create long-
produced by human experts or advanced mod- context instruction-following data. Despite sub-
els. We anticipate that this work will open stantial progress, the dependence on pre-existing
new avenues for self-improvement techniques expert models limits the potential for achieving
in long-context scenarios, which are essential more advanced capabilities.
for the continual advancement of LLMs.1
This work investigates whether LLMs can self-
improve in long-context reasoning. Drawing on
1 Introduction
evidence of LLMs’ near-perfect long-context re-
Large language models (LLMs) with long-context trieval and strong reasoning abilities in general do-
processing capabilities have spurred a range of mains (Bubeck et al., 2023; Zhong et al., 2024), we
novel applications, such as repository-level coding hypothesize that LLMs have untapped potential in
assistance (Jimenez et al., 2024), multi-document long-context reasoning. Our preliminary studies
analysis (Wang et al., 2024b) and autonomous show that refined prompting strategies achieve no-
agents (Ma et al., 2024). Delivering high-quality table improvements over both default prompting
service in these domains requires LLMs to rea- methods and direct answer requests. Furthermore,
son effectively over long contexts, necessitating the scaling the number of sampled outputs reveals a
retrieval of essential details and integration of dis- marked performance gap between the optimal out-
persed information throughout the reasoning pro- puts and those derived via greedy search. These
cess. While recent advancements have enabled results suggest that LLMs hold substantial potential
1
The repository can be accessed at https://github.com/ to advance in long-context reasoning.
SihengLi99/SEALONG. Inspired by these observations, we propose a

1
Llama-3.1-8B-Instruct Llama-3.1-70B-Instruct
Prompt
HotpotQA MuSiQue 2WikiMQA HotpotQA MuSiQue 2WikiMQA
Default 55.5 33.0 66.0 60.0 54.0 77.0
Direct answer 49.0 28.5 55.0 61.5 51.5 74.0
Think step-by-step (Kojima et al., 2022) 62.5 50.5 77.5 75.5 62.5 85.0
Fact-and-reflection (Zhao et al., 2024b) 67.0 49.0 76.5 78.0 62.0 84.0
Plan-and-solve (Wang et al., 2023a) 64.0 49.5 82.0 74.0 68.5 85.5

Table 1: Comparison of various prompting methods. The best result is highlighted in bold.

Self-improving method for rEAsoning over LONG- vious work (Mallen et al., 2023; Asai et al., 2024;
contexts (S EA L ONG). This involves first sam- Yen et al., 2024b), we use substring exact match
pling multiple reasoning trajectories from the LLM, (SubEM) for evaluation, assessing whether the
then scoring each based on Minimum Bayes Risk golden answer is included in the output.
(MBR) (Bickel and Doksum, 1977), which prior-
itizes outputs that are more consistent with oth- 2.1 Prompting Strategies Matter
ers. This idea is intuitive, as reasoning trajectories Numerous long-context evaluation benchmarks as-
that deviate from the majority are more likely to sess LLMs by simply asking them to respond to
be hallucinations (Manakul et al., 2023; Farquhar a query based on a long context (Bai et al., 2023;
et al., 2024). Following this, we can either conduct An et al., 2024a; Zhang et al., 2024d; Wang et al.,
supervised fine-tuning using high-scoring outputs 2024b; Yen et al., 2024b). We suggest that this
or apply preference optimization by utilizing both approach may underestimate LLMs’ potential in
high-scoring and low-scoring outputs. long-context scenarios, particularly for questions
We apply S EA L ONG to several leading LLMs requiring complex, multi-step reasoning to arrive
and conduct evaluations on multiple long-context at an answer. To further investigate this, we exam-
reasoning tasks (Bai et al., 2023; Yang et al., ine various prompting strategies for long-context
2018; Trivedi et al., 2022; Ho et al., 2020; Dasigi reasoning, including:
et al., 2021). The results reveal that LLMs can
• Default: Prompting the LLM with the long
self-improve in long-context reasoning. Specifi-
context and a question.
cally, S EA L ONG raises the score of Llama-3.1-8B-
Instruct (Dubey et al., 2024) from 50.8 to 55.0. • Direct Answer: Asking the LLM to directly
Additionally, S EA L ONG enables Qwen-2.5-14B- answer the question based on the long context.
Instruct (Yang et al., 2024a) to outperform its 32B
variant (54.7 vs. 53.1). In comparison to previ- • Think Step-by-step: Providing the LLM with
ous synthetic data, S EA L ONG demonstrate notable the context, question, and an instruction to
improvement without requiring human or expert think step-by-step (Kojima et al., 2022).
model annotation. We hope that S EA L ONG can
• Fact-and-reflection: Providing the LLM with
pave the way for self-improving approaches in
the long context, question, and an instruction
long-context scenarios, supporting the continual
to first identify the relevant information from
advancement of LLM capabilities.
the long context, and then carry out step-by-
step reasoning and provide the answer (Zhao
2 Understanding the Potential of LLMs in
et al., 2024b; Li et al., 2024a).
Long-context Reasoning
• Plan-and-solve: Providing the LLM with the
We explore the potential of LLMs in long-context long context, question, and an instruction to
reasoning through experiments on three reasoning- first devise a plan and then follow it to solve
intensive tasks from LongBench (Bai et al., 2023): the problem step-by-step (Wang et al., 2023a).
HotpotQA (Yang et al., 2018), MuSiQue (Trivedi
et al., 2022) and 2WikiMQA (Ho et al., 2020). The detailed prompts for these strategies are pre-
These tasks involve handling multiple documents sented in Tab. 9 (Appx. B). As shown in Tab. 1,
within the context and addressing multi-hop ques- prompting strategies play a crucial role in long-
tions that span several paragraphs. Following pre- context reasoning. A notable performance gap

2
Figure 1: Scaling up the number of sampled outputs improves the performance of both the oracle sample and MBR
decoding (§3.1). The results are based on Llama-3.1-8B-Instruct.

exists between default prompting and reasoning- trajectories for each question and its corresponding
targeted prompting strategies, aligning with obser- long context. The primary challenge lies in evalu-
vations in short-context tasks (Wei et al., 2022; ating these outputs. The fundamental idea behind
Zhou et al., 2023). Manual inspection reveals that S EA L ONG is that correct reasoning trajectories typ-
with an appropriate prompting strategy, the LLM ically exhibit higher semantic consistency. For ex-
breaks down multi-hop questions into simpler parts, ample, they tend to follow similar planning steps
addresses each part using the long context, and ul- and reference the same information within the long
timately arrives at an answer. context. This observation aligns with hallucination
detection methods (Manakul et al., 2023; Farquhar
2.2 The Potential of LLMs for Correct et al., 2024), where less consistent outputs are more
Long-context Reasoning likely to indicate hallucinations, representing incor-
We further investigate the potential of LLMs for rect reasoning in our scenario.
long-context reasoning by expanding the genera- We formalize this idea using Minimum Bayes
tion space. Specifically, we use temperature sam- Risk (MBR) (Bickel and Doksum, 1977; Bertsch
pling to produce multiple outputs per question, eval- et al., 2023; Wu et al., 2024), which prioritizes
uate each with SubEM, and designate the highest- outputs that exhibit higher consistency with others.
scoring output as the oracle sample. As shown in In the MBR literature, the quality of an output
Fig. 1, there is a notable gap between oracle per- is assessed by its expected utility under the model
formance and that of greedy search, even with just distribution (Bickel and Doksum, 1977; Kumar and
8 outputs. Scaling up to 128 samples achieves over Byrne, 2004; Tromble et al., 2008):
90% correct answers. These results underscore the
potential of LLMs for long-context reasoning and s(y) = Ey∗ ∼πθ (y|x) [u(y, y ∗ )]
motivate the development of methods that enable
Here, s(y) is the score assigned to output y, where
LLMs to self-improve in this area.
x denotes the input, including the long context,
3 S EA L ONG question and instruction. The term πθ (y | x) repre-
sents the policy distribution of the LLM and the util-
Motivated by the potential of LLMs in long-context ity metric u(y, y ∗ ) assesses y based on y ∗ . We ap-
reasoning (§2), we propose S EA L ONG, a self- proximate this expectation using the Monte Carlo
improving method for reasoning over long con- method with N sampled outputs:
texts. This approach consists of two stages: creat-
N
ing self-supervision and fine-tuning the model. An 1 X
overview of S EA L ONG is provided in Fig. 2. s(y) ≈ [u(y, y ∗ )]
N
i=1

3.1 Self-supervision The utility metric measures the consistency be-

We begin by leveraging plan-and-solve prompting tween two outputs. We use sentence embedding
(Wang et al., 2023a) to sample multiple reasoning similarity as this metric, as it effectively captures

3
“Let the Heartaches Begin” is a song performed
…... (30k-word long document) ! Chosen
What is the country of citizenship of Just a !
Little Heartache's performer? ! Self-supervision " Rejected Fine-tuning

Self-supervision Ⅰ: To solve the problem, we need to … Step 1: Identify the Ⅰ Ⅱ Ⅲ Score

performer of the song. According to Passage 3 … Step 2:
Determine Maria Arredondo’s citizenship … Ⅰ 1.0 0.8 0.6 Ⅰ 0.80
!
Ⅱ: We can break down this … First, let’s recognize the Minimum
performer. Based on Passage 3 … Then, we need to find Ⅱ 0.8 1.0 0.4 Ⅱ 0.73
out the residence of Maria Arredondo … Bayes Risk
Sampling 0.67
Ⅲ : To find out the citizenship of ... Step 1: Confirm the Ⅲ 0.6 0.4 1.0 Ⅲ
Outputs
performer. As stated in Passage 9 … Step 2: Determine "
the citizenship of Silje Nergaard …

Figure 2: S EA L ONG consists of two stages: self-supervision creation and fine-tuning. Given a long context
and a corresponding query, multiple outputs are sampled, each assigned a score based on Minimum Bayes Risk.
Fine-tuning is then conducted using either the highest-scoring output for supervised fine-tuning or both high-scoring
and low-scoring outputs for preference optimization.

the semantic alignment between the two reasoning likelihood of the output as follows:
trajectories. Formally:
1
LSFT = − log πθ (y | x)
∗
u(y, y ) = Sim (Emb(y), Emb(y )) ∗ |y|
|y|
1 X
We employ a lightweight RoBERTa-based =− log πθ (yi | x, y<i )
model (Liu, 2019) to embed outputs and measure |y|
i=1
similarity with the inner product. This approach
Here, y denotes the MBR decoding output.
allows us to assign each output y a score s(y), and
selecting the output with the highest score is re- Preference Optimization. Alternatively, we can
ferred to as MBR decoding (Bickel and Doksum, conduct preference optimization to reinforce the
1977; Bertsch et al., 2023; Wu et al., 2024). As tendency toward high-scoring outputs and reduce
demonstrated in Fig. 1, MBR decoding substan- the likelihood of low-scoring outputs. Among the
tially surpasses greedy search: with absolute im- various preference optimization methods, we adopt
provements of 11.5% on MuSiQue (Trivedi et al., the monolithic odds ratio preference optimization
2022), 5.0% on HotpotQA (Yang et al., 2018), and (ORPO) algorithm (Hong et al., 2024) due to its
5.0% on 2WikiMultihopQA (Ho et al., 2020) when strong empirical performance. ORPO introduces
N = 128. These results highlight the potential an odds ratio loss to minimize the negative log
for LLMs to self-improve by leveraging multiple odds ratio between a preferred output yw and a
samples and an effective evaluation metric based less-preferred output yl :
on output consensus, eliminating the need for hu-
man experts or advanced models. Furthermore, this oddsθ (yw |x)
LOR = − log σ log
evaluation approach produces preference pairs by oddsθ (yl |x)
contrasting high-scoring and low-scoring outputs, Here, σ represents the sigmoid function, and
allowing straightforward preference optimization oddsθ (y|x) measures how much more likely y is
(Ouyang et al., 2022; Rafailov et al., 2024). to be generated than not:
3.2 Fine-tuning πθ (y|x)
oddsθ (y|x) =
Leveraging self-provided supervision, we can ei- 1 − πθ (y|x)
ther perform supervised fine-tuning on the highest-
The final objective in ORPO combines SFT and
scoring outputs or apply preference optimization
OR losses, with a hyperparameter β controlling
using preference pairs.
their relative importance:
Supervised Fine-tuning. For supervised fine-
tuning (SFT), we minimize the negative log- LORPO = LSFT + β · LOR

4
Model Qasper MultiFieldQA-En HotpotQA MuSiQue 2WikiMQA Avg.
Qwen-2.5-7B-Instruct (Yang et al., 2024a) 21.0 28.0 70.5 48.0 77.5 49.0
+ S EA L ONG 26.0 29.3 72.5 51.5 79.5 51.8
Qwen-2.5-14B-Instruct (Yang et al., 2024a) 21.0 32.0 73.0 52.0 83.0 52.2
+ S EA L ONG 24.0 30.0 75.0 57.0 87.5 54.7
Llama-3.1-8B-Instruct (Dubey et al., 2024) 29.0 29.3 64.0 49.5 82.0 50.8
+ S EA L ONG 32.5 31.3 68.0 58.5 84.5 55.0
Qwen-2.5-32B-Instruct (Yang et al., 2024a) 24.5 26.0 72.0 55.0 88.0 53.1
Qwen-2.5-72B-Instruct (Yang et al., 2024a) 27.0 28.7 74.5 58.5 89.0 55.5
Llama-3.1-70B-Instruct (Dubey et al., 2024) 30.0 33.3 74.0 68.5 85.5 58.3
GPT-4o (Hurst et al., 2024) 21.5 28.0 74.5 64.0 84.0 54.4

Table 2: Main evaluation results. Substring exact match (SubEM) serves as the evaluation metric, with the top-
performing results emphasized in bold. S EA L ONG utilizes the training set of MuSiQue with self-supervision (§3.1),
and its performance on other tasks demonstrates the generalization ability of S EA L ONG.

Task # Example Max Tokens Avg. Tokens temperature of 0.7. By default, we synthesize 2048
Qasper 200 21,110 4,921 examples for fine-tuning, with context lengths ran-
MultiFieldQA-en 150 14,947 6,888 domly specified between 4K and 31K tokens. We
HotpotQA 200 16,322 12,779
MuSiQue 200 16,335 15,542 conduct experiments using the Llama-3.1 models
2WikiMultihopQA 200 16,319 7,096 (Dubey et al., 2024) and Qwen-2.5 models (Yang
et al., 2024a), with jina-embeddings-v3 serving as
Table 3: Statistics of evaluation tasks, with token counts the sentence embedding model (Sturua et al., 2024).
calculated using the tokenizer of Llama-3.1-8B-Instruct.
ORPO (Hong et al., 2024) is employed as the de-
fault fine-tuning method. More training details can
Model Avg. Long-context Avg. Output Tokens
be found in Appx. A.
Qwen-2.5-Instruct 7B 49.0 375
+ S EA L ONG 51.8 371
Llama-3.1-Instruct 8B 50.8 289 4.2 Evaluation Setup
+ S EA L ONG 55.0 295
We conduct evaluations in long-context scenarios
Table 4: Average performance on long-context tasks across a wide range of tasks. For single-document
(Tab. 2) and average token count in model predictions QA, we include Qasper (Dasigi et al., 2021) and
for these tasks, measured with the model’s tokenizer. MultiFieldQA-En (Bai et al., 2023) from the Long-
Bench benchmark (Bai et al., 2023). For multi-
document QA, we use HotpotQA (Yang et al.,
In our implementation, we use the MBR decoding
2018), MuSiQue (Trivedi et al., 2022) and 2Wiki-
output as yw and randomly select a low-scoring
MultihopQA (Ho et al., 2020), also from Long-
output to serve as yl .
Bench. Task statistics are presented in Tab. 3.
4 Experiments We adopt plan-and-solve prompting for evaluation
due to its strong performance (Tab. 1). Following
4.1 Implementation previous research (Mallen et al., 2023; Asai et al.,
S EA L ONG requires query and long-context pairs 2024; Yen et al., 2024b), we use substring exact
to synthesize training data. Specifically, we lever- match (SubEM) as the evaluation metric, measur-
age the training dataset of MuSiQue (Trivedi et al., ing whether the output contains the golden answer.
2022), where each question is related to several
Wikipedia documents. To achieve a specified num- 4.3 Main Results
ber of tokens in the context, we randomly sample S EA L ONG Improves Various Models. We im-
some unrelated documents, shuffle them with the plement S EA L ONG on the leading open-source
related ones and concatenate them into a single LLMs, including Qwen-2.5 models (Yang et al.,
context. We use the original questions in MuSiQue 2024a) and Llama-3.1 models (Dubey et al., 2024).
without the annotated answer, relying on the LLM As illustrated in Tab. 2, S EA L ONG brings notable
to produce self-supervision (§3.1). For each ques- improvements: when implemented on Qwen-2.5-
tion, we sample N = 32 outputs with a sampling 7B-Instruct, it closes the performance gap with

5
Model Qasper MultiFieldQA-En HotpotQA MuSiQue 2WikiMQA Avg.
Llama-3.1-8B-Instruct 29.0 29.3 64.0 49.5 82.0 50.8
Supervised Fine-tuning
+ TULU-V2-mix 26.5 27.3 49.5 27.5 54.0 37.0
+ WildChat 20.5 29.3 46.5 28.0 58.0 36.5
+ LongAlpaca 22.5 31.3 48.0 31.0 45.0 35.6
+ LongAlign 25.0 36.7 58.5 47.5 76.0 48.7
+ LongMIT 20.0 30.0 56.0 36.0 66.5 41.7
+ LongReward-SFT 22.0 28.7 58.0 52.0 76.5 47.4
+ GPT-4o-MuSiQue 21.5 31.3 64.0 54.0 83.5 50.9
+ S EA L ONG-SFT 28.5 30.7 68.5 50.5 84.0 52.4
Preference Optimization
+ UltraFeedback 26.0 27.3 47.5 28.5 46.0 35.1
+ LongReward-Preference 26.5 32.0 63.5 52.0 80.5 50.9
+ S EA L ONG 32.5 31.3 68.0 58.5 84.5 55.0

Table 5: A comparison between S EA L ONG and previous datasets. The results are based on Llama-3.1-8B-Instruct
finetuned on the corresponding dataset. To ensure fairness, 2K examples are randomly sampled from each dataset,
with the exception of TULU-V2-mix, WildChat, and UltraFeedback, where the longest 2K examples are selected.
The preference optimization strategy is ORPO (Hong et al., 2024).

Dataset Supervision Avg. Tokens answer. To explore this, we examine output to-
TULU-V2-mix (2023) [1], [2], [3] 3,788 ken counts. As shown in Tab. 4, S EA L ONG has
WildChat (2024a) [2], [3] 32,230 minimal effect on the number of output tokens.
LongAlpaca (2024b) [1], [4] 9,160
LongAlign (2024) [4] 16,881
LongMIT (2024c) [5] 78,412 S EA L ONG Competes with Previous Datasets.
LongReward-SFT (2024b) [6] 22,206 We compare S EA L ONG with several previous
LongReward-Preference (2024b) [6] 22,689
UltraFeedback (2023) [3] 1,356 datasets, including short-context datasets such as
GPT-4o-MuSiQue [7] 18,476 TULU-V2-mix (Ivison et al., 2023), WildChat
S EA L ONG [8] 18,532 (Zhao et al., 2024a), UltraFeedback (Cui et al.,
2023), as well as long-context datasets including
Table 6: Dataset statistics, including supervision source
and average token count, measured with the Llama3.1- LongAlpaca (Chen et al., 2024b), LongAlign (Bai
8B-Instruct tokenizer. Sources: [1] Human, [2] GPT- et al., 2024), LongMIT (Chen et al., 2024c), and
3.5-Turbo (OpenAI, 2022), [3] GPT-4 (Achiam et al., LongReward (Zhang et al., 2024b). Additionally,
2023), [4] Claude (Anthropic, 2023), [5] Qwen2-72B- we utilize GPT-4o to synthesize data using the same
Instruct (Yang et al., 2024a), [6] GLM-4 (GLM et al., question and long-context as S EA L ONG, creating a
2024), [7] GPT-4o (Hurst et al., 2024), and [8] Self. dataset we term GPT-4o-MuSiQue. Dataset statis-
tics are presented in Tab. 6. To ensure fairness, 2K
examples are randomly sampled from each dataset,
Qwen-2.5-14B-Instruct (51.8 vs. 52.2); when ap- with the exception of TULU-V2-mix, WildChat,
plied to Qwen-2.5-14B-Instruct, it even exceeds and UltraFeedback, where the longest 2K exam-
the performance of Qwen-2.5-32B-Instruct (54.7 ples are selected. As demonstrated in Tab. 5, most
vs. 53.1). Additionally, S EA L ONG yields an abso- previous datasets negatively affect the performance
lute improvement of 4.2 on Llama-3.1-8B-Instruct, of Llama-3.1-8B-Instruct, consistent with the ob-
outperforming GPT-4o (Hurst et al., 2024) (55.0 servation of Gao et al. (2024). We hypothesize that
vs. 54.4). Although S EA L ONG utilizes MuSiQue this is because Llama-3.1-8B-Instruct already has
for data synthesis, it achieves strong performance strong long-context processing capabilities, and
across other tasks as well, highlighting its gener- additional training on low-quality synthetic data
alization potential. One possible shortcut of S EA - could diminish its performance. However, we ob-
L ONG is producing more tokens, as the evaluation serve a performance improvement with S EA L ONG
metric, SubEM, might favor outputs with more to- (50.8 to 55.0), indicating that self-improvement
kens, which are more likely to contain the golden holds promise, which is particularlly promising as

6
current LLMs advance rapidly.

4.4 Analysis

Method HotpotQA MuSiQue 2WikiMQA

Greedy Search 64.0 49.5 82.0
Random 61.0 50.5 79.5
Reference-free Self-evaluation 64.0 51.5 83.0
Figure 3: Long-context performance of S EA L ONG with
Minimum Bayes Risk varying numbers of synthetic training examples, evalu-
ROUGE 66.5 53.5 85.0 ated based on Llama-3.1-8B-Instruct fine-tuned on the
BERTScore 67.5 50.0 86.5
Reference-based Self-evaluation 63.5 51.5 84.5 corresponding dataset.
Sentence Embedding 67.5 56.0 88.0

Table 7: Comparison of various scoring methods and amples provide limited benefit. This suggests that
greedy search. Each scoring method evaluates 16 out- S EA L ONG is unlocking the inherent potential of
puts sampled from Llama-3.1-8B-Instruct. The results LLMs for long-context reasoning rather than intro-
indicate the performance of the highest-scoring output ducing a new skill that would require more data.
for each method.

Scoring Methods. Effective scoring methods are

critical for creating self-supervision signals (§3.1).
We explore several approaches, including ran-
dom selection, and reference-free self-evaluation,
which prompts the LLM to assess its prediction
in a separate turn based on the question and con-
text. Additionally, we investigate various strate-
Figure 4: Long-context performance of S EA L ONG with
gies for the utility metric u(y, y ∗ ) within Minimum varying numbers of samples per example during data
Bayes Risk (§3.1), such as ROUGE (Lin, 2004), synthesis, evaluated based on Llama-3.1-8B-Instruct
BERTScore (Zhang et al., 2019) and reference- fine-tuned on the corresponding dataset.
based self-evaluation, which prompts the LLM to
assess y using y ∗ as the reference. The detailed Number of Samples per Example. We continue
prompts for reference-free and reference-based self- to explore the effect of the number of samples, N ,
evaluation are presented in Tab. 10 (Appx. B). For per example. As illustrated in Fig. 4, increasing N
each question, we sample N = 16 outputs using from 8 to 32 consistently improves performance,
a temperature of 0.7. Subsequently, we evaluate likely due to more accurate MBR estimation (§3.1).
the highest-scoring output across different scor- Beyond 32, except for MuSiQue, the performance
ing methods and further compare these results to improvement diminishes. This may indicate a fun-
the performance of greedy search as a reference. damental limitation of our scoring method, which
As shown in Tab. 7, MBR-based methods out- appears to struggle with selecting higher-quality
perform reference-free self-evaluation, even with outputs from larger output sets when N > 32 (see
simple N-gram-based ROUGE. We attribute this also in Tab. 1). We believe the scoring method
to the limited self-evaluation capabilities of cur- is pivotal to self-improvement and will investigate
rent LLMs (Huang et al., 2024; Jiang et al., 2024), this aspect further in future work.
which might be more challenging in long-context
scenarios. Incorporating more semantic informa- Short-context Performance. Improving long-
tion through sentence embeddings further improves context reasoning should not compromise short-
MBR-based methods. context performance. To investigate this further,
we evaluate S EA L ONG on the Open LLM Leader-
Number of Synthetic Examples. We analyze board (Beeching et al., 2023), covering 6 tasks that
the impact of the number of training examples syn- represent diverse capabilities: MMLU (Hendrycks
thesized by S EA L ONG on long-context tasks. As et al., 2021), GSM8K (Cobbe et al., 2021), ARC-
shown in Fig. 3, S EA L ONG demonstrates strong challenge (Clark et al., 2018), HellaSwag (Zellers
data efficiency, achieving competitive performance et al., 2019), WinoGrande (Sakaguchi et al., 2021)
with only 1K examples, after which additional ex- and TruthfulQA (Lin et al., 2022). As shown in Tab.

7
Long-Context Short-Context
Model
Avg. MMLU GSM8K ARC-Challenge HellaSwag Winogrande TruthfulQA Avg.
Qwen-2.5-7B-Instruct 49.0 74.2 82.4 67.1 81.5 74.7 64.7 74.1
+ S EA L ONG 51.8 74.1 83.2 66.5 81.3 74.4 64.8 74.1
Llama-3.1-8B-Instruct 50.8 68.3 77.7 60.2 80.1 77.4 54.1 69.6
+ S EA L ONG 55.0 68.4 77.8 60.3 79.9 77.3 53.8 69.6

Table 8: Evaluation results on short-context tasks from the Open LLM Leaderboard (Beeching et al., 2023), with
the long-context average performance referenced from Tab.2. S EA L ONG demonstrates a marked improvement in
long-context performance, with minimal impact on short-context performance.

8, while S EA L ONG achieves substantial improve- et al., 2024). However, the reliability of these self-
ments in long-context performance, it has minimal refinement has been questioned in recent studies
impact on short-context performance. (Huang et al., 2024; Jiang et al., 2024). The sec-
ond approach involves generating synthetic training
5 Related Work data through the models themselves. This process
typically involves generating multiple outputs for
Long-context Language Modeling. Numerous
a given input, filtering out inaccurate results based
studies explore methods to extend the long-context
on ground-truths, and using the remaining correct
processing abilities of LLMs. One line of research
responses for model fine-tuning Zelikman et al.
approaches addresses this challenge from a model-
(2022); Hosseini et al. (2024); Pang et al. (2024);
centered perspective, with some studies focusing
Wang et al. (2024c); Gulcehre et al. (2023); Zhang
on minimal modifications to existing LLMs, such
et al. (2024a). Additionally, Yuan et al. (2024) fine-
as adjustments to position embeddings (Chen et al.,
tune LLMs to assign rewards to their own outputs
2023; Peng et al., 2024; Ding et al., 2024; Zhu
using human preference data and facilitate contin-
et al., 2024; Xiong et al., 2024) and refinements
ual improvement in instruction following. To re-
to the attention mechanism (Ding et al., 2023; Jin
duce reliance on human annotations, some studies
et al., 2024; An et al., 2024b,c). Additionally, some
adopts consensus-based supervision, designating
works propose novel architectures for efficient long-
the output with the higher consensus across mul-
context processing (Wu et al., 2022; Bertsch et al.,
tiple outputs as better, with applications in areas
2024; Wang et al., 2024d; Yen et al., 2024a; Lieber
such as arithmetic and logical reasoning (Huang
et al., 2024; Ye et al., 2024; Sun et al., 2024). An-
et al., 2023; Prasad et al., 2024), machine trans-
other line of research adopts a data-centric perspec-
lation (Finkelstein and Freitag, 2024; Wang et al.,
tive, focusing on data engineering strategies. For
2024a; Yang et al., 2024b), and instruction follow-
example, Dubey et al. (2024); Lieber et al. (2024);
ing (Wu et al., 2024). S EA L ONG first reveals the
Fu et al. (2024); Gao et al. (2024) continue pre-
underestimated potential of LLMs in long-context
training models on long sequences, while An et al.
reasoning and then leverages a consensus-based su-
(2024d); Bai et al. (2024); Zhang et al. (2024b);
pervision strategy to enable LLMs to self-improve
Chen et al. (2024c,b) leverage expert models or
in long-context reasoning.
human annotations to create long-context data for
fine-tuning. In contrast to these approaches, this
work aims to facilitate the self-improvement of 6 Conclusion
LLMs in long-context reasoning.

Self-improving. The self-improvement of LLMs In this study, we investigate the potential of LLMs
has become a vital area of research as these models to self-improve in long-context reasoning and pro-
advance toward human intelligence. Research in pose S EA L ONG for this purpose. This method
this area follows two main approaches. The first achieves substantial improvements across multi-
approach investigates the self-reflection capabili- ple long-context reasoning tasks. We hope this re-
ties of LLMs, where models are prompted to assess search will open new avenues for self-improvement
and refine their own outputs (Ganguli et al., 2023; in long-context reasoning, which is vital for the
Madaan et al., 2024; Shinn et al., 2024; Xie et al., sustained progress of LLMs, particularly as they
2024; Gou et al., 2024; Chen et al., 2024a; Pan advance toward surpassing human intelligence.

8
Limitations Long Papers), Bangkok, Thailand. Association for
Computational Linguistics.
We recognize that this work has several limitations
Chenxin An, Fei Huang, Jun Zhang, Shansan Gong,
that warrant further investigation.
Xipeng Qiu, Chang Zhou, and Lingpeng Kong.
Scoring Method. To establish self-supervision 2024b. Training-free long-context scaling of large
language models. In Forty-first International Confer-
(§3.1), we score each output according to Mini- ence on Machine Learning.
mum Bayesian Risk (MBR), which reflects consen-
sus across multiple sampled outputs. However, a Chenxin An, Jun Zhang, Ming Zhong, Lei Li, Shansan
Gong, Yao Luo, Jingjing Xu, and Lingpeng Kong.
substantial performance gap remains between the 2024c. Why does the effective context length of llms
highest MBR-scored output and the oracle sample fall short? arXiv preprint arXiv:2410.18745.
(see Tab. 1 for details). Future research should ex-
Shengnan An, Zexiong Ma, Zeqi Lin, Nanning Zheng,
plore more effective approaches for self-evaluation and Jian-Guang Lou. 2024d. Make your llm fully
of outputs. One possible direction could involve utilize the context. arXiv preprint arXiv:2404.16811.
examining the critic capabilities of LLMs in long-
Anthropic. 2023. Anthropic: Introducing claude 2.1.
context scenarios (Lan et al., 2024b; Lin et al.,
2024; Lan et al., 2024a). Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and
Hannaneh Hajishirzi. 2024. Self-RAG: Learning to
Synthetic Data. Another limitation of this work retrieve, generate, and critique through self-reflection.
is its reliance on MuSiQue (Trivedi et al., 2022) for In The Twelfth International Conference on Learning
Representations.
synthetic data, which consists of multi-hop ques-
tions spanning multiple paragraphs. While this Yushi Bai, Xin Lv, Jiajie Zhang, Yuze He, Ji Qi, Lei Hou,
approach has enabled some progress, MuSiQue Jie Tang, Yuxiao Dong, and Juanzi Li. 2024. Lon-
galign: A recipe for long context alignment of large
dose not cover all challenging question types, such language models. arXiv preprint arXiv:2401.18058.
as those requiring full-context reasoning, which
remains a key limitation of current long-context Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu,
Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao
LLMs (Karpinska et al., 2024; Wang et al., 2024b; Liu, Aohan Zeng, Lei Hou, et al. 2023. Longbench:
Vodrahalli et al., 2024; Yen et al., 2024b). We ad- A bilingual, multitask benchmark for long context
vocate for future work to prioritize the creation of understanding. arXiv preprint arXiv:2308.14508.
high-quality prompt sets, which are essential for Edward Beeching, Clémentine Fourrier, Nathan Habib,
the development of long-context LLMs. Sheon Han, Nathan Lambert, Nazneen Rajani, Omar
Sanseviero, Lewis Tunstall, and Thomas Wolf. 2023.
Experimental Setup. Due to the computational Open llm leaderboard.
limitations, we restrict the implementation of S EA -
Amanda Bertsch, Uri Alon, Graham Neubig, and
L ONG to LLMs with up to 14B parameters, though Matthew Gormley. 2024. Unlimiformer: Long-range
its effectiveness at larger scales warrants further transformers with unlimited length input. Advances
investigation. Likewise, the maximum sequence in Neural Information Processing Systems, 36.
length is set to 32K tokens, whereas current lead- Amanda Bertsch, Alex Xie, Graham Neubig, and
ing LLMs support context lengths of up to 128K Matthew R Gormley. 2023. It’s mbr all the way
tokens or more. We leave the exploration of longer down: Modern generation techniques through the
context lengths for future work. lens of minimum bayes risk. In Proceedings of the
Big Picture Workshop, pages 108–122.
P.J. Bickel and K.A. Doksum. 1977. Mathematical
References Statistics: Basic Ideas and Selected Topics. Prentice
Hall.
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama
Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Sébastien Bubeck, Varun Chandrasekaran, Ronen El-
Diogo Almeida, Janko Altenschmidt, Sam Altman, dan, Johannes Gehrke, Eric Horvitz, Ece Kamar,
Shyamal Anadkat, et al. 2023. Gpt-4 technical report. Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lund-
arXiv preprint arXiv:2303.08774. berg, et al. 2023. Sparks of artificial general intelli-
gence: Early experiments with gpt-4. arXiv preprint
Chenxin An, Shansan Gong, Ming Zhong, Xingjian arXiv:2303.12712.
Zhao, Mukai Li, Jun Zhang, Lingpeng Kong, and
Xipeng Qiu. 2024a. L-eval: Instituting standard- Shouyuan Chen, Sherman Wong, Liangjian Chen, and
ized evaluation for long context language models. In Yuandong Tian. 2023. Extending context window of
Proceedings of the 62nd Annual Meeting of the As- large language models via positional interpolation.
sociation for Computational Linguistics (Volume 1: arXiv preprint arXiv:2306.15595.

9
Xinyun Chen, Maxwell Lin, Nathanael Schärli, and Sebastian Farquhar, Jannik Kossen, Lorenz Kuhn, and
Denny Zhou. 2024a. Teaching large language mod- Yarin Gal. 2024. Detecting hallucinations in large
els to self-debug. In The Twelfth International Con- language models using semantic entropy. Nature,
ference on Learning Representations. 630(8017):625–630.
Yukang Chen, Shengju Qian, Haotian Tang, Xin Lai, Mara Finkelstein and Markus Freitag. 2024. MBR
Zhijian Liu, Song Han, and Jiaya Jia. 2024b. Lon- and QE finetuning: Training-time distillation of the
gloRA: Efficient fine-tuning of long-context large best and most expensive decoding methods. In The
language models. In The Twelfth International Con- Twelfth International Conference on Learning Repre-
ference on Learning Representations. sentations.
Zhi Chen, Qiguang Chen, Libo Qin, Qipeng Guo, Hai- Yao Fu, Rameswar Panda, Xinyao Niu, Xiang Yue, Han-
jun Lv, Yicheng Zou, Wanxiang Che, Hang Yan, Kai naneh Hajishirzi, Yoon Kim, and Hao Peng. 2024.
Chen, and Dahua Lin. 2024c. What are the essential Data engineering for scaling language models to 128k
factors in crafting effective long context multi-hop in- context. arXiv preprint arXiv:2402.10171.
struction datasets? insights and best practices. arXiv
preprint arXiv:2409.01893. Deep Ganguli, Amanda Askell, Nicholas Schiefer,
Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Thomas I Liao, Kamilė Lukošiūtė, Anna Chen, Anna
Ashish Sabharwal, Carissa Schoenick, and Oyvind Goldie, Azalia Mirhoseini, Catherine Olsson, Danny
Tafjord. 2018. Think you have solved question an- Hernandez, et al. 2023. The capacity for moral self-
swering? try arc, the ai2 reasoning challenge. arXiv correction in large language models. arXiv preprint
preprint arXiv:1803.05457. arXiv:2302.07459.

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Tianyu Gao, Alexander Wettig, Howard Yen, and
Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Danqi Chen. 2024. How to train long-context
Plappert, Jerry Tworek, Jacob Hilton, Reiichiro language models (effectively). arXiv preprint
Nakano, et al. 2021. Training verifiers to solve math arXiv:2410.02660.
word problems. arXiv preprint arXiv:2110.14168.
Team GLM, Aohan Zeng, Bin Xu, Bowen Wang, Chen-
Ganqu Cui, Lifan Yuan, Ning Ding, Guanming Yao, hui Zhang, Da Yin, Diego Rojas, Guanyu Feng, Han-
Wei Zhu, Yuan Ni, Guotong Xie, Zhiyuan Liu, and lin Zhao, Hanyu Lai, et al. 2024. Chatglm: A family
Maosong Sun. 2023. Ultrafeedback: Boosting lan- of large language models from glm-130b to glm-4 all
guage models with high-quality feedback. arXiv tools. arXiv preprint arXiv:2406.12793.
preprint arXiv:2310.01377.
Zhibin Gou, Zhihong Shao, Yeyun Gong, yelong shen,
Pradeep Dasigi, Kyle Lo, Iz Beltagy, Arman Cohan, Yujiu Yang, Nan Duan, and Weizhu Chen. 2024.
Noah A Smith, and Matt Gardner. 2021. A dataset CRITIC: Large language models can self-correct
of information-seeking questions and answers an- with tool-interactive critiquing. In The Twelfth Inter-
chored in research papers. In Proceedings of the national Conference on Learning Representations.
2021 Conference of the North American Chapter of
the Association for Computational Linguistics: Hu- Caglar Gulcehre, Tom Le Paine, Srivatsan Srini-
man Language Technologies, pages 4599–4610. vasan, Ksenia Konyushkova, Lotte Weerts, Abhishek
Sharma, Aditya Siddhant, Alex Ahern, Miaosen
Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Wang, Chenjie Gu, et al. 2023. Reinforced self-
Luke Zettlemoyer. 2024. Qlora: Efficient finetuning training (rest) for language modeling. arXiv preprint
of quantized llms. Advances in Neural Information arXiv:2308.08998.
Processing Systems, 36.
Jiayu Ding, Shuming Ma, Li Dong, Xingxing Zhang, Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou,
Shaohan Huang, Wenhui Wang, Nanning Zheng, Mantas Mazeika, Dawn Song, and Jacob Steinhardt.
and Furu Wei. 2023. Longnet: Scaling trans- 2021. Measuring massive multitask language under-
formers to 1,000,000,000 tokens. arXiv preprint standing. In International Conference on Learning
arXiv:2307.02486. Representations.

Yiran Ding, Li Lyna Zhang, Chengruidong Zhang, Xanh Ho, Anh-Khoa Duong Nguyen, Saku Sugawara,
Yuanyuan Xu, Ning Shang, Jiahang Xu, Fan Yang, and Akiko Aizawa. 2020. Constructing a multi-hop
and Mao Yang. 2024. LongroPE: Extending LLM qa dataset for comprehensive evaluation of reasoning
context window beyond 2 million tokens. In Forty- steps. In Proceedings of the 28th International Con-
first International Conference on Machine Learning. ference on Computational Linguistics, pages 6609–
6625.
Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey,
Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Jiwoo Hong, Noah Lee, and James Thorne. 2024.
Akhil Mathur, Alan Schelten, Amy Yang, Angela Orpo: Monolithic preference optimization without
Fan, et al. 2024. The llama 3 herd of models. arXiv reference model. arXiv preprint arXiv:2403.07691,
preprint arXiv:2407.21783. 2(4):5.

10
Arian Hosseini, Xingdi Yuan, Nikolay Malkin, Aaron Marzena Karpinska, Katherine Thai, Kyle Lo, Tanya
Courville, Alessandro Sordoni, and Rishabh Agarwal. Goyal, and Mohit Iyyer. 2024. One thousand and one
2024. V-STar: Training verifiers for self-taught rea- pairs: A" novel" challenge for long-context language
soners. In First Conference on Language Modeling. models. arXiv preprint arXiv:2406.16264.

Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shan- Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yu-
tanu Acharya, Dima Rekesh, Fei Jia, and Boris Gins- taka Matsuo, and Yusuke Iwasawa. 2022. Large lan-
burg. 2024. Ruler: What’s the real context size of guage models are zero-shot reasoners. Advances in
your long-context language models? arXiv preprint neural information processing systems, 35:22199–
arXiv:2404.06654. 22213.

Jiaxin Huang, Shixiang Gu, Le Hou, Yuexin Wu, Xuezhi Shankar Kumar and Bill Byrne. 2004. Minimum bayes-
Wang, Hongkun Yu, and Jiawei Han. 2023. Large risk decoding for statistical machine translation. In
language models can self-improve. In Proceedings Proceedings of the Human Language Technology
of the 2023 Conference on Empirical Methods in Conference of the North American Chapter of the
Natural Language Processing, pages 1051–1068. Association for Computational Linguistics: HLT-
NAACL 2004, pages 169–176.
Jie Huang, Xinyun Chen, Swaroop Mishra,
Huaixiu Steven Zheng, Adams Wei Yu, Xiny- Tian Lan, Wenwei Zhang, Chengqi Lyu, Shuaibin Li,
ing Song, and Denny Zhou. 2024. Large language Chen Xu, Heyan Huang, Dahua Lin, Xian-Ling Mao,
models cannot self-correct reasoning yet. In The and Kai Chen. 2024a. Training language models to
Twelfth International Conference on Learning critique with multi-agent feedback. arXiv preprint
Representations. arXiv:2410.15287.

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Tian Lan, Wenwei Zhang, Chen Xu, Heyan Huang,
Perelman, Aditya Ramesh, Aidan Clark, AJ Os- Dahua Lin, Kai Chen, and Xian-ling Mao. 2024b.
trow, Akila Welihinda, Alan Hayes, Alec Radford, Criticbench: Evaluating large language models as
et al. 2024. Gpt-4o system card. arXiv preprint critic. arXiv preprint arXiv:2402.13764.
arXiv:2410.21276.
Mosh Levy, Alon Jacoby, and Yoav Goldberg. 2024.
Hamish Ivison, Yizhong Wang, Valentina Pyatkin, Same task, more tokens: the impact of input length on
Nathan Lambert, Matthew Peters, Pradeep Dasigi, the reasoning performance of large language models.
Joel Jang, David Wadden, Noah A Smith, Iz Belt- In Proceedings of the 62nd Annual Meeting of the
agy, et al. 2023. Camels in a changing climate: En- Association for Computational Linguistics (Volume
hancing lm adaptation with tulu 2. arXiv preprint 1: Long Papers), Bangkok, Thailand. Association for
arXiv:2311.10702. Computational Linguistics.

Sam Ade Jacobs, Masahiro Tanaka, Chengming Zhang, Huayang Li, Pat Verga, Priyanka Sen, Bowen Yang, Vi-
Minjia Zhang, Shuaiwen Leon Song, Samyam Rajb- jay Viswanathan, Patrick Lewis, Taro Watanabe, and
handari, and Yuxiong He. 2023. Deepspeed ulysses: Yixuan Su. 2024a. A retrieve-then-reason framework
System optimizations for enabling training of ex- for long-context question answering. arXiv preprint
treme long sequence transformer models. arXiv arXiv:2410.03227.
preprint arXiv:2309.14509.
Mo Li, Songyang Zhang, Yunxin Liu, and Kai Chen.
Dongwei Jiang, Jingyu Zhang, Orion Weller, Nathaniel 2024b. Needlebench: Can llms do retrieval and rea-
Weir, Benjamin Van Durme, and Daniel Khashabi. soning in 1 million context window? arXiv preprint
2024. Self-[in] correct: Llms struggle with re- arXiv:2407.11963.
fining self-generated responses. arXiv preprint
arXiv:2404.04298. Yanyang Li, Shuo Liang, Michael Lyu, and Liwei Wang.
2024c. Making long-context language models better
Carlos E Jimenez, John Yang, Alexander Wettig, multi-hop reasoners. In Proceedings of the 62nd
Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R Annual Meeting of the Association for Computational
Narasimhan. 2024. SWE-bench: Can language mod- Linguistics (Volume 1: Long Papers), pages 2462–
els resolve real-world github issues? In The Twelfth 2475.
International Conference on Learning Representa-
tions. Opher Lieber, Barak Lenz, Hofit Bata, Gal Co-
hen, Jhonathan Osin, Itay Dalmedigos, Erez
Hongye Jin, Xiaotian Han, Jingfeng Yang, Zhimeng Safahi, Shaked Meirom, Yonatan Belinkov, Shai
Jiang, Zirui Liu, Chia-Yuan Chang, Huiyuan Chen, Shalev-Shwartz, et al. 2024. Jamba: A hybrid
and Xia Hu. 2024. LLM maybe longLM: Selfextend transformer-mamba language model. arXiv preprint
LLM context window without tuning. In Forty-first arXiv:2403.19887.
International Conference on Machine Learning.
Chin-Yew Lin. 2004. Rouge: A package for automatic
Greg Kamradt. 2023. Needle in a haystack - pressure evaluation of summaries. In Text summarization
testing llms. branches out, pages 74–81.

11
Stephanie Lin, Jacob Hilton, and Owain Evans. 2022. Bowen Peng, Jeffrey Quesnelle, Honglu Fan, and Enrico
Truthfulqa: Measuring how models mimic human Shippole. 2024. YaRN: Efficient context window ex-
falsehoods. In Proceedings of the 60th Annual Meet- tension of large language models. In The Twelfth
ing of the Association for Computational Linguistics International Conference on Learning Representa-
(Volume 1: Long Papers), pages 3214–3252. tions.
Zicheng Lin, Zhibin Gou, Tian Liang, Ruilin Luo, Archiki Prasad, Weizhe Yuan, Richard Yuanzhe Pang,
Haowei Liu, and Yujiu Yang. 2024. CriticBench: Jing Xu, Maryam Fazel-Zarandi, Mohit Bansal, Sain-
Benchmarking LLMs for critique-correct reasoning. bayar Sukhbaatar, Jason Weston, and Jane Yu. 2024.
In Findings of the Association for Computational Lin- Self-consistency preference optimization. Preprint,
guistics: ACL 2024, Bangkok, Thailand. Association arXiv:2411.04109.
for Computational Linguistics.
Rafael Rafailov, Archit Sharma, Eric Mitchell, Christo-
Yinhan Liu. 2019. Roberta: A robustly opti- pher D Manning, Stefano Ermon, and Chelsea Finn.
mized bert pretraining approach. arXiv preprint 2024. Direct preference optimization: Your language
arXiv:1907.11692, 364. model is secretly a reward model. Advances in Neu-
ral Information Processing Systems, 36.
Chang Ma, Junlei Zhang, Zhihao Zhu, Cheng Yang,
Yujiu Yang, Yaohui Jin, Zhenzhong Lan, Lingpeng Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavat-
Kong, and Junxian He. 2024. Agentboard: An analyt- ula, and Yejin Choi. 2021. Winogrande: An adver-
ical evaluation board of multi-turn llm agents. arXiv sarial winograd schema challenge at scale. Commu-
preprint arXiv:2401.13178. nications of the ACM, 64(9):99–106.
Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Noah Shinn, Federico Cassano, Ashwin Gopinath,
Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Karthik Narasimhan, and Shunyu Yao. 2024. Re-
Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, flexion: Language agents with verbal reinforcement
et al. 2024. Self-refine: Iterative refinement with learning. Advances in Neural Information Process-
self-feedback. Advances in Neural Information Pro- ing Systems, 36.
cessing Systems, 36.
Saba Sturua, Isabelle Mohr, Mohammad Kalim Akram,
Alex Mallen, Akari Asai, Victor Zhong, Rajarshi Das, Michael Günther, Bo Wang, Markus Krimmel, Feng
Daniel Khashabi, and Hannaneh Hajishirzi. 2023. Wang, Georgios Mastrapas, Andreas Koukounas,
When not to trust language models: Investigating Nan Wang, et al. 2024. jina-embeddings-v3: Mul-
effectiveness of parametric and non-parametric mem- tilingual embeddings with task lora. arXiv preprint
ories. In Proceedings of the 61st Annual Meeting of arXiv:2409.10173.
the Association for Computational Linguistics (Vol-
ume 1: Long Papers), pages 9802–9822. Yutao Sun, Li Dong, Yi Zhu, Shaohan Huang, Wenhui
Wang, Shuming Ma, Quanlu Zhang, Jianyong Wang,
Potsawee Manakul, Adian Liusie, and Mark Gales. 2023. and Furu Wei. 2024. You only cache once: Decoder-
Selfcheckgpt: Zero-resource black-box hallucina- decoder architectures for language models. arXiv
tion detection for generative large language models. preprint arXiv:2405.05254.
In Proceedings of the 2023 Conference on Empiri-
cal Methods in Natural Language Processing, pages Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot,
9004–9017. and Ashish Sabharwal. 2022. Musique: Multi-
hop questions via single-hop question composition.
OpenAI. 2022. Chatgpt blog post. https://openai. Transactions of the Association for Computational
com/blog/chatgpt. Linguistics, 10:539–554.
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Roy Tromble, Shankar Kumar, Franz Josef Och, and
Carroll Wainwright, Pamela Mishkin, Chong Zhang, Wolfgang Macherey. 2008. Lattice minimum bayes-
Sandhini Agarwal, Katarina Slama, Alex Ray, et al. risk decoding for statistical machine translation. In
2022. Training language models to follow instruc- Proceedings of the 2008 Conference on Empirical
tions with human feedback. Advances in neural in- Methods in Natural Language Processing, pages 620–
formation processing systems, 35:27730–27744. 629.
Liangming Pan, Michael Saxon, Wenda Xu, Deepak Kiran Vodrahalli, Santiago Ontanon, Nilesh Tripuraneni,
Nathani, Xinyi Wang, and William Yang Wang. 2024. Kelvin Xu, Sanil Jain, Rakesh Shivanna, Jeffrey Hui,
Automatically correcting large language models: Sur- Nishanth Dikkala, Mehran Kazemi, Bahare Fatemi,
veying the landscape of diverse automated correction et al. 2024. Michelangelo: Long context evaluations
strategies. Transactions of the Association for Com- beyond haystacks via latent structure queries. arXiv
putational Linguistics, 12:484–506. preprint arXiv:2409.12640.
Richard Yuanzhe Pang, Weizhe Yuan, He He, Jun Wang, Eleftheria Briakou, Hamid Dadkhahi,
Kyunghyun Cho, Sainbayar Sukhbaatar, and Jason E Rishabh Agarwal, Colin Cherry, and Trevor Cohn.
Weston. 2024. Iterative reasoning preference opti- 2024a. Don’t throw away data: Better se-
mization. In The Thirty-eighth Annual Conference quence knowledge distillation. arXiv preprint
on Neural Information Processing Systems. arXiv:2407.10456.

12
Lei Wang, Wanyu Xu, Yihuai Lan, Zhiqiang Hu, of the North American Chapter of the Association
Yunshi Lan, Roy Ka-Wei Lee, and Ee-Peng Lim. for Computational Linguistics: Human Language
2023a. Plan-and-solve prompting: Improving zero- Technologies (Volume 1: Long Papers), Mexico City,
shot chain-of-thought reasoning by large language Mexico. Association for Computational Linguistics.
models. In Proceedings of the 61st Annual Meet-
ing of the Association for Computational Linguistics An Yang, Baosong Yang, Binyuan Hui, Bo Zheng,
(Volume 1: Long Papers), pages 2609–2634. Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan
Li, Dayiheng Liu, Fei Huang, Guanting Dong, Hao-
Minzheng Wang, Longze Chen, Cheng Fu, Shengyi ran Wei, Huan Lin, Jialong Tang, Jialin Wang, Jian
Liao, Xinghua Zhang, Bingli Wu, Haiyang Yu, Nan Yang, Jianhong Tu, Jianwei Zhang, Jianxin Ma, Jin
Xu, Lei Zhang, Run Luo, et al. 2024b. Leave Xu, Jingren Zhou, Jinze Bai, Jinzheng He, Junyang
no document behind: Benchmarking long-context Lin, Kai Dang, Keming Lu, Keqin Chen, Kexin Yang,
llms with extended multi-doc qa. arXiv preprint Mei Li, Mingfeng Xue, Na Ni, Pei Zhang, Peng
arXiv:2406.17419. Wang, Ru Peng, Rui Men, Ruize Gao, Runji Lin,
Shijie Wang, Shuai Bai, Sinan Tan, Tianhang Zhu,
Tianduo Wang, Shichen Li, and Wei Lu. 2024c. Self- Tianhao Li, Tianyu Liu, Wenbin Ge, Xiaodong Deng,
training with direct preference optimization improves Xiaohuan Zhou, Xingzhang Ren, Xinyu Zhang, Xipin
chain-of-thought reasoning. In Proceedings of the Wei, Xuancheng Ren, Yang Fan, Yang Yao, Yichang
62nd Annual Meeting of the Association for Compu- Zhang, Yu Wan, Yunfei Chu, Yuqiong Liu, Zeyu
tational Linguistics (Volume 1: Long Papers), pages Cui, Zhenru Zhang, and Zhihao Fan. 2024a. Qwen2
11917–11928. technical report. arXiv preprint arXiv:2407.10671.
Weizhi Wang, Li Dong, Hao Cheng, Xiaodong Liu, Guangyu Yang, Jinghong Chen, Weizhe Lin, and Bill
Xifeng Yan, Jianfeng Gao, and Furu Wei. 2024d. Byrne. 2024b. Direct preference optimization for
Augmenting language models with long-term mem- neural machine translation with minimum bayes risk
ory. Advances in Neural Information Processing decoding. In Proceedings of the 2024 Conference
Systems, 36. of the North American Chapter of the Association
for Computational Linguistics: Human Language
Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Technologies (Volume 2: Short Papers), pages 391–
Liu, Noah A Smith, Daniel Khashabi, and Hannaneh 398.
Hajishirzi. 2023b. Self-instruct: Aligning language
models with self-generated instructions. In Proceed- Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio,
ings of the 61st Annual Meeting of the Association for William Cohen, Ruslan Salakhutdinov, and Christo-
Computational Linguistics (Volume 1: Long Papers), pher D Manning. 2018. Hotpotqa: A dataset for
pages 13484–13508. diverse, explainable multi-hop question answering.
In Proceedings of the 2018 Conference on Empiri-
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten cal Methods in Natural Language Processing, pages
Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, 2369–2380.
et al. 2022. Chain-of-thought prompting elicits rea-
soning in large language models. Advances in neural Tianzhu Ye, Li Dong, Yuqing Xia, Yutao Sun, Yi Zhu,
information processing systems, 35:24824–24837. Gao Huang, and Furu Wei. 2024. Differential trans-
former. arXiv preprint arXiv:2410.05258.
Ian Wu, Patrick Fernandes, Amanda Bertsch, Seungone
Kim, Sina Pakazad, and Graham Neubig. 2024. Bet- Howard Yen, Tianyu Gao, and Danqi Chen. 2024a.
ter instruction-following through minimum bayes Long-context language modeling with parallel con-
risk. arXiv preprint arXiv:2410.02902. text encoding. arXiv preprint arXiv:2402.16617.
Yuhuai Wu, Markus Norman Rabe, DeLesley Hutchins, Howard Yen, Tianyu Gao, Minmin Hou, Ke Ding,
and Christian Szegedy. 2022. Memorizing transform- Daniel Fleischer, Peter Izasak, Moshe Wasserblat,
ers. In International Conference on Learning Repre- and Danqi Chen. 2024b. Helmet: How to evaluate
sentations. long-context language models effectively and thor-
oughly. arXiv preprint arXiv:2410.02694.
Yuxi Xie, Kenji Kawaguchi, Yiran Zhao, James Xu
Zhao, Min-Yen Kan, Junxian He, and Michael Xie. Weizhe Yuan, Richard Yuanzhe Pang, Kyunghyun Cho,
2024. Self-evaluation guided beam search for rea- Xian Li, Sainbayar Sukhbaatar, Jing Xu, and Ja-
soning. Advances in Neural Information Processing son E Weston. 2024. Self-rewarding language mod-
Systems, 36. els. In Forty-first International Conference on Ma-
chine Learning.
Wenhan Xiong, Jingyu Liu, Igor Molybog, Hejia Zhang,
Prajjwal Bhargava, Rui Hou, Louis Martin, Rashi Eric Zelikman, Yuhuai Wu, Jesse Mu, and Noah Good-
Rungta, Karthik Abinav Sankararaman, Barlas Oguz, man. 2022. STar: Bootstrapping reasoning with rea-
Madian Khabsa, Han Fang, Yashar Mehdad, Sharan soning. In Advances in Neural Information Process-
Narang, Kshitiz Malik, Angela Fan, Shruti Bhosale, ing Systems.
Sergey Edunov, Mike Lewis, Sinong Wang, and Hao
Ma. 2024. Effective long-context scaling of founda- Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali
tion models. In Proceedings of the 2024 Conference Farhadi, and Yejin Choi. 2019. Hellaswag: Can a

13
machine really finish your sentence? In Proceedings Dawei Zhu, Nan Yang, Liang Wang, Yifan Song, Wen-
of the 57th Annual Meeting of the Association for hao Wu, Furu Wei, and Sujian Li. 2024. PoSE: Ef-
Computational Linguistics, pages 4791–4800. ficient context window extension of LLMs via po-
sitional skip-wise training. In The Twelfth Interna-
Dan Zhang, Sining Zhoubian, Yisong Yue, Yuxiao tional Conference on Learning Representations.
Dong, and Jie Tang. 2024a. Rest-mcts*: Llm self-
training via process reward guided tree search. arXiv
preprint arXiv:2406.03816.

Jiajie Zhang, Zhongni Hou, Xin Lv, Shulin Cao, Zhenyu

Hou, Yilin Niu, Lei Hou, Yuxiao Dong, Ling Feng,
and Juanzi Li. 2024b. Longreward: Improving
long-context large language models with ai feedback.
arXiv preprint arXiv:2410.21252.

Peitian Zhang, Ninglu Shao, Zheng Liu, Shitao Xiao,

Hongjin Qian, Qiwei Ye, and Zhicheng Dou. 2024c.
Extending llama-3’s context ten-fold overnight.
arXiv preprint arXiv:2404.19553.

Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q

Weinberger, and Yoav Artzi. 2019. Bertscore: Eval-
uating text generation with bert. arXiv preprint
arXiv:1904.09675.

Xinrong Zhang, Yingfa Chen, Shengding Hu, Zihang

Xu, Junhao Chen, Moo Hao, Xu Han, Zhen Thai,
Shuo Wang, Zhiyuan Liu, and Maosong Sun. 2024d.
∞Bench: Extending long context evaluation beyond
100K tokens. In Proceedings of the 62nd Annual
Meeting of the Association for Computational Lin-
guistics (Volume 1: Long Papers), pages 15262–
15277, Bangkok, Thailand. Association for Compu-
tational Linguistics.

Wenting Zhao, Xiang Ren, Jack Hessel, Claire Cardie,

Yejin Choi, and Yuntian Deng. 2024a. Wildchat: 1m
chatGPT interaction logs in the wild. In The Twelfth
International Conference on Learning Representa-
tions.

Xinran Zhao, Hongming Zhang, Xiaoman Pan, Wenlin

Yao, Dong Yu, Tongshuang Wu, and Jianshu Chen.
2024b. Fact-and-reflection (FaR) improves confi-
dence calibration of large language models. In Find-
ings of the Association for Computational Linguistics
ACL 2024, Bangkok, Thailand and virtual meeting.
Association for Computational Linguistics.

Tianyang Zhong, Zhengliang Liu, Yi Pan, Yutong

Zhang, Yifan Zhou, Shizhe Liang, Zihao Wu, Yanjun
Lyu, Peng Shu, Xiaowei Yu, et al. 2024. Evaluation
of openai o1: Opportunities and challenges of agi.
arXiv preprint arXiv:2409.18486.

Denny Zhou, Nathanael Schärli, Le Hou, Jason Wei,

Nathan Scales, Xuezhi Wang, Dale Schuurmans,
Claire Cui, Olivier Bousquet, Quoc V Le, and Ed H.
Chi. 2023. Least-to-most prompting enables com-
plex reasoning in large language models. In The
Eleventh International Conference on Learning Rep-
resentations.

14
A Training Details
To support efficient fine-tuning for long-context
scenarios, we implement sequence parallelization
(Jacobs et al., 2023) with a parallel size of 8. Addi-
tionally, we utilize QLoRA (Dettmers et al., 2024)
to reduce memory consumption during fine-tuning.
The LoRA rank, alpha, and dropout are set to 128,
128, and 0.05, respectively, with all attention and
feedforward linear layers designated as target mod-
ules. All models are fine-tuned for one epoch. The
batch size, learning rate, and maximum sequence
length are set to 8, 5e − 5, and 32K, respectively.
The β for ORPO is configured to 0.1. All exper-
iments are conducted on a computing setup with
8 × H100 GPUs.

B Prompts
We provide the prompts for various prompting
strategies (§2.1) in Tab. 9, and the prompts for the
reference-free and reference-based self-evaluation
strategies (§4.4) in Tab. 10.

15
Strategy Prompt
{context}
Default
{input}
{context}
Direct Answer
{input}
Let’s answer the question directly.
{context}
Think step-by-step
(Kojima et al., 2022) {input}
Let’s think step by step.
{context}
Fact-and-reflection
(Zhao et al., 2024b) {input}
Let’s first identify the relevant information from the long context and list
it. Then, carry out step-by-step reasoning based on that information, and
finally, provide the answer.
{context}
Plan-and-solve
(Wang et al., 2023a) {input}
Let’s first understand the problem and devise a plan to solve it. Then, let’s
carry out the plan and solve the problem step-by-step.

Table 9: The prompts for various prompting strategies (§2.1), where {context} and {input} serve as placeholders for
the long context and input query, respectively.

16
Strategy Prompt
[Context]
{context}

[Question]
Reference-free {question}
Self-Evaluation
[Predicted Response]
{prediction}

Please evaluate the correctness of the predicted response based on the context
and the question. Begin your evaluation by providing a brief explanation.
Be as objective as possible. After giving your explanation, you must rate the
response on a scale from 1 to 5, following this format exactly: “[[rating]]”.
For example, “Rating: [[3]]”.
Here is a question along with two responses: one is the reference response,
and the other is the predicted response. Please determine whether the two
responses provide the same answer to the question. Respond with “True” or
“False” directly.
Reference-based
Self-Evaluation [Question]
{question}

[Reference Response]
{reference}

[Predicted Response]
{prediction}

Table 10: The prompts for the reference-free and reference-based self-evaluation strategies (§4.4), where {question},
{reference}, {prediction}, and {context} serve as placeholders for their respective elements.

Advanced Prompt Engineering
No ratings yet
Advanced Prompt Engineering
27 pages
English For Everyone - Junior English Dictionary
100% (3)
English For Everyone - Junior English Dictionary
136 pages
For A While: by Matthew Sutherland
100% (2)
For A While: by Matthew Sutherland
2 pages
CELBAN Test Information Manual Final
100% (2)
CELBAN Test Information Manual Final
15 pages
Jason Weston Reasoning Alignment Berkeley Talk
No ratings yet
Jason Weston Reasoning Alignment Berkeley Talk
106 pages
3 Prompting
No ratings yet
3 Prompting
59 pages
Korean Vocabulary
100% (1)
Korean Vocabulary
106 pages
Academic Transcript
No ratings yet
Academic Transcript
1 page
Understanding Reasoning LLMS: Methods and Strategies For Building and Refining Reasoning Models
No ratings yet
Understanding Reasoning LLMS: Methods and Strategies For Building and Refining Reasoning Models
27 pages
Large Language Models Are In-Context Semantic Reasoners Rather Than Symbolic Reasoners
No ratings yet
Large Language Models Are In-Context Semantic Reasoners Rather Than Symbolic Reasoners
40 pages
Towards Large Reasoning Models
No ratings yet
Towards Large Reasoning Models
36 pages
Addis Ketema Sub City English G6
No ratings yet
Addis Ketema Sub City English G6
4 pages
T K C P: A LLM E E R L: RUE Nowledge Omes From Ractice Ligning S With Mbodied Nvironments VIA Einforcement Earning
No ratings yet
T K C P: A LLM E E R L: RUE Nowledge Omes From Ractice Ligning S With Mbodied Nvironments VIA Einforcement Earning
48 pages
Large Large Models
No ratings yet
Large Large Models
25 pages
Easy Problems That LLMs Get Wrong
No ratings yet
Easy Problems That LLMs Get Wrong
46 pages
ART Automatic Multi-Step Reasoning and Tool-Use For
No ratings yet
ART Automatic Multi-Step Reasoning and Tool-Use For
26 pages
PROMPT SPACE New Prompt Engineering Technique Use Feb 2025 BEST 2024.findings-Naacl.119
No ratings yet
PROMPT SPACE New Prompt Engineering Technique Use Feb 2025 BEST 2024.findings-Naacl.119
27 pages
Thinking Machines: A Survey of LLM Based Reasoning Strategies
No ratings yet
Thinking Machines: A Survey of LLM Based Reasoning Strategies
15 pages
Large Language Models As Analogical Reasoners
No ratings yet
Large Language Models As Analogical Reasoners
25 pages
Beyond Task Performance: Evaluating and Readucing The Flaws of Large Multimodal Models With In-Context Learning
No ratings yet
Beyond Task Performance: Evaluating and Readucing The Flaws of Large Multimodal Models With In-Context Learning
31 pages
Klaus Düwel, Yuriy Kuzmenko - "Runic Inscriptions in Eastern Europe"
No ratings yet
Klaus Düwel, Yuriy Kuzmenko - "Runic Inscriptions in Eastern Europe"
36 pages
2502.12962v1 - Unknown
No ratings yet
2502.12962v1 - Unknown
21 pages
(Key) Relative Clauses
No ratings yet
(Key) Relative Clauses
16 pages
G7 W1 Analogy
No ratings yet
G7 W1 Analogy
24 pages
Large Language Models As Analogical Reasoners
No ratings yet
Large Language Models As Analogical Reasoners
25 pages
Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More?
No ratings yet
Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More?
29 pages
NeurIPS 2024 Make Your LLM Fully Utilize The Context Paper Conference
No ratings yet
NeurIPS 2024 Make Your LLM Fully Utilize The Context Paper Conference
29 pages
L-Eval Instituting Standardized Evaluation
No ratings yet
L-Eval Instituting Standardized Evaluation
26 pages
2023.findings Emnlp.710
No ratings yet
2023.findings Emnlp.710
30 pages
L L M A R - : Arge Anguage Odels As Nalogical Eason ERS
No ratings yet
L L M A R - : Arge Anguage Odels As Nalogical Eason ERS
24 pages
Learning To Retrieve In-Context Examples For Large Language Models
No ratings yet
Learning To Retrieve In-Context Examples For Large Language Models
15 pages
Language Agent Tree Search
No ratings yet
Language Agent Tree Search
26 pages
Towards A Deeper Understanding of Reasoning Capabilities in Large Language Models
No ratings yet
Towards A Deeper Understanding of Reasoning Capabilities in Large Language Models
8 pages
Reasoning On A Spectrum: Aligning Llms To System 1 and System 2 Thinking
No ratings yet
Reasoning On A Spectrum: Aligning Llms To System 1 and System 2 Thinking
16 pages
Google REST
No ratings yet
Google REST
19 pages
React - Synergizing Reasoning and Acting in Language Models
No ratings yet
React - Synergizing Reasoning and Acting in Language Models
33 pages
Unit 42 Ok
No ratings yet
Unit 42 Ok
5 pages
Boosting Theory-of-Mind Performance in Large Language Models
No ratings yet
Boosting Theory-of-Mind Performance in Large Language Models
27 pages
LM Maybe LM: Selfextend LLM Context Window Without Tuning: L Long
No ratings yet
LM Maybe LM: Selfextend LLM Context Window Without Tuning: L Long
16 pages
5230 Can Long Context Large Languag
No ratings yet
5230 Can Long Context Large Languag
12 pages
长文本 Effective Long-Context Scaling of Foundation Models
No ratings yet
长文本 Effective Long-Context Scaling of Foundation Models
23 pages
Information Re-Organization Improves Reasoning in Large Language Models
No ratings yet
Information Re-Organization Improves Reasoning in Large Language Models
22 pages
LLM Cant Self-Correct
No ratings yet
LLM Cant Self-Correct
17 pages
L L M C S - I: Arge Anguage Odels AN ELF Mprove
No ratings yet
L L M C S - I: Arge Anguage Odels AN ELF Mprove
19 pages
L A T S U R - A P L M: Anguage Gent REE Earch Nifies Eason ING Cting and Lanning in Anguage Odels
No ratings yet
L A T S U R - A P L M: Anguage Gent REE Earch Nifies Eason ING Cting and Lanning in Anguage Odels
24 pages
Learning To Retrieve In-Context Examples For Large Language Models
No ratings yet
Learning To Retrieve In-Context Examples For Large Language Models
16 pages
A Modular Dataset To Demonstrate LLM Abstraction Capability: Adam Atanas Kai Liu
No ratings yet
A Modular Dataset To Demonstrate LLM Abstraction Capability: Adam Atanas Kai Liu
7 pages
Large Language Models For Text Classification Case Study and 2rl2h1dz4onu
No ratings yet
Large Language Models For Text Classification Case Study and 2rl2h1dz4onu
12 pages
Prompt Engineering
No ratings yet
Prompt Engineering
5 pages
ASurvey On Feedback-Based Multi-Step Reasoning For Large Language
No ratings yet
ASurvey On Feedback-Based Multi-Step Reasoning For Large Language
15 pages
Overcoming The Limitations of Large Language Models - by Janna Lipenkova - Towards Data Science
No ratings yet
Overcoming The Limitations of Large Language Models - by Janna Lipenkova - Towards Data Science
20 pages
LLMQuoter - Enhancing RAG Capabilities Through Efficient Quote
No ratings yet
LLMQuoter - Enhancing RAG Capabilities Through Efficient Quote
12 pages
Paper 3
No ratings yet
Paper 3
18 pages
Chain-of-Thought Matters - Improving Long-Context Language Models
No ratings yet
Chain-of-Thought Matters - Improving Long-Context Language Models
14 pages
LLMand Logicor Mimick
No ratings yet
LLMand Logicor Mimick
11 pages
Ada-Leval: Evaluating Long-Context Llms With Length-Adaptable Benchmarks
No ratings yet
Ada-Leval: Evaluating Long-Context Llms With Length-Adaptable Benchmarks
13 pages
长文本 LLM Maybe LongLM- Self-Extend LLM Context Window Without Tuning
No ratings yet
长文本 LLM Maybe LongLM- Self-Extend LLM Context Window Without Tuning
15 pages
Putting People in LLMS' Shoes: Generating Better Answers Via Question Rewriter
No ratings yet
Putting People in LLMS' Shoes: Generating Better Answers Via Question Rewriter
14 pages
The Funny Noise Simple Past 2c 1
No ratings yet
The Funny Noise Simple Past 2c 1
2 pages
L-E: I S E L C L M: VAL Nstituting Tandardized Valuation FOR ONG Ontext Anguage Odels
No ratings yet
L-E: I S E L C L M: VAL Nstituting Tandardized Valuation FOR ONG Ontext Anguage Odels
16 pages
Are Retrials All You Need? Enhancing Large Language Model Reasoning Without Verbalized Feedback
No ratings yet
Are Retrials All You Need? Enhancing Large Language Model Reasoning Without Verbalized Feedback
8 pages
"According To - . - ": Prompting Language Models Improves Quoting From Pre-Training Data
No ratings yet
"According To - . - ": Prompting Language Models Improves Quoting From Pre-Training Data
14 pages
InfiniteICL Breaking The Limit of Context Window Size 1743702150
No ratings yet
InfiniteICL Breaking The Limit of Context Window Size 1743702150
12 pages
34767-Article Text-38834-1-2-20250410
No ratings yet
34767-Article Text-38834-1-2-20250410
9 pages
Chatqa 2: Bridging The Gap To Proprietary Llms in Long Context and Rag Capabilities
No ratings yet
Chatqa 2: Bridging The Gap To Proprietary Llms in Long Context and Rag Capabilities
11 pages
Enhancing Robustness of Retrieval-Augmented Language Models With In-Context Learning
No ratings yet
Enhancing Robustness of Retrieval-Augmented Language Models With In-Context Learning
10 pages
Advancing Reasoning in Large Language Models. Promising Methods and Approaches
No ratings yet
Advancing Reasoning in Large Language Models. Promising Methods and Approaches
9 pages
Unit 2 Once Upon A Time
No ratings yet
Unit 2 Once Upon A Time
17 pages
Leancontext: Cost-Efficient Domain-Specific Question Answering Using Llms
No ratings yet
Leancontext: Cost-Efficient Domain-Specific Question Answering Using Llms
8 pages
How To Write Effective Prompts For Large Language Models: Comment
No ratings yet
How To Write Effective Prompts For Large Language Models: Comment
5 pages
List of Abbreviations Thesis Sample
100% (3)
List of Abbreviations Thesis Sample
7 pages
Extending Llama-3's Context Ten-Fold Overnight
No ratings yet
Extending Llama-3's Context Ten-Fold Overnight
5 pages
1MS - All in One - All The Sequences - 2nd Generation 2017
No ratings yet
1MS - All in One - All The Sequences - 2nd Generation 2017
81 pages
Question Paper Preview: English
No ratings yet
Question Paper Preview: English
58 pages
Tieng Anh 6 Moi - Tiet 27,36,37
No ratings yet
Tieng Anh 6 Moi - Tiet 27,36,37
8 pages
Natural Language Processing 16CSE16-3-6-21
No ratings yet
Natural Language Processing 16CSE16-3-6-21
1 page
Reading Comprehension Test Part 3 Answers
No ratings yet
Reading Comprehension Test Part 3 Answers
1 page
ERS - Early Reading Skills
No ratings yet
ERS - Early Reading Skills
11 pages
Someone Like You by Adele: Fill in The Blanks With The Correct Form of The Verbs in Past Tense
No ratings yet
Someone Like You by Adele: Fill in The Blanks With The Correct Form of The Verbs in Past Tense
1 page
Sound and Structure Slideshow
No ratings yet
Sound and Structure Slideshow
17 pages
Grammar Notes - : Form of Verbs
No ratings yet
Grammar Notes - : Form of Verbs
8 pages
Gestalt Language Processing
No ratings yet
Gestalt Language Processing
9 pages
MULTIMODALITY
No ratings yet
MULTIMODALITY
16 pages
Chapter 1 Discourse
No ratings yet
Chapter 1 Discourse
43 pages
E11 - Đề Kiểm Tra Định Kỳ Giữa HK2
No ratings yet
E11 - Đề Kiểm Tra Định Kỳ Giữa HK2
3 pages
GRASPS For Class
No ratings yet
GRASPS For Class
21 pages
I Can Write An Autobiography
No ratings yet
I Can Write An Autobiography
3 pages
02 Too Enough
No ratings yet
02 Too Enough
3 pages
CV Rubric WITH COVER LETTER
No ratings yet
CV Rubric WITH COVER LETTER
1 page
Gensim for Natural Language Processing: Definitive Reference for Developers and Engineers
From Everand
Gensim for Natural Language Processing: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet

LLMSelf Improve Long CTXT Reasoning

Uploaded by

LLMSelf Improve Long CTXT Reasoning

Uploaded by

Large Language Models Can Self-Improve in Long-context Reasoning

Siheng Li♡ Cheng Yang♡ Zesen Cheng♠ Lemao Liu♢ Mo Yu♢

Abstract LLMs to attain near-perfect accuracy on the needle-

Dubey et al., 2024), which involves locating evi-

3.1 Self-supervision The utility metric measures the consistency be-

Self-supervision Ⅰ: To solve the problem, we need to … Step 1: Identify the Ⅰ Ⅱ Ⅲ Score

Method HotpotQA MuSiQue 2WikiMQA

Scoring Methods. Effective scoring methods are

Jiajie Zhang, Zhongni Hou, Xin Lv, Shulin Cao, Zhenyu

Peitian Zhang, Ninglu Shao, Zheng Liu, Shitao Xiao,

Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q

Xinrong Zhang, Yingfa Chen, Shengding Hu, Zihang

Wenting Zhao, Xiang Ren, Jack Hessel, Claire Cardie,

Xinran Zhao, Hongming Zhang, Xiaoman Pan, Wenlin

Tianyang Zhong, Zhengliang Liu, Yi Pan, Yutong

Denny Zhou, Nathanael Schärli, Le Hou, Jason Wei,

You might also like