ART Automatic multi-step reasoning and tool-use for
ART Automatic multi-step reasoning and tool-use for
tings by generating intermediate chain of War Q1: [gen code] Write arithmetic as
Q1: [search] When was python code
thought (CoT) reasoning steps. Further, each George H. W. #1: viola =167, nancy = 137
Bush, president? ans = viola - nancy
reasoning step can rely on external tools to sup- #1: From 1989-1993 ... Q2: [exec code] Execute code
port computation beyond the core LLM capa- Q2: [EOQ] Q3: [EOQ]
Ans: True Ans: 30
bilities (e.g. search/running code). Prior work
on CoT prompting and tool use typically re- B Run Program Frozen LLM
quires hand-crafting task-specific demonstra-
Human Feedback
Q1: [search] How to write english as pig latin?
generation whenever external tools are called, #1: Add "yay" if it starts with a vowel ...
and integrates their output before resuming Q2: [gen code] Write code to translate "albert ...
#2: for w in ["albert", "goes", "home"]: if w[0] in "aeiou":
generation. ART achieves a substantial im- print(w + "yay") ...consonent_cluster = find_clstr(w)
provement over few-shot prompting and auto- Q3: [exec code] Execute snippet
#3: albertyay oesgay ivingdray
matic CoT on unseen tasks in the BigBench Q4: [EOQ]
and MMLU benchmarks, and matches perfor- Ans: albertyay oesgay ivingdray
1 Introduction
In-context learning allows large language models math (Patel et al., 2021), having up-to-date informa-
(LLMs) to quickly adapt to new tasks simply by us- tion (Komeili et al., 2022), and others. To address
ing natural language instructions and a few demon- these limitations, recent work proposes prompting
strations as a prompt to the LLM (Xie et al., 2021; LLMs to mimic a chain of thought (CoT) for multi-
Brown et al., 2020; Chowdhery et al., 2022). While step reasoning (Wei et al.; Zhou et al., 2022; Wang
this circumvents annotating large datasets or even et al., 2022; Press et al., 2022; Khot et al., 2022;
hosting the LLM itself (since many are available Arora et al., 2022) or providing them with access
through APIs), there are severe performance limita- to tools (e.g. a calculator or QA model) to enable
tions around multi-step reasoning (Liu et al., 2022), more complex reasoning steps (Gao et al., 2022;
Chen et al., 2022; Press et al., 2022; Wei et al.; 6.1% percentage points (Table 3).
Schick et al., 2023). However, existing methods Finally, ART enables human intervention and im-
for chained reasoning with tool use are difficult to provement of the reasoning process by simply up-
extend to new tasks and tools, requiring fine-tuning dating the task and tool libraries with new demon-
or prompt-engineering tailored for a specific task strations, making it very easy to improve perfor-
(Parisi et al., 2022) or tool (Schick et al., 2023). mance on any specific task with minor human feed-
back. On 12 test tasks, ART with additional hu-
In this paper, we present Automatic Reasoning
man feedback surpasses the best-known results for
and Tool use (ART), a framework that automati-
GPT3 by an average of over 20% points (Table 6).1
cally generates decompositions (multi-step reason-
ing) for instances of new tasks. The framework 2 Related Work
also selects and uses the most appropriate avail-
able tools (like search engines, and code execution) Scaled finetuning for low-resource adaptation
in individual steps. Given a new task, ART re- Recent work has shown that finetuning LLMs on
trieves demonstrations of related tasks from a task a broad range of public NLP datasets (with pre-
library to enable few-shot decomposition and tool fixed instructions) is an effective technique for
use. These demonstrations follow a flexible but cross-task generalization (Mishra et al., 2021; Sanh
structured query language (Beurer-Kellner et al., et al., 2021; Khashabi et al., 2020; Wei et al.,
2022), such that it is easy to parse intermediate 2021) in both the zero-shot and few-shot settings.
steps, stop generation to call external tools, and Ouyang et al. (2022) show that aligning language
resume it after including the output of such tools models with user intent on a wide range of tasks
(Figure 1). ART provides the LLM with demonstra- by fine-tuning with human feedback for desired
tions of how to decompose instances of several re- model behavior (InstructGPT) further improves
lated tasks, and how to select and use any tool from in-context learning performance on complex NLP
the tool library that is represented in these demon- tasks. Chung et al. (2022) show that finetuning
strations. This encourages the model to generalize on an aggregated mixture of tasks (T0, CoT, dia-
from demonstrations to decompose a new task and log, and code datasets) together with scaling mod-
use tools in appropriate places, zero-shot. It also els to 540B parameters achieves state-of-the-art
enables users to fix any mistakes in the reasoning in-context learning performance on several bench-
chain or add new tools by simply updating the task marks such as BigBench and MMLU. ART uses
and tool libraries, providing new demonstrations API access to InstructGPT and Codex (LLM fine-
where necessary (e.g. for the task at hand). tuned on code (Chen et al., 2021)) to leverage their
emergent in-context learning abilities. Future im-
We construct a task library for 15 diverse Big- provements in scaled finetuning in LLMs will likely
Bench (Srivastava et al., 2022) tasks, and eval- improve the performance on ART.
uate ART on 19 unseen test tasks from Big-
Bench, 6 MMLU tasks, and various tasks used Prompting with intermediate reasoning steps
by related work on tool use (SQUAD, TriviaQA, Chain-of-thought (CoT) prompting (Wei et al.,
SVAMP, MAWPS). ART consistently matches or 2022; Suzgun et al., 2022) is a popular gradient-
outperforms automatically generated CoT reason- free technique that encourages LLMs to gener-
ing chains on 32 / 34 BigBench and all MMLU ate intermediate reasoning steps prior to the final
tasks, by an average of over 22 percentage points. answer, with multiple task-specific variants (e.g.
Tool-use in particular improves performance on test Least-to-most prompting (Zhou et al., 2022), Self-
tasks by an average of over 12.3 percentage points, Ask (Press et al., 2022), Ask-me-anything (Arora
as compared to when no tools are allowed (Table 3). et al., 2022), Successive prompting (Dua et al.,
ART improves over direct few-shot prompting by 2022), decomposed prompting (Khot et al., 2022)).
10.8% percentage points on average across unseen While such prompts were initially hand-crafted, re-
BigBench and MMLU tasks. Improvements are cent work (Kojima et al., 2022) showed that LLMs
particularly notable on unseen tasks requiring arith- can generate CoT-style multi-step reasoning in a
metic and algorithmic reasoning, where ART im- zero-shot manner, when prompted with the prefix
proves over direct few-shot prompting by 12.5% “Let’s think step-by-step". Zhang et al. (2022) use
and previous best-known results for GPT3 that use 1
Code is available at https://github.com/
supervision for decomposition and/or tool use by bhargaviparanjape/language-programmes/
Table 1: Comparing ART with related approaches for 3 ART
multi-step reasoning and tool-use
With ART, a frozen LLM decomposes instances
Feature CoT Auto Tool- ART of a new task into multiple steps (using external
CoT former tools whenever appropriate), despite not having ex-
Multi-step reasoning X X X plicit supervision for decomposition or tool use.
Limited supervision X X X
Tool use X X
In this section, we present an overview of ART,
Extendable libraries X followed by more thorough descriptions of each
Cross-task transfer X X X individual component. We use the Physics Ques-
Human feedback X X
tion Answering (PQA) task as a running example,
which consists of high-school physics problems.
LLMs to automatically generate such CoT-style 3.1 Overview
prompts—AutoCoT—which are competitive with
In Figure 2, ART is presented with a new task
hand-crafted prompts in their performance on arith-
description and input instance. We also assume
metic and commonsense reasoning tasks. We com-
access to a few input-output pairs (not shown), with
pare ART, CoT and AutoCoT in Table 1. ART
no decomposition or tool use supervision.
builds on this line of work, introducing a common
language that enables cross-task demonstrations Prompt building. ART retrieves similar tasks
and flexible and extensible tool use, improving ac- from a task library (Figure 2(A); Section 3.2), and
curacy of intermediate reasoning steps. adds instances of those tasks as demonstrations in
the prompt.
Tool Use There is growing interest in overcom- A demonstration in the task library is written in a
ing LLM limitations with external tools such as specific format, defined by a custom parsing expres-
search engines, web browsers, calculators, trans- sion grammar (PeG) (Section 3.2). The grammar
lation systems, and python interpreters (Komeili is defined such that each task instance is decom-
et al., 2022; Thoppilan et al., 2022; Lazaridou et al., posed into a sequence of sub-steps. Some of these
2022; Shuster et al., 2022; Nakano et al., 2021; sub-steps contain symbols corresponding to tools
Thoppilan et al., 2022; Cobbe et al., 2021; Thop- in a tool library (Section 3.3). We refer to these
pilan et al., 2022; Gao et al., 2022; Chen et al., decompositions as programs, since the sequential
2022). Most of these approaches either require reasoning steps and symbolic calls to tools are sim-
large amounts of human supervision (Thoppilan ilar to a conventional program with function calls.
et al., 2022; Komeili et al., 2022) or carefully con- The resultant prompt consists of programs from
structed prompts tailored to specific tasks and par- related tasks and teaches the LLM how to effec-
ticular tools. An alternative line of recent work tively decompose instances of a new task—related
uses self-supervision to teach LLMs to use search, sub-steps and tools in these programs can be used
translation, and a calculator (Schick et al., 2023)— by the LLM for cross-task generalization.
Toolformer. In contrast, since ART does not require In Figure 2(A), the demonstrations include calls
any additional training or tool-specific prompts, it to both search and code tools.
allows users flexibility both in terms of replacing
Generation. At generation time (Figure 2(B)),
the underlying LLM (e.g. when a new version
the LLM writes its own program. ART parses the
of GPT-3 is released), and in replacing or adding
program as it is generated, and pauses generation
new tools (either general-purpose tools or tools that
whenever a tool call is encountered in the generated
are important for a specific task of interest). We
text, resuming generation after the tool is called and
compare ART and Toolformer in Table 1. In Sec-
its output is integrated back into the program. As
tion 3.4, we show how human-in-the-loop feedback
illustrated in the figure, a search engine is used
— analyzing and debugging LLM generations and
to find the appropriate physics formula, and then
extending tool-use — can provide a large boost in
the LLM uses code generation and execution to
the performance of ART while also extending it
substitute the given values and compute the answer.
with new tools. This built-in feedback loop and
adaptive capability of ART extends the capabilities Human feedback (optional). Humans can add
of LLMs that are finetuning to follow instructions new decomposition demonstrations to the task li-
and use tools. brary, or add/edit tools in the tool library in order
New Task (Physics QA) Answer this high-school physics question
Input: Hector yanks on the chain with a 72.0 N force at an angle of 35.0° above the horizontal. Determine the horizontal components of the tension force.
A TASK LIBRARY
Input: Hector yanks on the chain with a 72.0 N force at an angle of 35.0° above the horizontal.
Determine the horizontal components of the tension force.
Solve these arithmetic problems using python code
Input: Viola had 167 breads. Nancy took 137from him. How Q1: [search] What is the formula for the horizontal component of the tension force?
many does Viola have now? #1: The formula for the horizontal component of the tension force is Tcosθ. The horizontal
Q1: [generate code] Write down arithmetic as python code component (Fx) can be calculated as Ftens*cosine(θ) where θ is the angle which the force make
#1: viola_bought = 167, nancy_took = 137 s with the horizontal in radians.
ans = viola_bought - nancy_took Input: ... Q1: [search] ...
Q2: [code execute] Execute snippet #2: 30 Q3: [EOQ] Ans: No LLM #1: ... can be calculated as Ftens*cosine(θ)where θ is ...
Q2: [generate code] Use the formula Fx = Ftens*cosine(θ) to solve: Hank ...
Does the sentence contain an anachrornism? Yes/No.
Input: President George H. W. Bush called his generals at the outset of the #2:T = 72.0, theta = 35.0
Gulf War. radians= math.pi*theta/180
Q1: [search] When was President George H. W. Bush, president? Fx = T*math.cos(radians)
#1: George H. W. Bush's tenure started on January 20, 1989, Input: ...Q1: [search] ...#1: ...
and ended on January 20, 1993. Q2: [generate code] Use the formula Fx = Ftens*cosine(θ) to solve: Hank ...
Q2: [search] When was the Gulf War fought? #2: The Gulf War was a 1990–1991 #2: ... Fx = T*math.cos(radians)
Q3: [subquestion] Could these entities have co-existed? #3: Yes. Their time Q3: [code execute] Execute the python code and get the value of "Fx"
periods intersect. #3: 58.9789
Q4: [generate output] Is this an anachronism? #4: No Q5: [EOQ] Ans: No Q4: [EOQ] Ans: 58.9789
Figure 2: A run-through of ART on a new task, Physics QA. (A) Programs of related tasks like anachronisms and
Math QA provide few-shot supervision to the LLM — related sub-steps and tools in these programs can be used
by the LLM for cross-task generalization (shown in purple). (B) Tool use: Search is used to find the appropriate
physics formula, and code generation and execution are used to substitute given values and compute the answer
(shown in orange).
(a) Correcting generated programs (b) Adding additional tool use examples and
TASK LIBRARY
by adding additional reasoning steps new tool definitions
Figure 3: Human feedback to ART shown for (a) PQA where reasoning steps are added to the program and; (b)
Word unscrambling where tool library is augmented with a new lookup tool.
task input (“Q1: [search] What is the formula...”), 3.3 Tool Library
while the sub-task answer #i is simply the output Whenever a sub-task query name matches a tool
of the sub-task (“#1: The horizontal component name in the task library (e.g. “Qi : [search]”), gen-
(Fx) can be calculated...”). The program ends with eration is stopped and resumed after the tool is
a dummy sub-task (“Q3: [EOQ]”), followed by a called and its output is incorporated into the par-
final answer node (“Ans: 59N”). All examples in tially completed program. We seed the tool library
Figures 1 and 2 follow this format. with the following tools (all of which have demon-
strations in the task library). In particular, we de-
scribe the symbols used to represent these tools and
their inputs. We also specify how the tool output is
Task Retrieval Given a new task, ART retrieves incorporated back into the program. Tool-specific
N tasks from the task library to construct a dy- implementation details and other tools added to
namic multi-task prompt. We explore two strate- ART during feedback (3.4) are in Appendix A.3.
gies to retrieve similar tasks, depending on what
data is available. If a small number of labeled Search We use SerpAPI2 , which provides an API
examples for the new task is available (≈50), we it- for Google search. The input to search is the se-
erate over all five task clusters and select a few task quence generated by the LLM after “Qi : [search]”.
programs from each cluster to compose the prompt. We extract answer box snippets when they are avail-
Ultimately, the task cluster with the highest perfor- able or combine the top-2 search result snippets to-
mance on the held-out set of examples is chosen gether. For PQA in Figure 2(B), the search query is
when predicting on all unlabeled examples from the original input followed by “What is the formula
the task. While this strategy requires a held-out set for the horizontal component of tension force?”,
of input-output pairs, no additional supervision is and the output is ““... horizontal component (Fx)
needed to generate a decomposed program. can be calculated as Ftens*cosine(θ) ...””.
Code Generation We use the Codex (Chen
In the second strategy, we craft a few-shot
et al., 2021) model for code generation. Input
prompt (Appendix A.2) with task pairs, where each
to code generation is the sequence generated by
task includes a name, instructions, and a few input-
the LM after the sub-task query symbol “Qi :
output examples. For each pair, we provide a la-
[generate python code]”. This argument is an
bel of “Similar” or “Not similar”, and reasoning
instruction for code generation and is prompted
(e.g. “These are related because they require solv-
to Codex as a multi-line comment in Python. For
ing arithmetic word problems”). At run time, we
example, in Figure 2, Codex is prompted the in-
pair the test task with every task in the task library,
struction ““Use the formula Fx = Ftens * cosine(θ)
and choose the highest-ranked ones based on the
to solve...”” as a comment and generates “T = 72.0,
log probability ratio between “Similar” and “Not
2
similar”. We explore both strategies in Section A.2. https://serpapi.com
theta = 35.0, ..., Fx = T*math.cos(radians)”, which et al., 2020). Finally, we also evaluate on a sub-
is appended to the incomplete program. set of tasks used to evaluate Toolformer (Schick
et al., 2023), in order to compare ART to a model
Code Execution We run Python code in a virtual
fine-tuned for tool use.
Python environment with arithmetic, symbolic, and
scientific computing packages pre-installed. The Details We use InstructGPT (text-davinci-002)
argument to code execute is the previous sub-task’s as the frozen LLM, and Codex as the code gen-
answer sequence “#(i − 1) : . . . ”, i.e. the python eration tool, with temperature set to 0.3. We set
code snippet to be executed. For i = 1, the task the number of seed tasks in the prompt to N = 3
input is used as the argument since it potentially and use 2 demonstration programs from each task.
contains the code snippet to be executed. In Fig- We measure the preferred scoring metric for each
ure 2, the code snippet generated in the previous task as in Srivastava et al. (2022), and report per-
step is executed and the value of variable “Fx” is formance averaged over 5 runs.
added to the incomplete program.
Baselines ART proposes an automatic frame-
3.4 Human feedback work to generate multi-step reasoning decompo-
ART is specifically designed to be amenable to hu- sitions and use relevant available external tools
man feedback since it does not require additional within those decompositions. We compare with the
finetuning. Consequently, users can incorporate following baselines:
feedback immediately into ART, by editing the • Few-shot/Direct: Prompting LLMs with
task library and/or the tool library. Since ART input-output pairs (but no intermediate rea-
generates multi-step reasoning programs that are soning). We use 3 examples for BigBench
interpretable, we explore feedback in the form of and 5 examples for MMLU, as done in prior
debugging, i.e. users edit existing programs rather work (Suzgun et al., 2022). We evaluate this
than creating programs from scratch. These edits baseline for both, GPT-3 and Codex, and re-
can be in the form of correcting sub-step outputs, port the higher of the two.
adding/removing sub-steps (with appropriate in- • Auto-CoT: A baseline that automatically gen-
puts and answers), adding calls to new tools, etc. erates multi-step reasoning in natural lan-
For example, in Figure 3(a) the user edits a pro- guage. A random subset of 5 examples is
gram by adding two sub-steps, in order to round first used to elicit CoT-style reasoning (Input
the answer to the nearest integer and include the ap- + Let’s think step-by-step.). These examples
propriate unit of measurement to the answer. This and their generated output form the prompt for
feedback demonstrates appropriate decompositions other unseen examples of the task. This base-
for the task, as these operations are still performed line is free-form and does not include tools,
by the LLM (the tool library does not have “[arith- and thus allows us to verify the effectiveness
metic]” or “[add unit]” APIs). In contrast, in Figure of our query language and task library. We
3(b) the user demonstrates the use of a dictionary evaluate this baseline for GPT-3.
“[lookup]” and implements it as a tool in the tool • ART-tool: ART with tool-use turned off, i.e.
library. While most of our experiments do not rely the LLM generates the output of every sub-
on such feedback (and thus measure “zero shot” step, to verify the gains from tool use.
task transfer with no supervision for reasoning/tool- • GPT-3 Best: Best published GPT-3/Codex
use), we show that simple operations like these can (175B) result with multi-step decomposition
drastically improve performance on target tasks. and/or tool use. These often include additional
human supervision to decompose reasoning
4 Experimental Setup steps, and external tools to boost performance
(with carefully constructed prompts).
Evaluation Datasets In addition to 15 tasks in
Additional details about baselines and GPT-3 best
the task library (Section 3.2), we evaluate ART on
models are in Appendix A.4.
19 additional test tasks from BigBench which also
belong to the five task clusters identified in Sec- 5 Results
tion 3.2. To check for cross-benchmark generaliza-
tion, we further evaluate ART on a random subset We evaluate ART (without human feedback) on
of tasks from the MMLU benchmark (Hendrycks tasks in the task library (5.1), and on a variety
Task Name (Cluster) Few Shot AutoCot ART ART GPT-3
w/o Tool Use Best
Anachronisms (Search) 71.35 51.48 70.87 75.66 -
Musique (Search) 2.035 12.88 10.04 19.19 15.23
Hindu Knowledge (Search) 85.02 5 73.03 83.42 87.98 -
Known Unknown (Search) 68.90 5 56.09 80.43 80.43 -
∆ with ART (Search) +9.0 +17.44 +4.6 +4.0
7
Elementary Math QA (Arithmetic) 56.40 74.52 58.04 68.04 -
Aqua-rat (Arithmetic) 20.547 34.41 36.29 54.20 54.14
GSM8K (Arithmetic) 7.797 21.99 53.4 71.00 71.64
Navigate (Arithmetic) 60.77 61.7 72.4 72.4 85.901
∆ with ART (Arithmetic) +30.0 +18.25 +11.4 -4.7
5
K’th letter concatenation (String) 3.2 0.64 8.19 40.00 98.02
Language games (String) 35.145 18.58 11.19 23.08 -
Date Understanding (String) 37.535 38.90 52.05 - 70.411
Auto Debugging (Code) 62.945 38.24 55.29 62.94 -
Code Description (Code) 97.997 88.67 84.67 88.00 -
Formal Fallacies (CoT) 44.845 56.4 64.76 - 58.41
Hyperbation (CoT) 62.725 55.4 80.80 - 72.41
∆ with ART (Misc) +9.6 +16.4 +13.7 -15.4
∆ with ART (Overall) +14.90 +17.17 +7.91 -9.0
Table 2: ART performance on tasks in the task library. (1 Human-crafted CoT (Wei et al., 2022; Suzgun et al., 2022),
2
Decomposed Prompting (Khot et al., 2022), 3 Self-Ask (Press et al., 2022), 4 PoT (Chen et al., 2022), 5 InstructGPT
(Ouyang et al., 2022), 7 Code-davinci-002 (Chen et al., 2021)). (-) For tasks using CoT reasoning, no tool use is
used.
of test tasks from BigBench, MMLU, and QA all tasks (+17.7 % points) minus one. Tools are
benchmarks (5.2). Then, we show that ART can called in ≈ 95% of test instances, and significantly
be further improved with more compute (self- improve performance (+7.91 % points). Gains from
consistency) and with human feedback (5.3). tool use are particularly significant for arithmetic
tasks that benefit from representing the arithmetic
5.1 Results on the task library problem as code that executes complex arithmetic
accurately (+21.85 on average). This has also been
For tasks in the task library, demonstrations in noted in prior work (Chen et al., 2022; Gao et al.,
the prompt include two instances of the task it- 2022).
self, along with other instances from tasks in the Compared to the best published GPT-3 results,
same cluster. We present results in Table 2, where ART is stronger or comparable in 5/8 tasks. For
tasks are organized by skill cluster. Even with de- the others, further investigation indicates that the
composition demonstrations for only two instances, demonstrations provided by Khot et al. (2022) and
ART drastically improves performance over few- Suzgun et al. (2022) are just more effective than the
shot learning (+14.9 % points on average), in line two programs we author for these tasks (we explore
with prior work on CoT. It does not do as well on further human feedback for these in Appendix A.5).
language games, code description, and auto debug- In sum, ART is stronger than few-shot learning and
ging — tasks that use code generation and/or code AutoCoT on the library tasks (where we provide
editing models. We observe that code generation 2 labeled decompositions), and comparable to the
errors often lead to cascading errors in reasoning. best published GPT-3 results.
Similarly, ART outperforms AutoCoT on most
5.2 Test tasks (cross-task transfer)
tasks even without any tool use (by 8% points on
average). We hypothesize that the program format We measure cross-task generalization on test tasks
(and PeG grammar) is better at eliciting multi-step where ART does not use explicit supervision for
reasoning from models than free-form CoT due to decomposition and tool use. ART retrieves demon-
the added structure to the reasoning. When tool strations from the task library according to the first
use is turned on, ART outperforms AutoCoT on strategy in Section 3.2, which uses a small amount
Task Name (Cluster) Few Shot AutoCot ART ART GPT-3
w/o Tool Use Best
Test Tasks
Sentence Ambiguity (Search) 70.675 51.47 71.00 73.33 -
Strategy QA (Search) 55.495 27.22 59.37 66.44 -
Physics (Search) 70.095 61.83 59.13 67.55 -
∆ with ART (Search) +3.7 +22.27 + 5.9
5
Physics Questions (Arithmetic) 7.02 5.56 6.30 20.37 -
Operators (Arithmetic) 71.237 75.52 71.80 92.00 -
Unit interpretation (Arithmetic) 58.27 41.20 51.4 53.99 -
Repeat copy logic (Arithmetic) 50.017 15.63 31.25 44.38 -
Object Counting (Arithmetic) 39.27 26.80 42.2 87.00 81.201
Penguins in a table (Arithmetic) 58.237 40.40 68.86 77.85 72.341
Reasoning about objects (Arithmetic) 71.007 33.33 45.35 64.34 52.691
Tracking shuffled objects (Arithmetic) 22.397 19.44 18.14 37.67 36.321
∆ with ART (Arithmetic) +19.0 +36.7 + 23.1 +6.1
7
Word Unscramble (String) 40.72 32.44 23.03 42.7 -
Simple Text Editing (Code) 35.315 30.21 20.74 27.65 -
CS Algorithms (Code) 73.487 0.0 41.59 88.11 -
Sports Understanding (CoT) 69.745 51.47 92.89 - 86.591
Snarks (CoT) 54.585 57.24 57.13 - 65.21
Disambiguation QA (Free-form) 55.035 48.45 55.89 - 60.621
Temporal sequences (CoT) 55.807 19.70 49.5 - 81.81
Ruin names (CoT) 71.015 55.28 60.22 - -
∆ with ART (Misc) 2.4 22.5 24.37 -9.4
∆ with ART (Overall) +6.9 +24.6 +16.7 -1.7
MMLU
College Computer Science (Search) 41.00 43.99 63.40 67.80 63.66
Astronomy (Search) 62.10 41.48 76.71 79.1 62.56
Business Ethics (Search) 61.60 48.8 77.17 81.16 72.76
Virology (Search) 50.03 49.52 71.60 71.49 50.726
Geography (Search) 77.67 57.07 70.30 71.71 81.86
Mathematics (Arithmetic) 36.67 33.77 39.50 45.66 34.56
∆ with ART (MMLU) +14.6 +23.7 +3.0 +8.5
Table 3: ART performance on BigBench tasks and MMLU tasks. (1 Human-crafted CoT (Wei et al., 2022; Suzgun
et al., 2022), 5 InstructGPT (Ouyang et al., 2022), 6 Scaled instruction finetuning (Chung et al., 2022), 7 Code-
davinci-002 (Chen et al., 2021)).
Table 4: Comparing ART results on GPT3 (175B) model and (Schick et al., 2023), which is a smaller GPT-J model
finetuned for tool-use. Results are reported from their paper (their code and models are not publicly available).
of labeled input-output pairs to pick a task clus- task library. ART outperforms few-shot learning
ter and sample demonstration programs from that (6.9 % points). In particular, ART has significant
cluster.3 improvements on arithmetic tasks (+19.0) and is
comparable to the few-shot performance on search
BigBench test tasks Even though there is no de- tasks. Non-grammatical choices in ruin names and
composition or tool use supervision, the results choices not in the input in temporal sequences are
in Table 3 are similar to those for tasks in the often incorrect, which the few-shot baseline may
potentially learn to ignore, while ART attempts to
3
We compare both strategies in Appendix A.2
Simple Text CS Strategy QA Physics Unit Reasoning about
Editing Algorithms Questions Interpretation colored objects
ART 27.65 88.11 66.44 20.37 53.99 64.34
+ Self Consistency 30.67(+3.0) 90.99(+2.9) 70.76(+4.3) 24.07(+3.7) 57.20(+3.2) 69.11(+4.8)
Table 5: Improving ART via self-consistency (Wang et al., 2022). Ensembling model generations over 15 runs
further boosts performance.
Table 6: Improving ART and free-form CoT via self-consistency and human-in-the-loop feedback. (*) indicates
that human-in-the-loop improvement was done over automatically generated CoT reasoning for these tasks. Feed-
back for ART includes correcting sub-steps in programs (“C:”), adding additional sub-steps(“A:”), and defining
new tools(“T:”). Note that only five examples were edited for each task.
explicitly reason about them. As with library tasks, Table 3, where ART is more effective than all base-
we observe that string manipulation tasks like sim- lines on 5/6 tasks (+8.5 points better than few-shot
ple text editing, word unscrambling, and repeat baseline on average), despite having no supervision
copy logic suffer from code generation errors. for demonstrations or tool use. MMLU requires
As observed in the case of library tasks, ART extensive world knowledge, and thus most of these
is better than AutoCoT on almost all tasks (24.6 tasks benefit the most from the search tool.
% points). Tools are once again called very fre- In Table 4, we compare ART to a random subset
quently (89% of instances), and are responsible for of tasks used to evaluate Toolformer (Schick et al.,
a significant fraction of the gains over baselines. 2023), a model finetuned to use a variety of tools.
When compared to the best published GPT-3 The comparison is not exact since Toolformer uses
results, ART performs favorably on average, es- a smaller GPT-J model, but it is informative that
pecially on arithmetic tasks (+6.1 % points). As ART outperforms Toolformer by a large margin on
before, it does worse in tasks where good human 5/6 of these tasks. To make sure these gains are not
demonstrations of how to decompose the task it- simply a result of model scale, we also use vanilla
self (provided by Suzgun et al. (2022)) have a big GPT-3 as a baseline, which yields much worse re-
impact. We re-evaluate ART with more human sults than ART on all tasks. Besides improved
feedback on these tasks in 5.3, but even without performance, we note again that ART does not re-
that we conclude that ART is competitive on Big- quire additional fine-tuning when new tools or new
Bench even when we do not have supervision for base LLMs are introduced, and also is amenable
decompositions for the task at hand (i.e. there is to further improvement at the cost of compute or
cross-task generalization). human feedback.
Tushar Khot, Harsh Trivedi, Matthew Finlayson, Yao Victor Sanh, Albert Webson, Colin Raffel, Stephen H
Fu, Kyle Richardson, Peter Clark, and Ashish Sab- Bach, Lintang Sutawika, Zaid Alyafeai, Antoine
harwal. 2022. Decomposed prompting: A modular Chaffin, Arnaud Stiegler, Teven Le Scao, Arun
approach for solving complex tasks. arXiv preprint Raja, et al. 2021. Multitask prompted training en-
arXiv:2210.02406. ables zero-shot task generalization. arXiv preprint
arXiv:2110.08207.
Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yu-
taka Matsuo, and Yusuke Iwasawa. 2022. Large Timo Schick, Jane Dwivedi-Yu, Roberto Dessì,
language models are zero-shot reasoners. arXiv Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer,
preprint arXiv:2205.11916. Nicola Cancedda, and Thomas Scialom. 2023. Tool-
former: Language models can teach themselves to
Mojtaba Komeili, Kurt Shuster, and Jason Weston. use tools. arXiv preprint arXiv:2302.04761.
2022. Internet-augmented dialogue generation. In
Proceedings of the 60th Annual Meeting of the As- Kurt Shuster, Mojtaba Komeili, Leonard Adolphs,
sociation for Computational Linguistics (Volume 1: Stephen Roller, Arthur Szlam, and Jason We-
Long Papers), pages 8460–8478, Dublin, Ireland. ston. 2022. Language models that seek for
Association for Computational Linguistics. knowledge: Modular search & generation for di-
alogue and prompt completion. arXiv preprint
Angeliki Lazaridou, Elena Gribovskaya, Wojciech arXiv:2203.13224.
Stokowiec, and Nikolai Grigorev. 2022. Internet-
augmented language models through few-shot Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao,
prompting for open-domain question answering. Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch,
arXiv preprint arXiv:2203.05115. Adam R Brown, Adam Santoro, Aditya Gupta,
Adrià Garriga-Alonso, et al. 2022. Beyond the
Haokun Liu, Derek Tam, Mohammed Muqeeth, Jay imitation game: Quantifying and extrapolating the
Mohta, Tenghao Huang, Mohit Bansal, and Colin capabilities of language models. arXiv preprint
Raffel. 2022. Few-shot parameter-efficient fine- arXiv:2206.04615.
tuning is better and cheaper than in-context learning.
arXiv preprint arXiv:2205.05638. Mirac Suzgun, Nathan Scales, Nathanael Schärli, Se-
bastian Gehrmann, Yi Tay, Hyung Won Chung,
Swaroop Mishra, Daniel Khashabi, Chitta Baral, and Aakanksha Chowdhery, Quoc V Le, Ed H Chi,
Hannaneh Hajishirzi. 2021. Cross-task generaliza- Denny Zhou, et al. 2022. Challenging big-bench
tion via natural language crowdsourcing instructions. tasks and whether chain-of-thought can solve them.
arXiv preprint arXiv:2104.08773. arXiv preprint arXiv:2210.09261.
Romal Thoppilan, Daniel De Freitas, Jamie Hall, understanding, Database Operations, Algebra and
Noam Shazeer, Apoorv Kulshreshtha, Heng-Tze Arithmetic, Code Generation and Editing, Text Tag-
Cheng, Alicia Jin, Taylor Bos, Leslie Baker, Yu Du,
ging/Annotation(linguistic markers), Specialized
et al. 2022. Lamda: Language models for dialog
applications. arXiv preprint arXiv:2201.08239. Search(eg. looking up linguistic knowledge, sci-
entific knowledge etc), String editing, Recursive
Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, operations over multiple choices, Topic classifica-
Ed Chi, and Denny Zhou. 2022. Self-consistency
improves chain of thought reasoning in language
tion, Evidence extraction, conditional Text Genera-
models. arXiv preprint arXiv:2203.11171. tion/Editing, and Sentence similarity.
In this work, we choose to focus on the five most
Jason Wei, Maarten Bosma, Vincent Y Zhao, Kelvin
Guu, Adams Wei Yu, Brian Lester, Nan Du, An- used skills that cover a significant proportion of
drew M Dai, and Quoc V Le. 2021. Finetuned lan- BigBench tasks for classification (over 50 of the
guage models are zero-shot learners. arXiv preprint 91 tasks that remained after filtrating out long-text
arXiv:2109.01652. understanding, generation, and multi-lingual tasks).
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten We randomly select 2-4 tasks from each of these 5
Bosma, Ed Chi, Quoc Le, and Denny Zhou. 2022. task clusters and author decomposed programs with
Chain of thought prompting elicits reasoning in large appropriate tool use for these tasks. This results in
language models. arXiv preprint arXiv:2201.11903. a total of 15 tasks that compose the task library.
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten • Arithmetic: Elementary MathQA, Grade
Bosma, Fei Xia, Ed H Chi, Quoc V Le, Denny Zhou, school math (GSM8K), arithmetic Questions
et al. Chain-of-thought prompting elicits reasoning
in large language models. In Advances in Neural about ratios (Aqua-Rat), Navigate
Information Processing Systems. • Code: Auto Debugging, Code Description
• Search and question decomposition:
Sang Michael Xie, Aditi Raghunathan, Percy Liang, Anachronims, Multi-step question answering
and Tengyu Ma. 2021. An explanation of in-context
learning as implicit bayesian inference. arXiv (Musique), Hindu Knowledge, Known
preprint arXiv:2111.02080. Unknown
• Free-form reasoning: Formal fallacies, Hyper-
Zhuosheng Zhang, Aston Zhang, Mu Li, and Alex
bation
Smola. 2022. Automatic chain of thought prompt-
ing in large language models. arXiv preprint • String Operations: Kth letter concatenation,
arXiv:2210.03493. Language games, Date understanding
Denny Zhou, Nathanael Schärli, Le Hou, Jason Wei,
Nathan Scales, Xuezhi Wang, Dale Schuurmans, Cluster Programs The programs written for
Olivier Bousquet, Quoc Le, and Ed Chi. 2022. tasks in each task cluster are shown in Table 7
Least-to-most prompting enables complex reason-
ing in large language models. arXiv preprint for tasks involving string editing and manipulation,
arXiv:2205.10625. in Table 8 for arithmetic and algebra tasks, in Ta-
ble 10 for code generation, editing and debugging
A Appendix tasks, in Table 9 for tasks benefit from search of
world knowledge, and in Table 11 for tasks that
A.1 Task Library
benefit from eliciting chain-of-thought reasoning
Library Design We analyzed input-output in- following the prompt “Let’s think step-by-step”.
stances of all 200 tasks in BigBench, filtered out
text classification and short answer generation tasks
Program Format We define a parsing expres-
in English, and created a list of reasoning skills that
sion grammar (PEG) (shown in Figure 4) that de-
were relevant to solving each task. We do not focus
scribes the language used to write multi-step rea-
on long text understanding, long text generation,
soning programs. This grammar is designed to
and multi-lingual tasks in this work. We find that
parse full programs of the form “Input: ... Q1 : ...
most of these tasks rely on a few common skills
#1:... Qn : [EOQ] Ans: ”. We use the python li-
mentioned below:
brary parsimoneous4 to construct the grammar and
Visual Reasoning, Temporal Reasoning, Proposi-
parse programs generated by LLMs.
tional logic, Natural Logic, Machine Translation,
Web Search, Knowledge Base or Database lookup,
4
Recursive sub-question decomposition, Long text https://pypi.org/project/parsimonious/
String Operations
In these examples, you are given a task description and an input. Break the input down into subtasks in order to solve the
task. You can use string operations like splitting, reformatting, editing or merging. You can also use other operations like
arithmetic and logic.
Description: (Date Understanding) Find the required date in MM/DD/YYYY using information about related events and
dates in the input. Clue: First find what day is today.
Input: The deadline is Jun 1, 2021, which is 2 days away from now. What is the date 24 hours later in MM/DD/YYYY?
Q1: [string reformat] Jun 1, 2021 in MM/DD/YYYY
#1: 06/01/2021
Q2: [arithmetic] 06/01/2021 is 2 days away from now. What date is today?
#2: Today is 04/01/2021
Q3: [arithmetic] What date is 24 hours later than today?
#3: 05/01/2021
Q4: [EOQ]
Ans: 05/31/2021
—-
Description: (Language games) Translate English into Pig Latin.
Input: (English) Sami made his way across the bar and hugged Layla.
Q1: [string split] What are the words in "Sami made his way across the bar and hugged Layla."?
#1: ["Sami", "made", "his", "way", "across", "the", "bar", "and", "hugged", "Layla", "."]
Q2: [string edit] Transfer the initial consonant of each word to the end of the word and adding "ay" after it.
#2: ["Amisay", "ademay", "ishay", "ayway", "acrossyay", "ethay", "arbay", "andyay", "uggedhay", "Aylalay", "."]
Q3: [string merge] Concatenate #2 into a full sentence.
#3: Amisay ademay ishay ayway acrossyay ethay arbay andyay uggedhay Aylalay.
Q4: [EOQ]
Ans: Amisay ademay ishay ayway acrossyay ethay arbay andyay uggedhay Aylalay.
—-
Description: (Kth letter concatenation) Take the letters at position 3 of the words in a list of words and concatenate them
using a space.
Input: Take the letters at position 3 of the words in "Savita Saeed Ramos Sato Yadav" and concatenate them using a space.
Q1: [string split] What are the words in "Savita Saeed Ramos Sato Yadav"?
#1: ["Savita", "Saeed", "Ramos", "Sato", "Yadav"]
Q2: [string index] What is the third letter of words in the list in #1?
#2: ["v", "e", "m", "t", "d"]
Q3: [string merge] Concatenate #2 with spaces
#3: "v e m t d"
Q4: [EOQ]
Ans: v e m t d
—-
Descripton: %s
Input: %s
Q1:
Table 7: Programs in the task library for tasks requiring string manipulation.
.
Arithmetic
In these examples, you are given a task description and an input. Break the input down into subtasks in order to solve the task.
You can generate python code to solve arithmetic and algebra equations in using functions from sympy.
from sympy import Symbol
from sympy import simplify
import math
from sympy import solve_it
# solve_it(equations, variable): solving the equations and return the variable value.
Description: (Aqua-rat) Solve the following arithmetic problems on ratios and fractions, writing out intermediate arithmetic
calculations as python code. Store your result as a variable named ’ans’.
Input: In a flight of 600 km, an aircraft was slowed down due to bad weather. Its average speed for the trip was reduced by
200 km/hr and the time of flight increased by 30 minutes. The duration of the flight is: A)1 hour B)2 hours C)3 hours D)4
hours E)5 hours
Q1: [generate python code] write python code to solve the problem, using math and sympy.
#1:
duration = Symbol(’duration’, positive=True)
delay = 30 / 60
total_disntace = 600
original_speed = total_disntace / duration
reduced_speed = total_disntace / (duration + delay)
solution = solve_it(original_speed - reduced_speed - 200, duration)
ans = solution[duration]
print(ans)
Q2: [code execute] Execute the python code in #1 and get the value of "ans"
#2:
1.0
Q3: [compare] Which of the options among A)1 hour B)2 hours C)3 hours D)4 hours E)5 hours is most similar to the answer?
#3: A
Q4: [EOQ]
Ans: A
—-
Description: (Elementary Math) Solve the following middle-school arithmetic problems, writing out intermediate arithmetic
calculations as python code. Store your result as a variable named ’ans’.
Input: Janet’s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every
day with four. She sells the remainder at the farmers’ market daily for $2 per fresh duck egg. How much in dollars does she
make every day at the farmers’ market?
Q1: [generate python code] write down the arithmetic or algebra equations as python code, storing the answer as ’ans’
#1:
total_eggs = 16
eaten_eggs = 3
baked_eggs = 4
sold_eggs = total_eggs - eaten_eggs - baked_eggs
dollars_per_egg = 2
ans = sold_eggs * dollars_per_egg
print(ans)
Q2: [code execute] Execute the python code in #1 and get the value of "ans"
#2: 18
Q3: [EOQ]
Ans:18
—-
Description: (Grage school Math) Solve the following middle-school arithmetic problems, writing out intermediate arithmetic
calculations as python code. Store your result as a variable named ’ans’.
Input: Joseph and Getty went to buy ice creams, they together bought 36 ice creams. On the way back, Joseph ate 12 of the
ice creasm, and he has 2 ice creams left now.
Q1: [generate python code] write down the arithmetic or algebra equations as python code, storing the answer as ’ans’
#1:
num_ice_creams_bought_by_joseph = 2 + 12
total_ice_creams = 36
ans = total_ice_creams - num_ice_creams_bought_by_joseph
print(ans)
Q2: [code execute] Execute the python code in #1 and get the value of "ans"
#2: 22
Q3: [EOQ]
Ans: 22
—-
Descripton: %s
Input: %s
Q1:
Table 8: Programs in the task library for tasks requiring arithmetic operations.
.
Search
In these examples, you are given a task description and an input. Break the input down into subtasks in order to solve the task.
You can use search functions like Google search in one or more of your substeps, if there in insufficient information. Other
functions like arithmetic and logical operations can also be used.
Description: (Knwon or Unknwon) Choose the option that best answers the question. If the question does not have a known
answer, choose "Unknown".
Input: How many hairs were on Neil Armstrong’s head when he landed on the moon?
choice: Unknown
choice: Five million
Q1: [search] How many hairs were on Neil Armstrong’s head when he landed on the moon?
#1:
Apollo 11 (July 16–24, 1969) was the American spaceflight that first landed humans on the Moon. Commander Neil
Armstrong and lunar module pilot Buzz Aldrin.
Neil Alden Armstrong (August 5, 1930 – August 25, 2012) was an American astronaut and aeronautical engineer who became
the first person to walk on the Moon.
Q2: [subquestion] Does the information help answer the question? There could be no definitive answer because the question is
too specific, about personal details not in public record, because the answer is not yet known, or the question is opinion-based.
#2: No. The question is too specific
Q3: [compare] What is the final answer?
#3: Unknown
Q4: [EOQ]
Ans: Unknown
—-
Description: (Anachronisms) An anachronism is a mistake in chronology, or a person, thing, or event that is out of its proper
time. Does the sentence contain an anachrornism? Answer Yes/No.
Input: President George H. W. Bush called his generals to the Oval Office at the outset of the Gulf War.
Q1: [tag] What are the entities in this sentence?
#1:
President George H. W. Bush
Gulf War
Q2: [search] When was President George H. W. Bush president?
#2: George H. W. Bush’s tenure as the 41st president of the United States began with his inauguration on January 20, 1989,
and ended on January 20, 1993.
Q3: [search] When was the Gulf War fought?
#3: The Gulf War[b] was a 1990–1991 armed campaign waged by a 35-country military coalition in response to the Iraqi
invasion of Kuwait.
#4: [subquestion] Could these entities have co-existed based on thier time periods alone?
Yes. Their time periods intersect.
Q5: [generate output] Is this an anachronism?
#5: No
Q6: [EOQ]
Ans: No
—-
Description: (Hindu Knowledge) Answer questions about Hindu mythology by choosing the option that best answers the
question.
Input: In the Mahabharata, Karna is cursed to forget the incantations needed to use which weapon?
choice: Anjalikastra
choice: Narayanastra
choice: Agneyastra
choice: Brahmastra
Q1: [search] In the Mahabharata, Karna is cursed to forget the incantations needed to use which weapon?
#1: As a result, he cursed Karna, saying that HIS MARTIAL SKILLS, including the use of BRAHMASTRA, would abandon
him when he needed them most. Indra, the King of Gods, stung Karna in the form of a bee to get him cursed by Parshuram.
Karna walked through the woods in despair, feeling dejected by the curse. A skilled & devoted warrior...
Q2: [compare] Which option is the answer in #3 most similar to?
#2: Brahmastra
Q3: [EOQ]
Ans: Brahmastra
—-
Descripton: %s
Input: %s
Q1:
Table 9: Programs in the task library for tasks requiring search for world knowledge.
.
Code Generation and Editing
In these examples, you are given a task description and an input. Break the input down into subtasks in order to solve the task.
You can use a python code generation and execution function in one or more of your substeps, if required. Other functions
like arithmetic and logical operations can also be used.
Description: (Auto Debugging) Debug the following code snippets by finding the answer or the error message.
Input:
“‘
if x < 5:
pass
“‘
What error does this program surface?
Q1: [execute] Execute the following python code snippet.
if x < 5:
pass
#1:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
NameError: name ’x’ is not defined
Q2: [generate answer] What is the final error message?
#2: NameError: name ’x’ is not defined
Q3: [EOQ]
Ans: NameError: name ’x’ is not defined
—-
Description: (Code description) Given a python code snippet, choose the option that is the best description of the code
snippet.
Input:
Python code:
try:
n = int(input())
m = int(input())
integer_sum = int(n) + int(m)
print(integer_sum)
except:
print(’error’)
choice: prints number between 5 and 6
choice: try input and except error
choice: inputs the string ’try’
choice: prints sum of two input numbers only if they are integers otherwise raises error
Q1: [code generate] prints number between 5 and 6
#1:
import random
print(random.uniform(5,6))
Q2: [code generate] try input and except error
#2:
try:
file = open(file_name, "r")
data = file.read()
file.close()
except:
print(’error’)
Q3: [code generate] inputs the string ’try’
#3: print(’try’)
Q4: [code generate] prints sum of two input numbers only if they are integers otherwise raises error
#4:
#!/usr/bin/python
Table 10: Programs in the task library for tasks requiring code generation or editing.
.
CoT Reasoning
In these examples, you are given a task description and an input. Break the input down into subtasks in order to solve the task.
Thinking though the problem explicitly can be one of the substeps you use.
Description: (Sports Understanding) Determine whether an artificially constructed sentence relating to sports is plausible.
The final answer should be "yes" or "no".
Input: Is the following sentence plausible? "Santi Cazorla scored a touchdown."
Q1: [think step-by-step]
#1: Let’s think step-by-step. Santi Cazorla is a soccer player. Touchdown is part of American football and rugby. So the
answer is no.
Q2: [EOQ]
Ans: no
—-
Description: (Hyperbation) Identify correct adjective ordering from the two choices. This involves selecting what would be
considered the more inexplicably "intuitive" sentence by a native English speaker.
Input: Which sentence has the correct adjective order:
Options:
(A) repulsive small Brazilian exercise ship
(B) Brazilian repulsive exercise small ship
Q1: [think step-by-step]
#1: Let’s think step-by-step. When there is more than one adjective before a noun, the adjectives need to respect the following
order before a noun: "[1. opinion] [2. size] [3. age] [4. shape] [5. color] [6. origin] [7. material] [8. purpose] noun".
Option (A): "repulsive small Brazilian exercise ship". (1) "repulsive" falls into the opinion category. (2) "small" falls into the
size category. (3) "Brazilian" falls into the origin category. (4) "exercise" falls into the purpose category. Option (A) has the
following adjective order: [1. opinion] [2. size] [6. origin] [8. purpose] (or, in numeric terms, 1 2 6 8). Because 1 < 2 < 6 < 8
is correct, (A) has the correct ordering.
Option (B): "Brazilian repulsive exercise small ship". Option (B) has the following adjective order: [6. origin] [1. opinion] [8.
purpose] [2. size] (or, in numeric terms, 6 1 8 2). Because 6 < 1 < 8 < 2 is not correct, (B) does not have the correct ordering.
So the answer is (A).
Q2: [EOQ]
Ans: (A)
—-
Description: (Formal Fallacies) Distinguish deductively valid syllogistic arguments from formal fallacies, paying specific
attention to negations.
Input: "It is not always easy to see who is related to whom – and in which ways. The following argument pertains to this
question: To begin with, Lesley is a close friend of Fernando. Moreover, being a close friend of Fernando or a schoolmate of
Lowell is sufficient for being a great-grandfather of Leroy. It follows that Lesley is a great-grandfather of Leroy."
Is the argument, given the explicitly stated premises, deductively valid or invalid?
Options:
- valid
- invalid
Q1: [think step-by-step]
#1:
Let’s think step-by-step.
(1) Lesley is a close friend of Fernando: Lesley = friend(Fernando).
(2) Being a close friend of Fernando or a schoolmate of Lowell is sufficient for being a great-grandfather of Leroy: If X =
friend(Fernando) OR SCHOOLMATE(Lowell), then X = great-grandfather(Leroy).
Hypothesis: Does it follow that Lesley is a great-grandfather of Leroy: Lesley = great-grandfather(Leroy)?
Let’s see whether the Hypothesis can be deduced from the arguments (1) and (2) by logical reasoning?
By (1), we have Lesley = friend(Fernando). By (2), we have if Lesley = friend(Fernando), then Lesley = great-
grandfather(Leroy).
So, it is true that Lesley is a great-grandfather of Leroy. So the answer is valid.
Q2: [EOQ]
Ans: valid
—-
Description: (Reasoning about colored objects) Given a collection of colored objects in the text input, answer the question at
the end of the input.
Input: On the nightstand, there is a red pencil, a purple mug, a burgundy keychain, a fuchsia teddy bear, a black plate, and a
blue stress ball. What color is the stress ball?
Q1: [think step-by-step]
#1: Let’s think step-by-step. According to this question, the color of the stress ball is blue. So the answer is blue.
Q2: [EOQ]
Ans: blue
—-
Descripton: %s
Input: %s
Q1:"""
Table 11: Programs in the task library for tasks requiring free-form chain-of-thought style reasoning about logic
and lingusitics.
.
A.2 Task Selection pervision) is still a challenging task for LLMs, and
When provided new task description and input in- will explore this direction further in future work.
stance, ART retrieves N tasks from the task library A.3 Tool Use
to constructs a dynamic multi-task prompt. We
Code Generation We use the Codex (Chen
explore two strategies for task selection.
et al., 2021) model for code generation. Argument
Task-Cluster based 50 examples used for tun- for code generation is the previous sub-task’s
ing except in cases with fewer than 100 examples, answer sequence ““#i − 1 : . . . "” and the
where we reduce this number to 10. sequence generated by the LM after the sub-task
We iterate over all five task clusters in the library, query symbol ““Qi : [generate python code]"”.
prompting the LLM with demonstration programs When i = 1, the instance input is used as the first
from just one cluster at a time. For example, we argument. We include the previous answer/input
only use programs from arithmetic tasks as demon- since it often contains information relevant to
strations in the prompt in one such iteration. The generating accurate code, like the arithmetic word
task cluster with the highest performance on the problem for which code needs to be generated
held-out set of examples ( 50) is chosen. This strat- (see Table 8 for examples). Both arguments are
egy requires as many API calls as there are task provided to Codex as a multi-line python comment,
clusters, and a held-out set of input-output pairs for while maintaining their original formatting. To
the new task. Note that no additional supervision is keep the answer variable consistent, we also
needed for the new task to generate a decomposed append an additional instruction: Store the final
program. answer in variable ’ans’ and print it. For example:
Table 12: Comparing ART results on GPT3 (175B) model with two similar task selection strategies. LLM-based
similarity is worse on average compared to just choosing the best task cluster.
Table 15: Improving ART and free-form CoT via self-consistency and human-in-the-loop feedback on library tasks.
(*) indicates that human-in-the-loop improvement was done over automatically generated CoT reasoning for these
tasks.
their prompt to generate answers in the output • Table 20 for unscrambling english words,
space, which is a fair comparison with ART. where code is edited in step 2 to get correct
Hence, the results reported in Table 2 and Ta- permutations and a new tool is defined in step
ble 3 may differ from the results reported in 3 to lookup the english dictionary and return
(Suzgun et al., 2022) permutations that are actual words.
• (Khot et al., 2022): Decomposed prompting In Table 15, we present human feedback results for
for list reveral and kth letter comaparison. training tasks in the task library.
• (Press et al., 2022): Decomposed multi-step
QA for musique.
• (Chen et al., 2022): Program-of-though:
Promtps are designed to convert Arithmetic
problems into code and execute them.
• (Ouyang et al., 2022): InstructGPT
• (Chung et al., 2022): PaLM language model.
Note that we only consider the highest re-
ported text-davinci-002 or code-davinci-002
results from this paper to compare with ART.
• (Chen et al., 2021): Codex