T Ra: A T - I R A M P S: O OOL Ntegrated Easoning Gent FOR Athematical Roblem Olving
T Ra: A T - I R A M P S: O OOL Ntegrated Easoning Gent FOR Athematical Roblem Olving
A BSTRACT
Large language models have made significant progress in various language tasks,
yet they still struggle with complex mathematics. In this paper, we propose T O RA,
a series of Tool-integrated Reasoning Agents designed to solve challenging math-
ematical problems by seamlessly integrating natural language reasoning with the
utilization of external tools (e.g., computation libraries and symbolic solvers),
thereby amalgamating the analytical prowess of language and the computational
efficiency of tools. To train T O RA, we curate interactive tool-use trajectories on
mathematical datasets, apply imitation learning on the annotations, and propose
output space shaping to further refine models’ reasoning behavior. As a result,
T O RA models significantly outperform open-source models on 10 mathematical
reasoning datasets across all scales with 13%-19% absolute improvements on av-
erage. Notably, T O RA-7B reaches 44.6% on the competition-level dataset MATH,
surpassing the best open-source model WizardMath-70B by 22% absolute. T O RA-
C ODE-34B is also the first open-source model that achieves an accuracy exceeding
50% on MATH, which significantly outperforms GPT-4’s CoT result, and is com-
petitive with GPT-4 solving problems with programs. Additionally, we conduct a
comprehensive analysis of the benefits and remaining challenges of tool interaction
for mathematical reasoning, providing valuable insights for future research1 .
Accuracy (%)
40 63.9
60 54.9 57.8 55.2
51.1
30
22.7 41.3
40
20
14.0 14.4 14.9 24.3
10.7 9.2 20
10 7.2 6.3 13.3
4.1
0 0
7B 13B 70B 7B 13B 70B
∗
Equal contribution. See Contributions section for details. Work done during an internship at Microsoft.
†
Corresponding authors.
1
Code and models will be available at https://github.com/microsoft/ToRA.
1
Problem: Suppose that the sum of the squares of two complex Let’s use `Sympy` to calculate and print all
numbers 𝑥 and 𝑦 is 7 and the sum of their cubes is 10. List all
possible values for 𝑥 + 𝑦, separated by commas.
possible values for 𝑥 + 𝑦, 𝑟"
from sympy import symbols, Eq, solve, simplify
We are given that 𝑥! + import math
𝑦! = 7 and 𝑥" + 𝑦" = def possible_values():
10. We can factor 𝑥" + def solution():
x, y = symbols('x y’)
𝑦" to get 𝑥 + 𝑦 (𝑥! − x = symbols('x’), y = symbols('y')
eq1 = Eq(x**2 + y**2, 7)
𝑥𝑦 + 𝑦!). x**2 + y**2 = 7
x**3 + y**3 = 10
eq2 = Eq(x**3 + y**3, 10) 𝑎"
solutions = solve((eq1,eq2), (x, y))
Thus, we can write sums = [simplify(sol[0] + sol[1]) for sol in
10 = 𝑥 + 𝑦 7 . It x = (-1 + math.sqrt(17)) / 2
solutions]
follows that 𝑥 + 𝑦 must y = (-1 - math.sqrt(17)) / 2
return sums
divide 10. print(possible_values())
# Get all possible values for x+y
The only possibilities
are that 𝑥 + 𝑦 is
result = [x + y, -x - y] >>> [-5, -5, 1, 1, 4, 4]
𝑜"
✘ ✘ ✓ 𝑟!
>>> SyntaxError: cannot assign to Removing duplicates, the possible values for
1, 2, 5 or 10. expression. 𝑥 + 𝑦 are \boxed{-5, 1, 4}.
Figure 2: Examples of three reasoning formats for mathematical reasoning: (a) Rationale-based
methods (e.g., CoT prompting) generate step-by-step natural language rationales, (b) Program-based
methods (e.g., PAL prompting) solve tasks with program synthesis, and (c) our proposed Tool-
integrated Reasoning format interleaves rationales with program-based tool use. For brevity, we
present a simple example of single-round tool interaction, where the model creates rationale r1 for
analysis, writes program a1 to call an external solver, obtains the execution output o1 , and then
generates rationale r2 to finalize the answer.
1 I NTRODUCTION
Large language models (LLMs), such as GPT-4 (OpenAI, 2023) and PaLM-2 (Anil et al., 2023), have
demonstrated remarkable progress in a wide range of language tasks, particularly in the longstanding
challenge of mathematical reasoning (Feigenbaum et al., 1963; Hosseini et al., 2014). However,
open-source models, such as LLaMA-2 (Touvron et al., 2023a;b) and Falcon (Penedo et al., 2023),
still struggle with advanced mathematical reasoning tasks.
Existing works improve mathematical performance of language models either with step-by-step
natural language reasoning (Wei et al., 2022) as illustrated in Fig 2 (a), or by synthesizing and
executing programs to obtain the answers (Gao et al., 2022; Chen et al., 2022), as depicted in Fig 2
(b). Both approaches exhibit complementary advantages. Natural language is suitable for semantic
analysis, planning, and abstract reasoning (e.g., commonsense reasoning), but struggles with precise
computation, symbolic manipulation, and algorithmic processing. Conversely, programs excel in
rigorous operations, and can outsource intricate calculations to specialized tools like equation solvers.
To leverage the benefits of both natural language reasoning and program-based tool use, we train
open-source models such as LLaMA-2 to reason in a way where natural language reasoning is
interleaved with program-based tool use synergistically (as depicted in Fig 2 (c)), thereby largely
reducing the gap with closed-source models like GPT-4 in mathematical reasoning. Specifically, we
first design the interleaving format of reasoning, curate corresponding interactive tool-use trajectories
for mathematical problems from the popular GSM8k (Cobbe et al., 2021) and MATH (Hendrycks
et al., 2021) dataset, and then apply imitation learning on the high-quality annotations, leading to a
better performance than any existing open-source model. Furthermore, since the curated data is far
from exhausting all valid trajectories for a problem, relying solely on imitation learning restricts a
model’s output space, hindering the flexibility in exploring plausible trajectories during testing. To
improve the diversity of plausible reasoning steps and mitigate improper tool-use behavior, we apply
output space shaping which additionally trains the models on both self-sampled valid trajectories
and invalid ones that have been corrected by a teacher model (e.g., a 34B model can serve as the
teacher for a 7B model). Output space shaping significantly boosts reasoning performance, allowing
2
① Imitation Learning
Tool-integrated Reasoning ToRA-Corpus
Fine-tune
Valid Trajectories
Problem Output Rationale …
LLM M
② Output Space Shaping Valid Trajectories
✓ ✓ Fine-tune
Figure 3: Training T O RA contains two steps. ① Imitation Learning: Prompt LLMs like GPT-4 to
generate Tool-integrated Reasoning trajectories (T O RA-C ORPUS) and use this corpus to fine-tune a
model M; ② Output Space Shaping: Sample diverse tool-use trajectories with M, keep the valid
ones, correct the invalid ones with a teacher model M′ , and retrain M on the union of sampled valid
trajectories, corrected ones, and the initial T O RA-C ORPUS to obtain T O RA.
open-source models to attain an accuracy exceeding 50% on the competition-level MATH dataset for
the first time.
We evaluate the resulting suite of Tool-integrated Reasoning Agents (T O RA) ranging from 7B to
70B on 10 diverse mathematical reasoning datasets. As shown in Fig 1, T O RA series significantly
outperform open-source models across all scales. Notably, on the competition-level MATH dataset,
T O RA-7B outperforms the previous SoTA WizardMath-70B (Luo et al., 2023) by 22% absolute.
T O RA-C ODE-34B beats GPT-4’s CoT result (Bubeck et al., 2023) by 8.3% absolute (50.8% vs.
42.5%), and is competitive with GPT-4 solving problems with code (GPT-4-Code, 51.8%). In addition,
we analyze the benefits and remaining challenges of tool interaction for mathematical reasoning,
providing valuable insights for future work.
2.1 OVERVIEW
T O RA series solve challenging mathematical problems by leveraging both natural language reasoning
and program-based tool use. As shown in Fig 2 (c), given a mathematical problem q, T O RA reasons
with natural language, producing r1 . When reaching a point where program-based tool use is more
appropriate for the subsequent task, e.g., equation solving, T O RA generates a program a1 for tool use
following natural language guidance r1 . The execution output o1 will be fed to T O RA for subsequent
processing including tool use adjustments, sub-tasks solving, or answer finalization. We repeat the
process until the model places its answer within “\boxed{}”. The resulting trajectory is denoted as
τ = r1 a1 o1 ...rn−1 an−1 on−1 rn , where rn contains the answer.
Fig 3 presents the training pipeline of T O RA. We first collect interactive tool-use trajectories on
popular mathematical datasets. We then apply imitation learning on the resulting annotations, as well
as output space shaping to further refine models’ reasoning behavior.
Existing mathematical reasoning datasets primarily contain annotations in either natural language or
code, posing a challenge for training tool-integrated agents due to the absence of interactive tool-use
annotations. To address this, we utilize GPT-4 to synthesize high-quality trajectories on the GSM8k
and MATH training sets. We select GSM8k and MATH as they exhibit diverse reasoning patterns,
spanning multiple domains and difficulty levels.
3
Algorithm 1 Inference of Tool-Integrated Reasoning
Require: problem q, model G, prompt ℘, external tools E, stop condition Stop(·), maximum iteration rounds n
1: τ0 ← "" ▷ Trajectory Initialization
2: for i ← 1 to n do
3: ri ∼ PG (·|℘ ⊕ q ⊕ τi−1 ) ▷ Rationale Generation (Eq. 1)
4: if Stop(ri ) then ▷ Stopping Criteria
5: return τi−1 ⊕ ri
6: end if
7: ai ∼ PG (·|℘ ⊕ q ⊕ τi−1 ⊕ ri ) ▷ Program Generation (Eq. 2)
8: oi ← E(ai ) ▷ Tool Execution
9: τi ← τi−1 ⊕ ri ⊕ ai ⊕ oi ▷ Trajectory Update (Eq. 3)
10: end for
11: return τn
Prompt Curation We compose instructions along with diverse few-shot examples, utilizing an in-
terleaved format as depicted in Fig 2 (c). These examples showcase interactive tool usage trajectories,
incorporating descriptive variable names and combined program outputs. Please refer to Appendix C
for the assembled prompts.
Inference Procedure We follow Algorithm 1 and feed GPT-4 (G) with the composed prompt ℘ to
generate a tool-use trajectory τ for each question q from the training set. The trajectory is initialized
as an empty string τ0 , for each interaction round i, we first generate a rationale:
ri ∼ PG (·|℘ ⊕ q ⊕ τi−1 ) (1)
where ⊕ means concatenation. If ri includes an answer within “\boxed{}” (i.e., the stopping
condition Stop(ri )), we cease generation, otherwise the model continues to write a program for tool
use:
ai ∼ PG (·|℘ ⊕ q ⊕ τi−1 ⊕ ri ) (2)
In line with Gou et al. (2023), if the model triggers the code execution stop words like “‘‘‘output”,
we supply it with the corresponding execution message and output oi by calling tools with oi ← E(ai ),
facilitating the generation of subsequent steps. Then, we update the trajectory by concatenating it
with the newly generated rationale ri , program ai , and output oi :
τi ← τi−1 ⊕ ri ⊕ ai ⊕ oi (3)
We repeat the above interaction process until we reach the maximum rounds n.
Trajectory Sampling We set n = 3 and perform inference using GPT-4 with greedy decoding,
retaining trajectories that yield correct answers. For questions where GPT-4 fails with greedy
decoding, we apply nucleus sampling with a sample size of 10 and keep up to 4 valid trajectories per
question. Ultimately, we successfully annotate trajectories for 98.2% of GSM8k questions and 83.1%
of MATH questions. After filtering out invalid trajectories with tool-use errors or wrong answers,
we obtain 16k annotations which constitute our dataset T O RA-C ORPUS. Table 1 compares T O RA-
C ORPUS with recently proposed mathematical reasoning datasets, while Table 5 in the Appendix
displays MATH annotation accuracy details.
2.3 T RAINING
where M is the resulting model. After imitation learning, we can simply apply the same procedure
in Algorithm 1 by setting prompt to empty ℘ = "" for inference. Imitation learning leads to
state-of-the-art mathematical reasoning performance despite the small scale of T O RA-C ORPUS.
4
Table 1: Compared with mathematical reasoning datasets, T O RA-C ORPUS uniquely combines
natural language rationales with program-based tool usage. Note that T O RA-C ORPUS only employ
questions from the original training set of MATH and GSM8k.
Output Space Shaping For each question, T O RA-C ORPUS mostly demonstrates only one valid
interactive tool-use trajectory, which may restrict a model’s output space, rendering it inflexible in
exploring plausible trajectories during testing. We therefore propose output space shaping in order to
encourage the diversity of plausible reasoning steps and reduce improper tool-use behavior.
To explore diverse valid trajectories, we apply nucleus sampling to imitation learning models M to
sample 64 trajectories per training question q, following the inference procedure in Section 2.2. We
retain valid trajectories with correct answers and no tool-use errors. As many samples are duplicates,
to further improve diversity and in an attempt to correct models’ improper behavior, we seek to
leverage invalid trajectories as well. We observe that trajectories with wrong answers are mostly
incorrect halfway (Li et al., 2023), and the preceding reasoning is still plausible; in other words, we
can obtain valid trajectories by correcting the subsequent portions. Specifically, a wrong trajectory
τe, when written in text, can be represented as a sequence of lines separated by line breaks, i.e.,
τe = l1 ...lm , where m is the total number of lines in τe. We enumerate possible preceding portions of
wrong trajectories, i.e., τe[: j] = l1 ...lj , and leverage a teacher model M′ to complete the subsequent
steps with greedy decoding: τ ← PM′ (·|q ⊕ τe[: j]) where we abuse the notation PM′ (·) to denote
the interactive tool use process following Section 2.2. Finally, corrected trajectories as well as valid
trajectory samples will be used for model training, thereby shaping the output space.
In our experiments, we always use CodeLLaMA-34B trained on T O RA-C ORPUS as the teacher
model, and apply sampling with the CodeLLaMA series (ranging from 7B to 34B, with imitation
learning). We obtain a total of 233k distinct valid trajectory samples and 69k corrected ones. From
this combined dataset, we randomly select up to 4 trajectories per GSM8k and MATH problem,
merge them with T O RA-C ORPUS, and then train all T O RA models on the resulting 69k annotations.
3 E XPERIMENTS
3.1 I MPLEMENTATION D ETAILS
We fine-tuned LLaMA-2 (Touvron et al., 2023b) and CodeLLaMA (Rozière et al., 2023) series
(ranging from 7B to 70B) using T O RA-C ORPUS with output space shaping, yielding the T O RA and
T O RA-C ODE series respectively. We used a learning rate of 2e-5 by default except that we used 1e-5
for the 34B and 70B models. We set the global batch size to 128 and used a linear scheduler with a
3% warm-up period for 3 epochs. We trained all models with DeepSpeed ZeRO Stage3 (Rajbhandari
et al., 2021) and Flash-Attention 2 (Dao, 2023). We used greedy decoding for all results, with the
maximum sequence length set to 2,048 and the maximum number of tool executions set to 3.
Datasets We evaluated models on GSM8k (Cobbe et al., 2021) and MATH (Hendrycks et al., 2021),
along with 8 out-of-distribution datasets, namely GSM-Hard (Gao et al., 2022), SVAMP (Patel et al.,
2021), ASDIV (Miao et al., 2020), TabMWP (Lu et al., 2023), SingleEQ, SingleOP, AddSub, and
MultiArith (Koncel-Kedziorski et al., 2016), as illustrated in Table 4 in Appendix. The 10 assorted
datasets collectively encompass mathematical problems spanning basic arithmetic to competition
level, covering middle and high school curricula and various mathematical domains. The problem
formats comprise tabular-based, free-form, and multiple-choice questions, ensuring a thorough
assessment of the model’s mathematical reasoning aptitude.
5
Table 2: Results on 10 mathematical reasoning tasks. MAWPS results are averaged over four tasks:
Singleeq, Singleop, Addsub, and MultArith. Vanilla models are tested with CoT. The best results in
each section are in blue , the second-best results are underlined, while the results of our best model
are bolded. ∗ ZS: Zero-shot inference without demonstrations.
Model Size Tools ZS∗ GSM8k MATH GSM-Hard SVAMP TabMWP ASDiv MAWPS
AVG
Used for training? ✓ ✓ ✗ ✗ ✗ ✗ ✗
Proprietary Models
GPT-4 - ✗ ✗ 92.0 42.5 64.7 93.1 67.1 91.3 97.6 78.3
GPT-4 (PAL) - ✓ ✗ 94.2 51.8 77.6 94.8 95.9 92.6 97.7 86.4
ChatGPT - ✗ ✗ 80.8 35.5 55.9 83.0 69.1 87.3 94.6 72.3
ChatGPT (PAL) - ✓ ✗ 78.6 38.7 67.6 77.8 79.9 81.0 89.4 73.3
Claude-2 - ✗ ✗ 85.2 32.5 - - - - - -
PaLM-2 540B ✗ ✗ 80.7 34.3 - - - - - -
Open-Source Models
LLaMA-2 7B ✗ ✗ 13.3 4.1 7.8 38.0 31.1 50.7 60.9 29.4
LLaMA-2 SFT 7B ✗ ✓ 41.3 7.2 16.1 31.9 27.8 47.4 60.0 33.1
LLaMA-2 RFT 7B ✗ ✓ 51.2 - - - - - - -
Platypus-2 7B ✗ ✗ 14.4 5.4 8.6 36.7 26.5 47.9 58.4 28.3
WizardMath 7B ✗ ✓ 54.9 10.7 20.6 57.3 38.1 59.1 73.7 44.9
CodeLLaMA (PAL) 7B ✓ ✗ 34.0 16.6 33.6 59.0 47.3 61.4 79.6 47.4
Toolformer† 7B ✓ ✓ - - - 29.4 - 40.4 44.0 -
T O RA 7B ✓ ✓ 68.8 40.1 54.6 68.2 42.4 73.9 88.8 62.4
T O RA-C ODE 7B ✓ ✓ 72.6 44.6 56.0 70.4 51.6 78.7 91.3 66.5 (+19)
LLaMA-2 13B ✗ ✗ 24.3 6.3 13.6 43.1 39.5 56.3 70.4 36.2
LLaMA-2 SFT 13B ✗ ✓ 51.1 9.2 22.3 46.3 35.8 58.6 75.0 42.6
LLaMA-2 RFT 13B ✗ ✓ 55.3 - - - - - - -
Platypus-2 13B ✗ ✗ 23.7 7.1 14.3 50.7 45.3 55.1 69.6 38.0
WizardMath 13B ✗ ✓ 63.9 14.0 28.4 64.3 46.7 65.8 79.7 51.8
CodeLLaMA (PAL) 13B ✓ ✗ 39.9 19.9 39.0 62.4 59.5 65.3 86.0 53.1
T O RA 13B ✓ ✓ 72.7 43.0 57.3 72.9 47.2 77.2 91.3 65.9
T O RA-C ODE 13B ✓ ✓ 75.8 48.1 60.5 75.7 65.4 81.4 92.5 71.3 (+18)
LLaMA-1 RFT 34B ✗ ✓ 57.9 - - - - - - -
CodeLLaMA (PAL) 34B ✓ ✗ 53.3 23.9 49.4 71.0 63.1 72.4 91.5 60.7
T O RA-C ODE 34B ✓ ✓ 80.7 50.8 63.7 80.5 70.5 84.2 93.3 74.8 (+14)
LLaMA-2 70B ✗ ✗ 57.8 14.4 36.0 73.6 57.5 76.0 92.4 58.2
LLaMA-2 SFT 70B ✗ ✓ 69.3 14.9 39.0 64.0 53.0 71.3 84.8 56.6
LLaMA-2 RFT 70B ✗ ✓ 64.8 - - - - - - -
Platypus-2 70B ✗ ✗ 45.9 15.0 24.6 74.3 47.3 72.7 91.1 53.0
WizardMath 70B ✗ ✓ 81.6 22.7 50.3 80.0 49.8 76.2 86.2 63.8
LLaMA-2 (PAL) 70B ✓ ✗ 55.2 18.3 50.0 74.6 59.5 71.9 92.8 60.3
T O RA 70B ✓ ✓ 84.3 49.7 67.2 82.7 74.0 86.8 93.8 76.9 (+13)
Metrics We report accuracies of predicted answers. For numerical values, we perform rounding,
while for expressions, we employ sympy 2 for parsing. Since the SingleEQ, SingleOP, AddSub, and
MultiArith datasets focus on different aspects of basic arithmetic, we report their average results
under the collective term MAWPS (Koncel-Kedziorski et al., 2016) for all methods.
3.3 BASELINES
Proprietary Models We present results from an array of SoTA LLMs, such as OpenAI’s GPT-4,
ChatGPT (gpt-3.5-turbo), Google’s PaLM-2, and Anthropic’s Claude-2. By default, we report
CoT prompting results, and include PAL (Gao et al., 2022) prompting results for selected models.
Open-Source Models Base models comprise LLaMA-2 and CodeLLaMA with CoT and PAL
prompting. Supervised Fine-Tuning (SFT) employs CoT rationales from the original GSM8k and
MATH dataset (15k samples) for fine-tuning. Rejection sampling Fine-Tuning (RFT) leverages
multiple models to generate diverse reasoning paths for fine-tuning (Yuan et al., 2023). WizardMath
augments data using ChatGPT, and conducts SFT and RLHF. Platypus-2, the top model on the LLM
Leaderboard 3 , is fine-tuned with Open-Platypus reasoning datasets (Lee et al., 2023). We also
compare T O RA with Toolformer (Schick et al., 2023) which is a model trained to utilize calculators.
2
https://www.sympy.org
3
https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard
6
Table 3: Results on MATH subtopics.
40 37.5 39.2
33.6 31.3
30 27.8
20 14.9
10 7.2 9.2
0
LLaMA-2-7B LLaMA-2-13B LLaMA-2-70B GPT-4
Figure 4: Comparison of three formats: (1) Rationale-only: step-by-step natural language reasoning
like CoT; (2) Program-only: solving problems with programs like PAL; (3) Tool-integrated Reasoning
used by T O RA: interweaving rationale and program execution to solve problems. We evaluated
GPT-4 with few-shot prompting. We trained LLaMA-2 models to reason in the three types of formats,
respectively. For a fair comparison, we do not apply output space shaping for all LLaMA-2 models.
7
3.5.1 C OMPARISONS OF F ORMATTING
To evaluate the efficacy of the reasoning format adopted by T O RA which interleaves rationales with
programs, we compared it with Rationale-only and Program-only formats using GPT-4 and LLaMA-2
trained with the same size of data from MATH. As shown in Fig 4, the T O RA method consistently
surpasses Rationale-only and Program-only approaches. Remarkably, Using LLaMA-2, the T O RA
method achieves substantial improvements of 29.0% and 6.7% over Rationale-only and Program-only,
respectively. With the closed-source GPT-4, the improvements are 19.1% and 9.8%, respectively.
This emphasizes the effectiveness of integrating natural language rationales with programs.
80
GSM8k 50
MATH
48.1
46.7
74.9 75.8
Accuracy (%)
Accuracy (%)
75 73.5 45 44.6 44.6 44.6
72.6
71.1 40.2
70 68.1 40
65 35
7B 13B 7B 13B
Figure 5: Ablation on output space shaping strategies using CodeLLaMA: (1) T O RA−Shaping −Correction is
trained on T O RA-C ORPUS without shaping. (2) T O RA−Correction employs only the sampling strategy
for shaping, trained with up to 4 additional valid trajectory samples per problem. (3) T O RA utilizes
both the sampling and correction, also trained with up to 4 additional trajectories per problem.
We assess the effectiveness of the output space shaping strategies presented in Section 2.3, specifically
sampling and correction. As shown in Fig 5: (1) Output space shaping yields a considerable average
improvement of 3.4% and 4.0% absolute for GSM8k and MATH, respectively, with greater benefits
for smaller models; (2) Applying the sampling strategy results in a 2.7% absolute improvement
on average, while additionally incorporating correction offers a modest yet significant average
improvement of 0.8% to 1.2% absolute; (3) Output space shaping benefits even the largest model
T O RA-70B, with a notable improvement from 47.3% to 49.7% on MATH. These findings highlight
the effectiveness of our shaping strategies across different model sizes and datasets.
3.6 A NALYSIS
We investigate the benefits, detailed patterns, and remaining challenges of tool interaction for
mathematical reasoning on the challenging MATH dataset. Performance breakdowns on all subtopics
of MATH are reported in Table 3.
Benefits from Tool-Integration for MATH Sub-topics As shown in Table 3, T O RA outperforms
WizardMath by around 45% in Algebra and Number Theory, which is attributed to stimulating and
shaping tool-use behavior. Problems from the two sub-topics typically need intricate computation
and data manipulation. Algebra mainly focuses on solving equations and application problems, while
many Number Theory problems can be tackled using brute-force approaches through code.
Patterns of Library Usage for Problem Solving Fig 6 presents the most frequently used libraries
for different sub-topics and the corresponding accuracies of their solutions. Tool-use behavior on
different mathematical areas demonstrates distinct patterns. sympy and its internal solvers are
primarily employed for algebra-related topics. Precalculus exhibits extensive matrix operations via
matrices, resulting in a high accuracy. Number Theory depends on algorithms like gcd
and lcm. Geometry mainly uses the rational library for fraction-based computations, while the
application of other tools is limited, signifying the potential for improvement.
Detailed Impact of Rationale on Different Topics Table 3 shows that using an interleaved format,
in contrast to merely writing the program, leads to significant improvements across all subtopics,
especially in Precalculus, Algebra, and Geometry, where notable increases range from 8.6% to 18.8%.
8
Library Usage Frequency for Each Topic Library Usage Accuracy for Each Topic
100 100
sympy matrices sympy matrices
80 solvers binomial 80 solvers binomial
Frequency (%) rational algorithm rational algorithm
Accuracy (%)
60 calculus 60 calculus
40 40
20 20
0 0
Nu y
Nu y
lg.
Ge c
.
P
lg
ra
ll
lg.
Ge c
.
P
lg
ra
ll
Th
Th
etr
etr
al
al
era
era
C&
C&
eA
eA
eb
eb
.A
.A
eC
eC
m.
m.
om
om
Pr
Pr
Alg
Ov
Alg
Ov
Int
Int
Pr
Pr
Figure 6: Library usage frequency and accuracy on each sub-topic of MATH.
Appendix D.1 provides representative examples demonstrating how the rationale aids in planning,
multi-round self-correction, and finalizing answers.
Remaining Challenges in Mathematical Reasoning for T O RA Although T O RA has made notable
progress in various mathematical domains, substantial improvements are still needed in topics like
Geometry, Precalculus, and Intermediate Algebra. In Geometry, as illustrated by failure cases in
Listing 6 in Appendix, a deeper understanding of geometric space is essential, encompassing visual
modalities and interactions with images for auxiliary information, while incorporating computational
tools like SymPy offers limited benefits. For Intermediate Algebra and Precalculus problems, as
shown in Listing 5, direct brute-force solutions are often infeasible, resulting in timeout exceptions.
Addressing these challenges requires complex symbolic reasoning over algebraic expressions and the
given conditions, along with sophisticated problem-solving and proof techniques involving forward
and backward reasoning, as well as result verification.
4 R ELATED W ORKS
Mathematical Reasoning Recent research has greatly improved reasoning in LLMs with step-by-
step natural language reasoning (Wei et al., 2022; Zhou et al., 2023; Zhu et al., 2023; Huang et al.,
2022; Liang et al., 2023). However, natural language reasoning struggles with complex computations
and symbolic manipulations. To overcome the limitations, recent research has exploited tools like
calculators (Cobbe et al., 2021; Shao et al., 2022), code interpreters (Mishra et al., 2022), and symbolic
solvers (Zhang et al., 2023). Program-based methods (Gao et al., 2022; Chen et al., 2022; Shao
et al., 2023a) transform reasoning tasks into program synthesis tasks, thus offering complementary
advantages over natural language reasoning, but they face challenges in nuanced reasoning, planning,
and error handling (Gou et al., 2023), where natural language reasoning should be more suitable.
Tool-Augmented Language Models Augmenting LLMs with tools can largely alleviate LLMs’
limitations and improve reasoning and generation performance (Parisi et al., 2022; Mialon et al.,
2023; Yao et al., 2023). Recent work demonstrates the benefits of integrating retrievers (Borgeaud
et al., 2022; Shao et al., 2023b), search engines (Nakano et al., 2021), and multi-tool approaches
(Schick et al., 2023; Paranjape et al., 2023; Gou et al., 2023) to improve generation.
Knowledge Distillation Knowledge distillation (KD) transfers knowledge from teacher models to
student models (Buciluǎ et al., 2006; Hinton et al., 2015). Using LLM-generated trajectories for
fine-tuning is a form of KD (Fu et al., 2023; Taori et al., 2023; Peng et al., 2023; Ho et al., 2023).
Our proposed T O RA shows that learning interactive tool-use trajectories is a promising direction to
adapt language models to reasoning tasks.
5 C ONCLUSION
This paper presents T O RA, a series of novel Tool-integrated Reasoning Agents that synergistically
combines natural language rationale with program-based tool-use for mathematical problem solving.
Our approach demonstrates the potential of integrating external tools in the reasoning process,
9
enabling language models to effectively tackle complex quantitative tasks. T O RA achieves state-of-
the-art performance on 10 diverse mathematical reasoning tasks, substantially outperforming existing
rationale-based and program-based approaches. Furthermore, our systematic analysis of the benefits
and remaining challenges of tool interaction provides valuable insights for future research, paving the
way for the development of more advanced and versatile reasoning agents.
6 AUTHOR C ONTRIBUTIONS
Zhibin Gou proposed the interleaved tool-use format of T O RA and curated T O RA-C ORPUS dataset,
implemented the training and evaluation pipeline, conducted experiments and analysis on all datasets,
implemented baselines, and was a main contributor to the paper writing. Zhihong Shao proposed the
project, conducted preliminary experiments, proposed and implemented the training and evaluation
pipelines, proposed and trained all T O RA models with output space shaping as well as T O RA variants
in the ablation study, designed and oversaw experimental analysis, and contributed to many parts of
the paper writing. Yeyun Gong, Yelong Shen, Yujiu Yang, Minlie Huang, Nan Duan, and Weizhu
Chen provided research mentorship, oversaw project coordination, and advised and contributed to
many parts of the writing.
R EFERENCES
Rohan Anil, Andrew M Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Passos,
Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, et al. Palm 2 technical report. arXiv
preprint arXiv:2305.10403, 2023.
Sebastian Borgeaud, Arthur Mensch, Jordan Hoffmann, Trevor Cai, Eliza Rutherford, Katie Millican,
George Bm Van Den Driessche, Jean-Baptiste Lespiau, Bogdan Damoc, Aidan Clark, et al.
Improving language models by retrieving from trillions of tokens. In International conference on
machine learning, pp. 2206–2240. PMLR, 2022.
Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar,
Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott M. Lundberg, Harsha Nori, Hamid Palangi, Marco Túlio
Ribeiro, and Yi Zhang. Sparks of artificial general intelligence: Early experiments with GPT-4.
CoRR, abs/2303.12712, 2023. doi: 10.48550/arXiv.2303.12712. URL https://doi.org/10.
48550/arXiv.2303.12712.
Cristian Buciluǎ, Rich Caruana, and Alexandru Niculescu-Mizil. Model compression. In Proceedings
of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pp.
535–541, 2006.
Wenhu Chen, Xueguang Ma, Xinyi Wang, and William W Cohen. Program of thoughts prompt-
ing: Disentangling computation from reasoning for numerical reasoning tasks. arXiv preprint
arXiv:2211.12588, 2022.
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser,
Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John
Schulman. Training verifiers to solve math word problems, 2021. URL https://arxiv.org/
abs/2110.14168.
Tri Dao. FlashAttention-2: Faster attention with better parallelism and work partitioning. 2023.
Edward A Feigenbaum, Julian Feldman, et al. Computers and thought, volume 7. New York
McGraw-Hill, 1963.
Yao Fu, Hao Peng, Litu Ou, Ashish Sabharwal, and Tushar Khot. Specializing smaller lan-
guage models towards multi-step reasoning. In Andreas Krause, Emma Brunskill, Kyunghyun
Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett (eds.), International Confer-
ence on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume
202 of Proceedings of Machine Learning Research, pp. 10421–10430. PMLR, 2023. URL
https://proceedings.mlr.press/v202/fu23d.html.
10
Luyu Gao, Aman Madaan, Shuyan Zhou, Uri Alon, Pengfei Liu, Yiming Yang, Jamie Callan, and
Graham Neubig. Pal: Program-aided language models. arXiv preprint arXiv:2211.10435, 2022.
Zhibin Gou, Zhihong Shao, Yeyun Gong, Yelong Shen, Yujiu Yang, Nan Duan, and Weizhu Chen.
Critic: Large language models can self-correct with tool-interactive critiquing. arXiv preprint
arXiv:2305.11738, 2023.
Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song,
and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. NeurIPS,
2021.
Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv
preprint arXiv:1503.02531, 2015.
Namgyu Ho, Laura Schmid, and Se-Young Yun. Large language models are reasoning teachers. In
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume
1: Long Papers), pp. 14852–14882, Toronto, Canada, July 2023. Association for Computational
Linguistics. doi: 10.18653/v1/2023.acl-long.830. URL https://aclanthology.org/
2023.acl-long.830.
Mohammad Javad Hosseini, Hannaneh Hajishirzi, Oren Etzioni, and Nate Kushman. Learning to
solve arithmetic word problems with verb categorization. In Proceedings of the 2014 Conference
on Empirical Methods in Natural Language Processing (EMNLP), pp. 523–533, 2014.
Jiaxin Huang, Shixiang Shane Gu, Le Hou, Yuexin Wu, Xuezhi Wang, Hongkun Yu, and Jiawei Han.
Large language models can self-improve. CoRR, abs/2210.11610, 2022. doi: 10.48550/arXiv.2210.
11610. URL https://doi.org/10.48550/arXiv.2210.11610.
Rik Koncel-Kedziorski, Subhro Roy, Aida Amini, Nate Kushman, and Hannaneh Hajishirzi. MAWPS:
A math word problem repository. In Proceedings of the 2016 Conference of the North American
Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.
1152–1157, San Diego, California, June 2016. Association for Computational Linguistics. doi:
10.18653/v1/N16-1136. URL https://aclanthology.org/N16-1136.
Ariel N Lee, Cole J Hunter, and Nataniel Ruiz. Platypus: Quick, cheap, and powerful refinement of
llms. arXiv preprint arXiv:2308.07317, 2023.
Yifei Li, Zeqi Lin, Shizhuo Zhang, Qiang Fu, Bei Chen, Jian-Guang Lou, and Weizhu Chen. Making
language models better reasoners with step-aware verifier. In Proceedings of the 61st Annual
Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 5315–5333,
2023.
Tian Liang, Zhiwei He, Wenxiang Jiao, Xing Wang, Yan Wang, Rui Wang, Yujiu Yang, Zhaopeng Tu,
and Shuming Shi. Encouraging divergent thinking in large language models through multi-agent
debate. arXiv preprint arXiv:2305.19118, 2023.
Pan Lu, Liang Qiu, Kai-Wei Chang, Ying Nian Wu, Song-Chun Zhu, Tanmay Rajpurohit, Peter
Clark, and Ashwin Kalyan. Dynamic prompt learning via policy gradient for semi-structured
mathematical reasoning. In The Eleventh International Conference on Learning Representations,
2023. URL https://openreview.net/forum?id=DHyHRBwJUTN.
Haipeng Luo, Qingfeng Sun, Can Xu, Pu Zhao, Jianguang Lou, Chongyang Tao, Xiubo Geng,
Qingwei Lin, Shifeng Chen, and Dongmei Zhang. Wizardmath: Empowering mathematical
reasoning for large language models via reinforced evol-instruct. arXiv preprint arXiv:2308.09583,
2023.
Grégoire Mialon, Roberto Dessì, Maria Lomeli, Christoforos Nalmpantis, Ram Pasunuru, Roberta
Raileanu, Baptiste Rozière, Timo Schick, Jane Dwivedi-Yu, Asli Celikyilmaz, et al. Augmented
language models: a survey. arXiv preprint arXiv:2302.07842, 2023.
Shen-yun Miao, Chao-Chun Liang, and Keh-Yih Su. A diverse corpus for evaluating and developing
English math word problem solvers. In Proceedings of the 58th Annual Meeting of the Association
for Computational Linguistics, pp. 975–984, Online, July 2020. Association for Computational
Linguistics. doi: 10.18653/v1/2020.acl-main.92. URL https://aclanthology.org/
2020.acl-main.92.
11
Swaroop Mishra, Matthew Finlayson, Pan Lu, Leonard Tang, Sean Welleck, Chitta Baral, Tanmay
Rajpurohit, Oyvind Tafjord, Ashish Sabharwal, Peter Clark, and Ashwin Kalyan. Lila: A unified
benchmark for mathematical reasoning. In Proceedings of the 2022 Conference on Empirical
Methods in Natural Language Processing (EMNLP), 2022.
Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christopher
Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, et al. Webgpt: Browser-assisted
question-answering with human feedback. arXiv preprint arXiv:2112.09332, 2021.
OpenAI. Gpt-4 technical report, 2023.
Bhargavi Paranjape, Scott Lundberg, Sameer Singh, Hannaneh Hajishirzi, Luke Zettlemoyer, and
Marco Tulio Ribeiro. Art: Automatic multi-step reasoning and tool-use for large language models.
arXiv preprint arXiv:2303.09014, 2023.
Aaron Parisi, Yao Zhao, and Noah Fiedel. Talm: Tool augmented language models. arXiv preprint
arXiv:2205.12255, 2022.
Arkil Patel, Satwik Bhattamishra, and Navin Goyal. Are NLP models really able to solve simple
math word problems? In Proceedings of the 2021 Conference of the North American Chapter of
the Association for Computational Linguistics: Human Language Technologies, pp. 2080–2094,
Online, June 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.naacl-main.
168. URL https://aclanthology.org/2021.naacl-main.168.
Guilherme Penedo, Quentin Malartic, Daniel Hesslow, Ruxandra Cojocaru, Alessandro Cappelli,
Hamza Alobeidli, Baptiste Pannier, Ebtesam Almazrouei, and Julien Launay. The RefinedWeb
dataset for Falcon LLM: outperforming curated corpora with web data, and web data only. arXiv
preprint arXiv:2306.01116, 2023. URL https://arxiv.org/abs/2306.01116.
Baolin Peng, Chunyuan Li, Pengcheng He, Michel Galley, and Jianfeng Gao. Instruction tuning with
gpt-4. arXiv preprint arXiv:2304.03277, 2023.
Samyam Rajbhandari, Olatunji Ruwase, Jeff Rasley, Shaden Smith, and Yuxiong He. Zero-infinity:
Breaking the gpu memory wall for extreme scale deep learning. In Proceedings of the International
Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–14, 2021.
Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi
Adi, Jingyu Liu, Tal Remez, Jérémy Rapin, et al. Code llama: Open foundation models for code.
arXiv preprint arXiv:2308.12950, 2023.
Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer,
Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to
use tools. arXiv preprint arXiv:2302.04761, 2023.
Zhihong Shao, Fei Huang, and Minlie Huang. Chaining simultaneous thoughts for numerical
reasoning. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (eds.), Findings of the As-
sociation for Computational Linguistics: EMNLP 2022, Abu Dhabi, United Arab Emirates,
December 7-11, 2022, pp. 2533–2547. Association for Computational Linguistics, 2022. doi:
10.18653/v1/2022.findings-emnlp.187. URL https://doi.org/10.18653/v1/2022.
findings-emnlp.187.
Zhihong Shao, Yeyun Gong, Yelong Shen, Minlie Huang, Nan Duan, and Weizhu Chen. Synthetic
prompting: Generating chain-of-thought demonstrations for large language models. In Andreas
Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett
(eds.), International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu,
Hawaii, USA, volume 202 of Proceedings of Machine Learning Research, pp. 30706–30775.
PMLR, 2023a. URL https://proceedings.mlr.press/v202/shao23a.html.
Zhihong Shao, Yeyun Gong, Yelong Shen, Minlie Huang, Nan Duan, and Weizhu Chen. Enhancing
retrieval-augmented large language models with iterative retrieval-generation synergy. CoRR,
abs/2305.15294, 2023b. doi: 10.48550/arXiv.2305.15294. URL https://doi.org/10.
48550/arXiv.2305.15294.
12
Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy
Liang, and Tatsunori B. Hashimoto. Stanford alpaca: An instruction-following llama model.
https://github.com/tatsu-lab/stanford_alpaca, 2023.
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée
Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and
efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay
Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian
Canton-Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin
Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar
Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel Kloumann,
Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana
Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra, Igor
Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan
Schelten, Ruan Silva, Eric Michael Smith, Ranjan Subramanian, Xiaoqing Ellen Tan, Binh Tang,
Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen
Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurélien Rodriguez, Robert Stojnic,
Sergey Edunov, and Thomas Scialom. Llama 2: Open foundation and fine-tuned chat models.
CoRR, abs/2307.09288, 2023b. doi: 10.48550/arXiv.2307.09288. URL https://doi.org/
10.48550/arXiv.2307.09288.
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, brian ichter, Fei Xia, Ed H. Chi, Quoc V
Le, and Denny Zhou. Chain of thought prompting elicits reasoning in large language models.
In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in
Neural Information Processing Systems, 2022. URL https://openreview.net/forum?
id=_VjQlMeSB_J.
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan
Cao. React: Synergizing reasoning and acting in language models. In The Eleventh International
Conference on Learning Representations, 2023. URL https://openreview.net/forum?
id=WE_vluYUL-X.
Zheng Yuan, Hongyi Yuan, Chengpeng Li, Guanting Dong, Chuanqi Tan, and Chang Zhou. Scaling
relationship on learning mathematical reasoning with large language models. arXiv preprint
arXiv:2308.01825, 2023.
Xiang Yue, Xingwei Qu, Ge Zhang, Yao Fu, Wenhao Huang, Huan Sun, Yu Su, and Wenhu Chen.
Mammoth: Building math generalist models through hybrid instruction tuning. arXiv preprint
arXiv:2309.05653, 2023.
Beichen Zhang, Kun Zhou, Xilin Wei, Wayne Xin Zhao, Jing Sha, Shijin Wang, and Ji-Rong Wen.
Evaluating and improving tool-augmented computation-intensive math reasoning. arXiv preprint
arXiv:2306.02408, 2023.
Denny Zhou, Nathanael Schärli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang, Dale Schuurmans,
Claire Cui, Olivier Bousquet, Quoc V. Le, and Ed H. Chi. Least-to-most prompting enables
complex reasoning in large language models. In The Eleventh International Conference on
Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023.
URL https://openreview.net/pdf?id=WZH7099tgfM.
Xinyu Zhu, Junjie Wang, Lin Zhang, Yuxiang Zhang, Yongfeng Huang, Ruyi Gan, Jiaxing Zhang, and
Yujiu Yang. Solving math word problems via cooperative reasoning induced language models. In
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume
1: Long Papers), pp. 4471–4485, Toronto, Canada, July 2023. Association for Computational
Linguistics. doi: 10.18653/v1/2023.acl-long.245. URL https://aclanthology.org/
2023.acl-long.245.
13
A E VALUATION DATASETS
Table 4: Statistics and examples of the 10 evaluation datasets. In the main result table, we present
the average accuracy of SingleEq, SingleOp, AddSub, and MultiArith under the collective name
MAWPS.
SingleEq (Koncel- Alyssa’s dog had puppies. She gave 7 to her friends.She
Kedziorski et al., OOD 508 now has 5 puppies left. How many puppies did she have
2016) to start with?
SingleOp (Koncel- Rachel removes 47 bottle caps from a jar. There were
Kedziorski et al., OOD 562 originally 87 bottle caps in the jar. How many bottle caps
2016) are left in the jar?
AddSub (Koncel- Sam went to 14 football games this year. He went to 29
OOD 395
Kedziorski et al., games last year. How many football games did Sam go to
2016) in all?
MultArith (Koncel- Paige had 43 math problems and 12 science problems for
Kedziorski et al., OOD 600 homework. If she finished 44 of the problems at school,
2016) how many problems did she have to do for homework?
14
B A DDITIONAL E XPERIMENTS
B.1 ACCURACIES OF C LOSED -S OURCE M ODELS ON MATH
Table 5: Accuracies of ChatGPT and GPT-4 on the MATH dataset, with breakdown w.r.t. different
mathematical subjects. We apply PAL prompting and the Tool-integrated Reasoning method used by
T O RA to the two closed-source models.
Table 5 presents the detailed accuracies of GPT-4 on the MATH dataset. The Tool-integrated
Reasoning method used by T O RA significantly outperforms PAL prompting when directly applied to
the closed-source GPT-4, further demonstrating the benefits of synergizing natural language reasoning
and program-based tool use.
As described in Section 2.2, we annotated interactive tool-use trajectories for the training questions
from MATH with GPT-4. GPT-4 achieves a success rate below 65% using greedy decoding. As
MATH is originally annotated with natural language rationales, to improve the annotation success
rate, we tried to provide GPT-4 with the rationales as hints. However, when using this method, GPT-4
tends to replicate the hints and ignore tool-use outputs especially when the outputs are inconsistent
with the hints, thus failing to produce high-quality trajectories. Hence, we deferred the utilization of
the already-annotated natural language rationales for future investigations. Instead, we employed
nucleus sampling to recall valid trajectories for questions that remained unsolved through greedy
decoding. This approach significantly boosted annotation accuracy to 83.1%.
78
GSM8k 50
MATH
13B 75.8 13B 48.1
76 7B 75.4 48 7B
74.8
74 73.5 46
45.8
45.1
Accuracy (%)
Accuracy (%)
66 38
0 1 2 4 0 1 2 4
#Trajectories for Shaping #Trajectories for Shaping
Figure 7: Effects of using difference numbers of additional valid trajectories per question for output
space shaping.
As shown in Fig 7, it is beneficial to increase the number of additional valid trajectories for output
space shaping.
15
C P ROMPTS
We present instructions and example few-shot prompts of Tool-integrated Reasoning for querying
GPT-4.
- Analyze the question and write functions to solve the problem; the
function should not take any arguments.
- Present the final result in LaTeX using a ‘\boxed{}‘ without any units.
- Utilize the ‘pi‘ symbol and ‘Rational‘‘ from Sympy for $\pi$ and
fractions, and simplify all fractions and square roots without
converting them to decimal values.
---
Solution:
To convert the point $(x, y, z)$ in rectangular coordinates to spherical
coordinates $(\rho, \theta, \phi)$, we use the following formulas:
1. $\rho = \sqrt{x^2 + y^2 + z^2}$
2. $\theta = \arctan{\frac{y}{x}}$
3. $\phi = \arccos{\frac{z}{\rho}}$
‘‘‘python
from sympy import sqrt, atan2, acos, pi
def rectangular_to_spherical():
x, y, z = 0, -3*sqrt(3), 3
rho = sqrt(x**2 + y**2 + z**2)
theta = atan2(y, x)
phi = acos(z/rho)
return rho, theta, phi
spherical_coordinates = rectangular_to_spherical()
print(spherical_coordinates)
‘‘‘
‘‘‘output
(6, -pi/2, pi/3)
‘‘‘
The required form of the answer is $(\rho, \theta, \phi)$, where $\rho >
0$, $0 \le \theta < 2\pi$, and $0 \le \phi \le \pi$. Since our
initial output has $\theta = -\pi/2$, which doesn’t meet the
condition for $\theta$, we add $2\pi$ to adjust its value.
Hence, the spherical coordinates for the point $(0, -3\sqrt{3}, 3)$ are $
\boxed{(6, \frac{3\pi}{2}, \frac{\pi}{3})}$.
---
16
Solution:
Let’s write a single python function for binary computation:
‘‘‘python
def binary_sum_diff():
# Convert binary to decimal
num1 = int("1011", 2)
num2 = int("101", 2)
num3 = int("1100", 2)
num4 = int("1101", 2)
return result_binary
result = binary_sum_diff()
print(result)
‘‘‘
‘‘‘output
10001
‘‘‘
The answer expressed in base 2 is $\boxed{10001_2}$.
---
Question: Find the matrix that corresponds to projecting onto the vector
$\begin{pmatrix} 2 \\ -3 \end{pmatrix}.$
Solution:
‘‘‘python
from sympy import Matrix, simplify
def projection_matrix():
v = Matrix([2, -3])
P = (v * v.T) / (v.T * v)[0]
return P
result = projection_matrix()
print(result)
‘‘‘
‘‘‘output
Matrix([[4/13, -6/13], [-6/13, 9/13]])
‘‘‘
---
Solution:
Let’s use ‘sympy‘ to solve the inequality:
‘‘‘python
from sympy import symbols, simplify
17
from sympy.solvers.inequalities import solve_univariate_inequality
from sympy.core.relational import LessThan
def solve_inequality():
x = symbols(’x’)
expression = (x**2 - 25) / (x + 5)
inequality = LessThan(expression, 0)
solution = solve_univariate_inequality(inequality, x, relational=False
)
simplified_solution = simplify(solution)
return simplified_solution
result = solve_inequality()
print(result)
‘‘‘
‘‘‘output
Union(Interval.open(-oo, -5), Interval.Lopen(-5, 5))
‘‘‘
The answer is $\boxed{(-\infty,-5)\cup(-5,5)}$
---
D E XAMPLES
D.1 S UCCESS C ASES
Solution:
The domain of the function is all real numbers, so we need to find the
values of x for which the denominator of the function is zero.
The denominator of the function is the sum of the absolute values of the
two terms inside the absolute value.
Each of these terms is a quadratic in x, and the roots of a quadratic
equation are real unless the equation is identically zero.
So, we need to find the values of x for which the two quadratic equations
are identically zero.
‘‘‘python
from sympy import symbols, Abs, solveset, S
from sympy.core.relational import Ne
def non_domain_values():
x = symbols(’x’)
# These are the values of x for which the function is not defined
roots1 = solveset(Ne(x**2 - 6*x + 8, 0), x, domain=S.Reals)
roots2 = solveset(Ne(x**2 + x - 6, 0), x, domain=S.Reals)
# The roots of a quadratic equation are always in the domain of the
function, so we need to find the values of x that are not in the
union of the roots
non_domain_values = S.Reals - (roots1.union(roots2))
return non_domain_values
result = non_domain_values()
print(result)
‘‘‘
‘‘‘output
{2}
18
‘‘‘
The real value of $x$ that is not in the domain of the function is $\
boxed{2}$.
Listing 3: Success case for T O RA: Self-Correcting Errors with tool feedback.
Question: The difference between two perfect squares is 133. What is the
smallest possible sum of the two perfect squares?
Solution:
We can solve this problem by noting that any perfect square can be
written as $(a + b)^2$ or $(a - b)^2$ for some integers $a$ and $b$.
So we can rewrite the question as the following problem: Find the
minimum possible value for $(a + b)^2 + (a - b)^2$ given that $(a + b)
^2 - (a - b)^2 = 133$.
‘‘‘python
from sympy import symbols, Eq, solve
def min_sum_perfect_squares():
"""
Find the smallest possible sum of two perfect squares
given that the difference between them is 133.
Returns:
Integer: The minimum possible sum of the two perfect squares.
"""
a, b = symbols(’a b’)
return min_sum_squares
result = min_sum_perfect_squares()
print(result)
‘‘‘
‘‘‘output
(a - 133/(4*a))**2 + (a + 133/(4*a))**2
‘‘‘
The problem seems to be more complex than anticipated and the solution
was incorrect.
Since the difference between two squares can be factored, we can write:
$133$ has two pairs of factors: 1 and 133, and 7 and 19.
So, either $x+y = 133$ and $x-y = 1$, or $x+y = 19$ and $x-y = 7$.
19
(13, 6) result in the smallest sum of the two perfect squares.
‘‘‘python
def min_sum_perfect_squares():
return min_sum_squares
‘‘‘
‘‘‘output
205
‘‘‘
Solution:
‘‘‘python
from sympy import symbols, Sum, factorial, oo, simplify
def infinite_sum():
expr = (k + 2) / (factorial(k) + factorial(k + 1) + factorial(k + 2))
sum_expr = Sum(expr, (k, 1, oo))
result = sum_expr.doit()
return result
result = infinite_sum()
print(result)
‘‘‘
‘‘‘output
Sum((k + 2)/(factorial(k) + factorial(k + 1) + factorial(k + 2)), (k, 1,
oo))
‘‘‘
The sum does not converge to a simple expression, so the answer is $\
boxed{\sum_{k = 1}^\infty \frac{k + 2}{k! + (k + 1)! + (k + 2)!}}$.
Listing 5: Failure case for T O RA: Failed fallback from program timeout to rationale-based
solution.
Question: For some positive integer $n,$ $0 < n < 180,$
\[\csc (2^3)^\circ + \csc (2^4)^\circ + \csc (2^5)^\circ + \dots + \csc
(2^{2019})^\circ = \sec n^\circ.\]Find $n.$
Solution:
‘‘‘python
from sympy import symbols, solve, csc, sec, pi
def find_n():
n = symbols(’n’)
lhs = sum(csc(2**i) for i in range(3, 2020))
rhs = sec(n)
equation = lhs - rhs
20
solution = solve(equation, n)
# Filter out solutions that are not in the interval (0, 180)
solution = [sol.evalf() for sol in solution if 0 < sol.evalf() < 180]
return solution[0]
result = find_n()
print(result)
‘‘‘
‘‘‘output
timeout_decorator.timeout_decorator.TimeoutError: ’Timed Out’
‘‘‘
The solution to this problem is not straightforward to solve using sympy
due to the complexity of the equation and the large number of terms
in the sum.
However, we can solve this problem using a different approach. The sum on
the left hand side is a telescoping sum, which means that many of
the terms cancel out. After cancelling, we are left with $\csc
(2^{2019}) - \csc(2^3)$.
Since $0 < n < 180$, the only solution is $n = 90$. So, the value of $n$
is $\boxed{90}$.
**Ground Truth**: 82
[asy]
size(150);
draw((0,0)--(10,0)--(10,10)--(0,10)--(0,0),linewidth(0.7));
draw((5*sqrt(2)/(1 + sqrt(2)),0)--(0,5*sqrt(2)/(1 + sqrt(2))),linewidth
(0.7));
draw((10 - (5*sqrt(2)/(1 + sqrt(2))),0)--(10,5*sqrt(2)/(1 + sqrt(2))),
linewidth(0.7));
draw((0,(10-(5*sqrt(2)/(1+sqrt(2)))))--(5*sqrt(2)/(1 + sqrt(2)),10),
linewidth(0.7));
draw((10 - (5*sqrt(2)/(1 + sqrt(2))),10)--(10,10 - 5*sqrt(2)/(1 + sqrt(2)
)),linewidth(0.7));
21
[/asy]
Solution:
The side length of the octagon is the hypotenuse of the right triangle
formed by the side of the square and the diagonal of the octagon
The hypotenuse is the sum of the side length of the square and the length
of the diagonal of the octagon
The length of the diagonal of the octagon is 5*sqrt(2)/(1 + sqrt(2))
‘‘‘python
from sympy import sqrt, Rational
def octagon_side_length():
square_side_length = 10
octagon_side_length = square_side_length + 5*sqrt(2)/(1 + sqrt(2))
return round(octagon_side_length, 2)
result = octagon_side_length()
print(result)
‘‘‘
‘‘‘output
12.93
‘‘‘
The length of one side of the octagon is $\boxed{12.93}$ inches.
**Ground Truth**:
Recall that the ratio of the hypotenuse to the leg of an isosceles right
triangle is $\sqrt{2}$. Looking at the removed triangles, we see that
the three segments that make up the side of square are $s/\sqrt{2}$,
$s$, and $s/\sqrt{2}$, where $s$ is the side length of the octagon.
Setting the sum of these three side lengths equal to 10 inches, we
find \begin{align*}
\frac{s}{\sqrt{2}}+s+\frac{s}{\sqrt{2}}&=10 \implies \\
s\left(\frac{1}{\sqrt{2}}+1+\frac{1}{\sqrt{2}}\right)&=10 \implies \\
s(\sqrt{2}+1)&=10 \implies \\
s&=\frac{10}{\sqrt{2}+1}=10(\sqrt{2}-1),
\end{align*} where we have rationalized the denominator twice: \[
\frac{1}{\sqrt{2}}=\frac{1}{\sqrt{2}}\cdot \frac{\sqrt{2}}{\sqrt{2}}=\
frac{\sqrt{2}}{2},
\] and \[
\frac{10}{\sqrt{2}+1}=\frac{10}{\sqrt{2}+1}\cdot\frac{\sqrt{2}-1}{\sqrt
{2}-1}=10(\sqrt{2}-1).
\] To the nearest hundredth, $s=\boxed{4.14}$ inches.
[asy]
size(150);
defaultpen(linewidth(0.7)+fontsize(10));
real s = 10/(1+sqrt(2));
draw((0,0)--(10,0)--(10,10)--(0,10)--(0,0));
draw((5*sqrt(2)/(1 + sqrt(2)),0)--(0,5*sqrt(2)/(1 + sqrt(2))));
draw((10 - (5*sqrt(2)/(1 + sqrt(2))),0)--(10,5*sqrt(2)/(1 + sqrt(2))));
draw((0,(10-(5*sqrt(2)/(1+sqrt(2)))))--(5*sqrt(2)/(1 + sqrt(2)),10));
draw((10 - (5*sqrt(2)/(1 + sqrt(2))),10)--(10,10 - 5*sqrt(2)/(1 + sqrt(2)
)));
label("$s$",(10-s/(2*sqrt(2)),10-s/(2*sqrt(2))),SW);
label("$\displaystyle{\frac{s}{\sqrt{2}}}$",(10,10-s/(2*sqrt(2))),E);
label("$\displaystyle{\frac{s}{\sqrt{2}}}$",(10,s/(2*sqrt(2))),E);
label("$s$",(10,5),E);
draw(rightanglemark((10,0),(10,10),(0,10)));[/asy]
22