Benchmark LLM
Benchmark LLM
Language Models
Tomohiro Sawada1,2,∗, Daniel Paleka1,3 , Alexander Havrilla1,2 , Pranav Tadepalli1,2 , Paula Vidas1,
1
DuckAI 2 Georgia Tech 3 ETH Zürich 4 Nomos AI
5
Stanford University Center for Legal Informatics 6 MILA
Abstract
1 Introduction
In recent years, models such as GPT-3 [Brown et al., 2020], GPT-4 [OpenAI, 2023], PaLM [Chowdh-
ery et al., 2022], and Chinchilla [Hoffmann et al., 2022] have shown increasing performance across a
wide variety of natural language tasks ranging from translation to reasoning [Bubeck et al., 2023,
Laskar et al., 2023]. This rapid progress has been closely tracked and assessed by evaluating LLMs on
benchmarks, which test model capabilities on a set of standardized problems. The GLUE benchmark
[Wang et al., 2019b] for language understanding was first released in April 2018; but models such
as BERT [Devlin et al., 2019] and GPT-2 [Radford et al., 2019] in the following year were already
powerful enough to necessitate the “SuperGLUE” benchmark [Wang et al., 2019a]. Since then, the
race between language models and benchmarks has increasingly favored the former.
Scaling up, model sizes and datasets alike, has led to rapid improvements on various natural language
tasks on benchmarks like BIG-bench [Srivastava et al., 2022] and HELM [Liang et al., 2022]. Neural
scaling laws [Kaplan et al., 2020, Caballero et al., 2023, Alabdulmohsin et al., 2022] have been used
to predict the behavior of large scale models on various metrics. Nevertheless, LLM performance
often increases unpredictably [Wei et al., 2022a], especially on tasks that require reasoning abilities.
Predictions of performance on ML benchmarks often underestimate the rate of progress [Steinhardt,
2022]. Since progress has been faster than anticipated, new benchmarks need to be more difficult.
* Email: [email protected].
‡‡
Email: [email protected].
Models such as ChatGPT have shown the ability to pass entry-level examinations in fields such as
law [Bommarito II and Katz, 2022], medicine [Kung et al., 2023], economics [Caplan, 2023], and
mathematics [Shakarian et al., 2023]. Nevertheless, LLM understanding of many fields is reportedly
shallow and unreliable [Shapira et al., 2023]. Expert reasoning in domains with specialized knowledge
is essential for automated systems to augment skilled professionals [Noy and Zhang, 2023].
In this paper, we introduce a new benchmark dataset, ARB (Advanced Reasoning Benchmark),
designed to evaluate expert reasoning abilities in mathematics, physics, chemistry, biology, and law.
To make the benchmark more challenging than previous benchmarks, we extract graduate-level tasks
from resources intended for domain professionals. The performance of current models such as GPT-4
on the quantitative parts of ARB is very low using standard prompting methods.
Our dataset offers improvements over existing benchmarks:
• Hundreds of problems requiring expert reasoning in quantitative subjects, where LLMs are
known to underperform;
• A large percentage of the problems are short-answer and open response questions, in contrast
to the multiple-choice questions that dominated earlier benchmarks.
2 Related Work
Improving the reasoning capabilities of LLMs has been a subject of recent interest, with a particular
focus on advanced prompting techniques [Wei et al., 2022b, Kojima et al., 2023, Wang et al., 2023,
Yao et al., 2023, Nye et al., 2021]. Such techniques have seen increasingly successful applications
in solving reasoning problems involving commonsense reasoning and mathematics, by promoting
active reasoning processes within the LLMs before yielding final answers.
Model architectures such as Minerva [Lewkowycz et al., 2022] have exemplified the enhancement of
reasoning capabilities through fine-tuning on extensive datasets covering math and reasoning tasks.
This has yielded improved performance across several benchmarks, including MATH [Hendrycks
et al., 2021], GSM8K [Cobbe et al., 2021], and MMLU [Hendrycks et al., 2020]. Concurrently, other
lines of research [Li et al., 2023, Lightman et al., 2023, Cobbe et al., 2021] have investigated the
application of verification techniques to augment and enhance LLM performance.
Most of the aforementioned work has typically evaluated techniques against math benchmarks
(e.g., GSM8K [Cobbe et al., 2021], MATH [Hendrycks et al., 2021], SVAMP [Patel et al., 2021],
ASDiv [Miao et al., 2020], AQuA [Ling et al., 2017], MAWPS [Koncel-Kedziorski et al., 2016],
MultiArith [Roy and Roth, 2016]) and commonsense reasoning tasks (e.g., CSQA [Talmor et al.,
2018], StrategyQA [Geva et al., 2021], HotpotQA [Yang et al., 2018]). Recently, several new
benchmarks have been introduced for reasoning and planning tasks, such as the GPT-Planning
Benchmark [Valmeekam et al., 2023], ALERT Reasoning Benchmark [Yu et al., 2022], JEEBench
[Arora et al., 2023]), and [Gendron et al., 2023]. Additionally, comprehensive evaluation suites like
the Chain-of-Thought Hub [Fu et al., 2023] have been proposed.
Despite their utility, existing benchmarks are limited in difficulty, represent a restricted range of
reasoning challenges, and do not necessarily mirror real-world tasks demanding complex reasoning.
Moreover, recent advancements such as Minerva [Lewkowycz et al., 2022] have revealed that these
benchmarks may not offer sufficient challenge.
The rapid progress in LLM capabilities has led many to explore using LLMs in the LLM evaluation
pipeline. Apart from using LLMs to generate evaluation tasks [Zhang et al., 2022, Perez et al., 2022],
LLMs have increasingly been used as a proxy for human evaluation [Chiang and Lee, 2023, Liu et al.,
2023, Fu et al., 2023, Kocmi and Federmann, 2023]. Useful LLM-based evaluation for alignment has
been done using rubrics [Bai et al., 2022]. We explore the efficacy of rubrics for evaluation when
applied to highly complex math and physics problems.
2
3 Benchmark
The key considerations when building a machine learning benchmark are:
• Difficulty. Most tasks have to be out of reach of current models; a benchmark where many
models score over 95% is not useful for tracking differential AI development.
• Usefulness. The tested skills should correlate with generally useful human skills.
• Ease of evaluation. It should be straightforward for the model creators to compare the
performances of different models. The scores should be interpretable.
• Minimizing data contamination. A consistent issue with popular benchmarks is that
the recent LLMs contain some tasks in their training data [OpenAI, 2023]. This leads to
overestimation of true model capabilities.
• Connection to general capabilities. If a model is trained on data similar to the benchmark,
it is possible it achieves high performance without generalization or “intelligence”, failing to
solve novel tasks of similar difficulty [Chollet, 2019]. Conversely, problems should not be
pathological or overly adversarial, to avoid the dangers of underclaiming [Bowman, 2021].
3.1 Formatting
The benchmark consists of three types of questions: multiple choice, short answer, and open response,
in descending order of proportion in the dataset.
• Multiple choice questions consist of a question and four to five possible answers, and
the correct answer is the one that best answers the question. They were sourced from
standardized tests, such as the MCAT and bar exam prep, and make up a large proportion of
the dataset due to their ease of grading.
• Short answer questions, on the other hand, ask for final answers in the format of a short
phrase or mathematical expression. They were sourced from problem books such as Souza
and Silva [2008], Gelca and Andreescu [2017], and physics book series Lim and Qiang
[2001], Lim [2007], Lim [1998], Lim et al. [2019], and Lim [1996]. We generally avoided
algebraic expressions, because of technical difficulties in the grading process.
A given algebraic expression may have several equivalent forms (e.g. nontrivial functional
relations for the functions appearing in the final answer), and a grading scheme which
accounts for all possible variations across our entire dataset is not feasible. Moreover,
physics problems often require answers introducing new notation that is not explicitly
mentioned in the problem statement.
• Open response questions are more challenging: they consist of a question and a blank
space for the answer. They were sourced from problem books and exams, such as the
Harvard PhD comprehensive exams in mathematics [Harvard University, 2021]. Such tasks
require manual grading. These questions are aspirational in nature, as current systems (e.g.
ChatGPT) cannot produce satisfactory responses, even for the “elementary” problems.
3.2 Mathematics
This part of the dataset is the most diverse. It includes contest mathematics problems as well as
“university mathematics” (i.e. mathematics traditionally taught in universities at the undergraduate
and beginning graduate level). The contest problems are sourced from Gelca and Andreescu [2017]
and Brayman and Kukush [2018], and the university mathematics problems are sourced from Souza
and Silva [2008] and Harvard University [2021]. The dataset does not include high school contest
problems because those are already present in other well-known benchmarks [Hendrycks et al., 2021].
The Putnam and Brayman books both contain official solutions, which we also include in the dataset.
This can be useful for fully automating the grading process, which we leave to future work.
For university mathematics, we pick Souza and Silva [2008] for its large selection of “standard”
undergraduate mathematics problems, as well as many problems suitable for the short answer portions.
We also select Harvard University [2021] because it covers topics that other collections of exams
rarely not cover, such as representation theory of finite groups and algebraic topology.
3
Table 1: Types of problems in the benchmark by subject area.
Subject Answer Type Number
Numerical 52
Mathematics Symbolic 34
Proof-like 19
Numerical 80
Numerical (w/ image) 18
Physics
Symbolic 18
Symbolic (w/ image) 13
Law Multiple Choice 627
MCAT (Reading) Multiple Choice 165
Multiple Choice 144
MCAT (Science)
Multiple Choice (w/ image) 37
3.3 Physics
The physics problems are structured similarly as the math problems. The main difference is that some
physics problems contain figures, and there are more problems with numerical answers. The problems
were sourced from the Major American Universities PhD Qualifying Questions and Solutions series
[Zhongguo-Kexue-Jishu-Daxue, 1990].
3.4 MCAT
The MCAT test contains multiple choice problems testing biology, psychology, chemistry, physics,
and reading comprehension. The MCAT problems are sampled from the third edition of McGraw-
Hill Education 3 MCAT Practice Tests [Campbell et al., 2017] and cover both science and reading
questions. This book was chosen as very few of these problems appear in standard web-searchable
sources, limiting contamination. As in the previous categories, we pick problems which are self-
contained. Because some MCAT science questions are accompanied by images, we accompany such
questions with corresponding image files.
3.5 Law
Applying law involves the application logical reasoning, in addition to grasping legal knowledge. This
makes assessments of legal skills an especially attractive type of language model benchmark, where
we are attempting to assess the reasoning and intelligence of these models. Furthermore, if the models
better understand law, they can be more reliable and ultimately more useful in real-world applications,
potentially even increasing the efficiency and transparency of governments more broadly.
Most lawyers in the U.S. go to law school, graduate, then study for the Bar Examination, and then
must pass the bar before going on to practice law professionally. To evaluate legal understanding of
the models, we use an older Bar Examination practice set that, to the best of our knowledge, is not
available online in a way that could have led to its inclusion in training data for the language models
that we are assessing. The practice bar exam we administer to the various language models covers
most major areas of law and therefore it tests legal reasoning and broad U.S. legal knowledge.
4 Evaluation
We evaluate current LLMs on all text-only problems in our dataset. Other LLM benchmark papers do
not evaluate on multimodal tasks due to the lack of good multimodal models; we follow suit. Given
public communications about GPT-4 [OpenAI, 2023] and Gemini [Ghahramani, 2023], it is likely the
physics and MCAT image problems will be useful for testing multimodal LLMs soon.
4
Models We evaluate ChatGPT (gpt3.5-turbo-0301), GPT 3.5 (text-davinci-003),
GPT-4 with 8k context length (gpt-4-0314), and Claude (claude-v1.3-100k). We evaluate
all question types using task-specific instructions and chain of thought. In chat models, we put the
instructions as the system prompt; otherwise we put them at the beginning of the prompt.
In all problem types, in order to extract the model’s final answer, we instruct the model to write its
final answer at the end of the response after the delimiter ANSWER: . We then parse the model
generated final answer as the remaining text after the delimiter. The response is marked as incorrect if
the delimiter is not found. Due to the differences in evaluation for multiple choice versus open-ended
responses, we adopt multiple evaluation procedures.
Multiple choice To evaluate multiple choice questions, we can simply compare the extracted final
answer to the ground truth. A response is considered correct if the extracted choice matches the
ground truth choice. With appropriate prompting, all models output a parsable answer > 97% of the
time. We conduct a separate manual evaluation on a sampled subset of the questions to check that our
parsing procedure is not mischaracterizing the true performance of the model.
Numerical To evaluate problems with a numerical final answer, we first extract the delimited model
answer as above. In the physics problems, many answers are in units; we prompt the model with
information about the unit, and instruct it to fully simplify its answer and omit any units. However,
sometimes the model forgets to do either or both, and so we apply a series of regexes to remove units.
We then attempt to parse the result into a mathematical expression using Python’s SymPy library
[Meurer et al., 2017]. If this parsing fails, the answer is marked as incorrect. Once parsed, we score a
the model answer as correct if |model_answer−ground_truth|
ground_truth < 0.01.
Symbolic Problems with symbolic answers are less structured and harder to parse. To do so,
we again leverage SymPy, first normalizing expressions to contain a default set of variable names
and then checking for equivalence up to a permutation of the variables. However this approach is
error-prone and only works for the subset of symbolic responses in a function form. More advanced
responses, such as those containing set notation, require human evaluation.
Proof-like Natural language proofs cannot be evaluated automatically; the authors with training in
mathematics grade the proofs. Further manual human evaluation requires a thorough inspection of
the intermediate reasoning steps. This makes evaluation expensive in practice.
Model-based evaluation To address the difficulties in developing automated metrics for evaluating
more advanced problems, we experiment with two model based approaches. First, we prompt
ChatGPT to grade the equivalence of two symbolic expressions with score options 0 when the totally
incorrect, 0.5 when the symbolic expressions are nearly the same e.g. equivalent up to a constant, and
1 when they are an exact match. Our prompting strategy can be found in the supplementary material.
More generally, we evaluate the capabilities of GPT-4 to grade intermediate reasoning chains via a
rubric-based evaluation approach. For symbolic and proof-like problems, we few-shot prompt GPT-4
to create a 10-point rubric. This is done by handwriting a small set of initial rubrics for proof-like
problems and prompting the model with these examples and the ground truth reference solution. The
model assigns point values to intermediate steps using the reference solution as a guide. This process
is illustrated in the supplementary material.
With model generated rubrics in hand, we then evaluate each question against its rubric. This is done
by again prompting GPT-4 to go step by step through the model answer and assign partial credit
based on the rubric. This provides a denser automatic evaluation metric on increasingly unstructured
answers. As a nice side benefit, it makes human evaluation of complex symbolic questions much
easier, significantly reducing the amount of time required per question.
4.1 Results
5
Figure 1: Accuracy of models over automatically scored components of the ARB benchmark.
Numerical questions are evaluated with a relative error threshold of 10−2 .
We see models generally do quite well on the multiple choice Law and MCAT subsets, but struggle
significantly on questions with numerical final answers. GPT-4 is the only model capable of reliably
simplifying complex expressions, but even GPT-4 struggles to reliably perform arithmetic and
symbolic manipulations over long contexts.
On the multiple-choice questions, the only model that cannot reliably follow the answer formatting
instructions is gpt-3.5-turbo. This happens for a variety of reasons, including the model refusing
to answer or to commit to a single answer choice. On the Law benchmark, gpt-3.5-turbo does
not output a parsable answer around 25% of the time. The other models exhibit this failure in less
than 5% of multiple-choice questions, with GPT-4 being correctly parsed over 99% of the time.
We see a similarly low performance profile across models on symbolic problems, reported in Table 2.
The GPT-4 evaluation paper [Bubeck et al., 2023] classified errors GPT-4 makes in single-pass
evaluation on GSM8K [Cobbe et al., 2021] and MATH [Hendrycks et al., 2021] into three types:
arithmetic mistakes, misunderstood statement, and wrong approach. We make a more fine-grained
analysis and extend it to math and physics problems in our dataset. The results are in Table 3.
The errors current LLMs make on the Mathematics part of ARB fall into five general types:
• Misunderstanding / answering only a part of the question / misread problem;
• Wrong approach: the model’s early chain of thought does not guess the right approach;
6
Table 3: Mistakes on mathematics and physics problems in ARB, GPT-4.
Misread Wrong Logical error Arithmetic Correct Correct
problem approach or hallucination mistake answer reasoning
Math Numerical 0% 25% 88% 48% 3% 3%
Math Symbolic 16% 50% 29% 4% 16% 16%
Math Proof-like 5% 50% 72% 16% n/a 5%
Physics Numerical 0% 80% 53% 6% 6% 6%
Physics Symbolic 0% 37% 68% 31% 28% 12%
• Logical errors: the model uses a false implication between two statements;
• Hallucinating facts or theorems: the model confabulates a statement that is false in general,
or not applicable in context;
• Arithmetic/calculation error: the model multiplies incorrectly, omits a term in an expression,
gives a wrong numerical value for a fraction, and other similar mistakes.
We grade GPT-4 using the above as a guideline. Our grading of the model’s CoT answers is
not mutually exclusive; if the model both uses an approach that doesn’t go anywhere and makes
a calculation error in it, we count it towards both categories. Note that the errors might not be
independent: arithmetic mistakes could be more or less frequent in wrong approach solutions as
opposed to the solutions with correct idea. We notice that the model is likely to make incorrect
simplifications to get to some final answer in approaches that cannot work; this is expected, as
prompting the model to produce a solution with a final answer often leads it to produce some final
answer by any means.
When the model outputs a chain of implications, it is not always clear whether some false statement is
due to a logical error, or it is a straight-out confabulation. We merge those two error types in Table 3.
Some problems ask for multiple things to be proven or calculated. Our graders gave the model a
score of 0.5 if it correctly derived at least half of the "subproblems" (for example, homology groups
of a given manifold). With this more benevolent form of grading, the performance of GPT-4 on the
Proof-like problems jumps to 16%. Where applicable, slight discrepancy with automatic evaluation
is also possible due to the error tolerance.
We note that many of the problems in Physics Symbolic have correct symbolic answers even when
there are flaws in the chain of thought reasoning of GPT-4. This is likely due to some kind of
memorization, although not necessarily from the same sources: see Table 12 for an example.
It is possible that our graders underestimate the rate of arithmetic mistakes in some cases, especially
when the approach is clearly wrong, or it is not clear whether a given error is due to faulty reasoning
or due to a missed term in the calculations.
For the larger subsets (see Table 1), we subsample the problems to between 20 and 40 per subject
area; this is enough for a ballpark estimate of the frequency of different errors, and is not worth
increasing because attributing error types is inherently fuzzy.
As reasoning tasks increase in complexity, it gets harder to evaluate model performance. Symbolic
final answers are in some cases difficult to grade automatically. Further, we are often more interested
in the correctness of the reasoning used to produce the final answer; but evaluating intermediate
reasoning steps requires expert human supervision. An ideal solution would be to use LLMs as
evaluators based on a reference solution; unfortunately, there are major reliability issues.
To improve reliability, we propose generating rubrics as an important component of the evaluation
process. The model generates the rubric from the reference solution, then evaluates any solution
based on the generated rubric. To aid rubric generation, we give few-shot examples of human-written
rubrics to the rubric-generating model run. We study this approach by conducting a human evaluation
of GPT-4 generated rubrics and the GPT-4 grading of its own solutions using the generated rubrics.
7
We rate the quality of GPT-4 generated rubrics by hand in the first two rows of Table 4. Likert scores
from 1-5 are assigned to both the coverage of the rubric, i.e. how well it captures key subproblems,
and the point breakdown. Rubric quality scores are reported in Table 5 for symbolic and proof-like
problems. We find GPT-4 designs rubrics which cover the crucial solution steps well, but struggles to
properly allocate points to each step based on relative importance. However, it is much better than
GPT-3.5-turbo, which tends to over-allocate points to only one or two solution steps.
Table 4: Evaluations of rubric quality and GPT-4 rubric evaluation failure cases. Rubric coverage
and rubric point spread are on a 1-5 Likert scale. Alternative solutions is the percentage of correct
solutions found not covered by the rubric. Extra/reduced credit track how often GPT-4 erroneously
assigns or deducts points. Hallucinated rubric tracks how often GPT-4 assigns points by referring to a
rubric item not actually present in the rubric.
The obvious limitation of rubric scoring is the case of correct solutions not covered by the rubric. We
find that on our benchmark, GPT-4 rarely generates a fully or even mostly partially correct solution
that does not follow the rubric. Once done rating the model generated rubrics, we then manually
grade GPT-4’s solutions according to each rubric and compare the result to GPT-4’s evaluation. We
also annotate, for each problem, both whether GPT-4 assigns credit inappropriately or fails to assign
credit when it should.
Table 5: Average scores (out of 10 points) when assigned by human annotators versus GPT-4.
Correlation is the Pearson correlation coefficient between the two scores, over all problems.
We find a moderately high correlation between GPT-4’s evaluation score and the manual score. In
some cases, the model, assigns an extra point or two when compared to the annotated rubric score.
However, the self-eval score almost never deviates more than two points from the ground truth. The
main failure mode we detect is the assignment of partial credit to attempted solutions completely
outside the problem rubric, where the human evaluation score is always zero. Taken together, we
believe these results suggest that rubric-based evaluation is a promising automated evaluation method.
Having established rubric-based evaluation as a (imperfect) proxy for correctness, we now comment
on the GPT-4 performance graded by the rubric. Table 5 shows GPT-4 is best at generating correct
intermediate reasoning steps for physics questions. Inspecting the model outputs suggests that GPT-4
is good at recalling relevant and useful concepts in physics for solving the relevant problem; however,
it can struggle with the mathematical manipulations required to solve the problem. The model is
worse at recognizing the correct concepts and formulating an appropriate plan for the math questions,
particularly for proof-like problems.
8
sourced from graduate-level exams and professional resources. Despite advancements in current
LLMs, their performance remains very low on the quantitative subjects, in ARB’s tasks. We also
proposed a rubric-based self-evaluation method, enabling LLMs to grade their own reasoning. This
method is not yet reliable enough to replace human grading. We hope this method can be extended to
more reliable and cheap testing of complex model outputs.
As with all other benchmarks that are not created anew and kept secret, it is possible there is data
contamination. For example, the MCAT books are not available for free in most jurisdictions, but it
certainly possible that some model creators have trained on it anyway.
Finally, the benchmark does not remotely cover all aspects of human ability; a model solving
this benchmark perfectly could still be much worse than most educated people in many aspects.
Nevertheless, we hope that increasing the difficulty standards helps the research community ground
the performance of increasingly powerful models more accurately.
References
Ibrahim M Alabdulmohsin, Behnam Neyshabur, and Xiaohua Zhai. Revisiting neural scaling laws in language
and vision. Advances in Neural Information Processing Systems, 35:22300–22312, 2022.
Daman Arora, Himanshu Gaurav Singh, and Mausam. Have LLMs advanced enough? A challenging problem
solving benchmark for Large Language Models, 2023.
Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen,
Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, Carol Chen, Catherine Olsson, Christopher Olah,
Danny Hernandez, Dawn Drain, Deep Ganguli, Dustin Li, Eli Tran-Johnson, Ethan Perez, Jamie Kerr,
Jared Mueller, Jeffrey Ladish, Joshua Landau, Kamal Ndousse, Kamile Lukosuite, Liane Lovitt, Michael
Sellitto, Nelson Elhage, Nicholas Schiefer, Noemi Mercado, Nova DasSarma, Robert Lasenby, Robin Larson,
Sam Ringer, Scott Johnston, Shauna Kravec, Sheer El Showk, Stanislav Fort, Tamera Lanham, Timothy
Telleen-Lawton, Tom Conerly, Tom Henighan, Tristan Hume, Samuel R. Bowman, Zac Hatfield-Dodds, Ben
Mann, Dario Amodei, Nicholas Joseph, Sam McCandlish, Tom Brown, and Jared Kaplan. Constitutional AI:
Harmlessness from AI, feedback, 2022.
Barbri. Barbri Practice Questions: Multistate Testing Practice Questions. Thomson/Bar/Bri, 2007. ISBN
9780314174017.
Michael Bommarito II and Daniel Martin Katz. GPT takes the bar exam. arXiv preprint arXiv:2212.14402,
2022.
Samuel R. Bowman. The dangers of underclaiming: Reasons for caution when reporting how NLP systems fail,
2021.
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind
Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen
Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter,
Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark,
Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language models
are few-shot learners. CoRR, abs/2005.14165, 2020. URL https://arxiv.org/abs/2005.14165.
Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee,
Yin Tat Lee, Yuanzhi Li, Scott Lundberg, et al. Sparks of artificial general intelligence: Early experiments
with GPT-4. arXiv preprint arXiv:2303.12712, 2023.
Ethan Caballero, Kshitij Gupta, Irina Rish, and David Krueger. Broken neural scaling laws, 2023.
9
Candice McCloskey Campbell, Shaun Murphree, Jennifer M. Warner, Amy B. Wachholz, Kathy A. Zahler,
and George J. Hademenos. McGraw-Hill Education 3 MCAT Practice Tests, Third Edition. McGraw-Hill
Education, Jan 2017. ISBN 1259859622.
Bryan Caplan. GPT retakes my midterm and gets an A, 2023. URL https://betonit.substack.com/
p/gpt-retakes-my-midterm-and-gets-an.
Cheng-Han Chiang and Hung-yi Lee. Can Large Language Models be an alternative to human evaluations?
arXiv e-prints, art. arXiv:2305.01937, may 2023. doi: 10.48550/arXiv.2305.01937.
Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul
Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh, Kensen Shi, Sasha
Tsvyashchenko, Joshua Maynez, Abhishek Rao, Parker Barnes, Yi Tay, Noam Shazeer, Vinodkumar Prab-
hakaran, Emily Reif, Nan Du, Ben Hutchinson, Reiner Pope, James Bradbury, Jacob Austin, Michael
Isard, Guy Gur-Ari, Pengcheng Yin, Toju Duke, Anselm Levskaya, Sanjay Ghemawat, Sunipa Dev, Henryk
Michalewski, Xavier Garcia, Vedant Misra, Kevin Robinson, Liam Fedus, Denny Zhou, Daphne Ippolito,
David Luan, Hyeontaek Lim, Barret Zoph, Alexander Spiridonov, Ryan Sepassi, David Dohan, Shivani
Agrawal, Mark Omernick, Andrew M. Dai, Thanumalayan Sankaranarayana Pillai, Marie Pellat, Aitor
Lewkowycz, Erica Moreira, Rewon Child, Oleksandr Polozov, Katherine Lee, Zongwei Zhou, Xuezhi Wang,
Brennan Saeta, Mark Diaz, Orhan Firat, Michele Catasta, Jason Wei, Kathy Meier-Hellstern, Douglas Eck,
Jeff Dean, Slav Petrov, and Noah Fiedel. PaLM: Scaling language modeling with Pathways, 2022. URL
https://arxiv.org/abs/2204.02311.
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and
John Schulman. Training verifiers to solve math word problems. CoRR, abs/2110.14168, 2021. URL
https://arxiv.org/abs/2110.14168.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional
transformers for language understanding. In Proceedings of the 2019 Conference of the North American
Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long
and Short Papers), pages 4171–4186, Minneapolis, Minnesota, jun 2019. Association for Computational
Linguistics. doi: 10.18653/v1/N19-1423. URL https://aclanthology.org/N19-1423.
Jinlan Fu, See-Kiong Ng, Zhengbao Jiang, and Pengfei Liu. GPTScore: Evaluate as you desire. arXiv e-prints,
art. arXiv:2302.04166, feb 2023. doi: 10.48550/arXiv.2302.04166.
Yao Fu, Litu Ou, Mingyu Chen, Yuhao Wan, Hao Peng, and Tushar Khot. Chain-of-thought hub: A continuous
effort to measure large language models’ reasoning performance, 2023.
Răzvan Gelca and Titu Andreescu. Putnam and beyond. Springer, 2017.
Gaël Gendron, Qiming Bao, Michael Witbrock, and Gillian Dobbie. Large language models are not abstract
reasoners, 2023.
Mor Geva, Daniel Khashabi, Elad Segal, Tushar Khot, Dan Roth, and Jonathan Berant. Did aristotle use a
laptop? a question answering benchmark with implicit reasoning strategies, 2021.
Department of Mathematics Harvard University. Qualifying examination for fall 2021, Aug 2021. URL
https://www.math.harvard.edu/media/quals-F21_with_solutions.pdf.
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt.
Measuring massive multitask language understanding, 2020. URL https://arxiv.org/abs/2009.
03300.
Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob
Steinhardt. Measuring mathematical problem solving with the MATH dataset. CoRR, abs/2103.03874, 2021.
URL https://arxiv.org/abs/2103.03874.
Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford,
Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, Katie
Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, Erich
Elsen, Jack W. Rae, Oriol Vinyals, and Laurent Sifre. Training compute-optimal large language models, 2022.
URL https://arxiv.org/abs/2203.15556.
10
Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray,
Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models, 2020.
Daniel Martin Katz, Michael James Bommarito, Shang Gao, and Pablo Arredondo. GPT-4 passes the bar exam.
Available at SSRN 4389233, 2023.
Tom Kocmi and Christian Federmann. Large language models are state-of-the-art evaluators of translation
quality. arXiv e-prints, art. arXiv:2302.14520, feb 2023. doi: 10.48550/arXiv.2302.14520.
Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language
models are zero-shot reasoners, 2023.
Rik Koncel-Kedziorski, Subhro Roy, Aida Amini, Nate Kushman, and Hannaneh Hajishirzi. MAWPS: A
math word problem repository. In Proceedings of the 2016 Conference of the North American Chapter
of the Association for Computational Linguistics: Human Language Technologies, pages 1152–1157, San
Diego, California, jun 2016. Association for Computational Linguistics. doi: 10.18653/v1/N16-1136. URL
https://aclanthology.org/N16-1136.
Tiffany H Kung, Morgan Cheatham, Arielle Medenilla, Czarina Sillos, Lorie De Leon, Camille Elepaño, Maria
Madriaga, Rimel Aggabao, Giezel Diaz-Candido, James Maningo, et al. Performance of ChatGPT on
USMLE: Potential for AI-assisted medical education using large language models. PLOS Digital Health, 2
(2):e0000198, 2023.
Md Tahmid Rahman Laskar, M Saiful Bari, Mizanur Rahman, Md Amran Hossen Bhuiyan, Shafiq Joty, and
Jimmy Xiangji Huang. A systematic study and comprehensive evaluation of ChatGPT on benchmark datasets,
2023.
Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Ramasesh,
Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, Yuhuai Wu, Behnam Neyshabur, Guy
Gur-Ari, and Vedant Misra. Solving quantitative reasoning problems with language models, 2022. URL
https://arxiv.org/abs/2206.14858.
Yifei Li, Zeqi Lin, Shizhuo Zhang, Qiang Fu, Bei Chen, Jian-Guang Lou, and Weizhu Chen. Making large
language models better reasoners with step-aware verifier, 2023.
Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang,
Deepak Narayanan, Yuhuai Wu, Ananya Kumar, Benjamin Newman, Binhang Yuan, Bobby Yan, Ce Zhang,
Christian Cosgrove, Christopher D. Manning, Christopher Ré, Diana Acosta-Navas, Drew A. Hudson,
Eric Zelikman, Esin Durmus, Faisal Ladhak, Frieda Rong, Hongyu Ren, Huaxiu Yao, Jue Wang, Keshav
Santhanam, Laurel Orr, Lucia Zheng, Mert Yuksekgonul, Mirac Suzgun, Nathan Kim, Neel Guha, Niladri
Chatterji, Omar Khattab, Peter Henderson, Qian Huang, Ryan Chi, Sang Michael Xie, Shibani Santurkar,
Surya Ganguli, Tatsunori Hashimoto, Thomas Icard, Tianyi Zhang, Vishrav Chaudhary, William Wang,
Xuechen Li, Yifan Mai, Yuhui Zhang, and Yuta Koreeda. Holistic evaluation of language models, 2022.
Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John
Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step, 2023.
Swee Cheng Lim, Choy Heng Lai, and Leong Chuan Kwek. Problems and solutions on optics. World Scientific,
2019.
Yung-kuo Lim. Problems and solutions on thermodynamics and Statistical Mechanics. World Scientific, 1996.
Yung-kuo Lim. Problems and solutions in quantum mechanics: Major, American universities ph. D. qualifying
questions and, solutions. World Scientific, 1998.
Yung-kuo Lim. Problems and solutions on electromagnetism. World Scientific Pub. Co, 2007.
Yung-kuo Lim and Yuan-qi Qiang. Problems and solutions on Mechanics. World Scientif., 2001.
Wang Ling, Dani Yogatama, Chris Dyer, and Phil Blunsom. Program induction by rationale generation:
Learning, to solve and explain algebraic word problems. In Proceedings of the 55th Annual Meeting
of the Association for Computational Linguistics (Volume 1: Long Papers), pages 158–167, Vancouver,
Canada, jul 2017. Association for Computational Linguistics. doi: 10.18653/v1/P17-1015. URL https:
//aclanthology.org/P17-1015.
Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. G-eval: NLG evaluation
using GPT-4 with better human alignment. arXiv e-prints, art. arXiv:2303.16634, mar 2023. doi: 10.48550/
arXiv.2303.16634.
11
Aaron Meurer, Christopher P. Smith, Mateusz Paprocki, Ondřej Čertík, Sergey B. Kirpichev, Matthew Rocklin,
AMiT Kumar, Sergiu Ivanov, Jason K. Moore, Sartaj Singh, Thilina Rathnayake, Sean Vig, Brian E. Granger,
Richard P. Muller, Francesco Bonazzi, Harsh Gupta, Shivam Vats, Fredrik Johansson, Fabian Pedregosa,
Matthew J. Curry, Andy R. Terrel, Štěpán Roučka, Ashutosh Saboo, Isuru Fernando, Sumith Kulal, Robert
Cimrman, and Anthony Scopatz. Sympy: Symbolic computing in python. PeerJ Computer Science, 3:
e103, jan 2017. ISSN 2376-5992. doi: 10.7717/peerj-cs.103. URL https://doi.org/10.7717/
peerj-cs.103.
Shen-yun Miao, Chao-Chun Liang, and Keh-Yih Su. A diverse corpus for evaluating and developing English
math word problem solvers. In Proceedings of the 58th Annual Meeting of the Association for Computational
Linguistics, pages 975–984, Online, jul 2020. Association for Computational Linguistics. doi: 10.18653/v1/
2020.acl-main.92. URL https://aclanthology.org/2020.acl-main.92.
Shakked Noy and Whitney Zhang. Experimental evidence on the productivity effects of generative artificial
intelligence. Available at SSRN 4375283, 2023.
Maxwell Nye, Anders Johan Andreassen, Guy Gur-Ari, Henryk Michalewski, Jacob Austin, David Bieber,
David Dohan, Aitor Lewkowycz, Maarten Bosma, David Luan, Charles Sutton, and Augustus Odena. Show
your work: Scratchpads for intermediate computation with language models, 2021.
OpenAI. GPT-4 technical report, 2023.
Arkil Patel, Satwik Bhattamishra, and Navin Goyal. Are NLP models really able to solve simple math word
problems?, 2021.
Ethan Perez, Sam Ringer, Kamilė Lukošiūtė, Karina Nguyen, Edwin Chen, Scott Heiner, Craig Pettit, Catherine
Olsson, Sandipan Kundu, Saurav Kadavath, et al. Discovering language model behaviors with model-written
evaluations. arXiv preprint arXiv:2212.09251, 2022.
Mahima Pushkarna, Andrew Zaldivar, and Oddur Kjartansson. Data Cards: Purposeful and Transparent Dataset
Documentation for Responsible AI. arXiv e-prints, art. arXiv:2204.01075, April 2022. doi: 10.48550/arXiv.
2204.01075.
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are
unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
Subhro Roy and Dan Roth. Solving general arithmetic word problems, 2016.
Paulo Shakarian, Abhinav Koyyalamudi, Noel Ngu, and Lakshmivihari Mareedu. An independent evaluation of
ChatGPT on mathematical word problems (MWP). arXiv preprint arXiv:2302.13814, 2023.
Natalie Shapira, Mosh Levy, Seyed Hossein Alavi, Xuhui Zhou, Yejin Choi, Yoav Goldberg, Maarten Sap, and
Vered Shwartz. Clever Hans or Neural Theory of Mind? Stress testing social reasoning in large language
models. arXiv preprint arXiv:2305.14763, 2023.
Paulo N de Souza and Jorge N. Silva. Berkeley problems in Mathematics. Springer New York, 2008.
Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch,
Adam R Brown, Adam Santoro, Aditya Gupta, Adrià Garriga-Alonso, et al. Beyond the imitation game:
Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615, 2022.
doi: 10.48550/ARXIV.2206.04615. URL https://arxiv.org/abs/2206.04615.
Jacob Steinhardt. AI forecasting: One year in, 2022. URL https://bounded-regret.ghost.io/
ai-forecasting-one-year-in/.
Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. CommonsenseQA: A question answering
challenge targeting commonsense knowledge. CoRR, abs/1811.00937, 2018. URL http://arxiv.org/
abs/1811.00937.
Karthik Valmeekam, Alberto Olmo, Sarath Sreedharan, and Subbarao Kambhampati. Large language models
still can’t plan (a benchmark for LLMs on planning and reasoning about change), 2023.
Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy,
and Samuel R. Bowman. SuperGLUE: A stickier benchmark for general-purpose language understanding
systems, 2019a.
Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. GLUE: A
multi-task benchmark and analysis platform for natural language understanding. 2019b. In the Proceedings
of ICLR.
12
Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and
Denny Zhou. Self-consistency improves chain of thought reasoning in language models, 2023.
Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten
Bosma, Denny Zhou, Donald Metzler, Ed H. Chi, Tatsunori Hashimoto, Oriol Vinyals, Percy Liang, Jeff
Dean, and William Fedus. Emergent abilities of large language models, 2022a.
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and
Denny Zhou. Chain of thought prompting elicits reasoning in large language models, 2022b. URL https:
//arxiv.org/abs/2201.11903.
Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W. Cohen, Ruslan Salakhutdinov, and Christo-
pher D. Manning. HotpotQA: A dataset for diverse, explainable multi-hop question answering, 2018.
Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, and Karthik Narasimhan.
Tree of thoughts: Deliberate problem solving with large language models, 2023.
Ping Yu, Tianlu Wang, Olga Golovneva, Badr Alkhamissy, Gargi Ghosh, Mona Diab, and Asli Celikyilmaz.
ALERT: Adapting language models to reasoning tasks, 2022.
Sarah Zhang, Reece Shuttleworth, Derek Austin, Yann Hicke, Leonard Tang, Sathwik Karnik, Darnell Granberry,
and Iddo Drori. A dataset and benchmark for automatically answering and generating machine learning final
exams. arXiv preprint arXiv:2206.05442, 2022.
Hefei Zhongguo-Kexue-Jishu-Daxue. Major American universities Ph. D. qualifying questions and solutions. 5.
Problems and solutions on thermodynamics and statistical mechanics. World Scientific, 1990.
A Datasheet
We present the data card, following the format proposed by Pushkarna et al. [2022].
Dataset Owners. [email protected].
13
Example: Typical Data Point. Each data point of the dataset consist of a pair of problem statement
and ground truth solution. Table 7, Table 9 and Table 10 include problem statement and ground truth
solution of typical data points.
Sensitive Human Attributes. We have not found any sensitive human attributes in our dataset.
Data Distributions. Table Section 3.1 shows the number of problems for each subject area and
answer type. Text entries (problem statement, ground truth solution, ground truth answer) for all
categories are in LaTeX (although obviously, the non-quantitative subjects have very few mathematical
expressions).
B Dataset format
The benchmark dataset is available in .jsonl format, containing problem statements, ground truth
solutions, and final ground truth answers for each entry. We additionally include metadata such as
subject names and problem topics, where available.
We chose the four subject areas discussed earlier for several reasons. Primarily, the dataset focuses
on math and physics, as these subjects present more challenging problems than existing benchmarks.
However, to ensure a comprehensive evaluation of models, we also included subjects like Law and
MCAT. This inclusion allows for assessing model performance across a wider range of technical
domains, beyond the quantitative sciences.
Although previous works have evaluated recent models on law [Katz et al., 2023], we draw upon
the established importance of broad benchmarks like SuperGLUE [Wang et al., 2019a]. Making
a benchmark more comprehensive expands the evaluation scope while enhancing the dataset’s
significance in the wider AI research context.
We facilitate access to the dataset through API calls* . The problems can be accessed by the different
splits and subject areas.
We use an API to host the data (rather than making it publicly available on popular platforms such as
HuggingFace or Github) out of concern for data contamination. Many models released in the last few
years have been trained on benchmark datasets OpenAI [2023], an approach that could inadvertently
inflate the model’s performance on the associated benchmark. There are several proposed strategies
to mitigate this problem, including dataset poisoning and canary text methods.However, restricting
dataset accessibility to web crawlers seemed to be the only way to ensure integrity of our dataset for
future uses.
Table 7 presents a GPT-4 generated rubric and self-evaluation for symbolic math questions. GPT-4
does a good job breaking the problem into important sub-parts of the reference solution. Further, it
ensures a well-balanced point distribution among sub-problems, assigning similar point values to
each sub-part. In contrast, less powerful models like GPT-3.5-turbo struggle to generate good rubrics,
as they tend to under-specify sub-problems and allocate too many points to irrelevant steps.
The main downside on evaluating solutions only on the final answer is that the metric is not smooth;
a small mistake in the middle of the solution always leads to an incorrect final answer. Grading
using a rubric shows that GPT-4 generates some useful ideas, for example the correct formula for
parameterizing an ellipse. The model also correctly identifies that the question’s area is optimized by
an isosceles triangle. Despite this, it is unable to correctly compute the final answer due to an earlier
mistake in the response. This indicates that GPT-4 has some problem-solving abilities, but struggles
to detect or recover from earlier errors in generation.
14
D.1 Using ChatGPT for Symbolic Evaluation
Unlike GPT-4, GPT-3.5-turbo is not to write rubrics with good coverage of the reference solution
and a fair point breakdown. Often the model will over-simplify the rubric and allocate far too many
points to non-essential parts of the problem. However, GPT-3.5-turbo does possess some ability to
reason about complex symbolic expressions. Motivated by this, we asked the model to grade the
final answers to symbolic math and physics problems. While much easier to grade than intermediate
reasoning steps, more involved symbolic expressions still require human evaluation to compare
accurately. Using cheap models like GPT-3.5-turbo to automate this symbolic equivalence boosts
our abilities to evaluate models on this more complicated class of reasoning problems. We prompt
GPT-3.5-turbo to compare extracted model generated symbolic final answers from GPT-4 to the
reference answer and record results in Table 8.
GPT-3.5-turbo is surprisingly good at not equating non-equivalent symbolic statements, achieving
a false positive rate of 0. However, this comes at of the model often not assigning credit when it
should, leading to an underestimate of the true performance of the graded answers on symbolic tasks.
In particular, GPT-3.5-turbo often fails when comparing expressions with more than 3 variables.
Furthermore, we observe in several cases the student model GPT-4 is able to correctly generate the
final answer, but does not present it in the expected format, making it impossible for GPT-3.5-turbo
to correctly compare against the ground truth reference. These errors could potentially be avoided by
giving the model access to the entire student generated reasoning trace, but we expect this to be a
minimal help and perhaps even harm performance.
E Interesting Examples
Math Example. One common way GPT-4 gets numerical questions wrong is when it do some
formal manipulations follows by a blackbox numerical calculation after which the model spits out an
(incorrect) answer. (Table 10) The formal manipulation does not simplify the problem, and so all the
work of solving the problem is done in the final step of the calculation.
MCAT Example. GPT-4 get’s confused when meanings of words are implicit in prerequisite
knowledge or contexts. In one example about DNA replication (Table 9), the model correctly
identifies that the radioactive thymine is present in the two strands of nucleotides from the original
DNA, it fails to deduce that both of the resulting double helices are radioactive. This seems to be
because the model confuses the word "DNA" with "strands" of the DNA. When looking at choice C,
the model (incorrectly) assumes that each of the four strands in the new double helices are radioactive,
when it is clear from context that the choice is referring to the radioactive molecule being present
somewhere in each double helix (not necessarily in each strand). Because of this misconception, the
model chooses D.
Law Example. An unexpected mistake from GPT-4 in answering law questions is where the model
reads too much into an answer choice. For example, GPT-4 incorrectly produced this answer and
reasoning:
“B: This answer choice suggests that when two crossing offers are identical, one
will be treated as an offer and the other as an acceptance. This accurately reflects
the situation between Mom and Huck, as they both agreed on the same price.”
“Yes, because when two crossing offers are identical in import, one will be treated
as an offer and the other as an acceptance.”
The error GPT-4 made is treating the statement in the answer choice (“when two crossing offers are
identical, one will be treated as an offer and the other as an acceptance”) as a fact, and then making a
choice based on that fact being true. Better prompting could likely precent errors of this type.
15
F Memorization
While it is difficult to quantify potential problem leakage and memorization of similar problems, some
outputs suggest this might be a relevant factor. In Table 12, GPT-3.5-turbo appears to be memorizing
the answer to this symbolic physics problem, given that it is getting the formula for capacitance C
correct despite having the wrong reasoning. In contrast, GPT-4 stays faithful to its reasoning chain,
resulting in a slightly inaccurate final answer. Note that the overall performance of all models we test
is still somewhat low, hence not majorly affected by memorization of similar problems.
H Checklist Information.
Dataset Intended Uses. The ARB benchmark dataset, documented within the paper, is primarily
intended for research purposes. We do not advocate for this dataset to train models that help students
cheat on exams. We hope that the research community will use this benchmark to better assess
reasoning capabilities of language models.
Author Statement and License. We bear all responsibility in case of violation of rights. The ARB
dataset is licensed under CC BY 4.0, and all helper code we release is under the MIT license. For all
problems originating in books listed in Section 3, following [Hendrycks et al., 2021], we abide by
Fair Use §107: "the fair use of a copyrighted work, including such use by ... scholarship, or research,
is not an infringement of copyright", where fair use is determined by "the purpose and character of the
use, including whether such use is of a commercial nature or is for nonprofit educational purposes"
and "the effect of the use upon the potential market for or value of the copyrighted work".
16
Table 7: Example model generated rubric and evaluation. Evaluation with a rubric reveals the model
begins with the model makes good partial progress but is unable to arrive at the correct final answer.
Problem Find the maximum area of all triangles that can be inscribed in an ellipse with semiaxes a
Statement and b, and describe the triangles that have maximum area.
x = a cos t, y = b sin t,
a triple of points on the ellipse is given by
(a cos ti , b sin ti ) , i = 1, 2, 3.
So the area of an inscribed triangle is given by
ab 1 cos t1 sin t1
1 cos t2 sin t2 (1)
2 1 cos t3 sin t3
Model • Connect the problem to the maximum area of an inscribed triangle in the unit
Generated circle (2 points).
Rubric
• Correctly identify that the maximum area occurs when the inscribed triangle is
equilateral (2 points).
• Find the maximum area of an inscribed triangle in the ellipse, and provide the
correct answer (2 points):
√
3ab 3
(2)
4
17
Table 7 (continued)
To maximize the area of the inscribed triangle, we can use the following approach:
1. Parameterize the ellipse: The equation of an ellipse with semiaxes a and b is given
2 2
by xa2 + yb2 = 1. We can parameterize the ellipse using the parameter t as follows:
x = a cos(t) and y = b sin(t), where 0 ≤ t ≤ 2π.
2. Determine the vertices of the triangle: Let A(a cos(t1 ), b sin(t1 )),
B(a cos(t2 ), b sin(t2 )), and C(a cos(t3 ), b sin(t3 )) be the vertices of the inscribed
triangle, where 0 ≤ t1 , t2 , t3 ≤ 2π.
3. Calculate the area of the triangle: Using the Shoelace Theorem, the area of the triangle
is given by:
3 3
1 X X
Area = xi yi+1 − xi+1 yi
2 i=1 i=1
1 2
Area = |a cos(t1 ) sin(t2 ) + a2 cos(t2 ) sin(t3 )
2
+a2 cos(t3 ) sin(t1 ) − a2 cos(t2 ) sin(t1 ) − a2 cos(t3 ) sin(t2 ) − a2 cos(t1 ) sin(t3 )|
Simplifying the expression, we get:
a2
Area = | cos(t1 ) sin(t2 ) − cos(t2 ) sin(t1 )+
2
Model
Response cos(t2 ) sin(t3 ) − cos(t3 ) sin(t2 ) + cos(t3 ) sin(t1 ) − cos(t1 ) sin(t3 )|
Using the sine difference identity, sin(A − B) = sin(A) cos(B) − cos(A) sin(B), we can
rewrite the area as:
a2
Area = |sin(t1 − t2 ) + sin(t2 − t3 ) + sin(t3 − t1 )|
2
4. Maximize the area: Notice that the maximum value of each sine term is 1, and this
occurs when the angles t1 , t2 , t3 are equally spaced around the ellipse, i.e., when the
triangle is equilateral. In this case, the area becomes:
a2 3a2
Area = |1 + 1 + 1| =
2 2
5. Relate the maximum area to the semiaxes: Since the triangle is equilateral, the ratio of
the area of the triangle to the area of the ellipse is constant. The area of the ellipse is given
by πab. Thus, we can write:
2
3a
Areamax
= 2
πab πab
Solving for the maximum area, we get:
3a2 3πa3 b
Areamax = · πab =
2 2
Thus, the maximum area of an inscribed triangle in an ellipse with semiaxes a and b is
3πa3 b
2 , and the triangles with maximum area are equilateral.
3
ANSWER: 3πa2 b
18
Table 7 (continued)
1. The student parameterized the ellipse correctly using x = a cos(t) and y = b sin(t). (2
points)
2. The student used the Shoelace Theorem to find the area of the triangle, but did not use
the determinant formula as specified in the rubric. (0 points)
Rubric 3. The student did not explicitly connect the problem to the maximum area of an inscribed
Based triangle in the unit circle. (0 points)
Evaluation 4. The student correctly identified that the maximum area occurs when the inscribed
triangle is equilateral. (2 points)
5. The student found the maximum area of the √
inscribed triangle in the ellipse, but their
3
answer is incorrect. The correct answer is 3ab4 3 , not 3πa2 b . (0 points)
Rubric Score: 4 points
Table 8: Performance of GPT-3.5-turbo on symbolic equivalence versus human ground truth. The
model achieves a false positive rate of 0, but allows many false negatives. This suggests there are
many cases in which the model does not recognize equivalent expressions.
19
Table 9: Problem, solution and GPT-4 response for a question from the MCAT part of the benchmark.
The model mistakes two copies of a DNA with the two amino acid strands in a DNA.
If you replaced the thymine in a double helix with radioactive thymine and you allowed
DNA replication to occur once, which of the following results would you expect? Choices:
Problem A: The DNA would not be radioactive in either of the double helices.
Statement B: The mRNA made from the double helices would be radioactive.
C: The DNA in each of the double helices would be radioactive.
D: The DNA in one of the double helices would be radioactive but not in the other one.
Ground The correct answer is C. During semiconservative replication, the DNA helix unwinds so
Truth that both strands of DNA can serve as a template. Both strands are copied, producing two
Solution double helices. Each one consists of a template strand and a newly synthesized strand.
First, let’s consider the process of DNA replication. DNA replication is a semi-
conservative process, which means that each new double helix formed contains one
original strand and one newly synthesized strand. In this case, the original strand contains
radioactive thymine.
D: The DNA in one of the double helices would be radioactive but not in the other one.
This choice aligns with the concept of semi-conservative replication. Each new double
helix contains one original strand (with radioactive thymine) and one newly synthesized
strand (without radioactive thymine).
ANSWER: D
20
Table 10: Problem, solution and GPT-4 sesponse for a question from the Math Numerical part of
the benchmark. The model often returns a numerical answer after doing some internal black-box
numerical calculation, which is rarely correct.
Z π
Problem x sin x
Compute dx.
Statement 0 1 + sin2 x
x
We use the example from the introduction for the particular function f (x) = 1+x2 to
transform the integral into
Z π2
sin x
π dx.
0 1 + sin2 x
This is the same as
Ground Z π2
d(cos x)
Truth π − ,
Solution 0 2 − cos2 x
which with the substitution t = cos x becomes
Z 1 √ 1 √
1 π 2+t π 2+1
π 2
dt = √ ln √ = √ ln √
0 2 − t 2 2 2 − t 2 2 2−1
0
Unfortunately, this integral does not have a closed-form solution in terms of elementary
functions. However, it is possible to compute its numerical value using numerical integra-
tion methods, such as the trapezoidal rule, Simpson’s rule, or Gaussian quadrature.
Using numerical integration, we can approximate the value of the integral to be approxi-
mately −1.9348. Therefore, the final answer is: ANSWER: −1.9348.
21
Table 11: Problem, solution and GPT-4 response for a question from the law part of the benchmark.
Mom, a wealthy woman, wished to buy her son, Sylvester, an expensive Rolls-Royce for a
wedding present. She visited Dealership, a Rolls-Royce dealer, several times, looked at
many cars, and discussed possible sales prices with Huck, a salesman. On May 15, after
much discussion, Huck and Mom signed a writing that stated, "If we can agree on price on
or before June 1, Mom agrees to buy and Huck, for Dealership, agrees to sell one yellow
Rolls-Royce Silver Streak, serial number XO857623." On May 20, Mom dispatched a
letter to Huck stating, "I will buy the Rolls-Royce for 150, 000." On the same day, Huck
dispatched a letter to Mom, stating, "I will sell the Rolls-Royce for 150, 000." Has a valid
Problem contract been formed between Dealership and Mom?
Statement
Answer Candidates:
A.) Yes, because the May 15 writing constitutes a contract with a missing price term, and
that term was filled by the crossing offers.
B.) Yes, because when two crossing offers are identical in import, one will be treated as
an offer and the other as an acceptance.
C.) No, because there were two crossing offers and no acceptance; hence there was no
mutual assent.
D.) No, but the result would be different if Mom were a merchant.
Ground
Truth C
Solution
First, let’s analyze the May 15 writing. The writing states that if the price can be agreed
upon by June 1, Mom and Huck will enter into a contract for the purchase of the specific
Rolls-Royce. At this point, there is no agreement for the price, so there is no valid contract
formed yet.
Next, let’s examine the crossing offers on May 20. Mom sent a letter stating she would
buy the car for $150,000, and Huck sent a letter stating he would sell the car for $150,000.
Both letters were dispatched on the same day, and they agreed on the same price. Since
the price was one of the main terms that needed to be agreed upon, and both parties agreed
on the same price, this could be seen as one party making an offer and the other accepting.
Thus, a valid contract could be formed.
Now, let’s look at the answer choices:
Model A: This answer choice argues that the May 15 writing is a contract with a missing price
Response term. However, the May 15 writing itself stated that an agreement would only be formed
if the price was agreed upon. Therefore, this answer choice is incorrect.
B: This answer choice suggests that when two crossing offers are identical, one will be
treated as an offer and the other as an acceptance. This accurately reflects the situation
between Mom and Huck, as they both agreed on the same price.
C: This answer choice argues that there was no mutual assent because there were two
crossing offers and no acceptance. However, since both offers agreed on the same price,
one could be seen as an offer and the other as an acceptance, forming a valid contract.
D: This answer choice states that the result would be different if Mom were a merchant.
However, the fact that Mom is not a merchant does not affect the formation of a valid
contract in this situation.
Given the analysis of the answer choices, the correct answer is:
ANSWER: B
22
Table 12: Example of memorization in ARB. GPT-3.5-turbo’s response indicates memorization of
the correct expression. GPT-4 commits to its reasoning better, resulting in the wrong final answer.
Supposing that the charge per unit length of the inner wire is −λ and using cylindrical
coordinates (r, φ, z), we find the electric field intensity in the capacitor by Gauss’
theorem to be
λ −Q
E=− er = er .
2πεr 2πεLr
The potential difference between the inner and outer capacitors is
Reference
Solution Z b
λ
b
V =− E · dr = ln .
a 2πε a
Hence the capacitance is
λL 2πεL
C= =
ln ab
V
23
Table 12 (continued)
To find the capacitance of the cylindrical capacitor, we will use the general formula
for capacitance: C = VQ , where Q is the charge stored in the capacitor and V is the
potential difference between the inner conductor wire and the outer conducting shell.
First, let’s find the electric field E between the two conductors. We can use Gauss’s
law, which states that the electric flux through a closed surface is equal to the charge
enclosed by the surface divided by the permittivity of free space ε0 :
I
E ⃗ = Qenclosed .
⃗ · dA
ε0
We will choose a cylindrical Gaussian surface of radius r and length L, where
a < r < b. The electric field is radial, so the electric flux through the curved surface
of the cylinder is: I I
E⃗ · dA⃗ = E dA = E(2πrL).
The charge enclosed by the Gaussian surface is the charge on the inner conductor
wire, which is given by its linear charge density λ multiplied by the length of the
Gaussian surface, L: Qenclosed = λL.
Now we can use Gauss’s law to find the electric field:
λL λ
E(2πrL) = ⇒E= .
ε0 2πrε0
However, we need to account for the dielectric material in between the conductors.
gpt-4 The electric field in a dielectric material is related to the electric field in vacuum by
Solution the dielectric constant ε:
E
Edielectric = .
ε
So, the electric field in the dielectric material is:
λ
Edielectric = .
2πrε0 ε
Now we can find the potential difference between the inner conductor wire and the
outer conducting shell by integrating the electric field along the radial direction:
Z b Z b
λ
V =− Edielectric dr = − dr.
a a 2πrε 0ε
Integrating, we get:
λ b
V =− ln( ).
2πε0 ε a
Q
Finally, we can find the capacitance using the formula C = V :
λL
C= .
− 2πελ0 ε
ln( ab )
Simplifying, we get:
ANSWER: C = 2πε 0 εL
ln( b )
.
a
24
Table 13: Prompt used for multiple-choice MCAT and Law problems.
You are a top graduate student taking an open-ended qualifying exam. Your final
System
answer should always be in the last line of your response, preceded by ANSWER:.
You are a top graduate student taking a qualifying exam. Below you will find a
multiple choice question.
Question: {Problem_Statement}
Solution:
You are a top graduate student taking an open-ended qualifying exam. Your final
System
answer should always be in the last line of your response, preceded by ANSWER:.
You are a top graduate student taking an open-ended qualifying exam. Below you
will find a question requiring you to compute a numerical value.
Question: {Problem_Statement}
Now it is time to give your answer. Think carefully and go step by step. Make sure
to justify all your work. Please simplify all expressions as much as possible and do
User
not leave any variables in your final answer.
Your final answer should NOT contain units and should be given at the end of your
work and preceded by ANSWER:
For example, if you think the answer is 2.4 meters, the last line of your answer
should be ANSWER: 2.4.
Solution:
25
Table 15: Prompt used for symbolic problems.
You are a top graduate student taking an open-ended qualifying exam. Your final
System
answer should always be in the last line of your response, preceded by ANSWER:.
You are a top graduate student taking an open-ended qualifying exam. Below you
will find a question requiring you to give a symbolic answer.
Question: {Problem_Statement}
Now it is time to give your answer. Think carefully and go step by step. Make sure
User to justify all your work.
Your final answer should NOT contain units and should be given at the end of your
work and preceded by ANSWER:
For example, if you think the answer is x ∗ y, the last line of your answer should be
ANSWER: x ∗ y
Solution:
You are a top graduate student taking an open-ended qualifying exam. Your final
System
answer should always be in the last line of your response, preceded by ANSWER:.
You are a top graduate student taking an open-ended qualifying exam. Below you
will find a question requiring you to prove the given statement.
Question: {Problem_Statement}
User
Now it is time to give your answer. Think carefully and go step by step. Make sure
to justify all your work.
Solution:
26
Table 17: Prompt used for GPT-3.5-turbo symbolic evaluation.
Now it is time to grade the model answer. If the solution is incorrect give GRADE:
User 0. If the solution is nearly correct up to a constant give GRADE: 0.5. If the solution
is correct give GRADE: 1. Before coming to a final grade think think carefully and
go step by step. DO NOT TRY TO SOLVE THE PROBLEM. If a variable name is
not specified and the reference answer and the model answer are the same up to the
name of a variable give a score of GRADE: 1. For example if the reference answer
is
f (x) = x2
and the model answer is
f (y) = y 2
give a score of GRADE: 1.
Now it is time to grade the student answer. Make sure to check each point of the
User
rubric step by step. And make sure to print the total number of earned points at the
end of your grading. For example, if the student earned 8 points, print Rubric Score:
8 points
Rubric Evaluation:
27
Table 19: Prompt used for GPT-4 rubric design.
Rubric:
28