Green Wizard
Green Wizard
elements required to produce a high quality task. Welcome! What is this project? The Reasoning
Annotation Project uses prompts and responses to improve the AI model's logical reasoning
capabilities. The AI model relies on human expertise to ensure each step in it's reasoning is
accurate and logical. How does it work? You will write a prompt that can stump the model
(meaning the model makes an error). The error must be a reasoning error (as opposed to a simple
calculation error). More on this later but we are looking to help models reason through math
problems better, as opposed to checking whether they can divide large numbers. The model will
output its answer broken out into step-by-step chunks (like a train of thought). You will check
each step to determine which steps are correct and which are incorrect. You will correct the
model's reasoning by rewriting any incorrect steps. This involves writing a brief justification
explaining the error. Once you correct a step, you will regenerate the response. This will restart
the model's train of thought beginning at the step you just corrected. If new steps are incorrect,
you will fix them and repeat the process. If the new steps are correct - you are done! The task
instructions are attached here for your reference.
Here are some Key Definitions you should understand:
The User Prompt: this is the question you will ask the model. The Model Response: this is the
model's attempt to solve your question. In this task, these models will be chunked into a Step-by-
Step structure. Response Labels: this is what you'll use to label whether a step is 'Correct' or
'Incorrect'
Step 1:
Write an Elegant Problem Make sure the problem is:
Appropriate: The idea behind this project is to stump the model. An appropriate task is complex
enough to stump the model and aligns to the assigned category.
Clean: It does not contain any spelling, grammar, or formatting issues
Solvable: avoid problems that are open ended (ie have many potential answers) as well as
problems that don't contain the information necessary to solve them.
Note: All mathematical expressions, equations, and formulas should be written in LaTeX. Step 2:
Solve your own problem Calculate the final answer to the prompt you generated.
When asked to rate if each step is done correctly or incorrectly, you will see 4 possible step
labels which are displayed in this example below.
The easiest to think of each step label is that they’re answering two different questions!
2. Regardless of if the correctness of the model’s output, should this step ideally be solved
using a model’s LLM or Python capabilities?
An astute reader might wonder how to evaluate the second question. This is what we will explain
in the next section!
Python vs LLM - What does this even mean?
Regardless of if the correctness of the model’s output, should this step ideally be solved using a
model’s LLM or Python capabilities?
This question stems from a development in large language models called “implicit code
execution” or ICE for short. ICE is a technique in which the model will run code in the
background (usually Python) to solve some mathematical tasks. The idea is that the effort
exerted by the model in writing this code and running it under the hood to solve your question
will lead to more accurate answers.
What you’re doing by evaluating this question and choosing the step label is helping inform
when the model should use code to answer the question and when it should use Python (ICE).
When the model solves a step using LLM model, it’s faster and less effort since it can produce
text quickly. This is helpful for things like reasoning, logic capabilities, proofs, and some
calculations. Using code is helpful for more complex computations, calculations, and
approximations. It’s used when you need a formulaic, inflexible, and numerical approach that an
LLM would get wrong.
For example, a step like “Divide 9371/391” should be solved by Python, since this is a large
computation that the LLM will likely get wrong. However, a step like “An identity matrix I = I^2
= II” does not need Python and computation skills.
Python vs LLM - When should the model use each one?
Select LLM is ideal to solve the problem when the step is similar to:
3. Deducing or applying proofs, theorems, propositions, and corollaries where the model
relies on step-by-step reasoning, intuition, and context
4. Abstract math, such as measure theory (e.g. prove a certain Lebesgue set is measurable)
or number theory (e.g., determining if an extension of a field has finitely many
intermediate fields)
Select Python is ideal to solve the problem when the step is similar to:
2. Applying number tests that don’t rely on large computation, e.g. calculating a limit or
applying a ratio test or root test
7. Trigonometry and evaluation of calculus concepts (e.g. find the quadratic roots of x^2
- 7x + 4; or solve cos(x)^2 + sin(x^2) = 0 at x = 0)
There are 3 things you must do once you've identified that the model has made a mistake:
Bad LaTeX: While we require good LaTeX in rewrites, it's acceptable if the model
makes mistakes in LaTeX formatting.
Suboptimal (But Correct) Solution: If the model solves the problem in a valid but less
efficient way, do not label it as incorrect.
Suboptimal (But Correct) Expression: if the model could be more accurately expressed
as a fraction, do not label it as incorrect.
3. Preserve the order of reasoning. Do not combine multiple logical steps into one, even if
they seem closely related.
5. Ensure that the step logically follows from the previous step and leads naturally into the
next one.
7. Articulate the correct reasoning as clearly and concisely as possible. Use simple language
and ensure that the step is self-contained and understandable on its own, while still being
part of the larger reasoning sequence; Write in clear, plain-language without jargon, at
the level that a high school student could understand
8. Only describe the step to solve the problem, do not include extraneous information such
as definitions of basic concepts
Once you're done, you hit submit and the model will run again based on your corrected
step that you’ve written for the model.
In mathematical problem-solving, mistakes can happen due to two main types of errors:
It's important to recognize why an error occurred and provide a logical explanation.
For this project, you must identify at least 1 Reasoning Error per response. If the
only error is a calculation error, you must modify or rewrite your prompt.
Problem:
A rectangle has a length of 8 meters and a width of 5 meters. If the length is doubled and
the width is increased by 3 meters, what is the new area of the rectangle?
16 + 8 = 24 square meters.
(Error: The user incorrectly adds the dimensions instead of multiplying them to find the
area.)
(Error: The user makes a simple arithmetic mistake, incorrectly multiplying 16 and 8.)
Problem:
A car rental company charges a flat fee of $50 per day and an additional $0.20 per mile
driven. How much will it cost to rent the car for 3 days and drive 150 miles?
The user mistakenly multiplies the cost per mile by the number of days instead of adding
it:
50×3+0.20×3×150=150+90=240.
(Error: The user incorrectly applies the mileage cost per day, rather than just per mile.)
50×3+0.20×150=150+30=180.
Remember that your prompt must produce at least 1 reasoning error. If you
only produce a calculation error, you are not done with the task. You must
rework your prompt to produce a reasoning error.
Writing with LaTeX
All prompts and rewrites should be written in proper Single $ LaTeX. If you are unfamiliar with
LaTeX:
Only ask for help writing the LaTeX, as opposed to asking for help solving the problem.
Remember that the impetus for this project is that LLMs are often very wrong at math.
We don't want an incorrect solution from an LLM biasing your own solution!
To help catch mistakes and ensure that your responses are accurate in mathematical calculations,
we highly recommend using a dedicated math-solving or calculation verification tool.
WolframAlpha
Desmos
Symbolab
GeoGebra
To help catch mistakes and make sure that your responses don’t get dinged for minor errors, we
highly recommend that you install and use a Grammar Checker.
Quillbot:
Google Chrome
Microsoft Edge
Grammarly:
Google Chrome
LanguageTool:
Safari
Firefox
Google Chrome