100% found this document useful (1 vote)
185 views

Green Wizard

Uploaded by

keizhaandreista
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
185 views

Green Wizard

Uploaded by

keizhaandreista
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 18

This course has been created to ensure you understand the project, task components, and

elements required to produce a high quality task. Welcome! What is this project? The Reasoning
Annotation Project uses prompts and responses to improve the AI model's logical reasoning
capabilities. The AI model relies on human expertise to ensure each step in it's reasoning is
accurate and logical. How does it work? You will write a prompt that can stump the model
(meaning the model makes an error). The error must be a reasoning error (as opposed to a simple
calculation error). More on this later but we are looking to help models reason through math
problems better, as opposed to checking whether they can divide large numbers. The model will
output its answer broken out into step-by-step chunks (like a train of thought). You will check
each step to determine which steps are correct and which are incorrect. You will correct the
model's reasoning by rewriting any incorrect steps. This involves writing a brief justification
explaining the error. Once you correct a step, you will regenerate the response. This will restart
the model's train of thought beginning at the step you just corrected. If new steps are incorrect,
you will fix them and repeat the process. If the new steps are correct - you are done! The task
instructions are attached here for your reference.
Here are some Key Definitions you should understand:
The User Prompt: this is the question you will ask the model. The Model Response: this is the
model's attempt to solve your question. In this task, these models will be chunked into a Step-by-
Step structure. Response Labels: this is what you'll use to label whether a step is 'Correct' or
'Incorrect'
Step 1:
Write an Elegant Problem Make sure the problem is:
Appropriate: The idea behind this project is to stump the model. An appropriate task is complex
enough to stump the model and aligns to the assigned category.
Clean: It does not contain any spelling, grammar, or formatting issues
Solvable: avoid problems that are open ended (ie have many potential answers) as well as
problems that don't contain the information necessary to solve them.
Note: All mathematical expressions, equations, and formulas should be written in LaTeX. Step 2:
Solve your own problem Calculate the final answer to the prompt you generated.

Step 3: Generate AI Model Response


Your job in Step 3 is to Evaluate if the model provides an incorrect answer (this is what
‘stumping’ the model means). If NOT, you will return to Step 1 and improve your original
prompt.
Step 4: Label the Steps
Once you have created a prompt where the model has incorrectly answered the prompt, you can
now move on to the labelings for each step. Review and label EACH step of the bot’s response
as Correct or Incorrect.

Step 5: Correct and Explain


Next you will correct each step: When you identify an incorrect step, the following steps need to
be done: Label the step as Incorrect. Explain why you rated the step as incorrect. Make sure the
reasoning has no grammar issues.
Step 6: Rewrite
Rewrite the step and header so that it is correct and aligns with the previous step. After rewriting
the step, you will be asked “Is this Step the final Step, and contains the solution to the prompt?"
Choose “Yes” if the rewritten step is the final answer of the response. The model will stop
generating subsequent steps, which means that the response is complete. Choose “No” if the
rewritten step is not the final answer of the response. This will cause the model to regenerate the
response with the rewritten step.
Based on the rewrite, the model refreshes the following steps. Continue the process of reviewing
the next step(s) and select Correct or Incorrect. You must continue the process until the model
arrives at the correct answer (Ground Truth Final Answer).
Python vs LLM - What does this even mean?
Regardless of if the correctness of the model’s output, should this step ideally be solved using a
model’s LLM or Python capabilities?
This question stems from a development in large language models called “implicit code
execution” or ICE for short. ICE is a technique in which the model will run code in the
background (usually Python) to solve some mathematical tasks. The idea is that the effort
exerted by the model in writing this code and running it under the hood to solve your question
will lead to more accurate answers.
What you’re doing by evaluating this question and choosing the step label is helping inform
when the model should use code to answer the question and when it should use Python (ICE).
When the model solves a step using LLM model, it’s faster and less effort since it can produce
text quickly. This is helpful for things like reasoning, logic capabilities, proofs, and some
calculations. Using code is helpful for more complex computations, calculations, and
approximations. It’s used when you need a formulaic, inflexible, and numerical approach that an
LLM would get wrong.
For example, a step like “Divide 9371/391” should be solved by Python, since this is a large
computation that the LLM will likely get wrong. However, a step like “An identity matrix I = I^2
= II” does not need Python and computation skills.
Understanding Step Labels

When asked to rate if each step is done correctly or incorrectly, you will see 4 possible step
labels which are displayed in this example below.

The easiest to think of each step label is that they’re answering two different questions!

1. Is the math in the step displayed performed correctly or incorrectly?

2. Regardless of if the correctness of the model’s output, should this step ideally be solved
using a model’s LLM or Python capabilities?

So, to summarize the possible labels you see here:


Evaluating the first question, “Is the math in the step displayed performed correctly or
incorrectly”, is obvious. It’s all about checking the step’s math and confirming the model’s
output and intermediary steps are correct.

An astute reader might wonder how to evaluate the second question. This is what we will explain
in the next section!
Python vs LLM - What does this even mean?

Regardless of if the correctness of the model’s output, should this step ideally be solved using a
model’s LLM or Python capabilities?

This question stems from a development in large language models called “implicit code
execution” or ICE for short. ICE is a technique in which the model will run code in the
background (usually Python) to solve some mathematical tasks. The idea is that the effort
exerted by the model in writing this code and running it under the hood to solve your question
will lead to more accurate answers.

What you’re doing by evaluating this question and choosing the step label is helping inform
when the model should use code to answer the question and when it should use Python (ICE).

When the model solves a step using LLM model, it’s faster and less effort since it can produce
text quickly. This is helpful for things like reasoning, logic capabilities, proofs, and some
calculations. Using code is helpful for more complex computations, calculations, and
approximations. It’s used when you need a formulaic, inflexible, and numerical approach that an
LLM would get wrong.

For example, a step like “Divide 9371/391” should be solved by Python, since this is a large
computation that the LLM will likely get wrong. However, a step like “An identity matrix I = I^2
= II” does not need Python and computation skills.
Python vs LLM - When should the model use each one?

Thus, to conclude, you should:

Select LLM is ideal to solve the problem when the step is similar to:

1. Word problems that require an understanding of natural language

2. Pattern recognition, like simple sequences

3. Deducing or applying proofs, theorems, propositions, and corollaries where the model
relies on step-by-step reasoning, intuition, and context

4. Abstract math, such as measure theory (e.g. prove a certain Lebesgue set is measurable)
or number theory (e.g., determining if an extension of a field has finitely many
intermediate fields)

5. Theoretical probability, (e.g. What is P(A)P(B) for independent A and B)

Select Python is ideal to solve the problem when the step is similar to:

1. Simple variable-based calculus (e.g. derive x^2 + 2x + 7)

2. Applying number tests that don’t rely on large computation, e.g. calculating a limit or
applying a ratio test or root test

3. Manipulating or isolating quantifiers or variables, where a solution can be calculated


(e.g. isolate y in 2x + 3 = 7y; or substitute y = 3z+3 when x = 10+7y)

4. Basic arithmetic with small numbers (under 100, e.g. 25 + 37)

5. Algebra or solving for variables (e.g. solve for x in 3x+7=3)


6. Precise numerical computation or complex arithmetic, where there are large numbers
or multi-step calculations that might introduce errors when using an LLM (e.g., what is
938193 x 93189318)

7. Trigonometry and evaluation of calculus concepts (e.g. find the quadratic roots of x^2
- 7x + 4; or solve cos(x)^2 + sin(x^2) = 0 at x = 0)

8. Matrix operations or calculations involving vectors (e.g. find a vector orthogonal to v1


= [3,0,2]; or multiply M_1 and M_2 where each is a matrix; or find the determinant of
M_3)

9. Applied probability and statistics, such as calculating standard deviations or


regressions, (e.g. what is the standard deviation of this dataset; or binomial distribution,
such as what is the probability of getting exactly 8 heads in 15 coin flips, assuming the
coin is fair?)

10. Distributions or calculations that require calculating areas under a curve,


evaluating an integral, or volume (e.g. what is the probability that a normally
distributed variable with mean 100 and standard deviation 15 falls between 90 and 120?;
or, “what is the volume of this solid evolution (3x+1)^0.5 at these boundaries …”)

11. Numerical approximations or evaluating series (e.g. a Monte Carlo approximation, a


binomial distribution, applying Simpson’s rule or a Striling approximation)
Note: you may be tempted to stump the model by asking contrived questions with multiple
parts. Please avoid this.

There are 3 things you must do once you've identified that the model has made a mistake:

 Mark the step as "Incorrect"


 Explain what the mistake is and why it is a mistake
 Rewrite the model's reasoning
Guidelines for Labeling a Step as Incorrect

Main Criterion for Incorrect Labeling:

 There is a mathematical error. Simple as that.

Cases Where You Should NOT Label a Step as Incorrect:

 Bad LaTeX: While we require good LaTeX in rewrites, it's acceptable if the model
makes mistakes in LaTeX formatting.

 Suboptimal (But Correct) Solution: If the model solves the problem in a valid but less
efficient way, do not label it as incorrect.

 Suboptimal (But Correct) Expression: if the model could be more accurately expressed
as a fraction, do not label it as incorrect.

 Accurate Preamble or Summary: If the model provides a factually accurate preamble


at the beginning of a task or a summary at the end, this should not be labeled as incorrect
(provided they are factually accurate).

Cases Where It Is Acceptable to Label a Response as Incorrect:

 The model cuts off before finishing its sentence.


Tips and Guidelines

1. Maintain Sequential Integrity

2. When rewriting the step, make sure to:

3. Preserve the order of reasoning. Do not combine multiple logical steps into one, even if
they seem closely related.

4. Maintain the same level of detail as other steps to ensure consistency.

5. Ensure that the step logically follows from the previous step and leads naturally into the
next one.

6. Rewrite the Step Clearly

7. Articulate the correct reasoning as clearly and concisely as possible. Use simple language
and ensure that the step is self-contained and understandable on its own, while still being
part of the larger reasoning sequence; Write in clear, plain-language without jargon, at
the level that a high school student could understand

8. Only describe the step to solve the problem, do not include extraneous information such
as definitions of basic concepts

Once you're done, you hit submit and the model will run again based on your corrected
step that you’ve written for the model.
In mathematical problem-solving, mistakes can happen due to two main types of errors:

1. Calculation Errors (simple mistakes in arithmetic or math operations).

2. Reasoning Errors (errors in understanding the logic or process behind a solution).

It's important to recognize why an error occurred and provide a logical explanation.

For this project, you must identify at least 1 Reasoning Error per response. If the
only error is a calculation error, you must modify or rewrite your prompt.

Example 1 Reasoning vs Calculation Errors

Problem:

A rectangle has a length of 8 meters and a width of 5 meters. If the length is doubled and
the width is increased by 3 meters, what is the new area of the rectangle?

Answer 1 (Reasoning Error):

Length is doubled: 8 × 2 = 16 meters.

Width is increased by 3: 5 + 3 = 8 meters.

The new area is the product of these two:

16 + 8 = 24 square meters.

(Error: The user incorrectly adds the dimensions instead of multiplying them to find the
area.)

Answer 2 (Calculation Error):

Length is doubled: 8 × 2 = 16 meters.


Width is increased by 3: 5 + 3 = 8 meters.

The new area is the product of these two:

16 × 8 = 138 square meters.

(Error: The user makes a simple arithmetic mistake, incorrectly multiplying 16 and 8.)

Example 2 Reasoning vs Calculation Errors

Problem:

A car rental company charges a flat fee of $50 per day and an additional $0.20 per mile
driven. How much will it cost to rent the car for 3 days and drive 150 miles?

Answer 1 (Reasoning Error):

The user mistakenly multiplies the cost per mile by the number of days instead of adding
it:

Cost = 50×3+0.20×3×150=150+90=24050 \times 3 + 0.20 \times 3 \times 150 = 150 + 90


= 240

50×3+0.20×3×150=150+90=240.

(Error: The user incorrectly applies the mileage cost per day, rather than just per mile.)

Answer 2 (Calculation Error):

The user correctly understands the cost structure:

Cost = 50×3+0.20×150=150+30=18050 \times 3 + 0.20 \times 150 = 150 + 30 = 180

50×3+0.20×150=150+30=180.

However, they make a mistake in adding:


Final cost = 190.

(Error: The logic is correct, but there’s a simple arithmetic error.)

Remember that your prompt must produce at least 1 reasoning error. If you
only produce a calculation error, you are not done with the task. You must
rework your prompt to produce a reasoning error.
Writing with LaTeX

All prompts and rewrites should be written in proper Single $ LaTeX. If you are unfamiliar with
LaTeX:

 Refer to the style guides that are linked in the task.

 Feel free to use an LLM to help you. Make sure to:

 Ask it to write the expression in Single $ Latex

 Only ask for help writing the LaTeX, as opposed to asking for help solving the problem.
Remember that the impetus for this project is that LLMs are often very wrong at math.
We don't want an incorrect solution from an LLM biasing your own solution!

To help catch mistakes and ensure that your responses are accurate in mathematical calculations,
we highly recommend using a dedicated math-solving or calculation verification tool.

WolframAlpha

Great for general math problems, calculus, and symbolic computation.

Desmos

Perfect for graphing and exploring functions interactively.

Symbolab

Useful for step-by-step solutions in calculus, algebra, and more.

GeoGebra

A powerful tool for dynamic geometry, algebra, calculus, and statistics.


By using one of these tools, you can reduce errors and enhance the precision of mathematical
answers.

To help catch mistakes and make sure that your responses don’t get dinged for minor errors, we
highly recommend that you install and use a Grammar Checker.

You only need to install one of the following extensions:

Quillbot:

 Google Chrome

 Microsoft Edge

Grammarly:

 Google Chrome

LanguageTool:

 Safari

 Firefox

 Google Chrome

You might also like