0% found this document useful (0 votes)
35 views10 pages

2504.03975v1

GREAT ERPROMPT is an open-source toolkit designed for prompt optimization in large language models (LLMs), addressing the challenges of prompt sensitivity and the need for careful design. It unifies various optimization techniques under a customizable API, providing both text and gradient-based methods to enhance prompt quality across different model scales. The framework is user-friendly, featuring a web interface that democratizes access to advanced prompt optimization techniques for both expert and non-expert users.

Uploaded by

virak seam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views10 pages

2504.03975v1

GREAT ERPROMPT is an open-source toolkit designed for prompt optimization in large language models (LLMs), addressing the challenges of prompt sensitivity and the need for careful design. It unifies various optimization techniques under a customizable API, providing both text and gradient-based methods to enhance prompt quality across different model scales. The framework is user-friendly, featuring a web interface that democratizes access to advanced prompt optimization techniques for both expert and non-expert users.

Uploaded by

virak seam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

G REAT ER P ROMPT: A Unified, Customizable, and High-Performing

Open-Source Toolkit for Prompt Optimization

Wenliang Zheng, Sarkar Snigdha Sarathi Das, Yusen Zhang, Rui Zhang
Penn State University
{wmz5132,sfd5525,yfz5488,rmz5227}@psu.edu

Abstract advances, LLMs remain highly sensitive to the


prompt designs and formulations - that subtle vari-
LLMs have gained immense popularity among ations in wordings of prompts can dramatically
researchers and the general public for its im-
alter model outputs and affect performance (Chat-
arXiv:2504.03975v1 [cs.LG] 4 Apr 2025

pressive capabilities on a variety of tasks. No-


tably, the efficacy of LLMs remains signifi- terjee et al., 2024; Zhuo et al., 2024). This persis-
cantly dependent on the quality and structure tent prompt sensitivity implies that even the recent
of the input prompts, making prompt design state-of-the-art LLMs do not entirely eliminate the
a critical factor for their performance. Re- need for careful prompt design, which traditionally
cent advancements in automated prompt op- relies on human expertise and iterative trial-and-
timization have introduced diverse techniques error (Chen et al., 2024; Wu et al., 2024). In re-
that automatically enhance prompts to better sponse, recent efforts have increasingly focused on
align model outputs with user expectations.
However, these methods often suffer from
developing automated prompt optimization meth-
the lack of standardization and compatibility ods (Pryzant et al., 2023; Ye et al., 2023; Zhou et al.,
across different techniques, limited flexibil- 2023; Yuksekgonul et al., 2024; Das et al., 2024),
ity in customization, inconsistent performance systematically enhancing prompt quality and ensur-
across model scales, and they often exclu- ing robust model performance without exhaustive
sively rely on expensive proprietary LLM APIs. manual tuning (Wu et al., 2024; Tang et al., 2025).
To fill in this gap, we introduce G REAT ER - However, as shown in Table 1, existing prompt
P ROMPT, a novel framework that democra-
optimization techniques and tools exhibit consid-
tizes prompt optimization by unifying diverse
methods under a unified, customizable API erable variability in terms of usability, scope, and
while delivering highly effective prompts for their performance often fluctuates inconsistently
different tasks. Our framework flexibly ac- across different model scales. This variability and
commodates various model scales by leverag- specialized nature often makes it challenging for
ing both text feedback-based optimization for non-expert users, who otherwise could derive sub-
larger LLMs and internal gradient-based opti- stantial benefits from prompt optimization tech-
mization for smaller models to achieve power- niques while using LLMs. Moreover, existing
ful and precise prompt improvements. More-
over, we provide a user-friendly Web UI that
prompt optimization methods rely on expensive
ensures accessibility for non-expert users, en- proprietary LLM APIs, significantly undermining
abling broader adoption and enhanced perfor- their affordability and privacy protection.
mance across various user groups and appli- To bridge these gaps and encourage broad adop-
cation scenarios. G REAT ER P ROMPT is avail- tion of prompt optimization techniques, we intro-
able at https://github.com/psunlpgroup/ duce G REAT ER P ROMPT, a novel framework de-
GreaterPrompt via GitHub, PyPI, and web signed to enhance accessibility, adaptability, and
user interfaces.
efficacy of prompt optimization. As shown in Fig-
ure 1, G REAT ER P ROMPT provides a streamlined
1 Introduction
workflow from inputs and model initialization to
LLMs have demonstrated impressive capabilities optimization execution, supporting flexible opti-
across a wide variety of natural language tasks, es- mizer configurations that can be easily customized.
tablishing prompting as the primary means of com- G REAT ER P ROMPT is designed and implemented
munication between humans and machines (Gu based on the following three principles:
et al., 2023). However, despite the remarkable 1) Methodological Perspective: G REAT ER -
Text-based Gradient-based Zero-Shot Custom Metric Web UI Local Model Smaller Model Larger Model
Method Integration
Optimization Optimization Prompt Support Support Support Compatibility Compatibility

LangChain Promptim (LangChain, 2025) ✓ × ✓ ✓ Library (Python API) × ✓ Low High


Stanford DsPy (Khattab et al., 2024) ✓ × Few-Shot ✓ Library (Python) × ✓ Low High
AutoPrompt (Levi et al., 2024) ✓ × Few-Shot ✓ Library (Python) × × None Limited
Google Vertex Prompt Optimizer (Google Cloud, 2025) ✓ × Few-Shot ✓ Cloud × × None Proprietary Models Only
AWS Bedrock Optimizer (Amazon Web Service, 2025) Single Step rewrite × Few-Shot × Cloud × × None Proprietary Models Only
Anthropic Claude Improver (Anthropic, 2025) LLM heuristic guided × ✓ × Cloud × × None Proprietary Models Only
Jina PromptPerfect (Jina, 2025) LLM heuristic guided × ✓ × Cloud ✓ × None Limited
G REAT ER P ROMPT (Ours) ✓ ✓ ✓ ✓ Library (Python) ✓ ✓ High High

Table 1: Comparison of different prompt optimization tools. While all existing methods rely on LLM feedback
for Text-Based Optimization, G REAT ER P ROMPT uniquely also supports Gradient-based Optimization, for
more precise prompt tuning. Unlike methods requiring Few-Shot prompts, G REAT ER P ROMPT deliver Zero-Shot
optimized prompts. It also allows optimization for custom metrics—an option missing in many proprietary tools.
Finally, G REAT ER P ROMPT is the only method offering both an intuitive Web-UI and a Python library, with high
compatibility across both small (locally deployed) and large (API-based) LLMs.

Supported Models Set Optimizer Configs & Init Agent Models


Initialize Optimizers with Config & Model
optimizer_configs = {“intersect_q”: 5, "candidates_topk": 10}
model = AutoModel.from_pretrained(google/gemma-2-9b-it)
GReaTer Optimizer TextGrad Optimizer
⚙️Optimizer.optimize()
Formatted JSONL / Custom Inputs Build Dataloader by JSONL or Manual Inputs
APO Optimizer APE Optimizer PE2 Optimizer
JSONL Batch Inputs Manual Single Inputs
optimizer = ApoOptimizer(optimize_config)
dataset = GreaterDataLoader(data_path, custom_input)

Figure 1: Architecture Overview of G REAT ER P ROMPT.

P ROMPT unifies diverse prompt optimization techniques for larger models, our framework en-
methodologies under a cohesive implementation sures optimal performance irrespective of model
framework. Currently, G REAT ER P ROMPT sup- choice and resource constraints.
port five prominent prompt optimization techniques 3) Integration Perspective: G REAT ER P ROMPT
across two families based on model scales: i) Itera- is designed with ease of use and integrability as
tive Prompt Rewriting through LM feedback (Zhou key principles. To make it accessible for both ex-
et al., 2023; Ye et al., 2023; Pryzant et al., 2023; pert and non-expert users, G REAT ER P ROMPT of-
Yuksekgonul et al., 2024), and ii) Gradient based fers both a Python package (a GitHub repository
prompt optimization (Das et al., 2024). This uni- and a pip package) for simple incorporation into
fication ensures users can leverage different types any existing pipeline and a user-friendly Web UI
of LM feedback or gradient computations for opti- (Figure 2) tailored for non-expert users. This dual
mizing LLM prompts. interface design democratizes prompt optimization
2) Model-Centric Perspective: Larger, closed- by enabling both expert and general users to benefit
source API-based LLMs like GPT (Achiam et al., equally from state-of-the-art techniques. As shown
2023) and Gemini (Team et al., 2024a) generally in Table 1, G REAT ER P ROMPT offers a compre-
offer superior performance but are computationally hensive set of features compared to other libraries,
expensive and require transmitting sensitive data supporting both text-based and gradient-based op-
externally; in contrast, smaller open-source LLMs timization while maintaining broad compatibility
like Llama 3 (Grattafiori et al., 2024) and Gemma with smaller and larger models.
2 (Team et al., 2024b) provide cost-effective alter- Overall, G REAT ER P ROMPT combines flexi-
natives that better ensure data confidentiality. Rec- ble methodological support, extensive model
ognizing the critical importance of model flexibil- compatibility, seamless integration, and compre-
ity, G REAT ER P ROMPT provides extensive support hensive evaluation functionalities. G REAT ER -
across both closed-source and open-source model P ROMPT not only advances the state-of-the-art
families, ranging from compact, efficient models in prompt optimization but also makes these so-
suitable for local deployment to large-scale models phisticated techniques accessible to a broader au-
available via cloud APIs. By incorporating both dience, bridging the gap between research in-
gradient-based optimization techniques suitable for novations and practical applications. Our re-
smaller models and feedback-driven optimization lease include a GitHub repo https://github.
com/psunlpgroup/GreaterPrompt, a PyPI in- and AutoPrompt (Levi et al., 2024) provide sim-
stallable package https://pypi.org/project/ ilar text-based approaches with few-shot capabil-
greaterprompt/, and a demo video https:// ities. Google Vertex Prompt Optimizer (Google
youtu.be/aSLSnE17lBQ. Cloud, 2025), AWS Bedrock Optimizer (Amazon
Web Service, 2025), and Anthropic Claude Im-
2 Background prover (Anthropic, 2025) are cloud-based solutions
with varying optimization techniques but limited
2.1 Prompt Optimization Algorithms
model compatibility. Jina PromptPerfect (Jina,
Given a task execution language model fLLM , 2025) offers cloud integration with Web UI support
and a small representative task dataset, Dtask = but has limited model compatibility. G REAT ER -
{(x1 , y1 ), . . . (xn , yn )}, the goal of prompt opti- P ROMPT stands out by combining both text-based
mization is to find a prompt p∗ such that: and gradient-based optimization approaches while
X supporting diverse integration options and high
p∗ = arg max m (fLLM (x; p), y) (1) compatibility across model sizes. All of the ex-
p
(x,y)∈Dtask isting services had some downsides and there was
no unified way to use them until the introduction
where fLLM (x; p) is the output from fLLM upon of G REAT ER P ROMPT.
channeling the input x with the prompt p, and m(·)
is the evaluation function for this task. 2.3 Evaluation Metrics and Datasets for
Textual Feedback Based Prompt Optimiza- Prompt Optimization
tion. Prompt optimization methods based on tex- To evaluate the efficacy of the prompts produced
tual feedback use an optimizer model foptimizer by our library, in our experiments (Section 3.3), we
which is usually substantially larger and more select two popular datasets for performance evalu-
expensive than fLLM (Zhou et al., 2023; Ye ation: BBH (Suzgun et al., 2022), a diverse suite
et al., 2023; Pryzant et al., 2023). Conceptually,
 testing capabilities beyond current language mod-
foptimizer m(fLLM (x; p), y)|(x, y) ∈ Dtask drives els, and GSM8k (Cobbe et al., 2021) for mathemat-
the optimization process by assessing and provid- ical reasoning assessment. All optimizers demon-
ing feedback for refining the prompt. strated performance improvements over the Zero-
Gradient Based Prompt Optimization. Tradi- Shot Chain-of-Thought (Kojima et al., 2022).
tional textual feedback-based prompt optimization
relies heavily on the heuristic capabilities of large 3 G REAT ER P ROMPT
language models (LLMs) and often leads to poor
performance when applied to smaller models. To G REAT ER P ROMPT is a unified (Section 3.1), cus-
overcome this, G REAT ER P ROMPT introduces a tomizable (Section 3.2), and high-performing (Sec-
stronger optimization signal in the form of loss tion 3.3) library for prompt optimization.
gradients. The method begins by generating rea-
soning chains for task inputs using a small model. 3.1 Unified Implementation
Then, it extracts final answer logits via a formatted G REAT ER P ROMPT unifies the following five dif-
prompt and computes loss. By backpropagating ferent prompt optimization methods under a single
through this reasoning-informed output, optimizers API. Even though existing methods already have
will get a list of gradients with respect to candidate released code, it is still challenging for beginner
prompt tokens. These gradients are used to se- users for daily use. In our library, we build a unified
lect optimal tokens, enabling efficient and effective data loading class. It supports both manual inputs
prompt refinement—even for lightweight models. by passing a list to the dataloader class and batch
inputs by loading a jsonl file. With our data loader
2.2 Prompt Optimization Services/Libraries class, users could easily use all the supported opti-
Looking at Table 1, we can observe various prompt mizers with the same function calling, eliminating
optimization methods currently available in the the need to initialize and optimize respectively by
field. LangChain Promptim (LangChain, 2025) different methods.
offers text-based optimization with zero-shot ca- 1) APE: APE (Automatic Prompt Evolution) (Zhou
pabilities and Python API integration. Stanford et al., 2023) is an optimization method that it-
DsPy (Khattab et al., 2022) (Khattab et al., 2024) eratively refines prompts for LLMs by automat-
👾 Different Optimizers 🤖 Model Choices
GReaTer Optimizer
APE Optimizer
APO Optimizer
PE2 Optimizer
TextGrad Optimizer
📖 User Inputs
⚙️Advanced Settings
Device: CUDA:0, CUDA:1, CPU

Intersect Q: 5

Perplexity Loss: True

Rounds: 105

Figure 2: Screenshot of Web UI for G REAT ER P ROMPT. Optimizer list is on the top left bar, bottom left bar is
parameter settings for each optimizer. On the main area, there is a textbox for the model path input, and an area to
upload user’s prompt data. “P Extractor” is a system prompt for GReaTer optimizer to extract answer to calculate
loss.

ically generating, evaluating, and evolving varia- using massive LLMs.


tions based on performance metrics. Inspired by 4) TextGrad: TextGrad (Yuksekgonul et al., 2024)
evolutionary algorithms, it selects high-performing is a prompt optimization method that automates
prompts, applies mutations, and repeats the pro- prompt refinement by leveraging textual feedback
cess without requiring gradient-based tuning. This as a form of "textual gradient." Instead of using
method reduces manual effort, enhances prompt numerical loss gradients like GReaTer, TextGrad
effectiveness across tasks, and improves LLM per- iteratively improves prompts based on feedback
formance in instruction following, reasoning, and from a larger LLM, which critiques and suggests
factual accuracy. modifications to enhance task performance. This
method relies on natural language evaluations of
2) APO: APO (Automated Prompt Optimization)
prompt effectiveness, guiding optimizations with-
(Pryzant et al., 2023) is a technique that system-
out requiring direct gradient computations. While
atically refines prompts for large language mod-
effective in improving reasoning tasks, TextGrad
els by leveraging iterative improvements. It eval-
can be computationally expensive and highly de-
uates multiple prompt variations, selects high-
pendent on the quality of the feedback provided by
performing ones, and applies controlled modifi-
the larger model.
cations to enhance clarity, coherence, and effective-
5) PE2: PE2 (Prompt Engineering a Prompt En-
ness. Unlike manual tuning, APO automates the
gineer) (Ye et al., 2023) is a prompt optimiza-
process using heuristic or model-driven feedback,
tion method that enhances prompts through meta-
ensuring better task-specific performance. This
prompt engineering techniques. It iteratively re-
approach minimizes human effort, improves re-
fines prompts by analyzing model responses and
sponse reliability, and adapts to diverse use cases
leveraging structured feedback from large LLMs.
efficiently.
PE2 systematically improves prompts by identify-
3) GReaTer: GReaTer (Das et al., 2024) is a novel ing patterns in successful completions and mak-
prompt optimization method that enhances smaller ing targeted adjustments to optimize performance.
language models by leveraging numerical gradi- While effective in improving reasoning and struc-
ents over task-specific reasoning instead of rely- tured tasks, its reliance on external LLM-generated
ing solely on textual feedback from large LLMs. feedback can introduce variability, making opti-
Unlike existing techniques that depend on costly mization results dependent on the feedback model’s
proprietary models like GPT-4, GReaTer enables quality.
self-optimization by computing loss gradients over
generated reasoning steps. This approach improves 3.2 User Customization
prompt effectiveness and task performance in vari- G REAT ER P ROMPT allows users to choose task ex-
ous reasoning benchmarks, achieving results com- emplar samples, evaluation functions, and model
parable to or surpassing those of prompts optimized choices.
Optimizer movie_rec. object_count. tracking_five. hyperbaton causal Average
ZS-CoT 47 67 49 68 51 56.4
TextGrad (Yuksekgonul et al., 2024) 48 80 55 66 42 58.2
GReaTer (Das et al., 2024) 57 90 70 84 57 71.6

Table 2: Performance in BBH tasks with GReaTer and TextGrad optimizers, with Llama3-8B-Instruction model.
Here ZS-COT refers to: Zero-Shot Chain of Thought prompt i.e. "Let’s think step by step".

Optimizer movie_rec. object_count. tracking_five. hyperbaton causal Average


ZS-CoT 54 64 56 86 48 61.6
APE (Zhou et al., 2023) 66 70 44 92 49 64.2
APO (Pryzant et al., 2023) 66 66 58 92 59 68.2
PE2 (Ye et al., 2023) 68 74 60 90 56 69.6

Table 3: Performance in BBH tasks with APE, APO and PE2 optimized (with gpt-4-turbo) prompt used on gpt-3.5-
turbo task model. Here ZS-COT refers to: Zero-Shot Chain of Thought prompt i.e. "Let’s think step by step".

User-defined Task Examples. User can upload and PE2 optimizers, gpt-4-turbo as the optimiza-
their task examples consisting of input and output tion model and results can be found in Table 3 and
pairs in a JSON format, providing a demonstration Table 5. The resulting prompts are in Table 6.
of the target task which our library can use as oracle Based on the tables, the results demonstrate note-
to produce the optimized prompts. worthy performance differences between the var-
Customized Task Evaluation Functions. We ious optimizers across the BBH subtasks. With
found that cross entropy doesn’t meet the needs the Llama3-8B-Instruction model, GReaTer
of all tasks. To address this, we added support for achieves the highest average performance (71.6),
custom loss functions in the GReaTer optimizer in outperforming both TextGrad (64.9) and the ZS-
our library. Users can define their own loss func- CoT baseline (56.4). For the gpt-4-turbo opti-
tions and pass them as a parameter to the model. mization model, PE2 shows the best overall perfor-
The custom loss will then be used during back- mance (69.6), followed by APO (68.2), APE (64.2),
propagation and gradient computation to help the and the ZS-CoT baseline (61.6). Notably, all op-
optimizer choose better tokens. timizers demonstrate task-specific strengths, with
Flexible Model Choices. Our library supports hyperbaton being particularly receptive to optimiza-
two types of model deployment: API-based tion across both model types, while performance on
and local. Both deployment modes are com- causal reasoning remains more challenging. These
patible with all model sizes. For the GReaTer results highlight the effectiveness of our optimizers
method, users can choose smaller models like across both large and small models on different
meta-llama/Meta-Llama-3-8B-Instruct for tasks.
efficient optimization, or larger models like
meta-llama/Meta-Llama-3-70B-Instruct to 4 Usage Examples
generate higher-quality token replacements. For
the APO, APE, and PE2 methods, users can G REAT ER P ROMPT supports two ways of usage:
flexibly select GPT models ranging from the Python package (Section 4.1) and web UI (Sec-
legacy gpt-35-turbo to the latest gpt-4o for tion 4.2). Our demo video shows more details.
evaluation and testing.
4.1 Python Package
3.3 High-Performing Prompts The following code snippets demonstrate a quick
To demonstrate the performance of our five op- view of our library as a python package.
timizers, we randomly sampled 5 subtasks from Data Loading. G REAT ER P ROMPT supports two
BBH for evaluation. For GReaTer and TextGrad methods to build the dataloader. Users can either
optimizers, we choose Llama3-8B-Instruction provide a jsonl file path to the predefined Greater-
as the optimization models, evaluation results can Dataloader, which will automatically load batch
be found in Table 2 and Table 4. For APE, APO inputs, or manually input samples. Each sample
only needs to contain three mandatory keys: ques-
tion, prompt, and answer. ape_optimizer = ApeOptimizer (
optimize_config = optimize_config
)
# config is optional
# method 1: load jsonl file for pe2_optimizer = Pe2Optimizer (
batch inputs optimize_config = optimize_config
dataset1 = GreaterDataloader ( )
data_path = ape_result = ape_optimizer .
" ./ data / boolean_expressions . jsonl " optimize ( dataloader1 , p_init ="
) think step by step ")
pe2_result = pe2_optimizer .
# method 2: manually custom inputs optimize ( dataloader2 , p_init ="
dataset2 = GreaterDataloader ( think step by step ")
custom_inputs =[
{ " question ": " (( -1 + 2 + 9 * 5) -
( -2 + -4 + -4 * -7) ) =" , 4.2 User Friendly Web Interface
" prompt ": " Use logical reasoning
and think step by step ." , A primary goal in building our library, G REAT ER -
" answer ": " 24 "},
{ " question ": " (( -9 * -5 - 6 + -2)
P ROMPT, is to democratize prompt optimization
- ( -8 - -6 * -3 * 1) ) =" , for both expert and non-expert users. Traditionally,
" prompt ": " Use logical reasoning as discussed in Section 2, prompt optimization tech-
and think step by step ." ,
" answer ": " 63 " }]) niques have required a significant degree of tech-
nical expertise and coding proficiency, rendering
them inaccessible to many end users. G REAT ER -
P ROMPT addresses this barrier through a compre-
Configs Initialization. G REAT ER P ROMPT sup-
hensive and user-friendly web interface (see Fig-
ports comprehensive and flexible configurations
ure 2) that brings the power of automated prompt
for each optimizer. Users can choose their desired
optimization to a broader audience. Through this
model for optimization, either local or online. For
interface, users only have to: (i) select from var-
the GReaTer optimizer, there are more advanced
ious prompt optimization methods; (ii) for API-
settings, and users can even customize their loss
based models, simply provide their model API key;
function to meet expectations for different tasks.
(iii) for locally hosted models, specify the model
For beginners, these fields can be left blank, as op-
path and select the target GPU. Finally, the inter-
timizers will initialize with default configurations.
face exposes all core functionalities of the code-
based library, including hyperparameter tuning, via
intuitive controls such as steppers and dropdown
optimize_config = {
" task_model ": " menus—no coding required. We believe this UI-
openai_gpt35_turbo_instruct " , driven solution lowers the barrier to entry, making
" optim_model ": " openai_gpt4_turbo " prompt optimization more accessible to users with
,
} varying levels of technical expertise.

5 Conclusion
Optimizer Loading and Prompt Optimization. G REAT ER P ROMPT is a comprehensive open-
The initialization for optimizers is also very sim- source toolkit that supports features many other
ple. If configurations have been defined, users can prompt optimization libraries lack. As shown in our
pass them to the optimizer as a parameter when comparison, it uniquely offers both iterative LLM-
initializing; otherwise, they can leave it blank. Af- rewrite and gradient-guided optimization alongside
ter that, users only need to call .optimize() for zero-shot prompting and custom metrics. Its user-
each optimizer and pass the predefined dataloader friendly web interface makes advanced prompt
and initial prompt to the optimizer. After a brief engineering accessible even to non-programmers,
waiting period, the optimizer will return either a sin- while supproting both smaller and larger models.
gle optimized prompt or a sequence of optimized We hope this tool will prove highly useful to a wide
prompts to the user. All processes are simple and range of users, and that contributors will continue
highly integrated, requiring no specialized domain to enhance the platform by adding support for fu-
knowledge. ture prompt optimization techniques.
References Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yu-
taka Matsuo, and Yusuke Iwasawa. 2022. Large lan-
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama guage models are zero-shot reasoners. Advances in
Ahmad, Ilge Akkaya, Florencia Leoni Aleman, neural information processing systems, 35:22199–
Diogo Almeida, Janko Altenschmidt, Sam Altman, 22213.
Shyamal Anadkat, et al. 2023. Gpt-4 technical report.
arXiv preprint arXiv:2303.08774.
LangChain. 2025. LangChain Promptim.
Amazon Web Service. 2025. AWS Bedrock Optimizer.
Elad Levi, Eli Brosh, and Matan Friedmann. 2024.
Anthropic. 2025. Claude Improver. Intent-based prompt calibration: Enhancing prompt
optimization with synthetic boundary cases.
Anwoy Chatterjee, H. S. V. N. S. Kowndinya Renduch-
intala, Sumit Bhatia, and Tanmoy Chakraborty. 2024.
POSIX: A prompt sensitivity index for large language Reid Pryzant, Dan Iter, Jerry Li, Yin Tat Lee, Chen-
models. In Findings of EMNLP 2024, pages 14550– guang Zhu, and Michael Zeng. 2023. Automatic
14565. Association for Computational Linguistics. prompt optimization with" gradient descent" and
beam search. arXiv preprint arXiv:2305.03495.
Yongchao Chen, Jacob Arkin, Yilun Hao, Yang Zhang,
Nicholas Roy, and Chuchu Fan. 2024. PRompt opti- Mirac Suzgun, Nathan Scales, Nathanael Schärli, Se-
mization in multi-step tasks (PROMST): Integrating bastian Gehrmann, Yi Tay, Hyung Won Chung,
human feedback and heuristic-based sampling. In Aakanksha Chowdhery, Quoc V Le, Ed H Chi, Denny
Proceedings of EMNLP 2024, pages 3859–3920. As- Zhou, et al. 2022. Challenging big-bench tasks and
sociation for Computational Linguistics. whether chain-of-thought can solve them. arXiv
preprint arXiv:2210.09261.
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian,
Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Xinyu Tang, Xiaolei Wang, Wayne Xin Zhao, Siyuan
Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Lu, Yaliang Li, and Ji-Rong Wen. 2025. Unleashing
Nakano, et al. 2021. Training verifiers to solve math the potential of large language models as prompt
word problems. arXiv preprint arXiv:2110.14168. optimizers: Analogical analysis with gradient-based
Sarkar Snigdha Sarathi Das, Ryo Kamoi, Bo Pang, model optimizers. arXiv preprint arXiv:2402.17564.
Yusen Zhang, Caiming Xiong, and Rui Zhang. 2024.
Greater: Gradients over reasoning makes smaller Gemini Team, R Anil, S Borgeaud, Y Wu, JB Alayrac,
language models strong prompt optimizers. arXiv J Yu, R Soricut, J Schalkwyk, AM Dai, A Hauth,
preprint arXiv:2412.09722. et al. 2024a. Gemini: A family of highly ca-
pable multimodal models, 2024. arXiv preprint
Google Cloud. 2025. Vertex AI Prompt Optimizer. arXiv:2312.11805.
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Gemma Team, Morgane Riviere, Shreya Pathak,
Abhinav Pandey, Abhishek Kadian, Ahmad Al- Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupati-
Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, raju, Léonard Hussenot, Thomas Mesnard, Bobak
Alex Vaughan, et al. 2024. The llama 3 herd of mod- Shahriari, Alexandre Ramé, et al. 2024b. Gemma 2:
els. arXiv preprint arXiv:2407.21783. Improving open language models at a practical size.
Jindong Gu, Zhen Han, Shuo Chen, Ahmad Beirami, arXiv preprint arXiv:2408.00118.
Bailan He, Gengyuan Zhang, Ruotong Liao, Yao Qin,
Volker Tresp, and Philip Torr. 2023. A systematic sur- Zhaoxuan Wu, Xiaoqiang Lin, Zhongxiang Dai,
vey of prompt engineering on vision-language foun- Wenyang Hu, Yao Shu, See-Kiong Ng, Patrick Jail-
dation models. arXiv preprint arXiv:2307.12980. let, and Bryan Kian Hsiang Low. 2024. Prompt
optimization with EASE? efficient ordering-aware
Jina. 2025. PromptPerfect. automated selection of exemplars. arXiv preprint
arXiv:2405.16122.
Omar Khattab, Keshav Santhanam, Xiang Lisa
Li, David Hall, Percy Liang, Christopher Potts,
Qinyuan Ye, Maxamed Axmed, Reid Pryzant, and
and Matei Zaharia. 2022. Demonstrate-search-
Fereshte Khani. 2023. Prompt engineering a prompt
predict: Composing retrieval and language mod-
engineer. arXiv preprint arXiv:2311.05661.
els for knowledge-intensive NLP. arXiv preprint
arXiv:2212.14024.
Mert Yuksekgonul, Federico Bianchi, Joseph Boen,
Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Sheng Liu, Zhi Huang, Carlos Guestrin, and James
Zhiyuan Zhang, Keshav Santhanam, Sri Vard- Zou. 2024. Textgrad: Automatic “differentiation"
hamanan, Saiful Haq, Ashutosh Sharma, Thomas T. via text (2024). arXiv preprint arXiv:2406.07496.
Joshi, Hanna Moazam, Heather Miller, Matei Za-
haria, and Christopher Potts. 2024. Dspy: Compiling Yongchao Zhou, Andrei Ioan Muresanu, Ziwen Han,
declarative language model calls into self-improving Keiran Paster, Silviu Pitis, Harris Chan, and Jimmy
pipelines. In The Twelfth International Conference Ba. 2023. Large language models are human-level
on Learning Representations. prompt engineers. Preprint, arXiv:2211.01910.
Jingming Zhuo, Songyang Zhang, Xinyu Fang,
Haodong Duan, Dahua Lin, and Kai Chen. 2024.
ProSA: Assessing and understanding the prompt sen-
sitivity of LLMs. In Findings of EMNLP 2024, pages
1950–1976. Association for Computational Linguis-
tics.

A GSM8K Results
For mathematical reasoning, we compare the per-
formance of different optimization algorithms with
G REAT ER P ROMPT on GSM8K (Cobbe et al.,
2021). We evaluate the prompt performance on the
dedicated test set of 1319 examples. Table 4 shows
the performance of GreaTer (Das et al., 2024) and
TextGrad (Yuksekgonul et al., 2024) with Llama-3-
8B-Instruct optimized prompts.

Optimizer GSM8K
ZS-CoT 79.6
TextGrad (Yuksekgonul et al., 2024) 81.1
GReaTer (Das et al., 2024) 82.6

Table 4: GSM8K performance for ZS-CoT, TextGrad,


and GReaTer with Llama-3-8B-Instruct

Larger Models For prompt optimization perfor-


mance comparison with larger model, we compare
the performance in GSM8K with APE, APO, and
PE2 as shown in Table 5. Prompts are tested on
Mistral-7B-Instruct-v0.2 as in (Ye et al., 2023).

Optimizer GSM8K
ZS-CoT 48.1
APE (Zhou et al., 2023) 49.7
APO (Pryzant et al., 2023) 51.0
PE2 (Ye et al., 2023) 50.5

Table 5: GSM8K performance for ZS-CoT, APE, APO,


and PE2, with gpt-4-turbo optimizer and Mistral-7B-
Instruct-v0.2

B Optimized Prompts
Table 6 gives a list of optimized prompts for 5
randomly sampled BBH tasks by different prompt
optimizers in G REAT ER P ROMPT.
Task Method Prompt
TextGrad You will answer a reasoning question by explicitly con-
necting the events and outcomes, considering multiple
Movie Recommendation perspectives and potential counterarguments.
GREATER Use causal diagram. The correct option asks whether
the variable C has a causal relationship with D, based on
changes in the probability P that C occurs given E.
APE Approach each stage sequentially.
APO Identify the direct cause of the outcome: was it the imme-
diate action or condition without which the event wouldn’t
have occurred?
PE2 Determine if the action was intentional and a contributing
factor to the outcome. Answer ’Yes’ if intentional and
causative, ’No’ otherwise.
TextGrad You will answer a reasoning question about counting ob-
jects. Think step by step, considering the context of the
Object Counting question and using it to inform your answer. Be explicit in
your counting process, breaking it down.
GREATER Use only addition. Add step by step. Finally, give the
correct answer.
APE Let’s continue by taking systematic, sequential steps.
APO Let’s think step by step.
PE2 Let’s identify and count the instances of the specified cate-
gory of items mentioned, tallying multiples to determine
their total quantity.
TextGrad You will answer a reasoning question by providing a step-
by-step breakdown of the process. Use vivid and descrip-
Tracking Shuffled Objects tive language to describe the events, and make sure to
highlight the key connections...
GREATER Use this process as an explanation stepwise for each step
until you get to as given above Alice has got originaly the
following as follows.
APE We’ll tackle this systematically, one stage at a time.
APO Track ball swaps and position changes separately. List
each swap, update positions and ball ownership after each,
and determine final states for both.
PE2 Let’s carefully track each player’s position swaps step by
step to determine their final positions.
TextGrad You will answer a reasoning question. Think step by
step. Provide explicit explanations for each step. Con-
Hyperbaton sider breaking down complex concepts into smaller, more
manageable parts...
GREATER Use the reasoning and examples you would step. Finally
give the actual correct answer.
APE Approach this gradually, step by step
APO Choose the sentence with adjectives in the correct order:
opinion, size, age, shape, color, origin, material, purpose,
noun."
PE2 Let’s think step by step, considering the standard order
of adjectives in English: opinion, size, age, shape, color,
origin, material, purpose.
TextGrad You will answer a reasoning question by explicitly con-
necting the events and outcomes, considering multiple
Causal Judgment perspectives and potential counterarguments...
GREATER Use causal diagram. The correct option ask about whether
there the variable C of about whether a specific cause is
sufficient. The answer a causal relationship between C to
D if the probability P that C occurs given E changes.
APE Approach each stage sequentially.
APO Identify the direct cause of the outcome: was it the imme-
diate action or condition without which the event wouldn’t
have occurred?
PE2 Determine if the action was intentional and a contributing
factor to the outcome. Answer ’Yes’ if intentional and
causative, ’No’ otherwise.

Table 6: Results for 5 randomly sampled BBH tasks by 5 different optimizers

You might also like