0% found this document useful (0 votes)
6 views13 pages

Demystifying Prompts in Language Models Via Perplexity Estimation

Uploaded by

Ushmal Ramesh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views13 pages

Demystifying Prompts in Language Models Via Perplexity Estimation

Uploaded by

Ushmal Ramesh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Published in Findings of EMNLP 2023

Demystifying Prompts in Language Models via Perplexity Estimation


Hila Gonen1,2 Srini Iyer2 Terra Blevins1 Noah A. Smith1,3 Luke Zettlemoyer1,2
1
Paul G. Allen School of Computer Science & Engineering, University of Washington
2
Meta AI Research 3 Allen Institute for Artificial Intelligence
[email protected]
[email protected]
{blvns,nasmith,lsz}@cs.washington.edu

Abstract
Language models can be prompted to perform a
wide variety of tasks with zero- and few-shot in-
arXiv:2212.04037v2 [cs.CL] 12 Sep 2024

context learning. However, performance varies


significantly with the choice of prompt, and we
do not yet understand why this happens. In this
paper, we analyze the factors that contribute
to this variance and establish a new empirical
hypothesis: the performance of a prompt is
predicted by the extent to which the model is
familiar with the language it contains. Over a
wide range of tasks, we show that the lower the
perplexity of the prompt, the better it is able
to perform the task, when considering reason- Figure 1: Accuracy vs. perplexity for the AG News
able prompts that are related to it. As part of dataset with OPT 175B. The x axis is in log scale. Each
our analysis, we also devise a method to auto- point stands for a different prompt.
matically extend a small seed set of manually
written prompts by paraphrasing with GPT3
and backtranslation. This larger set allows us on the task will be, when considering reasonable
to verify that perplexity is a strong predictor prompts that are related to the task. This is based
of the success of a prompt and we show that on the intuition that the more frequently the prompt
the lowest perplexity prompts are consistently (or very similar phrases) appears in the training
effective. data, the more the model is familiar with it and
is able to perform the described task. We refrain
1 Introduction from using the training data directly as it is often
Language models can be prompted to perform unavailable, expensive to search due to its size, and
a wide range of zero- and few-shot learning hard to use for approximate matching of similar
tasks (Brown et al., 2020; Schick and Schütze, prompts. Instead, we focus on the perplexity of the
2020). However, there is significant variance in the prompt as a proxy for its occurrences in the data.
performance of seemingly similar prompts (Chen To enable more complete analysis, we automati-
et al., 2022): for AG News (Zhang et al., 2015), we cally expand the set of manually created prompts
find an over 30 point accuracy gap between differ- for the task by paraphrasing, resulting in a much
ent manually curated prompts (see Table 1) on OPT larger and diverse set of prompts. We focus on
175B (Zhang et al., 2022). Despite efforts to im- prompts in English that reasonably describe the
prove prompt engineering (Shin et al., 2020; Li and task for two reasons: (a) our main motivation is to
Liang, 2021; Gao et al., 2021), it is still challenging understand what lies under the variance of perfor-
to develop high-quality prompts for new tasks, and mance in this type of prompt; (b) we aim to devise
little is known about why this phenomenon occurs. a useful method for creating prompts that are con-
We are interested in understanding what makes sistently effective, that could be easily adopted and
some prompts better than others, and using this un- interpreted by future, potentially non-expert users.
derstanding to create better prompts for given tasks We show empirically that our hypothesis holds
and models. We hypothesize that the lower the per- across a diverse set of tasks (including classifi-
plexity of a prompt is, the better its performance cation and word prediction), models, and model
Published in Findings of EMNLP 2023

sizes, providing us some insights about the under- roughly 366B tokens), and in addition, this training
lying mechanism of prompting (see Figure 1). As data is not always publicly available (e.g., GPT3;
a result, we devise a method, SPELL (Selecting Brown et al. 2020). Our initial attempts to estimate
Prompts by Estimating LM Likelihood), for cre- exact-match occurrences of prompts in the data
ating prompts in an informed manner. We show resulted in very sparse counts, which led us to look
that using SPELL to choose prompts results in less for a softer formalization.1
variability in performance as well as in accuracy Instead of considering the training data directly,
gains (1.8 accuracy points with OPT and 2.3 accu- we propose to focus on the perplexity of the prompt
racy points with Bloom on average). Importantly, as a proxy for its occurrences in some form in
our method does not require labels at all, only a the data – essentially indicating to what extent the
small sample of inputs for the task. model expects this prompt. This perplexity-based
Our contributions can be summarized as follows: framing helps to avoid the challenge of exact match
(a) we formalize the notion that better familiarity in the data, and takes into account variations of the
of the model with the prompt correlates with better prompt that the model is also exposed to and might
performance (Section 2); (b) we automatically elab- be influenced by. In addition, it helps overcome the
orate a given set of seed prompts using paraphras- challenges mentioned above as it requires neither
ing (Section 3); (c) we establish experimentally the access to the pretraining data (which is not always
hypothesis that lower perplexity of the prompt cor- publicly available for LMs) nor matching over huge
relates well with better performance (Section 5); amounts of text.
(d) we devise a method to create a more consistent
Hypothesis: lower perplexity correlates with bet-
set of prompts, that also improve results even with
ter performance. We hypothesize that on aver-
no labels for the task (Section 7).
age, lower-perplexity prompts perform better. We
2 Why are prompts not all created equal? are interested in establishing this hypothesis by
experimentally showing a significant negative cor-
Despite the popularity of prompting as a method relation between the perplexity of the prompt and
for using language models (Shin et al., 2020; Li its performance on the task, across a diverse set of
and Liang, 2021; Gao et al., 2021), the cause for tasks and models.
the different behavior of various prompts remains We define the perplexity of the prompt as the
unclear so far. Table 1 shows four example prompts perplexity of the full prompt sequence, including
for a news topic classification task (AG News) and the input itself, and without the label, averaged
their respective accuracies when used to prompt over 1,000 examples (see Section 4 for details).
OPT 175B (Zhang et al., 2022). The accuracy gap The input is a part of the prompt in the case of
between the different prompts is not trivial, and it the word prediction tasks by design (e.g., “The
is not possible to predict from the prompts alone. opposite of the word good is”). Inclusion of the
task input as part of the prompt for classification
Prompt Accuracy
tasks as well is intentional: we want to ground the
What is this piece of news regarding? 40.9
What is this article about? 52.4
prompt to the task (without the input, we are testing
What is the best way to describe this article? 68.2 the hypothesis that lower perplexity prompts across
What is the most accurate label for this news article? 71.2 all tasks work better on every task). The label is
not considered a part of the prompt and is not taken
Table 1: Example prompts for the task AG News (news
classification) that vary considerably in accuracy. into consideration when computing the prompt. In
practice, this also results in a huge advantage of our
We propose that the more frequently a prompt method, SPELL (Section 7), which aims to find
appears in some variation in the data, the better better prompts—it does not require any labels.
it works for the task. The intuition behind this For performance measures, we use the log-
is that a sequence that is more expected by the likelihood score assigned by the model to the cor-
model is more likely to aid the model to extract rect label given that prompt. We choose this metric
the relevant information. However, this premise is 1
We experimented with the task of AG News (see Sec-
hard to measure accurately: most language models tion 4.1), and looked for all of its prompts (using exact match)
in the OPT training data. Indeed, only 9/108 of the prompts
use huge amounts of training data (e.g., OPT uses appear in the training data. Such sparse counts do not allow
a corpus of roughly 180B tokens, and Bloom uses for any useful or reliable analysis of prompt behaviour.

2
Published in Findings of EMNLP 2023

over accuracy as it gives a more fine-grained dis- and close to English,2 to reduce the risk of noise.
tinction between prompts and because accuracy can Since we aim to get about 100 prompts per task,
be unstable, as explained in more detail in Section 4. we add 8 additional languages3 in the case where
For classification tasks, we also report correlation the basic 8 languages yielded too few alternatives.
with accuracy, which is the main evaluation metric For word prediction tasks, we use the sequence of
for this type of task. the created prompt up to the index of the label, not
including the label, for example: The word “dog”
3 Automatic Expansion of Seed Prompts in French is “. Depending on the task, we enforce
the existence of specific words (e.g., the name of
We are interested in expanding our pool of prompts the language, and the source word, in word-level
in order to: (a) have a more diverse set of prompts, translation) or enforce the prompt to be a question.
making it more likely to find a better prompt for
our task, and (b) support better analysis to validate Examples and Statistics Table 4 lists all 4 man-
our prompt quality hypothesis. In this section, we ually created prompts we use for the AG News
describe our method for automatically expanding task (news classification), alongside a few sampled
a seed set of manually created prompts using para- prompts created automatically using our method.
phrasing. As was typically the case, we are able to get
prompts that are rather different in phrasing and
Step 0: Creating a seed set of manually-written structure from those included in the seed set.
prompts We first write/collect a small set of hu- The statistics of the prompts in the manually
man written prompts that describe the task. For created seed set (Step 0) as well as the prompts after
classification tasks we assume that the input ap- Step 1 and Step 2 for each task (see Section 4.1 for
pears before the prompt, with no choices appearing details about the tasks) are detailed in Table 3.
as part of the prompt (to help in smooth paraphras-
ing of the prompt itself). 4 Experimental Setup
4.1 Models, Tasks and Datasets
Step 1: Paraphrasing with GPT3 We use the
text-davinci-002 version of GPT3 (Brown et al., We study four auto-regressive models: OPT (Zhang
2020) to generate paraphrases for each of the man- et al., 2022) of different sizes (1.3B, 30B, 175B
ual prompts in our seed set. We prompt it with parameters), all trained mainly on English,4 and
a meta-prompt for paraphrasing to generate varia- Bloom (176B parameters; Luccioni et al. 2022),
tions of one of our seed prompts. An example of which is trained on 46 natural languages and 13
such a meta-prompt is: Write a paraphrase for the programming languages. We experiment with two
following sentence: <seed prompt> Paraphrase:. types of tasks: word prediction tasks and classifica-
The 7 meta-prompts used in this step are listed in tion tasks, as detailed below.
Table 2.
Word Prediction Tasks The first task in this cat-
We choose GPT3 as our paraphrasing model egory is word-level translation. Given a source
because of its well-documented generation abilities. word in English and a target language, we expect
This is also to ensure that there is a separation the model to predict the correct translation. For this
between the model we use to create the prompts and task we use NorthEuraLex5 (Dellert et al., 2019),
the models we use to rank them (OPT and Bloom, a lexical database providing translations of 1016
see Section 4 for details), to avoid confounding the words into 107 languages. We experiment with
experimental setup. 9 languages that use the Latin script. For Bloom,
we use 5 additional languages that do not use the
Step 2: Paraphrasing using backtranslation
2
Our second step takes as input the paraphrases from Danish, German, Italian, French, Dutch, Portuguese,
Swedish, Spanish.
GPT3 (in addition to the seed set of prompts) and 3
Norwegian, Romanian, Catalan, Turkish, Ukrainian, Pol-
translates them into different languages and back ish, Russian, Arabic.
4
into English to get additional prompt paraphrases As stated in the paper, the training corpora were previ-
(Wieting et al., 2017). We use a set of 8 languages ously collected or filtered to contain predominantly English
text, but a small amount of non-English data is still present
available in the NLLB translation model (Costa- within the corpus via CommonCrawl.
5
jussà et al., 2022) that are relatively high resource http://northeuralex.org/

3
Published in Findings of EMNLP 2023

Meta prompts
Write a paraphrase for the following sentence: <seed-prompt> Paraphrase:
<seed-prompt> Paraphrase:
Write a likely paraphrase of the text: <seed-prompt> Paraphrase:
Write a sentence similar to the following one: <seed-prompt> Paraphrase:
Paraphrase the following sentence: <seed-prompt> Paraphrase:
Write a variation of this sentence: <seed-prompt>
How would you say the following sentence in a different way? <seed-prompt>

Table 2: Meta prompts used in Step 1 of our method for paraphrasing using GPT3.

Task # Step 0 # Step 1 # Step 2 tions; Saravia et al. 2018); (g) Tweet Offensive
Word-Level Translation 12 59 118 (classification to offensive vs. not offensive tweets;
Antonyms 12 85 176 Barbieri et al. 2020). We use 1,000 random exam-
GLUE Cola 4 27 144 ples from each dataset.
Newspop 13 43 119
AG News 4 23 108
The full set of manual prompts is listed in Sec-
IMDB 10 45 178 tion A in the Appendix. In these tasks, the prompt
DBpedia 8 23 103 follows the input, and at the end of each prompt
Emotion 4 14 94 we add the choices of classes (i.e., we provide the
Tweet Offensive 5 41 119
possible labels explicitly in the prompt by listing
Table 3: Number of prompts for the different tasks:
the possible answers as defined by the dataset it-
prompts after step 0 (creating prompts manually), self.): “Choices: X, Y, Z. Answer:” as we find it
prompts after step 1 (GPT3 paraphrasing), and prompts helps in terms of accuracy. Defining the label space
after step 2 (backtranslation). likely helps in our zero-shot setting because there
are no previous demonstrations from which the
model can learn the possible classes. Additionally,
Latin script (since Bloom is multilingual). Note adding class options to the prompt helps to reduce
that only 5 of the languages we experiment with the effect of the surface form competition (Holtz-
are officially covered by Bloom.6 man et al., 2021). The option of generating the
We also consider antonym prediction where, answer and comparing it with the gold label was
given a word, the model is expected to predict its not reasonable here, since we cannot expect the
antonym. For this task, we use data from Kaggle,7 model to generate the exact label as the first choice
which is based on WordNet (Miller, 1995). We often enough.
choose 1,000 word pairs at random.

Classification Tasks We choose classification 4.2 Implementation Details


tasks from Huggingface Datasets,8 with an attempt In all experiments we evaluate zero-shot perfor-
to have a set of diverse tasks that use relatively short mance. To avoid noise when computing perplexity,
inputs, with some prompts available in Prompt- we instantiate the prompts with 1,000 examples of
Source (Bach et al., 2022):9 (a) GLUE Cola (gram- the dataset, compute the perplexity of the prompt
maticality; Warstadt et al. 2018); (b) Newspop with each example, and calculate the average across
(news classification; Moniz and Torgo 2018); (c) all instantiated prompts.
AG News (news classification; Zhang et al. 2015); To estimate the performance of the prompt, we
(d) IMDB (movie review classification; Maas et al. look at two measures: (a) the language model
2011); (e) DBpedia (topic classification; Lehmann score (log probability) of the correct label, aver-
et al. 2015); (f) Emotion (classification to emo- aged across 1,000 examples; (b) the accuracy on
6
Basque, French, Portuguese, Spanish, and Arabic. the task, computed over the 1,000 examples. To
7
https://www.kaggle.com/datasets/duketemon/ compute accuracy, for each example we score all
antonyms-wordnet
8 classes and choose the highest ranking class as the
https://huggingface.co/docs/datasets/index
9
https://github.com/bigscience-workshop/ prediction of the model. The score of a label of
promptsource multiple tokens is defined by the sum of the token

4
Published in Findings of EMNLP 2023

All Manually Created Prompts Examples of Similar Automatically Created Prompts


What label best describes this news article? What’s the most accurate label for this news article?
What is this piece of news regarding? What does this piece of news concern?
Which newspaper section would this article likely appear in? In what section of the newspaper could this article be published?
What topic is this news article about? What category does this article fall into?

Table 4: Prompts for the task AG News (news classification): the manually created prompts and a sample of
automatically created prompts using our method.

Model Task Perplexity-score corr. Perplexity-acc corr. Avg Acc Acc 50%
Pearson Spearman Pearson Spearman
Antonyms **-0.41 **-0.53 – – – –
GLUE Cola -0.15 -0.14 -0.04 -0.02 47.7 57.1
Newspop *-0.24 **-0.26 *-0.20 -0.18 66.4 72.9
AG News **-0.63 **-0.68 **-0.77 **-0.81 57.5 68.7
OPT 175B
IMDB **0.35 **0.40 0.14 *0.20 86.2 91.0
DBpedia **-0.50 **-0.44 **-0.51 **-0.42 46.7 55.2
Emotion -0.14 -0.19 **-0.30 **-0.32 16.4 23.0
Tweet Offensive *-0.19 0.07 0.18 *0.23 51.3 55.8
Antonyms **-0.37 **-0.23 – – – –
GLUE Cola 0.07 0.11 **-0.25 **-0.26 55.5 65.6
Newspop **-0.50 **-0.42 **-0.59 **-0.51 78.9 87.8
AG News **-0.62 **-0.54 **-0.44 **-0.44 50.3 59.4
Bloom 176B
IMDB 0.04 0.09 -0.08 -0.14 89.3 92.2
DBpedia **-0.47 *-0.27 **-0.35 *-0.21 27.2 33.4
Emotion **-0.33 **-0.42 **-0.48 **-0.55 29.3 31.7
Tweet Offensive 0.14 *0.24 *-0.20 -0.03 41.6 46.2
Antonyms **-0.54 **-0.70 – – – –
GLUE Cola -0.05 0.03 -0.13 0.02 32.2 35.5
Newspop *-0.23 *-0.25 *-0.18 -0.12 60.3 66.6
OPT 30B AG News **-0.66 **-0.71 **-0.81 **-0.80 49.3 60.7
IMDB -0.06 *0.17 0.04 **0.22 81.6 86.1
DBpedia **-0.41 **-0.34 *-0.21 *-0.25 35.9 42.4
Emotion 0.00 -0.03 0.18 0.13 12.3 16.2
Tweet Offensive **-0.44 **-0.39 -0.11 -0.05 54.6 60.2
Antonyms **-0.45 **-0.53 – – – –
GLUE Cola **-0.39 **-0.36 -0.09 *-0.19 60.3 65.9
Newspop **0.33 *0.21 -0.07 -0.07 37.6 40.3
OPT 1.3B AG News **-0.33 **-0.29 **-0.56 **-0.49 31.9 37.6
IMDB -0.11 -0.07 **0.24 **0.22 86.0 89.1
DBpedia -0.16 -0.14 -0.02 -0.01 8.7 9.2
Emotion 0.08 0.08 **-0.29 **-0.30 7.0 9.1
Tweet Offensive **-0.42 **-0.35 **-0.50 **-0.38 58.6 62.6

Table 5: Correlation results for the different tasks, with OPT (different sizes) and Bloom. Correlations with p < 0.05
are marked with *. Correlations with p < 0.00625 (according to Bonferroni correction for multiple hypotheses) are
marked with **. Dark and light blue colored cells stand for negative correlations < −0.2 and > −0.2, respectively.
Dark and light orange colored cells stand for positive correlations > 0.2 and < 0.2, respectively. Average accuracy
across all prompts and average accuracy of best 50% prompts are also reported for reference (Avg Acc and Acc
50%, respectively).

scores. tasks in our setting (the chances the model will gen-
erate an exact match of the label are low). Hence,
For the word prediction tasks we only report
the score of the correct label gives a better estimate
scores, since accuracy in general is less stable, suf-
of the actual performance of the model.
fers more from the surface form competition (Holtz-
man et al., 2021), and is usually quite low for these

5
Published in Findings of EMNLP 2023

Lang
OPT 175B Bloom 176B general the trend appears to be the same in the
Pearson Spear. Pearson Spear. smaller models as well; however, the correlations
ita -0.44 -0.57 -0.37 -0.63 seem to be slightly weaker. We hypothesize that
spa -0.47 -0.61 -0.51 -0.66 this might be due to the overall lower performance
cat -0.47 -0.58 -0.24 -0.31
fra -0.48 -0.57 -0.48 -0.64
of these smaller models, making the performance
deu -0.44 -0.60 -0.46 -0.65 results we use for correlation less stable and reli-
fin -0.44 -0.62 -0.34 -0.56 able. For word-level translation, however, all corre-
por -0.45 -0.62 -0.46 -0.61 lations with the 30B and 1.3B models are similar
eus -0.47 -0.61 -0.45 -0.61
tur -0.44 -0.62 -0.33 -0.62
to those with the 175B model, and are all statisti-
jpn – – -0.33 -0.26 cally significant (also after Bonferroni correction
arb – – -0.36 -0.47 for multiple hypotheses).
rus – – -0.54 -0.69
kor – – -0.42 -0.58
ell – – -0.40 -0.51
6 Analysis
Next, we further explore the observed relationship
Table 6: Correlation results for word-level translation,
with OPT 175B and Bloom 176B. All correlations are
between model perplexity and prompt performance.
statistically significant also according to Bonferroni cor- Despite the consistently high correlation between
rection for multiple hypotheses for OPT (p < 0.0055). these two factors, the structure of this relationship
Same for Bloom (p < 0.00357), except for Catalan varies across tasks (Section 6.1). Additionally, we
(Pearson) and Japanese (Spearman). find that the automatically added prompts are high-
quality and not a significant source of noise (Sec-
tion 6.2), and that the best prompts selected by our
5 Results
approach vary across models (Section 6.3).
Classification Tasks and Antonym Prediction
Table 5 depicts the Pearson and Spearman corre- 6.1 Visualizing the Relationship between
lation results on the classification tasks and the Perplexity and Performance
antonym task, with both OPT 175B and Bloom To visualize the correlations we get between the per-
(two upper blocks). We see that most correlations plexity and the performance of the prompts across
are negative and statistically significant, as we ex- the different settings, we plot a few examples for
pect. This validates our hypothesis and shows that different tasks and languages. Figures 1 and 2 show
in the majority of tasks we indeed get a strong cor- some of the results for selected tasks, as detailed
relation between low perplexity of the prompt and in the captions. The negative trend of the corre-
better performance on the task.10 For each task we lation is clearly visible in all plots. Interestingly,
also report the average accuracy. the structure of the plots for word-level translation
are very similar across all the language pairs, sug-
Word-Level Translation The results of the word-
gesting that prompts get consistent perplexity and
level translation task are reported in Table 6. Here
performance across languages (possibly at different
the correlations are extremely consistent across all
scales). Indeed, the intersection of the 10 lowest
languages and across models, with statistical sig-
perplexity prompts between any two different lan-
nificance for all languages except for Catalan and
guages is 8.6 and 8.4 on average (for OPT 175B
Japanese (in Bloom).
and Bloom, respectively), which is extremely high.
Results across Different Model Sizes We repeat This is not very surprising since we know that the
the same experiment with the OPT models of sizes only differences between the prompts in the differ-
1.3B and 30B, to investigate whether these corre- ent languages are the names of the target languages
lations are also consistent across model sizes or (e.g., The word for “dog” in French is “). Addition-
whether this is a phenomenon we should expect ally, the intersection of 10 prompts with the highest
only in large language models. Table 5 (two lower label score between any two different languages is
blocks) shows these results for all classification 7 and 6.5 on average (for OPT 175B and Bloom,
tasks and antonym prediction. We do see that in respectively).
10
A notable finding that appears in the word-level
Repeating the experiments with the length of the prompt
instead of perplexity yields weak positive correlations, almost translation plots is the clear separation between
all of which are not statistically significant. prompts that include or do not include quotation

6
Published in Findings of EMNLP 2023

marks for the label (usually aligns with whether prompt ppl
the prompt uses quotation marks for the source Is this example correct English usage? 25.79
word) – three example prompts appear on the plot. Is this example using English correctly? 25.46
Prompts with quotation marks for the words tend to Is this example correct English? 25.33
Is this the example in correct English? 25.00
have both lower perplexity and better performance, Is English in this example correct? 24.90
consistently. We further analyze the results for
OPT 175B within clusters (with/without quotations Table 7: Example of the 5 highest perplexity prompts
marks). In the cluster with quotation marks, we for GLUE Cola, using OPT 175B.
get negative correlations (in the range of –0.28 to
–0.38) that are statistically significant for almost all Task Lang
Before filtering After filtering
Pearson Spearman Pearson Spearman
languages. The correlations within the other cluster
AG News - -0.63 -0.68 -0.62 -0.54
are weaker and less significant (this is expected ita -0.44 -0.58 -0.44 -0.57
given the overall lower performance of that cluster). spa -0.47 -0.61 -0.47 -0.61
cat -0.45 -0.57 -0.47 -0.58
fra -0.47 -0.57 -0.48 -0.57
WLT deu -0.43 -0.60 -0.44 -0.60
fin -0.41 -0.60 -0.44 -0.62
por -0.43 -0.61 -0.45 -0.62
eus -0.45 -0.60 -0.47 -0.61
tur -0.43 -0.61 -0.44 -0.62

Table 8: Correlations before and after filtering out noisy


prompts, with AG News and Word-Level Translation
(WLT).

As a sanity check, we choose two tasks: word-


level translation and AG News, manually filter out
the noisy prompts, and compute the correlations
Figure 2: Score of correct label vs. perplexity for the again. The annotation is done by external anno-
word-level translation task in French with OPT 175B. tators (NLP researchers) that were presented with
The x axis is in log scale. The blue points stand for
the tasks and asked to label whether the prompt is
prompts with quotation marks for the words, while the
yellow points are of prompts without quotation marks. reasonable to use for the task. The new correlations
with OPT 175B are reported in Table 8. We find
that all correlations remain strong and statistically
6.2 Effect of Noisy Prompts significant when noise is manually removed from
the analysis. We get the same trends with Bloom
We expect our automatic method for expand- as well.
ing the set of prompts to also introduce some
noise. Though our focus is on the lower perplexity 6.3 Best Performing Prompts
prompts, since we want to benefit from this anal- Table 9 lists the 5 lowest perplexity prompts for the
ysis and be able to devise a method for creating task of antonym prediction, as an example. Similar
better prompts, we do want to make sure that this lists for the rest of the tasks are listed in Section B
potential noise is not the cause for the strong cor- in the Appendix.
relations we get. In other words, one might claim A closer look at the lowest perplexity prompts
that some noisy prompts have particularly high per- reveals that the intersection of 10 lowest perplex-
plexity and also perform badly, thus, supporting
our hypothesis in an undesirable and uncontrolled
prompt ppl
manner.
The following two words are antonyms: “good” and “ 10.24
We turn to inspect the 10% highest perplex- The antonym of the word “good” is “ 10.32
ity prompts in the different tasks and find subjec- The word that has the opposite meaning of the word “good” is “ 10.43
The word “good” is the antithesis of the word “ 10.85
tively that they are not noisy, and are usually valid The word “good” is the opposite of the word “ 11.15
prompts for the tasks. The 5 highest perplexity
prompts for the GLUE Cola task are listed in Ta- Table 9: Lowest perplexity prompts for the antonym
ble 7 as an example. prediction task, using OPT 175B.

7
Published in Findings of EMNLP 2023

ity prompts between OPT 175B and Bloom is 7.1 OPT Bloom
Task low-ppl manual ∆ low-ppl manual ∆
on average, across the classification tasks. When
GLUE Cola 51.7 48.5 3.1 64.5 60.9 3.6
looking at the 10 highest accuracy prompts across Newspop 80.6 70.4 10.2 90.0 80.0 10.0
models we get an average intersection of 3.1 across AG News 68.4 61.9 6.5 51.0 63.5 -12.5
IMDB 90.4 88.9 1.4 91.3 88.8 2.5
the classification tasks. DBpedia 46.0 51.7 -5.7 31.2 30.2 1.0
Emotion 21.6 22.6 -1.1 35.8 32.1 3.6
Tweet Offensive 48.4 50.6 -2.3 48.6 40.8 7.8
7 SPELL: Selecting Prompts by
Estimating LM Likelihood Table 10: The average accuracy with the manual
prompts (manual) compared to the average accuracy
The primary contribution of this work is the analy- with the 3 lowest-perplexity prompts (low-ppl), for both
sis of the relationship between prompt perplexity OPT 175B and Bloom, across tasks.
and downstream task performance (Section 5). As
one potential application of our findings, we also
present a new method, SPELL, for generating and points with Bloom, demonstrating the effectiveness
selecting consistently effective prompts. of our method.
Assuming a fixed computational budget for find- The variability in accuracy of the 3 lowest per-
ing effective prompts for a given task, and that the plexity prompts is also much lower than that of
search space might be quite large, we devise the the manually created prompts: with OPT 175B,
following straightforward procedure: the average standard deviation within the 3 lowest
perplexity prompts (across tasks) is 5.07, vs. 6.86
1. Obtain a small set of manually created for the manual prompts, and with Bloom the gap is
prompts for the task. much bigger, with an average of 2.6 for the 3 lowest
perplexity prompts vs. 7.47 for the manual ones.11
2. Expand the set of prompts with automatic
This further shows that SPELL is more stable and
paraphrasing using a LM (e.g., GPT3) and
reliable compared to using an arbitrary set of man-
backtranslation (see Section 3).
ually created prompts. SPELL sets the stage for
3. Rank the list of prompts by perplexity (aver- further development in this direction, and serves
aged on a representative sample of task inputs, as an initial indication of the benefits of involving
e.g., 1,000). perplexity estimation in the process of generating
4. Choose the k (e.g., 3) lowest perplexity effective prompts.
prompts.
8 Related Work
Using this algorithm, we show empirically that Relation between performance and training
it is best to prioritize experimenting with the lowest data Previous work looking directly into the rela-
perplexity prompts, as they are more stable (exhibit tion between the training data and the performance
less variation in performance) and perform better is limited. Razeghi et al. (2022) study numeric
than manual prompts on average. This method deduction tasks, and examine the correlations be-
also does not require any labels for the task, and is tween the model performance on specific test in-
applicable to any task, also by non-experts, given stances and the frequency of terms from those in-
example inputs only. stances in the pretraining data. They find that the
models are more accurate on instances whose terms
7.1 Empirical Validation of SPELL
are more prevalent in the training data. Addition-
To show the effectiveness of our method, we report ally, Han and Tsvetkov (2022) propose a method to
the results we get using SPELL across the different effectively identify a very small subset of pretrain-
tasks. In Table 10 we report the average accuracy ing data that directly supports the model in perform-
with the manual prompts compared to the average ing a specific task. Elazar et al. (2022) use causal
accuracy with the 3 lowest-perplexity prompts, for inference to measure the effect of pretraining data
both OPT 175B and Bloom. Indeed, in most cases, statistics on factual knowledge performance, and
the average accuracy using the 3 lowest perplex-
11
ity prompts outperforms the average accuracy of We also calculate the standard deviation when using the
same amount of low-perplexity prompts as in the manual
the manual prompts, with an average of 1.8 accu- prompts set for each task and get averages of 6.32 and 3.78
racy points across tasks with OPT and 2.3 accuracy for OPT 175B and Bloom, respectively.

8
Published in Findings of EMNLP 2023

Kandpal et al. (2022) show correlational and causal Limitations


relationships between accuracy and relevant docu-
Searching for human-readable prompts We
ment count (from training data) for QA datasets.
limit our search space to human-readable prompts
that are fluent and accurately describe the task
Prompt tuning and analysis There is a very rich
at hand, as we are primarily motivated in under-
line of work trying to find prompts automatically.
standing why some relevant prompts work better
Shin et al. (2020) present an automated method to
than others. We do this by using manually cre-
create discrete prompts for a diverse set of tasks,
ated prompts and their automatically created para-
based on a gradient-guided search, and they demon-
phrases. Our findings may not hold when the possi-
strate their method on masked LMs. Other work
ble prompt space is expanded to include any token
also focuses on discrete prompts, aiming to im-
sequence; we leave this direction to future work.
prove zero-shot performance (Gao et al., 2021;
Le Scao and Rush, 2021; Deng et al., 2022; Shi Generality of our analysis and of the SPELL
et al., 2022), or trains continuous prompts (Li and method We perform our analysis on and build
Liang, 2021; Lester et al., 2021; Qin and Eisner, our method around specific models, namely OPT
2021). and Bloom. Additionally, our study is limited to the
On top of works that suggest a variety of meth- specific tasks we experiment with and to English. It
ods for creating better prompts, some work also is possible that our analysis and SPELL method do
analyzes those prompts to try and get some insights not generalize to other pretrained models or tasks;
about them: Khashabi et al. (2022a) find that model however, we consider models of various sizes and
performance is highly sensitive to small changes from different sources, and a wide range of tasks
in wordings and Khashabi et al. (2022b) point to to mitigate this risk.
a surprising disconnect between continuous and
discrete prompts. Acknowledgements
We thank Alisa Liu and Orevaoghene Ahia for their
9 Conclusion help in annotating noisy prompts. We also thank
the reviewers for their valuable comments on the
We investigate the phenomenon where some paper.
prompts perform better than others despite appear-
ing similar to the human users of LMs. Specifically,
we hypothesize that the perplexity of a prompt un- References
der a given LM is closely tied to its task perfor- Stephen H. Bach, Victor Sanh, Zheng-Xin Yong, Albert
mance. We test this theory on a large number of Webson, Colin Raffel, Nihal V. Nayak, Abheesht
Sharma, Taewoon Kim, M Saiful Bari, Thibault
tasks and autoregressive LMs, and the resulting Fevry, Zaid Alyafeai, Manan Dey, Andrea San-
correlation study validates our hypothesis. Further tilli, Zhiqing Sun, Srulik Ben-David, Canwen Xu,
analysis of this relationship demonstrates that the Gunjan Chhablani, Han Wang, Jason Alan Fries,
best prompts differ across models, highlighting the Maged S. Al-shaibani, Shanya Sharma, Urmish
Thakker, Khalid Almubarak, Xiangru Tang, Xian-
importance of model-specific analysis, and that the
gru Tang, Mike Tian-Jian Jiang, and Alexander M.
underlying structure of the relationship between Rush. 2022. Promptsource: An integrated develop-
perplexity and performance varies across tasks. ment environment and repository for natural language
In light of these findings, we then propose prompts.
a method, SPELL, to help users find well- Francesco Barbieri, Jose Camacho-Collados, Luis
performing prompts for new tasks. Empirical Espinosa-Anke, and Leonardo Neves. 2020. TweetE-
validation of the proposed procedure shows that val:Unified Benchmark and Comparative Evaluation
for Tweet Classification. In Proceedings of Findings
SPELL generates effective prompts with low vari-
of EMNLP.
ability in performance, and produces small gains
of 1.8 (2.3) accuracy points with OPT (Bloom) Tom Brown, Benjamin Mann, Nick Ryder, Melanie
over manual prompts. We therefore conclude that Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind
Neelakantan, Pranav Shyam, Girish Sastry, Amanda
SPELL provides a general and interpretable ap- Askell, et al. 2020. Language models are few-shot
proach for applying LMs to new tasks while requir- learners. Advances in neural information processing
ing minimal human effort, and no labels. systems, 33:1877–1901.

9
Published in Findings of EMNLP 2023

Yanda Chen, Chen Zhao, Zhou Yu, Kathleen McKeown, Singh, and Yejin Choi. 2022b. Prompt wayward-
and He He. 2022. On the relation between sensitivity ness: The curious case of discretized interpretation
and accuracy in in-context learning. arXiv e-prints, of continuous prompts. In Proceedings of the 2022
pages arXiv–2209. Conference of the North American Chapter of the
Association for Computational Linguistics: Human
Marta R Costa-jussà, James Cross, Onur Çelebi, Maha Language Technologies.
Elbayad, Kenneth Heafield, Kevin Heffernan, Elahe
Kalbassi, Janice Lam, Daniel Licht, Jean Maillard, Teven Le Scao and Alexander Rush. 2021. How many
et al. 2022. No language left behind: Scaling data points is a prompt worth? In Proceedings of the
human-centered machine translation. arXiv preprint 2021 Conference of the North American Chapter of
arXiv:2207.04672. the Association for Computational Linguistics: Hu-
man Language Technologies. Association for Com-
Johannes Dellert, Thora Daneyko, Alla Münch, Alina putational Linguistics.
Ladygina, Armin Buch, Natalie Clarius, Ilja Grigor-
jew, Mohamed Balabel, Hizniye Isabella Boga, Za- Jens Lehmann, Robert Isele, Max Jakob, Anja Jentzsch,
lina Baysarova, et al. 2019. Northeuralex: a wide- Dimitris Kontokostas, Pablo N Mendes, Sebastian
coverage lexical database of northern eurasia. Lan- Hellmann, Mohamed Morsey, Patrick Van Kleef,
guage Resources and Evaluation, pages 1–29. Sören Auer, et al. 2015. Dbpedia–a large-scale, mul-
tilingual knowledge base extracted from wikipedia.
Mingkai Deng, Jianyu Wang, Cheng-Ping Hsieh, Yihan Semantic web, 6(2):167–195.
Wang, Han Guo, Tianmin Shu, Meng Song, Eric P
Xing, and Zhiting Hu. 2022. Rlprompt: Optimizing Brian Lester, Rami Al-Rfou, and Noah Constant. 2021.
discrete text prompts with reinforcement learning. The power of scale for parameter-efficient prompt
arXiv preprint arXiv:2205.12548. tuning. In Proceedings of the 2021 Conference on
Empirical Methods in Natural Language Processing,
Yanai Elazar, Nora Kassner, Shauli Ravfogel, Amir pages 3045–3059.
Feder, Abhilasha Ravichander, Marius Mosbach,
Yonatan Belinkov, Hinrich Schütze, and Yoav Gold- Xiang Lisa Li and Percy Liang. 2021. Prefix-tuning:
berg. 2022. Measuring causal effects of data statis- Optimizing continuous prompts for generation. In
tics on language model’sfactual’predictions. arXiv Proceedings of the 59th Annual Meeting of the Asso-
preprint arXiv:2207.14251. ciation for Computational Linguistics and the 11th
International Joint Conference on Natural Language
Tianyu Gao, Adam Fisch, and Danqi Chen. 2021. Processing (Volume 1: Long Papers), pages 4582–
Making pre-trained language models better few-shot 4597.
learners. In Proceedings of the 59th Annual Meet-
ing of the Association for Computational Linguistics Alexandra Sasha Luccioni, Sylvain Viguier, and Anne-
and the 11th International Joint Conference on Natu- Laure Ligozat. 2022. Estimating the carbon footprint
ral Language Processing (Volume 1: Long Papers), of bloom, a 176b parameter language model. arXiv
pages 3816–3830. preprint arXiv:2211.02001.
Xiaochuang Han and Yulia Tsvetkov. 2022. Orca: In- Andrew L. Maas, Raymond E. Daly, Peter T. Pham,
terpreting prompted language models via locating Dan Huang, Andrew Y. Ng, and Christopher Potts.
supporting data evidence in the ocean of pretraining 2011. Learning word vectors for sentiment analysis.
data. arXiv preprint arXiv:2205.12600. In Proceedings of the 49th Annual Meeting of the
Association for Computational Linguistics: Human
Ari Holtzman, Peter West, Vered Shwartz, Yejin Choi, Language Technologies, pages 142–150, Portland,
and Luke Zettlemoyer. 2021. Surface form competi- Oregon, USA. Association for Computational Lin-
tion: Why the highest probability answer isn’t always guistics.
right. In Proceedings of the 2021 Conference on Em-
pirical Methods in Natural Language Processing, George A Miller. 1995. Wordnet: a lexical database for
pages 7038–7051. english. Communications of the ACM, 38(11):39–41.
Nikhil Kandpal, Haikang Deng, Adam Roberts, Eric N. Moniz and L. Torgo. 2018. Multi-source social feed-
Wallace, and Colin Raffel. 2022. Large language back of online news feeds. ArXiv, abs/1801.07055.
models struggle to learn long-tail knowledge. arXiv
preprint arXiv:2211.08411. Guanghui Qin and Jason Eisner. 2021. Learning how
to ask: Querying lms with mixtures of soft prompts.
Daniel Khashabi, Chitta Baral, Yejin Choi, and Han- In Proceedings of the 2021 Conference of the North
naneh Hajishirzi. 2022a. Reframing instructional American Chapter of the Association for Computa-
prompts to gptk’s language. In Findings of the As- tional Linguistics: Human Language Technologies,
sociation for Computational Linguistics: ACL 2022, pages 5203–5212.
pages 589–612.
Yasaman Razeghi, Robert L Logan IV, Matt Gardner,
Daniel Khashabi, Xinxi Lyu, Sewon Min, Lianhui and Sameer Singh. 2022. Impact of pretraining term
Qin, Kyle Richardson, Sean Welleck, Hannaneh Ha- frequencies on few-shot reasoning. arXiv preprint
jishirzi, Tushar Khot, Ashish Sabharwal, Sameer arXiv:2202.07206.

10
Published in Findings of EMNLP 2023

Elvis Saravia, Hsien-Chi Toby Liu, Yen-Hao Huang,


Junlin Wu, and Yi-Shin Chen. 2018. CARER: Con-
textualized affect representations for emotion recog-
nition. In Proceedings of the 2018 Conference on
Empirical Methods in Natural Language Processing,
pages 3687–3697, Brussels, Belgium. Association
for Computational Linguistics.
Timo Schick and Hinrich Schütze. 2020. It’s not just
size that matters: Small language models are also
few-shot learners. arXiv preprint arXiv:2009.07118.
Weijia Shi, Xiaochuang Han, Hila Gonen, Ari Holtzman,
Yulia Tsvetkov, and Luke Zettlemoyer. 2022. Toward
human readable prompt tuning: Kubrick’s the shining
is a good movie, and a good prompt too? arXiv
preprint arXiv:2212.10539.
Taylor Shin, Yasaman Razeghi, Robert L Logan IV,
Eric Wallace, and Sameer Singh. 2020. Autoprompt:
Eliciting knowledge from language models with
automatically generated prompts. arXiv preprint
arXiv:2010.15980.
Alex Warstadt, Amanpreet Singh, and Samuel R Bow-
man. 2018. Neural network acceptability judgments.
arXiv preprint arXiv:1805.12471.
John Wieting, Jonathan Mallinson, and Kevin Gimpel.
2017. Learning paraphrastic sentence embeddings
from back-translated bitext. In Proceedings of the
2017 Conference on Empirical Methods in Natural
Language Processing, pages 274–285.
Susan Zhang, Stephen Roller, Naman Goyal, Mikel
Artetxe, Moya Chen, Shuohui Chen, Christopher De-
wan, Mona Diab, Xian Li, Xi Victoria Lin, et al. 2022.
Opt: Open pre-trained transformer language models.
arXiv preprint arXiv:2205.01068.
Xiang Zhang, Junbo Jake Zhao, and Yann LeCun. 2015.
Character-level convolutional networks for text clas-
sification. In NIPS.

A Manually Created Prompts


Table 11 lists the manually created prompts we use
for the different tasks. We manually add, remove
and edit prompts for some of these tasks, to make
them fit for our setting. For example, the following
prompt for AG News, taken from Promptsource,
does not fit our setting: Would you recommend the
following article to a politician, an athlete, busi-
ness executive, or a scientist?

B Lowest Perplexity Prompts


Table 12 lists the 5 lowest perplexity prompts for
each task, using OPT 175B.

11
Published in Findings of EMNLP 2023

Task Manual Prompts


The antonym of the word “good” is “
The opposite meaning of the word “good” is “
“Good” is the opposite of “
“Good” is the negation of “
The following are opposites of each other: “good” and “
The word “good” contradicts the word “
Antonyms
The antonym of the word good is
The opposite meaning of the word good is
Good is the opposite of
Good is the negation of
The following are opposites of each other: good and
The word good contradicts the word
Does the this sentence make sense and use correct English?
Is this example grammatically correct and sensible?
GLUE Cola
Does this sentence make sense and is it grammatically correct?
Does this example use correct English?
What is the article about?
What is this news about?
What is the topic of this news piece?
What does this article discuss?
What is the topic of this sentence?
What category does the article belong to?
Newspop Pick one category for this news piece.
Pick the category that fits the text.
The article refers to which category?
What topic does the article belong to?
What category fits this article?
What topic does this news piece belong to?
Choose the correct category for this article.
What label best describes this news article?
What is this piece of news regarding?
AG News
Which newspaper section would this article likely appear in?
What topic is this news article about?
This movie review expresses what sentiment?
Did the reviewer find this movie good or bad?
Is this review positive or negative?
How does the viewer feel about the movie?
What sentiment does the writer express for the movie?
IMDB
What sentiment is expressed for the movie?
What is the sentiment expressed in this text?
Did the reviewer enjoy the movie?
What is the sentiment expressed by the reviewer for the movie?
How does the reviewer feel about the movie?
What category does the paragraph belong to?
Pick one category for the text.
Pick the category that fits the text.
The text refers to which category?
DBpedia
What category does the title belong to?
What category fits this text?
What topic does this text belong to?
Choose the correct category for the text.
What is the emotion expressed in this message?
What emotion does this message express?
Emotion
How will you feel about the message?
What emotion does the writer express for the message?
Is this tweet offensive?
Can the tweet be removed for being offensive?
Tweet Offensive Is the author’s tweet offensive?
Task: Identify if the tweet or text is offensive.
Is this an offensive tweet?
The translation of the word “dog” to French is “
The translation of the word dog to French is
The word “dog” in French is “
“dog” (In French: “
Translate the word dog into French:
The translation of dog to French is
Word-Level Translation
“dog” (French: “
The word dog in French is
Translate the word “dog” into French: “
dog (In French:
dog (French:
The translation of “dog” to French is “

Table 11: The set of manually created prompts for each task.

12
Published in Findings of EMNLP 2023

Task Lowest Perplexity Prompts Perplexity


The following two words are antonyms: “good” and “ 10.24
The antonym of the word “good” is “ 10.32
Antonyms The word that has the opposite meaning of the word “good” is “ 10.43
The word “good” is the antithesis of the word “ 10.85
The word “good” is the opposite of the word “ 11.15
Is this an example of the proper use of the English language? 11.63
Does the sentence make sense and does it follow the rules of grammar? 11.76
GLUE Cola Is this sentence an example of the correct use of the English language? 12.10
Does this sentence make sense and is it grammatically correct? 12.15
Is this sentence grammatically correct and does it make sense? 12.68
What is the main subject of the article? 10.01
What is the main topic of the article? 10.01
Newspop What is the subject matter of the article? 10.17
What is the subject of the article? 10.21
What is the main idea of this article? 10.21
In what section of the newspaper would you expect to find this article? 7.51
In which section of the newspaper would you expect to find this article? 7.52
AG News In which section of the newspaper would this article be most likely to appear? 7.60
In what section of the newspaper do you expect to find this article? 7.80
In what section of the newspaper would this article most likely appear? 7.87
What is the opinion of the review? Is it positive or negative? 7.19
Is this a positive or negative review? 7.31
IMDB What do you think of the movie? 7.33
What do you think of the film? 7.35
Is that a positive or a negative? 7.35
What is the category to which the text refers? 8.99
What is the subject of the text? 9.15
DBpedia What category does the title belong to? 9.18
Which category does the text refer to? 9.19
What is the subject of this text? 9.20
How do you feel when you hear this message? 12.72
What is the writer’s emotional reaction to this news? 13.18
Emotion What is the emotion expressed in this message? 13.20
How does this message make you feel? 13.32
How do you feel about this message? 13.50
If someone said this to you, would you be offended? 13.00
If someone said that to you, would you be offended? 13.10
Tweet Offensive Would you be offended if someone said that to you? 13.73
Would it offend you if someone said that to you? 14.79
If someone told you that, would you be offended? 14.93
The word for “dog” in French is “ 7.73
The French word for “dog” is “ 8.16
Word-Level Translation The French translation of the word “dog” is “ 8.24
The translation of the word “dog” in French is “ 8.35
The translation of the word “dog” into French is “ 8.91

Table 12: The 5 lowest perplexity prompts for each task, using OPT 175B.

13

You might also like