Demystifying Prompts in Language Models Via Perplexity Estimation
Demystifying Prompts in Language Models Via Perplexity Estimation
Abstract
Language models can be prompted to perform a
wide variety of tasks with zero- and few-shot in-
arXiv:2212.04037v2 [cs.CL] 12 Sep 2024
sizes, providing us some insights about the under- roughly 366B tokens), and in addition, this training
lying mechanism of prompting (see Figure 1). As data is not always publicly available (e.g., GPT3;
a result, we devise a method, SPELL (Selecting Brown et al. 2020). Our initial attempts to estimate
Prompts by Estimating LM Likelihood), for cre- exact-match occurrences of prompts in the data
ating prompts in an informed manner. We show resulted in very sparse counts, which led us to look
that using SPELL to choose prompts results in less for a softer formalization.1
variability in performance as well as in accuracy Instead of considering the training data directly,
gains (1.8 accuracy points with OPT and 2.3 accu- we propose to focus on the perplexity of the prompt
racy points with Bloom on average). Importantly, as a proxy for its occurrences in some form in
our method does not require labels at all, only a the data – essentially indicating to what extent the
small sample of inputs for the task. model expects this prompt. This perplexity-based
Our contributions can be summarized as follows: framing helps to avoid the challenge of exact match
(a) we formalize the notion that better familiarity in the data, and takes into account variations of the
of the model with the prompt correlates with better prompt that the model is also exposed to and might
performance (Section 2); (b) we automatically elab- be influenced by. In addition, it helps overcome the
orate a given set of seed prompts using paraphras- challenges mentioned above as it requires neither
ing (Section 3); (c) we establish experimentally the access to the pretraining data (which is not always
hypothesis that lower perplexity of the prompt cor- publicly available for LMs) nor matching over huge
relates well with better performance (Section 5); amounts of text.
(d) we devise a method to create a more consistent
Hypothesis: lower perplexity correlates with bet-
set of prompts, that also improve results even with
ter performance. We hypothesize that on aver-
no labels for the task (Section 7).
age, lower-perplexity prompts perform better. We
2 Why are prompts not all created equal? are interested in establishing this hypothesis by
experimentally showing a significant negative cor-
Despite the popularity of prompting as a method relation between the perplexity of the prompt and
for using language models (Shin et al., 2020; Li its performance on the task, across a diverse set of
and Liang, 2021; Gao et al., 2021), the cause for tasks and models.
the different behavior of various prompts remains We define the perplexity of the prompt as the
unclear so far. Table 1 shows four example prompts perplexity of the full prompt sequence, including
for a news topic classification task (AG News) and the input itself, and without the label, averaged
their respective accuracies when used to prompt over 1,000 examples (see Section 4 for details).
OPT 175B (Zhang et al., 2022). The accuracy gap The input is a part of the prompt in the case of
between the different prompts is not trivial, and it the word prediction tasks by design (e.g., “The
is not possible to predict from the prompts alone. opposite of the word good is”). Inclusion of the
task input as part of the prompt for classification
Prompt Accuracy
tasks as well is intentional: we want to ground the
What is this piece of news regarding? 40.9
What is this article about? 52.4
prompt to the task (without the input, we are testing
What is the best way to describe this article? 68.2 the hypothesis that lower perplexity prompts across
What is the most accurate label for this news article? 71.2 all tasks work better on every task). The label is
not considered a part of the prompt and is not taken
Table 1: Example prompts for the task AG News (news
classification) that vary considerably in accuracy. into consideration when computing the prompt. In
practice, this also results in a huge advantage of our
We propose that the more frequently a prompt method, SPELL (Section 7), which aims to find
appears in some variation in the data, the better better prompts—it does not require any labels.
it works for the task. The intuition behind this For performance measures, we use the log-
is that a sequence that is more expected by the likelihood score assigned by the model to the cor-
model is more likely to aid the model to extract rect label given that prompt. We choose this metric
the relevant information. However, this premise is 1
We experimented with the task of AG News (see Sec-
hard to measure accurately: most language models tion 4.1), and looked for all of its prompts (using exact match)
in the OPT training data. Indeed, only 9/108 of the prompts
use huge amounts of training data (e.g., OPT uses appear in the training data. Such sparse counts do not allow
a corpus of roughly 180B tokens, and Bloom uses for any useful or reliable analysis of prompt behaviour.
2
Published in Findings of EMNLP 2023
over accuracy as it gives a more fine-grained dis- and close to English,2 to reduce the risk of noise.
tinction between prompts and because accuracy can Since we aim to get about 100 prompts per task,
be unstable, as explained in more detail in Section 4. we add 8 additional languages3 in the case where
For classification tasks, we also report correlation the basic 8 languages yielded too few alternatives.
with accuracy, which is the main evaluation metric For word prediction tasks, we use the sequence of
for this type of task. the created prompt up to the index of the label, not
including the label, for example: The word “dog”
3 Automatic Expansion of Seed Prompts in French is “. Depending on the task, we enforce
the existence of specific words (e.g., the name of
We are interested in expanding our pool of prompts the language, and the source word, in word-level
in order to: (a) have a more diverse set of prompts, translation) or enforce the prompt to be a question.
making it more likely to find a better prompt for
our task, and (b) support better analysis to validate Examples and Statistics Table 4 lists all 4 man-
our prompt quality hypothesis. In this section, we ually created prompts we use for the AG News
describe our method for automatically expanding task (news classification), alongside a few sampled
a seed set of manually created prompts using para- prompts created automatically using our method.
phrasing. As was typically the case, we are able to get
prompts that are rather different in phrasing and
Step 0: Creating a seed set of manually-written structure from those included in the seed set.
prompts We first write/collect a small set of hu- The statistics of the prompts in the manually
man written prompts that describe the task. For created seed set (Step 0) as well as the prompts after
classification tasks we assume that the input ap- Step 1 and Step 2 for each task (see Section 4.1 for
pears before the prompt, with no choices appearing details about the tasks) are detailed in Table 3.
as part of the prompt (to help in smooth paraphras-
ing of the prompt itself). 4 Experimental Setup
4.1 Models, Tasks and Datasets
Step 1: Paraphrasing with GPT3 We use the
text-davinci-002 version of GPT3 (Brown et al., We study four auto-regressive models: OPT (Zhang
2020) to generate paraphrases for each of the man- et al., 2022) of different sizes (1.3B, 30B, 175B
ual prompts in our seed set. We prompt it with parameters), all trained mainly on English,4 and
a meta-prompt for paraphrasing to generate varia- Bloom (176B parameters; Luccioni et al. 2022),
tions of one of our seed prompts. An example of which is trained on 46 natural languages and 13
such a meta-prompt is: Write a paraphrase for the programming languages. We experiment with two
following sentence: <seed prompt> Paraphrase:. types of tasks: word prediction tasks and classifica-
The 7 meta-prompts used in this step are listed in tion tasks, as detailed below.
Table 2.
Word Prediction Tasks The first task in this cat-
We choose GPT3 as our paraphrasing model egory is word-level translation. Given a source
because of its well-documented generation abilities. word in English and a target language, we expect
This is also to ensure that there is a separation the model to predict the correct translation. For this
between the model we use to create the prompts and task we use NorthEuraLex5 (Dellert et al., 2019),
the models we use to rank them (OPT and Bloom, a lexical database providing translations of 1016
see Section 4 for details), to avoid confounding the words into 107 languages. We experiment with
experimental setup. 9 languages that use the Latin script. For Bloom,
we use 5 additional languages that do not use the
Step 2: Paraphrasing using backtranslation
2
Our second step takes as input the paraphrases from Danish, German, Italian, French, Dutch, Portuguese,
Swedish, Spanish.
GPT3 (in addition to the seed set of prompts) and 3
Norwegian, Romanian, Catalan, Turkish, Ukrainian, Pol-
translates them into different languages and back ish, Russian, Arabic.
4
into English to get additional prompt paraphrases As stated in the paper, the training corpora were previ-
(Wieting et al., 2017). We use a set of 8 languages ously collected or filtered to contain predominantly English
text, but a small amount of non-English data is still present
available in the NLLB translation model (Costa- within the corpus via CommonCrawl.
5
jussà et al., 2022) that are relatively high resource http://northeuralex.org/
3
Published in Findings of EMNLP 2023
Meta prompts
Write a paraphrase for the following sentence: <seed-prompt> Paraphrase:
<seed-prompt> Paraphrase:
Write a likely paraphrase of the text: <seed-prompt> Paraphrase:
Write a sentence similar to the following one: <seed-prompt> Paraphrase:
Paraphrase the following sentence: <seed-prompt> Paraphrase:
Write a variation of this sentence: <seed-prompt>
How would you say the following sentence in a different way? <seed-prompt>
Table 2: Meta prompts used in Step 1 of our method for paraphrasing using GPT3.
Task # Step 0 # Step 1 # Step 2 tions; Saravia et al. 2018); (g) Tweet Offensive
Word-Level Translation 12 59 118 (classification to offensive vs. not offensive tweets;
Antonyms 12 85 176 Barbieri et al. 2020). We use 1,000 random exam-
GLUE Cola 4 27 144 ples from each dataset.
Newspop 13 43 119
AG News 4 23 108
The full set of manual prompts is listed in Sec-
IMDB 10 45 178 tion A in the Appendix. In these tasks, the prompt
DBpedia 8 23 103 follows the input, and at the end of each prompt
Emotion 4 14 94 we add the choices of classes (i.e., we provide the
Tweet Offensive 5 41 119
possible labels explicitly in the prompt by listing
Table 3: Number of prompts for the different tasks:
the possible answers as defined by the dataset it-
prompts after step 0 (creating prompts manually), self.): “Choices: X, Y, Z. Answer:” as we find it
prompts after step 1 (GPT3 paraphrasing), and prompts helps in terms of accuracy. Defining the label space
after step 2 (backtranslation). likely helps in our zero-shot setting because there
are no previous demonstrations from which the
model can learn the possible classes. Additionally,
Latin script (since Bloom is multilingual). Note adding class options to the prompt helps to reduce
that only 5 of the languages we experiment with the effect of the surface form competition (Holtz-
are officially covered by Bloom.6 man et al., 2021). The option of generating the
We also consider antonym prediction where, answer and comparing it with the gold label was
given a word, the model is expected to predict its not reasonable here, since we cannot expect the
antonym. For this task, we use data from Kaggle,7 model to generate the exact label as the first choice
which is based on WordNet (Miller, 1995). We often enough.
choose 1,000 word pairs at random.
4
Published in Findings of EMNLP 2023
Table 4: Prompts for the task AG News (news classification): the manually created prompts and a sample of
automatically created prompts using our method.
Model Task Perplexity-score corr. Perplexity-acc corr. Avg Acc Acc 50%
Pearson Spearman Pearson Spearman
Antonyms **-0.41 **-0.53 – – – –
GLUE Cola -0.15 -0.14 -0.04 -0.02 47.7 57.1
Newspop *-0.24 **-0.26 *-0.20 -0.18 66.4 72.9
AG News **-0.63 **-0.68 **-0.77 **-0.81 57.5 68.7
OPT 175B
IMDB **0.35 **0.40 0.14 *0.20 86.2 91.0
DBpedia **-0.50 **-0.44 **-0.51 **-0.42 46.7 55.2
Emotion -0.14 -0.19 **-0.30 **-0.32 16.4 23.0
Tweet Offensive *-0.19 0.07 0.18 *0.23 51.3 55.8
Antonyms **-0.37 **-0.23 – – – –
GLUE Cola 0.07 0.11 **-0.25 **-0.26 55.5 65.6
Newspop **-0.50 **-0.42 **-0.59 **-0.51 78.9 87.8
AG News **-0.62 **-0.54 **-0.44 **-0.44 50.3 59.4
Bloom 176B
IMDB 0.04 0.09 -0.08 -0.14 89.3 92.2
DBpedia **-0.47 *-0.27 **-0.35 *-0.21 27.2 33.4
Emotion **-0.33 **-0.42 **-0.48 **-0.55 29.3 31.7
Tweet Offensive 0.14 *0.24 *-0.20 -0.03 41.6 46.2
Antonyms **-0.54 **-0.70 – – – –
GLUE Cola -0.05 0.03 -0.13 0.02 32.2 35.5
Newspop *-0.23 *-0.25 *-0.18 -0.12 60.3 66.6
OPT 30B AG News **-0.66 **-0.71 **-0.81 **-0.80 49.3 60.7
IMDB -0.06 *0.17 0.04 **0.22 81.6 86.1
DBpedia **-0.41 **-0.34 *-0.21 *-0.25 35.9 42.4
Emotion 0.00 -0.03 0.18 0.13 12.3 16.2
Tweet Offensive **-0.44 **-0.39 -0.11 -0.05 54.6 60.2
Antonyms **-0.45 **-0.53 – – – –
GLUE Cola **-0.39 **-0.36 -0.09 *-0.19 60.3 65.9
Newspop **0.33 *0.21 -0.07 -0.07 37.6 40.3
OPT 1.3B AG News **-0.33 **-0.29 **-0.56 **-0.49 31.9 37.6
IMDB -0.11 -0.07 **0.24 **0.22 86.0 89.1
DBpedia -0.16 -0.14 -0.02 -0.01 8.7 9.2
Emotion 0.08 0.08 **-0.29 **-0.30 7.0 9.1
Tweet Offensive **-0.42 **-0.35 **-0.50 **-0.38 58.6 62.6
Table 5: Correlation results for the different tasks, with OPT (different sizes) and Bloom. Correlations with p < 0.05
are marked with *. Correlations with p < 0.00625 (according to Bonferroni correction for multiple hypotheses) are
marked with **. Dark and light blue colored cells stand for negative correlations < −0.2 and > −0.2, respectively.
Dark and light orange colored cells stand for positive correlations > 0.2 and < 0.2, respectively. Average accuracy
across all prompts and average accuracy of best 50% prompts are also reported for reference (Avg Acc and Acc
50%, respectively).
scores. tasks in our setting (the chances the model will gen-
erate an exact match of the label are low). Hence,
For the word prediction tasks we only report
the score of the correct label gives a better estimate
scores, since accuracy in general is less stable, suf-
of the actual performance of the model.
fers more from the surface form competition (Holtz-
man et al., 2021), and is usually quite low for these
5
Published in Findings of EMNLP 2023
Lang
OPT 175B Bloom 176B general the trend appears to be the same in the
Pearson Spear. Pearson Spear. smaller models as well; however, the correlations
ita -0.44 -0.57 -0.37 -0.63 seem to be slightly weaker. We hypothesize that
spa -0.47 -0.61 -0.51 -0.66 this might be due to the overall lower performance
cat -0.47 -0.58 -0.24 -0.31
fra -0.48 -0.57 -0.48 -0.64
of these smaller models, making the performance
deu -0.44 -0.60 -0.46 -0.65 results we use for correlation less stable and reli-
fin -0.44 -0.62 -0.34 -0.56 able. For word-level translation, however, all corre-
por -0.45 -0.62 -0.46 -0.61 lations with the 30B and 1.3B models are similar
eus -0.47 -0.61 -0.45 -0.61
tur -0.44 -0.62 -0.33 -0.62
to those with the 175B model, and are all statisti-
jpn – – -0.33 -0.26 cally significant (also after Bonferroni correction
arb – – -0.36 -0.47 for multiple hypotheses).
rus – – -0.54 -0.69
kor – – -0.42 -0.58
ell – – -0.40 -0.51
6 Analysis
Next, we further explore the observed relationship
Table 6: Correlation results for word-level translation,
with OPT 175B and Bloom 176B. All correlations are
between model perplexity and prompt performance.
statistically significant also according to Bonferroni cor- Despite the consistently high correlation between
rection for multiple hypotheses for OPT (p < 0.0055). these two factors, the structure of this relationship
Same for Bloom (p < 0.00357), except for Catalan varies across tasks (Section 6.1). Additionally, we
(Pearson) and Japanese (Spearman). find that the automatically added prompts are high-
quality and not a significant source of noise (Sec-
tion 6.2), and that the best prompts selected by our
5 Results
approach vary across models (Section 6.3).
Classification Tasks and Antonym Prediction
Table 5 depicts the Pearson and Spearman corre- 6.1 Visualizing the Relationship between
lation results on the classification tasks and the Perplexity and Performance
antonym task, with both OPT 175B and Bloom To visualize the correlations we get between the per-
(two upper blocks). We see that most correlations plexity and the performance of the prompts across
are negative and statistically significant, as we ex- the different settings, we plot a few examples for
pect. This validates our hypothesis and shows that different tasks and languages. Figures 1 and 2 show
in the majority of tasks we indeed get a strong cor- some of the results for selected tasks, as detailed
relation between low perplexity of the prompt and in the captions. The negative trend of the corre-
better performance on the task.10 For each task we lation is clearly visible in all plots. Interestingly,
also report the average accuracy. the structure of the plots for word-level translation
are very similar across all the language pairs, sug-
Word-Level Translation The results of the word-
gesting that prompts get consistent perplexity and
level translation task are reported in Table 6. Here
performance across languages (possibly at different
the correlations are extremely consistent across all
scales). Indeed, the intersection of the 10 lowest
languages and across models, with statistical sig-
perplexity prompts between any two different lan-
nificance for all languages except for Catalan and
guages is 8.6 and 8.4 on average (for OPT 175B
Japanese (in Bloom).
and Bloom, respectively), which is extremely high.
Results across Different Model Sizes We repeat This is not very surprising since we know that the
the same experiment with the OPT models of sizes only differences between the prompts in the differ-
1.3B and 30B, to investigate whether these corre- ent languages are the names of the target languages
lations are also consistent across model sizes or (e.g., The word for “dog” in French is “). Addition-
whether this is a phenomenon we should expect ally, the intersection of 10 prompts with the highest
only in large language models. Table 5 (two lower label score between any two different languages is
blocks) shows these results for all classification 7 and 6.5 on average (for OPT 175B and Bloom,
tasks and antonym prediction. We do see that in respectively).
10
A notable finding that appears in the word-level
Repeating the experiments with the length of the prompt
instead of perplexity yields weak positive correlations, almost translation plots is the clear separation between
all of which are not statistically significant. prompts that include or do not include quotation
6
Published in Findings of EMNLP 2023
marks for the label (usually aligns with whether prompt ppl
the prompt uses quotation marks for the source Is this example correct English usage? 25.79
word) – three example prompts appear on the plot. Is this example using English correctly? 25.46
Prompts with quotation marks for the words tend to Is this example correct English? 25.33
Is this the example in correct English? 25.00
have both lower perplexity and better performance, Is English in this example correct? 24.90
consistently. We further analyze the results for
OPT 175B within clusters (with/without quotations Table 7: Example of the 5 highest perplexity prompts
marks). In the cluster with quotation marks, we for GLUE Cola, using OPT 175B.
get negative correlations (in the range of –0.28 to
–0.38) that are statistically significant for almost all Task Lang
Before filtering After filtering
Pearson Spearman Pearson Spearman
languages. The correlations within the other cluster
AG News - -0.63 -0.68 -0.62 -0.54
are weaker and less significant (this is expected ita -0.44 -0.58 -0.44 -0.57
given the overall lower performance of that cluster). spa -0.47 -0.61 -0.47 -0.61
cat -0.45 -0.57 -0.47 -0.58
fra -0.47 -0.57 -0.48 -0.57
WLT deu -0.43 -0.60 -0.44 -0.60
fin -0.41 -0.60 -0.44 -0.62
por -0.43 -0.61 -0.45 -0.62
eus -0.45 -0.60 -0.47 -0.61
tur -0.43 -0.61 -0.44 -0.62
7
Published in Findings of EMNLP 2023
ity prompts between OPT 175B and Bloom is 7.1 OPT Bloom
Task low-ppl manual ∆ low-ppl manual ∆
on average, across the classification tasks. When
GLUE Cola 51.7 48.5 3.1 64.5 60.9 3.6
looking at the 10 highest accuracy prompts across Newspop 80.6 70.4 10.2 90.0 80.0 10.0
models we get an average intersection of 3.1 across AG News 68.4 61.9 6.5 51.0 63.5 -12.5
IMDB 90.4 88.9 1.4 91.3 88.8 2.5
the classification tasks. DBpedia 46.0 51.7 -5.7 31.2 30.2 1.0
Emotion 21.6 22.6 -1.1 35.8 32.1 3.6
Tweet Offensive 48.4 50.6 -2.3 48.6 40.8 7.8
7 SPELL: Selecting Prompts by
Estimating LM Likelihood Table 10: The average accuracy with the manual
prompts (manual) compared to the average accuracy
The primary contribution of this work is the analy- with the 3 lowest-perplexity prompts (low-ppl), for both
sis of the relationship between prompt perplexity OPT 175B and Bloom, across tasks.
and downstream task performance (Section 5). As
one potential application of our findings, we also
present a new method, SPELL, for generating and points with Bloom, demonstrating the effectiveness
selecting consistently effective prompts. of our method.
Assuming a fixed computational budget for find- The variability in accuracy of the 3 lowest per-
ing effective prompts for a given task, and that the plexity prompts is also much lower than that of
search space might be quite large, we devise the the manually created prompts: with OPT 175B,
following straightforward procedure: the average standard deviation within the 3 lowest
perplexity prompts (across tasks) is 5.07, vs. 6.86
1. Obtain a small set of manually created for the manual prompts, and with Bloom the gap is
prompts for the task. much bigger, with an average of 2.6 for the 3 lowest
perplexity prompts vs. 7.47 for the manual ones.11
2. Expand the set of prompts with automatic
This further shows that SPELL is more stable and
paraphrasing using a LM (e.g., GPT3) and
reliable compared to using an arbitrary set of man-
backtranslation (see Section 3).
ually created prompts. SPELL sets the stage for
3. Rank the list of prompts by perplexity (aver- further development in this direction, and serves
aged on a representative sample of task inputs, as an initial indication of the benefits of involving
e.g., 1,000). perplexity estimation in the process of generating
4. Choose the k (e.g., 3) lowest perplexity effective prompts.
prompts.
8 Related Work
Using this algorithm, we show empirically that Relation between performance and training
it is best to prioritize experimenting with the lowest data Previous work looking directly into the rela-
perplexity prompts, as they are more stable (exhibit tion between the training data and the performance
less variation in performance) and perform better is limited. Razeghi et al. (2022) study numeric
than manual prompts on average. This method deduction tasks, and examine the correlations be-
also does not require any labels for the task, and is tween the model performance on specific test in-
applicable to any task, also by non-experts, given stances and the frequency of terms from those in-
example inputs only. stances in the pretraining data. They find that the
models are more accurate on instances whose terms
7.1 Empirical Validation of SPELL
are more prevalent in the training data. Addition-
To show the effectiveness of our method, we report ally, Han and Tsvetkov (2022) propose a method to
the results we get using SPELL across the different effectively identify a very small subset of pretrain-
tasks. In Table 10 we report the average accuracy ing data that directly supports the model in perform-
with the manual prompts compared to the average ing a specific task. Elazar et al. (2022) use causal
accuracy with the 3 lowest-perplexity prompts, for inference to measure the effect of pretraining data
both OPT 175B and Bloom. Indeed, in most cases, statistics on factual knowledge performance, and
the average accuracy using the 3 lowest perplex-
11
ity prompts outperforms the average accuracy of We also calculate the standard deviation when using the
same amount of low-perplexity prompts as in the manual
the manual prompts, with an average of 1.8 accu- prompts set for each task and get averages of 6.32 and 3.78
racy points across tasks with OPT and 2.3 accuracy for OPT 175B and Bloom, respectively.
8
Published in Findings of EMNLP 2023
9
Published in Findings of EMNLP 2023
Yanda Chen, Chen Zhao, Zhou Yu, Kathleen McKeown, Singh, and Yejin Choi. 2022b. Prompt wayward-
and He He. 2022. On the relation between sensitivity ness: The curious case of discretized interpretation
and accuracy in in-context learning. arXiv e-prints, of continuous prompts. In Proceedings of the 2022
pages arXiv–2209. Conference of the North American Chapter of the
Association for Computational Linguistics: Human
Marta R Costa-jussà, James Cross, Onur Çelebi, Maha Language Technologies.
Elbayad, Kenneth Heafield, Kevin Heffernan, Elahe
Kalbassi, Janice Lam, Daniel Licht, Jean Maillard, Teven Le Scao and Alexander Rush. 2021. How many
et al. 2022. No language left behind: Scaling data points is a prompt worth? In Proceedings of the
human-centered machine translation. arXiv preprint 2021 Conference of the North American Chapter of
arXiv:2207.04672. the Association for Computational Linguistics: Hu-
man Language Technologies. Association for Com-
Johannes Dellert, Thora Daneyko, Alla Münch, Alina putational Linguistics.
Ladygina, Armin Buch, Natalie Clarius, Ilja Grigor-
jew, Mohamed Balabel, Hizniye Isabella Boga, Za- Jens Lehmann, Robert Isele, Max Jakob, Anja Jentzsch,
lina Baysarova, et al. 2019. Northeuralex: a wide- Dimitris Kontokostas, Pablo N Mendes, Sebastian
coverage lexical database of northern eurasia. Lan- Hellmann, Mohamed Morsey, Patrick Van Kleef,
guage Resources and Evaluation, pages 1–29. Sören Auer, et al. 2015. Dbpedia–a large-scale, mul-
tilingual knowledge base extracted from wikipedia.
Mingkai Deng, Jianyu Wang, Cheng-Ping Hsieh, Yihan Semantic web, 6(2):167–195.
Wang, Han Guo, Tianmin Shu, Meng Song, Eric P
Xing, and Zhiting Hu. 2022. Rlprompt: Optimizing Brian Lester, Rami Al-Rfou, and Noah Constant. 2021.
discrete text prompts with reinforcement learning. The power of scale for parameter-efficient prompt
arXiv preprint arXiv:2205.12548. tuning. In Proceedings of the 2021 Conference on
Empirical Methods in Natural Language Processing,
Yanai Elazar, Nora Kassner, Shauli Ravfogel, Amir pages 3045–3059.
Feder, Abhilasha Ravichander, Marius Mosbach,
Yonatan Belinkov, Hinrich Schütze, and Yoav Gold- Xiang Lisa Li and Percy Liang. 2021. Prefix-tuning:
berg. 2022. Measuring causal effects of data statis- Optimizing continuous prompts for generation. In
tics on language model’sfactual’predictions. arXiv Proceedings of the 59th Annual Meeting of the Asso-
preprint arXiv:2207.14251. ciation for Computational Linguistics and the 11th
International Joint Conference on Natural Language
Tianyu Gao, Adam Fisch, and Danqi Chen. 2021. Processing (Volume 1: Long Papers), pages 4582–
Making pre-trained language models better few-shot 4597.
learners. In Proceedings of the 59th Annual Meet-
ing of the Association for Computational Linguistics Alexandra Sasha Luccioni, Sylvain Viguier, and Anne-
and the 11th International Joint Conference on Natu- Laure Ligozat. 2022. Estimating the carbon footprint
ral Language Processing (Volume 1: Long Papers), of bloom, a 176b parameter language model. arXiv
pages 3816–3830. preprint arXiv:2211.02001.
Xiaochuang Han and Yulia Tsvetkov. 2022. Orca: In- Andrew L. Maas, Raymond E. Daly, Peter T. Pham,
terpreting prompted language models via locating Dan Huang, Andrew Y. Ng, and Christopher Potts.
supporting data evidence in the ocean of pretraining 2011. Learning word vectors for sentiment analysis.
data. arXiv preprint arXiv:2205.12600. In Proceedings of the 49th Annual Meeting of the
Association for Computational Linguistics: Human
Ari Holtzman, Peter West, Vered Shwartz, Yejin Choi, Language Technologies, pages 142–150, Portland,
and Luke Zettlemoyer. 2021. Surface form competi- Oregon, USA. Association for Computational Lin-
tion: Why the highest probability answer isn’t always guistics.
right. In Proceedings of the 2021 Conference on Em-
pirical Methods in Natural Language Processing, George A Miller. 1995. Wordnet: a lexical database for
pages 7038–7051. english. Communications of the ACM, 38(11):39–41.
Nikhil Kandpal, Haikang Deng, Adam Roberts, Eric N. Moniz and L. Torgo. 2018. Multi-source social feed-
Wallace, and Colin Raffel. 2022. Large language back of online news feeds. ArXiv, abs/1801.07055.
models struggle to learn long-tail knowledge. arXiv
preprint arXiv:2211.08411. Guanghui Qin and Jason Eisner. 2021. Learning how
to ask: Querying lms with mixtures of soft prompts.
Daniel Khashabi, Chitta Baral, Yejin Choi, and Han- In Proceedings of the 2021 Conference of the North
naneh Hajishirzi. 2022a. Reframing instructional American Chapter of the Association for Computa-
prompts to gptk’s language. In Findings of the As- tional Linguistics: Human Language Technologies,
sociation for Computational Linguistics: ACL 2022, pages 5203–5212.
pages 589–612.
Yasaman Razeghi, Robert L Logan IV, Matt Gardner,
Daniel Khashabi, Xinxi Lyu, Sewon Min, Lianhui and Sameer Singh. 2022. Impact of pretraining term
Qin, Kyle Richardson, Sean Welleck, Hannaneh Ha- frequencies on few-shot reasoning. arXiv preprint
jishirzi, Tushar Khot, Ashish Sabharwal, Sameer arXiv:2202.07206.
10
Published in Findings of EMNLP 2023
11
Published in Findings of EMNLP 2023
Table 11: The set of manually created prompts for each task.
12
Published in Findings of EMNLP 2023
Table 12: The 5 lowest perplexity prompts for each task, using OPT 175B.
13