project (1) (1)
project (1) (1)
impact on text summarization, focusing on the dismissed by a federal judge. The judge ruled that Trump's claims are
classic benchmark domain of news summariza- unsupported by any legal authority and that many of his specific factual
assertions are implausible or unsupported.
tion. First, we investigate how GPT-3 compares
against fine-tuned models trained on large sum- Keyword-constrained prompt
marization datasets. We show that not only Summarize the above article briefly focusing on Alina Habba.
do humans overwhelmingly prefer GPT-3 sum- GPT-3: Alina Habba is an attorey for Donald Trump. She stated that
Trump will "immediately move to appeal this decision." Habba also said
maries, prompted using only a task descrip- that the judge's opinion is "rife with erroncous applications of the law" and
tion, but these also do not suffer from common that it "disregards the numerous independent govermental investigations
which substantiate our claim that the defendants conspired to falsely
dataset-specific issues such as poor factuality. implicate our client and undermine the 2016 Presidential election.”
Next, we study what this means for evalua-
[cs.CL]
tion, particularly the role of gold standard test Figure 1: Examples of GPT-3 summaries. We can gen-
sets. Our experiments show that both reference- erate summaries following style constraints or queries
based and reference-free automatic metrics can- included in the prompts, allowing us to emulate a range
not reliably evaluate GPT-3 summaries. Fi- of existing fine-tuned systems.
nally, we evaluate models on a setting beyond
generic summarization, specifically keyword-
arXiv 2209.12356v2
based summarization, and show how dominant 2020), TO (Sanh et al., 2022), PaLM (Chowdhery
fine-tuning approaches compare to prompting. etal., 2022), etc.) provides an alternative approach,
To support further research, we release: (a) namely learning from natural language task instruc-
a corpus of 10K generated summaries from tions and/or a few demonstrative examples in the
fine-tuned and prompt-based models across 4 context without updating model parameters. While
standard summarization benchmarks, (b) 1K recent work (Zhao et al., 2021; Min et al., 2022;
human preference judgments comparing differ- Ye and Durrett, 2022) has evaluated this paradigm
ent systems for generic- and keyword-based
across a number of tasks, it has only been studied
summarization.'
for text summarization with unreliable automatic
1 Introduction metrics (He et al., 2022b; Chowdhery et al., 2022;
Ouyang et al., 2022) or in non-standard settings
Fine-tuning pre-trained models on domain-specific (Saunders et al., 2022).
datasets has been the leading paradigm in text sum- In this paper, we conduct the first systematic
marization research in recent years (Lewis et al., study of the impact of prompt-based models on
2020; Zhang et al., 2020; Raffel et al., 2020). These the text summarization research space, using an
models generate high-quality summaries on stan- Instruct-tuned 175B GPT-3 model (text-davinci-
dard benchmarks, but still require sizeable training 002) (Brown et al., 2020; Ouyang et al., 2022) as a
datasets to adapt to new settings, e.g., summarizing case study. Figure 1 shows that GPT-3 summaries
data from a new source domain or producing a sum- are extremely high-quality and adaptable to differ-
mary in a different style. The success of prompt- ent summarization settings. Starting from these
ing large language models (GPT-3 (Brown et al., observations, we aim to answer three main ques-
'All data available at: https://tagoyal.github.io/ tions. First, how do prompt-based GPT-3 sum-
zeroshot-news-annotations.html. maries compare to those obtained from state-of-
the-art fine-tuned summarization models (Zhang Avg. Words | % novel n-grams
et al., 2020; Liu et al., 2022)? We compare these Dataset | jricle Summ | n=1 n=2
approaches using A/B testing on a new corpus CNN 760.5 457 16.7 54.3
DailyMail 653.3 54.6 17.0 538
of recent news articles, and find that our study Xsum (BBC) | 431.1 232 | 357 82.4
participants overwhelmingly prefer GPT-3 sum- Newsroom 658.6 26.7 18.9 475
maries across two different “styles” with differ-
ent prompts (three-sentence and single-sentence). Table 1: Basic statistics of standard summarization
Moreover, these summaries do not suffer from lim- datasets: CNN/DM (Hermann et al., 2015; Nallapati et al.,
2016), XSum (Narayan et al., 2018), Newsroom (Grusky
itations due to low-quality training data that plague
et al., 2018). These show large variance in their sum-
fine-tuned generic summarization models (Maynez
mary properties and fundamentally differ in their defini-
et al., 2020; Goyal et al., 2022). tion of the “gold” standard.
Second, are existing automatic metrics well-
suited to evaluating prompt-based summaries? Re-
2 Models and Setup
cent work has shown that classic reference-based
such as ROUGE (Lin, 2004) and BERTScore (Zhang* 2.1 Current Paradigms for Summarization
et al., 2020) are unreliable when small improve- Recent zero- and few-shot prompting based mod-
ments are reported (Peyrard, 2019; Fabbri et al., els (Brown et al., 2020; Sanh et al., 2022), have
2021); however large differences, on the order of shown impres ive generalization capabilities on
say 5 ROUGE points or greater, are considered to be unseen tasks specified using prompts alone and
correlated with human preferences (Bhandari et al., without performing any gradient updates (Mishra
2020; Deutsch et al., 2022). However, we find that et al.,, 2022). In this work, we want to compare
the same is no longer true when evaluating GPT-3 their text summarization performance against the
summaries. These summaries score much lower on current state-of-the-art models.
automatic metrics (7 ROUGE-L points on average)
than all prior state-of-the-art models while com- Pre-trained LM
i
fortably outperforming them on human evaluation. Fine-tunedon | [Instruction-tuned on| [Zero-shot prompting|
Furthermore, we show that recent reference-free summ. datasets multiple tasks
= T (Not available or less
(Task-specific models - effective than
metrics, e.g. QA-based metrics (Fabbri et al., 2022; trained for each dataser) Prompting instruction-tuned
Durmus et al., 2020) and trained factuality models Instrnct GPT counterparts)
(Kryscinski et al., 2020; Goyal and Durrett, 2020),
» PEGASUS »4 CTRLSum _4y FLAN; (ese
4
ot nabned
(these not trained
»arrs
> BRIO (S mmarization datasets) | datasers) » PalM
similarly fail to adapt to this shift from the fine- | used during training text-davinci-002 » Turing-NLG
N=2
training with inference time usage. Africans in Ukraine, which it des ibed as "shockingly racist."
In this work, we compare the summarization The aricle discusses the reports of discrimination against African citizens
performance of three models that are representative at the Ukrainian border. The representatives from the three African nations
e on the UN Security Council condemned the reports and called for the
of this space of options: Il mistreatment of African peoples on Europe's borders to cease immediately.
7. Forcign students attempting to flee Ukraine after Russia invaded the
country told CNN that they experienced racial discrimination at the
1. OpenAI’s text-davinci-002, a GPT-3 model Ukrainian border.
(Brown et al., 2020) from the Instruct series
(Ouyang et al., 2022). While we do not know Figure 3: Illustration of length control using the task
the exact training details for this release of description / prompt for GPT3-D2. We found that the
the model, the previous in the series (text- generated summaries followed the given sentence length
davinci-001) was fine-tuned on a combina-
constraint 98% of the time, allowing us to generate
different length summaries emulating different datasets.
tion of prompts submitted to their API and la-
beler prompts spanning multiple tasks. These
tasks include summarization but not (to our Specifically, we follow prior work (Sanh et al.,
knowledge) standard summarization datasets 2022) and use sentence-count length prompts to
like CNN/DM (Hermann et al., 2015; Nallapati adapt to each dataset. Although these datasets also
et al., 2016) or XSum (Narayan et al., 2018). differ along other attributes, e.g. CNN/DM is lead-
‘We choose the text-davinci-002 version for our biased whereas XSum requires drawing inferences
from a whole article, we do not attempt to con-
experiments in order to benchmark the best
available prompt-based model.> We refer to trol any other attributed of the summary. Figure 3
this approach as GPT3-D2.
shows an example of different length GPT3-D2 sum-
maries for the same news article, using the follow-
2. BRIO (Liu et al., 2022), a fine-tuned summariza- ing prompt format:
tion model that reports state-of-the art results
Article: {{article}}
on both CNN/DM and XSum. We will use versions
Summarize the above article in N sentences.
of this model fine-tuned on each of these two
datasets. We found that GPT3-D2 summaries faithfully fol-
low the given length constraint in 98% of the test
3. T0 (Sanh et al., 2022), a prompt-based model
instances used in our human study data in Sec-
fine-tuned on multiple tasks including standard
tion 3.
summarization datasets. This provides a use-
Given this setup, we first compare the summary
ful point of comparison between task-specific
quality of the three summarization models through
fine-tuned (BRIO) and bigger instruction-tuned
a human annotation study (Section 3). Then, we
models (GPT3-D2).
evaluate the current suite of summarization metrics
2.2 Using GPT3-D2 for summarization for prompt-based summarization (Section 4). Fi-
nally, in Section 5, we briefly discuss GPT3-D2 per-
Fine-tuned models largely follow the “style” of ref-
formance on summarization tasks beyond generic
erence summaries in their training data, and hence,
summarization and new challenges.
generated summaries show large variance between
datasets (see Table 1 for basic summary statistics 3 Human evaluation of GPT3-D2
of standard summarization datasets). To ensure fair summaries
comparison between these and GPT3-D2, we adapt
the latter’s prompt to align with dataset-specific Generated summaries of fine-tuned models (Lewis
styles. et al., 2020; Zhang et al., 2020; Liu et al., 2022)
emulate gold-standard summaries in their training
*We did not observe obvious quality differences in gen-
erated summaries between text-davinci-001 and text-davinci- datasets. In contrast, prompt-based GPT3-D2 mod-
002. Examples are included in Appendix C. els generate summaries based on how the given
CNN Article: (CNN) Morigage rates fell slightly this week, ‘The 30-year, fixed-rate mortgage averaged 5.09% in the week ending
marking the third consecutive week of declines. But with rates BRIO June 2. It the third consecutive week of declines. But rates are still
hovering above 5% and home prices well above where they were above 5% and home prices are well above where they were this time last
this time last year; prospective buyers are finding it increasingly year. Rising prices have pushed some prospective buyers to the sidelines.
difficult to afford a home. The 30-year, fixed-rate morigage 30-year, fixed-rate mortgage averaged 5.09% in the week ending June 2.
averaged 5.09% in the week ending June 2, down from 5.10% the 0 It still well above the 2.99% average from this time last year. Rising
week before, according to Freddie Mac. It is still well above the prices had already been pushing many prospective buyers to the sidelines.
2.99% average from this time last year. [...] Rising prices had
already been pushing many prospective buyers to the sidelines. Mortgage rates have fallen slightly in the past few weeks, but they are.
[...] The Fed has been secking to tame inflation by raising interest ~ GPT3. still well above last year's levels. This is making it difficult for many
rates over the past couple of months. And the central bank has D2 prospective buyers {0 afford a home. The Fed has been raising interest
signaled there are more rate hikes to come. [...] rates in an effort to control inflation, and more rate hikes are expected.
BBC Article: The full opening of a landmark shared education BRIO The full opening of the Strule shared education campus in Omagh,
campus in Omagh is fucing another delay. Education Minister County Tyrone, has been delayed to September 2026,
Michelle Mcllveen has now said the Strule Shared Education
Campus is planned 10 open in September 2026. The minister ™ Strule, Norther Ireland's biggest ever school-building ject, is now
clarified the new date in response to an assembly question from the scheduled to open in 2026, the Education Minister has sai
SDLP MLA Daniel McCrossan. The campus had originally been GPT3 ‘The Strule Shared Education Campus is facing another delay, and is now.
due 0 open in 2020, but has been delayed a number of times.[...] -D2 planned to open in September 2026.
Figure 4: Examples of CNN-style and BBC/XSum-style summaries for the three systems. For CNN, we observe
that models fine-tuned on the CNN/DM training set reflect its dataset bias ummaries are highly extractive, specific
and lead-| ased. On the other hand, GPT3-D2 summaries contain fewer specific details but cover more content.
task description surfaces behavior learned during 100 recent articles from CNN? and BBC, collected
pre-training or instruction-tuning. In this section, between March 1, 2022 and June 31, 2022. We call
we ask: how do these paradigms compare? Does these CNN-2022 and BBC-2022 respectively.
learning from gold summaries lead to a better sum-
marization model? To answer this, we conduct a Model details We use the publicly released
human study to compare outputs of our 3 repre- BRIO-XSum and BRIO-CNN/DM models to generate
sentative models and collect human preferences of summaries.* For T, we use a prompt we selected
quality. from its prompt repository for CNN/DM and XSum
datasets. Finally, to generate GPT3-D2 summaries,
3.1 Experimental Setup we set N = 3 for CNN and N = 1 for BBC in
Datasets for fine-tuning We choose two stan- our standard sentence-count prompt template from
dard fine-tuning datasets whose summaries differ Section 2.
along multiple dimensions such as length and ab- For a maximally fair comparison in this “realis-
stractiveness: tic” setting, we take some additional steps to im-
prove the output of BRIO-XSum. In order to auto-
1. CNN/DM (Hermann et al., 2015; Nallapati
mate dataset creation, XSum removes the first sen-
et al., 2016) contains reference summaries that
tence from news articles to use as the gold summary
are approximately 3-4 sentences long. Sum-
for training, then treats the rest of the sentences as
maries in this dataset are highly extractive and
the article to summarize. This setup differs from
lead-biased.
the real world usage of summarization systems
. XSum (Narayan et al., 2018) contains 1 sen- where the complete article is summarized. Due
tence summaries of BBC news articles. In to this mismatch, BRIO-XSum often generates very
this dataset, references summaries, and conse- low quality outputs, e.g. All images: Strule Shared
quently generated summaries from fine-tuned
*Although the BRIO's CNN/DM model also includes Daily-
models are highly abstractive. Mail data in its training, we do not use this news source in
our study as it is now widely considered to be unreliable. E.g.
Datasets for evaluation Because GPT3-D2’s pre- according to Media Bias / Fact Check site, DM's factual re-
training and instruction-tuning datasets are un- porting is rated ‘low” https://mediabiasfactcheck. com/
known, it may have been trained on existing articles daily-mail/.
“Models at: https://github.com/yixinL7/BRIO
and summaries in the test splits of these standard SRepository with TO prompts: https://github.com/
benchmarks. We therefore run our human study on bigscience-workshop/promptsource
Education Campus in Figure 4, for around 30% of Length Statistics | % novel n-gms | #NEs per
the articles. We manually identify these examples Model |y oot #words/sent =1 n =2 | 100 words
and first attempt to fix them by selecting a summary CNN
without such obvious failures from further down BRIO| 37 158 | 121 362 | 129
the beam (we use beam size = 10). However, if we To| 27 149 | 164 52 | 128
GPT3-D2| 29 234 163 40.7 10.5
cannot find a “better” summary, we remove the first
BBC
sentence of the article and re-sample a new sum-
mary to align with its noisy training. This latter BRIO| 10 202 | 246 612 | 91
To| 10 200 |263 667 | 93
strategy often results in factually incorrect sum- GPT3-D2| 1.0 277 16.4 423 8.5
mary generations, as is well documented in prior
research (Maynez et al., 2020; Goyal and Durrett, Table 2: Statistics for generated summaries evaluated
2021). in the human study across all datasets and summariza-
tion systems. We observe that GPT3-D2 generated sum-
maries nearly always follow the sentence length con-
Design of the human study We design an A/B straints in their prompts.
test to collect preference annotations. For each
given article, annotators are shown summaries from 3.2 Results
all three summarization systems (BRIO, T@ and
Differences between summarization systems
GPT3-D2). They are then asked to select their most
Figure 4 shows examples of generated summaries
and least preferred summary or summaries. In ad-
from all three summarization systems for both
dition to these multiple choice questions, we also
CNN and BBC articles. For CNN, we observe that
ask for a free-text justification of both choices.
fine-tuned BRIO summaries tend to be highly extrac-
‘We make two design decisions for our human tive and generally include a high number of named
study: first, we do not provide annotators with spe- entities (dates, percentages, names), reflecting the
cific definitions of summary quality to avoid intro- data it was trained on. In contrast, GPT3-D2 sum-
ducing our own biases. It is also quite challenging maries are more abstractive and less specific, but
to produce a unified definition of quality for the provide a more exhaustive overview of the article
very different “styles” of summaries evaluated in content. Table 2 provides quantitative evidence of
this study. Instead, we ask them to rely on their this; we use percentage of novel n-grams to mea-
own preferences based on summaries they would sure abstractiveness, and number of named entities
like to see if they were browsing the web, which per 100 words to measure specificity.
we believe to be a representative scenario for non- For BBC, we observe inverse trends where
expert consumers of news summaries. Detailed BRIO and T@ are more abstractive compared to
task instructions are included in Appendix F. GPT3-D2. Again, this can be attributed to the XSum
Second, we allow multiple selections for both the training data used to train both these prior mod-
best and worst summary questions to cater to sce- els. For GPT3-D2 summaries, on the other hand,
narios in which different summarization systems the level of abstractiveness does not differ between
output similar quality summaries without meaning- datasets. Finally, Table 2 shows that GPT3-D2 sum-
ful differences. maries tend to have longer sentences, and therefore
similar number of summary sentences often results
‘We hire crowd annotators through Prolific. For
in a longer summary for both datasets. We study
both CNN and BBC, we recruit 60 unique partici-
the effect of this length difference on human pref-
pants to annotate the 100 summaries in each dataset.
erence judgments in Appendix B.
Each annotator was asked to annotate 5 articles and
each article was annotated by 3 annotators. Addi- ‘Which systems do humans prefer? Results of
tionally, we use the Prolific’s demographic filters to our human study are summarized in Table 3. We
restrict participation to USA (or UK) residents for report the percentage of times a particular system is
CNN (or BBC). We anticipate that residents from the most/least preferred model according to major-
these respective countries are better positioned to ity vote combining all three annotator’s choices.®
understand country-specific news events and evalu-
SAs we allow multiple system selections, note that more
ate their summaries. Participants were paid approx- that one system could be the majority. However, this is rare
imately $11/hr for their work. after majority vote: only 2% of the articles in CNN and 7% in
BRIO TO GPT3 Which summary is Which summary is
Dataset | g+ the most preferred? the least preferred?
Worst | | Best T Worst | | BestT Worst |
CNN 36 24 8 67 58 9 £ o
BBC 20 56 30 29 57 5] o
™
Table 3: Percentage of times a summarization system is
selected as the best or worst according to majority vote
(may be tied). Human annotators have a clear preference
8g mo
"
for GPT3-D2 for both CNN and BBC style summaries. pres——— prm——r,
o m oM W oM
M: M>
No. of annotator votes for No. of annotator votes for
Across both datasets and styles, we observe a clear “best summary™ “worst summary”
preference for GPT3-D2 summaries compared to
Figure 5: Annotator vote distribution for best and worst
the other two models. In fact, in both scenarios,
summaries across all datasets and models. Although
the GPT3-D2 outperforms the next best model by at GPT3-D2 is the clear winner according to majority vote,
least 20 percentage points. This improvement is sta- this choice is unanimous for less than 30% of the ar-
tistically significant according to a paired bootstrap ticles. This demonstrates the inherent variance in dif-
test (CNN p—value = 2 x 1073, BBC p—value ferent annotators” definitions of “best summary”, espe-
=6x107%). cially when comparing high-quality summaries from
strong models.
Note that the next best model differs between the
two datasets. For BBC, annotators prefer T@ sum-
maries over BRIO. Annotator rationales often men- Conversely, although BRIO (or T0) summaries are
tioned misleading or incorrect information as the less preferred than GPT3-D2 for the CNN (or BBC)
primarily reason for selecting BRIO as the worst dataset on aggregate, they were voted as the best
summary, confirming the issues that have been ob- summary by at least one annotator for more than
served with XSum-trained models (Maynez et al., 60% of the articles. This demonstrate two things:
2020; Pagnoni et al., 2021; Goyal and Durrett, first, when comparing summaries from two strong
2021). Although T@ also includes XSum training models, the choice is inherently ambiguous (similar
data, we hypothesize that its multi-task framework observations in Clark et al. (2021)). Second, these
helps offset the noisy signal from XSum. results and the diversity in the written rationales,
In contrast, annotators rate T@ as the worst sum- show that there does not exist a universal definition
marization system for CNN. The most common of a “good” summary and that different summary
rationales for these were shorter length and inclu- properties appeal to different annotators. Regard-
sion of irrelevant details, e.g. long quotes, while less, the aggregate preference for GPT3-D2 is high
missing key points. Some annotators also com- enough across the board to give us confidence in
mented that these T@ summaries were less coherent its strength.
compared to the other models. Interestingly, we
did not observe similar complaints for the single- How do these results impact the field? Progress
sentence T0 summaries for BBC. in text summarization research in the last five years
has been enabled by the construction of large-scale
Do annotators agree with each other? To study
text summarization datasets that involved scrap-
this, we plot the distribution of annotator votes for
ing news articles and pairing them with any avail-
each summarization system and dataset in Figure 5.
able summary-like data (Hermann et al., 2015;
Additionally, we report the inter-annotator agree-
Narayan et al., 2018; Grusky et al., 2018). The
ment, measured using Krippendorff’s alpha with
CNN/DM dataset considers bullet points accompa-
MASI distance (Passonneau, 2006), to account for
nying news articles as its summary. These “gold”
multiple selections of best or worst summary al- standard summaries provided useful training sig-
lowed in our study design.
nal to train impressive supervised models (Lewis
The vote distribution shows that although more et al., 2020; Zhang et al., 2020; Liu et al., 2022)
annotators prefer GPT3-D2 summaries, this choice and hence, their quality or alignment with human
is only unanimous, i.e. supported by all three an- preferences was largely ignored.
notators, for less that 30% of the annotated articles.
We found that, despite its popularity, XSum is
BBC have multiple best summaries. largely unsuitable for fine-tuning models like BRIO
Overlap-Based Similarity-Based QAEval
Dataset | Model | poyGE(1/2/L) METEOR BLEU | BERTScore MoverScore | EM F1
PEGASUS | 34.85/14.62/28.23 24 7.1 858 229 105160
o BRIO | 38.49/17.08/31.44 31 6.6 864 261 137 21
To | 35.06/13.84/28.46 25 59 859 238 099 163
GPT3-D2 | 31.86/1131/24.71 25 38 858 216 098 159
PEGASUS | 45.77/23.00/36.65 33 122 865 308 159 229
vailywail | BRIO | 4927247613921 37 117 871 331 175 259
¥ To | 4297/19.04/33.95 28 8.9 863 290 21 184
GPT3-D2 | 38.68/14.24/2808 .26 6.6 859 248 01159
PEGASUS | 47.97/24.82/39.63 36 9.8 901 362 145 221
Xsun BRIO | 49.66/25.97/41.04 39 10.6 901 372 139 224
To | 442012072/35.84 34 8.0 896 340 125 208
GPT3-D2 | 28.78/7.64/20.60 19 22 869 197 066 119
PEGASUS | 39.21/27.73/35.68 39 14 873 ) 0182 0253
BRIO - - - - - - -
Newsroom To 25.6419.49/21.41 20 04 849 145 080 0.125
GPT3-D2 | 27.44/10.67/22.18 2 05 859 1159 089 0.142
Table 4: Performance of different summarization systems measured using reference-based automatic metrics. Across
all datasets, we observe that automatic metrics report substantially worse results for GPT3-D2 summaries compared
to fine-tuned models. This directly contradicts the human preference results from Section 3, demonstrating that
these reference-based metrics cannot reliably compare the quality of prompt-based summaries against fine-tuned
summaries.
for realistic summarization settings. Even though a evaluating prompt-based GPT3-D2 summaries.
CNN/DM-trained BRIO model performed better, the
Experimental Setup We evaluate automatic met-
results of our human study question the contin-
rics using summaries from 4 different summariza-
ued utility of hill-climbing on this dataset, as it
tion datasets, listed in Table 1. For each dataset,
seems users may simply prefer a different style of
we construct our evaluation sets by randomly sam-
summary altogether. In fact, this preference for
pling 5007 articles from the standard test split.* We
GPT3-D2 is much larger than incremental improve-
compare the same 3 summarization systems from
ments reported in other human evaluation settings,
Section 3 in our analysis. Additionally, we also
e.g. improvements on XSum on the GENIE leader-
report results using the fine-tuned PEGASUS model
board (Khashabi et al., 2022). Furthermore, as
(Zhang et al., 2020), as BRIO fine-tuned models are
we we will see in Section 5, the greater flexibil-
not available for all datasets.
ity of GPT3-D2 compared to these systems makes
‘We publicly release this corpus of summariza-
it more suitable for news summarization tasks be-
tion outputs to standardize the test sets and sup-
yond generic summarization.
port future research into GPT3-D2 based summa-
If a system designer collects a large-scale dataset
rization. Link: https://tagoyal.github.io/
of high-quality summaries that they wish to emu-
zeroshot-news-annotations.html.
late, we believe a fine-tuned system may outper-
form GPT3-D2. However, better-trained models on 4.1 Reference-based metrics
datasets collected via “incidental” supervision are
Here, we study if the gold summaries of the stan-
less likely to help.
dard datasets are useful for evaluation, especially
when evaluating prompt-based summaries that are
4 Can current automatic metrics evaluate not trained to emulate the gold. We benchmark
GPT3-D2 summaries?
"This size is chosen to give sufficient statistical power
Automatic metrics proposed for summarization (Card et al., 2020) while keeping costs for GPT3-D2 evaluation
low to enable others to compare on this subset. We outline
evaluation can be broadly divided into two cate- costs in Appendix D.
gories: (1) reference-based, that compare gener- “Note that these standard datasets were released before
ated summaries against available gold summaries, 2020. Therefore, it is possible that some article-summary
pairs in our test st overlap with GPT3-D2’s training data. How-
and (2) reference-free that only rely on the input ever, we do not observe a qualitative difference in GPT3-D2’s
document. Here, we compare their performance at performance on these older articles.
Overall Quality Factuality (QA-based) Factuality (NLI-based)
Dataset | Model | qippRT “BLANC | QuestEval QAFactEval | FactCC DAE ~ SummaC
PEGASUS | 5466 0605 7373 44071 3743 8223 1138
N BRIO 5586 0802 7334 3.8332 1817 1577 -.0532
To 5330 0558 7799 37517 2012 7556 -0605
PT3-D2 | 5560 0749 7249 3.6399 2428 6671 -0729
PEGASUS | 6433 1137 7536 44677 5152 8497 2402
vailymail | BRIO 6360 1217 7415 41362 3609 8118 0153
v To 5995 0889 7803 3.9827 2431 8043 0478
T3-02 | 6118 0983 7461 3.8279 2697 6990 0365
PEGASUS | 4439 0249 8233 2.0089 2465 3508 -2993
sum BRIO 4459 0230 8305 1.8626 2031 3040 -3292
To 4538 0238 7957 2.0330 2209 332 -3037
GPT3-D2 | 5060 0594 8064 29492 3977 6372 -2626
PEGASUS | 6286 1131 18 42120 7218 7956 2418
BRIO - - - - - - -
Newsroom To 5433 0640 511 3.5799 2828 7376 0261
GPT3-D2 | 5408 0599 7160 32336 3988 6564 -0729
Table 5: Performance of different summarization systems, as scored by automatic reference-free evaluation metrics
from the summarization literature. Similar to reference-based metrics, these also generally fail to produce the same
system rankings as human preferences reliably across datasets.
the performance of 3 different summarization met- estingly, out of the four datasets evaluated here,
rics: (1) overlap-based metrics, specifically ROUGE Newsroom is the only one not used to train the
(Lin, 2004) METEOR (Banerjee and Lavie, 2005) and T model. This further shows that access to dataset-
BLEU (Papineni et al., 2002). (2) similarity-based specific reference summaries during training im-
metrics, that compute similarity between embed- proves performance according to these metrics, ren-
dings representations of generated and reference dering them unsuitable for evaluating prompt-based
summaries. Specifically, we report BERTScore models.
(Zhang* et al., 2020) and MoverScore (Zhao et al.,
2019). (3) a QA-based metric, specifically QAE-
val (Deutsch et al., 2021). Although most QA-
4.2 Reference-free metrics
metrics are reference-free (discussed in Section
4.2), QAEval uses the reference summaries to in-
Next, we investigate whether current reference-free
dicate saliency. We report both exact match (EM)
and F1 components of QAEval. evaluation metrics reflect the human preference
rankings between summarization systems, as ob-
Results Table 4 outlines the results. It shows that served in Section 3. Here, we study 2 categories
BRIO and PEGASUS models, fine-tuned to emulate of metrics: (1) quality metrics, specifically SU-
the reference summaries, outperform GPT3-D2 sum- PERT (Gao et al., 2020), which evaluates generated
maries according to all reference-based automatic summaries against automatically identified salient
metrics. The difference in their assigned scores sentences in the input, and BLANC (Vasilyev et al.,
is very high, e.g. >7 ROUGE-L points between 2020), which evaluates summaries on language
GPT3-D2 and BRIO. For comparison, these reported understanding tasks. We refer readers to the orig-
scores for GPT3-D2 are even lower than the triv- inal papers for detailed explanation of these. (2)
ial Lead-3 baseline reported in prior work (Fabbri factuality metrics, that are evaluate whether gener-
et al., 2021; Grusky et al., 2018). This clearly ated summaries contain incorrect information with
demonstrates that current automatic reference- respect to the source article. We report the perfor-
based metrics cannot be used to reliably mea- mance of summarization systems using two QA-
sure summary quality under the prompting based metrics: QuestEval (Scialom et al., 2021)
paradigm. and QAFactEval (Fabbri et al., 2022). Addition-
Amongst prompting-based models, we observe ally, we also benchmark entailment-based metrics:
that T summaries report better metric scores than FactCC (Kryscinski et al., 2020), DAE (Goyal and
GPT3-D2 for all datasets except Newsroom. Inter- Durrett, 2020, 2021) and SummaC (Laban et al.,
2022).° These entailment-based models are de- though “reference-free” at test time, they are still
signed for classification into factual or non-factual; trained to reward the summary properties seen in
therefore, we use P(factual | article, summary) the standard summarization benchmarks. (2) Even
to score generated summaries. completely reference-free metrics, e.g. QuestE-
val and QAFactEval, have only been evaluated on
Results Table 5 outlines the scores for each sum- reference-based benchmarks and fine-tuned mod-
‘marization system according to the above reference- els. Therefore, the choice of different components,
free metrics. Ideally, we want the relative rankings such as question answering or question generation
of different systems according to these metrics to
models to use, etc. has been dictated by the error
correspond to human preferences, i.e. GPT3-D2 >
space of prior fine-tuned models (Tang et al., 2023).
BRIO > T@ for CNN/DM'® and GPT3-D2 > T@ > BRIO These decisions also now need to be re-visited to
for XSum.!! incorporate GPT3-D2 evaluation; we leave this for
Overall, we observe that none of the reference-
future work.
free metrics we evaluate follow these trends for
both CNN/DM and XSum datasets. In particular, we 5 Beyond Generic Summarization
observe that GPT3-D2 summaries report low factu-
ality scores (except XSum) even though we rarely Previously, we observed that GPT3-D2 models faith-
found any factual errors in our qualitative analysis fully follow simple “style” instructions in the given
of its generated summaries. prompts. This provides a promising direction to
Interestingly, we noticed a roughly inverse rela- tackle other use cases in news summarization be-
tion to abstractiveness; summarization systems that yond the generic summarization task from Sec-
generated more abstractive summaries (see Table tion 3.
2) were generally scored lower by all automatic Different users can have very different infor-
reference-based metrics. For instance, GPT3-D2 mation needs from the same article, all of which
is scored lower than BRIO by both quality metrics cannot be satisfied with a single generic summary.
for all datasets except XSum; the latter is the only Prior work has introduced several task formulations
dataset for which GPT3-D2 summaries are less ab- to address this gap, including keyword-focused (He
stractive. Such shortcomings of reference-free eval- et al., 2022a), query-focused (Baumel et al., 2014;
uation metrics due to spurious correlations have He et al., 2022a), or aspect-focused summariza-
also been studied in prior work (Durmus et al., tion (Krishna and Srinivasan, 2018; Ahuja et al.,
2022). These issues become more exaggerated 2022), amongst others. Here, we evaluate GPT3-D2
when the summarization systems being compared performance at two of these use cases.
exhibit very different properties. In keyword-based summarization, the output
summaries must succinctly summarize the input
Discussion On the surface, the failure of
document focusing on a given keyword; these gen-
reference-free metrics at evaluating GPT3-D2 sum-
erally correspond to specific entities or events di-
maries is more surprising that reference-based met- rectly mentioned in the document. In contrast, the
rics as the later explicitly compares generated sum- control units in aspect-based summarization are
maries with references that GPT3-D2 is not trained high-level topics that can be common across mul-
to imitate. Therefore, GPT3-D2 understandably
tiple similar types of documents. For e.g., for the
scores lower than fine-tuned systems. input article in Figure 1, Donald Trump or Russian
However, we note two different issues with interference in 2016 elections are keyword controls
reference-free metrics: (1) Some of these, e.g. whereas charges against the defendants is a higher-
FactCC and DAE, use reference summaries as pos-
level aspect that can serve as the query for any news
itive examples to train the metric. Therefore, al-
article discussing a lawsuit or investigation.
“Exact model versions and configurations used for these
are outlined in Appendix A. 5.1 Qualitative Analysis
19Although the human study in Section 3 is only run on
CNN articles, the underlying fine-tuned model is same for Baseline Model for comparison We use the re-
both CNN and DM. Therefore, it we can reasonably expect it cently proposed CTRLSum (He et al., 2022a), a fine-
to display similar quality differences with respect to GPT3-D2. tuned BART model, as our baseline. It can be flex-
""Note that while annotators were not explicitly asked to
rate factuality, we instructed them to carefully check factuality ibly adapted for both keyword- and aspect-based
and appropriately downvote non-factual summaries. settings by including a prompt as additional input
Article: Republican defenders of Donald Trump won't or can't answer the questions that are atthe oot of the intrigue over why classified documents were at Mar-a-Lago and the
troubling question of whether national security was put at risk. [...] At one end of the scale is Missouri Sen. Josh Hawley. a firm Trump backer who is often mentioned as a future
presidential candidate once the ex-President finally exits the politcal siage. Hawley promoted Trumps line that the key isue was the FBI search of an ex-President’s home, not
hat he might have kept there. "1 don't know what he has or doesn't have, but what I have a lot of concerns abou i the FBI raiding the home of the likely presidential nominee in
2024, the former President of the United Staes, and particularly as we learn more and more about what's going on with the FBI,” Hawley told CNN's Manu Raju on Wednesday:
[.-.] Ex-Atiorney General William Barr has repeatedly questioned publicly what his former boss was doing with classified material. There has been speculation that Trump might
want such material s keepsakes or as some kind of leverage in a future business deal. But none of this is informed given the secrecy sounding the DOJ' investigation. [...]
Link: htps: v cnn.com/2022109/08 poliics/republicans-trump-national-security-implications-classiffed-docs
Keyword-based CTRLSum: Sen. Marco Rubio downplays the scriousness of the Department of Justiee's investigation. Ex-Attorney General = Missing context
‘William Barr has repeatedly questioned what his former boss was doing with classified material. Trump and his allies have = Poor discourse.
illed the vacuum for answers with a smokescreen of unproven claims, including that Trump declassified the material
Summarize the
above article in GPT3-D2: The article discusses how various Republicans have been defending Donald Trump in light of the recent revelations. Includes article
2.3 sentences that he had classified documents at his Mar-a-Lago resort. While some have downplayed the significance of the situation, content before
others have outright denied tha there i any cause for concern. However, former Attomey General William Barr has repeatedly keyword-specific
questioned Trump's actions, speculating that he may have been trying to use the classified information for personal gain. information
CTRLSum: Republican defenders of Donald Trump won't or can't answer the questions that are at the root of the intriguc. = Does not dircctly
Aspect-based address aspect
over why classified documents were at Mar-a-Lago and the troubling question of whether national security was put atrisk. In = Repeats the prompt
In the above article, the article, who is a defendant or under investigation? at the end.
who is a defendant or
under investigation? GPT3-D2: Donald Trump. The FBI. The Department of Justice. Sandy Berger. = Incorrectly answers
the query.
In the above article, CTRLSum: Republican defenders of Donald Trump won'tor can't answer the questions that are at the root of the intrigue
what is the over why classified documents were at Mar-a-Lago, says Julian Zelizer. Zelizer: Trump and his allics ha = Does not dircctly
defendant's reaction for answers with a smokescreen of unproven claims, including that Trump declassified the material or it answer query.
to charges, including planted by the FBI.
‘admission of guilt, GPT3-D2: The defendants reaction to charges in the above article is denial of charges. Trump and his alles have filed the 4 Summary captures
denial of charges, or vacuum for answers with a smokescreen of unproven claims,including that Trump declassified the material or that it might aspect-relevant
explanations? have been planted by the FBI in order to persecute him. content,
Figure 6: Comparison of keyword- and aspect-based summaries using GPT3-D2 and CTRLSum models. The
GPT3-D2 prompt is shown on the left with the corresponding keyword or aspect bolded. For keyword-based
summarization, the GPT3-D2 summary presents appropriate context before the keyword-specific information. How-
ever, for aspect-based summarization, it does not always generate factually correct summaries, as shown in the first
aspect example. We observe that CTRLSum performs poorly for both these settings.
to the encoder. We use the prompt template recom- In this example, representative of aver-
mended in the original paper.'? age GPT3-D2 quality, the keyword-focused
GPT3-D2 summary first gives a brief overview
Control Units For the keyword-focused setting,
of the article setting before providing keyword-
we use named entities extracted from the input arti-
relevant information. In contrast, the CTRLSum
cle as the control units. For aspect-focused summa-
summary exhibits poor discourse structure and
rization, we directly use the aspects introduced in
reads like a list of facts stapled together.
the guided summarization task from TAC 2011."3
It defined 5 broad categories of newswire articles, The figure also shows aspect-focused summaries
such as accidents and natural disasters, investiga- for two aspects associated with the “investigations
tions and trial, etc., and multiple aspects for each and trial” category most appropriate for the chosen
category. For example, the “investigations and tri- article. We see mixed results here for GPT3-D2; it
als” category includes aspects such as “who is the generates a factually incorrect summary for the first
defendant or under trial?”, “who is investigating, aspect, listing multiple people from the input arti-
prosecuting, judging?”, and so on. cle as defendants instead of only “Donald Trump”.
For the second aspect, it correctly maps the high-
Qualitative Analysis Figure 6 shows examples level concept “defendant” to “Donald Trump” in
of keyword- and aspect-focused summaries using the input article and generates the correct answer
GPT3-D2 and the baseline CTRLSum model. The to the input query: “The defendant’s reaction to
keywords or aspects are highlighted in bold within charges in the above article is denial of charges”.
the GPT3-D2 prompt displayed on the left.
On the other hand, CTRLSum fails to generate
"Trained model publicly released at: https://github. aspect-focused summaries for both cases. We be-
con/salesforce/ctrl-sum.
Bhttps://tac.nist.gov/2011/Summarization/ lieve that it struggles to align high-level concepts
Guided-Summ.2011.guidelines.html and explicit entities in the article due to a lack of
by a majority of the annotators. The main ratio-
Which keyword-focused Win % according
summary is better? to majority vote nales given for this choice were better contextual-
ization of keyword-related information and better
ez [N ©s% coherence in GPT3-D2 summaries.
CTRLSum s | | 302%
o 1 M2 W Impact These results show that prompting GPT-3
No. of votes for “best summary” models present a promising alternative to fine-
tuned models for such specialized summarization
tasks that can be easily described using textual
Figure 7: Distribution of annotator votes for the
prompts. One of the major drawbacks of fine-tuned
keyword-focused summarization task. Annotators pre-
fer GPT3-D2 summaries over CTRLSum for approxi- models is that they are constrained by what data
mately 70% of all article-keyword pairs, showing unani- is available and how it can be transformed to cre-
mous preference more than half the time. ate new task-specific training data. CTRLSum relied
on the SQUAD question answering dataset (Ra-
Jjpurkar et al., 2016) because the required “queries”
such aspect-specific examples in its training data. or “questions” were unavailable at scale for sum-
Instead, it generates summaries focusing on lexi- maries in standard summarization datasets. In con-
cally similar words, i.e. “defenders” for both cases. trast, prompt-based models are not constrained by
Based off of GPT3-D2’s promising keyword- the availability of task-specific data and can flexibly
focused summarization capabilities observed adapt to new tasks. Future research should focus
above, we next conduct a human study to system- on further exploring these capabilities and possible
atically compare it against the CTRLSum baseline. improvements on currently “unsolved” tasks such
‘We leave further explorations of aspect-based sum- as aspect-based or plan-based summarization.
marization to future work, given the mixed to poor
results for both models at this task. 6 Discussion and Related Work
In recent years, research in text summarization
5.2 Human Study: Keyword-focused
(Rush et al., 2015; Nallapati et al., 2016; See et al.,
summarization
2017; Lewis et al., 2020; Zhang et al., 2020; Liu
Task Setup Similar to Section 3, we design an et al., 2022) has typically relied on comparisons
A/B test to compare the two models. We use the with gold test sets for evaluation, possibly aug-
same set of 100 CNN' articles as Section 3. We mented with reference-free metrics for dimensions
randomly extract 2 distinct named entities from like factuality. This paper shows that all these
each article. In the study interface, the annota- metrics are completely ineffective at evaluating
tor is shown the article-keyword pair and GPT3-D2 GPT-3 summaries. Although issues with these
and CTRLSum summaries corresponding to it. They metrics, particularly low correlation with human
are asked to select the summary that best summa- judgments, have also been studied earlier (Fabbri
rizes the input article while focusing on the given et al., 2021; Deutsch and Roth, 2021), they are
keyword. Exact task instructions are included in considered reliable when comparing systems in dif-
Appendix F. ferent score ranges (Peyrard, 2019; Deutsch et al.,
Again, we run this study using the Prolific plat- 2022). However, GPT-3 challenges these estab-
form. We recruit 60 participants to annotate the lished practices and evaluation protocols, and poses
100 articles; each article is annotated by 3 anno- an urgent need for better evaluation.
tators which includes annotations for 2 separate This brings us to manual evaluation, generally
keywords. Each annotator evaluates 5 articles. considered to be the gold standard for generation
evaluation. The majority of summarization re-
Results Figure 7 shows the distribution of an-
search now reports results from a human study in
notator votes between the GPT3-D2 and CTRLSum
addition to automatic metrics, but there is a general
models. Annotators show a clear preference for
lack of consensus on what dimensions to evalu-
GPT3-D2. In fact, for nearly 70% of all article-
ate, task design, and other factors (Hardy et al.,
keyword pairs, GPT3-D2 is preferred over CTRLSum
2019). This presents difficulties in conducting re-
"*We run this study using only CNN articles as the baseline liable and reproducible comparisons between sys-
CTRLSum model is trained on CNN/DM. tems (Karpinska et al., 2021), another factor con-
tributing to the popularity of automatic metrics. paring different system generations. In our work,
Although recent efforts like GENIE (Khashabi et al., we chose a human evaluation workflow that directly
2022) have taken steps to standardize manual eval- asks annotators to compare systems, while other
uation protocols across systems, its annotation is prior work has opted for Likert-scale judgments
not universally affordable and the quality is not and/or evaluation along multiple quality dimen-
strictly monitored. We hope that future work ad- sions (Gehrmann et al., 2022). The latter strategy
dresses these challenges and democratizes human of evaluating different dimensions could surface
evaluations. more insights into which “style” properties of GPT-
The ultimate test of summarization systems is 3 summaries provide them an edge over fine-tuned
with actual users using the systems in practice. models; however, such analysis is outside the scope
Jones (2007) discusses the need to align task formu- of this paper. Our experiments comparing overall
lations with actual applications scenarios (“purpose quality reveal that current summarization datasets
factors™). However, the research in text summa- are not well-aligned with user preferences. We
rization until now has been constrained to certain leave more fine-grained analysis into these prefer-
problems or domains by the heavy dependence on ence judgments for future work.
large-scale training data: for example, producing a The experiments in this paper are run on English-
bullet-point summary of a news article has emerged language news summarization datasets as these
as standard due to availability of data from CNN, serve as common benchmarks in the summariza-
not because it is shown to be the best way to present tion literature. However, user rankings of system
information. outputs might be different when evaluating other
Now, the success of prompt-based models can domains, e.g., summaries of scientific text. While
allow realistic use-cases to drive research in a more we believe that automatic metrics would fail to eval-
top-down way. We already show that GPT3-D2 im- uate GPT-3 summaries on these domains also (gen-
proves upon prior keyword-focused summarization erated summaries would still look different from
systems that were trained on artificially adapted the reference summaries), users may prefer models
training data. In future research, we are inter- that are specifically fine-tuned on domain-specific
ested in tackling other real world use cases, such data for niche domains.
as update summarization and plan- or aspect-based Finally, we do not know exact datasets or tasks
summarization. Additionally, adapting GPT3-D2 used to train GPT3-D2. It is possible that its RLHF
to documents longer than the allowed context, or training (Ouyang et al., 2022) included summariza-
structured inputs such as tables, presents research tion examples, and therefore, preference judgments
challenges beyond the current capabilities of GPT-3 from human annotators for its different outputs.
and would be interesting to study." However, our arguments in this paper do not rely
on the specifics of the GPT3-D2 system, merely that
7 Conclusion such a system exists. If anything, the existence
In this work, we performed the first systematic of potentially better data underscores that further
study comparing prompt-based GPT-3 and fine- work should collect new data for summarization
model tuning, and our claims about metrics still
tuned models at the news summarization task. We
analyzed the impact of prompting on the summa- hold regardless of the details of how the GPT3-D2
rization field, including training paradigms and summaries were produced.
evaluation practices. Finally, to support further
research in this direction, we release a large corpus References
of generated summaries for multiple prompt-based
Ojas Ahuja, Jiacheng Xu, Akshay Gupta, Kevin
and fine-tuned models, as well as human preference
Horecka, and Greg Durrett. 2022. ASPECTNEWS:
judgments comparing these systems. Aspect-oriented summarization of news documents.
In Proceedings of the 60th Annual Meeting of the
8 Limitations Association for Computational Linguistics (Volume
1: Long Papers). pages 6494-6506.
In the text generation evaluation literature, there
does not exist a standardized task design for com- Satanjeev Banerjee and Alon Lavie. 2005. METEOR:
An automatic metric for mt evaluation with improved
"SWe very briefly discuss long document summarization correlation with human judgments. In Proceedings of
with GPT-3in Appendix E. the acl workshop on intrinsic and extrinsic evaluation
measures for machine translation and/or summariza- summarization evaluation metrics. In Proceedings of
tion, pages 65-72. the 2022 Conference of the North American Chap-
ter of the Association for Computational Linguistics:
Tal Baumel, Raphael Cohen, and Michael Elhadad. Human Language Technologies, pages 6038—6052,
2014. Query-chain focused summarization. In Pro- Seattle, United States. Association for Computational
ceedings of the 52nd Annual Meeting of the Associa- Linguistics.
tion for Computational Linguistics (Volume 1: Long
Papers), pages 913-922. Daniel Deutsch and Dan Roth. 2021. Understanding the
extent to which content quality metrics measure the
Manik Bhandari, Pranav Narayan Gour, Atabak Ash- information quality of summaries. In Proceedings
faq, and Pengfei Liu. 2020. Metrics also disagree of the 25th Conference on Computational Natural
in the low scoring range: Revisiting summarization Language Learning, pages 300-309.
evaluation metrics. In Proceedings of the 28th Inter-
national Conference on Computational Linguistics. Esin Durmus, He He, and Mona Diab. 2020. FEQA: A
pages 5702-5711. question answering evaluation framework for faith-
fulness assessment in abstractive summarization. In
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Proceedings of the 58th Annual Meeting of the Asso-
Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind ciation for Computational Linguistics, pages 5055—
Neelakantan, Pranay Shyam, Girish Sastry, Amanda 5070.
Askell, et al. 2020. Language models are few-shot
learners. Advances in neural information processing Esin Durmus, Faisal Ladhak, and Tatsunori B
systems, 33:1877-1901. Hashimoto. 2022. Spurious correlations in reference-
free evaluation of text generation. In Proceedings
Dallas Card, Peter Henderson, Urvashi Khandelwal, of the 60th Annual Meeting of the Association for
Robin Jia, Kyle Mahowald, and Dan Jurafsky. 2020. Computational Linguistics (Volume 1: Long Papers),
With little power comes great responsibility. In pages 1443-1454.
Proceedings of the 2020 Conference on Empirical
Methods in Natural Language Processing (EMNLP), Alexander Fabbri, Chien-Sheng Wu, Wenhao Liu, and
pages 9263-9274. Caiming Xiong. 2022. QAFactEval: Improved QA-
based factual consistency evaluation for summariza-
Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, tion. In Proceedings of the 2022 Conference of the
Maarten Bosma, Gaurav Mishra, Adam Roberts, North American Chapter of the Association for Com-
Paul Barham, Hyung Won Chung, Charles Sutton, putational Linguistics: Human Language Technolo-
Sebastian Gehrmann, et al. 2022. PaLM: Scaling gies, pages 2587-2601, Seattle, United States. Asso-
language modeling with pathways. arXiv preprint ciation for Computational Linguistics.
arXiv:2204.02311.
Alexander R Fabbri, Wojciech Kryscinski, Bryan Mc-
Elizabeth Clark, Tal August, Sofia Serrano, Nikita Cann, Caiming Xiong, Richard Socher, and Dragomir
Haduong, Suchin Gururangan, and Noah A Smith. Radev. 2021. SummEval: Re-evaluating summariza-
2021. All that’s *human’ is not gold: Evaluating tion evaluation. Transactions of the Association
for
human evaluation of generated text. In Proceedings Computational Linguistics, 9:391-409.
of the 59th Annual Meeting of the Association for
Computational Linguistics and the 11th International Yang Gao, Wei Zhao, and Steffen Eger. 2020. SUPERT:
Joint Conference on Natural Language Processing Towards new frontiers in unsupervised evaluation
(Volume I: Long Papers), pages 7282-7296. metrics for multi-document summarization. In Pro-
ceedings of the 58th Annual Meeting of the Asso-
Arman Cohan, Franck Dernoncourt, Doo Soon Kim, ciation for Computational Linguistics, pages 1347—
Trung Bui, Seokhwan Kim, Walter Chang, and Nazli 1354.
Goharian. 2018. A discourse-aware attention model
for abstractive summarization of long documents. In Sebastian Gehrmann, Elizabeth Clark, and Thibault Sel-
Proceedings of the 2018 Conference of the North lam. 2022. Repairing the cracked foundation: A sur-
American Chapter of the Association for Computa- vey of obstacles in evaluation practices for generated
tional Linguistics: Human Language Technologies, text. arXiv preprint arXiv:2202.06935.
Volume 2 (Short Papers), pages 615621, New Or-
leans, Louisiana. Association for Computational Lin- Tanya Goyal and Greg Durrett. 2020. Evaluating factu-
guistics. ality in generation with dependency-level entailment.
In Findings of the Association for Computational
Daniel Deutsch, Tania Bedrax-Weiss, and Dan Roth. Linguistics: EMNLP 2020, pages 3592-3603.
2021. Towards question-answering as an automatic
metric for evaluating the content quality of a sum- Tanya Goyal and Greg Durrett. 2021. Annotating and
mary. Transactions of the Association for Computa- modeling fine-grained factuality in summarization.
tional Linguistics, 9:774-789. In Proceedings of the 2021 Conference of the North
American Chapter of the Association for Computa-
Daniel Deutsch, Rotem Dror, and Dan Roth. 2022. Re- tional Linguistics: Human Language Technologies,
examining system-level correlations of automatic pages 1449-1462.
Tanya Goyal, Jiacheng Xu, Junyi Jessy Li, and Greg Wojciech Kryscinski, Bryan McCann, Caiming Xiong,
Durrett. 2022. Training dynamics for text summa- and Richard Socher. 2020. Evaluating the factual
rization models. In Findings of the Association for consistency of abstractive text summarization. In
Computational Linguistics: ACL 2022, pages 2061— Proceedings of the 2020 Conference on Empirical
2073. Methods in Natural Language Processing (EMNLP),
pages 9332-9346.
Max Grusky, Mor Naaman, and Yoav Artzi. 2018.
Newsroom: A dataset of 1.3 million summaries with ‘Wojciech Kryscinski, Nazneen Fatema Rajani, Di-
diverse extractive strategies. In Proceedings of the vyansh Agarwal, Caiming Xiong, and Dragomir R
2018 Conference of the North American Chapter of Radev. 2021. BookSum: A collection of datasets for
the Association
for Computational Linguistics: Hu- long-form narrative summarization.
man Language Technologies, Volume 1 (Long Pa-
pers), pages 708-719. Philippe Laban, Tobias Schnabel, Paul N. Bennett, and
Marti A. Hearst. 2022. SummaC: Re-visiting NLI-
Hardy Hardy, Shashi Narayan, and Andreas Vlachos. based models for inconsistency detection in summa-
2019. Highres: Highlight-based reference-less evalu- rization. Transactions of the Association
for Compu-
ation of summarization. In Proceedings of the 57th tational Linguistics, 10.
Annual Meeting of the Association for Computational
Linguistics, pages 3381-3392. Mike Lewis, Yinhan Liu, Naman Goyal, Marjan
Ghazvininejad, Abdelrahman Mohamed, Omer Levy,
Junxian He, Wojciech Kryscinski, Bryan McCann, Veselin Stoyanov, and Luke Zettlemoyer. 2020.
Nazneen Rajani, and Caiming Xiong. 2022a. CTRL- BART: Denoising sequence-to-sequence pre-training
sum: Towards generic controllable text summariza- for natural language generation, translation, and com-
tion. In Proceedings of the 2022 Conference on Em- prehension. In Proceedings of the 58th Annual Meet-
pirical Methods in Natural Language Processing. ing of the Association
for Computational Linguistics,
pages 5879-5915, Abu Dhabi, United Arab Emirates. pages 7871-7880.
Association for Computational Linguistics.
Chin-Yew Lin. 2004. ROUGE: A package for auto-
Pengcheng He, Baolin Peng, Liyang Lu, Song Wang, Jie
matic evaluation of summaries. In Text summariza-
Mei, Yang Liu, Ruochen Xu, Hany Hassan Awadalla,
tion branches out, pages 74-81.
Yu Shi, Chenguang Zhu, et al. 2022b. Z-Code++: A
pre-trained language model optimized for abstractive Yixin Liu, Pengfei Liu, Dragomir Radev, and Graham
summarization. arXiv preprint arXiv:2208.09770. Neubig. 2022. BRIO: Bringing order to abstractive
summarization. In Proceedings of the 60th Annual
Karl Moritz Hermann, Tomas Kocisky, Edward Grefen-
stette, Lasse Espeholt, Will Kay, Mustafa Suleyman,
Meeting of the Association for Computational Lin-
and Phil Blunsom. 2015. Teaching machines to read
guistics (Volume 1: Long Papers), pages 2890-2903.
and comprehend. Advances in Neural Information Joshua Maynez, Shashi Narayan, Bernd Bohnet, and
Processing Systems, 28. Ryan McDonald. 2020. On Faithfulness and Factu-
Karen Spiirck Jones. 2007. Automatic summarising: ality in Abstractive Summarization. In Proceedings
The state of the art. Information Processing & Man- of the 58th Annual Meeting of the Association for
agement, 43(6):1449-1481. Computational Linguistics, pages 1906-1919.
Marzena Karpinska, Nader Akoury, and Mohit Iyyer. Sewon Min, Xinxi Lyu, Ari Holtzman, Mikel Artetxe,
2021. The perils of using mechanical turk to evaluate Mike Lewis, Hannaneh Hajishirzi, and Luke Zettle-
open-ended text generation. In Proceedings of the moyer. 2022. Rethinking the role of demonstrations:
2021 Conference on Empirical Methods in Natural ‘What makes in-context learning work? In Proceed-
Language Processing, pages 1265-1285. ings of the 2022 Conference on Empirical Methods in
Natural Language Processing, pages 11048-11064,
Daniel Khashabi, Gabriel Stanovsky, Jonathan Bragg, Abu Dhabi, United Arab Emirates. Association for
Nicholas Lourie, Jungo Kasai, Yejin Choi, Noah A. Computational Linguistics.
Smith, and Daniel Weld. 2022. GENIE: Toward re-
producible and standardized human evaluation for Swaroop Mishra, Daniel Khashabi, Chitta Baral, and
text generation. In Proceedings of the 2022 Con- Hannaneh Hajishirzi. 2022. Cross-task generaliza-
ference on Empirical Methods in Natural Language tion via natural language crowdsourcing instructions.
Processing, pages 11444-11458, Abu Dhabi, United In Proceedings of the 60th Annual Meeting of the
Arab Emirates. Association for Computational Lin- Association for Computational Linguistics (Volume
guistics. 1: Long Papers). pages 3470-3487.
Kundan Krishna and Balaji Vasan Srinivasan. 2018. Ramesh Nallapati, Bowen Zhou, Cicero dos Santos,
Generating topic-oriented summaries using neural Caglar Gulcehre, and Bing Xiang. 2016. Abstrac-
attention. In Proceedings of the 2018 Conference of tive text summarization using sequence-to-sequence
the North American Chapter of the Association for RNNs and beyond. In Proceedings of The 20th
Computational Linguistics: Human Language Tech- SIGNLL Conference on Computational Natural Lan-
nologies, Volume I (Long Papers), pages 1697-1705. guage Learning, pages 280-290.
Shashi Narayan, Shay B Cohen, and Mirella Lapata. William Saunders, Catherine Yeh, Jeff Wu, Steven Bills,
2018. Don’t Give Me the Details, Just the Summary! Long Ouyang, Jonathan Ward, and Jan Leike. 2022.
Topic-Aware Convolutional Neural Networks for Ex- Self-critiquing models for assisting human evaluators.
treme Summarization. In Proceedings of the 2018 arXiv preprint arXiv:2206.05802.
Conference on Empirical Methods in Natural Lan-
guage Processing, pages 1797-1807. Thomas Scialom, Paul-Alexis Dray, Sylvain Lamprier,
Benjamin Piwowarski, Jacopo Staiano, Alex Wang,
Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Car- and Patrick Gallinari. 2021. QuestEval: Summariza-
roll L Wainwright, Pamela Mishkin, Chong Zhang, tion asks for fact-based evaluation. In Proceedings
Sandhini Agarwal, Katarina Slama, Alex Ray, et al. of the 2021 Conference on Empirical Methods in
2022. Training language models to follow in- Natural Language Processing, pages 6594-6604.
structions with human feedback. arXiv preprint
arXiv:2203.02155. Abigail See, Peter J Liu, and Christopher D Manning.
2017. Get to the point: Summarization with pointer-
Artidoro Pagnoni, Vidhisha Balachandran, and Yulia generator networks. In Proceedings of the 55th An-
Tsvetkov. 2021. Understanding factuality in abstrac- nual Meeting of the Association for Computational
tive summarization with FRANK: A benchmark for Linguistics (Volume 1: Long Papers), pages 1073—
factuality metrics. In Proceedings of the 2021 Con- 1083.
ference of the North American Chapter of the Asso-
ciation for Computational Linguistics: Human Lan- Liyan Tang, Tanya Goyal, Alexander R Fabbri, Philippe
guage Technologies, pages 4812-4829. Laban, Jiacheng Xu, Semih Yahvuz, Wojciech Kry$-
ciriski, Justin F Rousseau, and Greg Durrett. 2023.
Kishore Papineni, Salim Roukos, Todd Ward, and Wei- Understanding factual errors in summarization: Er-
Jing Zhu. 2002. BLEU: a method for automatic eval- rors, summarizers, datasets, error detectors. Associa-
uation of machine translation. In Proceedings of the tion for Computational Linguistics.
40th annual meeting of the Association for Computa-
tional Linguistics, pages 311-318.
Oleg Vasilyev, Vedant Dharnidharka, and John Bohan-
non. 2020. Fill in the BLANC: Human-free quality
Rebecca J Passonneau. 2006. Measuring agreement on
estimation of document summaries. In Proceedings
set-valued items (MASI) for semantic and pragmatic of the First Workshop on Evaluation and Comparison
annotation. In Proceedings of the Fifth International
of NLP Systems, pages 11-20.
Conference on Language Resources and Evaluation
(LREC’06).
Jason Wei, Maarten Bosma, Vincent Zhao, Kelvin Guu,
Adams Wei Yu, Brian Lester, Nan Du, Andrew M.
Maxime Peyrard. 2019. Studying summarization eval-
Dai, and Quoc V Le. 2022. Finetuned language mod-
uation metrics in the appropriate scoring range. In
els are zero-shot learners. In International Confer-
Proceedings of the 57th Annual Meeting of the Asso-
ence on Learning Representations.
ciation for Computational Linguistics, pages 5093—
5100.
Xi Ye and Greg Durrett. 2022. The unreliability of ex-
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine planations in few-shot prompting for textual reason-
Lee, Sharan Narang, Michael Matena, Yangi Zhou, ing. In Advances in Neural Information Processing
‘Wei Li, Peter J Liu, et al. 2020. Exploring the limits Systems.
of transfer learning with a unified text-to-text trans-
former. J. Mach. Learn. Res., 21(140):1-67. Jingqing Zhang, Yao Zhao, Mohammad Saleh, and Peter
Liu. 2020. PEGASUS: Pre-training with extracted
Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and gap-sentences for abstractive summarization. In In-
Percy Liang. 2016. SQuAD: 100,000+ questions ternational Conference on Machine Learning, pages
for machine comprehension of text. In Proceedings 11328-11339. PMLR.
of the 2016 Conference on Empirical Methods in
Natural Language Processing, pages 2383-2392. Tianyi Zhang*, Varsha Kishore*, Felix Wu*, Kilian Q.
Weinberger, and Yoav Artzi. 2020. BERTScore:
Alexander M Rush, Sumit Chopra, and Jason Weston. Evaluating Text Generation with BERT. In Inter-
2015. A neural attention model for abstractive sen- national Conference on Learning Representations.
tence summarization. In Proceedings of the 2015
Conference on Empirical Methods in Natural Lan- Yusen Zhang, Ansong Ni, Ziming Mao, Chen Henry Wu,
guage Processing, pages 379-389. Chenguang Zhu, Budhaditya Deb, Ahmed Awadal-
lah, Dragomir Radev, and Rui Zhang. 2022. SummN:
Victor Sanh, Albert Webson, Colin Raffel, Stephen A multi-stage summarization framework for long in-
Bach, Lintang Sutawika, Zaid Alyafeai, Antoine put dialogues and documents: A multi-stage sum-
Chaffin, Arnaud Stiegler, Teven Le Scao, Arun Raja, marization framework for long input dialogues and
etal. 2022. Multitask prompted training enables zero- documents. In Proceedings of the 60th Annual Meet-
shot task generalization. In The Tenth International ing of the Association for Computational Linguistics
Conference on Learning Representations. (Volume 1: Long Papers), pages 1592-1604.
Yusen Zhang, Ansong Ni, Tao Yu, Rui Zhang, Chen-
3 | Dataset ,
guang Zhu, Budhaditya Deb, Asli Celikyilmaz, . o
Ahmed Hassan, and Dragomir Radev. 2021. An 2 o BBC o on cmmmise s eses
exploratory study on long dialogue summarization:
‘What works and what’s next. In Findings of the Asso- %1 e o
ciation for Computational Linguistics: EMNLP 2021, @
pages 44264433, L0 * com sumsmm o
3§-1 [ ap— —
Tony Z. Zhao, Eric Wallace, Shi Feng, Dan Klein, and
@
Sameer Singh. 2021. Calibrate before use: Improv- -2 . . [pee— .
ing few-shot performance of language models. In
Proceedings of the International Conference on Ma- -3
chine Learning (ICML).
a0 -20 o 20 a0 60
‘Wei Zhao, Maxime Peyrard, Fei Liu, Yang Gao, Chris- Len(s*) - L(s)
tian M Meyer, and Steffen Eger. 2019. MoverScore:
Figure 8: Correlation between summary length and
Text generation evaluating with contextualized em-
annotator score (computed as the no. of “best summary™
beddings and earth mover distance. In Proceedings
of the 2019 Conference on Empirical Methods in Nat- votes. For each example, plot the difference in length (x-
ural Language Processing and the 9th International axis) and annotator score (y-axis) between the GPT3-D2
Joint Conference on Natural Language Processing summary and the next best system’s summary.
(EMNLP-IJCNLP), pages 563-578.
Yao Zhao, Mohammad Saleh, and Peter J Liu. 3. SummaC: We use the SummaC-Conv model
2020. SEAL: Segment-wise extractive-abstractive
(model_name = ‘vitc’) and sentence-level gran-
long-form text summarization. — arXiv preprint
arXiv:2006.10213. ularity in our experiments.
F Task Instructions
Task instructions provided to crowd annotators for
the generic summarization task setting are shown in
Figure 14 and those for the keyword-based setting
are shown in Figure 15.
Figure 11: Examples of generated summaries for the CNN-2022 dataset using 3 different summarization systems.
Figure 12: Examples of generated summaries for the BBC-2022 dataset using 3 different summarization systems.
Input Article Generated Summaries
A conlfonofhousndsof i sllrssigaled sppor for oe-weck sk sarting on Monday— h e dy he ol kesplace
ko for s e Keynrs Krii Casidy
el will ot ikinge e i chrge those who s el - i, Ao pthion o by i (ETSY) sop v Krt Gy i CTRLSum:Fisy (FTSY) shop owner Kris Casidy s
h company ol h e e i ek p o S5t Mooy — s s ey 000 gt OF o g, me ption i tecompony cancel h e ncrese. Sl
15500 corn o pople who hve a3 s e whosupprt h ke, cconin iy s aperstoe d ik g Mot Boyd. e ke it he ke e pring tht shopson “vcston e
Tor ek saing Mondey. Thesie comes anid & waveof
actviem seen 5 sl of o,
(GPED: Kris Caidy i sy sho o who s organcivg
onenwesk st gsotthscomgany st on Mondey. The ke
iinesponse o i’ deision 0 s the e s sl
‘hich il cras o S 6%, Cseidy' petion wing
Company o cancl s o s b g ey 50000
elr g homenmde graphicTins s othr " s st Boyd sy "o rwards pople o members fthe g and s communiy, gnaures.
S0 wh s i ik ok ad et Deoands st i he pton s sunclinge s incsass pasdo sl o comgrebcniv ln o
ko on el shops (psople sl mas-produced s ha ey have ot desged temselves): i nd xpaie tesuport s forsllers Keyword: Siherman
‘Whoae had i usiocs suptd by sy’ ot s end e "Sia Slle” roga ot sy s o e selles: 4 0t sl ot ot of e o5 for
e producs. oy e many slles e e ey wer fen i way 1 v bk o the sden -k aouncement,hich marks he st s CTRLSum: iy CEO o Siverman sanouncedhe o nceses
ince 201, Whic h e demand n e pcions s oy s ave b bewing foa i, ik o 3ot o v i et 0 erese [ et i e selles e Febray. "o sopport i s o Al
el T he Doy e "The sk o 1 e will mresedu et S5 apsstion 610 6.5%"
Siverman vt "W fs ke we deserve st t he bl Emy
Sop opeor Mt Boyd ld CRN Busies
(GPT-D: Tho s discusss n ne-ves stk beng by iy
el i espone o e st ] o o e om Moy,
The sk s e n o 03 mn o iy CEO o
Sl ssnonein s o e, Bty sl r danding
Vooo the -ommere gt frscer US hor ion i andmark lecon.Amazu s sins e n agpel,cling for d-onofhe s vt hut e s crens bscsceliod,amon tes g,
New York (CNN Busass) As Russ's sl o Ustin ontines, A b and ssuasnt oo bopng sl wand change will lp show e solidry Keyword: Ukraine
‘it e Ukniia popl. I move et o e “Fcedon 5" 5 o te caly anghs teye kg Mascow Mles i the e pdrplci thn wit Ky CTRLSum: Har ownrs n the U, s repacing Moscn wih
Ml Small Ameccan bsinescs,sh 5 ndkpende 4o st ownes, iy ot bave any et busigs s o R b any ol rongly about e Ky i ok g im cokiie Thy v doin 1 show.
oot atack on Uhaiia i andizens.Replacing Moscow with"Kyi” i i vodka-gigerdme ok s 00 way o show supor o Ui Bond Bor, e sy wid e Uk pople. Rusi's sl o
{0 oo, s e s Mosen Mo o Ky M st e ko eknow et 1 e Ui paple,” s v Andes Mino, "Were Ul s “vioas says R Hockinan, o fa Mty b
st g e avareness, and ot people ko, we i support(f k) She st Ukrnan 0 ko i "we s what's hpenin, we wish we <o d
mor. o Bardocr srve ussan sk Mino potd. ' o rplcing any it s Ky Mue. MadroneA Bt in Son Francise, dd s (GPT-D: The it dicasss b America b et
Russion vodkaunl i st ke, when ownce Mchac Krous decsdod 10 ofl e . i, b i o ot Whichof e gy 10 vodkas e caris owners e chaning e ame o the Moson Ml coeki b
‘weeactly Resian. Many Lg-scling vk besnds e h rgins o Rasa e o isled i il couios,nluding the Unod Stws il Vodka, K Mle o sho seppontfor ke, The Ky e e it
o cxampl, s el made i i, he copany's hesdqares s i Luncimbouse,
Al ome s, Kepuse removed Russan Sandrt, one of e e Ul vodka nd sigee. e s some bsicscs s doting
ok st actaly s Rusia-mad, o s . Thenhe decidd 1 e Madeo's Moscon Ml he Ky Ml andlokod fra Ukaaan vodka o mke et ofthe procoeds 2 Ukninian . The chnge s mean o
0 wth. The b unveld e rconted okl n st i ves"roducin the Ky Mul¢ made il rime Uanin vodka,” o Wodnesdy post ods, Soall st 10 show suppon o he Ui gl 3 Rusi
ading bt "S2of ach Ko il s will b donated 1 the Uk Crisis Fund T Kyis Mol costs 2. Krouse s b was i s andhelples st U contines s sl
itunion i Ukain when e decded to make tos . Thosechanges wee "t et scmeingat e ould o, sid. Making gesure o Chaas Bl
Grill i KmssCiy, Mo, s n 3 Focchok post s ek hat s oseos Mule will b repincdby “Snake o Mole” n "ot ofthe Uk e Keyword: Brad Lendon
and i bonor f e brve sl of Skt Saske land,s ks s Zriny. and, s sout 30 eof the s i of e Ui it he TR Sum: e owrerse eplscing Moo with Ky
o Bk Se. L ek, hadfl o Ui Fgherson e land st up o R i, They were e de, bt he Uk Ny relssod hi vodginge Tone cockiuls. NN Jordan Valind, d i
et Monday syins 4 rops wre v ad el e bein e o s e e ack f oo, Back n KaasesCity, b Em Chits s endon, o Peingon conieuted o s repot. Thesack on
descom)” The ctr.whic charsshousS forcockiil, Ul ks s sy RonaieHocknan, owner of Cdies:
Ronmie Heckman,owperof Caddics on Corel bt 1 gl n Hehesd, My, s s nly swappc ot Moseow Nl o Ky Mles, s goe o e 0 on Conel,
replceBlack Russinsand hic Rusians with Black Ukunians (vodka and Koo e rocks) and Wit Ukainins (ade it vodka, Kala andcresm). I (GPT-DR:The il discuses b American b ad estrant
e, e i, Cadic il doting prtofh e o sk f Uk i, . sk hope it i cnough s nd b owoers ke ownee s swappingot Mosea Mol o K Ml i e o
refroceso Rossa o e e, ey ca sed 3 s 0 Rassa eoderhi. The atack on Ursne "k o snse” e s, s wron.” CNNs Jrda o suppontfor Ukeine. This s n respnse Rusifs st
Nalky B Lendon, T Litrads Peingion conibused o tis rpor. sl ot couney. CNN's B Lendon coubutd t the eor
Figure 13: Examples of keyword-focused summaries for CNN articles from 2022.
The goal of tis study s to rats evaluate machine-generated summaries of news articles. The goat of tis stucy s {0 evaluate machine-genorated summaries of news artcies You
You wil evaluate summariesfor 5 dferot news aticesintis stucy, each of which s 3 willovalate summaris for different nows artces. Each summaryis expected fo be 2-3
summarios. Supposo you woro browsing social mecia and saw one of these summaries with a sentences ong.
ik o thearicle. Which summary/summaries would you prefe t see or which summary.
rowids the truest descripton of the aricl's content and itent. Suppose you search for a keywerd (e.. person's namo or an organization) and saw ono of
these summaries with Ink 0 thearticle. Which summary would you prfer to s0e? You should
You can make tis judgment based on your own browsing habits For exampHe, you can make this judgment based on the following citrion
evaluato the summary basod on characterisicske doos focus on the main topi or contont
of the artice?, s al th iformation i the sumary factualy oz, o ay other 1. Dos the summary provide an appropriate descripton of the person/organizatons'srofen
characteistcsthatare important to you i tis seting. Note that the summaries are the news story?
automatically generated and can contain small rors. Keep an eye out fo these and 2. Doss the summary give enough contextof the broadr news story around the
‘appropriatsly ponalz them whil making your cocision. ~personlorganization? E.g. Bori Johnson i expeced o respond fo the accusatonon Tuesday’
Is ot an deal summery as it 0es not give any detas about the main event e accusation'in
Werktiow tho summary.
For ach atice, st road the nows aricl carotull on the lftpanel of the task. The summaries Apart fom these criteron, you can aso make your judment basec on your personel
for th articleare
shown o the igh pane. You will nswer 2 questions about these summaris. and behavior Note hat the v ang
1. Which summary/summaries 4o you prefer the most? You can select
more than one summary ‘can contain small rors 9. the summary may not present a coherent arative or Gontan
hero f there are multile good summaries an you have no clearpreference batwsen them. formation not n the input artcle. Keap an eye ou or thesa and appropriatly penaize tem
usty your selecion in the text box below. You can say tings ko ‘Summary A misses the main ‘while making your decison.
ntent o the summary | Summar A no-factual / oc. Worktiow
2. Which summary/summariss s the worst? Simiar o the previous case,jusiy your slection in For ach artice, first road tho nows artce carofull on th lft pavel of the task. On tho ight
he text box below: (You can select i o one summary i noticeably worse than the other panal, you wil ba shown fwo keywords. For ach keyword, you wil ba shown 2 sumimaris. You
wo) will b askedi 0 comparo tho o summaries anct answer the folowing questions:
1. Which summary do you profer the most?
2. ustityyour slection i th toxt box below. You can saytingske ‘Summary A missss the
main intentof the summary'“Summry A doss not talk about the keyword'srole fc.
Figure 14: Screenshot of the task instructions for the Figure 15: Screenshot of the task instructions for the
generic summarization setting. keyword-based setting.