project (1) (1)

Uploaded by

nikhiladithyakumar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

28 views

project (1) (1)

Uploaded by

nikhiladithyakumar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 20

News Summarization and Evaluation in the Era of GPT-3

Tanya Goyal' Junyi Jessy Li’ Greg Durrett!

! Department of Computer Science 2 Department of Linguistics
The University of Texas at Austin
[email protected]

Abstract CNN article: https:/www.cnn.com/2022/09/09/politics/judge-throws-

out-trumps-rico-lawsuit-against-hillary-clinton-and-democrats/index.html
The recent success of prompting large language Length-constrained prompt
models like GPT-3 has led to a paradigm shift Summarize the above article in 2 sentences.
in NLP research. In this paper, we study its GPT-3: Donald Trump's lawsuit against Hillary Clinton, the Democratic
National Committee, and several other individuals and entities has been
23 May 2023

impact on text summarization, focusing on the dismissed by a federal judge. The judge ruled that Trump's claims are
classic benchmark domain of news summariza- unsupported by any legal authority and that many of his specific factual
assertions are implausible or unsupported.
tion. First, we investigate how GPT-3 compares
against fine-tuned models trained on large sum- Keyword-constrained prompt
marization datasets. We show that not only Summarize the above article briefly focusing on Alina Habba.
do humans overwhelmingly prefer GPT-3 sum- GPT-3: Alina Habba is an attorey for Donald Trump. She stated that
Trump will "immediately move to appeal this decision." Habba also said
maries, prompted using only a task descrip- that the judge's opinion is "rife with erroncous applications of the law" and
tion, but these also do not suffer from common that it "disregards the numerous independent govermental investigations
which substantiate our claim that the defendants conspired to falsely
dataset-specific issues such as poor factuality. implicate our client and undermine the 2016 Presidential election.”
Next, we study what this means for evalua-
[cs.CL]

tion, particularly the role of gold standard test Figure 1: Examples of GPT-3 summaries. We can gen-
sets. Our experiments show that both reference- erate summaries following style constraints or queries
based and reference-free automatic metrics can- included in the prompts, allowing us to emulate a range
not reliably evaluate GPT-3 summaries. Fi- of existing fine-tuned systems.
nally, we evaluate models on a setting beyond
generic summarization, specifically keyword-
arXiv 2209.12356v2

based summarization, and show how dominant 2020), TO (Sanh et al., 2022), PaLM (Chowdhery
fine-tuning approaches compare to prompting. etal., 2022), etc.) provides an alternative approach,
To support further research, we release: (a) namely learning from natural language task instruc-
a corpus of 10K generated summaries from tions and/or a few demonstrative examples in the
fine-tuned and prompt-based models across 4 context without updating model parameters. While
standard summarization benchmarks, (b) 1K recent work (Zhao et al., 2021; Min et al., 2022;
human preference judgments comparing differ- Ye and Durrett, 2022) has evaluated this paradigm
ent systems for generic- and keyword-based
across a number of tasks, it has only been studied
summarization.'
for text summarization with unreliable automatic
1 Introduction metrics (He et al., 2022b; Chowdhery et al., 2022;
Ouyang et al., 2022) or in non-standard settings
Fine-tuning pre-trained models on domain-specific (Saunders et al., 2022).
datasets has been the leading paradigm in text sum- In this paper, we conduct the first systematic
marization research in recent years (Lewis et al., study of the impact of prompt-based models on
2020; Zhang et al., 2020; Raffel et al., 2020). These the text summarization research space, using an
models generate high-quality summaries on stan- Instruct-tuned 175B GPT-3 model (text-davinci-
dard benchmarks, but still require sizeable training 002) (Brown et al., 2020; Ouyang et al., 2022) as a
datasets to adapt to new settings, e.g., summarizing case study. Figure 1 shows that GPT-3 summaries
data from a new source domain or producing a sum- are extremely high-quality and adaptable to differ-
mary in a different style. The success of prompt- ent summarization settings. Starting from these
ing large language models (GPT-3 (Brown et al., observations, we aim to answer three main ques-
'All data available at: https://tagoyal.github.io/ tions. First, how do prompt-based GPT-3 sum-
zeroshot-news-annotations.html. maries compare to those obtained from state-of-
the-art fine-tuned summarization models (Zhang Avg. Words | % novel n-grams
et al., 2020; Liu et al., 2022)? We compare these Dataset | jricle Summ | n=1 n=2
approaches using A/B testing on a new corpus CNN 760.5 457 16.7 54.3
DailyMail 653.3 54.6 17.0 538
of recent news articles, and find that our study Xsum (BBC) | 431.1 232 | 357 82.4
participants overwhelmingly prefer GPT-3 sum- Newsroom 658.6 26.7 18.9 475
maries across two different “styles” with differ-
ent prompts (three-sentence and single-sentence). Table 1: Basic statistics of standard summarization
Moreover, these summaries do not suffer from lim- datasets: CNN/DM (Hermann et al., 2015; Nallapati et al.,
2016), XSum (Narayan et al., 2018), Newsroom (Grusky
itations due to low-quality training data that plague
et al., 2018). These show large variance in their sum-
fine-tuned generic summarization models (Maynez
mary properties and fundamentally differ in their defini-
et al., 2020; Goyal et al., 2022). tion of the “gold” standard.
Second, are existing automatic metrics well-
suited to evaluating prompt-based summaries? Re-
2 Models and Setup
cent work has shown that classic reference-based
such as ROUGE (Lin, 2004) and BERTScore (Zhang* 2.1 Current Paradigms for Summarization
et al., 2020) are unreliable when small improve- Recent zero- and few-shot prompting based mod-
ments are reported (Peyrard, 2019; Fabbri et al., els (Brown et al., 2020; Sanh et al., 2022), have
2021); however large differences, on the order of shown impres ive generalization capabilities on
say 5 ROUGE points or greater, are considered to be unseen tasks specified using prompts alone and
correlated with human preferences (Bhandari et al., without performing any gradient updates (Mishra
2020; Deutsch et al., 2022). However, we find that et al.,, 2022). In this work, we want to compare
the same is no longer true when evaluating GPT-3 their text summarization performance against the
summaries. These summaries score much lower on current state-of-the-art models.
automatic metrics (7 ROUGE-L points on average)
than all prior state-of-the-art models while com- Pre-trained LM
i
fortably outperforming them on human evaluation. Fine-tunedon | [Instruction-tuned on| [Zero-shot prompting|
Furthermore, we show that recent reference-free summ. datasets multiple tasks
= T (Not available or less
(Task-specific models - effective than
metrics, e.g. QA-based metrics (Fabbri et al., 2022; trained for each dataser) Prompting instruction-tuned
Durmus et al., 2020) and trained factuality models Instrnct GPT counterparts)
(Kryscinski et al., 2020; Goyal and Durrett, 2020),
» PEGASUS »4 CTRLSum _4y FLAN; (ese
4
ot nabned
(these not trained
»arrs
> BRIO (S mmarization datasets) | datasers) » PalM
similarly fail to adapt to this shift from the fine- | used during training text-davinci-002 » Turing-NLG

tuned to prompting, and need to be re-visited.

Figure 2: Broad categorization of available summariza-
Finally, how can prompting be used beyond tion systems; those compared in this work are high-
generic summarization? We focus on keyword- lighted in red.
based and aspect-based summarization. For
keyword-based summarization, we find that GPT-3 Figure 2 shows the broad categories of all avail-
consistently generates more coherent and keyword- able summarization approaches, including current
relevant summaries compared to current fine-tuned SOTA models and prompting-based models. The
alternatives: crowd annotators prefer GPT-3 sum- former set consists of fine-tuned language mod-
maries over a baseline model (He et al., 2022a) els, trained on a large number of article-summary
70% of the time. We observe mixed results for pairs (e.g. BART (Lewis et al., 2020), PEGASUS
the aspect-based setting, where GPT-3 summaries (Zhang et al., 2020), BRIO (Liu et al., 2022)) to
show frequent failure cases with simple prompts. obtain dataset-specific systems. This category also
Taken together, this evidence suggests that GPT- includes models aimed at tasks beyond generic
3 represents a fundamental paradigm shift in sum- summarization, such as keyword- or query-based
marization, changing what data we need (or don’t summarization, that still rely on standard datasets
need) and what approaches we can now explore. for training (He et al., 2022a).
Evaluating these systems will require a new frame- On the other extreme are zero- or few-shot
work distinct from the automatic metrics that have models, (e.g. GPT3 (Brown et al., 2020), PaLM
dominated the last decade of summarization re- (Chowdhery et al., 2022)), that are not explicitly
search. trained for any particular task, as discussed above.
Recent work (Ouyang et al., 2022; Wei et al., 2022; Article: https://www.cnn.com/2022/03/01/africa/africa-condemns-
racism-ukraine-intl/index.
html
Sanh et al., 2022) has improved on these models
Prompt: Summarize the article in N sentences.
by introducing instruction-tuned models. Here, ~ The three African nations on the UN Security Council condemned reports
pre-trained language models are fine-tuned on mul- I of discrimination against Afican citizens at the Ukrainian border during a
7 meeing at the UN HQ in New York City Monday.
tiple tasks (which may include summarization) us-
‘The United Nations Securi Council condemned the reports of
ing instruction templates in order to align their discrimination against A ctizens at the Ukrainian border. The African
Union has said it i "disturbe the reports of scgregation against

N=2
training with inference time usage. Africans in Ukraine, which it des ibed as "shockingly racist."
In this work, we compare the summarization The aricle discusses the reports of discrimination against African citizens
performance of three models that are representative at the Ukrainian border. The representatives from the three African nations
e on the UN Security Council condemned the reports and called for the
of this space of options: Il mistreatment of African peoples on Europe's borders to cease immediately.
7. Forcign students attempting to flee Ukraine after Russia invaded the
country told CNN that they experienced racial discrimination at the
1. OpenAI’s text-davinci-002, a GPT-3 model Ukrainian border.
(Brown et al., 2020) from the Instruct series
(Ouyang et al., 2022). While we do not know Figure 3: Illustration of length control using the task
the exact training details for this release of description / prompt for GPT3-D2. We found that the
the model, the previous in the series (text- generated summaries followed the given sentence length
davinci-001) was fine-tuned on a combina-
constraint 98% of the time, allowing us to generate
different length summaries emulating different datasets.
tion of prompts submitted to their API and la-
beler prompts spanning multiple tasks. These
tasks include summarization but not (to our Specifically, we follow prior work (Sanh et al.,
knowledge) standard summarization datasets 2022) and use sentence-count length prompts to
like CNN/DM (Hermann et al., 2015; Nallapati adapt to each dataset. Although these datasets also
et al., 2016) or XSum (Narayan et al., 2018). differ along other attributes, e.g. CNN/DM is lead-
‘We choose the text-davinci-002 version for our biased whereas XSum requires drawing inferences
from a whole article, we do not attempt to con-
experiments in order to benchmark the best
available prompt-based model.> We refer to trol any other attributed of the summary. Figure 3
this approach as GPT3-D2.
shows an example of different length GPT3-D2 sum-
maries for the same news article, using the follow-
2. BRIO (Liu et al., 2022), a fine-tuned summariza- ing prompt format:
tion model that reports state-of-the art results
Article: {{article}}
on both CNN/DM and XSum. We will use versions
Summarize the above article in N sentences.
of this model fine-tuned on each of these two
datasets. We found that GPT3-D2 summaries faithfully fol-
low the given length constraint in 98% of the test
3. T0 (Sanh et al., 2022), a prompt-based model
instances used in our human study data in Sec-
fine-tuned on multiple tasks including standard
tion 3.
summarization datasets. This provides a use-
Given this setup, we first compare the summary
ful point of comparison between task-specific
quality of the three summarization models through
fine-tuned (BRIO) and bigger instruction-tuned
a human annotation study (Section 3). Then, we
models (GPT3-D2).
evaluate the current suite of summarization metrics
2.2 Using GPT3-D2 for summarization for prompt-based summarization (Section 4). Fi-
nally, in Section 5, we briefly discuss GPT3-D2 per-
Fine-tuned models largely follow the “style” of ref-
formance on summarization tasks beyond generic
erence summaries in their training data, and hence,
summarization and new challenges.
generated summaries show large variance between
datasets (see Table 1 for basic summary statistics 3 Human evaluation of GPT3-D2
of standard summarization datasets). To ensure fair summaries
comparison between these and GPT3-D2, we adapt
the latter’s prompt to align with dataset-specific Generated summaries of fine-tuned models (Lewis
styles. et al., 2020; Zhang et al., 2020; Liu et al., 2022)
emulate gold-standard summaries in their training
*We did not observe obvious quality differences in gen-
erated summaries between text-davinci-001 and text-davinci- datasets. In contrast, prompt-based GPT3-D2 mod-
002. Examples are included in Appendix C. els generate summaries based on how the given
CNN Article: (CNN) Morigage rates fell slightly this week, ‘The 30-year, fixed-rate mortgage averaged 5.09% in the week ending
marking the third consecutive week of declines. But with rates BRIO June 2. It the third consecutive week of declines. But rates are still
hovering above 5% and home prices well above where they were above 5% and home prices are well above where they were this time last
this time last year; prospective buyers are finding it increasingly year. Rising prices have pushed some prospective buyers to the sidelines.
difficult to afford a home. The 30-year, fixed-rate morigage 30-year, fixed-rate mortgage averaged 5.09% in the week ending June 2.
averaged 5.09% in the week ending June 2, down from 5.10% the 0 It still well above the 2.99% average from this time last year. Rising
week before, according to Freddie Mac. It is still well above the prices had already been pushing many prospective buyers to the sidelines.
2.99% average from this time last year. [...] Rising prices had
already been pushing many prospective buyers to the sidelines. Mortgage rates have fallen slightly in the past few weeks, but they are.
[...] The Fed has been secking to tame inflation by raising interest ~ GPT3. still well above last year's levels. This is making it difficult for many
rates over the past couple of months. And the central bank has D2 prospective buyers {0 afford a home. The Fed has been raising interest
signaled there are more rate hikes to come. [...] rates in an effort to control inflation, and more rate hikes are expected.
BBC Article: The full opening of a landmark shared education BRIO The full opening of the Strule shared education campus in Omagh,
campus in Omagh is fucing another delay. Education Minister County Tyrone, has been delayed to September 2026,
Michelle Mcllveen has now said the Strule Shared Education
Campus is planned 10 open in September 2026. The minister ™ Strule, Norther Ireland's biggest ever school-building ject, is now
clarified the new date in response to an assembly question from the scheduled to open in 2026, the Education Minister has sai
SDLP MLA Daniel McCrossan. The campus had originally been GPT3 ‘The Strule Shared Education Campus is facing another delay, and is now.
due 0 open in 2020, but has been delayed a number of times.[...] -D2 planned to open in September 2026.

Figure 4: Examples of CNN-style and BBC/XSum-style summaries for the three systems. For CNN, we observe
that models fine-tuned on the CNN/DM training set reflect its dataset bias ummaries are highly extractive, specific
and lead-| ased. On the other hand, GPT3-D2 summaries contain fewer specific details but cover more content.

task description surfaces behavior learned during 100 recent articles from CNN? and BBC, collected
pre-training or instruction-tuning. In this section, between March 1, 2022 and June 31, 2022. We call
we ask: how do these paradigms compare? Does these CNN-2022 and BBC-2022 respectively.
learning from gold summaries lead to a better sum-
marization model? To answer this, we conduct a Model details We use the publicly released
human study to compare outputs of our 3 repre- BRIO-XSum and BRIO-CNN/DM models to generate
sentative models and collect human preferences of summaries.* For T, we use a prompt we selected
quality. from its prompt repository for CNN/DM and XSum
datasets. Finally, to generate GPT3-D2 summaries,
3.1 Experimental Setup we set N = 3 for CNN and N = 1 for BBC in
Datasets for fine-tuning We choose two stan- our standard sentence-count prompt template from
dard fine-tuning datasets whose summaries differ Section 2.
along multiple dimensions such as length and ab- For a maximally fair comparison in this “realis-
stractiveness: tic” setting, we take some additional steps to im-
prove the output of BRIO-XSum. In order to auto-
1. CNN/DM (Hermann et al., 2015; Nallapati
mate dataset creation, XSum removes the first sen-
et al., 2016) contains reference summaries that
tence from news articles to use as the gold summary
are approximately 3-4 sentences long. Sum-
for training, then treats the rest of the sentences as
maries in this dataset are highly extractive and
the article to summarize. This setup differs from
lead-biased.
the real world usage of summarization systems
. XSum (Narayan et al., 2018) contains 1 sen- where the complete article is summarized. Due
tence summaries of BBC news articles. In to this mismatch, BRIO-XSum often generates very
this dataset, references summaries, and conse- low quality outputs, e.g. All images: Strule Shared
quently generated summaries from fine-tuned
*Although the BRIO's CNN/DM model also includes Daily-
models are highly abstractive. Mail data in its training, we do not use this news source in
our study as it is now widely considered to be unreliable. E.g.
Datasets for evaluation Because GPT3-D2’s pre- according to Media Bias / Fact Check site, DM's factual re-
training and instruction-tuning datasets are un- porting is rated ‘low” https://mediabiasfactcheck. com/
known, it may have been trained on existing articles daily-mail/.
“Models at: https://github.com/yixinL7/BRIO
and summaries in the test splits of these standard SRepository with TO prompts: https://github.com/
benchmarks. We therefore run our human study on bigscience-workshop/promptsource
Education Campus in Figure 4, for around 30% of Length Statistics | % novel n-gms | #NEs per
the articles. We manually identify these examples Model |y oot #words/sent =1 n =2 | 100 words
and first attempt to fix them by selecting a summary CNN
without such obvious failures from further down BRIO| 37 158 | 121 362 | 129
the beam (we use beam size = 10). However, if we To| 27 149 | 164 52 | 128
GPT3-D2| 29 234 163 40.7 10.5
cannot find a “better” summary, we remove the first
BBC
sentence of the article and re-sample a new sum-
mary to align with its noisy training. This latter BRIO| 10 202 | 246 612 | 91
To| 10 200 |263 667 | 93
strategy often results in factually incorrect sum- GPT3-D2| 1.0 277 16.4 423 8.5
mary generations, as is well documented in prior
research (Maynez et al., 2020; Goyal and Durrett, Table 2: Statistics for generated summaries evaluated
2021). in the human study across all datasets and summariza-
tion systems. We observe that GPT3-D2 generated sum-
maries nearly always follow the sentence length con-
Design of the human study We design an A/B straints in their prompts.
test to collect preference annotations. For each
given article, annotators are shown summaries from 3.2 Results
all three summarization systems (BRIO, T@ and
Differences between summarization systems
GPT3-D2). They are then asked to select their most
Figure 4 shows examples of generated summaries
and least preferred summary or summaries. In ad-
from all three summarization systems for both
dition to these multiple choice questions, we also
CNN and BBC articles. For CNN, we observe that
ask for a free-text justification of both choices.
fine-tuned BRIO summaries tend to be highly extrac-
‘We make two design decisions for our human tive and generally include a high number of named
study: first, we do not provide annotators with spe- entities (dates, percentages, names), reflecting the
cific definitions of summary quality to avoid intro- data it was trained on. In contrast, GPT3-D2 sum-
ducing our own biases. It is also quite challenging maries are more abstractive and less specific, but
to produce a unified definition of quality for the provide a more exhaustive overview of the article
very different “styles” of summaries evaluated in content. Table 2 provides quantitative evidence of
this study. Instead, we ask them to rely on their this; we use percentage of novel n-grams to mea-
own preferences based on summaries they would sure abstractiveness, and number of named entities
like to see if they were browsing the web, which per 100 words to measure specificity.
we believe to be a representative scenario for non- For BBC, we observe inverse trends where
expert consumers of news summaries. Detailed BRIO and T@ are more abstractive compared to
task instructions are included in Appendix F. GPT3-D2. Again, this can be attributed to the XSum
Second, we allow multiple selections for both the training data used to train both these prior mod-
best and worst summary questions to cater to sce- els. For GPT3-D2 summaries, on the other hand,
narios in which different summarization systems the level of abstractiveness does not differ between
output similar quality summaries without meaning- datasets. Finally, Table 2 shows that GPT3-D2 sum-
ful differences. maries tend to have longer sentences, and therefore
similar number of summary sentences often results
‘We hire crowd annotators through Prolific. For
in a longer summary for both datasets. We study
both CNN and BBC, we recruit 60 unique partici-
the effect of this length difference on human pref-
pants to annotate the 100 summaries in each dataset.
erence judgments in Appendix B.
Each annotator was asked to annotate 5 articles and
each article was annotated by 3 annotators. Addi- ‘Which systems do humans prefer? Results of
tionally, we use the Prolific’s demographic filters to our human study are summarized in Table 3. We
restrict participation to USA (or UK) residents for report the percentage of times a particular system is
CNN (or BBC). We anticipate that residents from the most/least preferred model according to major-
these respective countries are better positioned to ity vote combining all three annotator’s choices.®
understand country-specific news events and evalu-
SAs we allow multiple system selections, note that more
ate their summaries. Participants were paid approx- that one system could be the majority. However, this is rare
imately $11/hr for their work. after majority vote: only 2% of the articles in CNN and 7% in
BRIO TO GPT3 Which summary is Which summary is
Dataset | g+ the most preferred? the least preferred?
Worst | | Best T Worst | | BestT Worst |
CNN 36 24 8 67 58 9 £ o
BBC 20 56 30 29 57 5] o
™
Table 3: Percentage of times a summarization system is
selected as the best or worst according to majority vote
(may be tied). Human annotators have a clear preference
8g mo
"
for GPT3-D2 for both CNN and BBC style summaries. pres——— prm——r,
o m oM W oM
M: M>
No. of annotator votes for No. of annotator votes for
Across both datasets and styles, we observe a clear “best summary™ “worst summary”
preference for GPT3-D2 summaries compared to
Figure 5: Annotator vote distribution for best and worst
the other two models. In fact, in both scenarios,
summaries across all datasets and models. Although
the GPT3-D2 outperforms the next best model by at GPT3-D2 is the clear winner according to majority vote,
least 20 percentage points. This improvement is sta- this choice is unanimous for less than 30% of the ar-
tistically significant according to a paired bootstrap ticles. This demonstrates the inherent variance in dif-
test (CNN p—value = 2 x 1073, BBC p—value ferent annotators” definitions of “best summary”, espe-
=6x107%). cially when comparing high-quality summaries from
strong models.
Note that the next best model differs between the
two datasets. For BBC, annotators prefer T@ sum-
maries over BRIO. Annotator rationales often men- Conversely, although BRIO (or T0) summaries are
tioned misleading or incorrect information as the less preferred than GPT3-D2 for the CNN (or BBC)
primarily reason for selecting BRIO as the worst dataset on aggregate, they were voted as the best
summary, confirming the issues that have been ob- summary by at least one annotator for more than
served with XSum-trained models (Maynez et al., 60% of the articles. This demonstrate two things:
2020; Pagnoni et al., 2021; Goyal and Durrett, first, when comparing summaries from two strong
2021). Although T@ also includes XSum training models, the choice is inherently ambiguous (similar
data, we hypothesize that its multi-task framework observations in Clark et al. (2021)). Second, these
helps offset the noisy signal from XSum. results and the diversity in the written rationales,
In contrast, annotators rate T@ as the worst sum- show that there does not exist a universal definition
marization system for CNN. The most common of a “good” summary and that different summary
rationales for these were shorter length and inclu- properties appeal to different annotators. Regard-
sion of irrelevant details, e.g. long quotes, while less, the aggregate preference for GPT3-D2 is high
missing key points. Some annotators also com- enough across the board to give us confidence in
mented that these T@ summaries were less coherent its strength.
compared to the other models. Interestingly, we
did not observe similar complaints for the single- How do these results impact the field? Progress
sentence T0 summaries for BBC. in text summarization research in the last five years
has been enabled by the construction of large-scale
Do annotators agree with each other? To study
text summarization datasets that involved scrap-
this, we plot the distribution of annotator votes for
ing news articles and pairing them with any avail-
each summarization system and dataset in Figure 5.
able summary-like data (Hermann et al., 2015;
Additionally, we report the inter-annotator agree-
Narayan et al., 2018; Grusky et al., 2018). The
ment, measured using Krippendorff’s alpha with
CNN/DM dataset considers bullet points accompa-
MASI distance (Passonneau, 2006), to account for
nying news articles as its summary. These “gold”
multiple selections of best or worst summary al- standard summaries provided useful training sig-
lowed in our study design.
nal to train impressive supervised models (Lewis
The vote distribution shows that although more et al., 2020; Zhang et al., 2020; Liu et al., 2022)
annotators prefer GPT3-D2 summaries, this choice and hence, their quality or alignment with human
is only unanimous, i.e. supported by all three an- preferences was largely ignored.
notators, for less that 30% of the annotated articles.
We found that, despite its popularity, XSum is
BBC have multiple best summaries. largely unsuitable for fine-tuning models like BRIO
Overlap-Based Similarity-Based QAEval
Dataset | Model | poyGE(1/2/L) METEOR BLEU | BERTScore MoverScore | EM F1
PEGASUS | 34.85/14.62/28.23 24 7.1 858 229 105160
o BRIO | 38.49/17.08/31.44 31 6.6 864 261 137 21
To | 35.06/13.84/28.46 25 59 859 238 099 163
GPT3-D2 | 31.86/1131/24.71 25 38 858 216 098 159
PEGASUS | 45.77/23.00/36.65 33 122 865 308 159 229
vailywail | BRIO | 4927247613921 37 117 871 331 175 259
¥ To | 4297/19.04/33.95 28 8.9 863 290 21 184
GPT3-D2 | 38.68/14.24/2808 .26 6.6 859 248 01159
PEGASUS | 47.97/24.82/39.63 36 9.8 901 362 145 221
Xsun BRIO | 49.66/25.97/41.04 39 10.6 901 372 139 224
To | 442012072/35.84 34 8.0 896 340 125 208
GPT3-D2 | 28.78/7.64/20.60 19 22 869 197 066 119
PEGASUS | 39.21/27.73/35.68 39 14 873 ) 0182 0253
BRIO - - - - - - -
Newsroom To 25.6419.49/21.41 20 04 849 145 080 0.125
GPT3-D2 | 27.44/10.67/22.18 2 05 859 1159 089 0.142
Table 4: Performance of different summarization systems measured using reference-based automatic metrics. Across
all datasets, we observe that automatic metrics report substantially worse results for GPT3-D2 summaries compared
to fine-tuned models. This directly contradicts the human preference results from Section 3, demonstrating that
these reference-based metrics cannot reliably compare the quality of prompt-based summaries against fine-tuned
summaries.

for realistic summarization settings. Even though a evaluating prompt-based GPT3-D2 summaries.
CNN/DM-trained BRIO model performed better, the
Experimental Setup We evaluate automatic met-
results of our human study question the contin-
rics using summaries from 4 different summariza-
ued utility of hill-climbing on this dataset, as it
tion datasets, listed in Table 1. For each dataset,
seems users may simply prefer a different style of
we construct our evaluation sets by randomly sam-
summary altogether. In fact, this preference for
pling 5007 articles from the standard test split.* We
GPT3-D2 is much larger than incremental improve-
compare the same 3 summarization systems from
ments reported in other human evaluation settings,
Section 3 in our analysis. Additionally, we also
e.g. improvements on XSum on the GENIE leader-
report results using the fine-tuned PEGASUS model
board (Khashabi et al., 2022). Furthermore, as
(Zhang et al., 2020), as BRIO fine-tuned models are
we we will see in Section 5, the greater flexibil-
not available for all datasets.
ity of GPT3-D2 compared to these systems makes
‘We publicly release this corpus of summariza-
it more suitable for news summarization tasks be-
tion outputs to standardize the test sets and sup-
yond generic summarization.
port future research into GPT3-D2 based summa-
If a system designer collects a large-scale dataset
rization. Link: https://tagoyal.github.io/
of high-quality summaries that they wish to emu-
zeroshot-news-annotations.html.
late, we believe a fine-tuned system may outper-
form GPT3-D2. However, better-trained models on 4.1 Reference-based metrics
datasets collected via “incidental” supervision are
Here, we study if the gold summaries of the stan-
less likely to help.
dard datasets are useful for evaluation, especially
when evaluating prompt-based summaries that are
4 Can current automatic metrics evaluate not trained to emulate the gold. We benchmark
GPT3-D2 summaries?
"This size is chosen to give sufficient statistical power
Automatic metrics proposed for summarization (Card et al., 2020) while keeping costs for GPT3-D2 evaluation
low to enable others to compare on this subset. We outline
evaluation can be broadly divided into two cate- costs in Appendix D.
gories: (1) reference-based, that compare gener- “Note that these standard datasets were released before
ated summaries against available gold summaries, 2020. Therefore, it is possible that some article-summary
pairs in our test st overlap with GPT3-D2’s training data. How-
and (2) reference-free that only rely on the input ever, we do not observe a qualitative difference in GPT3-D2’s
document. Here, we compare their performance at performance on these older articles.
Overall Quality Factuality (QA-based) Factuality (NLI-based)
Dataset | Model | qippRT “BLANC | QuestEval QAFactEval | FactCC DAE ~ SummaC
PEGASUS | 5466 0605 7373 44071 3743 8223 1138
N BRIO 5586 0802 7334 3.8332 1817 1577 -.0532
To 5330 0558 7799 37517 2012 7556 -0605
PT3-D2 | 5560 0749 7249 3.6399 2428 6671 -0729
PEGASUS | 6433 1137 7536 44677 5152 8497 2402
vailymail | BRIO 6360 1217 7415 41362 3609 8118 0153
v To 5995 0889 7803 3.9827 2431 8043 0478
T3-02 | 6118 0983 7461 3.8279 2697 6990 0365
PEGASUS | 4439 0249 8233 2.0089 2465 3508 -2993
sum BRIO 4459 0230 8305 1.8626 2031 3040 -3292
To 4538 0238 7957 2.0330 2209 332 -3037
GPT3-D2 | 5060 0594 8064 29492 3977 6372 -2626
PEGASUS | 6286 1131 18 42120 7218 7956 2418
BRIO - - - - - - -
Newsroom To 5433 0640 511 3.5799 2828 7376 0261
GPT3-D2 | 5408 0599 7160 32336 3988 6564 -0729
Table 5: Performance of different summarization systems, as scored by automatic reference-free evaluation metrics
from the summarization literature. Similar to reference-based metrics, these also generally fail to produce the same
system rankings as human preferences reliably across datasets.

the performance of 3 different summarization met- estingly, out of the four datasets evaluated here,
rics: (1) overlap-based metrics, specifically ROUGE Newsroom is the only one not used to train the
(Lin, 2004) METEOR (Banerjee and Lavie, 2005) and T model. This further shows that access to dataset-
BLEU (Papineni et al., 2002). (2) similarity-based specific reference summaries during training im-
metrics, that compute similarity between embed- proves performance according to these metrics, ren-
dings representations of generated and reference dering them unsuitable for evaluating prompt-based
summaries. Specifically, we report BERTScore models.
(Zhang* et al., 2020) and MoverScore (Zhao et al.,
2019). (3) a QA-based metric, specifically QAE-
val (Deutsch et al., 2021). Although most QA-
4.2 Reference-free metrics
metrics are reference-free (discussed in Section
4.2), QAEval uses the reference summaries to in-
Next, we investigate whether current reference-free
dicate saliency. We report both exact match (EM)
and F1 components of QAEval. evaluation metrics reflect the human preference
rankings between summarization systems, as ob-
Results Table 4 outlines the results. It shows that served in Section 3. Here, we study 2 categories
BRIO and PEGASUS models, fine-tuned to emulate of metrics: (1) quality metrics, specifically SU-
the reference summaries, outperform GPT3-D2 sum- PERT (Gao et al., 2020), which evaluates generated
maries according to all reference-based automatic summaries against automatically identified salient
metrics. The difference in their assigned scores sentences in the input, and BLANC (Vasilyev et al.,
is very high, e.g. >7 ROUGE-L points between 2020), which evaluates summaries on language
GPT3-D2 and BRIO. For comparison, these reported understanding tasks. We refer readers to the orig-
scores for GPT3-D2 are even lower than the triv- inal papers for detailed explanation of these. (2)
ial Lead-3 baseline reported in prior work (Fabbri factuality metrics, that are evaluate whether gener-
et al., 2021; Grusky et al., 2018). This clearly ated summaries contain incorrect information with
demonstrates that current automatic reference- respect to the source article. We report the perfor-
based metrics cannot be used to reliably mea- mance of summarization systems using two QA-
sure summary quality under the prompting based metrics: QuestEval (Scialom et al., 2021)
paradigm. and QAFactEval (Fabbri et al., 2022). Addition-
Amongst prompting-based models, we observe ally, we also benchmark entailment-based metrics:
that T summaries report better metric scores than FactCC (Kryscinski et al., 2020), DAE (Goyal and
GPT3-D2 for all datasets except Newsroom. Inter- Durrett, 2020, 2021) and SummaC (Laban et al.,
2022).° These entailment-based models are de- though “reference-free” at test time, they are still
signed for classification into factual or non-factual; trained to reward the summary properties seen in
therefore, we use P(factual | article, summary) the standard summarization benchmarks. (2) Even
to score generated summaries. completely reference-free metrics, e.g. QuestE-
val and QAFactEval, have only been evaluated on
Results Table 5 outlines the scores for each sum- reference-based benchmarks and fine-tuned mod-
‘marization system according to the above reference- els. Therefore, the choice of different components,
free metrics. Ideally, we want the relative rankings such as question answering or question generation
of different systems according to these metrics to
models to use, etc. has been dictated by the error
correspond to human preferences, i.e. GPT3-D2 >
space of prior fine-tuned models (Tang et al., 2023).
BRIO > T@ for CNN/DM'® and GPT3-D2 > T@ > BRIO These decisions also now need to be re-visited to
for XSum.!! incorporate GPT3-D2 evaluation; we leave this for
Overall, we observe that none of the reference-
future work.
free metrics we evaluate follow these trends for
both CNN/DM and XSum datasets. In particular, we 5 Beyond Generic Summarization
observe that GPT3-D2 summaries report low factu-
ality scores (except XSum) even though we rarely Previously, we observed that GPT3-D2 models faith-
found any factual errors in our qualitative analysis fully follow simple “style” instructions in the given
of its generated summaries. prompts. This provides a promising direction to
Interestingly, we noticed a roughly inverse rela- tackle other use cases in news summarization be-
tion to abstractiveness; summarization systems that yond the generic summarization task from Sec-
generated more abstractive summaries (see Table tion 3.
2) were generally scored lower by all automatic Different users can have very different infor-
reference-based metrics. For instance, GPT3-D2 mation needs from the same article, all of which
is scored lower than BRIO by both quality metrics cannot be satisfied with a single generic summary.
for all datasets except XSum; the latter is the only Prior work has introduced several task formulations
dataset for which GPT3-D2 summaries are less ab- to address this gap, including keyword-focused (He
stractive. Such shortcomings of reference-free eval- et al., 2022a), query-focused (Baumel et al., 2014;
uation metrics due to spurious correlations have He et al., 2022a), or aspect-focused summariza-
also been studied in prior work (Durmus et al., tion (Krishna and Srinivasan, 2018; Ahuja et al.,
2022). These issues become more exaggerated 2022), amongst others. Here, we evaluate GPT3-D2
when the summarization systems being compared performance at two of these use cases.
exhibit very different properties. In keyword-based summarization, the output
summaries must succinctly summarize the input
Discussion On the surface, the failure of
document focusing on a given keyword; these gen-
reference-free metrics at evaluating GPT3-D2 sum-
erally correspond to specific entities or events di-
maries is more surprising that reference-based met- rectly mentioned in the document. In contrast, the
rics as the later explicitly compares generated sum- control units in aspect-based summarization are
maries with references that GPT3-D2 is not trained high-level topics that can be common across mul-
to imitate. Therefore, GPT3-D2 understandably
tiple similar types of documents. For e.g., for the
scores lower than fine-tuned systems. input article in Figure 1, Donald Trump or Russian
However, we note two different issues with interference in 2016 elections are keyword controls
reference-free metrics: (1) Some of these, e.g. whereas charges against the defendants is a higher-
FactCC and DAE, use reference summaries as pos-
level aspect that can serve as the query for any news
itive examples to train the metric. Therefore, al-
article discussing a lawsuit or investigation.
“Exact model versions and configurations used for these
are outlined in Appendix A. 5.1 Qualitative Analysis
19Although the human study in Section 3 is only run on
CNN articles, the underlying fine-tuned model is same for Baseline Model for comparison We use the re-
both CNN and DM. Therefore, it we can reasonably expect it cently proposed CTRLSum (He et al., 2022a), a fine-
to display similar quality differences with respect to GPT3-D2. tuned BART model, as our baseline. It can be flex-
""Note that while annotators were not explicitly asked to
rate factuality, we instructed them to carefully check factuality ibly adapted for both keyword- and aspect-based
and appropriately downvote non-factual summaries. settings by including a prompt as additional input
Article: Republican defenders of Donald Trump won't or can't answer the questions that are atthe oot of the intrigue over why classified documents were at Mar-a-Lago and the
troubling question of whether national security was put at risk. [...] At one end of the scale is Missouri Sen. Josh Hawley. a firm Trump backer who is often mentioned as a future
presidential candidate once the ex-President finally exits the politcal siage. Hawley promoted Trumps line that the key isue was the FBI search of an ex-President’s home, not
hat he might have kept there. "1 don't know what he has or doesn't have, but what I have a lot of concerns abou i the FBI raiding the home of the likely presidential nominee in
2024, the former President of the United Staes, and particularly as we learn more and more about what's going on with the FBI,” Hawley told CNN's Manu Raju on Wednesday:
[.-.] Ex-Atiorney General William Barr has repeatedly questioned publicly what his former boss was doing with classified material. There has been speculation that Trump might
want such material s keepsakes or as some kind of leverage in a future business deal. But none of this is informed given the secrecy sounding the DOJ' investigation. [...]
Link: htps: v cnn.com/2022109/08 poliics/republicans-trump-national-security-implications-classiffed-docs

Keyword-based CTRLSum: Sen. Marco Rubio downplays the scriousness of the Department of Justiee's investigation. Ex-Attorney General = Missing context
‘William Barr has repeatedly questioned what his former boss was doing with classified material. Trump and his allies have = Poor discourse.
illed the vacuum for answers with a smokescreen of unproven claims, including that Trump declassified the material
Summarize the
above article in GPT3-D2: The article discusses how various Republicans have been defending Donald Trump in light of the recent revelations. Includes article
2.3 sentences that he had classified documents at his Mar-a-Lago resort. While some have downplayed the significance of the situation, content before
others have outright denied tha there i any cause for concern. However, former Attomey General William Barr has repeatedly keyword-specific
questioned Trump's actions, speculating that he may have been trying to use the classified information for personal gain. information

CTRLSum: Republican defenders of Donald Trump won't or can't answer the questions that are at the root of the intriguc. = Does not dircctly
Aspect-based address aspect
over why classified documents were at Mar-a-Lago and the troubling question of whether national security was put atrisk. In = Repeats the prompt
In the above article, the article, who is a defendant or under investigation? at the end.
who is a defendant or
under investigation? GPT3-D2: Donald Trump. The FBI. The Department of Justice. Sandy Berger. = Incorrectly answers
the query.
In the above article, CTRLSum: Republican defenders of Donald Trump won'tor can't answer the questions that are at the root of the intrigue
what is the over why classified documents were at Mar-a-Lago, says Julian Zelizer. Zelizer: Trump and his allics ha = Does not dircctly
defendant's reaction for answers with a smokescreen of unproven claims, including that Trump declassified the material or it answer query.
to charges, including planted by the FBI.
‘admission of guilt, GPT3-D2: The defendants reaction to charges in the above article is denial of charges. Trump and his alles have filed the 4 Summary captures
denial of charges, or vacuum for answers with a smokescreen of unproven claims,including that Trump declassified the material or that it might aspect-relevant
explanations? have been planted by the FBI in order to persecute him. content,

Figure 6: Comparison of keyword- and aspect-based summaries using GPT3-D2 and CTRLSum models. The
GPT3-D2 prompt is shown on the left with the corresponding keyword or aspect bolded. For keyword-based
summarization, the GPT3-D2 summary presents appropriate context before the keyword-specific information. How-
ever, for aspect-based summarization, it does not always generate factually correct summaries, as shown in the first
aspect example. We observe that CTRLSum performs poorly for both these settings.

to the encoder. We use the prompt template recom- In this example, representative of aver-
mended in the original paper.'? age GPT3-D2 quality, the keyword-focused
GPT3-D2 summary first gives a brief overview
Control Units For the keyword-focused setting,
of the article setting before providing keyword-
we use named entities extracted from the input arti-
relevant information. In contrast, the CTRLSum
cle as the control units. For aspect-focused summa-
summary exhibits poor discourse structure and
rization, we directly use the aspects introduced in
reads like a list of facts stapled together.
the guided summarization task from TAC 2011."3
It defined 5 broad categories of newswire articles, The figure also shows aspect-focused summaries
such as accidents and natural disasters, investiga- for two aspects associated with the “investigations
tions and trial, etc., and multiple aspects for each and trial” category most appropriate for the chosen
category. For example, the “investigations and tri- article. We see mixed results here for GPT3-D2; it
als” category includes aspects such as “who is the generates a factually incorrect summary for the first
defendant or under trial?”, “who is investigating, aspect, listing multiple people from the input arti-
prosecuting, judging?”, and so on. cle as defendants instead of only “Donald Trump”.
For the second aspect, it correctly maps the high-
Qualitative Analysis Figure 6 shows examples level concept “defendant” to “Donald Trump” in
of keyword- and aspect-focused summaries using the input article and generates the correct answer
GPT3-D2 and the baseline CTRLSum model. The to the input query: “The defendant’s reaction to
keywords or aspects are highlighted in bold within charges in the above article is denial of charges”.
the GPT3-D2 prompt displayed on the left.
On the other hand, CTRLSum fails to generate
"Trained model publicly released at: https://github. aspect-focused summaries for both cases. We be-
con/salesforce/ctrl-sum.
Bhttps://tac.nist.gov/2011/Summarization/ lieve that it struggles to align high-level concepts
Guided-Summ.2011.guidelines.html and explicit entities in the article due to a lack of
by a majority of the annotators. The main ratio-
Which keyword-focused Win % according
summary is better? to majority vote nales given for this choice were better contextual-
ization of keyword-related information and better
ez [N ©s% coherence in GPT3-D2 summaries.
CTRLSum s | | 302%
o 1 M2 W Impact These results show that prompting GPT-3
No. of votes for “best summary” models present a promising alternative to fine-
tuned models for such specialized summarization
tasks that can be easily described using textual
Figure 7: Distribution of annotator votes for the
prompts. One of the major drawbacks of fine-tuned
keyword-focused summarization task. Annotators pre-
fer GPT3-D2 summaries over CTRLSum for approxi- models is that they are constrained by what data
mately 70% of all article-keyword pairs, showing unani- is available and how it can be transformed to cre-
mous preference more than half the time. ate new task-specific training data. CTRLSum relied
on the SQUAD question answering dataset (Ra-
Jjpurkar et al., 2016) because the required “queries”
such aspect-specific examples in its training data. or “questions” were unavailable at scale for sum-
Instead, it generates summaries focusing on lexi- maries in standard summarization datasets. In con-
cally similar words, i.e. “defenders” for both cases. trast, prompt-based models are not constrained by
Based off of GPT3-D2’s promising keyword- the availability of task-specific data and can flexibly
focused summarization capabilities observed adapt to new tasks. Future research should focus
above, we next conduct a human study to system- on further exploring these capabilities and possible
atically compare it against the CTRLSum baseline. improvements on currently “unsolved” tasks such
‘We leave further explorations of aspect-based sum- as aspect-based or plan-based summarization.
marization to future work, given the mixed to poor
results for both models at this task. 6 Discussion and Related Work
In recent years, research in text summarization
5.2 Human Study: Keyword-focused
(Rush et al., 2015; Nallapati et al., 2016; See et al.,
summarization
2017; Lewis et al., 2020; Zhang et al., 2020; Liu
Task Setup Similar to Section 3, we design an et al., 2022) has typically relied on comparisons
A/B test to compare the two models. We use the with gold test sets for evaluation, possibly aug-
same set of 100 CNN' articles as Section 3. We mented with reference-free metrics for dimensions
randomly extract 2 distinct named entities from like factuality. This paper shows that all these
each article. In the study interface, the annota- metrics are completely ineffective at evaluating
tor is shown the article-keyword pair and GPT3-D2 GPT-3 summaries. Although issues with these
and CTRLSum summaries corresponding to it. They metrics, particularly low correlation with human
are asked to select the summary that best summa- judgments, have also been studied earlier (Fabbri
rizes the input article while focusing on the given et al., 2021; Deutsch and Roth, 2021), they are
keyword. Exact task instructions are included in considered reliable when comparing systems in dif-
Appendix F. ferent score ranges (Peyrard, 2019; Deutsch et al.,
Again, we run this study using the Prolific plat- 2022). However, GPT-3 challenges these estab-
form. We recruit 60 participants to annotate the lished practices and evaluation protocols, and poses
100 articles; each article is annotated by 3 anno- an urgent need for better evaluation.
tators which includes annotations for 2 separate This brings us to manual evaluation, generally
keywords. Each annotator evaluates 5 articles. considered to be the gold standard for generation
evaluation. The majority of summarization re-
Results Figure 7 shows the distribution of an-
search now reports results from a human study in
notator votes between the GPT3-D2 and CTRLSum
addition to automatic metrics, but there is a general
models. Annotators show a clear preference for
lack of consensus on what dimensions to evalu-
GPT3-D2. In fact, for nearly 70% of all article-
ate, task design, and other factors (Hardy et al.,
keyword pairs, GPT3-D2 is preferred over CTRLSum
2019). This presents difficulties in conducting re-
"*We run this study using only CNN articles as the baseline liable and reproducible comparisons between sys-
CTRLSum model is trained on CNN/DM. tems (Karpinska et al., 2021), another factor con-
tributing to the popularity of automatic metrics. paring different system generations. In our work,
Although recent efforts like GENIE (Khashabi et al., we chose a human evaluation workflow that directly
2022) have taken steps to standardize manual eval- asks annotators to compare systems, while other
uation protocols across systems, its annotation is prior work has opted for Likert-scale judgments
not universally affordable and the quality is not and/or evaluation along multiple quality dimen-
strictly monitored. We hope that future work ad- sions (Gehrmann et al., 2022). The latter strategy
dresses these challenges and democratizes human of evaluating different dimensions could surface
evaluations. more insights into which “style” properties of GPT-
The ultimate test of summarization systems is 3 summaries provide them an edge over fine-tuned
with actual users using the systems in practice. models; however, such analysis is outside the scope
Jones (2007) discusses the need to align task formu- of this paper. Our experiments comparing overall
lations with actual applications scenarios (“purpose quality reveal that current summarization datasets
factors™). However, the research in text summa- are not well-aligned with user preferences. We
rization until now has been constrained to certain leave more fine-grained analysis into these prefer-
problems or domains by the heavy dependence on ence judgments for future work.
large-scale training data: for example, producing a The experiments in this paper are run on English-
bullet-point summary of a news article has emerged language news summarization datasets as these
as standard due to availability of data from CNN, serve as common benchmarks in the summariza-
not because it is shown to be the best way to present tion literature. However, user rankings of system
information. outputs might be different when evaluating other
Now, the success of prompt-based models can domains, e.g., summaries of scientific text. While
allow realistic use-cases to drive research in a more we believe that automatic metrics would fail to eval-
top-down way. We already show that GPT3-D2 im- uate GPT-3 summaries on these domains also (gen-
proves upon prior keyword-focused summarization erated summaries would still look different from
systems that were trained on artificially adapted the reference summaries), users may prefer models
training data. In future research, we are inter- that are specifically fine-tuned on domain-specific
ested in tackling other real world use cases, such data for niche domains.
as update summarization and plan- or aspect-based Finally, we do not know exact datasets or tasks
summarization. Additionally, adapting GPT3-D2 used to train GPT3-D2. It is possible that its RLHF
to documents longer than the allowed context, or training (Ouyang et al., 2022) included summariza-
structured inputs such as tables, presents research tion examples, and therefore, preference judgments
challenges beyond the current capabilities of GPT-3 from human annotators for its different outputs.
and would be interesting to study." However, our arguments in this paper do not rely
on the specifics of the GPT3-D2 system, merely that
7 Conclusion such a system exists. If anything, the existence

In this work, we performed the first systematic of potentially better data underscores that further
study comparing prompt-based GPT-3 and fine- work should collect new data for summarization
model tuning, and our claims about metrics still
tuned models at the news summarization task. We
analyzed the impact of prompting on the summa- hold regardless of the details of how the GPT3-D2
rization field, including training paradigms and summaries were produced.
evaluation practices. Finally, to support further
research in this direction, we release a large corpus References
of generated summaries for multiple prompt-based
Ojas Ahuja, Jiacheng Xu, Akshay Gupta, Kevin
and fine-tuned models, as well as human preference
Horecka, and Greg Durrett. 2022. ASPECTNEWS:
judgments comparing these systems. Aspect-oriented summarization of news documents.
In Proceedings of the 60th Annual Meeting of the
8 Limitations Association for Computational Linguistics (Volume
1: Long Papers). pages 6494-6506.
In the text generation evaluation literature, there
does not exist a standardized task design for com- Satanjeev Banerjee and Alon Lavie. 2005. METEOR:
An automatic metric for mt evaluation with improved
"SWe very briefly discuss long document summarization correlation with human judgments. In Proceedings of
with GPT-3in Appendix E. the acl workshop on intrinsic and extrinsic evaluation
measures for machine translation and/or summariza- summarization evaluation metrics. In Proceedings of
tion, pages 65-72. the 2022 Conference of the North American Chap-
ter of the Association for Computational Linguistics:
Tal Baumel, Raphael Cohen, and Michael Elhadad. Human Language Technologies, pages 6038—6052,
2014. Query-chain focused summarization. In Pro- Seattle, United States. Association for Computational
ceedings of the 52nd Annual Meeting of the Associa- Linguistics.
tion for Computational Linguistics (Volume 1: Long
Papers), pages 913-922. Daniel Deutsch and Dan Roth. 2021. Understanding the
extent to which content quality metrics measure the
Manik Bhandari, Pranav Narayan Gour, Atabak Ash- information quality of summaries. In Proceedings
faq, and Pengfei Liu. 2020. Metrics also disagree of the 25th Conference on Computational Natural
in the low scoring range: Revisiting summarization Language Learning, pages 300-309.
evaluation metrics. In Proceedings of the 28th Inter-
national Conference on Computational Linguistics. Esin Durmus, He He, and Mona Diab. 2020. FEQA: A
pages 5702-5711. question answering evaluation framework for faith-
fulness assessment in abstractive summarization. In
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Proceedings of the 58th Annual Meeting of the Asso-
Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind ciation for Computational Linguistics, pages 5055—
Neelakantan, Pranay Shyam, Girish Sastry, Amanda 5070.
Askell, et al. 2020. Language models are few-shot
learners. Advances in neural information processing Esin Durmus, Faisal Ladhak, and Tatsunori B
systems, 33:1877-1901. Hashimoto. 2022. Spurious correlations in reference-
free evaluation of text generation. In Proceedings
Dallas Card, Peter Henderson, Urvashi Khandelwal, of the 60th Annual Meeting of the Association for
Robin Jia, Kyle Mahowald, and Dan Jurafsky. 2020. Computational Linguistics (Volume 1: Long Papers),
With little power comes great responsibility. In pages 1443-1454.
Proceedings of the 2020 Conference on Empirical
Methods in Natural Language Processing (EMNLP), Alexander Fabbri, Chien-Sheng Wu, Wenhao Liu, and
pages 9263-9274. Caiming Xiong. 2022. QAFactEval: Improved QA-
based factual consistency evaluation for summariza-
Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, tion. In Proceedings of the 2022 Conference of the
Maarten Bosma, Gaurav Mishra, Adam Roberts, North American Chapter of the Association for Com-
Paul Barham, Hyung Won Chung, Charles Sutton, putational Linguistics: Human Language Technolo-
Sebastian Gehrmann, et al. 2022. PaLM: Scaling gies, pages 2587-2601, Seattle, United States. Asso-
language modeling with pathways. arXiv preprint ciation for Computational Linguistics.
arXiv:2204.02311.
Alexander R Fabbri, Wojciech Kryscinski, Bryan Mc-
Elizabeth Clark, Tal August, Sofia Serrano, Nikita Cann, Caiming Xiong, Richard Socher, and Dragomir
Haduong, Suchin Gururangan, and Noah A Smith. Radev. 2021. SummEval: Re-evaluating summariza-
2021. All that’s *human’ is not gold: Evaluating tion evaluation. Transactions of the Association
for
human evaluation of generated text. In Proceedings Computational Linguistics, 9:391-409.
of the 59th Annual Meeting of the Association for
Computational Linguistics and the 11th International Yang Gao, Wei Zhao, and Steffen Eger. 2020. SUPERT:
Joint Conference on Natural Language Processing Towards new frontiers in unsupervised evaluation
(Volume I: Long Papers), pages 7282-7296. metrics for multi-document summarization. In Pro-
ceedings of the 58th Annual Meeting of the Asso-
Arman Cohan, Franck Dernoncourt, Doo Soon Kim, ciation for Computational Linguistics, pages 1347—
Trung Bui, Seokhwan Kim, Walter Chang, and Nazli 1354.
Goharian. 2018. A discourse-aware attention model
for abstractive summarization of long documents. In Sebastian Gehrmann, Elizabeth Clark, and Thibault Sel-
Proceedings of the 2018 Conference of the North lam. 2022. Repairing the cracked foundation: A sur-
American Chapter of the Association for Computa- vey of obstacles in evaluation practices for generated
tional Linguistics: Human Language Technologies, text. arXiv preprint arXiv:2202.06935.
Volume 2 (Short Papers), pages 615621, New Or-
leans, Louisiana. Association for Computational Lin- Tanya Goyal and Greg Durrett. 2020. Evaluating factu-
guistics. ality in generation with dependency-level entailment.
In Findings of the Association for Computational
Daniel Deutsch, Tania Bedrax-Weiss, and Dan Roth. Linguistics: EMNLP 2020, pages 3592-3603.
2021. Towards question-answering as an automatic
metric for evaluating the content quality of a sum- Tanya Goyal and Greg Durrett. 2021. Annotating and
mary. Transactions of the Association for Computa- modeling fine-grained factuality in summarization.
tional Linguistics, 9:774-789. In Proceedings of the 2021 Conference of the North
American Chapter of the Association for Computa-
Daniel Deutsch, Rotem Dror, and Dan Roth. 2022. Re- tional Linguistics: Human Language Technologies,
examining system-level correlations of automatic pages 1449-1462.
Tanya Goyal, Jiacheng Xu, Junyi Jessy Li, and Greg Wojciech Kryscinski, Bryan McCann, Caiming Xiong,
Durrett. 2022. Training dynamics for text summa- and Richard Socher. 2020. Evaluating the factual
rization models. In Findings of the Association for consistency of abstractive text summarization. In
Computational Linguistics: ACL 2022, pages 2061— Proceedings of the 2020 Conference on Empirical
2073. Methods in Natural Language Processing (EMNLP),
pages 9332-9346.
Max Grusky, Mor Naaman, and Yoav Artzi. 2018.
Newsroom: A dataset of 1.3 million summaries with ‘Wojciech Kryscinski, Nazneen Fatema Rajani, Di-
diverse extractive strategies. In Proceedings of the vyansh Agarwal, Caiming Xiong, and Dragomir R
2018 Conference of the North American Chapter of Radev. 2021. BookSum: A collection of datasets for
the Association
for Computational Linguistics: Hu- long-form narrative summarization.
man Language Technologies, Volume 1 (Long Pa-
pers), pages 708-719. Philippe Laban, Tobias Schnabel, Paul N. Bennett, and
Marti A. Hearst. 2022. SummaC: Re-visiting NLI-
Hardy Hardy, Shashi Narayan, and Andreas Vlachos. based models for inconsistency detection in summa-
2019. Highres: Highlight-based reference-less evalu- rization. Transactions of the Association
for Compu-
ation of summarization. In Proceedings of the 57th tational Linguistics, 10.
Annual Meeting of the Association for Computational
Linguistics, pages 3381-3392. Mike Lewis, Yinhan Liu, Naman Goyal, Marjan
Ghazvininejad, Abdelrahman Mohamed, Omer Levy,
Junxian He, Wojciech Kryscinski, Bryan McCann, Veselin Stoyanov, and Luke Zettlemoyer. 2020.
Nazneen Rajani, and Caiming Xiong. 2022a. CTRL- BART: Denoising sequence-to-sequence pre-training
sum: Towards generic controllable text summariza- for natural language generation, translation, and com-
tion. In Proceedings of the 2022 Conference on Em- prehension. In Proceedings of the 58th Annual Meet-
pirical Methods in Natural Language Processing. ing of the Association
for Computational Linguistics,
pages 5879-5915, Abu Dhabi, United Arab Emirates. pages 7871-7880.
Association for Computational Linguistics.
Chin-Yew Lin. 2004. ROUGE: A package for auto-
Pengcheng He, Baolin Peng, Liyang Lu, Song Wang, Jie
matic evaluation of summaries. In Text summariza-
Mei, Yang Liu, Ruochen Xu, Hany Hassan Awadalla,
tion branches out, pages 74-81.
Yu Shi, Chenguang Zhu, et al. 2022b. Z-Code++: A
pre-trained language model optimized for abstractive Yixin Liu, Pengfei Liu, Dragomir Radev, and Graham
summarization. arXiv preprint arXiv:2208.09770. Neubig. 2022. BRIO: Bringing order to abstractive
summarization. In Proceedings of the 60th Annual
Karl Moritz Hermann, Tomas Kocisky, Edward Grefen-
stette, Lasse Espeholt, Will Kay, Mustafa Suleyman,
Meeting of the Association for Computational Lin-
and Phil Blunsom. 2015. Teaching machines to read
guistics (Volume 1: Long Papers), pages 2890-2903.
and comprehend. Advances in Neural Information Joshua Maynez, Shashi Narayan, Bernd Bohnet, and
Processing Systems, 28. Ryan McDonald. 2020. On Faithfulness and Factu-
Karen Spiirck Jones. 2007. Automatic summarising: ality in Abstractive Summarization. In Proceedings
The state of the art. Information Processing & Man- of the 58th Annual Meeting of the Association for
agement, 43(6):1449-1481. Computational Linguistics, pages 1906-1919.
Marzena Karpinska, Nader Akoury, and Mohit Iyyer. Sewon Min, Xinxi Lyu, Ari Holtzman, Mikel Artetxe,
2021. The perils of using mechanical turk to evaluate Mike Lewis, Hannaneh Hajishirzi, and Luke Zettle-
open-ended text generation. In Proceedings of the moyer. 2022. Rethinking the role of demonstrations:
2021 Conference on Empirical Methods in Natural ‘What makes in-context learning work? In Proceed-
Language Processing, pages 1265-1285. ings of the 2022 Conference on Empirical Methods in
Natural Language Processing, pages 11048-11064,
Daniel Khashabi, Gabriel Stanovsky, Jonathan Bragg, Abu Dhabi, United Arab Emirates. Association for
Nicholas Lourie, Jungo Kasai, Yejin Choi, Noah A. Computational Linguistics.
Smith, and Daniel Weld. 2022. GENIE: Toward re-
producible and standardized human evaluation for Swaroop Mishra, Daniel Khashabi, Chitta Baral, and
text generation. In Proceedings of the 2022 Con- Hannaneh Hajishirzi. 2022. Cross-task generaliza-
ference on Empirical Methods in Natural Language tion via natural language crowdsourcing instructions.
Processing, pages 11444-11458, Abu Dhabi, United In Proceedings of the 60th Annual Meeting of the
Arab Emirates. Association for Computational Lin- Association for Computational Linguistics (Volume
guistics. 1: Long Papers). pages 3470-3487.

Kundan Krishna and Balaji Vasan Srinivasan. 2018. Ramesh Nallapati, Bowen Zhou, Cicero dos Santos,
Generating topic-oriented summaries using neural Caglar Gulcehre, and Bing Xiang. 2016. Abstrac-
attention. In Proceedings of the 2018 Conference of tive text summarization using sequence-to-sequence
the North American Chapter of the Association for RNNs and beyond. In Proceedings of The 20th
Computational Linguistics: Human Language Tech- SIGNLL Conference on Computational Natural Lan-
nologies, Volume I (Long Papers), pages 1697-1705. guage Learning, pages 280-290.
Shashi Narayan, Shay B Cohen, and Mirella Lapata. William Saunders, Catherine Yeh, Jeff Wu, Steven Bills,
2018. Don’t Give Me the Details, Just the Summary! Long Ouyang, Jonathan Ward, and Jan Leike. 2022.
Topic-Aware Convolutional Neural Networks for Ex- Self-critiquing models for assisting human evaluators.
treme Summarization. In Proceedings of the 2018 arXiv preprint arXiv:2206.05802.
Conference on Empirical Methods in Natural Lan-
guage Processing, pages 1797-1807. Thomas Scialom, Paul-Alexis Dray, Sylvain Lamprier,
Benjamin Piwowarski, Jacopo Staiano, Alex Wang,
Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Car- and Patrick Gallinari. 2021. QuestEval: Summariza-
roll L Wainwright, Pamela Mishkin, Chong Zhang, tion asks for fact-based evaluation. In Proceedings
Sandhini Agarwal, Katarina Slama, Alex Ray, et al. of the 2021 Conference on Empirical Methods in
2022. Training language models to follow in- Natural Language Processing, pages 6594-6604.
structions with human feedback. arXiv preprint
arXiv:2203.02155. Abigail See, Peter J Liu, and Christopher D Manning.
2017. Get to the point: Summarization with pointer-
Artidoro Pagnoni, Vidhisha Balachandran, and Yulia generator networks. In Proceedings of the 55th An-
Tsvetkov. 2021. Understanding factuality in abstrac- nual Meeting of the Association for Computational
tive summarization with FRANK: A benchmark for Linguistics (Volume 1: Long Papers), pages 1073—
factuality metrics. In Proceedings of the 2021 Con- 1083.
ference of the North American Chapter of the Asso-
ciation for Computational Linguistics: Human Lan- Liyan Tang, Tanya Goyal, Alexander R Fabbri, Philippe
guage Technologies, pages 4812-4829. Laban, Jiacheng Xu, Semih Yahvuz, Wojciech Kry$-
ciriski, Justin F Rousseau, and Greg Durrett. 2023.
Kishore Papineni, Salim Roukos, Todd Ward, and Wei- Understanding factual errors in summarization: Er-
Jing Zhu. 2002. BLEU: a method for automatic eval- rors, summarizers, datasets, error detectors. Associa-
uation of machine translation. In Proceedings of the tion for Computational Linguistics.
40th annual meeting of the Association for Computa-
tional Linguistics, pages 311-318.
Oleg Vasilyev, Vedant Dharnidharka, and John Bohan-
non. 2020. Fill in the BLANC: Human-free quality
Rebecca J Passonneau. 2006. Measuring agreement on
estimation of document summaries. In Proceedings
set-valued items (MASI) for semantic and pragmatic of the First Workshop on Evaluation and Comparison
annotation. In Proceedings of the Fifth International
of NLP Systems, pages 11-20.
Conference on Language Resources and Evaluation
(LREC’06).
Jason Wei, Maarten Bosma, Vincent Zhao, Kelvin Guu,
Adams Wei Yu, Brian Lester, Nan Du, Andrew M.
Maxime Peyrard. 2019. Studying summarization eval-
Dai, and Quoc V Le. 2022. Finetuned language mod-
uation metrics in the appropriate scoring range. In
els are zero-shot learners. In International Confer-
Proceedings of the 57th Annual Meeting of the Asso-
ence on Learning Representations.
ciation for Computational Linguistics, pages 5093—
5100.
Xi Ye and Greg Durrett. 2022. The unreliability of ex-
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine planations in few-shot prompting for textual reason-
Lee, Sharan Narang, Michael Matena, Yangi Zhou, ing. In Advances in Neural Information Processing
‘Wei Li, Peter J Liu, et al. 2020. Exploring the limits Systems.
of transfer learning with a unified text-to-text trans-
former. J. Mach. Learn. Res., 21(140):1-67. Jingqing Zhang, Yao Zhao, Mohammad Saleh, and Peter
Liu. 2020. PEGASUS: Pre-training with extracted
Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and gap-sentences for abstractive summarization. In In-
Percy Liang. 2016. SQuAD: 100,000+ questions ternational Conference on Machine Learning, pages
for machine comprehension of text. In Proceedings 11328-11339. PMLR.
of the 2016 Conference on Empirical Methods in
Natural Language Processing, pages 2383-2392. Tianyi Zhang*, Varsha Kishore*, Felix Wu*, Kilian Q.
Weinberger, and Yoav Artzi. 2020. BERTScore:
Alexander M Rush, Sumit Chopra, and Jason Weston. Evaluating Text Generation with BERT. In Inter-
2015. A neural attention model for abstractive sen- national Conference on Learning Representations.
tence summarization. In Proceedings of the 2015
Conference on Empirical Methods in Natural Lan- Yusen Zhang, Ansong Ni, Ziming Mao, Chen Henry Wu,
guage Processing, pages 379-389. Chenguang Zhu, Budhaditya Deb, Ahmed Awadal-
lah, Dragomir Radev, and Rui Zhang. 2022. SummN:
Victor Sanh, Albert Webson, Colin Raffel, Stephen A multi-stage summarization framework for long in-
Bach, Lintang Sutawika, Zaid Alyafeai, Antoine put dialogues and documents: A multi-stage sum-
Chaffin, Arnaud Stiegler, Teven Le Scao, Arun Raja, marization framework for long input dialogues and
etal. 2022. Multitask prompted training enables zero- documents. In Proceedings of the 60th Annual Meet-
shot task generalization. In The Tenth International ing of the Association for Computational Linguistics
Conference on Learning Representations. (Volume 1: Long Papers), pages 1592-1604.
Yusen Zhang, Ansong Ni, Tao Yu, Rui Zhang, Chen-
3 | Dataset ,
guang Zhu, Budhaditya Deb, Asli Celikyilmaz, . o
Ahmed Hassan, and Dragomir Radev. 2021. An 2 o BBC o on cmmmise s eses
exploratory study on long dialogue summarization:
‘What works and what’s next. In Findings of the Asso- %1 e o
ciation for Computational Linguistics: EMNLP 2021, @
pages 44264433, L0 * com sumsmm o
3§-1 [ ap— —
Tony Z. Zhao, Eric Wallace, Shi Feng, Dan Klein, and
@
Sameer Singh. 2021. Calibrate before use: Improv- -2 . . [pee— .
ing few-shot performance of language models. In
Proceedings of the International Conference on Ma- -3
chine Learning (ICML).
a0 -20 o 20 a0 60
‘Wei Zhao, Maxime Peyrard, Fei Liu, Yang Gao, Chris- Len(s*) - L(s)
tian M Meyer, and Steffen Eger. 2019. MoverScore:
Figure 8: Correlation between summary length and
Text generation evaluating with contextualized em-
annotator score (computed as the no. of “best summary™
beddings and earth mover distance. In Proceedings
of the 2019 Conference on Empirical Methods in Nat- votes. For each example, plot the difference in length (x-
ural Language Processing and the 9th International axis) and annotator score (y-axis) between the GPT3-D2
Joint Conference on Natural Language Processing summary and the next best system’s summary.
(EMNLP-IJCNLP), pages 563-578.

Yao Zhao, Mohammad Saleh, and Peter J Liu. 3. SummaC: We use the SummaC-Conv model
2020. SEAL: Segment-wise extractive-abstractive
(model_name = ‘vitc’) and sentence-level gran-
long-form text summarization. — arXiv preprint
arXiv:2006.10213. ularity in our experiments.

A Implementation Details Keyword-based data For our keyword-based hu-

man study, we extracted two named entities per
Prompts Used To generate GPT3-D2 summaries article, as discussed in Section 5. In practice, we
for all experiments in this paper, we use the stan- constrained the first keyword to be lead-biased, i.e.
dard prompt format outlined in Section 2. We set it was extracted from the first three sentences of the
N = 3 for CNN and DailyMail, N = 2 for News- article, and the second keyword was extracted from
room, and N = 1 for XSum/BBC. For the latter, the remaining article. As CNN-based summariza-
the prompt is slightly modified to “Summarize the tion models are generally lead-biased, this allowed
above article briefly in 1 sentence.” us to benchmark models under both settings.
For TO, we use the following prompts: a)
CNN/DM: “Summarize the article below in 3 to 4 B Are annotator judgments of quality
sentences?”, b) Newsroom: “Summarize the arti- correlated with length?
cle below in 2 to 3 sentences?”, and c) XSum/BBC:
“Summarize the article below in 1 sentence?” In Section 3, results of the human study showed
that annotators provide shorter length as one of
Factuality Metrics In Section 4.2, we evaluated the main reasons for selecting T@ summaries as the
several recently proposed factuality metrics. We worst for the CNN dataset. Here, we investigate if
note that multiple versions have been released for the choice between GPT3-D2 and BRIO is similarly
some of these models in recent years. Here, we influenced by their length differences; GPT3-D2
specify the versions used in our experiments to summaries are on average 9 words longer.
ensure reproducibility of results: To study this, we plot the difference in summary
1. QuestEval: We use version 0.2.4 of the queste- length against the difference in annotator score
val python package and report numbers using
(measured as the no. of votes for a summarization
the precision-only setting. system) between the best summarization system
(GPT3-D2) and the next best system (BRIO for CNN
2. DAE: We use the updated version of and T for BBC). The resulting plot is shown in
the DAE model trained for document- Figure 8. In general we observe low correlation
level factuality. Latest code and model between these; Pearson’s p is 0.17 for CNN and .02
released at https://github.com/tagoyal/ for the BBC dataset. These correlation values can-
factuality-datasets. not solely explain the large differences in annotator
'BBC-style 1 sentence summaries
Input Article: https:/www.cnn.com/2021/09/07/opinions:
GPT3-D1 Four men completed a world record by playing a board game covid-19-good-and-bad-news-ranney/index html
continuously for more than 85 hours.
GPT3:2 The artcle is about four men who have completed a world record Common prompt for all segments:
by playing a board game continuously for more than 85 hours. Summarize the above article briefly in 2-3 sentences.
GPT3-D1 Eight people have been charged with murder after two men died in Summary for input segment 1: The article di how the current
a car crash on a dual carriageway. situation with Covid-19 cases rising again s similar to last year in some
GPT3.D2 Eight defendants have been charged with murder afier two men ways, but different in others. One big difference is that there are now
died in a car crash vaccines available which decrease the risk of severe illness.
CNN-style 3 sentence summaries Summary for input segment 2: The article discusses how the current
situation with Covid-19 cases rising again is similar to last year in some
GPT3D1 Russian President Vladimir Putin's speech at Russia's Victory Day ways, but different in others. One big difference is that there are now
commemorations offered few clues on the direction of the conflict vaceines available which deerease the risk of severe iliness. The article
in Ukraine. UK defense chief Ben Wallace had suggested that Putin advises that everyone should take precautions such as wearing masks and
historic day to escalate his so-called "specal military ‘opening windows o let fresh air in
in Ukraine and declare an outright war. Putin used his Summary for input segment 3: The article discusses the challenges of
speech to blend history with the present, banking on Russian living through the pandemic and the importance of science and community. It
nationalism on its most patriotic of holidays to justfy his war. also highlights the need for patience, hope, and humility as we enter the next
GPT3-D2 The article discusses Russian president Vladimir Putin' specch on phase of the pandemic.
Victory Day, in which he blended history with the present and
Justified his war. It s speculated that Putin may use this day to
escalate the conflict in Ukraine, but it s also noted that he has few. Figure 10: Ilustrative example of GPT3-D2 summary of
successes in Ukraine to brag of. The question now is whether Putin along source article generated using the segment-then-
will use this day o escalate the war in other ways.
summarize pipeline.

Figure 9: Examples of generated summaries using

the text-davinci-001 (GPT3-D1) and text-davinci-002 We spent a total of approximately $150 on API
(GPT3-D2) versions. The figure shows both BBC and requests.
CNN-style summaries. For the human study, we paid participants $4
per task (each task involved annotation for 5 arti-
judgments reported in the human study results of cles). On average, this translated to $11/hr of work.
Section 3; additional quality factors must have in- The combined cost for the generic summarization
fluenced this choice. Anecdotally, we observe that (Section 3) and the keyword-based summarization
the GPT summaries are slightly less information (Section 5) studies was $1020, including platform
dense; our impression is that these contain a similar costs and bonus payments.
level of information content, but are easier to read
and understand despite being a bit more verbose. E Long document summarization using
GPT3-D2
C Qualitative differences between GPT-3
versions Summarization of long documents has attracted
significant interest in recent years (Cohan et al.,
Figure 9 shows examples comparing summaries 2018; Kryscinski et al., 2021). Here, we study
from text-davinci-001 (GPT3-D1) to those from how naive prompting of GPT-3 performs at long-
GPT3-D2. For BBC-style single sentence sum- document summarization.
maries, we observed that the two models generated First, we extract text from a long input article
very similar summaries with high content and lexi- from the CNN website.'® Next, we follow the com-
cal overlap. More variance is observed for CNN- monly used segment-then-summarize procedure
style summaries. In our anecdotal assessment, from prior work (Zhao et al., 2020; Zhang et al.,
GPT3-D1 generated more detailed summaries while 2022). We divide the input article into 3 disjoint
those from GPT3-D2 are less information dense. segments, summarize each segment separately and
concatenate these outputs to form the final sum-
D Human study and API costs mary.
At the time of running our experiments, GPT-3 Figure 10 shows the prompt used and the gener-
APT’s text-davinci-002 version was priced at $0.06 ated summaries for each segment. While individual
per 1K tokens. New pricing information is avail- segment summaries are high quality, we can see
able at: https://openai.com/api/pricing/. that the concatenated summary is not coherent and
In our experiments, we generated around 2600 includes repeated “introductory” sentences outlin-
GPT3-D2 summaries across all experiments in Sec-
"Article link: https://www.cnn.com/2021/09/07/
tion 3 (human study), Section 4 (evaluation of met- opinions/covid-19-good-and-bad-news-ranney/
rics) and Section 5 (keyword-based human study). index.html
ing similar content. Related to this, it also does
not cover all important aspects of the input arti-
cle as a majority of its ‘length budget’ is spent on
a high-level overview. We also observed that the
generated summaries for long documents often fo-
cus on less unimportant parts of the document, e.g.
“...everyone should take the precaution of ... opening
windows to let the fresh air in” in the illustrated ex-
ample. This is, in part, due to the segmentation of
the input article: GPT3-D2 still exhibits some lead
bias and treats the beginning of the input segment
as more salient. Therefore, the exact segmentation
of the article also dictates the quality of the final
summary, and cannot be readily fixed by altering
the prompt.
These observations show that while GPT3-D2
produces superior segment-level summaries, it is
more difficult to adapt it to “non-natural” text in-
puts without fine-tuning. Therefore, techniques that
have shown promising results for fine-tuned mod-
els, e.g. segment-then-summarize or extract-then-
abstract (Zhang et al., 2021) approaches, are not
as effective when directly applied with prompting-
based models.

F Task Instructions
Task instructions provided to crowd annotators for
the generic summarization task setting are shown in
Figure 14 and those for the keyword-based setting
are shown in Figure 15.

G Examples of generated summaries

‘We show examples of generated summaries for arti-
cles for generic summarization for CNN-2022 and
BBC-2022 in Figures 11 and 12. It includes sum-
maries from the 3 different summarization models
evaluated in the human study in Section 3.
Examples of keyword-focused summaries are
shown in Figure 13 for CNN. It includes summaries
generated by GPT3-D2 and CTRLSum models.
Input Article Generated Summaries
(CNN) Toronto Police ftlly shot man who s seen carying firearm nea three schools n the Scarborough irea Thursday. police BRIO: Suspect ws seen carying frearm near ree schools
Said. Ofcers responded o rports of man caryinga frcarm and thers was an ineracton” betwesn offcers and the man, according in e Scarborough arca of Toronto, polce say. At east four
10 the Specil Investgations Unit i the Canadian province of Ontario. A same poin durin the ineraction,two offcers shot at the Schools ae under lockdown n he rea. The incidentcoms s
Suspect, hiting i at least once,police said. e suspect was pronounced dead at he scene. At least fourschoolsin the area were the USS. s reeing fiom two recent mass shootings. The
nde ockdown Thursdy as authories ric o eirine the e of e (Tt SEcoRdng 0. i of st o e Toromo suspect was shot and kiled by police
District School Baard. Th incident comes as the US is grappling wih the shockof o ecent mass shootings: s deadlest school
shooting n nearly a decade in Uralde, Texas, nda racist shootin at supermarket in Bufllo, New York. "We crtainly understand TOE NEW " was really scored, 1 just started proying.” ane
the trauma and how taumalicthis must have heen fo saf, studens and parents given th o recent cves hat have happened in student says. Shooting comes ale wo mass shootings n the
the United States,” s Toronto Police Chict James Ramer . a news bricfng e the incident, efrencing the Usalde and Buflilo Uit States. The gunman was spotie near three schaols in
Shootings. He added tha he understands the community'sconcern s th armed suspected was very clos (o schools. Th schools the Scarborvigh ares.
under lockdown included William G. Davis Junior Public School, Joseph Howe Senior Public School, Charlotctown Junir Public
Schoolsnd Sir Olver Mowat Colcgiat Insttste,according 9.0 tweet from the school board, " s really seard, | just tricd GPT3-D2: The arile discusses poice shooting that took
pragine.” ne student told CNN aflste CTV. Ontari's Speeial InvestigaionsUit inow investigting the ot shootng, ccarding. place near severa schools i Scarborough, Canads. The
o nes lease. "Four nvstigaorsand thre ot nvstigator Bve been ssgnod o e cas,” he el s, O Suspect was shot and Kiled by offcers afer an intraction
premicr Doug Ford tweted his thanks o plice snd mrgeney servics for thei "k acton.” “Thank you o poice and emergency doring which the suspect was sen carying a firearm. The
Servies for your quick acton today in Scarborough,” he sad. "We'r extremely grateful fo everything you do o keep our incident i being investigated by the SpecialInvesigatons
communiies safe” Uit
(CNN) The owner of seven nuring homes acros Louisiana that evacuated residents fo 2 warchouse as Hurricane 1da approsched ast BRIO: Oune of seven nursing homes evacuated residents o
‘year s been indicted on felony charges alle soven rsidents did at the ersporary sheller,oficials said. Bob Glynn Dean was a varehouse as Hurricane 1da approached ast year. Seven
‘aested and charged witheight folony counts ofcruelty o persons with nfimitis, fve felony counts of Medicid faud and o residents died atthe temporary shele. Bob Giynn Dean's
felony counts of obstructon ofjustice,according t 2 Wednesday news release from Louisiana Attomey General Je Landry. Five of atomey says he plans to plead not guihty toal charges. Five
e seven deaths at the warehouse shelier were considered sorm-selated, state health ofcalssoid.Intotal, more thn 800 esidents of the seven deaths at the warchouse sheler were considered
‘were taken fo th facilty ahesd ofth storn. A jont nvestigarion by the atorney general's Medicud Fraud Control Unit (MFCU) and Stormrelated, offcals sy.
the Louisiana Bureau of nvestigations (LBI) “revesled Dean refused to move his residents ot of the warchouse following Hurricane.
I, biled Medicaid for dates his residents were no eceiving propercare, and engaged in condut ntended (0 inimidte o obstruct T0: Bab Glyan Dean faces cight feony counts ofcrully (o
‘public healthoffcals and s enforcemen,” the AG's news relase sid. Dean'sfore, Johin MeClindan, told CNN Wednesday persons with infimitie. Dean'satorney says his ciens
Dean planstoplead not guity 1o al charges he's fcing and ssid Dean's mental health willfactor ntothe case. °| con' think i any mental health will be an ssuein the case. Sexen nursing home
sccre hat Bob's mental heallh s going t b an s i this case” McClindon said. "Bob elealy has some cogntiv impaiments and reidents died at the warchouse sheler during Hurricane 1.
id on the day o his incident” MeClindon ssid Landey informed him of an arest warrat or Dean Monday and i client flew from
Georga o self-surrender. He made an niial cour appearance in Louisiana Wednesday and was relased on 350,000 bond, GPT3-D2: The owner of sven nursing homes in Louisiana
‘MeClindon said. MeClindon called the charges "very defensible” and said “the evidene wil bear out eventualy” Residents kept in s been indicted on elony charges afler seven residents died.
‘unsae, unsaitary, and unhealihy”conditons,offcas said The nursing home residents were aken 1o the warehause in Independence, atatemporary sheter during Huricane 1. The owner, Ba
about 37 mills east of Baton Rouge, ahesd of Hurriane Idas landfall on August 29, Th siate health department said it soon sarted 0 Glyan Dean, s fcing eight felony counts ofcruity o
hear ahout deterioating conditions o the warchouse. CNN oblsined th logs o 61 cals from the warchouse to 911 opertors. At last persons with infimitis,five felony couns of Vedicaid fraud,
30 of the cals asked for asistance with medicalepisodes before and afer landfall, including clls for seizures, stopped breathing, and ‘and two elony counts ofobstruction of jusice Dear's
‘e insance in which caler says a disbeti patient needed ransport because they had "o aten due 0 hem having n0 more atomey says Dean's mental healh will be fctor i the case.
supplies." "L b clear;there is o cmergeney-preparedness plan tha allows for resideats (0 be kept insuch an unsafe, unsaniary,
and unhealhy condiion,” Stephen Russo, dirctor oflegl, audit and egulatory affis for the halth department,sad st year. "The
Jack of adecquatecare for thse rsidents 1 nhumane, and goes against th rules,regultions, and appicsbl sates.” The seven
faciliies involved had thirlcenses revoked and canno repatriate or adrit residents, ofcialssad a he time. The hommes lso had
their Medicaid provider agreements erminated,the health department said. The Atiomey General's Offce investigaton i ongoing.
and addiional legal acton may be filed n the fture, the Wednesday reease said. The next cour date for Dean has ot been st but
‘MeClindon said it will most ke happen n the next 60 days.
(CNN) Global Jeaders and defense offcals had spent weeks speculaiing what Russian President Viadimie Putin might reveal about his BRIO: Russian President Viadiir Putn gave specch ot
Ukraine plans n specch at Russia's Victory Day commemoratons Monday. Theyl ave 1o eep guesing— the Ieader offeed fow Russi's Vietry Day commemarationson Monday. Petcr
elues on the dirction of the confict, UK defense chief Hen Wellace had suggested that Putin may use this histori day to csealatehis Bergen says Putin gave ew clus about his Ukpaine plas ia
so-caled *special milary operation” in Ukraine and declare an ouright war. Even if that had been Put’ plan, be was unlikely 10 hisspeech. He says Putn used his speech to blend history
ollow thrugh after Walace's comments,no waning 10 appea 0 his Wester focsg such an casy nut 0 crack. Insiead, he Russian Wilh the present 0 justify his war in Ukraine. The question
presidentused his specch o blend hisory withthepresent, banking on Russian nationalis on s st pariotic of holidayst jusify owis whetber Putin will escalate the war i other ways, he
his war. In his reverence for Soviet war heroes who helped defeat Nazi Germany in World Was 1 — the reason Russia celebrates
Victory Day Putin referrd to new Nazithratsin Ukraine, repeating his baselessjustificaron for the invasion1 an operaton to
denazify” the nation. I reference o the threat of NATO roaps n Furope, Putin said: "Everything indicstd that clash with neo- 0 Visdimir Putin offered few elues aboutthe dirction of
Nazis,Banderies [Uksainian nationalists], on horn the Usiled States and their younger pariners countedon. would be inevitable.” the conflet n Ukrvine. Putin used his speech to blend history
“Danger was increasng every day. Russi repelle hs agaresion i preventaiiv way. This was th only coreetdecison, and it was and the present 1o sty is war
a timely decison. The decision of an independent, sovereign and powerful nation” b said. Putin had few othe aptions than (0 use his
Speeel o keep seling his war o his wn people. He s so fow succeses in Ukrsin to brag of aer all.AIl e can do now s (o keep GPT3-D2: The arile discusses Russian president Viadinir
Russians on his ide s they suffer the conomic hardship of crippling sanctions and iternational olatonism. The question now is Puti'sspecch on Victory Day, n which he biended hisory
escalatethe var n other ways. Thee ar growing concers tht Rusion forces it the presentand juscfed his war, s speculated that
il tum again (0 standoff weapons — aeial trkes e ong-range missles. for example — ha can b e from afr, s they s0 olen Putin may us this day 1o escalate the confitin Ukraine, but
o when hey arc an the back oot Thats warying, s hose ks are indiscriminate and end tocause huga civilan ols. A borbing s also noted hat he has ow sucasss in Ukraine to brag
of a school in Luhansk,castem Uaine, that i feared t have kiled at least 60 sheltering people aver the weekend i just one example. of. The question now is whether Putin will e this day to
‘Afte Rusissfilure 0 take teritory in Uktaine's orth and around the capital, Kyiv. it strugglng even i th cast snd south, where it esealte the war n others
s had o presence through pro-Russian rebes or yars. The possbilitythat Russa may win nathing,or very i, in Ukraine i eal
Whether anything changes his Victory Day or not,a v chaper inthe war willneviably need writing soan.

Figure 11: Examples of generated summaries for the CNN-2022 dataset using 3 different summarization systems.

Input Article Generated Summaries

Four men avecomy L bt hey s wil b wakd sord by pleying
o ame ooy for mor han 5 s, Thefursone, of Gl complcd i BRI At of o men v st e wod
ot chleng o Mondsy i ht with st coupl oo of . L oo, Do oo, Adon Dicher snd Lk de it Ve playe e gane D, waichidby econd by playing
s b s for 90 b,
Supponn s anonkie ssrc. The wakd ecord aterpt wa o i i f Aleimer Rescrsh UK. The e of he i e s e men e sloved
Sl v minotersieforexery o ey played.Thy hd o bt th prviou word o, bl by i The Nethernd,whic layd bosed g o 50 hoses. T Four man e bk te o e for
i e 2017 oo Word Reconts it hy woukd o 1 iy fo it e e e b considord fo 3 warkd e, The st b 1o b longst e sen lying board g,
il Al Rescarh UK was et 0 bl o chllnge s e nd Dl Pol’s b sulfe o e condion. Da Pole sl "1 it af rllecosir
el 165very Tering. Peale b b i hei suport ad donatiosand s ey hubling Tt wer ot alloved e, b st socks roughout e
Stcmp.whic inludd 19 ganes i ol ou an b v o our i vids ava e, nd by siting down ad ot ing o ik o e scconds, ats . youe
Ao rd nd ers by o bresking (bt e el sid e P The men completd th hallrg 500 GMT on Monday b ecded o <y o1 ad s e same. contmnsly o move i 55 e
hey we on sippingon chanpsens el Mr Brcher s e s e, eyond mesurs” bt ol qie st plssed with i o s co”. "Iy d 0.
et eb doing somethin e s woukd e clldo iz e e, Mr Ds Wit Vi i e " i ring expeencbt T gl wee done
e The o e e almos 100 for Al UK. s wel sls for HereorshineMind, el et churity
st i forhe o i e s s yeaoldson i comvictionsforcommon sk, gy ad posesion f s ilh et o sl ry s BRIO: A csusd of murdering s o b
sing ofcomictons, Wolvehampion row Cort s
bt
[T e —————y
Wolvrbmpon ad conviions for sl drs
otces, o cout b heaed
(GPTS-D2: T man o i o e munerofs
cunmabis" He asked: "(In) 2013 you'e a deale of hroin and i 2021 i crack cocaine. Are you ssying tht in betweenwe shoukd red i i there would have been oching o pancrs ycaldson hd ity ofconvicionsfor
e et Me P e "hor wasmthing fthe sor” The sl contin sl barlr,ad g deling.
A\ sellon i for s s s e s o o ofe Norh it The Mt Offc i du o cmn o forcst 2000 BSTon Wedcsdn sd il il BRIO: Pars o the UK s st b it by iy
100 o Thurey: There will oy e ey e on et o nd it hiher vt ond il b et by s e Met Offie id. Motris re Condton, freeaders o v
g g 1ok o, Thecod i corss afee egio sk s andghs o 20 (6F) The weshar waring il o Noehumbrant, Couty
Do, Tyncsd, Dokt and Teside.Th iy westhe b expcted 1 st i heeckend when slight warmer et il ome o h west brigin 10 The iy weater s st 0 e o e N st
s condions. of Enaland on Wednsdsy1iht s Thdey mocing
GITRD2: The Mot Offi s s wamig o
S and e n the Nor Et,whih i gt 0 st
il e weckend

Figure 12: Examples of generated summaries for the BBC-2022 dataset using 3 different summarization systems.
Input Article Generated Summaries
A conlfonofhousndsof i sllrssigaled sppor for oe-weck sk sarting on Monday— h e dy he ol kesplace
ko for s e Keynrs Krii Casidy
el will ot ikinge e i chrge those who s el - i, Ao pthion o by i (ETSY) sop v Krt Gy i CTRLSum:Fisy (FTSY) shop owner Kris Casidy s
h company ol h e e i ek p o S5t Mooy — s s ey 000 gt OF o g, me ption i tecompony cancel h e ncrese. Sl
15500 corn o pople who hve a3 s e whosupprt h ke, cconin iy s aperstoe d ik g Mot Boyd. e ke it he ke e pring tht shopson “vcston e
Tor ek saing Mondey. Thesie comes anid & waveof
actviem seen 5 sl of o,
(GPED: Kris Caidy i sy sho o who s organcivg
onenwesk st gsotthscomgany st on Mondey. The ke
iinesponse o i’ deision 0 s the e s sl
‘hich il cras o S 6%, Cseidy' petion wing
Company o cancl s o s b g ey 50000
elr g homenmde graphicTins s othr " s st Boyd sy "o rwards pople o members fthe g and s communiy, gnaures.
S0 wh s i ik ok ad et Deoands st i he pton s sunclinge s incsass pasdo sl o comgrebcniv ln o
ko on el shops (psople sl mas-produced s ha ey have ot desged temselves): i nd xpaie tesuport s forsllers Keyword: Siherman
‘Whoae had i usiocs suptd by sy’ ot s end e "Sia Slle” roga ot sy s o e selles: 4 0t sl ot ot of e o5 for
e producs. oy e many slles e e ey wer fen i way 1 v bk o the sden -k aouncement,hich marks he st s CTRLSum: iy CEO o Siverman sanouncedhe o nceses
ince 201, Whic h e demand n e pcions s oy s ave b bewing foa i, ik o 3ot o v i et 0 erese [ et i e selles e Febray. "o sopport i s o Al
el T he Doy e "The sk o 1 e will mresedu et S5 apsstion 610 6.5%"
Siverman vt "W fs ke we deserve st t he bl Emy
Sop opeor Mt Boyd ld CRN Busies
(GPT-D: Tho s discusss n ne-ves stk beng by iy
el i espone o e st ] o o e om Moy,
The sk s e n o 03 mn o iy CEO o
Sl ssnonein s o e, Bty sl r danding
Vooo the -ommere gt frscer US hor ion i andmark lecon.Amazu s sins e n agpel,cling for d-onofhe s vt hut e s crens bscsceliod,amon tes g,
New York (CNN Busass) As Russ's sl o Ustin ontines, A b and ssuasnt oo bopng sl wand change will lp show e solidry Keyword: Ukraine
‘it e Ukniia popl. I move et o e “Fcedon 5" 5 o te caly anghs teye kg Mascow Mles i the e pdrplci thn wit Ky CTRLSum: Har ownrs n the U, s repacing Moscn wih
Ml Small Ameccan bsinescs,sh 5 ndkpende 4o st ownes, iy ot bave any et busigs s o R b any ol rongly about e Ky i ok g im cokiie Thy v doin 1 show.
oot atack on Uhaiia i andizens.Replacing Moscow with"Kyi” i i vodka-gigerdme ok s 00 way o show supor o Ui Bond Bor, e sy wid e Uk pople. Rusi's sl o
{0 oo, s e s Mosen Mo o Ky M st e ko eknow et 1 e Ui paple,” s v Andes Mino, "Were Ul s “vioas says R Hockinan, o fa Mty b
st g e avareness, and ot people ko, we i support(f k) She st Ukrnan 0 ko i "we s what's hpenin, we wish we <o d
mor. o Bardocr srve ussan sk Mino potd. ' o rplcing any it s Ky Mue. MadroneA Bt in Son Francise, dd s (GPT-D: The it dicasss b America b et
Russion vodkaunl i st ke, when ownce Mchac Krous decsdod 10 ofl e . i, b i o ot Whichof e gy 10 vodkas e caris owners e chaning e ame o the Moson Ml coeki b
‘weeactly Resian. Many Lg-scling vk besnds e h rgins o Rasa e o isled i il couios,nluding the Unod Stws il Vodka, K Mle o sho seppontfor ke, The Ky e e it
o cxampl, s el made i i, he copany's hesdqares s i Luncimbouse,
Al ome s, Kepuse removed Russan Sandrt, one of e e Ul vodka nd sigee. e s some bsicscs s doting
ok st actaly s Rusia-mad, o s . Thenhe decidd 1 e Madeo's Moscon Ml he Ky Ml andlokod fra Ukaaan vodka o mke et ofthe procoeds 2 Ukninian . The chnge s mean o
0 wth. The b unveld e rconted okl n st i ves"roducin the Ky Mul¢ made il rime Uanin vodka,” o Wodnesdy post ods, Soall st 10 show suppon o he Ui gl 3 Rusi
ading bt "S2of ach Ko il s will b donated 1 the Uk Crisis Fund T Kyis Mol costs 2. Krouse s b was i s andhelples st U contines s sl
itunion i Ukain when e decded to make tos . Thosechanges wee "t et scmeingat e ould o, sid. Making gesure o Chaas Bl
Grill i KmssCiy, Mo, s n 3 Focchok post s ek hat s oseos Mule will b repincdby “Snake o Mole” n "ot ofthe Uk e Keyword: Brad Lendon
and i bonor f e brve sl of Skt Saske land,s ks s Zriny. and, s sout 30 eof the s i of e Ui it he TR Sum: e owrerse eplscing Moo with Ky
o Bk Se. L ek, hadfl o Ui Fgherson e land st up o R i, They were e de, bt he Uk Ny relssod hi vodginge Tone cockiuls. NN Jordan Valind, d i
et Monday syins 4 rops wre v ad el e bein e o s e e ack f oo, Back n KaasesCity, b Em Chits s endon, o Peingon conieuted o s repot. Thesack on
descom)” The ctr.whic charsshousS forcockiil, Ul ks s sy RonaieHocknan, owner of Cdies:
Ronmie Heckman,owperof Caddics on Corel bt 1 gl n Hehesd, My, s s nly swappc ot Moseow Nl o Ky Mles, s goe o e 0 on Conel,
replceBlack Russinsand hic Rusians with Black Ukunians (vodka and Koo e rocks) and Wit Ukainins (ade it vodka, Kala andcresm). I (GPT-DR:The il discuses b American b ad estrant
e, e i, Cadic il doting prtofh e o sk f Uk i, . sk hope it i cnough s nd b owoers ke ownee s swappingot Mosea Mol o K Ml i e o
refroceso Rossa o e e, ey ca sed 3 s 0 Rassa eoderhi. The atack on Ursne "k o snse” e s, s wron.” CNNs Jrda o suppontfor Ukeine. This s n respnse Rusifs st
Nalky B Lendon, T Litrads Peingion conibused o tis rpor. sl ot couney. CNN's B Lendon coubutd t the eor

Figure 13: Examples of keyword-focused summaries for CNN articles from 2022.

(asic Tas Description

Thank you forparticipating i this stul! First, eter your Proifc ID here: Thank you fo paricipating i thi stuyt Frst, ente your Proic 1D here:

The goal of tis study s to rats evaluate machine-generated summaries of news articles. The goat of tis stucy s {0 evaluate machine-genorated summaries of news artcies You
You wil evaluate summariesfor 5 dferot news aticesintis stucy, each of which s 3 willovalate summaris for different nows artces. Each summaryis expected fo be 2-3
summarios. Supposo you woro browsing social mecia and saw one of these summaries with a sentences ong.
ik o thearicle. Which summary/summaries would you prefe t see or which summary.
rowids the truest descripton of the aricl's content and itent. Suppose you search for a keywerd (e.. person's namo or an organization) and saw ono of
these summaries with Ink 0 thearticle. Which summary would you prfer to s0e? You should
You can make tis judgment based on your own browsing habits For exampHe, you can make this judgment based on the following citrion
evaluato the summary basod on characterisicske doos focus on the main topi or contont
of the artice?, s al th iformation i the sumary factualy oz, o ay other 1. Dos the summary provide an appropriate descripton of the person/organizatons'srofen
characteistcsthatare important to you i tis seting. Note that the summaries are the news story?
automatically generated and can contain small rors. Keep an eye out fo these and 2. Doss the summary give enough contextof the broadr news story around the
‘appropriatsly ponalz them whil making your cocision. ~personlorganization? E.g. Bori Johnson i expeced o respond fo the accusatonon Tuesday’
Is ot an deal summery as it 0es not give any detas about the main event e accusation'in
Werktiow tho summary.
For ach atice, st road the nows aricl carotull on the lftpanel of the task. The summaries Apart fom these criteron, you can aso make your judment basec on your personel
for th articleare
shown o the igh pane. You will nswer 2 questions about these summaris. and behavior Note hat the v ang
1. Which summary/summaries 4o you prefer the most? You can select
more than one summary ‘can contain small rors 9. the summary may not present a coherent arative or Gontan
hero f there are multile good summaries an you have no clearpreference batwsen them. formation not n the input artcle. Keap an eye ou or thesa and appropriatly penaize tem
usty your selecion in the text box below. You can say tings ko ‘Summary A misses the main ‘while making your decison.
ntent o the summary | Summar A no-factual / oc. Worktiow
2. Which summary/summariss s the worst? Simiar o the previous case,jusiy your slection in For ach artice, first road tho nows artce carofull on th lft pavel of the task. On tho ight
he text box below: (You can select i o one summary i noticeably worse than the other panal, you wil ba shown fwo keywords. For ach keyword, you wil ba shown 2 sumimaris. You
wo) will b askedi 0 comparo tho o summaries anct answer the folowing questions:
1. Which summary do you profer the most?
2. ustityyour slection i th toxt box below. You can saytingske ‘Summary A missss the
main intentof the summary'“Summry A doss not talk about the keyword'srole fc.

Figure 14: Screenshot of the task instructions for the Figure 15: Screenshot of the task instructions for the
generic summarization setting. keyword-based setting.

Ling 111.02 Assignment Spring 2012 Morphology and Syntax SAMPLE
100% (7)
Ling 111.02 Assignment Spring 2012 Morphology and Syntax SAMPLE
4 pages
News Summarization and Evaluation in The Era of GPT-3
No ratings yet
News Summarization and Evaluation in The Era of GPT-3
20 pages
News Summarization and Evaluation in The Era of GP
No ratings yet
News Summarization and Evaluation in The Era of GP
19 pages
Report Group-8
No ratings yet
Report Group-8
16 pages
ACL23-Prompted Opinion Summarization With GPT-3.5
No ratings yet
ACL23-Prompted Opinion Summarization With GPT-3.5
17 pages
Exploring The Limits of Chatgpt For Query or Aspect-Based Text Summarization
No ratings yet
Exploring The Limits of Chatgpt For Query or Aspect-Based Text Summarization
9 pages
Comparative Analysis of T5 Model For Abstractive Text Summarization On Different Datasets
No ratings yet
Comparative Analysis of T5 Model For Abstractive Text Summarization On Different Datasets
7 pages
Text summarization paper
No ratings yet
Text summarization paper
5 pages
14.0
No ratings yet
14.0
20 pages
nlp
No ratings yet
nlp
8 pages
Fine Tuning Summarization For Russian Text
No ratings yet
Fine Tuning Summarization For Russian Text
10 pages
Rare Words in Text Summarization
No ratings yet
Rare Words in Text Summarization
11 pages
ACM Journals Primary Article Template Latest Version 4
No ratings yet
ACM Journals Primary Article Template Latest Version 4
31 pages
Review of Data-Driven Generative AI Models For Knowledge Extraction From Scientific Literature in Healthcare
No ratings yet
Review of Data-Driven Generative AI Models For Knowledge Extraction From Scientific Literature in Healthcare
20 pages
Los Modelos Basados en Indicaciones Realmente Entienden El Significado de Sus Indicaciones
No ratings yet
Los Modelos Basados en Indicaciones Realmente Entienden El Significado de Sus Indicaciones
45 pages
T-BERTSum Topic-Aware Text Summarization Based on BERT
No ratings yet
T-BERTSum Topic-Aware Text Summarization Based on BERT
12 pages
64c90f1e-de75-4777-8e3c-295e5ad5e957-1
No ratings yet
64c90f1e-de75-4777-8e3c-295e5ad5e957-1
13 pages
Automation of Text Summarization Using Hugging Face NLP
No ratings yet
Automation of Text Summarization Using Hugging Face NLP
7 pages
Information 13 00228
No ratings yet
Information 13 00228
12 pages
CS5984 Final Report
No ratings yet
CS5984 Final Report
57 pages
Combination of Abstractive and Extractive Approaches For Summarization of Long Scientific Texts
No ratings yet
Combination of Abstractive and Extractive Approaches For Summarization of Long Scientific Texts
11 pages
Benchmarking Large Language Models for News Summarization
No ratings yet
Benchmarking Large Language Models for News Summarization
19 pages
Lecture 7
No ratings yet
Lecture 7
66 pages
A Systematic Survey of Text Summarization_ From Statistical to Langauge Models
No ratings yet
A Systematic Survey of Text Summarization_ From Statistical to Langauge Models
42 pages
2210.02441v3
No ratings yet
2210.02441v3
63 pages
Ask Me Anything
No ratings yet
Ask Me Anything
59 pages
MD Adil Irshad
No ratings yet
MD Adil Irshad
37 pages
FALLSEM2024-25_BCSE409L_TH_VL2024250101879_2024-11-14_Reference-Material-I
No ratings yet
FALLSEM2024-25_BCSE409L_TH_VL2024250101879_2024-11-14_Reference-Material-I
13 pages
Inlg 19 TL DR Writeup 4
No ratings yet
Inlg 19 TL DR Writeup 4
7 pages
BookSum
No ratings yet
BookSum
23 pages
2005 14165v3 PDF
No ratings yet
2005 14165v3 PDF
74 pages
Group 13 Sem 2 Review 1
No ratings yet
Group 13 Sem 2 Review 1
20 pages
AI-TEXT SUMMARIZATION SYNOPSIS
No ratings yet
AI-TEXT SUMMARIZATION SYNOPSIS
36 pages
1805 03616 ReinforcedTopicAwareConvS2S PDF
No ratings yet
1805 03616 ReinforcedTopicAwareConvS2S PDF
8 pages
Don't Give Me The Details, Just The Summary! Topic-Aware Convolutional Neural Networks For Extreme Summarization
No ratings yet
Don't Give Me The Details, Just The Summary! Topic-Aware Convolutional Neural Networks For Extreme Summarization
11 pages
Deep Recurrent Generative Decoder For Abstractive Text Summarization
No ratings yet
Deep Recurrent Generative Decoder For Abstractive Text Summarization
10 pages
Text Summarization Using The T5 Transformer Model
No ratings yet
Text Summarization Using The T5 Transformer Model
3 pages
Coreference-Aware Dialogue Summarization
No ratings yet
Coreference-Aware Dialogue Summarization
11 pages
Daniel Prijs MSC Thesis 2022
No ratings yet
Daniel Prijs MSC Thesis 2022
70 pages
A_Comprehensive_Survey_of_Abstractive_Text_Summarization_Techniques
No ratings yet
A_Comprehensive_Survey_of_Abstractive_Text_Summarization_Techniques
5 pages
Abstractive Text Summary Generation With Knowledge Graph Representation
No ratings yet
Abstractive Text Summary Generation With Knowledge Graph Representation
9 pages
Extractive Summarization as Text Matching
No ratings yet
Extractive Summarization as Text Matching
12 pages
Towards Abstractive Captioning of Infographics
No ratings yet
Towards Abstractive Captioning of Infographics
94 pages
NLP-Driven Summarization of Local Language Texts
No ratings yet
NLP-Driven Summarization of Local Language Texts
52 pages
Paper for reference
No ratings yet
Paper for reference
47 pages
TC6 PROJECT SYNOPSIS KrishShetty VedantLandge 231106 101402
No ratings yet
TC6 PROJECT SYNOPSIS KrishShetty VedantLandge 231106 101402
13 pages
2310.19792v1
No ratings yet
2310.19792v1
27 pages
Neural Networks and The Chomsky Hierarchy
No ratings yet
Neural Networks and The Chomsky Hierarchy
32 pages
Get To The Point:: Summarization With Pointer-Generator Networks
No ratings yet
Get To The Point:: Summarization With Pointer-Generator Networks
32 pages
Literature Survey - Ai Mini Project: Research Papers
No ratings yet
Literature Survey - Ai Mini Project: Research Papers
5 pages
Towards efficient knowledge extraction: Natural language processing-based summarization of research paper introductions
No ratings yet
Towards efficient knowledge extraction: Natural language processing-based summarization of research paper introductions
12 pages
Towards Efficient Knowledge Extraction Natural Lan
No ratings yet
Towards Efficient Knowledge Extraction Natural Lan
12 pages
Group_ppt
No ratings yet
Group_ppt
29 pages
ICIMES_113 (4)
No ratings yet
ICIMES_113 (4)
27 pages
A Survey of Graph Prompting Methods
No ratings yet
A Survey of Graph Prompting Methods
11 pages
IEEE_Conference_Template__1_ (4).pdf (1)
No ratings yet
IEEE_Conference_Template__1_ (4).pdf (1)
3 pages
Final4 W18-2706
No ratings yet
Final4 W18-2706
10 pages
Group Method of Data Handling: Fundamentals and Applications for Predictive Modeling and Data Analysis
From Everand
Group Method of Data Handling: Fundamentals and Applications for Predictive Modeling and Data Analysis
Fouad Sabry
No ratings yet
Means Ends Analysis: Fundamentals and Applications
From Everand
Means Ends Analysis: Fundamentals and Applications
Fouad Sabry
No ratings yet
Data Mining: Fundamentals and Applications
From Everand
Data Mining: Fundamentals and Applications
Fouad Sabry
No ratings yet
Incentivizing Collaborative BIM-Enabled Projects: A Synthesis of Agency and Behavioral Approaches
From Everand
Incentivizing Collaborative BIM-Enabled Projects: A Synthesis of Agency and Behavioral Approaches
Chen-Yu Chang
No ratings yet
Baudrillard Jean
100% (3)
Baudrillard Jean
8 pages
Thailand International Mathematical Olympiad 2015 Primary 1 Past Paper Booklet PDF Free
No ratings yet
Thailand International Mathematical Olympiad 2015 Primary 1 Past Paper Booklet PDF Free
14 pages
Revtex 4.1 Manual
No ratings yet
Revtex 4.1 Manual
19 pages
Honest History - Issue 17 - Fall 2022
No ratings yet
Honest History - Issue 17 - Fall 2022
68 pages
Greek Divination From Anamerindian Perspective
No ratings yet
Greek Divination From Anamerindian Perspective
25 pages
CORE HR Exam 2024
No ratings yet
CORE HR Exam 2024
14 pages
Handbook: Published On Musescore
No ratings yet
Handbook: Published On Musescore
50 pages
Assignment Persuasion
No ratings yet
Assignment Persuasion
6 pages
2006 - Amc - 10 B
100% (1)
2006 - Amc - 10 B
13 pages
Nick Perkins - Technology Re-Placing Teachers
No ratings yet
Nick Perkins - Technology Re-Placing Teachers
8 pages
Text Types Summary For The Exam
No ratings yet
Text Types Summary For The Exam
4 pages
Module 5 Quiz - GEM 001 - Life and Works of Rizal
No ratings yet
Module 5 Quiz - GEM 001 - Life and Works of Rizal
8 pages
Journal of Experimental Child Psychology: Burcu Sarı, Handan Asûde Bas SSal, Zsofia K. Takacs, Adriana G. Bus
No ratings yet
Journal of Experimental Child Psychology: Burcu Sarı, Handan Asûde Bas SSal, Zsofia K. Takacs, Adriana G. Bus
15 pages
110 Melhores Livros A Ler
No ratings yet
110 Melhores Livros A Ler
13 pages
1-1 One-Way Ticket: Shadows of London Walkthrough - Version 2
No ratings yet
1-1 One-Way Ticket: Shadows of London Walkthrough - Version 2
13 pages
Tools Pdshell
No ratings yet
Tools Pdshell
2 pages
Sf1 - 2022 - Grade 10 (Year IV) - Honesty
No ratings yet
Sf1 - 2022 - Grade 10 (Year IV) - Honesty
8 pages
Santo Tomás de Aquino
No ratings yet
Santo Tomás de Aquino
5 pages
Abdalhay Makoni Severo 2020 Preface Introduction
No ratings yet
Abdalhay Makoni Severo 2020 Preface Introduction
35 pages
Great Wall of China
No ratings yet
Great Wall of China
13 pages
Paper 2018-English
No ratings yet
Paper 2018-English
7 pages
Advanced Mathematical Thinking PDF
100% (1)
Advanced Mathematical Thinking PDF
16 pages
1 4
No ratings yet
1 4
26 pages
(France and Culture) Robert Darnton - The Literary Underground of the Old Regime-Harvard University Press (1985)[1]
No ratings yet
(France and Culture) Robert Darnton - The Literary Underground of the Old Regime-Harvard University Press (1985)[1]
276 pages
Revised List of Hawaiian Names of Plants Native and Introduced With Brief Descriptions and Notes As To Occurrence and Medicinal or Other Values.
No ratings yet
Revised List of Hawaiian Names of Plants Native and Introduced With Brief Descriptions and Notes As To Occurrence and Medicinal or Other Values.
38 pages
Vegetables Translation 1
No ratings yet
Vegetables Translation 1
3 pages
Field Study 2
No ratings yet
Field Study 2
9 pages
BIG-IP CGNAT Implementations
No ratings yet
BIG-IP CGNAT Implementations
208 pages
Accelerate Productivity, Increase Accuracy & Reduce Costs.: Preps Imposition Software
No ratings yet
Accelerate Productivity, Increase Accuracy & Reduce Costs.: Preps Imposition Software
3 pages