14.0
14.0
sciences
Review
Abstractive vs. Extractive Summarization: An
Experimental Review
Nikolaos Giarelis 1, * , Charalampos Mastrokostas 2 and Nikos Karacapilidis 1
1 Industrial Management and Information Systems Lab, Department of Mechanical Engineering and
Aeronautics, University of Patras, 26504 Rio Patras, Greece; [email protected]
2 Department of Electrical and Computer Engineering, University of Patras, 26504 Rio Patras, Greece;
[email protected]
* Correspondence: [email protected]
Abstract: Text summarization is a subtask of natural language processing referring to the automatic
creation of a concise and fluent summary that captures the main ideas and topics from one or
multiple documents. Earlier literature surveys focus on extractive approaches, which rank the top-n
most important sentences in the input document and then combine them to form a summary. As
argued in the literature, the summaries of these approaches do not have the same lexical flow or
coherence as summaries that are manually produced by humans. Newer surveys elaborate abstractive
approaches, which generate a summary with potentially new phrases and sentences compared to
the input document. Generally speaking, contrary to the extractive approaches, the abstractive ones
create summaries that are more similar to those produced by humans. However, these approaches
still lack the contextual representation needed to form fluent summaries. Recent advancements in
deep learning and pretrained language models led to the improvement of many natural language
processing tasks, including abstractive summarization. Overall, these surveys do not present a
comprehensive evaluation framework that assesses the aforementioned approaches. Taking the
above into account, the contribution of this survey is fourfold: (i) we provide a comprehensive survey
of the state-of-the-art approaches in text summarization; (ii) we conduct a comparative evaluation of
these approaches, using well-known datasets from the related literature, as well as popular evaluation
scores such as ROUGE-1, ROUGE-2, ROUGE-L, ROUGE-LSUM, BLEU-1, BLEU-2 and SACREBLEU;
(iii) we report on insights gained on various aspects of the text summarization process, including
existing approaches, datasets and evaluation methods, and we outline a set of open issues and future
Citation: Giarelis, N.; Mastrokostas,
C.; Karacapilidis, N. Abstractive vs.
research directions; (iv) we upload the datasets and the code used in our experiments in a public
Extractive Summarization: An repository, aiming to increase the reproducibility of this work and facilitate future research in the field.
Experimental Review. Appl. Sci. 2023,
13, 7620. https://doi.org/10.3390/ Keywords: text summarization; deep learning; language models; natural language processing;
app13137620 abstractive summarization; extractive summarization; empirical research; literature review
Digital Library. We narrowed down the findings of our search by only taking into account
papers that have received more than 10 citations, or papers from a journal with an impact
factor greater than 2.0. A detailed list of the sources (journals, conferences, workshops,
repositories) of the 61 publications that satisfied these constraints and were analyzed in
this work appears in Appendix A.
The remainder of this paper Is organized as follows. Background concepts and related
work regarding extractive and abstractive TS approaches are thoroughly reviewed in
Section 2. The evaluation of these approaches, as well as the datasets and metrics used,
are presented in Section 3. Finally, concluding remarks, open issues, and future research
directions are outlined in Section 4.
2. Related Work
Existing works have suggested various classification schemas of the TS approaches
proposed so far [2–4,20]. The most prominent of them is based on the technique that creates
the output summary [2]. According to it, the approaches can be divided into two major
categories, namely extractive and abstractive ones. This section analyzes the selected TS
approaches, which are divided into extractive (Section 2.1) and abstractive (Section 2.2).
The evaluation datasets and metrics used in the TS literature are discussed in Sections 2.3
and 2.4, respectively.
the relationships between them as edges. A connection between two sentences indicates
that there is similarity between them, measured as a function of their overlapping content.
After the graph is created, the PageRank centrality algorithm [32] is applied to rank each
sentence based on its connections to the other ones. Finally, the top-ranked sentences are
selected to form a summary of the input document. The number of extracted sentences can
be set as a user-defined parameter for the termination of the algorithm.
LexRank [33] is another graph-based algorithm that relies on PageRank. Its key differ-
ence is that each sentence is represented as a vector of the TF-IDF (term frequency—inverse
document frequency) scores of the words it contains, while the relationship between these
sentence vectors is measured using cosine similarity. A similarity matrix is created with
each sentence represented as a row and column, and the elements of the matrix are com-
puted as the cosine similarity score between the sentence vectors. Only similarities above
a given threshold are included. To rank the sentences, PageRank is applied. The number
of selected sentences can be set similarly to TextRank. Other graph-based approaches that
build on TextRank are TopicRank [34] and PositionRank [35]. TopicRank uses a topic-modelling
technique, which clusters sentences with similar topics and extracts the most important
sentences of each cluster. PositionRank considers both the distribution of term positions in a
text and the term frequencies in a biased PageRank, to rank sentences.
Many word embedding models have been developed since the introduction of the
pioneering Word2Vec model [36]. Their goal is to capture semantic information for textual
terms, thus increasing the accuracy of various NLP tasks. These embeddings are calculated
for each term, and their mean vector representation is the document embedding. Recent
advancements in deep learning allow the inference of sentence embeddings [37] from
pretrained language models, while achieving better accuracy than earlier models. In this
work, we utilize these sentence embeddings in an already proposed implementation of
LexRank (https://www.sbert.net/examples/applications/text-summarization/README.
html (accessed on 19 June 2023)). Our aim is to assess whether the introduction of sentence
embeddings in the similarity step of the graph-based approach of LexRank improves
the summarization accuracy of the base approach, when measured using the evaluation
framework proposed in Section 3. The idea of incorporating word embeddings in extractive
approaches is based on a number of works appearing in the literature [38–40].
different parts of the input, according to their contextual significance. These are encoded
in hidden state layers when generating the output sequence. In addition, Transformer
models use multi-head attention, which means that attention is applied in parallel to
capture different patterns and relationships of the input data. Transformer uses the encoder-
decoder model, which encodes information into hidden layers and then decodes it to
generate output. These models are semisupervised, due to their unsupervised pretraining
on large datasets, followed by supervised finetuning. Approaches built on this model
achieve state-of-the-art performance on various text generation tasks, including abstractive
summarization. Recent surveys [20,21] discussed and evaluated the differences between
earlier abstractive approaches, including those that utilize deep learning models proposed
before the introduction of the Transformer architecture. It is stressed here that the work
presented in this article focuses on prominent pretrained language model approaches,
which rely on the Transformer model and are discussed below.
T5 [49], which stands for text-to-text transfer transformer, is an approach that closely
follows the Transformer architecture. It provides a general framework which converts
multiple NLP tasks into sequential text-to-text ones. To address each task, it uses a task-
specific prefix before the given sequence in the input. The pretraining process comprises
both supervised and unsupervised training. The unsupervised objective of the approach
includes masking random spans of tokens with unique sentinel tokens. The “corrupted”
sentence is passed to the encoder, while the decoder learns to predict the dropped-out
tokens on the output layer. A follow up approach, namely mT5 [50], builds on T5 to provide
multilingual pretrained baseline models, which can be further finetuned to address diverse
downstream tasks in multiple natural languages.
BART [51], which stands for bidirectional auto-regressive transformers, is a multitask
deep learning approach, with abstractive summarization being included in them. BART
utilizes a “denoising” autoencoder that learns the associations between a document and
its “corrupted” form using various textual transformations. These include random token
masking or deletion, text infilling, sentence permutation, and document rotation. This
autoencoder is implemented as a sequence-to-sequence model with a bidirectional encoder
and a left-to-right autoregressive decoder. For its pretraining, it optimizes a reconstruction
loss (cross-entropy) function, where the decoder generates tokens found in the original
document with higher probability.
PEGASUS [9], which stands for pretraining with extracted gap-sentences for abstractive
summarization, is a deep learning approach pretrained solely for the downstream task of
abstractive summarization. It introduces a novel pretraining objective for Transformer-based
models, called gap sentences generation (GSG). This objective is specifically designed for
the task of abstractive text summarization, as it involves the masking of whole sentences,
rather than smaller text spans used in previous attempts. By doing so, it creates a “gap” in
the input document, where the model is then trained to complete, by considering the rest
of the sentences. Another key advantage of this approach is the selection of the masked
sentences by utilizing a technique that ranks sentences based on their importance in the
document rather than randomly, as suggested in earlier approaches.
Considering the rapidly increasing size and computational complexity of large pre-
trained models, as noted in [52], researchers were prompted to explore methods to compress
them into smaller versions that maintain high accuracy and faster inference in terms of
execution time. One such example is the work of [53] that proposes various comprehension
techniques, including: (i) direct knowledge distillation (KD), which allows the knowledge
transfer between a large model, referred to as the “teacher” model, into a smaller and
“distilled” model, referred to as the “student” model; (ii) pseudo-labels, which replace the
ground truth target documents of the student model with those of the teacher, and (iii)
shrink and finetune (SFT), which shrinks the teacher model to student size by copying a
subset of layers and then the finetuning student model again. They also provide various
“distilled” pretrained model versions of large pretrained ones, produced by the BART and
PEGASUS approaches.
Appl. Sci. 2023, 13, 7620 6 of 20
2.3. Datasets
This subsection reports the selected evaluation datasets and their characteristics (an
overview of them is given in Table 1). We opted for multiple datasets, aiming to test how
well the approaches generalize over different data.
Table 1. Overview of the datasets, their statistics, and their source. The “#” symbol is used as an
abbreviation for the term “number”.
Mean #Words -
Dataset Size Mean #Sentences Link
(Text/Summary)
https://github.com/abisee/cnn-dailymail
CNN/Daily 312 k News 766/53 words— (accessed on 19 June 2023)
Mail Articles 29.74/3.72 sentences https://www.kaggle.com/datasets/gowrishankarp/newspaper-
text-summarization-cnn-dailymail (accessed on 19 June 2023)
227 k News 431.07/23.26 words— https://github.com/EdinburghNLP/XSum
XSum
Articles 19.77/1.00 sentence (accessed on 19 June 2023)
XLSum 330 k News 460.42/22.07 words— https://github.com/csebuetnlp/xl-sum/tree/master/seq2seq
(English) Articles 23.54/1.11 sentences (accessed on 19 June 2023)
16 k Chat 93.77/20.3 words— https://arxiv.org/src/1911.12237v2/anc/corpus.7z
SAMSum
Dialogues 19.26/3.07 sentences (accessed on 19 June 2023)
ddit TIFU 123 k Reddit Posts 444/23 words—
https://github.com/ctr4si/MMN (accessed on 19 June 2023)
(TL;DR) (42 K Posts) 22/1.4 sentences-
BillSum 22 k Legislation 1686/243 words—
https://github.com/FiscalNote/BillSum (accessed on 19 June 2023)
(US) Bills 42/7.1 sentences
CNN/Daily Mail [54] is a dataset containing over 300,000 news articles from CNN
and the Daily Mail newspaper, written between 2007 and 2015. This dataset is distributed
in three major versions. The first one was made for the NLP task of question answering
and contains 313 k unique news articles and close to 1 M questions. The second version
was restructured for the task of TS; the data in this version are anonymized. The third
version provides a nonanonymized version of the data, where individuals’ names can
be found in the dataset. Each article is accompanied by a list of bullet point summaries,
which abstractively summarize a key aspect of the article. The CNN/Daily Mail dataset
has 3 splits: training (92%, 287,113 articles), validation (4.3% 13,368 articles), and test (3.7%,
11,490 articles).
XSum (standing for eXtreme Summarization) is a dataset that provides over 220,000 BBC
news articles covering various topics [8]. Each article is accompanied by a one-sentence
summary written by a human expert, who for the most part was the original author of the
article. XSum has 3 splits: training (90%, 204,045 articles), validation (5%, 11,332 articles),
and test (5%, 11,334 articles).
SAMSum [55] is a dataset that contains more than 16,000 online chat conversations
written by linguists. These conversations cover diverse topics and formality styles including
emoticons, slang words, and even typographical errors. They are also annotated with short
third-person summaries, explaining the dialogue between different people.
Reddit TIFU [56] is a dataset consisting of 123,000 Reddit posts from the /r/tifu online
discussion forum. These posts are informal stories that include a short summary, which is
the title of the post, and a longer one, known as the “TL;DR” (too long; didn’t read) summary.
BillSum [57] is a dataset that deals with US Congressional (USC) and California (CA)
state bill summarization. This corpus contains legislation documents from five to twenty
thousand characters. In total, it contains 22,200 USC (18,949 train documents and 3269 test
documents) and 1200 CA state bills (1237 test documents), accompanied by summaries
written by human experts. The data are collected from the US Publishing Office Govinfo
and the CA legislature’s website.
XLSum [58] is a multilingual dataset that covers 44 different languages and offers more
than 1 million pairs of BBC news articles and their summaries. Since our review focuses on
Appl. Sci. 2023, 13, 7620 7 of 20
summarization approaches that were trained and finetuned in the English language, we
utilize the English part of the dataset which has 3 splits: training (93%, 306,522 articles),
validation (3.5%, 11,535 articles), and test (3.5%, 11,535 articles). The authors of XLSum
argue that they provide a more concise dataset, with less irrelevant summaries compared
to CNN/Daily Mail and XSum datasets.
Regarding the statistics found in Table 1, we collected them from either the original
publications mentioned above or the summarization survey appearing in [59] for the case of
Reddit TIFU and BillSum. For the cases of SAMSum and XLSum, we run our own analysis
since no statistics for these datasets were available in the literature.
3. Evaluation
This section reports the experimentations carried out to evaluate the approaches pre-
sented in Section 2. Specifically, Section 3.1 provides a comprehensive overview of the
hardware and software used in this research. Section 3.2 outlines the experimental setup.
It is noted that we evaluated the selected automatic summarization methods using the
Python programming language. Our experiments incorporate a wide variety of extrac-
tive and abstractive models, compared through the same evaluation framework across
heterogenous datasets.
the original US test data from BillSum. To the best of our knowledge, there are no official
data splits for Reddit TIFU. Therefore, for our evaluation purposes we choose 5% of the
document from the long summary version of this dataset. Also, before converting and
splitting the data to the desired format for this dataset, we preprocessed them to remove
posts with missing summaries, since the inclusion of the longer summary is optional by
the users.
Regarding the evaluation metrics, we used the following implementations: (i) for the
ROUGE metric, we used the implementation described in this link (https://huggingface.
co/spaces/evaluate-metric/rouge (accessed on 19 June 2023)); (ii) for the BLUE metric, we
used the implementation described in this link (https://huggingface.co/spaces/evaluate-
metric/bleu (accessed on 19 June 2023)); (iii) for the SACREBLUE metric, we used the
implementation described in this link (https://huggingface.co/spaces/evaluate-metric/
sacrebleu (accessed on 19 June 2023)).
Considering the length of the generated summaries, for the extractive approaches it
is required to set the desired number of output sentences. For the abstractive models, the
output length can be configured by tuning the sequence generation parameters (min_length,
max_length) of each model. In the abstractive models, the number of extracted sentences
cannot be set, as opposed to the number of generated tokens. The disadvantage of manually
adjusting this setting is that the model may abruptly end a sentence before the period
character. For the abstractive approaches, we considered the default hyperparameters set
by the original authors.
Model R1 R2 RL RLSum B1 B2 SB
PositionRank 0.3341 0.1277 0.2130 0.2819 0.2507 0.1432 0.0805
LexRank 0.3245 0.1146 0.2013 0.2665 0.2334 0.1252 0.0693
e-LexRank 0.3211 0.1171 0.1983 0.2679 0.2323 0.1276 0.0715
TextRank
0.3126 0.1118 0.1940 0.2615 0.2310 0.1255 0.0702
(pyTextRank)
Extractive
TextRank
0.2827 0.0956 0.1782 0.2339 0.1895 0.0994 0.0544
(sumy)
Luhn 0.3091 0.1160 0.1981 0.2584 0.2149 0.1209 0.0684
TopicRank 0.2987 0.1170 0.2022 0.2712 0.2383 0.1311 0.0728
LSA 0.2933 0.0909 0.1816 0.2399 0.2161 0.1046 0.0593
distilBART-CNN-12-6 0.4292 0.2077 0.2971 0.3646 0.3625 0.2384 0.1420
BART-large-CNN 0.4270 0.2058 0.2977 0.3647 0.3156 0.2052 0.1433
PEGASUS-
0.4186 0.2025 0.2983 0.3607 0.3259 0.2124 0.1312
CNN_dailymail
BART-large-CNN-
0.4203 0.1945 0.2904 0.3560 0.3081 0.1963 0.1303
SAMSum
Abstractive PEGASUS-large 0.3300 0.1330 0.2294 0.2803 0.2341 0.1367 0.0807
PEGASUS-multi_news 0.2794 0.1042 0.1714 0.2233 0.1624 0.0890 0.0449
BART-large-XSum 0.2413 0.0737 0.1635 0.2099 0.0834 0.0385 0.0242
distilBART-XSum-12-6 0.2197 0.0624 0.1518 0.1919 0.0630 0.0273 0.0177
mT5-multilingual-
0.2017 0.0545 0.1444 0.1789 0.0521 0.0218 0.1499
XLSum
PEGASUS-XSum 0.1998 0.0672 0.1398 0.1749 0.0749 0.0381 0.0242
Appl. Sci. 2023, 13, 7620 11 of 20
Model R1 R2 RL RLSum B1 B2 SB
TopicRank 0.1828 0.0258 0.1313 0.1313 0.1246 0.0266 0.0266
TextRank
0.1738 0.0248 0.1266 0.1266 0.1196 0.0271 0.0227
(sumy)
TextRank
0.1646 0.0219 0.1197 0.1197 0.1125 0.0227 0.0248
(pyTextRank)
Extractive
Luhn 0.1734 0.0246 0.1254 0.1254 0.1193 0.0265 0.0244
e-LexRank 0.1725 0.0243 0.1256 0.1256 0.1147 0.0239 0.0259
LexRank 0.1696 0.0226 0.1237 0.1237 0.1149 0.0238 0.0257
PositionRank 0.1618 0.0184 0.1188 0.1188 0.1121 0.0200 0.0247
LSA 0.1518 0.0165 0.1101 0.1101 0.1047 0.0165 0.0250
PEGASUS-XSum 0.4573 0.2405 0.3840 0.3840 0.3484 0.2374 0.1602
BART-large-XSum 0.4417 0.2194 0.3649 0.3649 0.3445 0.2250 0.1470
distilBART-XSum-12-6 0.4397 0.2197 0.3665 0.3665 0.3350 0.2204 0.1449
mT5-multilingual-
0.3520 0.1386 0.2845 0.2845 0.2536 0.1394 0.0889
XLSum
BART-large-CNN-
0.2084 0.0435 0.1399 0.1399 0.1251 0.0429 0.0231
Abstractive SAMSum
distilBART-CNN-12-6 0.1985 0.0357 0.1307 0.1307 0.1122 0.0348 0.0199
PEGASUS-
0.1979 0.0356 0.1339 0.1339 0.1247 0.0368 0.0209
CNN_dailymail
BART-large-CNN 0.1972 0.0339 0.1303 0.1303 0.1167 0.0340 0.0197
PEGASUS-large 0.1654 0.0266 0.1146 0.1146 0.0988 0.0252 0.0190
PEGASUS-multi_news 0.1578 0.0491 0.1114 0.1114 0.0812 0.0387 0.0173
Model R1 R2 RL RLSum B1 B2 SB
TopicRank 0.1900 0.0279 0.1354 0.1354 0.1291 0.0284 0.0271
e-LexRank 0.1841 0.0277 0.1325 0.1325 0.1224 0.0269 0.0267
Luhn 0.1812 0.0259 0.1293 0.1293 0.1225 0.0275 0.0240
TextRank
0.1802 0.0260 0.1304 0.1304 0.1222 0.0278 0.0232
(sumy)
Extractive
TextRank
0.1690 0.0230 0.1224 0.1224 0.1152 0.0229 0.0249
(pyTextRank)
LexRank 0.1778 0.0250 0.1290 0.1290 0.1207 0.0258 0.0265
PositionRank 0.1627 0.0194 0.1188 0.1188 0.1118 0.0204 0.0246
LSA 0.1526 0.0169 0.1101 0.1101 0.1057 0.0165 0.0249
PEGASUS-XSum 0.4343 0.2189 0.3617 0.3617 0.3269 0.2148 0.1452
distilBART-XSum-12-6 0.4237 0.2048 0.3500 0.3500 0.3208 0.2041 0.1362
BART-large-XSum 0.4223 0.2009 0.3450 0.3450 0.3260 0.2055 0.1347
mT5-multilingual-
0.3623 0.1491 0.2927 0.2927 0.2586 0.1462 0.0946
XLSum
BART-large-CNN-
0.2097 0.0441 0.1406 0.1406 0.1239 0.0417 0.0227
Abstractive SAMSum
BART-large-CNN 0.2011 0.0348 0.1322 0.1322 0.1170 0.0333 0.0199
distilBART-CNN-12-6 0.1999 0.0355 0.1311 0.1311 0.1112 0.0335 0.0196
PEGASUS-
0.1966 0.0339 0.1318 0.1318 0.1215 0.0342 0.0197
CNN_dailymail
PEGASUS-large 0.1694 0.0288 0.1158 0.1158 0.0970 0.0325 0.0177
PEGASUS-multi_news 0.1443 0.0422 0.1014 0.1014 0.0717 0.0261 0.0144
Appl. Sci. 2023, 13, 7620 12 of 20
Model R1 R2 RL RLSum B1 B2 SB
TopicRank 0.1770 0.0284 0.1290 0.1290 0.1195 0.0295 0.0270
Luhn 0.1724 0.0284 0.1233 0.1233 0.1190 0.0311 0.0255
e-LexRank 0.1703 0.0265 0.1264 0.1264 0.1111 0.0260 0.0268
TextRank
0.1689 0.0263 0.1215 0.1215 0.1164 0.0290 0.0239
(sumy)
Extractive
TextRank
0.1534 0.0228 0.1145 0.1145 0.1036 0.0237 0.0246
(pyTextRank)
LexRank 0.1673 0.0250 0.1223 0.1223 0.1131 0.0261 0.0262
LSA 0.1474 0.0179 0.1095 0.1095 0.0994 0.0172 0.0248
PositionRank 0.1304 0.0174 0.0990 0.0990 0.0864 0.0169 0.0229
BART-large-CNN-
0.1834 0.0421 0.1305 0.1305 0.1065 0.0386 0.0228
SAMSum
BART-large-XSum 0.1676 0.0329 0.1274 0.1274 0.1088 0.0300 0.0315
distilBART-XSum-12-6 0.1697 0.0314 0.1300 0.1300 0.1059 0.0283 0.0313
distilBART-CNN-12-6 0.1657 0.0340 0.1143 0.1143 0.0924 0.0315 0.0184
Abstractive PEGASUS-large 0.1617 0.0295 0.1133 0.1133 0.0993 0.0301 0.0203
PEGASUS-
0.1596 0.0326 0.1118 0.1118 0.0956 0.0316 0.0169
CNN_dailymail
BART-large-CNN 0.1570 0.0328 0.1088 0.1088 0.0883 0.0304 0.0167
PEGASUS-XSum 0.1428 0.0228 0.1135 0.1135 0.0779 0.0167 0.0260
PEGASUS-multi_news 0.1063 0.0243 0.0743 0.0743 0.0524 0.0197 0.0095
Model R1 R2 RL RLSum B1 B2 SB
e-LexRank 0.2762 0.0769 0.2059 0.2059 0.1054 0.0525 0.0393
LexRank 0.1654 0.0310 0.1136 0.1136 0.0979 0.0309 0.0183
TopicRank 0.1613 0.0313 0.1112 0.1112 0.0958 0.0316 0.0180
LSA 0.1622 0.0253 0.1101 0.1101 0.0987 0.0262 0.0173
TextRank
Extractive 0.1570 0.0287 0.1063 0.1063 0.0928 0.0282 0.0172
(pyTextRank)
TextRank
0.1471 0.0296 0.1027 0.1027 0.0823 0.0282 0.0154
(sumy)
PositionRank 0.1557 0.0272 0.1073 0.1073 0.0949 0.0270 0.0178
Luhn 0.1508 0.0306 0.1049 0.1049 0.0850 0.0292 0.0161
BART-large-CNN-
0.4004 0.2007 0.3112 0.3112 0.2603 0.1742 0.1165
SAMSum
BART-large-CNN 0.3041 0.1009 0.2272 0.2272 0.1487 0.0793 0.0508
PEGASUS-
0.2866 0.0833 0.2220 0.2220 0.1433 0.0697 0.0447
CNN_dailymail
distilBART-CNN-12-6 0.2884 0.0937 0.2149 0.2149 0.1350 0.0697 0.0466
BART-large-XSum 0.2572 0.0514 0.1878 0.1878 0.1542 0.0460 0.0431
Abstractive
PEGASUS-large 0.2589 0.0628 0.2021 0.2021 0.1008 0.0468 0.0429
distilBART-XSum-12-6 0.2121 0.0328 0.1561 0.1561 0.1276 0.0275 0.0361
mT5-multilingual-
0.1787 0.0235 0.1359 0.1359 0.1082 0.0190 0.0313
XLSum
PEGASUS-XSum 0.1428 0.0228 0.1135 0.1135 0.0779 0.0167 0.0260
PEGASUS-multi_news 0.1127 0.0200 0.0804 0.0804 0.0543 0.0149 0.0103
Appl. Sci. 2023, 13, 7620 13 of 20
Model R1 R2 RL RLSum B1 B2 SB
LexRank 0.3845 0.1888 0.2460 0.2460 0.2663 0.1815 0.1140
TextRank
0.3638 0.1735 0.2168 0.2168 0.2477 0.1651 0.1029
(pyTextRank)
TextRank
0.3503 0.1799 0.2320 0.2320 0.2437 0.1629 0.1033
(sumy)
Extractive
PositionRank 0.3601 0.1686 0.2153 0.2153 0.2436 0.1606 0.0996
e-LexRank 0.3595 0.1750 0.2205 0.2205 0.2397 0.1620 0.1032
TopicRank 0.3568 0.1763 0.2174 0.2174 0.2334 0.1662 0.1036
Luhn 0.3521 0.1812 0.2342 0.2342 0.2349 0.1640 0.1041
LSA 0.3480 0.1406 0.2115 0.2115 0.2288 0.1380 0.0846
PEGASUS-large 0.3568 0.1480 0.2268 0.2268 0.2312 0.1400 0.0901
BART-large-CNN-
0.3186 0.1571 0.2301 0.2301 0.1333 0.0881 0.0562
SAMSum
distilBART-CNN-12-6 0.3025 0.1465 0.2161 0.2161 0.1303 0.0836 0.0528
PEGASUS-
0.2962 0.1405 0.2116 0.2116 0.1311 0.0832 0.0543
CNN_dailymail
Abstractive BART-large-CNN 0.2954 0.1433 0.2146 0.2146 0.1168 0.0746 0.0495
PEGASUS-multi_news 0.2801 0.0721 0.1661 0.1661 0.1909 0.0864 0.0406
mT5-multilingual-
0.1486 0.0584 0.1155 0.1155 0.0232 0.0132 0.0081
XLSum
PEGASUS-XSum 0.1440 0.0742 0.1171 0.1171 0.0264 0.0152 0.0129
BART-large-XSum 0.1327 0.0641 0.0988 0.0988 0.0159 0.0093 0.0070
distilBART-XSum-12-6 0.1237 0.0540 0.0960 0.0960 0.0157 0.0091 0.0065
model has been finetuned on the multilingual XLSum dataset, which contains similar data in
to XSum in its English part [58]. Most abstractive models, which are not finetuned on XSum,
had higher results than the extractive ones, apart from PEGASUS-large and PEGASUS-
multi_news, on most metrics. Among the extractive approaches, TopicRank scored the best
on every metric except for B2, where TextRank (sumy) received the best score. Among the
two TextRank variations, the implementation from sumy performed better on this dataset.
e-LexRank performed slightly better than LexRank in both Tables 4 and 5; however, it was
not the top performing extractive approach.
As specified in Section 3.2.1, we are conducting the experimental evaluation on the
English part of XLSum, which is highly similar to XSum. As shown in Tables 4 and 5, the
abstractive models that are finetuned on XSum have similar or even better performance
than the mT5-multilingual-XLSum model. We believe that this better performance of the
XSum models is achieved due to the fact that they were solely trained on English corpora,
unlike mT5-multilingual-XLSum, which was trained on a multilingual corpus. Among
the extractive approaches, TopicRank scored the best across all the evaluation metrics.
Among the two TextRank variations, the implementation from sumy performed better across
all metrics.
As shown in Table 6, the abstractive models had comparable results with the extractive
ones on Reddit TIFU, with BART-large-CNN-SAMSum having the best score across all
evaluation metrics except B1 and SB where TopicRank and BART-large-XSum performed
better, respectively. Among the two TextRank variations, the implementation from sumy
performed better on this dataset. e-LexRank performed better than LexRank, however it was
not the best performing extractive approach.
As shown in Table 7, most abstractive models outperformed the extractive ones in the
case of SAMSum. The BART-large-CNN-SAMSum performed significantly better across all
evaluation metrics, due to its additional finetuning on this specific dataset. The rest of the
abstractive models had higher but comparable results to the extractive ones, apart from
PEGASUS-multi_news. All extractive approaches had comparable results, with e-LexRank
achieving the best performance across most metrics. Among the two TextRank variations,
the implementation from pyTextRank performed better across all metrics, with the exception
of R2.
As shown in Table 8, all abstractive models exhibited lower results than the extractive
ones in most metrics in the case of BillSum. LexRank scored better across all metrics. Among
the abstractive approaches, PEGASUS-large had the best performance in R1, B1, B2, and SB,
while BART-large-CNN-SAMSum performed better in R2 and RL. The extractive approaches
had comparable results, with LexRank being the best model for this dataset. Among the
two TextRank variations, the implementation from pyTextRank performed better on R1, B1,
and B2, while the sumy implementation performed better on R2, RL, and SB. e-LexRank
performed worse than LexRank in Table 8.
implementation from sumy performed better on the datasets from which we extracted
only one-sentence summaries (XSum, XLSum and Reddit TIFU), while the version from
pyTextRank performed better on the rest of the chosen datasets. We also tested an
embeddings-based extractive approach, e-LexRank, which in most datasets did not
yield better results than the classical extractive approaches.
• Regarding our evaluation metrics, we note that the scores produced by applying BLEU
are similar to those produced by ROUGE. This leads us to recommend the BLEU metric
for the evaluation of summarization approaches, even if its original use concerned the
field of machine translation.
• One may observe that the scores of RL match those of RLSum in all datasets, apart
from the case of the CNN/Daily Mail dataset. With respect to the datasets that have
single-sentence summaries, RLSum (summary-level) scores were equivalent to RL
(sentence-level) scores. For the other datasets, which might contain multi-sentence
summaries, this happens because the implementation of RLSum that we use splits
sentences using the newline delimiter (\n).
• The implication of the aforementioned experimental results is that all abstractive
models do not perform equally, as also reported in [21]. Thus, there is a constant
need for researchers to discover better pretrained language model architectures that
generalize more easily and produce summaries closer to the human style of writing.
Our work also revealed a list of open issues in TS, which require further attention;
these include:
• Abstractive models need to be retrained or refinetuned each time documents of differ-
ent languages or different domains are introduced, respectively. This could be solved
by creating more non-English datasets of various domains, and then training and
finetuning different versions of the models.
• Instead of highly accurate but specialized abstractive models, a generic type of abstrac-
tive models could emerge. These could be trained and finetuned on vast multilingual
corpora, achieving a degree of generalization that is not present in current models.
• Current abstractive approaches require a significant amount of training data [9,55]
and training time, even with specialized hardware. This could be solved through the
semisupervised nature of large language models (LLMs), e.g., GPT-3 [52], which, due
to their extremely large training and number of parameters (in the billion scale), can
be finetuned for a specific language or domain through the utilization of a limited
number of examples.
• As mentioned in Section 2.4, BLEU can be used as an evaluation metric for the TS task.
As presented in Section 3.2.2, this metric produces a ranking for the approaches that
is similar to that produced by ROUGE. To the best of our knowledge, most research
works that evaluate summarization approaches utilize only the ROUGE metric.
Based on the above discussion and remarks, we propose the following list of future
work directions:
• Evaluate LLMs for the TS task, given their zero/few-shot learning capabilities, which
enable them to be finetuned for different languages or domains with a significantly
smaller number of examples.
• Finetune existing abstractive approaches in other languages and/or domains.
• Utilize different evaluation metrics that do not penalize the approaches that produce
abstractive summaries with synonymous terms (e.g., BERTscore [64], BLEURT [65], etc.).
Author Contributions: Conceptualization, N.G. and N.K.; methodology, N.G. and N.K.; software,
N.G. and C.M.; investigation, N.G. and C.M.; resources, N.K.; data curation, C.M.; writing—original
draft preparation, N.G. and C.M.; writing—review and editing, N.G., C.M., and N.K.; visualization,
N.G. and C.M.; supervision, N.K.; project administration, N.K.; funding acquisition, N.K. All authors
have read and agreed to the published version of the manuscript.
Appl. Sci. 2023, 13, 7620 16 of 20
Funding: The work presented in this paper is supported by the inPOINT project (https://inpoint-
project.eu/ (accessed on 4 April 2023)), which is cofinanced by the European Union and Greek
national funds through the operational program, Competitiveness, Entrepreneurship, and Innovation,
under the call RESEARCH—CREATE—INNOVATE (Project id: T2EDK-04389).
Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.
Data Availability Statement: The employed datasets and the code repository are available at https://
drive.google.com/drive/folders/1UJ_L5ZYYm52CQuURixc7DHs2h6Yz50F0?usp=sharing (accessed
on 4 April 2023) and https://github.com/cmastrokostas/Automatic_Text_Summarization (accessed
on 4 April 2023).
Conflicts of Interest: The authors declare no conflict of interest.
Appendix A
Table A1. Sources of the publications selected in this review.
Journal/Conference/Workshop/Repository Publisher
Nature Nature
IEEE Access
International Conference on Computer, Communication and Signal Processing Institute of Electrical and Electronics
(ICCCSP) Engineers (IEEE)
IEEE Region 10 Symposium (TENSYMP)
Expert Systems with Applications
Information Fusion
Elsevier
Information Processing & Management
Computer Speech & Language
Association for the Advancement of Artificial
AAAI Conference on Artificial Intelligence
Intelligence (AAAI)
Journal of the American Society for Information Science and Technology (J. Am. Association for Computing Machinery
Soc. Inf. Sci.) (ACM)
Artificial Intelligence Review
European Conference on Advances in Information Retrieval (ECIR) Springer
International Journal of Parallel Programming (Int J Parallel Prog)
International Conference on Language Resources and Evaluation (LREC)
Conference on Empirical Methods in Natural Language Processing (EMNLP)
Conference on Empirical Methods in Natural Language Processing and the
International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)
Annual Meeting of the Association for Computational Linguistics
Annual Meeting of the Association for Computational Linguistics and the
International Joint Conference on Natural Language Processing (ACL-IJCNLP)
International Joint Conference on Natural Language Processing (IJCNLP) Association for Computational Linguistics
International Conference on Computational Linguistics: System Demonstrations (ACL)
(COLING)
Conference of the North American Chapter of the Association for Computational
Linguistics: Human Language Technologies (NAACL-HLT)
Findings of the Association for Computational Linguistics (ACL-IJCNLP)
Text Summarization Branches Out
Conference on Machine Translation (WMT)
Workshop on New Frontiers in Summarization
International Conference on Machine Learning
JMLR, Inc. and Microtome Publishing
(Proceedings of Machine Learning Research—PMLR)
(United States)
Journal of Machine Learning Research (JMLR)
Journal of Artificial Intelligence Research
AI Access Foundation, Inc.
(J. Artif. Int. Res.)
Appl. Sci. 2023, 13, 7620 17 of 20
Journal/Conference/Workshop/Repository Publisher
Journal of emerging technologies in web intelligence (JEWI) JEWI
Advances in Neural Information Processing Systems Curran Associates, Inc.
Foundations and Trends® in Information Retrieval Now Publishers
IBM Journal of Research and Development IBM
Arxiv preprints Arxiv.org
References
1. Gupta, V.; Lehal, G.S. A Survey of Text Summarization Extractive Techniques. J. Emerg. Technol. Web Intell. 2010, 2, 258–268.
[CrossRef]
2. El-Kassas, W.S.; Salama, C.R.; Rafea, A.A.; Mohamed, H.K. Automatic Text Summarization: A Comprehensive Survey. Expert
Syst. Appl. 2021, 165, 113679. [CrossRef]
3. Bharti, S.K.; Babu, K.S. Automatic Keyword Extraction for Text Summarization: A Survey. arXiv 2017, arXiv:1704.03242.
4. Gambhir, M.; Gupta, V. Recent Automatic Text Summarization Techniques: A Survey. Artif. Intell. Rev. 2017, 47, 1–66. [CrossRef]
5. Yasunaga, M.; Kasai, J.; Zhang, R.; Fabbri, A.R.; Li, I.; Friedman, D.; Radev, D.R. ScisummNet: A Large Annotated Corpus and
Content-Impact Models for Scientific Paper Summarization with Citation Networks. Proc. AAAI Conf. Artif. Intell. 2019, 33,
7386–7393. [CrossRef]
6. An, C.; Zhong, M.; Chen, Y.; Wang, D.; Qiu, X.; Huang, X. Enhancing Scientific Papers Summarization with Citation Graph. Proc.
AAAI Conf. Artif. Intell. 2021, 35, 12498–12506. [CrossRef]
7. Hong, K.; Conroy, J.; Favre, B.; Kulesza, A.; Lin, H.; Nenkova, A. A Repository of State of the Art and Competitive Baseline
Summaries for Generic News Summarization. In Proceedings of the Ninth International Conference on Language Resources and
Evaluation (LREC’14), Reykjavik, Iceland, 26–31 May 2014; European Language Resources Association (ELRA): Luxembourg;
pp. 1608–1616.
8. Narayan, S.; Cohen, S.B.; Lapata, M. Don’t Give Me the Details, Just the Summary! Topic-Aware Convolutional Neural Networks
for Extreme Summarization. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels,
Belgium, 31 October–4 November 2018; Association for Computational Linguistics: Stroudsburg, PA, USA; pp. 1797–1807.
9. Zhang, J.; Zhao, Y.; Saleh, M.; Liu, P. PEGASUS: Pre-Training with Extracted Gap-Sentences for Abstractive Summarization. In
Proceedings of the 37th International Conference on Machine Learning, Virtual Event. 21 November 2020; PMLR. pp. 11328–11339.
10. Zhang, S.; Celikyilmaz, A.; Gao, J.; Bansal, M. EmailSum: Abstractive Email Thread Summarization. In Proceedings of the
59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural
Language Processing (Volume 1: Long Papers), Online. 1–6 August 2021; Association for Computational Linguistics: Stroudsburg,
PA, USA; pp. 6895–6909.
11. Polsley, S.; Jhunjhunwala, P.; Huang, R. CaseSummarizer: A System for Automated Summarization of Legal Texts. In Proceedings
of the COLING 2016, the 26th International Conference on Computational Linguistics: System Demonstrations, Osaka, Japan,
11–16 December 2016; The COLING 2016 Organizing Committee. pp. 258–262.
12. Kanapala, A.; Pal, S.; Pamula, R. Text Summarization from Legal Documents: A Survey. Artif. Intell. Rev. 2019, 51, 371–402.
[CrossRef]
13. Bhattacharya, P.; Hiware, K.; Rajgaria, S.; Pochhi, N.; Ghosh, K.; Ghosh, S. A Comparative Study of Summarization Algorithms
Applied to Legal Case Judgments. In Advances in Information Retrieval, Proceedings of the 41st European Conference on IR Research,
ECIR 2019, Cologne, Germany, 14–18 April 2019; Azzopardi, L., Stein, B., Fuhr, N., Mayr, P., Hauff, C., Hiemstra, D., Eds.; Springer
International Publishing: Cham, Switzerland, 2019; pp. 413–428.
14. Sun, S.; Luo, C.; Chen, J. A Review of Natural Language Processing Techniques for Opinion Mining Systems. Inf. Fusion 2017, 36,
10–25. [CrossRef]
15. Hu, Y.H.; Chen, Y.L.; Chou, H.L. Opinion Mining from Online Hotel Reviews—A Text Summarization Approach. Inf. Process.
Manag. 2017, 53, 436–449. [CrossRef]
16. Adamides, E.; Giarelis, N.; Kanakaris, N.; Karacapilidis, N.; Konstantinopoulos, K.; Siachos, I. Leveraging open innovation
practices through a novel ICT platform. In Human Centred Intelligent Systems, Proceedings of KES HCIS 2023 Conference. Smart
Innovation, Systems and Technologies, Rome, Italy, 14–16 June 2023; Springer: Rome, Italy, 2023; Volume 359.
17. Nenkova, A.; McKeown, K. Automatic Summarization. Found. Trends Inf. Retr. 2011, 5, 103–233. [CrossRef]
18. Saggion, H.; Poibeau, T. Automatic Text Summarization: Past, Present and Future. In Multi-Source, Multilingual Information
Extraction and Summarization; Poibeau, T., Saggion, H., Piskorski, J., Yangarber, R., Eds.; Theory and Applications of Natural
Language Processing; Springer: Berlin/Heidelberg, Germany, 2013; pp. 3–21. ISBN 9783642285691.
Appl. Sci. 2023, 13, 7620 18 of 20
19. Moratanch, N.; Chitrakala, S. A Survey on Extractive Text Summarization. In Proceedings of the 2017 International Conference
on Computer, Communication and Signal Processing (ICCCSP), Chennai, India, 10–11 January 2017; pp. 1–6.
20. Mridha, M.F.; Lima, A.A.; Nur, K.; Das, S.C.; Hasan, M.; Kabir, M.M. A Survey of Automatic Text Summarization: Progress,
Process and Challenges. IEEE Access 2021, 9, 156043–156070. [CrossRef]
21. Alomari, A.; Idris, N.; Sabri, A.Q.M.; Alsmadi, I. Deep Reinforcement and Transfer Learning for Abstractive Text Summarization:
A Review. Comput. Speech Lang. 2022, 71, 101276. [CrossRef]
22. Lin, C.Y. ROUGE: A Package for Automatic Evaluation of Summaries. In Text Summarization Branches Out; Association for
Computational Linguistics: Barcelona, Spain, 2004; pp. 74–81.
23. Papineni, K.; Roukos, S.; Ward, T.; Zhu, W.J. BLEU: A Method for Automatic Evaluation of Machine Translation. In Proceedings
of the 40th Annual Meeting of the Association for Computational Linguistics; Association for Computational Linguistics,
Philadelphia, PA, USA, 7–12 July 2002; pp. 311–318.
24. Graham, Y. Re-Evaluating Automatic Summarization with BLEU and 192 Shades of ROUGE. In Proceedings of the 2015
Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal, 17–21 September 2015; Association for
Computational Linguistics: Stroudsburg, PA, USA; pp. 128–137.
25. Rieger, B.B. On Distributed Representation in Word Semantics; International Computer Science Institute: Berkeley, CA, USA, 1991.
26. Luhn, H.P. The Automatic Creation of Literature Abstracts. IBM J. Res. Dev. 1958, 2, 159–165. [CrossRef]
27. Deerwester, S.; Dumais, S.T.; Furnas, G.W.; Landauer, T.K.; Harshman, R. Indexing by Latent Semantic Analysis. J. Am. Soc.
Inf. Sci. 1990, 41, 391–407. Available online: https://search.crossref.org/?q=Indexing+by+latent+semantic+analysis+Scott+
Deerwester&from_ui=yes (accessed on 30 May 2023). [CrossRef]
28. Gong, Y.; Liu, X. Generic Text Summarization Using Relevance Measure and Latent Semantic Analysis. In Proceedings of the
24th annual international ACM SIGIR conference on Research and development in information retrieval, New York, NY, USA, 1
September 2001; Association for Computing Machinery: New York, NY, USA; pp. 19–25.
29. Steinberger, J.; Jezek, K. Using Latent Semantic Analysis in Text Summarization and Summary Evaluation. Proc. ISIM 2004, 4, 8.
30. Yeh, J.Y.; Ke, H.R.; Yang, W.P.; Meng, I.H. Text Summarization Using a Trainable Summarizer and Latent Semantic Analysis. Inf.
Process. Manag. 2005, 41, 75–95. [CrossRef]
31. Mihalcea, R.; Tarau, P. TextRank: Bringing Order into Text. In Proceedings of the 2004 Conference on Empirical Methods in
Natural Language Processing, Barcelona, Spain, 25–26 July 2004; Association for Computational Linguistics: Stroudsburg, PA,
USA; pp. 404–411.
32. Page, L.; Brin, S.; Motwani, R.; Winograd, T. The Pagerank Citation Ranking: Bring Order to the Web; Technical Report; Stanford
University: Stanford, CA, USA, 1998.
33. Erkan, G.; Radev, D.R. LexRank: Graph-Based Lexical Centrality as Salience in Text Summarization. J. Artif. Int. Res. 2004, 22,
457–479. [CrossRef]
34. Bougouin, A.; Boudin, F.; Daille, B. TopicRank: Graph-Based Topic Ranking for Keyphrase Extraction. In Proceedings of the Sixth
International Joint Conference on Natural Language Processing, Nagoya, Japan, 14–19 October 2013; Asian Federation of Natural
Language Processing: Singapore; pp. 543–551.
35. Florescu, C.; Caragea, C. PositionRank: An Unsupervised Approach to Keyphrase Extraction from Scholarly Documents. In
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vancouver,
CMA, Canada, 30 July–4 August 2017; Association for Computational Linguistics: Stroudsburg, PA, USA; pp. 1105–1115.
36. Mikolov, T.; Chen, K.; Corrado, G.; Dean, J. Efficient Estimation of Word Representations in Vector Space. arXiv 2013,
arXiv:1301.3781.
37. Reimers, N.; Gurevych, I. Sentence-BERT: Sentence Embeddings Using Siamese BERT-Networks. In Proceedings of the 2019
Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural
Language Processing (EMNLP-IJCNLP), Hong Kong, China, 3–7 November 2019; Association for Computational Linguistics:
Stroudsburg, PA, USA; pp. 3982–3992.
38. Chengzhang, X.; Dan, L. Chinese Text Summarization Algorithm Based on Word2vec. J. Phys. Conf. Ser. 2018, 976, 012006.
[CrossRef]
39. Haider, M.M.; Hossin, M.d.A.; Mahi, H.R.; Arif, H. Automatic Text Summarization Using Gensim Word2Vec and K-Means
Clustering Algorithm. In Proceedings of the 2020 IEEE Region 10 Symposium (TENSYMP), Dhaka, Bangladesh, 5–7 June 2020;
pp. 283–286.
40. Abdulateef, S.; Khan, N.A.; Chen, B.; Shang, X. Multidocument Arabic Text Summarization Based on Clustering and Word2Vec to
Reduce Redundancy. Information 2020, 11, 59. [CrossRef]
41. Ganesan, K.; Zhai, C.; Han, J. Opinosis: A Graph Based Approach to Abstractive Summarization of Highly Redundant Opinions.
In Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010), Beijing, China, 23–27 August
2010; Coling 2010 Organizing Committee. pp. 340–348.
42. Genest, P.E.; Lapalme, G. Fully Abstractive Approach to Guided Summarization. In Proceedings of the 50th Annual Meeting
of the Association for Computational Linguistics (Volume 2: Short Papers), Jeju Island, Korea, 8–14 July 2012; Association for
Computational Linguistics: Stroudsburg, PA, USA; pp. 354–358.
Appl. Sci. 2023, 13, 7620 19 of 20
43. Khan, A.; Salim, N.; Farman, H.; Khan, M.; Jan, B.; Ahmad, A.; Ahmed, I.; Paul, A. Abstractive Text Summarization Based on
Improved Semantic Graph Approach. Int. J. Parallel. Prog. 2018, 46, 992–1016. [CrossRef]
44. LeCun, Y.; Bengio, Y.; Hinton, G. Deep Learning. Nature 2015, 521, 436–444. [CrossRef] [PubMed]
45. Rekabdar, B.; Mousas, C.; Gupta, B. Generative Adversarial Network with Policy Gradient for Text Summarization. In Proceedings
of the 2019 IEEE 13th International Conference on Semantic Computing (ICSC), Newport Beach, CA, USA, 30 January–1 February
2019; pp. 204–207.
46. Yang, M.; Li, C.; Shen, Y.; Wu, Q.; Zhao, Z.; Chen, X. Hierarchical Human-Like Deep Neural Networks for Abstractive Text
Summarization. IEEE Trans. Neural Netw. Learn. Syst. 2021, 32, 2744–2757. [CrossRef] [PubMed]
47. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention Is All You Need. In
Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Curran
Associates, Inc.: Red Hook, NY, USA, 2017; Volume 30.
48. Bahdanau, D.; Cho, K.; Bengio, Y. Neural Machine Translation by Jointly Learning to Align and Translate. arXiv 2016,
arXiv:1409.0473.
49. Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; Liu, P.J. Exploring the Limits of Transfer
Learning with a Unified Text-to-Text Transformer. J. Mach. Learn. Res. 2020, 21, 1–67.
50. Xue, L.; Constant, N.; Roberts, A.; Kale, M.; Al-Rfou, R.; Siddhant, A.; Barua, A.; Raffel, C. MT5: A Massively Multilingual
Pre-Trained Text-to-Text Transformer. In Proceedings of the 2021 Conference of the North American Chapter of the Association
for Computational Linguistics: Human Language Technologies, Online. 8 June 2021; Association for Computational Linguistics:
Stroudsburg, PA, USA; pp. 483–498.
51. Lewis, M.; Liu, Y.; Goyal, N.; Ghazvininejad, M.; Mohamed, A.; Levy, O.; Stoyanov, V.; Zettlemoyer, L. BART: Denoising
Sequence-to-Sequence Pre-Training for Natural Language Generation, Translation, and Comprehension. In Proceedings of the
58th Annual Meeting of the Association for Computational Linguistics, Online. 10 July 2020; Association for Computational
Linguistics: Stroudsburg, PA, USA; pp. 7871–7880.
52. Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al.
Language Models Are Few-Shot Learners. In Proceedings of the Advances in Neural Information Processing Systems, Online.
6–12 December 2020; Curran Associates, Inc.: Red Hook, NY, USA, 2020; Volume 33, pp. 1877–1901.
53. Shleifer, S.; Rush, A.M. Pre-Trained Summarization Distillation. arXiv 2020, arXiv:2010.13002.
54. Hermann, K.M.; Kocisky, T.; Grefenstette, E.; Espeholt, L.; Kay, W.; Suleyman, M.; Blunsom, P. Teaching Machines to Read
and Comprehend. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, CMA, Canada, 7–12
December 2015; Curran Associates, Inc.: Red Hook, NY, USA, 2015; Volume 28.
55. Gliwa, B.; Mochol, I.; Biesek, M.; Wawer, A. SAMSum Corpus: A Human-Annotated Dialogue Dataset for Abstractive Sum-
marization. In Proceedings of the 2nd Workshop on New Frontiers in Summarization, Hong Kong, China, 4 November 2019;
Association for Computational Linguistics: Stroudsburg, PA, USA; pp. 70–79.
56. Kim, B.; Kim, H.; Kim, G. Abstractive Summarization of Reddit Posts with Multi-Level Memory Networks. In Proceedings
of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language
Technologies, Volume 1 (Long and Short Papers), Minneapolis, MN, USA, 2–7 June 2019; Association for Computational
Linguistics: Stroudsburg, PA, USA; pp. 2519–2531.
57. Kornilova, A.; Eidelman, V. BillSum: A Corpus for Automatic Summarization of US Legislation. In Proceedings of the 2nd
Workshop on New Frontiers in Summarization, Hong Kong, China, 4 November 2019; Association for Computational Linguistics:
Stroudsburg, PA, USA; pp. 48–56.
58. Hasan, T.; Bhattacharjee, A.; Islam, M.d.S.; Mubasshir, K.; Li, Y.F.; Kang, Y.B.; Rahman, M.S.; Shahriyar, R. XL-Sum: Large-Scale
Multilingual Abstractive Summarization for 44 Languages. In Proceedings of the Findings of the Association for Computational
Linguistics: ACL-IJCNLP 2021, Online, 1–6 August 2021; Association for Computational Linguistics: Stroudsburg, PA, USA;
pp. 4693–4703.
59. Koh, H.Y.; Ju, J.; Liu, M.; Pan, S. An Empirical Survey on Long Document Summarization: Datasets, Models, and Metrics. ACM
Comput. Surv. 2022, 55, 1–35. [CrossRef]
60. Post, M. A Call for Clarity in Reporting BLEU Scores. In Proceedings of the Third Conference on Machine Translation: Research
Papers, Brussels, Belgium, 31 October–1 November 2018; Association for Computational Linguistics: Stroudsburg, PA, USA;
pp. 186–191.
61. Nathan, P. PyTextRank, a Python Implementation of TextRank for Phrase Extraction and Summarization of Text Documents.
DerwenAI/Pytextrank: v3.1.1 release on PyPi | Zenodo. 2016. Available online: https://zenodo.org/record/4637885 (accessed
on 19 June 2023).
62. Wolf, T.; Debut, L.; Sanh, V.; Chaumond, J.; Delangue, C.; Moi, A.; Cistac, P.; Rault, T.; Louf, R.; Funtowicz, M.; et al. Transformers:
State-of-the-Art Natural Language Processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language
Processing: System Demonstrations, Online. 5 October 2020; Association for Computational Linguistics: Stroudsburg, PA, USA;
pp. 38–45.
63. Fabbri, A.; Li, I.; She, T.; Li, S.; Radev, D. Multi-News: A Large-Scale Multi-Document Summarization Dataset and Abstractive
Hierarchical Model. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy,
28 July–2 August 2019; Association for Computational Linguistics: Stroudsburg, PA, USA; pp. 1074–1084.
Appl. Sci. 2023, 13, 7620 20 of 20
64. Zhang, T.; Kishore, V.; Wu, F.; Weinberger, K.Q.; Artzi, Y. BERTScore: Evaluating Text Generation with BERT. arXiv 2020,
arXiv:1904.09675.
65. Sellam, T.; Das, D.; Parikh, A. BLEURT: Learning Robust Metrics for Text Generation. In Proceedings of the 58th Annual Meeting
of the Association for Computational Linguistics, Online, 5–10 July 2020; Association for Computational Linguistics: Stroudsburg,
PA, USA; pp. 7881–7892.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.