Natural Language Processing For Automatic text summarization
Natural Language Processing For Automatic text summarization
(4) (2022)
Abeer K AL-Mashhadany
Department of Computer Science, College of Science, Al-Nahrain University, Baghdad, Iraq.
1 Introduction
159
Wasit Journal of Computer and Mathematic Science Vol. (1) No. (4) (2022)
Extractive summarization aims to identify the importance of each text sentence and
develop a shorter version of the original text that represents it accurately. On the other
hand, text summarization based on abstraction uses linguistic methods for understand-
ing and a deeper analysis of the text [4]. An abstract text summary type consists of new
sentences generated by paraphrasing or reformulating the extracted content [5]. Ab-
stractive summaries are usually resembling human-written ones because they tend to
represent the content and the meaning of the original text more naturally. Although this
type of summarization performs more efficiently, its implementation requires great
knowledge of deep learning techniques [6]. Also, the type of input documents used for
summarizing may vary. The target of summarizing may be to obtain a summary from
several documents or one document to obtain a summary text [7]. Single-document
summarizers deal with a single source text and generate summaries from it Inde-
pendently of other documents [8]. On the other hand, multi-document summarizing is
viewed as an extension of a single-document summary. It compiles many documents
on the same subject into a single summary. The multi-document summarization task is
more complex than summarizing a single document, even if it is lengthy. The difficulty
comes from handle within a large set of documents with thematic diversity [9].
The quality of the summarized text is evaluated by human assessment and using
Natural Language Text Summarization metrics. ROUGE, or" Recall-Oriented Under-
study for Gisting Evaluation," is the most popular matrices containing software pack-
ages for evaluating automatic summarization and translation in natural language pro-
cessing. By comparing an automatically produced summary with a reference or a group
of contacts (human-produced) [10]. The overlap of unigrams (per word) between the
system and reference summaries is referred to as ROUGE-1. In contrast, ROUGE-2
refers to the bigram overlap between the system and reference summaries. And
ROUGE-L: Statistics based on the "Longest Common Subsequence (LCS) " The long-
est common sub-sequence naturally considers sentence-level structural similarities and
automatically determines the longest co-occurring in sequence n-grams [11]. In addi-
tion to these matrices, human evaluation is adopted based on the correct linguistic rules
for the summary, the coherence of the text, etc. [12].
The dataset is the first step to getting a well-trained model to perform a specific task
in artificial intelligence [13]. When summarizing texts, it is necessary to look at the
available datasets, languages, types of functions, and the latest methods that led to ob-
taining a well-trained model. Below we review the most famous datasets that have been
worked with a summary table of information:-
160
Wasit Journal of Computer and Mathematic Science Vol. (1) No. (4) (2022)
3) CNN/Daily Mail
The "CNN/Daily Mail" dataset is used for summarizing text. Questions (with one of
the entities obscured) and associated portions from CNN and Daily Mail news articles
were manually constructed to train the algorithm to answer a fill-in-the-blank inquiry.
Authors have made available scripts to crawl, extract, and produce passage and question
pairs from various resources. The hands specify 11,487 test pairs, 13368 validation
pairs, and 286,817 training pairs. Documents in training average 766 words and 29.74
sentences in length, whereas their summaries only use 53 comments and 3.72 sentences
[18].in the latest studies in 2022 [19] using “Pegasus 2B + SLiC”, Y. Zhao and others
get the best result to text summarization task in this data set, Rouge-1= 47.97, Rouge-
2= 24.18, Rouge-L= 44.88.
161
Wasit Journal of Computer and Mathematic Science Vol. (1) No. (4) (2022)
There are 27,770 publications included in the dataset that makes up the "Arxiv HEP-
TH" citation graph, and the total number of edges is 352,807. A directed edge connects
two nodes in the network if and only if paper I refers to paper J. There is no indication
in the graph of any papers outside the dataset being referenced or cited by any docu-
ments inside it. Specifically, the data includes articles published between January 1993
and April 2003. (124 months) [22]. the transformer was applied to summarize the text
in [21] and reached a ratio of ROUGE-1 =50.95, ROUGE-2 =21.93 ROUGE-L =45.61.
6) XSum (Extreme Summarization)
Extreme Summarizing (XSum) is a dataset for testing abstract single-document sum-
marization methods. The objective is to write a brand-fresh, catchy one-sentence sum-
mary that explains the article's subject. A one-sentence summary is provided for each
of the 226,711 news stories in the collection. Many different topics are covered in this
compilation of BBC stories that span from 2010 to 2017. (e.g., "News, Politics, Sports,
Weather, Business, Technology, Science, Health, Family, Education, Entertainment,
and Arts"). 204,045 (90%), 11,332 (5%) and 11,334 (5%) documents make up the of-
ficial training, validation, and test sets, respectively [23]. In [19], researchers work on
this dataset and introduce sequence likelihood calibration (SLiC). This makes Decoding
unnecessary, and the quality of decoding candidates rises dramatically independent of
the decoding method. Exceed SOTA results on various generation tasks, including ab-
stractive summarization, question creation, abstractive question answering, and data-
to-text generation, even with small models. Pegasus 2B +SLiC achieved result up to
ROUGE-1 =49.77, ROUGE-2 =27.09, and ROUGE-L =42.08.
8) OrangeSum
OrangeSum is an extreme summarizing dataset that focuses on a single document to
summarize it. It includes two tasks, title and abstract. Title and abstract assignments
have an average ground truth summary length of 11.42 and 32.12 words, whereas 315-
and 350-word documents have similar average sizes. Creating a French language ver-
sion of the XSum dataset was the impetus behind OrangeSum. For models to do well
162
Wasit Journal of Computer and Mathematic Science Vol. (1) No. (4) (2022)
in OrangeSum, they need to be more abstract than they need to be on the historical
CNN, Daily Mail, and NY Times datasets. Extracting article titles and abstracts from
the orange Actu website led to the development of OrangeSum. Pages scraped span
over a decade, from February 2011 to September 2020, and may be roughly categorized
under five broad headings: France, the globe, politics, automobiles, and society. Health,
environment, people, culture, media, high technology, unusual ("insolite" in French),
and miscellaneous are the eight divisions of the society event [26] .in the 2021 paper
[27] provides "BARThez", the first large-scale pre-trained seq2seq model for French,
in this paper. "BARThez" is particularly well-suited for generative jobs because of its
BART foundation.BARThez is very competitive with cutting-edge BERT-based
French language versions such as CamemBERT and FlauBERT.and also proceed to
train a multilingual BART on BARThez' corpus and show that the resultant model,
mBARThez, greatly improves BARThez' generative performance.
163
Wasit Journal of Computer and Mathematic Science Vol. (1) No. (4) (2022)
and the writing styles represented in these pieces have high diversity [32]. Using state-
of-the-art NLP in [33] makes abstractive summaries of "recorded instructional films
ranging from gardening and cooking to software configuration and sports". The model
is pre-trained on a few large cross-domain datasets in written and spoken English using
transfer learning. ROUGE measurement is used to evaluate, and this paper has achieved
the best results and used BertSum by relying on this dataset, as it was ROUGE-1
=35.91, ROUGE-2 =13.9, ROUGE-L =34.82.
12) Urdu News Dataset
Over a million news articles from the fields of business, economics, science and
technology, entertainment, and sports are included in this collection. This dataset is
helpful for numerous Urdu NLP applications since its four unique categories were care-
fully selected to eliminate ambiguity. For many NLP, Machine/Deep Learning tasks,
including "text processing, classification, summarization, named entity recognition,
topic modeling, and text generation", the dataset A Large-Scale News Dataset for Urdu
Text Processing is the only dataset in the Urdu language that is currently available—
created this dataset.in 2021 and so far, no studies have been done on this dataset in text
summarization [34].
13) Bengali News Articles ("IndicNLP")
Since the previous several decades, "natural language processing (NLP)" has been
used extensively in studying Western languages, especially the English language. Lan-
guage processing research on the eastern counterpart, particularly the languages of the
Indian subcontinent, needs to be increased. Western languages have access to a wealth
of dictionaries, WordNet, and related resources. This data collection, which has been
cleaned and comes with a train and test set to compare your classification and summa-
rization models against, contains 14k news items. A collection of Bengali news items
make up this dataset. It may be applied to problems involving classification and lan-
guage modeling [35]. Although there has been a substantial amount of critical study on
abstractive summary in English, just a few works have been done on Bengali abstractive
news summarization. Paper [36] described a seq2seq-based Long Short-Term Memory
(LSTM) network architecture focused on the encoder-decoder. The suggested system
uses the attention-based model to construct a long sequence of words. The summary
was evaluated subjectively and statistically, and its results were compared to other pub-
lished results. Mechanism of attention demonstrated a considerable improvement in
state-of-the-art human assessment ratings.
164
Wasit Journal of Computer and Mathematic Science Vol. (1) No. (4) (2022)
accurate summaries that benefit society. The model has an F-Score of 58% and a Rouge
Score of 67.5%. Pandas, NumPy, sklearn, and other libraries are utilized.LSTM, word
embedding and seq2seq are used to train the data model.
165
Wasit Journal of Computer and Mathematic Science Vol. (1) No. (4) (2022)
number of samples and the difficulty of training them, which is why researchers often
resort to pre-training and fine-tuning methods.
19) SUMARABIC
SumArabic is a dataset for abstractive text summarization in Arabic. The infor-
mation comes from two Arabic news websites: emaratalyoum.com and ema-
ratalyoum.com. - www.almamlakatv.com.The data is divided into four sets: training,
testing, validation, and out-of-domain. Each split has the following examples: 75,817
training, 4,121 validations, and 4,174 tests. 652 out-of-domain, Total: 84,764 [47].This
dataset is one of the latest additions to 2022 in the Arabic language, which does not
contain academic studies on it yet.
20) TalkSumm
TalkSumm is a dataset that contains 1705 automatically-generated summaries pa-
pers in science from ACL, NAACL, EMNLP, SIGDIAL (2015-2018), and ICML
(2017-2018). The dataset contains titles, URLs, and corresponding summaries.
[48]This study presents a unique way for automatically producing summaries for sci-
entific publications based on recordings of lectures at scientific conferences, suggesting
that such presentations provide a cohesive and short summary of the paper's content
and can serve as the foundation for effective summaries. It compiled a collection of
paper summaries from 1716 publications and their accompanying videos. A model
trained on this dataset outperforms models trained on a manually constructed dataset of
summaries. Furthermore, human specialists confirmed the quality of our this summary
[49].
166
Wasit Journal of Computer and Mathematic Science Vol. (1) No. (4) (2022)
Table 1. Summarizes the basic information of the dataset used to text summarization from
Language of dataset ،content ،Input document Single or Multiple and Referring to the
best research and methods that achieved the highest percentages in summary matrices
167
Wasit Journal of Computer and Mathematic Science Vol. (1) No. (4) (2022)
graph) "
[22]
6. XSum Eng- News Single Ex- Y. Z o et ROUGE-
[23] lish ("Poli- docu- treme/ al.[19] 1=49.77
tics, ment Ab- \Pegasus 2B ROUGE-
Sports, stractive +SLiC 2=27.09
Weathe ROUGE-
r, Busi- L=42.08
ness,
Tech-
nology,
Sci-
ence,
Health,
Family,
Educa-
tion,
and
Arts")
7.MentSu Eng- Mental Single Ex- S. Sotudeh et ROUGE-
m ("Men- lish health docu- treme/ al.[25] 1=29.13
tal Health ment Ab- \ BART ROUGE-
Summari- stractive 2=7.98
zation Da- ROUGE-
taset") [24] L=20.27
8. Or- French society Single Ab- M. K. Ed- ROUGE-
angeSum docu- stractive dine et al. 1=15.49
[26] ment [27]\BARTh ROUGE-
ez' 2=5.82
ROUGE-
L=13.05
9.Book- Eng- literary Single Ab- W. Xiong et ROUGE-
Sum [28] lish works docu- stractive al. [29] \ 1=38.5
ment BART-LS ROUGE-
2=10.3
ROUGE-
L=36.4
10.arXiv Eng- scien- Single Extrac- Liang et al. ROGUE-
Summari- lish tific docu- tive [31] 1=40.92
zation Da- ments \Facet- ROUGE-
taset [30] Aware Mod- 2=13.75
eling ROUGE-
L=35.56
168
Wasit Journal of Computer and Mathematic Science Vol. (1) No. (4) (2022)
169
Wasit Journal of Computer and Mathematic Science Vol. (1) No. (4) (2022)
3 Conclusion
This review concludes that the dataset's field in summarizing texts is focused on
news and scientific aspects, especially the abstract type. The available datasets lack
diversity in topics, primarily literary, artistic, and some fields of science. The dataset
available in a large size can be used to train the model, fine-tune a small emerging data
set, and get good results. After the significant development in text summarization tech-
niques, getting a good summary has become the beginning of other tasks related to
NLP.
4 References
[1] Suleiman, Dima, and Arafat Awajan. "Deep learning based abstractive text summarization:
approaches, datasets, evaluation measures, and challenges." Mathematical problems in en-
gineering 2020 (2020).
[2] Munot, Nikita, and Sharvari S. Govilkar. "Comparative study of text summarization meth-
ods." International Journal of Computer Applications 102, no. 12, 2014.
[3] V. Gupta and G. S. Lehal, "A survey of text summarization extractive techniques," Journal
of emerging technologies in web intelligence, vol. 2, no. 3, pp. 258-268, 2010.
[4] Sinha, Aakash, Abhishek Yadav, and Akshay Gahlot. "Extractive text summarization using
neural networks." arXiv preprint arXiv: 1802.10137, 2018.
[5] W.-T. Hsu, C.-K. Lin, M.-Y. Lee, K. Min, J. Tang, and M. Sun, "A unified model for ex-
tractive and abstractive summarization using inconsistency loss," arXiv preprint
arXiv:1805.06266, 2018.
[6] Nallapati, Ramesh, Bowen Zhou, Caglar Gulcehre, and Bing Xiang. "Abstractive text sum-
marization using sequence-to-sequence rnns and beyond." arXiv preprint arXiv:1602.06023,
2016.
[7] [C.-Y. Lin and E. Hovy, "From single to multi-document summarization," in Proceedings
of the 40th annual meeting of the association for computational linguistics, 2002, pp. 457-
464.
[8] Christian, Hans, Mikhael Pramodana Agus, and Derwin Suhartono. "Single document auto-
matic text summarization using term frequency-inverse document frequency (TF-IDF)."
ComTech: Computer, Mathematics and Engineering Applications 7, no. 4 (2016): 285-294.
[9] Mutlu, Begum, Ebru A. Sezer, and M. Ali Akcayol. "Multi-document extractive text sum-
marization: A comparative assessment on features." Knowledge-Based Systems 183 (2019):
104848.
170
Wasit Journal of Computer and Mathematic Science Vol. (1) No. (4) (2022)
[10] Lin, Chin-Yew. "Rouge: A package for automatic evaluation of summaries." In Text sum-
marization branches out, pp. 74-81. 2004.
[11] Ozsoy, Makbule Gulcin, Ferda Nur Alpaslan, and Ilyas Cicekli. "Text summarization using
latent semantic analysis." Journal of Information Science 37, no. 4 (2011): 405-417.
[12] Gong, Yihong, and Xin Liu. "Generic text summarization using relevance measure and la-
tent semantic analysis." In Proceedings of the 24th annual international ACM SIGIR con-
ference on Research and development in information retrieval, pp. 19-25. 2001.
[13] Wissner-Gross, A. .” Edge.com. Retrieved 8 January 2016.
[14] Demian Gholipour Ghalandari; Chris Hokamp; Nghia The Pham; John Glover; Georgiana
Ifrim, WCEP Dataset [Dataset]. https://paperswithcode.com/dataset/wcep
[15] W. Xiao, I. Beltagy, G. Carenini, and A. Cohan, "Primer: Pyramid-based masked sentence
pre-training for multi-document summarization," arXiv preprint ar X iv: 2110.08499, 2021.
[16] DUC 2004 Dataset [Dataset]. https://paperswithcode.com/dataset/duc-2004
[17] S. Shen, Y. Zhao, Z. Liu, and M. Sun, "Neural headline generation with sentence-wise op-
timization," arXiv preprint arXiv:1604.01904, 2016.
[18] Ramesh Nallapati; Bo-Wen Zhou; Cicero Nogueira dos santos; Caglar Gulcehre; Bing
Xiang, CNN/Daily Mail Dataset [Dataset]. https://paperswithcode.com/dataset/cnn-daily-
mail-1.
[19] Y. Zhao, M. Khalman, R. Joshi, S. Narayan, M. Saleh, and P. J. Liu, "Calibrating Sequence
likelihood Improves Conditional Language Generation," arXiv preprint ar X iv: 2210.00045,
2022.
[20] datadiscovery.nlm.nih.gov (2021). PubMed [Dataset]. https://healthdata.gov/dataset/Pub-
Med/h5mw-dwr6 .
[21] B. Pang, E. Nijkamp, W. Kryściński, S. Savarese, Y. Zhou, and C. Xiong, "Long Document
Summarization with Top-down and Bottom-up Inference," arXiv preprint
arXiv:2203.07586, 2022.
[22] ArXiv Dataset [Dataset]. https://paperswithcode.com/dataset/arxiv.
[23] Shashi Narayan; Shay B. Cohen; Mirella Lapata, XSum Dataset [Dataset]. https://pa-
perswithcode.com/dataset/xsum
[24] Sajad Sotudeh; Nazli Goharian; Zachary Young (2022). Mental Health Summarization
(MentSum) [Dataset]. https://ir.cs.georgetown.edu/resources/mentsum.html.
[25] S. Sotudeh, N. Goharian, and Z. Young, "MentSum: A Resource for Exploring Summariza-
tion of Mental Health Online Posts," arXiv preprint arXiv:2206.00856, 2022.
[26] Moussa Kamal Eddine; Antoine J. -P. Tixier; Michalis Vazirgiannis, OrangeSum Dataset
[Dataset]. https://paperswithcode.com/dataset/orangesum.
[27] M. K. Eddine, A. J.-P. Tixier, and M. Vazirgiannis, "Barthez: a skilled pretrained french
sequence-to-sequence model," arXiv preprint arXiv:2010.12321, 2020.
[28] Wojciech Kryściński; Nazneen Rajani; Divyansh Agarwal; Caiming Xiong; Dragomir
Radev, BookSum Dataset [Dataset]. https://paperswithcode.com/dataset/booksum .
[29] W. Xiong, A. Gupta, S. Toshniwal, Y. Mehdad, and W.-t. Yih, "Adapting Pretrained Text-
to-Text Models for Long Text Sequences," arXiv preprint arXiv: 2209.10052, 2022.
[30] Arman Cohan; Franck Dernoncourt; Doo Soon Kim; Trung Bui; Seokhwan Kim; Walter
Chang; Nazli Goharian (2021). arXiv Summarization Dataset Dataset [Dataset]. https://pa-
perswithcode.com/dataset/arxiv-summarization-dataset.
[31] Liang, Xinnian, Shuangzhi Wu, Mu Li, and Zhoujun Li. "Improving unsupervised extractive
summarization with facet-aware modeling." The Association for Computational Linguistics
Findings: ACL-IJCNLP 2021, pp. 1685-1697. 2021.
[32] Mahnaz Koupaee; William Yang Wang (2021). WikiHow Dataset [Dataset]. https://pa-
perswithcode.com/dataset/wikihow
171
Wasit Journal of Computer and Mathematic Science Vol. (1) No. (4) (2022)
[33] Savelieva, Alexandra, Bryan Au-Yeung, and Vasanth Ramani. "Abstractive summarization
of spoken and written instructions with BERT." arXiv preprint arXiv:2008.09676 (2020).
[34] Saurabh Shahane (2021). Urdu News Dataset [Dataset].
https://www.kaggle.com/saurabhshahane/urdu-news-dataset
[35] PrithwirajSust (2020). Bengali News Summarization Dataset [Dataset].
https://www.kaggle.com/prithwirajsust/bengali-news-summarization-dataset
[36] Bhattacharjee, Prithwiraj, Avi Mallick, and Saiful Islam. "Bengali abstractive news sum-
marization (BANS): a neural attention approach." In Proceedings of International Confer-
ence on Trends in Computational and Cognitive Engineering, pp. 41-51. Springer, Singa-
pore, 2021.
[37] Gaurav (2020). Hindi Text Short Summarization Corpus [Dataset].
https://www.kaggle.com/disisbig/hindi-text-short-summarization-corpus
[38] Shah, Aashil, Devam Zanzmera, and Kevan Mehta. "Deep Learning based Automatic Hindi
Text Summarization." In 2022 6th International Conference on Computing Methodologies
and Communication (ICCMC), pp. 1455-1461. IEEE, 2022.
[39] Abdelkader Rhouati (2020). Arabic News articles from Aljazeera.net [Dataset].
https://www.kaggle.com/arhouati/arabic-news-articles-from-aljazeeranet .
[40] Suleiman, Dima, and Arafat Awajan. "Multilayer encoder and single-layer decoder for ab-
stractive Arabic text summarization." Knowledge-Based Systems 237 (2022): 107791.
[41] Allen Institute for AI (2020). cord-19 [Dataset]. https://allenai.org/data/cord-19 .
[42] Bishop, Jennifer, Qianqian Xie, and Sophia Ananiadou. "GenCompareSum: a hybrid unsu-
pervised summarization method using salience." In Proceedings of the 21st Workshop on
Biomedical Language Processing, pp. 220-240. 2022.
[43] Aaditya Raj (2022). Scientific Document Summarization (SciTLDR-A) [Dataset].
https://www.kaggle.com/datasets/adityawithdoublea/scitldra .
[44] Cachola, Isabel, Kyle Lo, Arman Cohan, and Daniel S. Weld. "TLDR: Extreme summari-
zation of scientific documents." arXiv preprint arXiv:2004.15011, 2020.
[45] Michihiro Yasunaga; Jungo Kasai; Rui Zhang; Alexander R. Fabbri; Irene Li; Dan Fried-
man; Dragomir R. Radev (2021). ScisummNet Dataset [Dataset]. https://paperswith-
code.com/dataset/scisummnet .
[46] Park, Jong Won. "Continual bert: Continual learning for adaptive extractive summarization
of covid-19 literature." arXiv preprint arXiv:2007.03405, 2020.
[47] Mohammad Bani Almarjeh (2022). SumArabic [Dataset].
http://doi.org/10.17632/7kr75c9h24.1 .
[48] Guy Lev; Michal Shmueli-Scheuer; Jonathan Herzig; Achiya Jerbi; David Konopnicki,
TalkSumm Dataset [Dataset]. https://paperswithcode.com/dataset/talksumm .
[49] Lev, Guy, Michal Shmueli-Scheuer, Jonathan Herzig, Achiya Jerbi, and David Konopnicki.
"Talksumm: A dataset and scalable annotation method for scientific paper summarization
based on conference talks." arXiv preprint arXiv:1906.01351 , 2019.
Article submitted 23 October 2022. Published as resubmitted by the authors 31 December 2022.
172