The Design of Automatic Summarization of 1fb33ee8
The Design of Automatic Summarization of 1fb33ee8
1. INTRODUCTION
Documents with long entries are pretty tiring to read, and sometimes readers only
need the information. Text summarization is a solution to finding the information
contained in the text. But if it is done manually, summarizing the text will take a long time
and a lot of money. Therefore, automatic text summarization is required. Automatic text
summarization is the process of automated text generation by software that significantly
depicts the information contained in the source text. The contents of the summarized
result are no longer than the source text [1]. Automatic text summarization is not new
research. Research on this has been conducted since 1958 by Luhn [2]. There have also
been many studies of automatic texts in Indonesian. Mostly in making automatic text
Jurnal Teknologi Informasi dan Pendidikan
Volume 15, No. 1, Maret 2022
https://doi.org/10.24036/tip.v15i1
summaries that have been created using an extractive approach. This extractive
summarization has been used with various methods in machine learning, such as Support
Vector Machine [3], Relevance Vector Machine [3], Naïve Bayes [4]. Some use optimization
methods [5][6], using neural networks [7], Vector Space Model [8], Restricted Boltzmann
Machine [9], and many more [10] [11] [12].
An abstract approach has been made in Kasyfi Ivandera's research [13]. In his
research, it can be seen that the input to the system is a paragraph. Meanwhile, news
documents usually do not contain just a paragraph. Therefore, in this study, a model is
designed to process documents with more than one paragraph. This study will adopt the
research of Wang [14]. In Wang's research, summarization uses a hybrid approach. Input
in Wang's research is a Chinese news text. The first stage in Wang's research is the
extractive stage. In Wang's research, this stage uses the Pagerank method. Automatic text
summarization uses the Pagerank method, which is an unsupervised summary. Selection
of the core sentence based on the relationship between the text title and the sentence and
the relationship between the sentences. However, according to Hadyan's [15] research,
which performed extractive summarization using Pagerank in Indonesian text, the F-
measure obtained was not very good. So in this study, the stages of making extractive
summaries will be carried out using the Support Vector Machine method, which can be
seen in research [3] that can produce fairly good accuracy.
Meanwhile, the abstractive process is designed using seq2seq RNN as in Wang's
research with the addition of the Pointer Mechanism method. This Pointer Mechanism is
considered to be able to solve the OOV (out of vocabulary) problem, which usually
becomes a problem when the features used are words. This has also been shown in Silpi's
research. Although the results are still not good, with the pointer mechanism, Rouge-1,
Rouge-2, and Rouge-L values increased by 18%, 34%, and 20% compared to those using
the standard attention mechanism model [16].
This study aims to recommend the design of automatic text summarization in
Indonesian. So that the results of the summarized text can be read correctly, in this study,
an automated text summarization system is designed using a hybrid approach. The design
of this system is based on the results of a literature study. The results of this study are in
the form of recommendations for automatic text summarization designs.
2. RESEARCH METHOD
This study will discuss design recommendations for forming an automatic text
summarization system for long texts. The design of this system was based on the results of
previous research. This research consists of several stages, namely:
This stage is commonly carried out in summarizing text with an extractive approach
because the output of extractive summarization is a sentence. This process is almost
present in every extractive summarization study. This stage breaks the paragraph into
sentences, where the delimiter is the point.
3.2. Tokenizing
This stage is commonly carried out in summarizing text with an extractive approach
because the output of extractive summarization is a sentence. This process is almost
present in every extractive summarization study. This stage is done by breaking the
paragraph into sentences, where the delimiter is the point
This stage helps change pronouns for people and things. Extractive summarization
selects sentences considered introductory sentences in a document as summary sentences
without changing the sentence. If coreference resolution is not made first, then when a
word is chosen to have a person or thing pronoun, it will confuse the reader
3.4. Filtering
This stage helps remove punctuation marks. The results of this stage are words and
numbers only
This stage is done to homogenize the letters. Either uniform it to capital letters and all
lowercase.
Eliminate words that are considered unrelated to the topic. In this study, Tala's
stoplist was used [18].
This stage helps give weight to the sentence of the candidate for the summary. Feature
extraction that is often used is TF-IDF [17] [18]. At the same time, other studies also use
other features such as sentence length, sentence position, feature title, sentence to sentence,
negative keywords, and the connection between sentences [3] [19] [20].
The next stage is the stage of making summaries. The draft made by the document that
has been cleaned goes into the extractive approach text summarization stage. This stage
aims to shorten the time for making abstractive summaries. The results of this stage are
sentences that are considered important in a document. Selection of sentences using the
supervised method, the Support Vector Machine method. The way these methods work is
with the training dataset. This method creates a hyperplane that limits the summarized
sentences and those that aren't. If the entered sentence is a summary sentence, the system
will produce a positive value. In contrast, if the entered sentence is not a summary
sentence, the method will produce a negative value. This method is deemed sufficient to
produce a similar summary to other methods. Enter data which was a document
consisting of sentences, D={S_1,S_2,S_3,…,S_n }, with n many sentences, the output of this
method is the selected sentence, for example D_1={S_1,S_2,..,S_k }, where k<n.
The next stage is to form summary sentences. This stage was designed using the
seq2seq RNN method as in Wang's study [15]. In Wang's research, it can be seen that with
this method, the summary results are better than the usual RNN. At this stage, the
resulting sentences are broken down into words. Then the words are made into vectors
using word embedding. Then, with the RNN encoding process, a summary sentence is
formed. So the input of the extractive summarization results is D_1={S_1, S_2,.., S_k },
using the RNN method will produce D_2={w_1,w_2,..,w_m }, where m is the number of
words that is smaller than the number of words which is on D_1. This process can be seen
in Figure 1. The difference between Kasyifi and Wang's RNN is that in Kasyifi's research,
he used the development of the RNN, namely LSTM, whereas Wang used the seq2seq
RNN method. In the RNN method, a point mechanism is added, which is helpful to
reduce interference from the word Out Of Vocabulary (OOV), which is a problem in
Kasyifi's research.
4. CONCLUSION
Based on the results and discussion, it is found that this design can handle long
documents so that it can be continued to make abstracts. In this research, the new input
data is in the form of a single document so that in the future, we can try to design a system
that can handle multi-document input.
REFERENCES
[1] J.-M. Torres-Moreno, Automatic text summarization. John Wiley \& Sons, 2014.
[2] H. P. Luhn, “The automatic creation of literature abstracts,” IBM J. Res. Dev., vol. 2,
no. 2, pp. 159–165, 1958.
[3] E. Rainarli and K. E. Dewi, “Relevance Vector Machine for Summarization,” in IOP
Conference Series: Materials Science and Engineering, 2018, vol. 407, no. 1, p. 12075.
[4] S. Raharjo and E. Winarko, “Klasterisasi, klasifikasi dan peringkasan teks berbahasa
indonesia,” Pros. KOMMIT, 2014.
[5] A. N. Ammar and S. Suyanto, “Peringkasan Teks Ekstraktif Menggunakan Binary
Firefly Algorithm,” Indones. J. Comput., vol. 5, no. 2, pp. 31–42, 2020.
[6] Z. Zulkifli, A. T. Wibowo, and G. Septiana, “Pembobotan Fitur Ekstraksi Pada
Peringkasan Teks Bahasa Indonesia Menggunakan Algoritma Genetika,”
eProceedings Eng., vol. 2, no. 2, 2015.
[7] M. N. Rachmatullah and A. Primanita, “IMPLEMENTASI JARINGAN SYARAF
TIRUAN PADA SISTEM PERINGKASAN TEKS OTOMATIS MENGGUNAKAN
EKSTRAKSI CIRI.”
[8] C. Slamet, A. R. Atmadja, D. S. Maylawati, R. S. Lestari, W. Darmalaksana, and M.
A. Ramdhani, “Automated text summarization for Indonesian article using vector
space model,” in IOP Conference Series: Materials Science and Engineering, 2018, vol.
288, no. 1, p. 12037.
[9] R. Widiastutik, J. Santoso, and others, “Peringkasan Teks Ekstraktif pada Dokumen
Tunggal Menggunakan Metode Restricted Boltzmann Machine,” J. Intell. Syst.
Comput., vol. 1, no. 2, pp. 58–64, 2019.
[10] R. Indrianto, M. A. Fauzi, and L. Muflikhah, “Peringkasan Teks Otomatis Pada
Artikel Berita Kesehatan Menggunakan K-Nearest Neighbor Berbasis Fitur
Statistik,” J. Pengemb. Teknol. Inf. dan Ilmu Komput., vol. 1, no. 11, pp. 1198–1203,
2017.
[11] M. Mustaqhfiri, Z. Abidin, and R. Kusumawati, “Peringkasan teks otomatis berita
berbahasa Indonesia menggunakan metode Maximum Marginal Relevance,” Matics,
2011.
[12] C. Fang, D. Mu, Z. Deng, and Z. Wu, “Word-sentence co-ranking for automatic
extractive text summarization,” Expert Syst. Appl., vol. 72, pp. 189–195, 2017.
[13] K. Ivanedra and M. Mustikasari, “Implementasi Metode Recurrent Neural Network
Pada Text Summarization Dengan Teknik Abstraktif,” J. Teknol. Inf. dan Ilmu
Komput., vol. 6, no. 4, pp. 377–382, 2019.
[14] S. Wang, X. Zhao, B. Li, B. Ge, and D. Tang, “Integrating extractive and abstractive
models for long text summarization,” in 2017 IEEE International Congress on Big Data
(BigData Congress), 2017, pp. 305–312.
[15] F. Hadyan, M. A. Bijaksana, and others, “Comparison of Document Index Graph
Using TextRank and HITS Weighting Method in Automatic Text Summarization,”
in Journal of Physics: Conference Series, 2017, vol. 801, no. 1, p. 12076.
[16] A. S. Alpiani and S. Suyanto, “Pointer Generator dan Coverage Weighting untuk
Memperbaiki Peringkasan Abstraktif,” Indones. J. Comput., vol. 4, no. 2, pp. 169–176,
2019.
[17] N. G. dan Indriati Indriati dan Ratih Dewi, “Peringkasan Teks Otomatis Secara
Ekstraktif Pada Artikel Berita Kesehatan Berbahasa Indonesia Dengan
Menggunakan Metode Latent Semantic Analysis,” J. Pengemb. Teknol. Inf. dan Ilmu
Komput., vol. 2, no. 9, pp. 2821–2828, 2018.
[18] A. Najibullah and W. Mingyan, “Otomatisasi peringkasan dokumen sebagai
pendukung sistem manajemen surat,” … Ilm. Teknol. Sist. Inf., 2015.
[19] B. Zaman and E. Winarko, “Analisis Fitur Kalimat untuk Peringkas Teks Otomatis
pada Bahasa Indonesia,” IJCCS (Indonesian J. Comput. Cybern. Syst., vol. 5, no. 2,
2011.
[20] M. A. Fattah and F. Ren, “GA, MR, FFNN, PNN and GMM based models for
automatic text summarization,” Comput. Speech \& Lang., vol. 23, no. 1, pp. 126–144,
2009.