FL21D166 Final-Defence
FL21D166 Final-Defence
MULTI-TYPE TEXT
BY
MD. ASADUZZAMAN
ID: 191-15-12907
MD. RASEL
ID: 191-15-12729
This Report Presented in Partial Fulfillment of the Requirements for the Degree of
Bachelor of Science in Computer Science and Engineering
Supervised By
Co-Supervised By
This research project, titled “Bangla news classification and text summarization based on Multi-
Type Text”, submitted by Md. Asaduzzaman, ID: 191-15-12907, Md. Rasel, ID: 191-15-12729 and
NAZAM BEEN SADDI, ID: 191-15-12470 to the Department of Computer Science and Engineering,
Daffodil International University has been accepted as satisfactory for the partial fulfilment of the
requirements for the degree of B.Sc. in Computer Science and Engineering and approved as to its style
and contents. The presentation has been held on 02/02/2023.
BOARD OF EXAMINERS
______________________ Chairman
Dr. Touhid Bhuiyan
Professor and Head
Department of Computer Science and Engineering
Faculty of Science & Information Technology
Daffodil International University
_______________________
Dr. Sheak Rashed Haider Noori Internal Examiner
Professor and Associate Head
Department of Computer Science and Engineering
Faculty of Science & Information Technology
Daffodil International University
_______________________
Md. Sazzadur Ahamed Internal Examiner
Assistant Professor
Department of Computer Science and Engineering
Faculty of Science & Information Technology
Daffodil International University
_______________________
Dr. Md. Sazzadur Rahman External Examiner
Assistant Professor
Institute of Information Technology
Jahangirnagar University
We hereby declare that, this project has been done by us under the supervision of Sharun Akter
Khushbu Lecturer, Department of Computer Science and Engineering of Daffodil
International University. We also declare that neither this project nor any part of this project has
been submitted elsewhere for award of any degree or diploma.
Supervised by:
Co-Supervised by:
Submitted by:
(Md. Asaduzzaman)
ID: 191-15-12907
Department of CSE
Daffodil International University
(Md. Rasel)
ID: 191-15-12729
Department of CSE
Daffodil International University
First, we express our heartiest thanks and gratefulness to almighty God for His divine blessing
makes us possible to complete the final year project/internship successfully.
We really grateful and wish our profound our indebtedness to Shahrun Akter Khushbu,
Lecturer, Department of CSE Daffodil International University, Dhaka. Deep Knowledge & keen
interest of our supervisor in the field of “Machine learning” to carry out this project. His endless
patience, scholarly guidance, continual encouragement, constant and energetic supervision,
constructive criticism, valuable advice, reading many inferior drafts and correcting them at all
stage have made it possible to complete this project.
We would like to express our heartiest gratitude to Professor Dr. Touhid Bhuiyan Sir, Department
of CSE, for his kind help to finish our project and also to other faculty member and the staff of
CSE department of Daffodil International University.
We would like to thank our entire course mate in Daffodil International University, who took part
in this discuss while completing the course work.
Finally, we must acknowledge with due respect the constant support and patients of our parents.
CONTENTS PAGE
Board of examiners i
Declaration ii
Acknowledgements iii
Abstract iv
List of Abbreviation ix
1-7
CHAPTER 1: INTRODUCTION
1-4
1.1 Introduction
4
1.2 Objective
1.3 Motivation 5
2.1 Introduction 8
2.2 Literature review 8-11
2.3 Summary of the research 11-12
2.4 Challenges
12
4.1 Introduction 31
44
APPENDIX
REFERENCES 45-48
FIGURES PAGE NO
TABLES PAGE NO
Table 1: Software and Tools 16
Bangla is not only a language, It’s an emotion. We fight to achieve our mother language.
A large amount of data or information has been a part of our regular life. Those data are
mainly digital. Nowadays we have some issues like reading a lot of stuff like a large
number of paras to understand something. Which could be articles, reviews, or newspapers.
If we saw a newspaper, we saw a huge number of texts. If we want to understand the main
news, we have to read the full article, which takes a lot of time.
With the help of LSTM & Sequence to Sequence algorithm, we present to you a
summarization that could save time & easily understand the material or news. Through
Natural Language Processing (NLP) we can understand the document to generate the
summary. Text summarization is like a process where a large amount of text or data has
been submitted and with the help of Machine learning (ML) it compresses the data and
presents us with an understandable short form of the data which is called text
summarization. Text summarization is not new in Machine learning (ML). In the English
language, it is used vastly. But on the other hand, in the Bangla language, it is very difficult
to summarize data or form. The main reason for this kind of issue is that the Bangla
language has a large number of alphabets and grammatical rules.
Text summarization will be worn out in 2 ways: they're Extractive and theoretical. The
Extractive text summarization is a lot of correct to handle massive amounts of knowledge
and provide an associate degree correct outline. And it's done by looking for the foremost
used words or common words and so shrinking them and presenting newly designed
sentences. On the other hand, Abstractive text summarization clarifies the data and
improves them, and makes a sentence [1][2]. Yes, that's correct! Extractive text
summarization involves selecting and combining important sentences or phrases from the
original text to create a summary, while abstractive text summarization involves generating
new sentences and phrases that capture the meaning of the original text. Extractive
summarization is generally easier to implement than abstractive summarization because it
On sequence-to-sequence model, it’s a class that mainly uses Recurrent Neural Network
(RNN) to solve complex Language problems like Text Summarization. It gained a lot of
popularity in the last few years. It provides good-quality summarization because of its
improved technique which was proposed on sequence-to-sequence model. Most techniques
of this mode follow three different categories: parameter inference, network structure, and
decoding [4].
Long short-run Memory (LSTM) could be a style of continual neural network (RNN) that's
capable of learning long-run dependencies in ordered knowledge. RNNs are a sort of neural
network that are notably well-suited for process-ordered knowledge, like statistics,
language text, and speech. LSTMs can learn long-run dependencies by employing a special
style of a memory cell which will store data over long periods of your time and by selecting
retain or forget data as required. This enables LSTMs to model dependencies that are for
much longer than those shaped by ancient RNNs, which might solely learn dependencies
over some time steps [9]. LSTMs have been applied to a wide range of tasks, including
language translation, language modeling, speech recognition, and more. They are also
widely used in natural language processing (NLP) tasks, such as language generation,
machine translation, and text classification. One disadvantage of LSTMs is that they can
be computationally expensive to train, particularly for very large datasets. Additionally,
LSTMs can be difficult to optimize, as the learning process can be unstable. Despite these
challenges, LSTMs have proven to be very powerful and effective at learning long-term
dependencies in sequential data, and they continue to be an important tool in the field of
deep learning [8].
During the time spent grouping news article distributed in the Bangla language into preset
classes, like economics, sports, entertainment, international, Science and Technology and
so on, we additionally order the news dataset. This is frequently accomplished by utilizing
AI calculations and regular language handling strategies, which examine the content of the
reports and group them according to catchphrases, subjects, and topics. The order of Bangla
news intends to improve the client experience for Bangla-speaking crowds and simplify it
for perusers to find relevant data.
1.2 Objective
The objective of Bangla text summarization is to create a summary of a longer Bangla text
document or passage that retains the most important information and main points from the
original document. The goal is to provide a condensed version of the text that is easier to
read and understand, while still conveying the most important information. Text
summarization can be useful for a variety of applications, such as quickly understanding
the main points of a document, creating a summary for a report or presentation, or reducing
the amount of time needed to read and comprehend a large text.
Firstly, we have to understand what is the main purpose of text summarization. We live in
a period where we had to deal with a large amount of data daily. Like blogs, newspapers
and many more. Text account is the method of generating a crisp and fluent outline of an
extended text document. The goal of a text account is to form an outline that retains the
foremost vital data from the first text, whereas reducing the length of the text and removing
redundant or orthogonal data. There square measure many motivations for exploitation text
account, including:
Time-saving: When we read a newspaper article, to understand the news we have to read
the whole article to understand. It will take a lot of time. To save this waste of time we
need text summarization to understand a short amount of time.
Information overload: With the abundance of knowledge obtainable on the web, it is
overwhelming to sift through all of it. Text accounts will facilitate folks quickly get the
most points from an oversized quantity of text.
Improved comprehension: A literate outline will facilitate folks to perceive the most points
of a text by presenting them in a very clear and crisp manner.
Text reuse: Summaries are used as a start line for making new content, like articles or
reports.
Translation: account is accustomed to produce shorter versions of texts that square measure
easier to translate into different languages.
Sentiment analysis: account is accustomed to quickly perceive the sentiment of a text, like
whether or not it's positive or negative.
Bengali is the most beautiful language in the world. It has a fascinating past. The only
language in the world for which individuals have given their lives in order to preserve it is
the Bengali language. However, there are a lot of cutting-edge tools and technologies
available in the current contemporary world for linguistic study, however, the approaches
or methods used to study the Bengali language are less developed than those used to study
other languages. We should participate in our language due to this. We can summarize its
Our main goal was to place the required documentation out in the fields, which we
accomplished. The builder then creates tools for users. A recent study looks at text
compression in Bengali. There have already been numerous analyses works created in the
past to lessen Bangla lessons. We're attempting to build an automation process to
accomplish our goal. The machine is conditioned by an automated system. So, in order to
understand our proposed model, you must read the machine. The objective of our study is
to generate an abstract text abbreviation using our developed model while maintaining the
method's remarkable efficiency. While preparing the model, strive to achieve perfection
and reduce total loss.
There is a total of five chapters in our report. Throughout the chapter-1 we an overview of
the whole project. Several sections like 1.1- Introduction, 1.2- Objective, 1.3- Motivation,
1.4- Reasons behind the study, 1.5-Research Questions, 1.6- Expected Outcome, 1.7-
Report layout. The discussion sections in the second chapter are 2.1- Introduction, 2.2-
Literature review, 2.3- Research summary, 2.4- Challenges. The research method,
including its subsections, is covered in Chapter 3 they are 3.1- Introduction, 3.2- Research
subject and intermediary,3.3-Data collection procedure, 3.4- Data fetching and data pre-
processing, 3.5- Arithmetical analysis, 3.6- Executional requirements. The tests and
paragraph are covered in the fourth part 4.1-Introduction, 4.2- Implementational results,
4.3- Descriptive Analysis, 4.4- moral. The fifth chapter discusses the subsections 5.1-
Summary of the study, 5.2- Conclusion, 5.3- recommendations 5.4- Further study. At the
conclusion of every section, situations were created that aided our research.
BACKGROUND STUDIES
2.1 Introduction
Text summarization is the process to shorten a large document or passage into short form.
Making the document more readable, summarize, and comprehensive is the main theme of
text summarization. Shortening the long text into the short form takes more time and
costlier for the people. After reading the full document, sometimes we are unable to
understand and catch the main theme of the document. But text summarization makes it
easy for us and converts the task in a more efficient way.
We collect our data from different sources like the internet, various news channel websites,
Kaggle, etc. Getting and saving the information is a multipart process for us at the same
time we need a space for storing it but text summarization solves the problem. There are
two techniques, extractive and abstractive, to deal with the problem of text abbreviation.
We use the abstractive way to summarize.
In the Bangla language, there are many grammatical rules. That's why it can't be analyzed
as other languages [1]. The main reason to choose this method is an approach to Bangla
text summarization. The primary benefit of deep learning is that it allows computers to
Extractive text summarization it’s formed in three phases. They are: Analyzing text,
scoring sentences, and Summarizing the high-scoring sentences. Then creating
representation is the major step for the document. Mainly it splits the text and creates it as
a paragraph, then sentence and tokens. Then the process is word removal [14]. Then the
second step is to determine the sentences which are important or not for the documents and
extend the information to different topics with the help of sentence scoring. The source
should be measured if the sentence is understandable or not. Previous steps send
combinations to the last steps and the last steps combine the sources and generate a
summary [2] [17] [21].
Sovereign representations are the easiest system derives like primary representation or
Sovereign representation of the text, which have summarized and then dig-in the
importance of the content depending on this representation [15]. Some of the greatest
summarization methods hinge on this topic of representation and this class of approaches
exhibits a powerful variation in sophistication and illustration power [3]. Here we explain
fifteen (15) sentences with the help of a scoring method. Every method (15) is implemented
[1]. The input gate controls the flow of data into the cell, permitting it to buy and select the
elements of the input sequence to recollect. The output gate controls the flow of data out
of the cell, permitting it to buy and select the elements of the input sequence to output. The
By controlling these gates, LSTMs can learn long-term dependencies and generate output
sequences over an extended period. The gates in an LSTM are composed of a sigmoid layer
and a pointwise multiplication operation. The sigmoid layer outputs a number between 0
and 1, which is then multiplied by the previous state or the output from the previous layer
[8]. This multiplication operation is used to control how much information is passed
through the gates. The forget gate controls how much of the previous state is forgotten,
while the input gate controls how much of the new input is added to the state [11] [13].
This allows the LSTM to find long-range dependencies and create predictions supported
by discourse data from previous time steps. Overall, LSTM square measure is a strong tool
for modeling sequent knowledge, and square measure is widely utilized in a range of
tongue processes (NLP) and speech recognition tasks [9].
TensorFlow provides a comprehensive set of tools for structure and planting machine
literacy models. It includes a library of algorithms, pre-trained models, and tools for
Our team considered the abstractive text summarization of a Bengali when carrying out
this research. We have created this model using deep learning. To use this model, we
utilized a personal dataset. We have collected our data from Kaggle and the online
newspapaer. The following techniques include summarizing each Bengali text. The
datasets contain two attributes: one is a useful summary and the other is the description.
The dataset comprises a total of 21969 data along with their related summaries. The
foundation for building a deep learning model is preprocessing text. Text is split during the
preprocessing stage, after which Bengali contractions are added, and stop words are
removed. After preprocessing, we counted the vocabulary across the entire dataset. Deep
learning models heavily rely on word embedding. In the necessary vocabulary file, W2V
offers a quantitative value. W2V files which have already been trained are required for
Bengali text that is now online.
Relying on the attention model, we constructed a chain-and-chain model. This model uses
a bi-directional LSTM cell with an encoder and decoder. Word vectors are the encoder's
input, and word vectors relating to those words are the decoder's outputs. A sign that is
identified as a unique sign, like PAD, UNK, or EOS, is required to pass the sequence. More
than five hours were spent training the model. The machine itself then provided us an
acknowledgment that was significant.
We also classify the dataset news in the process of categorizing news article written in the
Bangla language into predefined categories, such as economics, sports, entertainment,
2.4 Challenges
There is no organized data in Bengali. Everyone is there, yet they are all dispersed or
disorganized. The biggest obstacle for this research, however, is data collecting. The
required dataset might have been used before. The newspaper dataset is one such example;
however, it cannot be used in other research projects. Consequently, this study project
needs a brand-new dataset. The removal of words from the text might also be a problem.
English has a built-in library that can be used to eliminate stop words from texts. However,
there is none for Bengali, making it a significant difficulty.
Generating the summary is another challenging task after gathering the dataset.
Furthermore, dealing with the Bengali text is a constant challenge. To construct the text for
the model's input during the processing stage, some raw coding is necessary. Assume that
each punctuation mark requires Unicode in order to be removed from the text using raw
code Training the large data is also time consuming because it takes much time and more
memory space. Our dataset text size is big and sequence is also big so implementation
summary is challenging and process time is huge.
RESEARCH METHODOLOGY
3.1 Introduction
We mention that our entire study process in this part. Every research project is different
from the perspective of the methods used to solve it. The methodology includes every
strategy that was used in the study project. This methodology section includes a brief
summary of each component as well as a discussion of applying models.
Providing the nobility with a thorough description of the process is vital to improve
efficiency. Understanding the work is aided by mathematical equations, a graphical
representation of the model, and a narrative. For continued research, a clear justification
of the technique is necessary. The entire piece appears to be a framework. Our
methodology section includes a detailed discussion of all the key phases. The main
section's subheadings help readers understand the model's overview and intended use.
The web is gigantic and persistently refreshed. The number of Bangla news websites
has grown rapidly in the computer age, and each one has its own unique plan and
framework for organizing news. The association and grouping's assortment cannot
always address the issues of every client. It is a troublesome task to dispose of this
inconstancy and classify the new things as per client desire. In this exposition, we give
a technique for assisting clients with finding reports that are pertinent to a specific
classification. We utilize naive Bayes, SVM, RF, Gradient Boosting, adaboost and
KNN classifiers to classify the items in Bangla news datasets.
The title of our research topic is " Bangla news classification and text summarization based
on Multi-Type Text ". It is a significant field of research in Bengali NLP. This research
study includes a brief explanation of the theoretical and conceptual approach used to create
an abstractive text summary in Bengali. In a deep learning model, a strong PC with a GPU
and other tools is important.
NLTK
For bangla news classification and text summarization data collection procedure is very
hard. There no such data set we would like to use that. For our work we collect our data in
Kaggle and other online newspaper and then crete data set we would like.
Data preprocessing is an effective way to clean our own dataset. Because without a good
dataset we cannot achieve the desired output. Our dataset contains different kinds of news
data such as sports news, economic and financial news, political news, news of different
contents. Obtaining Bengali information has several drawbacks, such the way Bengali
content is organized. Our dataset includes text content that has been sorted and
summarized.
Before models are formed, processing data is a significant task. Preprocessing the data
requires a variety of processes. Data preprocessing for Bangla is really challenging. The
initial step in preprocessing is to eliminate extraneous words and spaces. Our shorthand for
embedding it here.
ড. ডক্টর
বি. বিস্টার
ডা. ডাক্তার
ইঞ্জি: ইঞ্জিবিযার
ররঞ্জি: ররঞ্জিশেেি
িু. িুহাম্মদ
রিা: রিাহাম্মদ
ররা: ররাপাইশ ার
বহবি. বহসািবিজ্ঞাি
১লা: পযলা
২রা: রদাসরা
৪ঠা: র ৌঠা
১৬ই. রষাশলাই
২১শে. একুশে
We have deleted less important words and Bengali numbers that we don’t need. Following
all the procedures, the text has been cleaned.
Due to our use of a huge dataset. So, there are a lot of ammount of words. We have made
a large vocabulary so that every word is connected to another. A small summary is
generated with long text.
The model has owned a lexical set and the necessity for a vocabulary set to distinguish
comparisons between the two description and the summary is a lot acceptable here. That’s
why we must count the vocabulary. Considering the word count, “রিবস” has been used 809
times. The per-trained "bn w2v model" has been used as a vector file.
Text pre-processing is must for the better accuracy. After processing the text, it looks so
much clean and usable. In the pre-processing steps, we have removed the stop words, extra
space, English letter, punctuation, abbreviations, Bengali digits, etc. Some text samples of
pre-processing in Table 3 are listed under.
ইংল্যান্ড টেস্ট দলল্ ব্যােসম্যালের সংকে েতু ে েয়। ইংল্যান্ড টেস্ট দলল্ ব্যােসম্যালের সংকে েতু ে েয়
ভারলতর বব্পলে ল্র্ডস টেলস্ট হালরর পর েতু ে কলর ভারলতর বব্পলে ল্র্ডস টেলস্ট হালরর পর েতু ে
এই আলল্াচো সাম্লে আসলে। বব্লেষ কলর কলর এই আলল্াচো সাম্লে আসলে বব্লেষ কলর
ইংল্যালন্ডর দুই ওলপোর রবর ব্ােসড ও র্ম্ বসব্বল্লক ইংল্যালন্ডর দুই ওলপোর রবর ব্ােসড ও র্ম্ বসব্বল্লক
বেলয় হলে তু ম্ুল্ সম্ালল্াচো। রাে করা টেে ভুলল্ই বেলয় হলে তু ম্ুল্ সম্ালল্াচো রাে করা টেে ভুলল্ই
টেলেে দুজে।দুই ওলপোলরর োো ব্যর্তায়
ড ব্যাটেং টেলেে দুজে দুই ওলপোলরর োো ব্যর্তায়
ড ব্যাটেং
অর্ডালরর একম্াত্র ফলম্ ডর্াকা ব্যােসম্যাে টজা রুে অর্ডালরর একম্াত্র ফলম্ ডর্াকা ব্যােসম্যাে টজা রুে
আলেে চালপ। ইংল্যান্ড দলল্র এই ব্যাটেং সম্সযার আলেে চালপ ইংল্যান্ড দলল্র এই ব্যাটেং সম্সযার
সম্াধাে কী হলত পালর, টসো জাোলল্ে সম্াধাে কী হলত পালর টসো জাোলল্ে পাবকস্তালের
পাবকস্তালের সালব্ক উইলকেবকপার-ব্যােসম্যাে সালব্ক উইলকেবকপার ব্যােসম্যাে কাম্রাে
কাম্রাে আকম্ল্। বতবে তরুণ ব্যােসম্যাে জযাক আকম্ল্ বতবে তরুণ ব্যােসম্যাে জযাক ক্রবল্লক
ক্রবল্লক একাদলে টদখলত চাে।বেলজর ইউটেউব্ . . একাদলে টদখলত চাে বেলজর ইউটেউব্ চযালেলল্ .
We have 21969 data. But in sample dataset we use 1072 data. All the data separated into two
subgroups such as Description and Summary.
Description Summary
আফোবেস্তালের েম্তা চলল্ টেলে তালল্ব্ােলদর হালত। েম্তার আফোবেস্তালের েম্তা চলল্ টেলে
পাল্াব্দলল্র পর আফোেলদর জীব্ে অবেশ্চয়তায়। টদেটের তালল্ব্ােলদর হালত েম্তার
বব্বভন্ন টখল্ার কী হলব্, প্রশ্ন উলেলে টসটে বেলয়ও। রেীদ খাে তাাঁর পাল্াব্দলল্র পর আফোেলদর
পবরব্ার বেলয় উৎকণ্ঠা জাবেলয়লেে। তালল্ব্ােলদর পে টর্লক জীব্ে অবেশ্চয়তায় টদেটের বব্বভন্ন
অব্েয আলেই আশ্বস্ত করা হলয়বেল্, তালদর েম্তা দখল্ টখল্ার কী হলব্ প্রশ্ন উলেলে টসটে
ক্রক্রলকলের ওপর প্রভাব্ টফল্লব্ ো’ ব্রং তারাই টদেটেলত ক্রক্রলকে বেলয়ও রেীদ খাে তাাঁর পবরব্ার বেলয়
এলেলে ম্লে কবরলয় ব্ল্া হলয়বেল্, ব্রং উন্নবতই হলব্ উৎকণ্ঠা জাবেলয়লেে
ক্রক্রলকলের।আপাতত টদেটের ক্রক্রলকলের ওপর টতম্ে প্রভাব্
পলেবে সরাসবর। েতু ে ব্যাটেং টকাচ বহলসলব্ সালব্ক শ্রীল্ঙ্কা
ওলপোর আবভস্কা গুণাব্ধলেলক
ড ব্যাটেং টকাচ বহলসলব্ বেলয়ােও
বদলয়লে আফোবেস্তাে ক্রক্রলকে টব্ার্ড। আজ কাব্ুলল্ শুরু হলয়লে
জাতীয় দলল্র অেুেীল্েও।েুইোলর গুোব্ধলের
ড বেলয়ালের ট াষণা
আেুষ্ঠাবেকভালব্ টদওয়ার পর অব্েয পলর ম্ুলে টফল্া হলয়লে
েুইেো। তলব্ ক্রক্রকব্াজ ব্ল্লে, পাবকস্তাে বসবরলজর জেয বেলয়াে
টদওয়া হলয়লে গুোব্ধলেলক।ড
The dataset's rundown and depiction inputs are both impartial. Commonly, a result or
outline is more limited than a portrayal (text). Consider W the exact word count of the
dataset's feedback chain depiction (text). The word size should be V, and the info chain
begins at 1..2..3. On the other side, the result chains like y1, y2,... and y, where S>W, are
created by the information chain. It suggests that the rundown's chain isn't significantly
longer than a text report. Each connection in the chain is created by words.
The importance of words isn't just relying on frequency but also rely on word likeness.
Therefore, we have to calculate the entire vocabulary from filtered description (text) and
summary. Following the vocabulary count, we look up the word circumstance. like as, we
have sampled word used “রিবস” and the incident was 809. One significant and usable pre-
trained model was discovered which is used for our research purposes for the improvement
of our model, we used a pre-trained Bengali W2V model which was downloaded from the
internet. There are various Word embedding pre-trained models in other languages, but the
Bengali language only has a small number of word embedding files, most of which are
inappropriate for research. We set up the vocabulary size and calculate those words which
happened 25 or more times in our dataset. The model was studied from scratch rather than
applying a pre-trained word embedding model.
RNN Encoder and Decoder are the combination of two Recurrent Neural Network which
does the task as an encoder and decoder pair. In the encoder phase, it produces the input
sequence and generating the encoder state. By using the encoder state summarize the
information into the input sequence. Decoder generates the output sequences by using
encoder states.
The encoder is given as contributing one course book of a news structure at a time. Each
word's initial goes through an implanting level that changes the word into an ordered
portrayal. This characterized portrayal is additionally coordinated by utilizing a multi-facet
brain network with the resigned layers created after entering the previous word, or all 0's
for the main word in the reading material. Despite the fact that GRU enjoys a benefit in
preparation time, our model uses LSTM rather than GRU because LSTM is easier to tune
boundaries and has a more grounded hypothetical assurance. ℎ.
Decoder
Based on the grouping to-arrangement worldview, our concept divides the decoder
segment into two distinct modes: replication mode and creation mode. The words in our
feedback arrangement have clear definitions on the off chance that the decoder expected
word grouping, with "I" representing the assortment succession, "H" for the concealed
state, and "V" for the information vector. The encased states are assessed using this recipe.
ℎ𝑡 = 𝑓(ℎ𝑡 − 𝐿)………………………….(2)
And
∁= 𝑞({ℎ𝑙. . . , ℎ𝑡𝑥})…………………….(3)
Where f and q are non - linear parts, while c is the unseen component. Specifically, by
using the X sequence, we may determine the probabilistic translations for the decoder.
Suppose, input sequence (h), also (ℎ1 ℎ ) is the hidden state. The hidden state (ℎ ℎ1) thus,
ℎ=[ℎ;ℎ]…………………….………………… (8)
Decoder
Encoder
For a few regular language handling (NLP) applications, like title age for machine
interpretation, text rundown, and sound acknowledgment, seq2seq models have basically
been carried out. A model of grouping-to-succession is created. In such an idea, a bi-
directional LSTM cell is utilized along with an encoder and a decoder. The most reasonable
and notable RNN strategy in profound learning It settles text-related issues all the more
effectively. Each LSTM cell that makes up a RNN LSTM cells function similarly to
transient memory. In LSTM cells, encoders and decoders are utilized. The important text
grouping is utilized to create the result after Encode has passed the information texts. A
token is used to verify the beginning and end of each succession.
In this paper, we have utilized a few successions like <PAD>, <EOS>, <GO>, <UNK>,
and so on. These kinds of unique tokens are used to work in dealing with the arrangement
in the encoder and decoder. <GO> tokens start the course of result succession in the
decoder. In the information preprocessing stage, we have added <UNK> and supplanted
the jargon. In this way, we utilized the exceptional token <UNK> which implies an obscure
token. At the point when an obscure token is set up in the succession, it will be added to
the <UNK> token in the text. Before our preparation information, we selected <GO> and
<EOS> in information that contains words that would be utilized for arrangement
interpretation.
We have chosen various layers for embeddings with the default pre-trained BERT model.
The N x E matrix required for clustering is produced by using the [cls] layer of BERT,
where N is the number of sentences and E is the embeddings dimension. However, the
output of the [cls] layer does not always result in the best embedding representation for
sentences. The basic BERT implementation for the newspaper summarization service
makes use of the "huggingface" company's pytorch-pretrained-BERT library.
3.6.f GPT-2
The goal of text summarizing is to produce a clear, succinct summary while preserving the
core information and message [36]. The most important information in a document must
be included in a high-quality summary, which should also be shorter than the original.
Although many different strategies have been put out for the automatic summary
generation task [37], the NLP community still finds it to be a difficult task [38]. GPT-2 has
achieved amazing performances and has a remarkable capacity for language representation.
Based on its exceptional capacity for language representation, GPT-2 has achieved
amazing performances. The OpenAI GPT-2 model just uses the transformer's decoder
3.6.g XLNET
Text summarization demands a group ensures of the words, sentences, and passages
throughout the entire article than other downstream NLP tasks. In this paper, we present a
model under the generic sequence to sequence pattern to utilize the full information of the
entire articles. with a layered Transformer decoder and a recurrent XLNet encoder Evey
part could acquire more contextual information than information that is only contained
within the current sequence because the entire article is encoded repeatedly until the last
part of the article. This produces both the representation and the memory, the content of
which is information from the previous time step. To achieve this, we first divide the entire
article into a number of identically sized pieces in accordance with the maximum length.
The GPU's capabilities, particularly its memory, place a limit on the duration.
We fully utilized XLNet's memory mechanism during the training phase, feeding the
segments into the model one at a time to create a hidden state and memory. Here are the
model's settings: label smoothing with a factor of 0.015 and dropout with a rate of 0.01 are
both employed. There are 12 heads, 768 hidden units, and feedforward layer in each
decoder. Beam search with size 5 is employed for producing. We did not take any action
to prevent the repetition and lack of vocabulary problems.
Naive Bayes
D. Lewis initially suggested and used the Guileless Bayes (NB) probabilistic
characterization technique for the gig of text ordering [7]. In a probabilistic system, it is
established by the Bayes hypothesis. To gauge the likelihood of classifications given a text,
the principal idea is to use the consolidated probabilities of words and classes. The model's
mistaken assumption of the word freedom makes it such. Since the Credulous Bayes
classifier doesn't utilize word mixes as indicators, its computation is undeniably more
proficient than the remarkable intricacy of Non-Guiltless Bayes procedures because of
word freedom's effortlessness. When we use the Credulous Bayes classifier in the Text
Arrangement issue, we use the following condition:
where p(class | document) is the probability that a given report D belongs to a given
document C. P(doc.) is the likelihood of a report, we can see that p(doc.) is a consistency
divider to each computation, so we can overlook it. P (class) is the likelihood of a class (or
classification). We can process it based on the number of archives in the classification
separated by report numbers across all classifications. p(doc. | class) addresses the
likelihood of a record given class, and archives can be represented as sets of words, so
p(doc. | class) can be written as: p(doc. | class) = Y I p(wordi | class) (2)
Joachims utilized the help vector machine, an interesting characterization technique, to sort
texts. In light of the most reduced primary gamble idea, this is a strong and managed
Random Forest
As they collaborate, this technique generates an infinite number of choice trees. The
foundations of this technique are choice trees. The pre-handling stage characterizes the
hubs of the choice trees that make up irregular backwoods [7]. The best component is the
haphazardly picked subset of highlights after many trees have been developed [8], [22].
Another idea that is made using the choice tree technique is to deliver a choice tree. These
trees make up the irregular woodland, which is utilized to sort new articles from an
information vector. Every choice tree assembled is utilized for grouping. Assume we give
the tree votes to that class, and then the irregular woodland picks the grouping that has the
best number of votes among every one of the trees in the backwoods.
Stage 3: Before rehashing Stages 1 and 2, pick the number NTree of trees you need to
build.
The primary objective of this technique is to keep tantamount articles near each other (see
Fig. 3). An informational collection's element vectors and class names are utilized by this
model [9]. KNN keeps a data set of all occasions and uses a closeness metric to classify
new models. To address text in K-closest neighbors, a spatial vector with the
documentation S = S (T1, W1; T2, W1;... Tn, Wn) is used. Using the preparation texts,
which are not completely settled for each text, the texts with the highest comparability are
chosen. The classes are then settled using K neighbors. Select neighbor number K at the
start. Consider the K-closest neighbors of the new information point in light of the
Euclidean distance. Play out a count of the quantity of interesting data among the K nearest
neighbors for every class. In the aftermath of counting, relegate the new information point
where we counted the most neighbors.
Gradient Boosting
AdaBoost Algorithm
The "helping" group of calculations incorporates "adaptive boosting" [1]. This sort of
student zeros in on additional, mistakenly grouped examples during preparation, changes
the example conveyance, and rehashes this cycle until the feeble classifier has gone through
a specific number of preparation stages, so all in all, learning is finished. Finish the circle,
then. AdaBoost has been altered to such an extent that all preparation tests start with a
4.1 Introduction
The terrible issue in the field of NLP is the shortening of abstract documents. People find
it considerably more challenging to pull a borough text and summary from within
themselves. Since the machine does its best to produce according to its capabilities. The
computer needs to be trained to learn using the data model after pre-processing. The model
contains a backend engine for each training. We used TensorFlow to complete the task in
this test. Initial values are almost complete. Examples include epoch, keep probability, run
size, batch size, learning rate, number of layers, etc. The amount of time needed to train
data has been decreased. The "Adam" in this case for model adjustment is the optimizer.
Making higher configuration PC data training easier is necessary. Finally, we train the
model via Google Colab. It works considerably more quickly and cuts down on time.
Parameter Value
Batch size 64
Number of layers 3
We all know very well that it is not possible to give the 100% for a model. But Almost the real
output is produced by the machine. Subsequent to making the model capacity we have to
prepare our model. We fit the model with the present and next words. We used 80 percent
data for the training and 20 percent data for the test. Hence, we have 21,969 but we import
1072 data in google notebook. So, 933 data used for training and another 140 data used for
the teste have used TensorFlow sequence to sequence model. When we stop training we'll
be able to create a machine’s own summary. We take epoch=100, batch size=64, rnn’s
size=256, learning rate=0.008, keep probability=0.75 and we used Adam Optimizer, which
calculated the learning rate of each parameter. For faster converges use a vanilla gradient
descent optimizer. Train model right around 12 hours gives better accuracy with a loss of
0.031. Our model accuracy is 97.69%. A file called "model.ckpt" has been saved in the model
so that the output can be checked. We then created a TensorFlow session in order to reload the
graph that was previously saved. After that the description (text) and summarized data frame
were then defined at random to ensure that the summary was accurate.
Original Description: তখি িাত্রই পৃথ্বী ের তাণ্ডি রেশক রক্ষা রপশযশে শ্রীলঙ্কা। ওশপি করশত
রিশি ৯ া ার রিশর িাত্র ২৪ িশল ৪৩ কশর দশলর রাি পাাঁ ওভাশরই ৫৭-
হাাঁপ রেশে রিাঁশ বেল শ্রীলঙ্কা। হযশতা রভশিবেল, বকেুক্ষশণর িিয তাণ্ডশির
হাত রেশক রক্ষা পাওযা রিল।অিি া যবদ রভশিও োশকি, কী ভুলই িা
িাবসশত
ন যাাঁর ওযািশড অবভশষক হশযশে িতকাল। বকন্তু বড বসলভার
Input text তখি িাত্রই পৃথ্বী ের তাণ্ডি রেশক রক্ষা রপশযশে শ্রীলঙ্কা ওশপি করশত
(Description): রিশি া ার রিশর িাত্র িশল কশর দশলর রাি পাাঁ ওভাশরই রত বিশয
বিশযবেশলি তরুণ এই ওশপিার <UNK> ফািাশদার
ন হাশত কযা িাবিশয
ধিিয বড বসলভা যখি েশক রেবসিংরুশি রফরাশলি হযশতা এক ু হাাঁপ
রেশে রিাঁশ বেল শ্রীলঙ্কা হযশতা রভশিবেল বকেুক্ষশণর িিয তাণ্ডশির হাত
রেশক রক্ষা পাওযা রিল অিি া যবদ রভশিও োশকি কী ভুলই িা
িযা সিযাি এখি িািা রিল ওযািশডশত বিশির রেি িশল েক্কা িারার এই
Response summary: শুধু রেি িশল েক্কা িারাই িয তি ভারতীয ঞ্জিশক ার বহশসশি ওযািশড
অবভশষশক হাফশসঞ্চুবর করার কীবতনও িশেশেি ইোি বকষাি িতকাল
Accuracy 0.96
In our datasets we have applied Multinomial Naive Bayes. In the wake of carrying out this
calculation, we catagorized the outcomes into 4 parameters. It is seen from Table that a
precision of 0.96 is acquired. For example, the precision of economics class acquired is
1.0, recall is 1.0, F1-score is 1.0. Likewise, for entertainment, the values obtained are 1.0
precision, recall is 1.0, 0.95 F1-score. In international, a precision of 0.94 is acquired, recall
1.0, 0.97 F1-score are acquired. For sports class, precision acquired is 0.90, recall 1.0, F1-
score of 0.95. For science and technology class, a precision of 1.0 is acquired, recall 0.75,
F1-score 0.86 is acquired.
Accuracy 0.96
The accuracy of RF classifier is 0.96 is acquired. For example, the precision of economics
class acquired is 1.0, recall is 1.0, F1-score is 1.0. Likewise, for entertainment, the values
Accuracy 0.94
The accuracy of SVM classifier is 0.94 is acquired. For example, the precision of
economics class acquired is 1.0, recall is 1.0, F1-score is 1.0. Likewise, for entertainment,
the values obtained are 1.0 precision, recall is 0.91, 0.95 F1-score. In international, a
precision of 0.89 is acquired, recall 1.0, 0.94 F1-score is acquired. For sports class,
precision acquired is 0.90, recall 1.0, F1-score of 0.95. For science and technology class, a
precision of 1.0 is acquired, recall 0.77, F1-score 0.86 is acquired.
Accuracy 0.96
Accuracy 0.48
The accuracy of Adaboost classifier is 0.48 is acquired. For example, the precision of
economics class acquired is 0.18, recall is 0.67, F1-score is 0.29. Likewise, for
entertainment, the values obtained are 1.0 precision, recall is 0.64, 0.78 F1-score. In
international, a precision of 0.67 is acquired, recall 0.25, 0.36 F1-score is acquired. For
sports class, precision acquired is 0.90, recall 1.0, F1-score of 0.95. For science and
technology class, a precision of 0.1 is acquired, recall 0.1, F1-score 0.1 is acquired.
Accuracy 0.48
The accuracy of KNN classifier is 0.48 is acquired. For example, the precision of
economics class acquired is 0.18, recall is 0.67, F1-score is 0.29. Likewise, for
entertainment, the values obtained are 1.0 precision, recall is 0.64, 0.78 F1-score. In
international, a precision of 0.67 is acquired, recall 0.25, 0.36 F1-score is acquired. For
sports class, precision acquired is 0.90, recall 1.0, F1-score of 0.95. For science and
technology class, a precision of 0.1 is acquired, recall 0.1, F1-score 0.1 is acquired.
Algorithms Accuracy
Adaboost 0.476
SVM 0.88
KNN 0.96
After analysis all the algorithms, we have come to know that which algorithms is best for
our dataset. We have applied six different algorithms as like Naive Bayes, SVM, Random
Forest, Gradient boosting etc. After comparing among those algorithms, we got that
Gradient boosting and Naive Bayes delivered best results and the accuracy according to
For text Summarization, we have built a model for Bengali. Our model produces a better
output for various scenario. We have created so that we may deduct the function loss.
Learning model is used for reducing the errors. Loss function deduction is very much
important for chain data. During the data training, we added the loss function. When the
training is ended up we counted the loss. Earlier during the training, our model produces a
hing loss. After some epoch, losses are decreased at alarming rate. Learning value which
is 0. 008. Generally, we have divided our data into training and test set. That’s why we
have got 933 for training and 140 data for testing.
4.4 Moral
Throughout this chapter we have talked about how we built our model, how it generates
the Summarization and the desired outputs. We have tried to explain in details.
Nowadays, everything is turned into modern and smart. Where time is also a crucial
element. Depending on technology our life becomes faster. In our daily life, general
people are fond of reading news papers, News articles, short stories, online articles, etc.
When they get enough time, they usually spend their time reading those articles or
books. But all the time they don't have enough time to read the full article or news or
stories. Then they wish if they got a short summary about that topic. This text
summarization will help them to ease their reading and they get the short summarization
instantly which is amazing to them. All classes of people get the benefits of this work.
Such as businessmen, students, news readers, journalists etc.
A student always is searching for knowledge to know new things. That's why they have
spent most of their time reading new things. They read their daily readings and they also
read many novels, articles, Story books, Newspaper. Sometimes they try to get the main
theme within a short time. Suppose, a student has to go outside or go for an exam or any
emergency work but he or she has not had much time to read the full passage or article
or any text. In that case, text summarization helps him/her in an efficient way with a
meaningful short summary. As they get the desired summary within a short time, they
have benefited in many ways. Sometimes while reading the online article or any text,
they have felt bored and anxious. In Facebook, sometimes we have seen some large
articles on important matters but due to a short time we avoided it. Now students don't
need to unread anything as it is large. They can easily summarize it and get their results.
The collection and use of data raise numerous ethical and moral concerns about its use,
sharing, and even protection. It also raises issues of privacy ethics, particularly the
appropriateness of publicly sharing some information on these sites. In this research
paper, we will look at the moral, social, and ethical issues that arise from the sharing
and use of this data. Additionally, it raises concerns about privacy ethics, particularly
the propriety of disclosing some information publicly on these sites. In this research
paper, we will look at the moral, social, and ethical issues raised by the use and sharing
of this data. People's privacy was compromised as a result of the unethical use of text,
which also has an impact on information security and physical security.
This project basically based on Natural language processing. In our project, we have made
a model using deep learning for Bangla abstractive text Summarization. This model has
given tremendous output for the text summarization. From the beginning to end, we have
tried our heart and soul to complete this project and it took 7 months. We have done our
project by several steps and options. The project's complete overview and step-by-step
instructions are provided below.
No Steps
6.2 Conclusion
In recent years we get a large amount of data, and we have to understand the data quickly.
That's why we need text summarization to summarize the large amount of data to
understand in a short time. On text summarization we can use abstractive and extractive
text summarization. Abstractive summarization can produce a more coherent summary, but
it is more difficult to achieve due to the complexity of natural language processing and the
need to understand the meaning of the text. For better understanding, the summarization
system underscores the needs of the developers designing. The Extractive text
summarization is a lot of correct to handle massive amounts of knowledge and provide an
associate degree correct outline. And it's done by looking for the foremost used words or
common words and so shrinking them and presenting newly designed sentences. On the
other hand, Abstractive text summarization clarifies the data and improves them, and
makes a sentence. Extractive text summarization involves selecting and combining
important sentences or phrases from the original text to create a summary, while abstractive
text summarization involves generating new sentences and phrases that capture the
meaning of the original text.
With the help of LSTM & Sequence to Sequence algorithm, we present to you a
summarization that could save time & easily understand the material or news. Through
Natural Language Processing (NLP) we can understand the document to generate the
summary. Text summarization is like a process where a large amount of text or data has
been submitted and with the help of Machine learning (ML) it compresses the data and
presents us with an understandable short form of the data which is called text
summarization. Text summarization is not new in Machine learning (ML). In the English
language, it is used vastly. But on the other hand, in the Bangla language, it is very difficult
to summarize data or form. The main reason for this kind of issue is that the Bangla
language has a large number of alphabets and grammatical rules.
The model has a couple of restrictions as well. Since each study project changes
persistently, we have a philosophy for how we will change from now into the foreseeable
future. By using this procedure, we had the option of additionally fostering precision while
likewise becoming more coordinated. We would like to add more models to improve the
idea of the outcome and make this task more obvious. Though the amount of data in our
assessment is limited, we really want to investigate in the future with boundless data in
light of the fact that the model is good for dealing with a great deal of data and giving it
exact once-overs. After we finish our survey, we really want to create a web-based,
convenient application that uses man-made thinking. Moreover, this application will give
Bengali text summaries. The application would likewise have the option to act as an extra
layer of safety by utilizing regular language handling (NLP) methods to identify possible
vindictive movements or content in the Bengali text that is inputted. As a feature of the
examination, we will utilize directed and unaided learning models to create and assess our
application's exhibition. This exploration will assess the presentation of both regulated and
solo AI models, assisting us in understanding which one is best for our Bengali message
outlines.
[1] Abujar, S., Hasan, M., Shahin, M.S.I. and Hossain, S.A., 2017, July. A heuristic
approach of text summarization for Bengali documentation. In 2017 8th International
Conference on Computing, Communication and Networking Technologies (ICCCNT) (pp.
1-8). IEEE.
[2] Rafael Ferreira et al. “Assessing Sentence Scoring Techniques for Extractive Text
Summarization”, Elsevier Ltd., Expert Systems with Applications 40 (2013) 5755-5764.
[3] Nenkova, Ani, and Kathleen McKeown. "A survey of text summarization techniques."
In Mining text data, pp. 43-76. Springer, Boston, MA, 2012.
[4] Shi, Tian, Yaser Keneshloo, Naren Ramakrishnan, and Chandan K. Reddy. "Neural
abstractive text summarization with sequence-to-sequence models." ACM Transactions on
Data Science 2, no. 1 (2021): 1-37.
[5] LeCun, Yann, Yoshua Bengio, and Geoffrey Hinton. "Deep learning." nature 521, no.
7553 (2015): 436-444.
[6] Yousefi-Azar, Mahmood, and Len Hamey. "Text summarization using unsupervised
deep learning." Expert Systems with Applications 68 (2017): 93-105.
[7] Abujar, S., Masum, A.K.M., Chowdhury, S.M.H., Hasan, M. and Hossain, S.A., 2019,
July. Bengali text generation using bi-directional RNN. In 2019 10th International
Conference on Computing, Communication and Networking Technologies (ICCCNT) (pp.
1-5). IEEE.
[8] Song, Shengli, Haitao Huang, and Tongxiao Ruan. "Abstractive text summarization
using LSTM-CNN based deep learning." Multimedia Tools and Applications 78, no. 1
(2019): 857-875.
[9] Tomer, Minakshi, and Manoj Kumar. "Improving text summarization using ensembled
approach based on fuzzy with LSTM." Arabian Journal for Science and Engineering 45,
no. 12 (2020): 10743-10754.
[10] Masum, A.K.M., Abujar, S., Talukder, M.A.I., Rabby, A.S.A. and Hossain, S.A.,
2019, July. Abstractive method of text summarization with sequence to sequence RNNs.
In 2019 10th international conference on computing, communication and networking
technologies (ICCCNT) (pp. 1-5). IEEE.