Text Summarisation and Document Understanding Report[1]
Text Summarisation and Document Understanding Report[1]
1
ABSTRACT
2
TABLE OF CONTENT
Chapters Page
no.
1. INTRODUCTION
1.1 Introduction 1-2
1.3 Objectives 3
3. SYSTEM DEVELOPMENT
3.1 NLP 24
4. Performance analysis
4.1 Approaches to Sentence Extraction
4.1.1 Frequency based approach 35-36
4.1.2 Feature-based approach 37-39
3
4.3 Training Dataset 45
5. CONCLUSION
5.1 Conclusion 52
6. Reference 54-55
4
Chapter-1
INTRODUCTION
1.1 Introduction
With the developing measure of data, it has turned out to be hard to discover brief data.
In this way, it is critical to making a framework that could condense like a human.
Programmed content rundown with the assistance of Normal Dialect Handling is an
instrument that gives synopses of a given archive. Content Outline strategies is divided in
two ways i.e. - extractive and abstractive approach. The extractive approach basically
choose the various and unique sentences, sections and so forth make a shorter type of the
first report. The sentences are estimated and chosen based on accurate highlights of the
sentences. In the Extractive technique, we have to choose the subset from the given
expression or sentences in given frame of the synopsis. The extractive outline
frameworks depends on two methods i.e. - extraction and expectation which includes the
arrangement of the particular sentences that are essential in the general comprehension
the archive. What’s more, the other methodology i.e. abstractive content synopsis
includes producing completely new articulations to catch the importance of the first
record. This methodology is all the more difficult but on the other hand is the
methodology utilized by people.
New methodologies like Machine taking in procedures from firmly related fields, for
example, content mining and data recovery have been utilized to help programmed
content synopsis.
From Completely Mechanized Summarizers (FAS), there are techniques that assistance
clients doing rundown (MAHS = Machine Helped Human Synopsis), for instance by
featuring hopeful sections to be included the outline, and there are frameworks that rely
upon post-preparing by a human (HAMS = Human Supported Machine Rundown).
There are two types of extractive rundown errands which rely on the outline application
focuses. One is nonexclusive synopsis, which centres on getting a general rundown or
unique of the Archive (regardless of whether records, news stories and so on.). Another is
inquiry related synopsis, some of the time called question based outline, which abstracts
especially to the question. Outline strategies can make both inquiry related content
rundowns and conventional machine-created synopses relying upon what the client
needs.
Likewise, rundown strategies endeavour to discover subsets of items, which contain data
of the total set. This is otherwise called the centre set. These calculations demonstrate
experiences like inclusion, decent variety, data or representativeness of the outline.
Question based synopsis techniques, furthermore demonstrate for purpose of the outline
with the inquiry. A few techniques and calculations which specifically outline issues are
Text Rank and Page Rank, Sub modular set capacity, determinately point process,
maximal negligible significance (MMR) and so forth.
In the new period, where tremendous measure of data is accessible on the Web, it is most
vital to give the enhanced gadget to get data rapidly. It is extremely intense for
individuals to physically pick the synopsis of expansive archives of content. So there is
an issue of scanning for vital reports from the accessible archives and discovering
essential data. Along these lines programmed content rundown is the need of great
importance. Content rundown is the way toward recognizing the most vital important
data in a record or set of related archives. What’s more, compact them into a shorter
rendition looking after its implications. .
1.3 Objectives
The objective of the project is to understand the concepts of natural language processing
and creating a tool for text summarization. The concern in automatic summarization is
increasing broadly so the manual work is removed. The project concentrates creating a
tool which automatically summarizes the document.
1.4 Methodologies
For obtaining automatic text summarization, there are basically two major techniques
i.e.-Abstraction based Text Summarization and Extraction based Text Summarization.
The Extractive summaries are used to highlight the words which are relevant, from input
source document. Summaries help in generating concatenated sentences taken as per the
appearance. Decision is made based on every sentence if that particular sentence will be
included in the summary or not. For example, Search engines typically use Extractive
summary generation methods to generate summaries from web page. Many types of
logical and mathematical formulations have been used to create summary. The regions
are scored and the words containing highest score are taken into the consideration. In
extraction only important sentences are selected. This approach is easier to implement.
There are three main obstacles for extractive approach. The first thing is ranking problem
which includes ranking of the word. The second one selection problem that includes the
selection of subset of particular units of ranks and the third one is coherence that is to
know to select various units from understandable summary. There are many algorithms
which are used to solve ranking problem. The two obstacles i.e. - selection and coherence
are further solved to improve diversity and helps in minimizing the redundancy and
pickup the lines which are important. Each sentence is scored and arranged in decreasing
order according to the score. It is not trivial problem which helps in selecting the subsets
of sentences for coherent summary. It helps in reduction of redundancy. When the list is
put in ordered manner than the first sentence is the most important sentence which helps
in forming the summary. The sentence having the highest similarity is selected in next
step is picked from the top half of the list. The process has to be repeated until the limit is
reached and relevant summary is generated.
Abstraction Based summarization
People by and large utilize abstractive outlines. In the wake of perusing content,
Individuals comprehend the point and compose a short outline in their own particular
manner creating their very own sentences without losing any essential data. In any case,
it is troublesome for machine to make abstractive synopses. Along these lines, it very
well may be said that the objective of reflection based outline is to make a synopsis
utilizing regular dialect preparing procedure which is utilized to make new sentences that
are syntactically right. Abstractive rundown age is difficult than extractive technique as
it needs a semantic comprehension of the content to be encouraged into the Common
Dialect framework. Sentence Combination being the significant issue here offers ascend
to irregularity in the produced outline, as it's anything but an all around created field yet.
1.5 Organization
LITERATURE
SURVEY
2.2.1
2.2.2
2.2.3
2.2.4
2.2.5
2.2.7
2.2.9
System
Development
NLP is way toward empowering PCs to comprehend and deliver human dialect. Uses of
NLP systems are utilized in separating of text, machine interpretation and Voice Agents
like Alexa and Siri. NLP is one of the fields that are profited from the advanced
methodologies in Machine Adapting, particularly from Profound Learning strategies.
Regular Dialect Preparing method utilize the characteristic dialect toolbox for making the
principle arrange in python tasks to work with human dialect data. This is simpler to-use
by giving the interfaces to at least one than 40 corpora and dictionary resources, for
portrayal, for part passages sentences and to get the words in its unique frame Marking,
parsing, and glossary thinking for current reasoning quality basic dialect dealing with
libraries, and for dynamic discourse. The NLTK will utilize a colossal instrument area
and will make some help for individuals with the whole basic dialect taking care of
system. This will assist individuals with part sentences from sections, to part up words,
seeing the syntactic segments of those words, denoting the fundamental subjects, doing
this it serves to your machine by acknowledging the main thing to the substance.
3.2 Lesk Algorithm
NLP is the way toward empowering PCs to comprehend and deliver human dialect. Uses
of NLP systems are utilized in separating of text, machine interpretation and Voice
Agents like Alexa and Siri. NLP is one of the fields that are profited from the advanced
methodologies in Machine Adapting, particularly from Profound Learning strategies.
Regular Dialect Preparing method utilize the characteristic dialect toolbox for making the
principle arrange in python tasks which work with human language data. This is simpler
to-use by providing the interfaces to at least one than 40 corpora and dictionary
resources, for portrayal, for part passages sentences and to get words in its unique frame.
marking, parsing, and glossary thinking for current reasoning quality basic dialect
dealing with libraries, and for dynamic discourse. The NLTK will utilize a colossal
instrument area and will make some help for individuals with the whole basic dialect
taking care of system. This will assist individuals with role sentences from sections,
seeing the syntactic segments of those words, denoting the fundamental subjects, doing
this it serves to the machine by acknowledging the main thing to the substance.
The wordnet is arranged semantically which creates the electronic database of verbs,
adjectives etc. Similar words are grouped together to form the synonym sets. The
algorithm is used which removes the words that belong to at least one synset and is
known as wordnet words. Synsets are interrelated by semantics and lexical relation.
Word net not only links the words but also the senses of the words. It group together all
the English words and provides the short definitions. It is accessible to human users via a
web browser and is used in automatic text review and artificial intelligence utilizations
prepositions and other function words and includes nouns, verbs etc. All Synsets are
connected by semantic relation.
Step 3: Summarization this is the last stage for automatic summarization. The last outline of
the particular stage is evaluated the introductions of the yield and survey is done at the time
when all the sentences are arranged. Firstly, it will select the onceover of sentences with
weight and are planned in jumping demand which is concerned by the increasing weights.
Various numbers of sentences are picked from the rate of summary. Further the sentences
which are picked are recomposed by the gathering in information. Further, the sentences
which are selected are gathered without any dependence of any particular object rather than
the denotative erudition lying in the sentence. Restrained matter once- over is without spoken
language.
The project focuses on extractive summarization, which involves selecting the most important sentences from the
original text rather than generating new sentences (as done in abstractive summarization).
The selected sentences form a concise summary while retaining the main ideas and concepts of the document.
Bidirectional Encoder Representations from Transformers (BERT), a state-of-the-art deep learning model, is used
for sentence embedding.
BERT captures the contextual meaning of words in sentences by analyzing the text bidirectionally, i.e., both left-to-
right and right-to-left.
The BERT model encodes sentences into dense vectors (embeddings) that represent their semantic meanings,
enabling the system to identify the most important sentences within a document.
This approach significantly improves the system's ability to capture relationships between sentences, even in long
documents.
The system calculates the cosine similarity between pairs of sentences to determine how similar they are.
The cosine similarity score is a measure of the angle between the sentence vectors obtained from BERT
embeddings. Sentences with higher similarity scores are likely to contain similar information.
This step helps in building a similarity matrix that serves as the foundation for ranking the importance of each
sentence.
A graph-based model is used to represent the sentences as nodes and their relationships (similarity scores) as
edges between them.
The PageRank algorithm, originally developed for ranking web pages, is adapted to rank sentences. Sentences that
are highly interconnected with other important sentences in the document are given higher importance scores.
This algorithm ensures that the most relevant sentences are selected for the final summary based on their relative
importance in the context of the entire document.
Before summarization, the system performs extensive preprocessing on the input text:
Tokenization: The document is split into sentences.
Stop-word Removal: Common but uninformative words (like "and," "the") are removed to focus on the core
content.
Lemmatization: Words are reduced to their base forms (e.g., "running" becomes "run"), which helps in
standardizing the data.
Punctuation Removal: Unnecessary punctuation is removed to clean the text.
These preprocessing steps ensure that the summarization model works on a clean and standardized dataset,
improving accuracy.
The project includes a flowchart-based visualization of the summarization workflow, helping users understand how
the text is processed, ranked, and summarized.
Key steps in the workflow, including preprocessing, feature extraction, sentence ranking, and final summary
generation, are illustrated in detail to explain the system's functionality.
The system's performance is evaluated using the ROUGE metric (Recall-Oriented Understudy for Gisting
Evaluation), which is a standard evaluation measure for summarization tasks.
ROUGE compares the generated summary to reference summaries (created by humans) and measures the overlap
between them in terms of precision, recall, and F-score.
This metric ensures that the system produces summaries that are not only concise but also close to human-level
summarization quality.
The system is designed to handle large volumes of textual data efficiently, making it suitable for a wide range of
applications like news summarization, academic research, and social media content.
The architecture can be easily scaled to process multiple documents or even large datasets, such as news article
databases or research archives.
While the project focuses on summarization in English, the methodology (especially BERT, which has multilingual
variants) can be extended to support other languages, including those that require specialized handling, such as
Marathi and Punjabi.
This flexibility opens up possibilities for future enhancements to support multilingual summarization.
The summarization system has potential applications in accessibility, particularly for users with visual impairments.
By combining the summarization system with text-to-speech technology, it can offer audio summaries for visually
impaired users, making textual content more accessible.
11. Application in Various Domains:
The summarization system is designed to be domain-independent, meaning it can be applied to various types of
text documents:
News articles for generating quick summaries of daily news.
Research papers for assisting scholars in quickly grasping the core ideas of lengthy academic publications.
Educational content for condensing large textbooks or course materials into digestible pieces of information.
The project uses Python, a highly popular language for machine learning and NLP, along with libraries like NLTK (for
text processing), Sklearn (for cosine similarity), TensorFlow Hub (for downloading BERT models), and
NumPy/Pandas (for dataset handling).
This setup makes the project modular, easily maintainable, and extensible, allowing for future improvements.
For training and evaluation, the system integrates datasets like news summary datasets from Kaggle, which
provide a wide variety of real-world text documents to test the summarization model.
The model is designed to work with various types of datasets, making it adaptable to multiple sources of text data.
The summarization system is specifically designed to address the information overload problem that exists in fields
such as journalism, research, and social media.
By condensing large texts into essential summaries, it helps users quickly extract the most relevant information,
saving time and enhancing productivity.
Advantages
1. Time Efficiency:
The project enables users to quickly obtain key information from large texts without having to read
through entire documents. This is particularly beneficial in time-sensitive environments like newsrooms,
research, and businesses, where decisions often rely on fast access to information.
By condensing long documents into concise summaries, users can review multiple pieces of content in a
fraction of the time.
2. Improved Document Understanding:
The system enhances document comprehension by focusing on the most relevant sentences. This
ensures that even in cases where users skim through content, they get a clear understanding of the
document's main ideas.
The use of BERT embeddings ensures that the summaries maintain the context and meaning of the
original document, leading to more accurate document understanding.
3. Handling Information Overload:
With the increasing volume of data available online and in industries like journalism, academia, and
social media, this system provides a practical solution to information overload.
The summarization tool reduces large blocks of text into digestible summaries, allowing users to filter out
irrelevant or redundant information more effectively.
4. Enhanced Productivity:
The project improves productivity for professionals who need to review large amounts of text daily, such
as journalists, researchers, analysts, and students.
By offering summaries that retain essential information, users can focus on the key insights without
getting bogged down by excessive detail, improving decision-making processes.
5. Accessibility for Diverse Users:
The system can be integrated with text-to-speech (TTS) technologies, making it highly beneficial for
visually impaired individuals by converting summarized text into audio.
Summarization also benefits elderly users or those with cognitive impairments who might struggle with
reading long documents.
6. Cross-Domain Application:
The summarization system can be applied across different domains, such as:
News aggregation: Summarizing multiple news articles into key points.
Academic research: Summarizing long research papers and journal articles.
Legal documents: Condensing legal documents into key points for faster comprehension.
Educational content: Summarizing textbooks or course materials for students.
This flexibility makes the system useful for a wide variety of professional fields and industries.
7. Improved Decision-Making:
By providing concise summaries of important documents, the system helps users make informed
decisions quickly. This is especially useful in fields like business intelligence, where quick access to critical
data can impact strategic decisions.
8. Better Information Retention:
Summarizing large documents allows users to focus on and retain the most important information,
enhancing overall learning and comprehension. Summaries also act as reference points for future
retrieval, allowing users to revisit key ideas without re-reading entire documents.
9. Scalability for Large Datasets:
The system is designed to handle large volumes of data, making it suitable for large datasets such as
news databases, research archives, and corporate document repositories.
This makes the system scalable for enterprise use, where organizations deal with extensive content that
requires summarization.
10. Higher Summarization Accuracy:
The use of BERT embeddings and PageRank ensures that the system identifies the most important
sentences accurately, leading to high-quality summaries.
By incorporating deep contextual understanding through BERT and ranking sentences based on cosine
similarity and PageRank, the system produces more reliable and coherent summaries than traditional
methods.
11. Customization and Personalization:
The project can be adapted to specific user needs or industries, allowing for customized summarization
based on the type of content (e.g., news, scientific papers, or business reports).
By tailoring the system to specific domains, it becomes possible to enhance the relevance and usefulness
of the generated summaries for particular user groups.
12. Better Evaluation Metrics:
The system uses ROUGE scoring to evaluate the quality of the generated summaries, ensuring that the
summaries produced are close to human-generated summaries in terms of relevance and accuracy.
This helps maintain a consistent standard of summarization quality and allows for continuous
improvement of the system.
13. Resource Optimization:
The project saves resources (time and effort) that would otherwise be spent on manually reading and
summarizing large volumes of text.
Organizations, researchers, and media outlets can reduce the need for human resources dedicated to
manual summarization, optimizing operational efficiency.
14. Multilingual and Multidisciplinary Potential:
While the current system is focused on English, its methodology can be expanded to support multilingual
text summarization, making it valuable for non-English content in various global markets.
The approach can also be fine-tuned for domain-specific summaries, such as medical, legal, or technical
documents, enhancing its utility across multiple disciplines.
15. Adaptable for AI Applications:
The summarization techniques used in this project (such as BERT and PageRank) can be integrated into
broader AI systems for tasks like chatbot development, question answering systems, and document
classification.
This adaptability makes the system a core component for future AI-powered applications focused on text
analysis and understanding.
16. Cost-Effective:
Automating the summarization process can significantly reduce the costs associated with manual
content summarization, such as hiring human summarizers or outsourcing the task.
Over time, this cost-saving potential is especially beneficial for companies or institutions that manage a
high volume of textual data.
The proposed approach uses machine and deep learning concepts. The flow chart for this
approach is as follows:
3.8 Platform Used
3.8.1 Windows 10
Windows 10 is defined as the Microsoft which works with the particular framework for
PCs, tablets, and inserted gadgets etc. Microsoft discharged Windows 10 is follow-up to
Windows 8. It was said on July that the window 10 will be refreshed instead of
discharging it and framework as a successor.
The declaration of Windows 10 in September 2014 from Microsoft was made and
window insider was made that time. There was the discharge from Microsoft to Windows
10 by seeing the total population in July 2015. After that clients observed that Windows
10 is cordial than Windows 8 because it was more conventional interface, which echoes
the work area engaging format of Windows 7.
The Windows 10 consecrate Refresh, which turned out in August 2016, made some
modifications to the assignment bar and Begin Menu. It additionally presented program
augmentations in Edge and gave client’s access to Cortana on the bolt screen. In April
2017, Microsoft discharged the Windows 10 Makers Refresh, which made Windows Hi's
facial acknowledgment innovation quicker and enabled clients to spare tabs in Microsoft
Edge to see later.
The Windows 10 Fall Makers Refresh appeared in October 2017, adding Windows
Safeguard Adventure Monitor to secure against zero-day assaults. The refresh likewise
enabled clients and IT to put applications running out of sight into vitality productive
mode to safeguard battery life and enhance execution.
3.9 Python
Performance Analysis
In each work based on content summarization, which was spearheaded, it was expected
that vital words in the record are rehashed ordinarily contrasted with the different words
in the record. Along these lines, demonstrate of the significance of sentences in the
record by utilizing word recurrence. From that point forward, a considerable lot of the
summarization frameworks utilize recurrence of the approaches in the extraction of the
sentences. Two procedures that utilization recurrence as an essential frame measures in
the content summarization is: word likelihood what's more, term recurrence reverse
record recurrence.
It was expected that one of the least complex methods for utilizing recurrence is done by
including the crude recurrence of the word i.e., by essentially including every word event
the archive. Nonetheless, the actions are enormously affected by the report length. One
approach is to get the modification for the report length is by processing the word
likelihood. The equation 1 shows the probability of the particular word:
The discoveries from the examination conveyed based on human-composed outlines
demonstrate that individuals will, in general, utilize word recurrence to decide the key
subjects of an archive. In case of summarization framework that misuses word likelihood
to make outlines. The Sum Basic framework initially processes the word likelihood from
the information archive. Each sentence Sj, it processes the sentence weight as a capacity
of word likelihood. Best scoring is picked up on the basis of sentence weight.
Where, the total number is further divided in various number of the document in the
corpus which has different words. Based on Condition 3 and 4, the of word I in document
j is calculated:
The starting sentences in a record as a rule depict the primary data concerning the
archive.
Title/Headline Word
Title words showing up in a sentence could recommend that the sentence contains critical
data.
Term Weight
Words with high event inside the record are utilized to decide the significance of the
document.
Sentence Length
Sentences which are very short contain very less data and the sentences which are long
are not proper to speak to outline.
Based on Figure it delineates the basic model of an element based on the summarizer.
The scores are registered for each and every element what's more, joined for sentence
scoring. Before scoring of the sentence, the highlights are offered weights to decide its
dimension of significance. For this situation, highlight weighting will be connected to
decide the weights related to each highlight and the sentence score is then registered
utilizing the direct mix of each component score duplicated by its comparing weight:
Binwahlan et al. (2009) proposed a content synopsis display dependent on Molecule
Swarm Enhancement to decide component weights. The scientists utilized hereditary
calculation of rough the best weight blend for the multiple record summary. Different
development calculation is additionally been utilized to scale the pertinence of highlight
weights. Examination on the impact of various element mix was conveyed by Hariharan
where it was discovered that better outcomes were gotten to consolidating term
recurrence weight with position and hub weight.
In this project we are going to use the concept of Deep Learning for abstractive
summarizer based on food review dataset. So before developing the model, let’s
understand the concept of deep learning. The basic structures of neural network with its
hidden layer are shown in the following figure.
Neural Networks (NN) are also used for Natural Language Processing (NLP), including
Summarizers. Neural networks are effective in solving almost any machine learning
classification problem. Important parameters required in defining the architecture of
neural network (NN) are total amount of hidden layers used, number of hidden units to be
present in each layer, activation function for each node, error threshold for the data, the
type of interconnections, etc. neural networks can capture very complex characteristics of
data without any significant involvement of manual labour as opposed to the machine
learning systems. Deep learning uses deep neural networks to learn good representations
of the input data, which can then be used to perform specific tasks.
Recurrent Neural Networks was formed in the year 1980’s but is very popular helps in
increasing the power which is computational from GPU. They are useful in terms of
sequential data because neuron can use its internal memory. It helps in maintaining the
information about the past input. This is great because in cases of language, “I had
washed my house” is very different than “I had my house washed”. The network helps in
gaining deep understanding of the given statement. A RNN contain loops in them where
the information is taken across neurons while reading the given input.
Here xt is the i/p, A is termed as a part of the RNN and ht is the o/p. The words are feed
from given sentences. Or even various characters from a string as xt and will come
upwards with ht. The ht is used as o/p and the comparison is done to given test data.
Hence, the error rate will be determined. After the comparison with o/p from test data the
back propagation technique is used. BPTT checks again with the help of network and
check and adjusts the weight depending on error rate.RNN is used to handle context from
the starting of the sentence where the prediction is correct.
As with time internet is growing at a very fast rate and with it data and information is
also increasing. it will going to be difficult for human to summarize large amount of data.
Thus there is a need of automatic text summarization because of this huge amount of
data. Until now, we have read multiple papers regarding text summarization, natural
language processing and lesk algorithms. There are multiple automatic text summarizers
with great capabilities and giving good results. We have learned all the basics of
Extractive and Abstractive Method of automatic text summarization and tried to
implement extractive one. We have made a basic automatic text summarizer using nltk
library using python and it is working on small documents. We have used extractive
approach to do text summarization.
[2] Neelima Bhatia, Arunima Jaiswal, “Automatic Text Summarization: Single and
Multiple Summarizations ”, International Journal of Computer Applications
[3] Mehdi Allahyari, Seyedamin Pouriyeh, Mehdi Assefi, Saeid Safaei, Elizabeth D.
Trippe, Juan B. Gutierrez, Krys Kochut, “ Text Summarization Techniques: A Brief
Survey”, (IJACSA) International Journal of Advanced Computer Science and
Applications
[4]Pankaj Gupta, Ritu Tiwari and Nirmal Robert,”Sentiment Analysis and Text
Summarization of Online Reviews: A Survey”International Conzatiference on
Communication and Signal Processing,August 2013