www.ijecs.
in
International Journal Of Engineering And Computer Science
Volume 9 Issue 05 May 2020, Page No. 25039-25046
ISSN: 2319-7242 DOI: 10.18535/ijecs/v9i05.4488
Legal Document Summarization Using Nlp and Ml Techniques
1
Rahul C Kore*, 2Prachi Ray, 3Priyanka Lade, 4Amit Nerurkar
1,2,3,4
Vidyalankar Institute of Technology, Mumbai University, Mumbai, India
Abstract:
Reading legal documents are tedious and sometimes it requires domain knowledge related to that document. It
is hard to read the full legal document without missing the key important sentences. With increasing number of
legal documents it would be convenient to get the essential information from the document without having to go
through the whole document. The purpose of this study is to understand a large legal document within a short
duration of time. Summarization gives flexibility and convenience to the reader. Using vector representation of
words, text ranking algorithms, similarity techniques, this study gives a way to produce the highest ranked
sentences. Summarization produces the result in such a way that it covers the most vital information of the
document in a concise manner. The paper proposes how the different natural language processing concepts can
be used to produce the desired result and give readers the relief from going through the whole complex
document. This study definitively presents the steps that are required to achieve the aim and elaborates all the
algorithms used at each and every step in the process.
Keywords: Natural Language Processing, Word Embeddings, Page Rank Algorithm, Text Rank Algorithm.
1. Introduction and then use this particular representation to obtain
Document Summarization is one of those a summary that is closer to how a human being
applications of natural language processing which might express. Abstractive summarization is
is definitely going to have a great impact on computationally much more complex and
everyone’s lives. Nowadays who has the time to go challenging than extractive summarization, It
through the entire document and understand the requires both natural language processing and a
purpose of the same. The human summarization is deep understanding of the domain of the original
the process of taking a document, understanding it, document. Most of the existing methods uses
interpreting it and finally generating a new statistical methods such as frequency of occurrence,
document as a summary, but this can be a time inverse document frequency or linguistic
consuming process. So comes the concept of information such as term distribution, sentence
automatic text summarization. Automatic position to extract the most relevant sentences from
Summarization is the process of shortening a large the document.
document computationally to create a summary that
However these methods ignore the relationship
represents the most important and relevant between different granularity information such as
information within the original content or the relationships between the sentences. Hence the
document. There are two general approaches to
proposed system takes into consideration the
automatic summarization which are extractive
similarities between different sentences before
summarization and abstractive summarization. calculating the ranks of individual sentences in the
In extractive summarization, the sentences are document. Many researches are ongoing in the field
extracted from the original document but the of document summarization because text
extracted sentences are not modified in any way. summarization becomes different and unique
Abstractive summarization constructs an internal problem for each domain of research.
semantic representation for the original sentences
Rahul C Kore, IJECS Volume 09 Issue 05 May, 2020 Page No. 25039-25046 Page 25039
2. Background and Related Work Supervised methods are those which requires
This part of the paper illustrates different work human intervention whereas Unsupervised methods
carried out by others in areas which are relevant to are without human intervention which extracts
our research. The sub-parts below are the most keywords directly through the information of the
important key areas in our study. text which in turn improves the efficiency greatly.
In unsupervised methods, the word graph model
2.1 Semantic Similarity Measures
There are various applications of text similarity treats the document as a network composed of
measures [2] which includes automatic text words which is based on the theory of PageRank
summarization, relevance feedback classification, link analysis to calculate the importance of words.
automatic evaluation of machine translation and Similarity and co-occurrence frequency between
determining text coherence. There are various words are used as the weight for extracting
approaches which are used to calculate the keywords and Word2vec is used to calculate the
similarity measure which are based on statistical close degree between the words. Word2vec makes
methods, vector representation of words in the use of deep learning to map each word into a vector
given document, string or corpus based approach of k-dimension.
and hybrid similarity measures. Some applications
2.3 Graph Based Ranking Algorithm
like TF-IDF uses inverse document frequency for
Using graph based ranking algorithm [1], we can
calculating the frequencies of terms where it does
find the importance of a vertex present within the
not take into consideration the surrounding context
graph based on the information which are drawn
amongst that term in the text. The mapping in such
from the graph structure. In this section, a graph
method is simply done using count and probabilistic
based algorithm-HITS which were previously found
measures.
to be useful on a large range of documents is
Our study proposes a method in which we first presented. This graph based ranking algorithm
convert all the words present inside the document (HITS) can be used for unidirectional or weighted
into it’s equivalent vector from by using an graphs.
appropriate word embedding model which takes
HITS(Hyperlink-Induced Topic Search) [1] is a link
into consideration the semantic value of each word,
analysis algorithm that ranks web nodes/pages. It
where each word is located in an n-dimensional
estimates two types of values for a page which are
(n=10 or 20) space. All the words with similar
hubs and authorities. Authority estimates the value
semantics are placed closer to each other in this
of content of the page whereas hub estimates the
space. After converting the words in the vector form
value of its links to the other pages. HITS is an
we use vector similarity measure to calculate the
iterative algorithm based on linkage of documents
likeness. In this way, we also consider the semantic
like PageRank. The algorithm performs a series of
of all the terms present in the document which gives
iterations which consists of the following two steps:
a better result.
• Authority update
2.2 Keyword Extraction
The smallest unit to express the core meaning of a • Hub update
document is known as a keyword [3]. By extracting The above two score for a node is calculated
several keywords from a document to summarize using the following algorithm:
document theme content, helps users to quickly
understand whether the document is of their interest • Start
or not. Keyword extraction can be classified in two • Initialize hub score and authority score to 1
categories which are supervised and unsupervised. for each node.
In supervised we have two different methods which
are Two classification problem and Multiple • Update authority score.
classification problem. In unsupervised we have • Update hub score.
three different methods which are word frequency,
Model based and Graph methods.
Rahul C Kore, IJECS Volume 09 Issue 05 May, 2020 Page No. 25039-25046 Page 25041
• Normalize hub score by dividing each hub 3) Then we would remove all the punctuations,
score by square root of the sum of squares of all numbers and special characters from all the
the hub scores. individual sentences.
• Normalize authority score by dividing each This can be achieved with the help of regular
authority score by square root of the sum of expression and python packages.
squares of all the authority scores. 4) Then all the alphabets are converted into lower
• Goto step 3 and repeat if necessary. case alphabets
Formula for calculating authority and hub score: This is done so there would not be any problem
because of character case sensitivity in the
HITSA (Vi) = ∑ HITSH (Vj)
sentences.
Vj ЄIn(Vi)
5) Then we would remove all the stop words from
the sentences because stop words do not contribute
HITSH (Vi) = ∑ HITSA (Vj) any meaningful context to the sentences and would
only waste processing time in the next step of
Vj ЄOut(Vi) vector conversion.
3. Proposed System After all the above steps we would get clean
As stated earlier, the proposed system would be sentences which are free from stop words and all
focusing on generating extractive summary from the other unwanted punctuations, numbers and special
given document using different natural language characters.
processing techniques such as word embedding,
similarity measures, and ranking algorithm. 6) Now, we will fetch vectors for the constituent
words in a sentence and then take mean/average of
So before getting into specifications of the proposed those vectors to obtain a consolidated vector for all
system, let us understand the overall flow of the the sentences in the document.
system which is given below in the figure.
The above step is done using the word embedding
model known as Law2Vec, which was developed
by the Department of Informatics of the University
of Athens.
After the above step we would get vector
representation of all the sentences which would be
carry forwarded in the later steps.
7) Now, we would create an empty similarity matrix
of nxn size where n is the number of sentences
present in the document.
8) Now, we would calculate cosine similarity for all
Figure 1: Flow of the system
the sentences present in the document using the
The overall flow of the system can be explained vector representation of the sentences and not the
with the help of following steps: original sentences.
1) Initially we will concatenate all the text present Right after calculating the similarities between the
in the document. sentences using their vector form we would be
2) Then we would split the text into individual inserting all these in the similarity matrix created in
sentences the above steps.
This can be done using the tokenizer of the natural 9) After all the above steps, now we would take this
language tool kit package of python. similarity matrix and apply a ranking algorithm on
this similarity matrix obtained in the above step.
Rahul C Kore, IJECS Volume 09 Issue 05 May, 2020 Page No. 25039-25046 Page 25042
In our case, we would be using page rank algorithm filter out in the pre-processing step are referred to as
to calculate the ranking of all the individual stop words. So a stop word is a commonly used
sentences. After ranking all the sentences now we word such as “the”. “a”, “an”, “in”, etc which are
can display top ranked sentences from the always ignored by all the applications using natural
document. language processing or search engines as a matter of
fact.
4. Overview of the System
In this section, we will be focusing more on all the We do not want stop words to waste any space
concepts such as tokenization, stop words inside the database or increase any processing time
elimination, word embedding, similarity measure, in our application so it is better to eliminate such
ranking algorithm which are used in the above words. we can stop words easily as the natural
proposed system. language tool kit package in python has a list of
stopwords stored in 16 different languages. So we
4.1 Dataset
Dataset was downloaded from UCI-Machine just need to download the corpus and start
Learning Repository [10]. The downloaded dataset eliminating all the stop words from all the sentences
contains Australian Legal Cases from the Federal in the document.
Court of Australia (FCA). It contains almost 4000 4.4 Word Embedding
Legal cases. In natural language processing, when working with
text, the first thing that we must do is come up with
4.2 Tokenization
Tokenization in a defined document unit is basically a strategy to convert strings to numbers or to
the task of chopping up the sentences into pieces, vectorize the text before feeding it to any model.
called tokens, perhaps it also means at the same There were many techniques that came before word
time throwing away certain characters, such as embedding to convert strings to text or vectorize the
punctuation. text but no technique was as good as word
embedding [9]. Two such techniques that came
One can think of tokens as words as a token in before word embedding are as follows:
sentence, and sentences as tokens in a paragraph.
1) One-hot encodings [9]
Below is an example of tokenization: In this, we might “one-hot” encode each word
Input: present in the vocabulary. To represent each of
these words we would create a zero vector having
Friends Romans Countrymen lend me your ears;
length equal to the vocabulary and then we can
place a one in the index that will correspond to the
Output: word. This approach is inefficient because if there
The above token are more often referred to as terms are 1,000 words so to one-hot code each word we
or sometimes words. would have to create vector almost all the elements
We would be using sentence tokenizer of natural are zero.
language tool kit package of python which is 2) Encode each word with a unique number [9].
already trained and thus it very well knows how to In this approach we might encode each word with a
mark the end and the beginning of the sentences at unique number. This method is also inefficient
what characters and at what punctuations. because instead of sparse vector we have a dense
4.3 Stop Words Elimination vector where all the elements are full. There are two
In the below section, we would be discussing on downsides to this method, the first one is that the
why stop word elimination is an important step in integer encoding is arbitrary and it does not capture
natural language processing. any relationship between the words, and the second
one is that integer encoding can be challenging for a
One of the important forms of pre-processing in
model to interpret.
natural language processing is to filter out all the
useless data present in the document. In natural Hence, word embedding [9] is the most efficient
language processing, such useless words that we one which gives us a way to use an efficient, dense
Rahul C Kore, IJECS Volume 09 Issue 05 May, 2020 Page No. 25039-25046 Page 25043
representation in which all the similar words have a be used for calculating similarity measure such as
similar encoding and the main part is that we do not Euclidean metric, Jaccard similarity, cosine
have to specify the encoding by hand. Here an similarity, etc. We would be focusing on cosine
embedding is a dense vector of floating point similarity as we have used the same in our proposed
values. And instead of specifying these values system.
manually we can train these parameters. It is very Cosine similarity [2] is the measure of similarity
much common to see 8-dimensional word between two non zero vectors that calculates the
embedding. There are word embeddings with 1024- cosine of the angle between these two non zero
dimensions also when we work with very large vectors. It is used to calculate how similar two
datasets. It takes more data to learn but gives very documents are irrespective of their sizes. So we are
fine grained relationship between the words. The using to measure how similar two sentences in our
graphical representation of word embedding can be document are. The formula of cosine similarity can
visualized as follows: be given as follows in the image below:
Figure 3: Cosine similarity formula
4.6 Page Rank
PageRank [1] is an algorithm which is used by
Figure 2: Graphical representation of word google search engine to rank different web pages in
embedding their search engine results. It was named after Larry
Page who is one of the founders of Google. It is
Thus we will be using Law2Vec [8] word
way of measuring the importance of web pages and
embedding model which was developed by the
accordingly the results are shown to the users. The
department of Informatics of the University of
main component which is used in calculating the
Athens. This model contains millions of legal words
rank of the web pages are the number of links to
already trained and are ready to use.\
that page, So by counting the number and quality of
4.5 Similarity Measure such links to the page the algorithm estimates how
Similarity measures [2] are used to calculate important the website is.
similarity between various documents, or different
The following figure shows a graph of web pages
sentences present inside the document. It defines
A, B, C, D having certain links to each other.
how much alike two objects are. It has various
applications in natural language processing such as
in automatic text summarization. It also has its
application in computer vision. It can be used in
many real world applications, one important
application in the business world would be to use
the similarity technique to match the resumes with
the job description which saves a considerable
amount of time for the job recruiters in the
company. Another important application would be
to use similarity measure to segment customers for
marketing campaigns using some clustering Figure 4: Graph of 4 web pages
algorithm which also uses similarity
measures.There are many similarity metric that can The formula for calculating the page rank score is
as given below:
Rahul C Kore, IJECS Volume 09 Issue 05 May, 2020 Page No. 25039-25046 Page 25044
Generally page rank value for any page u can be original sentences. Then we started calculating the
expressed as equivalent vector representation of all the words in
PR(υ) = ∑ PR(ν) the cleaned sentences produced and we took the
mean of all these vectors to obtain the vector
νЄBu L(ν) representation of all the sentences. These vector
that means the PageRank value for a web page u is representation then was used for calculating the
dependent on the page rank values for each web similarity between sentences and fed into page rank
page v out of the set Bu (this set contains all the algorithm which gives the score/rank of all the
web pages which are linked to page u), divided by sentences and the top ranked sentences forms the
the number L(v) of links from web page v. So the summary of the document.
page rank algorithm outputs a probability Currently the proposed system focuses only on
distribution which will be used to represent the extractive summarization and not on abstractive
likelihood that a person clicking on a web page A summarization. But the same can be extended for
will arrive at another web page B. The page rank implementing abstractive summarization. All the
computations require several iterations to compute sentences that ranked highest can then be used to
score for each web page. do abstractive summarization and the sentences
Now that we have understood page rank algorithm, given to the user can be more simplified than the
we can dive into understanding text rank algorithm sentences which are given to the user now. Since
[6]. There are certain similarities between these the processing time and complexity of abstractive
algorithms which are listed below: summarization is very much higher than extractive
summarization, hence we were not able to dive into
1) We use sentences in place of web pages. abstractive summarization. But with high
2) Here the similarity between two sentences is used processing power the same can be extended for
as the web page transition probability. abstractive summarization.
6. References
3) Similarly to the page rank the similarity scores [1.] Khushboo S Thakkar, Dr. R. V. Dharaskar,
are stored in square matrix. M. B. Chandak, “Graph-Based Algorithms
The text rank score of sentences can be visualized for Text Summarization”, Third
as follows: International Conference on Emerging
Trends in Engineering and Technology,
2010.
[2.] Keet Sugathadasa, buddhi Ayesha, Nisansa
de silve, Amal Shehan Perera, Vindula
Jayawardjana, Dimuthu Lakmal, Madhavi
Perera, “Legal Document Retrieval using
Document Vector Embeddings and Deep
Learning”, Computing Conference -London,
UK, 2018.
Figure 5: Text Rank score visualization [3.] Yujun Wen, Hui Yuan, Pengzhou Zhang,
5. Conclusion and Future Scope “Research on Keyword Extraction Based on
In this paper, we have introduced the basic idea of Word2Vec Weighted TextRank”, 2nd IEEE
text rank algorithm which is based on page rank International Conference on Computer and
algorithm to calculate the rank of individual Communications, 2016
sentences in our document. We started by simply [4.] Md. Nizam Uddin, Shakil Akter Khan, “A
tokenizing the sentences and then removing various Study on Text Summarization Techniques
unwanted punctuations, numbers and other special and Implement Few of Them for Bangla
characters from the sentences. Later we eliminated Language”, 1-4244-1551-9/07IEEE, 2007.
stop words and obtained clean sentences from the [5.] Tomas Mikolov, Kai Chen, Greg Corrado,
and Jeffrey Dean,"Efficient Estimation of
Rahul C Kore, IJECS Volume 09 Issue 05 May, 2020 Page No. 25039-25046 Page 25045
Word Representations in Vector Space", In [8.] “Law2Vec: Legal Word Embeddings by
Proceedings of Workshop at ICLR, 2013. Ilias Chalkidis”, Available
[6.] Mihalcea R, Tarau P,"TextRank: Bringing https://archive.org/details/Law2Vec
Order into Texts", In: Proceedings of [9.] “Word Embeddings tutorials by Tensorflow
Conference on Empirical Methods in Core”, Available:
Natural Language Processing, Barcelona, https://www.tensorflow.org/tutorials/text/wo
Spain, 2004, pp.404-411. rd_embeddings?source=post_page.
[7.] K. Sugathadasa, B. Ayesha, N. de Silva, A. [10.] “Legal Case Reports Dataset by UCI
S. Perera, V. Jayawardana, D. Lakmal, and Machine Learning Repository”, Available:
M. Perera., “Synergistic union of word2vec https://archive.ics.uci.edu/ml/datasets/Legal
and lexicon for domain specific semantic +Case+Reports.
similarity”, University of London
International Programmes, 2017.
Rahul C Kore, IJECS Volume 09 Issue 05 May, 2020 Page No. 25039-25046 Page 25046