NLP-Driven Summarization of Local Language Texts
NLP-Driven Summarization of Local Language Texts
Supervisor
Asst Prof Dr. Yasir Jan
1
1 OVERVIEW
2 EVOLUTION OF SUMMARIZATION
3 URDU SUMMARIZATION
4 EXPERIMENTAL RESULTS
5 FUTURE WORK
6 CONCLUSION
7 QUERIES 2
3
Increasing influx of data has created a biggest problem-
Information Overload
4
NLP is subfield of linguistics & computer science with an Part of Speech Tagging (POS)
Named Entity Recognition (NER)
addition of artificial intelligence Sentiment Analysis
• Natural Language Understanding (NLU) - Understanding Ques / Ans
and extracting meaningful insights of natural languages Language Modelling
Machine Translation
• Natural Language Generation (NLG) - Generating content Automatic Summarization
similar to human language for desired tasks Natural Language Inference (NLI)
Semantic Textual Similarity
Speech Recognition
Speaker Recognition
Document Classification
Named Entity Recognition (NER) Parts of Speech Tagging (POS) Sentiment Analysis
5
Automatic Summarization is the process of extracting only the meaningful information
from text resulting into reduction length of text as well as maintaining the information
included in it
Summarization can be categorized on various criteria
6
7
Early Statistical and Rule based approaches towards summarization includes use of
Term Frequency (TF) [3] and Inverse Document Frequency (IDF) [4]
These methods were enhanced by using additional features (position, cue words,
headlines) [5], clustering [6] and probability based model [7]
8
Popular Graph Based approaches include TextRank [8] and LexRank [9] derived from
Google’s PageRank algorithm
9
Model
Preprocessing Summary
• Binary Classifiers
• Stop Words Removal • Sentence Scoring based
• Probability Based Regression
• Stemming / Extraction
(HMM)
Lemmatization • Phrase Extraction
• Dimensionality Reduction (LSA,
• Unwanted Characters • Sentence Compression /
LDA)
Removal (URLs etc) Fusion
• Graph based (TextRank, LexRank,
• Sentence / Phrase / Word • Abstractive Generation
ADGs)
Splitting (Limited)
• Clustering
Training Data Training Output
Latent Drichlet Distribution TF-IDF (Topic Terms) in Sentences of Clustering of Sentences based on Selection of Sentences nearer
(LDA) based Topic Detection Document(s) occurrence of Topic Terms to Cluster Centroids 10
• Prior to Deep Learning, Summarization was achieved through word-based models
considering them as bag of words or scaling such model to sequences (without contextual
information)
• Deep learning based Sequence 2 Sequence models [17] [18] [19] [20] [21] (Convolutional
Neural Networks (CNN) and Recurrent Neural Networks (RNN)) syntactical and semantical
analysis was enriched with context
Syntax / Lexical Words & Structure POS Tagging
Semantics Meanings of Words Similarity
Context Dependencies & Relationship Seq 2 Seq Models
b/w words in a Sequence
Context; Example
•He went to the bank for depositing his savings
•He went to the bank of river for a walk 11
Document(s) Recurrent Neural Networks (RNNs, LSTMs, GRUs) Summary
Sentences
Pre-training Techniques
• Word Masking
• Sentence / Phrase
Masking
• De-noising / Corrupting
14
• Transfer Learning was enabled through Language Models (LM) and their re-use for various
downstream tasks (Summarization, Q/A, Inference etc)
• BERT (Bidirectional Encoder Representations from Transformers) [23] was proposed with
pre-training on Masked Language with un-labelled large corpus (self-supervised learning)
based on Transformers architecture [22]
• Transfer Learning (ubiquitous for NLP) not only provided with the benefit of generalized
LMs adaptable for various downstream tasks but also for cross-lingual usage benefits;
MBERT (Multilingual BERT trained on 107 languages)
• Pre-training objective of LMs extended to cross-lingual training [23]
• Suffers with disadvantage of under-representation of low resource languages [25]
15
• Sentence Scoring / Weighting • Sentence Scoring / Weighting • Sentence Scoring / Weighting / Fusion / Compression
• Sentence Fusion / Compression • Sentence Fusion / Compression • Natural Language Generation
• TF/IDF, Features • Binary Classifiers • Convolutional Neural Networks (CNN)
• Clustering • Recurrent Neural Networks (RNN, LSTM, GRUs)
• Probability • Probability, Entropy, HMM
• Dimensionality Reduction (LSA, LDA, • Transformers
• Graphs SVM) • Pre-Trained LMs using Transformer Architecture
Lexical + Syntax & Semantics Lexical + Syntax & Semantics Lexical + Syntax & Semantics + Context
16
17
Proceedings of the Tenth International Conference on Language Resources and Evaluation
European Language Resources Association (ELRA), 2016
Pros Cons
Simple Method using TF and Probability with addition of position Small Dataset Size; 50x Records Only
feature Non ML based statistical method
Unrealistic High Evaluation scores as compared to rich resource
language like English (SOTA Rouge F score is below 0.50) mainly
due to small dataset 19
Information Processing & Management Journal (Volume 57, Issue 6, November 2020)
• Dataset
o Additional Human written Extractive Summaries added to previous Dataset of 50x Records i.e. Urdu Summary Corpus
• Methodology
o Sentence Weight Algorithm using weighted TF/ IDF; Non ML based statistical method
o ML based Embedding Model for learning vocabulary on 600 Articles later used for Sentence Weight Algorithm
• Evaluation – Rouge-1 F Score
Dataset Sentence Wt Wt TF VSM TextRank Distributional Semantic Model
Abstractive 0.36 0.37 0.37 0.39 0.35
Extractive 0.80 0.76 0.62 0.77 0.57
Pros Cons
Comparison of various Statistical Methods Small Dataset Size; 50x Records Only
Local Weights and Global Learned Weights Non ML based statistical methods
Graph based method also used i.e. TextRank Unrealistic High Evaluation scores as compared to rich resource lang like
English (SOTA is below 0.50) mainly due to small dataset
Rouge Score for Abstractive has limitations (new generated words)
20
Mohammad Ali Jinnah University International Conference on Computing (MAJICC), 2021
• Dataset
o 15 x Articles and Summaries (not publicly available)
• Methodology
o Comparison of existing statistical methods
o Only extractive methods are compared
• Evaluation – Rouge-1 F Score
Reduction KL Divergence Sum Basics Edmundson TextRank LSA Luhn LexRank
0.71 0.40 0.31 0.49 0.58 0.62 0.62 0.49
Pros Cons
Comparison of various Statistical Methods Small Dataset Size; 15x Records Only
Graph based methods also used i.e. TextRank & Non ML based statistical methods
LexRank Unrealistic High Evaluation scores as compared to rich resource lang like
English (SOTA is below 0.50) due to extremely small dataset
No Details of Implementation nor publicly available dataset 21
• Urdu despite being a popular language i.e. 10th most spoken language [1] and our
national language there is little to no research available in the fields of Summarization
and NLP
• Only 1x Dataset [23] is available for Urdu summarization which has only 50 x records
makes it unsuitable for training of any machine learning based algorithm
• No research on Urdu summarization using ML (Deep Learning)
• Requirement of Urdu based summarization models which can be utilized in real-time
getting summarized content from social media networks and news to books and lectures;
same is required as a baseline for future research
No Research on Urdu Summarization based Non Availability of Urdu Summarization Non Availability of Urdu Summarization
on ML (Deep Learning) Models Datasets Models
22
• A methodological framework for utilizing deep learning based pre-trained language
models trained on multiple languages for summarization of a low resource language in
low resource settings efficiently.
• Baseline low resource summarization model finetuned on newly created dataset having
competitive evaluation score with high resource languages like English (using multiple
evaluations ROUGE, BERTScore, Human).
23
24
• Only 1 x “Urdu Summary Corpus” dataset of 50 records publicly available
• Creation of large human written summaries
• Resource Intensive
• Time Consuming
• Although mT5 based models doesn’t have any theoretical limitation of processing input text however
memory consumption exponentially increases with higher length input
• Articles length longer than 512 tokens were truncated to 512 token
• Recall measure of Rouge-1 between summary and articles paragraphs was used to rank paragraphs of
articles
• Articles with low Recall was removed until length of article comes upto 512 tokens
BEFORE AFTER
Before Preprocessing After Preprocessing
Attribute
Article Summary Compression % Article Summary Compression %
S1 [1.220, 0.796, …] S1 S2 … SN
Article
S2 [0.580, 0.124, …] C1 0.56 0.34 … 0.81
.. C2 0.71 0.92 … 0.01
..
*PLMs C3 0.10 0.56 … 0.31
..
trained on .. Cosine_Sim: Centroid Extractive Summary
NLU for LS SN [0.421, 0.898, …] of Clusters (C) with (S with highest sim)
Embeddings K-Means Clustering Sents (S) in C
*PLM – Pretrained Lang Model
LS
LN Summary
Article
mT5 LST5; loading only selected vocabulary in multilingual model
• ROUGE (Recall-Oriented Understudy for Gisting Evaluation) [22]
o Calculated using word overlaps between reference and generated summary (Rouge-1 = Unigram Overlap)
Cons
Abstractive Summarization has out of
Precision
vocabulary words from source text
Summary with low Rouge score may be
more accurate and have better quality
Cons
Dependent of pre-trained models for
calculating embeddings Recall Precision F-Score
Truncated version (using Recall) of Datasets showed better results as compared to original version which were
being truncated automatically after 512 tokens by BERT based models
mBERT trained over 104 languages; base - 110M parameters, approx. ~ 681 MB
MuRIL trained over 17 Indian languages; base - 236M parameters, approx. ~909 MB, large approx. ~1.89 GB
contained translated and transliterated documents as well for cross lingual training
MuRIL has only minor difference of evaluation score with mBERT as compared to size of model
Extractive Summarization
• Due to absence of models for low resource languages, mostly multilingual models are used
• Loading only monolingual vocabulary in a multilingual model as most of parameters lies in embedding layer
Reduction of size upto 48% ; mBERT from 681MB to 354MB
Geotrend-BERT-base have equivalent results with mBERT-base being almost half of it original size
Abstractive Summarization
• Memory utilization while training was enhanced by loading only monolingual vocabulary (Urdu comprising of 40k
tokens) in a multilingual model (mT5-base having 250k tokens; 101 languages) similar to Geotrend/BERT
• Reduction of size upto 44.78% ; mT5-base of ~2.17GB reduced to ~1.04GB (Urdu T5; urT5)
urT5 model have equivalent results as compared to mT5 model being of 55% size
Abstractive Summarization
Zeroshot – Poor results as model was pre-trained on NLU and NLG tasks (not on summarization)
50% Dataset – 0,89 lower Rouge-1 F Score; size of training examples increases capability
50% Training Epochs – 2.5 epochs instead of 5 shows minor improvement (0.11 Rouge-1 F Score)
More training epochs doesn’t necessarily means more efficient model
Extractive Summarization
• Summaries with top evaluation score were found having maximum terms extracted from articles
o Increasing term overlap hence automated evaluation scores
Abstractive Summarization
Training was carried out by joint dataset however testing carried out on joint as well as separate subsets
BBC Urdu evaluation was high however comparatively less than Extractive Summarization
Lesser score indicates the generative capability of T5 based models (Extractive: 48.7 vs Abstractive: 46.3)
• Considering lack of research in low resource summarization
o Dataset has been created from publicly available source in Urdu News (which can
be replicated easily for other languages)
Rouge-1 score of 47.21, 45.14 and 38.81 claimed by PEGASUS [70], BART [62]
and BERTSumExtAbs [64] respectively. However cannot be compared in true
sense as results are dataset dependent
• Dataset. Creation of quality Datasets for training and evaluations including multidomain
datasets (news, books, reviews etc), cross lingual parallel dataset (same text in
datasets in multiple languages)
• Low Resource. Modular Approach towards models which can work in low resource
settings for specific tasks (similar to technique used in creating urT5 from mT5)
• Complex Language. Generic Models able to understand complex lingua like idioms,
sarcasm (also verified on low resource languages like Urdu)