0% found this document useful (0 votes)
3 views

NLP-Driven Summarization of Local Language Texts

The document discusses the challenges of information overload and the role of Natural Language Processing (NLP) in automatic summarization, particularly focusing on Urdu language texts. It highlights the lack of research and datasets for Urdu summarization and proposes a framework for creating a large dataset and utilizing deep learning models for effective summarization. The study aims to develop a competitive summarization model for low-resource languages, leveraging pre-trained multilingual models and addressing the limitations of existing approaches.

Uploaded by

Shahzeb Tanveer
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

NLP-Driven Summarization of Local Language Texts

The document discusses the challenges of information overload and the role of Natural Language Processing (NLP) in automatic summarization, particularly focusing on Urdu language texts. It highlights the lack of research and datasets for Urdu summarization and proposes a framework for creating a large dataset and utilizing deep learning models for effective summarization. The study aims to develop a competitive summarization model for low-resource languages, leveraging pre-trained multilingual models and addressing the limitations of existing approaches.

Uploaded by

Shahzeb Tanveer
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 52

NLP-Driven Summarization of Local Language Texts:

A Pre-trained Model Approach


Subayyal Sheikh (2130-6003)

Supervisor
Asst Prof Dr. Yasir Jan

1
1 OVERVIEW

2 EVOLUTION OF SUMMARIZATION

3 URDU SUMMARIZATION

4 EXPERIMENTAL RESULTS

5 FUTURE WORK

6 CONCLUSION

7 QUERIES 2
3
Increasing influx of data has created a biggest problem-
Information Overload

Popularity of social media and news platforms -


content being created is overwhelming to the users

Biggest challenge - Sifting through and extracting


meaningful information

Information Extraction or Retrieval (IE or IR) and


Natural Language Processing (NLP)

4
 NLP is subfield of linguistics & computer science with an Part of Speech Tagging (POS)
Named Entity Recognition (NER)
addition of artificial intelligence Sentiment Analysis
• Natural Language Understanding (NLU) - Understanding Ques / Ans
and extracting meaningful insights of natural languages Language Modelling
Machine Translation
• Natural Language Generation (NLG) - Generating content Automatic Summarization
similar to human language for desired tasks Natural Language Inference (NLI)
Semantic Textual Similarity
Speech Recognition
Speaker Recognition
Document Classification

Named Entity Recognition (NER) Parts of Speech Tagging (POS) Sentiment Analysis
5
 Automatic Summarization is the process of extracting only the meaningful information
from text resulting into reduction length of text as well as maintaining the information
included in it
 Summarization can be categorized on various criteria

6
7
 Early Statistical and Rule based approaches towards summarization includes use of
Term Frequency (TF) [3] and Inverse Document Frequency (IDF) [4]
 These methods were enhanced by using additional features (position, cue words,
headlines) [5], clustering [6] and probability based model [7]

8
 Popular Graph Based approaches include TextRank [8] and LexRank [9] derived from
Google’s PageRank algorithm

TextRank & LexRank


(Conceptual Depiction)
Ranking of Sentences
using Similarity Measures

9
Model
Preprocessing Summary
• Binary Classifiers
• Stop Words Removal • Sentence Scoring based
• Probability Based Regression
• Stemming / Extraction
(HMM)
Lemmatization • Phrase Extraction
• Dimensionality Reduction (LSA,
• Unwanted Characters • Sentence Compression /
LDA)
Removal (URLs etc) Fusion
• Graph based (TextRank, LexRank,
• Sentence / Phrase / Word • Abstractive Generation
ADGs)
Splitting (Limited)
• Clustering
Training Data Training Output

Latent Drichlet Distribution TF-IDF (Topic Terms) in Sentences of Clustering of Sentences based on Selection of Sentences nearer
(LDA) based Topic Detection Document(s) occurrence of Topic Terms to Cluster Centroids 10
• Prior to Deep Learning, Summarization was achieved through word-based models
considering them as bag of words or scaling such model to sequences (without contextual
information)
• Deep learning based Sequence 2 Sequence models [17] [18] [19] [20] [21] (Convolutional
Neural Networks (CNN) and Recurrent Neural Networks (RNN)) syntactical and semantical
analysis was enriched with context
Syntax / Lexical Words & Structure POS Tagging
Semantics Meanings of Words Similarity
Context Dependencies & Relationship Seq 2 Seq Models
b/w words in a Sequence

Context; Example
•He went to the bank for depositing his savings
•He went to the bank of river for a walk 11
Document(s) Recurrent Neural Networks (RNNs, LSTMs, GRUs) Summary
Sentences

• Comprises of RNN based Encoder Decoder


• Sequential Models
• Sequential Attention Mechanisms caters for dependencies (context) between tokens 12
Parallelization instead of Sequential Attention Mechanism

Seq2Seq Models lacked parallelization creating bottlenecks


Transformer based on attention mechanism allowed parallelization to cater for bottle neck 13
Pre-training Objectives
• NLU
• NLG

Pre-training Techniques
• Word Masking
• Sentence / Phrase
Masking
• De-noising / Corrupting

Transfer Learning – Language Models

14
• Transfer Learning was enabled through Language Models (LM) and their re-use for various
downstream tasks (Summarization, Q/A, Inference etc)
• BERT (Bidirectional Encoder Representations from Transformers) [23] was proposed with
pre-training on Masked Language with un-labelled large corpus (self-supervised learning)
based on Transformers architecture [22]
• Transfer Learning (ubiquitous for NLP) not only provided with the benefit of generalized
LMs adaptable for various downstream tasks but also for cross-lingual usage benefits;
MBERT (Multilingual BERT trained on 107 languages)
• Pre-training objective of LMs extended to cross-lingual training [23]
• Suffers with disadvantage of under-representation of low resource languages [25]

15
• Sentence Scoring / Weighting • Sentence Scoring / Weighting • Sentence Scoring / Weighting / Fusion / Compression
• Sentence Fusion / Compression • Sentence Fusion / Compression • Natural Language Generation
• TF/IDF, Features • Binary Classifiers • Convolutional Neural Networks (CNN)
• Clustering • Recurrent Neural Networks (RNN, LSTM, GRUs)
• Probability • Probability, Entropy, HMM
• Dimensionality Reduction (LSA, LDA, • Transformers
• Graphs SVM) • Pre-Trained LMs using Transformer Architecture
Lexical + Syntax & Semantics Lexical + Syntax & Semantics Lexical + Syntax & Semantics + Context

STATISTICAL MACHINE LEARNING DEEP LEARNING

16
17
Proceedings of the Tenth International Conference on Language Resources and Evaluation
European Language Resources Association (ELRA), 2016

• Creation of Urdu Summary Corpus


• No Summarization; Only Dataset & its Preprocessing
• Dataset
• 50 x Records (Article and Summary)
• Manually written articles & summaries for 8x different categories
Pros Cons
 1st Urdu Dataset for Summarization  Small Dataset Size; 50x Records Only
 Human Generated Text  Not usable for training of ML algorithms
 Preprocessing (POS Tagger, Stemmer/ Lemmatizer)  Claimed as Abstractive however Favourable for Extractive
 Publicly available with Preprocessing Code Summarization (paraphrasing of key phrases)
18
Proceedings of Seventeenth Mexican International Conference on Artificial Intelligence (MICAI), 2018
• Dataset
o Urdu Summary Corpus; 50 x Human written Articles and Summaries
• Methodology
o Sentence Weight Algorithm using Words Probability; Non ML based statistical method
o Addl position weights also allocated
• Evaluation – Rouge-1 F Score
o Claimed Score of 0.59

Pros Cons
 Simple Method using TF and Probability with addition of position  Small Dataset Size; 50x Records Only
feature  Non ML based statistical method
 Unrealistic High Evaluation scores as compared to rich resource
language like English (SOTA Rouge F score is below 0.50) mainly
due to small dataset 19
Information Processing & Management Journal (Volume 57, Issue 6, November 2020)
• Dataset
o Additional Human written Extractive Summaries added to previous Dataset of 50x Records i.e. Urdu Summary Corpus
• Methodology
o Sentence Weight Algorithm using weighted TF/ IDF; Non ML based statistical method
o ML based Embedding Model for learning vocabulary on 600 Articles later used for Sentence Weight Algorithm
• Evaluation – Rouge-1 F Score
Dataset Sentence Wt Wt TF VSM TextRank Distributional Semantic Model
Abstractive 0.36 0.37 0.37 0.39 0.35
Extractive 0.80 0.76 0.62 0.77 0.57

Pros Cons
 Comparison of various Statistical Methods  Small Dataset Size; 50x Records Only
 Local Weights and Global Learned Weights  Non ML based statistical methods
 Graph based method also used i.e. TextRank  Unrealistic High Evaluation scores as compared to rich resource lang like
English (SOTA is below 0.50) mainly due to small dataset
 Rouge Score for Abstractive has limitations (new generated words)
20
Mohammad Ali Jinnah University International Conference on Computing (MAJICC), 2021
• Dataset
o 15 x Articles and Summaries (not publicly available)
• Methodology
o Comparison of existing statistical methods
o Only extractive methods are compared
• Evaluation – Rouge-1 F Score
Reduction KL Divergence Sum Basics Edmundson TextRank LSA Luhn LexRank
0.71 0.40 0.31 0.49 0.58 0.62 0.62 0.49

Pros Cons
 Comparison of various Statistical Methods  Small Dataset Size; 15x Records Only
 Graph based methods also used i.e. TextRank &  Non ML based statistical methods
LexRank  Unrealistic High Evaluation scores as compared to rich resource lang like
English (SOTA is below 0.50) due to extremely small dataset
 No Details of Implementation nor publicly available dataset 21
• Urdu despite being a popular language i.e. 10th most spoken language [1] and our
national language there is little to no research available in the fields of Summarization
and NLP
• Only 1x Dataset [23] is available for Urdu summarization which has only 50 x records
makes it unsuitable for training of any machine learning based algorithm
• No research on Urdu summarization using ML (Deep Learning)
• Requirement of Urdu based summarization models which can be utilized in real-time
getting summarized content from social media networks and news to books and lectures;
same is required as a baseline for future research

No Research on Urdu Summarization based Non Availability of Urdu Summarization Non Availability of Urdu Summarization
on ML (Deep Learning) Models Datasets Models
22
• A methodological framework for utilizing deep learning based pre-trained language
models trained on multiple languages for summarization of a low resource language in
low resource settings efficiently.

• Creation of a summarization dataset in Urdu language from publicly available source


which can be replicated for other low resource languages. News domain is chosen
because of its availability in multiple languages. This created dataset is the first and
largest Summarization Dataset.

• Baseline low resource summarization model finetuned on newly created dataset having
competitive evaluation score with high resource languages like English (using multiple
evaluations ROUGE, BERTScore, Human).

23
24
• Only 1 x “Urdu Summary Corpus” dataset of 50 records publicly available
• Creation of large human written summaries
• Resource Intensive
• Time Consuming

• News Domain; a suitable choice


- Publicly Available - Easily Collectable - Multilingual - Authenticity - Variety of sources

• BBC Urdu1 was selected for creation of own Dataset


• It has 2-3 line human written summaries available along with articles
• Dataset Size - BBC Urdu 1.3k Records of Article / Summary pairs

1BBC Urdu - https://www.bbc.com/urdu


• Pre-processing & Cleaning
• Only Text-based articles were included
• Links / URLs were removed (e.g. links of associated articles)
• Picture Captions were also removed
• Compression Ratio was calculated using tokenized lengths (using spaCy; word based
tokenizer). Records having ratio more than 50% were removed

URLs Removal Picture Caption Removal


• mBERT based models have limitations of processing input text upto 512 tokens
• Articles longer than 512 tokens will be truncated automatically

• Although mT5 based models doesn’t have any theoretical limitation of processing input text however
memory consumption exponentially increases with higher length input

• To cater for limitation of mBERT and considering memory consumption of mT5

• Articles length longer than 512 tokens were truncated to 512 token
• Recall measure of Rouge-1 between summary and articles paragraphs was used to rank paragraphs of
articles
• Articles with low Recall was removed until length of article comes upto 512 tokens
BEFORE AFTER
Before Preprocessing After Preprocessing
Attribute
Article Summary Compression % Article Summary Compression %

Count 1391 1379

Mean 1451.21 46.32 4.02 482.13 46.59 9.75

Std 710.07 13.49 4.27 32.60 13.07 3.04

Min 35.0 0.0 0.0 85 13 3.33

25% 993.5 37 2.51 473 38 7.76

50% 1360 45 3.39 487 45 9.37

75% 1750 55 4.54 500 55 11.43

Max 7621 105 100 512 105 36.17


• Various word based embedding models trained on Urdu languages are available
• Only 3 Pre-trained Multilingual Language Model trained on Urdu language
• mBERT Extractive Summarization
• MuRIL (trained of NLU)

• mT5 (small, large, XL, XXL)  Abstractive Summarization


(trained of NLU & NLG tasks)
EXTRACTIVE SUMMARIZATION

S1 [1.220, 0.796, …] S1 S2 … SN
Article
S2 [0.580, 0.124, …] C1 0.56 0.34 … 0.81
.. C2 0.71 0.92 … 0.01
..
*PLMs C3 0.10 0.56 … 0.31
..
trained on .. Cosine_Sim: Centroid Extractive Summary
NLU for LS SN [0.421, 0.898, …] of Clusters (C) with (S with highest sim)
Embeddings K-Means Clustering Sents (S) in C
*PLM – Pretrained Lang Model

ABSTRACTIVE SUMMARIZATION Evaluation  ROUGE Score BERT Score Human Evaluation

LS - Low Resource Lang Finetuning on


LS Custom DataSet
Vocabulary
Multilingual
T5 (mT5) L1 LST5
L2 LST5 (finetuned)
L3
Embedding Layer Abstractive

LS
LN Summary
Article
mT5  LST5; loading only selected vocabulary in multilingual model
• ROUGE (Recall-Oriented Understudy for Gisting Evaluation) [22]
o Calculated using word overlaps between reference and generated summary (Rouge-1 = Unigram Overlap)

 Most common evaluation metric


 Recently semantic scores based methods
evolved Recall

Cons
 Abstractive Summarization has out of
Precision
vocabulary words from source text
 Summary with low Rouge score may be
more accurate and have better quality

 Semantic Scores like BERT Score


 Human Evaluation of Sample Summaries
F-Score
(generated vs reference)
• BERT Score (Bidirectional Encoder Representations for Transformers Score) [22]
o Calculated using sematic similarity of contextual embeddings between reference and generated summary

 Recently evolved method


 Overcomes the shortcoming of exact
term matching / overlapping

Cons
 Dependent of pre-trained models for
calculating embeddings Recall Precision F-Score

 Accuracy of BERT Score depends on


accuracy of global embeddings used for
calculating similarity

 Human Evaluation of Sample Summaries


(generated vs reference)
37
Extractive Summarization

 Truncated version (using Recall) of Datasets showed better results as compared to original version which were
being truncated automatically after 512 tokens by BERT based models

Truncation versions of Dataset showed better results


Extractive Summarization

 mBERT trained over 104 languages; base - 110M parameters, approx. ~ 681 MB
 MuRIL trained over 17 Indian languages; base - 236M parameters, approx. ~909 MB, large approx. ~1.89 GB
 contained translated and transliterated documents as well for cross lingual training
MuRIL has only minor difference of evaluation score with mBERT as compared to size of model
Extractive Summarization

• Due to absence of models for low resource languages, mostly multilingual models are used
• Loading only monolingual vocabulary in a multilingual model as most of parameters lies in embedding layer
 Reduction of size upto 48% ; mBERT from 681MB to 354MB
Geotrend-BERT-base have equivalent results with mBERT-base being almost half of it original size
Abstractive Summarization

Urdu vocabulary collected from


 Own Summarization Datasets
 1M Urdu News Dataset
(DOI:10.17632/834vsxnb99.3)

• Memory utilization while training was enhanced by loading only monolingual vocabulary (Urdu comprising of 40k
tokens) in a multilingual model (mT5-base having 250k tokens; 101 languages) similar to Geotrend/BERT

• Reduction of size upto 44.78% ; mT5-base of ~2.17GB reduced to ~1.04GB (Urdu T5; urT5)

urT5 model have equivalent results as compared to mT5 model being of 55% size
Abstractive Summarization

 Zeroshot – Poor results as model was pre-trained on NLU and NLG tasks (not on summarization)
 50% Dataset – 0,89 lower Rouge-1 F Score; size of training examples increases capability
 50% Training Epochs – 2.5 epochs instead of 5 shows minor improvement (0.11 Rouge-1 F Score)
 More training epochs doesn’t necessarily means more efficient model
Extractive Summarization

• Summaries with top evaluation score were found having maximum terms extracted from articles
o Increasing term overlap hence automated evaluation scores
Abstractive Summarization

 Training was carried out by joint dataset however testing carried out on joint as well as separate subsets
 BBC Urdu evaluation was high however comparatively less than Extractive Summarization
 Lesser score indicates the generative capability of T5 based models (Extractive: 48.7 vs Abstractive: 46.3)
• Considering lack of research in low resource summarization

o Dataset has been created from publicly available source in Urdu News (which can
be replicated easily for other languages)

o Utilizing multilingual pretrained models, framework adopted for monolingual


purpose in a low resource environment (urT5 model having half of the size with
high results)

o Competitive Evaluation Results with high resource language English

 Rouge-1 score of 47.21, 45.14 and 38.81 claimed by PEGASUS [70], BART [62]
and BERTSumExtAbs [64] respectively. However cannot be compared in true
sense as results are dataset dependent
• Dataset. Creation of quality Datasets for training and evaluations including multidomain
datasets (news, books, reviews etc), cross lingual parallel dataset (same text in
datasets in multiple languages)

• Low Resource. Modular Approach towards models which can work in low resource
settings for specific tasks (similar to technique used in creating urT5 from mT5)

• Complex Language. Generic Models able to understand complex lingua like idioms,
sarcasm (also verified on low resource languages like Urdu)

• Evaluation Metrics. New automated evaluation metrics overcoming shortfalls of


existing metrics as highlighted.
47

You might also like