0% found this document useful (0 votes)

3 views

NLP-Driven Summarization of Local Language Texts

The document discusses the challenges of information overload and the role of Natural Language Processing (NLP) in automatic summarization, particularly focusing on Urdu language texts. It highlights the lack of research and datasets for Urdu summarization and proposes a framework for creating a large dataset and utilizing deep learning models for effective summarization. The study aims to develop a competitive summarization model for low-resource languages, leveraging pre-trained multilingual models and addressing the limitations of existing approaches.

Uploaded by

Shahzeb Tanveer

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views

NLP-Driven Summarization of Local Language Texts

Uploaded by

Shahzeb Tanveer

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 52

NLP-Driven Summarization of Local Language Texts:

A Pre-trained Model Approach

Subayyal Sheikh (2130-6003)

Supervisor
Asst Prof Dr. Yasir Jan

1
1 OVERVIEW

2 EVOLUTION OF SUMMARIZATION

3 URDU SUMMARIZATION

4 EXPERIMENTAL RESULTS

5 FUTURE WORK

6 CONCLUSION

7 QUERIES 2
3
Increasing influx of data has created a biggest problem-
Information Overload

Popularity of social media and news platforms -

content being created is overwhelming to the users

Biggest challenge - Sifting through and extracting

meaningful information

Information Extraction or Retrieval (IE or IR) and

Natural Language Processing (NLP)

4
 NLP is subfield of linguistics & computer science with an Part of Speech Tagging (POS)
Named Entity Recognition (NER)
addition of artificial intelligence Sentiment Analysis
• Natural Language Understanding (NLU) - Understanding Ques / Ans
and extracting meaningful insights of natural languages Language Modelling
Machine Translation
• Natural Language Generation (NLG) - Generating content Automatic Summarization
similar to human language for desired tasks Natural Language Inference (NLI)
Semantic Textual Similarity
Speech Recognition
Speaker Recognition
Document Classification

Named Entity Recognition (NER) Parts of Speech Tagging (POS) Sentiment Analysis
5
 Automatic Summarization is the process of extracting only the meaningful information
from text resulting into reduction length of text as well as maintaining the information
included in it
 Summarization can be categorized on various criteria

6
7
 Early Statistical and Rule based approaches towards summarization includes use of
Term Frequency (TF) [3] and Inverse Document Frequency (IDF) [4]
 These methods were enhanced by using additional features (position, cue words,
headlines) [5], clustering [6] and probability based model [7]

8
 Popular Graph Based approaches include TextRank [8] and LexRank [9] derived from
Google’s PageRank algorithm

TextRank & LexRank

(Conceptual Depiction)
Ranking of Sentences
using Similarity Measures

9
Model
Preprocessing Summary
• Binary Classifiers
• Stop Words Removal • Sentence Scoring based
• Probability Based Regression
• Stemming / Extraction
(HMM)
Lemmatization • Phrase Extraction
• Dimensionality Reduction (LSA,
• Unwanted Characters • Sentence Compression /
LDA)
Removal (URLs etc) Fusion
• Graph based (TextRank, LexRank,
• Sentence / Phrase / Word • Abstractive Generation
ADGs)
Splitting (Limited)
• Clustering
Training Data Training Output

Latent Drichlet Distribution TF-IDF (Topic Terms) in Sentences of Clustering of Sentences based on Selection of Sentences nearer
(LDA) based Topic Detection Document(s) occurrence of Topic Terms to Cluster Centroids 10
• Prior to Deep Learning, Summarization was achieved through word-based models
considering them as bag of words or scaling such model to sequences (without contextual
information)
• Deep learning based Sequence 2 Sequence models [17] [18] [19] [20] [21] (Convolutional
Neural Networks (CNN) and Recurrent Neural Networks (RNN)) syntactical and semantical
analysis was enriched with context
Syntax / Lexical Words & Structure POS Tagging
Semantics Meanings of Words Similarity
Context Dependencies & Relationship Seq 2 Seq Models
b/w words in a Sequence

Context; Example
•He went to the bank for depositing his savings
•He went to the bank of river for a walk 11
Document(s) Recurrent Neural Networks (RNNs, LSTMs, GRUs) Summary
Sentences

• Comprises of RNN based Encoder Decoder

• Sequential Models
• Sequential Attention Mechanisms caters for dependencies (context) between tokens 12
Parallelization instead of Sequential Attention Mechanism

Seq2Seq Models lacked parallelization creating bottlenecks

Transformer based on attention mechanism allowed parallelization to cater for bottle neck 13
Pre-training Objectives
• NLU
• NLG

Pre-training Techniques
• Word Masking
• Sentence / Phrase
Masking
• De-noising / Corrupting

Transfer Learning – Language Models

14
• Transfer Learning was enabled through Language Models (LM) and their re-use for various
downstream tasks (Summarization, Q/A, Inference etc)
• BERT (Bidirectional Encoder Representations from Transformers) [23] was proposed with
pre-training on Masked Language with un-labelled large corpus (self-supervised learning)
based on Transformers architecture [22]
• Transfer Learning (ubiquitous for NLP) not only provided with the benefit of generalized
LMs adaptable for various downstream tasks but also for cross-lingual usage benefits;
MBERT (Multilingual BERT trained on 107 languages)
• Pre-training objective of LMs extended to cross-lingual training [23]
• Suffers with disadvantage of under-representation of low resource languages [25]

15
• Sentence Scoring / Weighting • Sentence Scoring / Weighting • Sentence Scoring / Weighting / Fusion / Compression
• Sentence Fusion / Compression • Sentence Fusion / Compression • Natural Language Generation
• TF/IDF, Features • Binary Classifiers • Convolutional Neural Networks (CNN)
• Clustering • Recurrent Neural Networks (RNN, LSTM, GRUs)
• Probability • Probability, Entropy, HMM
• Dimensionality Reduction (LSA, LDA, • Transformers
• Graphs SVM) • Pre-Trained LMs using Transformer Architecture
Lexical + Syntax & Semantics Lexical + Syntax & Semantics Lexical + Syntax & Semantics + Context

STATISTICAL MACHINE LEARNING DEEP LEARNING

16
17
Proceedings of the Tenth International Conference on Language Resources and Evaluation
European Language Resources Association (ELRA), 2016

• Creation of Urdu Summary Corpus

• No Summarization; Only Dataset & its Preprocessing
• Dataset
• 50 x Records (Article and Summary)
• Manually written articles & summaries for 8x different categories
Pros Cons
 1st Urdu Dataset for Summarization  Small Dataset Size; 50x Records Only
 Human Generated Text  Not usable for training of ML algorithms
 Preprocessing (POS Tagger, Stemmer/ Lemmatizer)  Claimed as Abstractive however Favourable for Extractive
 Publicly available with Preprocessing Code Summarization (paraphrasing of key phrases)
18
Proceedings of Seventeenth Mexican International Conference on Artificial Intelligence (MICAI), 2018
• Dataset
o Urdu Summary Corpus; 50 x Human written Articles and Summaries
• Methodology
o Sentence Weight Algorithm using Words Probability; Non ML based statistical method
o Addl position weights also allocated
• Evaluation – Rouge-1 F Score
o Claimed Score of 0.59

Pros Cons
 Simple Method using TF and Probability with addition of position  Small Dataset Size; 50x Records Only
feature  Non ML based statistical method
 Unrealistic High Evaluation scores as compared to rich resource
language like English (SOTA Rouge F score is below 0.50) mainly
due to small dataset 19
Information Processing & Management Journal (Volume 57, Issue 6, November 2020)
• Dataset
o Additional Human written Extractive Summaries added to previous Dataset of 50x Records i.e. Urdu Summary Corpus
• Methodology
o Sentence Weight Algorithm using weighted TF/ IDF; Non ML based statistical method
o ML based Embedding Model for learning vocabulary on 600 Articles later used for Sentence Weight Algorithm
• Evaluation – Rouge-1 F Score
Dataset Sentence Wt Wt TF VSM TextRank Distributional Semantic Model
Abstractive 0.36 0.37 0.37 0.39 0.35
Extractive 0.80 0.76 0.62 0.77 0.57

Pros Cons
 Comparison of various Statistical Methods  Small Dataset Size; 50x Records Only
 Local Weights and Global Learned Weights  Non ML based statistical methods
 Graph based method also used i.e. TextRank  Unrealistic High Evaluation scores as compared to rich resource lang like
English (SOTA is below 0.50) mainly due to small dataset
 Rouge Score for Abstractive has limitations (new generated words)
20
Mohammad Ali Jinnah University International Conference on Computing (MAJICC), 2021
• Dataset
o 15 x Articles and Summaries (not publicly available)
• Methodology
o Comparison of existing statistical methods
o Only extractive methods are compared
• Evaluation – Rouge-1 F Score
Reduction KL Divergence Sum Basics Edmundson TextRank LSA Luhn LexRank
0.71 0.40 0.31 0.49 0.58 0.62 0.62 0.49

Pros Cons
 Comparison of various Statistical Methods  Small Dataset Size; 15x Records Only
 Graph based methods also used i.e. TextRank &  Non ML based statistical methods
LexRank  Unrealistic High Evaluation scores as compared to rich resource lang like
English (SOTA is below 0.50) due to extremely small dataset
 No Details of Implementation nor publicly available dataset 21
• Urdu despite being a popular language i.e. 10th most spoken language [1] and our
national language there is little to no research available in the fields of Summarization
and NLP
• Only 1x Dataset [23] is available for Urdu summarization which has only 50 x records
makes it unsuitable for training of any machine learning based algorithm
• No research on Urdu summarization using ML (Deep Learning)
• Requirement of Urdu based summarization models which can be utilized in real-time
getting summarized content from social media networks and news to books and lectures;
same is required as a baseline for future research

No Research on Urdu Summarization based Non Availability of Urdu Summarization Non Availability of Urdu Summarization
on ML (Deep Learning) Models Datasets Models
22
• A methodological framework for utilizing deep learning based pre-trained language
models trained on multiple languages for summarization of a low resource language in
low resource settings efficiently.

• Creation of a summarization dataset in Urdu language from publicly available source

which can be replicated for other low resource languages. News domain is chosen
because of its availability in multiple languages. This created dataset is the first and
largest Summarization Dataset.

• Baseline low resource summarization model finetuned on newly created dataset having
competitive evaluation score with high resource languages like English (using multiple
evaluations ROUGE, BERTScore, Human).

23
24
• Only 1 x “Urdu Summary Corpus” dataset of 50 records publicly available
• Creation of large human written summaries
• Resource Intensive
• Time Consuming

• News Domain; a suitable choice

- Publicly Available - Easily Collectable - Multilingual - Authenticity - Variety of sources

• BBC Urdu1 was selected for creation of own Dataset

• It has 2-3 line human written summaries available along with articles
• Dataset Size - BBC Urdu 1.3k Records of Article / Summary pairs

1BBC Urdu - https://www.bbc.com/urdu

• Pre-processing & Cleaning
• Only Text-based articles were included
• Links / URLs were removed (e.g. links of associated articles)
• Picture Captions were also removed
• Compression Ratio was calculated using tokenized lengths (using spaCy; word based
tokenizer). Records having ratio more than 50% were removed

URLs Removal Picture Caption Removal

• mBERT based models have limitations of processing input text upto 512 tokens
• Articles longer than 512 tokens will be truncated automatically

• Although mT5 based models doesn’t have any theoretical limitation of processing input text however
memory consumption exponentially increases with higher length input

• To cater for limitation of mBERT and considering memory consumption of mT5

• Articles length longer than 512 tokens were truncated to 512 token
• Recall measure of Rouge-1 between summary and articles paragraphs was used to rank paragraphs of
articles
• Articles with low Recall was removed until length of article comes upto 512 tokens
BEFORE AFTER
Before Preprocessing After Preprocessing
Attribute
Article Summary Compression % Article Summary Compression %

Count 1391 1379

Mean 1451.21 46.32 4.02 482.13 46.59 9.75

Std 710.07 13.49 4.27 32.60 13.07 3.04

Min 35.0 0.0 0.0 85 13 3.33

25% 993.5 37 2.51 473 38 7.76

50% 1360 45 3.39 487 45 9.37

75% 1750 55 4.54 500 55 11.43

Max 7621 105 100 512 105 36.17

• Various word based embedding models trained on Urdu languages are available
• Only 3 Pre-trained Multilingual Language Model trained on Urdu language
• mBERT Extractive Summarization
• MuRIL (trained of NLU)

• mT5 (small, large, XL, XXL)  Abstractive Summarization

(trained of NLU & NLG tasks)
EXTRACTIVE SUMMARIZATION

S1 [1.220, 0.796, …] S1 S2 … SN
Article
S2 [0.580, 0.124, …] C1 0.56 0.34 … 0.81
.. C2 0.71 0.92 … 0.01
..
*PLMs C3 0.10 0.56 … 0.31
..
trained on .. Cosine_Sim: Centroid Extractive Summary
NLU for LS SN [0.421, 0.898, …] of Clusters (C) with (S with highest sim)
Embeddings K-Means Clustering Sents (S) in C
*PLM – Pretrained Lang Model

ABSTRACTIVE SUMMARIZATION Evaluation  ROUGE Score BERT Score Human Evaluation

LS - Low Resource Lang Finetuning on

LS Custom DataSet
Vocabulary
Multilingual
T5 (mT5) L1 LST5
L2 LST5 (finetuned)
L3
Embedding Layer Abstractive
…

LS
LN Summary
Article
mT5  LST5; loading only selected vocabulary in multilingual model
• ROUGE (Recall-Oriented Understudy for Gisting Evaluation) [22]
o Calculated using word overlaps between reference and generated summary (Rouge-1 = Unigram Overlap)

 Most common evaluation metric

 Recently semantic scores based methods
evolved Recall

Cons
 Abstractive Summarization has out of
Precision
vocabulary words from source text
 Summary with low Rouge score may be
more accurate and have better quality

 Semantic Scores like BERT Score

 Human Evaluation of Sample Summaries
F-Score
(generated vs reference)
• BERT Score (Bidirectional Encoder Representations for Transformers Score) [22]
o Calculated using sematic similarity of contextual embeddings between reference and generated summary

 Recently evolved method

 Overcomes the shortcoming of exact
term matching / overlapping

Cons
 Dependent of pre-trained models for
calculating embeddings Recall Precision F-Score

 Accuracy of BERT Score depends on

accuracy of global embeddings used for
calculating similarity

 Human Evaluation of Sample Summaries

(generated vs reference)
37
Extractive Summarization

 Truncated version (using Recall) of Datasets showed better results as compared to original version which were
being truncated automatically after 512 tokens by BERT based models

Truncation versions of Dataset showed better results

Extractive Summarization

 mBERT trained over 104 languages; base - 110M parameters, approx. ~ 681 MB
 MuRIL trained over 17 Indian languages; base - 236M parameters, approx. ~909 MB, large approx. ~1.89 GB
 contained translated and transliterated documents as well for cross lingual training
MuRIL has only minor difference of evaluation score with mBERT as compared to size of model
Extractive Summarization

• Due to absence of models for low resource languages, mostly multilingual models are used
• Loading only monolingual vocabulary in a multilingual model as most of parameters lies in embedding layer
 Reduction of size upto 48% ; mBERT from 681MB to 354MB
Geotrend-BERT-base have equivalent results with mBERT-base being almost half of it original size
Abstractive Summarization

Urdu vocabulary collected from

 Own Summarization Datasets
 1M Urdu News Dataset
(DOI:10.17632/834vsxnb99.3)

• Memory utilization while training was enhanced by loading only monolingual vocabulary (Urdu comprising of 40k
tokens) in a multilingual model (mT5-base having 250k tokens; 101 languages) similar to Geotrend/BERT

• Reduction of size upto 44.78% ; mT5-base of ~2.17GB reduced to ~1.04GB (Urdu T5; urT5)

urT5 model have equivalent results as compared to mT5 model being of 55% size
Abstractive Summarization

 Zeroshot – Poor results as model was pre-trained on NLU and NLG tasks (not on summarization)
 50% Dataset – 0,89 lower Rouge-1 F Score; size of training examples increases capability
 50% Training Epochs – 2.5 epochs instead of 5 shows minor improvement (0.11 Rouge-1 F Score)
 More training epochs doesn’t necessarily means more efficient model
Extractive Summarization

• Summaries with top evaluation score were found having maximum terms extracted from articles
o Increasing term overlap hence automated evaluation scores
Abstractive Summarization

 Training was carried out by joint dataset however testing carried out on joint as well as separate subsets
 BBC Urdu evaluation was high however comparatively less than Extractive Summarization
 Lesser score indicates the generative capability of T5 based models (Extractive: 48.7 vs Abstractive: 46.3)
• Considering lack of research in low resource summarization

o Dataset has been created from publicly available source in Urdu News (which can
be replicated easily for other languages)

o Utilizing multilingual pretrained models, framework adopted for monolingual

purpose in a low resource environment (urT5 model having half of the size with
high results)

o Competitive Evaluation Results with high resource language English

 Rouge-1 score of 47.21, 45.14 and 38.81 claimed by PEGASUS [70], BART [62]
and BERTSumExtAbs [64] respectively. However cannot be compared in true
sense as results are dataset dependent
• Dataset. Creation of quality Datasets for training and evaluations including multidomain
datasets (news, books, reviews etc), cross lingual parallel dataset (same text in
datasets in multiple languages)

• Low Resource. Modular Approach towards models which can work in low resource
settings for specific tasks (similar to technique used in creating urT5 from mT5)

• Complex Language. Generic Models able to understand complex lingua like idioms,
sarcasm (also verified on low resource languages like Urdu)

• Evaluation Metrics. New automated evaluation metrics overcoming shortfalls of

existing metrics as highlighted.
47

Management Information Systems - Managing The Digital Firm 15th Edition 2018 (365-367)
No ratings yet
Management Information Systems - Managing The Digital Firm 15th Edition 2018 (365-367)
3 pages
SQL Success - Database Programmi - Stephane Faroult
100% (8)
SQL Success - Database Programmi - Stephane Faroult
503 pages
Text Summarization Using NLP Technique
No ratings yet
Text Summarization Using NLP Technique
7 pages
IEEE_Conference_Template__1_ (4).pdf (1)
No ratings yet
IEEE_Conference_Template__1_ (4).pdf (1)
3 pages
1656254641484 Project Final Presentation (1)
No ratings yet
1656254641484 Project Final Presentation (1)
30 pages
Bachelor Thesis 2016
No ratings yet
Bachelor Thesis 2016
56 pages
TC6 PROJECT SYNOPSIS KrishShetty VedantLandge 231106 101402
No ratings yet
TC6 PROJECT SYNOPSIS KrishShetty VedantLandge 231106 101402
13 pages
MD Adil Irshad
No ratings yet
MD Adil Irshad
37 pages
NLP Text Summary
No ratings yet
NLP Text Summary
21 pages
NLP_MODULE_6
No ratings yet
NLP_MODULE_6
30 pages
150_Poster
No ratings yet
150_Poster
1 page
Text Summarization Using NLP Final
No ratings yet
Text Summarization Using NLP Final
38 pages
Seminar - Report - PYLI - RAGHURAM - Entire Document Ready
No ratings yet
Seminar - Report - PYLI - RAGHURAM - Entire Document Ready
26 pages
Bashaier Proposal Ver 22-8-2024
No ratings yet
Bashaier Proposal Ver 22-8-2024
15 pages
Viswajothi Technologies PR Ivate Limited: "Text Summarization Based On NLP"
67% (3)
Viswajothi Technologies PR Ivate Limited: "Text Summarization Based On NLP"
23 pages
Towards efficient knowledge extraction: Natural language processing-based summarization of research paper introductions
No ratings yet
Towards efficient knowledge extraction: Natural language processing-based summarization of research paper introductions
12 pages
Towards Efficient Knowledge Extraction Natural Lan
No ratings yet
Towards Efficient Knowledge Extraction Natural Lan
12 pages
Text Summarization
No ratings yet
Text Summarization
6 pages
Text Summarization - Articles - Weights & Biases
No ratings yet
Text Summarization - Articles - Weights & Biases
16 pages
NLP-Multi Document Summarization
No ratings yet
NLP-Multi Document Summarization
18 pages
BERT Summarization MP IA1
No ratings yet
BERT Summarization MP IA1
16 pages
Combination of Abstractive and Extractive Approaches For Summarization of Long Scientific Texts
No ratings yet
Combination of Abstractive and Extractive Approaches For Summarization of Long Scientific Texts
11 pages
Recent Approaches For Text Summarization
No ratings yet
Recent Approaches For Text Summarization
13 pages
Shubh Am
No ratings yet
Shubh Am
40 pages
Text Summarization Using Natural Language Processing
No ratings yet
Text Summarization Using Natural Language Processing
5 pages
IEEE_Open_Journal_of_the_Industrial_Electronics_Society___Template__3_
No ratings yet
IEEE_Open_Journal_of_the_Industrial_Electronics_Society___Template__3_
9 pages
Natural language processing notes
No ratings yet
Natural language processing notes
61 pages
Implementation-of-NLP-based-automatic-text-summarization-using-spacy
No ratings yet
Implementation-of-NLP-based-automatic-text-summarization-using-spacy
15 pages
Project File
No ratings yet
Project File
23 pages
Mini Project Report
No ratings yet
Mini Project Report
26 pages
Report Group-8
No ratings yet
Report Group-8
16 pages
Temporary Report
No ratings yet
Temporary Report
28 pages
AI_Report[1][1]
No ratings yet
AI_Report[1][1]
15 pages
Ir Case Study
No ratings yet
Ir Case Study
8 pages
Data Representation for Deep Learning - Based Arabic Text Summarization Performance Using Python Results
No ratings yet
Data Representation for Deep Learning - Based Arabic Text Summarization Performance Using Python Results
18 pages
FALLSEM2024-25_BCSE409L_TH_VL2024250101879_2024-11-14_Reference-Material-I
No ratings yet
FALLSEM2024-25_BCSE409L_TH_VL2024250101879_2024-11-14_Reference-Material-I
13 pages
Automatic Text Summarization Using Natural Language Processing
No ratings yet
Automatic Text Summarization Using Natural Language Processing
54 pages
Automatic Text Summarization Using Natural Language Processing PDF
No ratings yet
Automatic Text Summarization Using Natural Language Processing PDF
54 pages
A Systematic Survey of Text Summarization_ From Statistical to Langauge Models
No ratings yet
A Systematic Survey of Text Summarization_ From Statistical to Langauge Models
42 pages
Rare Words in Text Summarization
No ratings yet
Rare Words in Text Summarization
11 pages
IR Report
No ratings yet
IR Report
10 pages
IEEE_Conference_Template__3_.pdf
No ratings yet
IEEE_Conference_Template__3_.pdf
4 pages
NLP Final Review
No ratings yet
NLP Final Review
32 pages
EXT Ummarization: Kareem El-Sayed Hashem Mohamed Mohsen Brary
No ratings yet
EXT Ummarization: Kareem El-Sayed Hashem Mohamed Mohsen Brary
24 pages
An Overview of Extractive Based Automati
No ratings yet
An Overview of Extractive Based Automati
12 pages
Text Summerizer Synopsis-1
No ratings yet
Text Summerizer Synopsis-1
6 pages
Text Summarization Using NLP
No ratings yet
Text Summarization Using NLP
15 pages
A Domain-Specific Automatic Text Summarization Using Fuzzy Logic
No ratings yet
A Domain-Specific Automatic Text Summarization Using Fuzzy Logic
13 pages
Answer Key-3
No ratings yet
Answer Key-3
12 pages
Introduction to NLP_first_week_lecture_1st
No ratings yet
Introduction to NLP_first_week_lecture_1st
6 pages
Seminar Text Summarization 1
No ratings yet
Seminar Text Summarization 1
21 pages
Extractive Text Summarization: Motilal Nehru National Institute of Technology Allahabad
No ratings yet
Extractive Text Summarization: Motilal Nehru National Institute of Technology Allahabad
29 pages
Human Aided Text Summarizer "SAAR" Using Reinforcement Learning
100% (1)
Human Aided Text Summarizer "SAAR" Using Reinforcement Learning
31 pages
Experiential Learning
No ratings yet
Experiential Learning
8 pages
Evaluation_of_Question-Answering_Based_Text_Summar_241213_213511
No ratings yet
Evaluation_of_Question-Answering_Based_Text_Summar_241213_213511
8 pages
10 1142@S0218194019500086
No ratings yet
10 1142@S0218194019500086
20 pages
Paper 3
No ratings yet
Paper 3
6 pages
Automatic Text Recognisation
No ratings yet
Automatic Text Recognisation
4 pages
Paper Work
No ratings yet
Paper Work
12 pages
Unit - 4 DL
No ratings yet
Unit - 4 DL
33 pages
Summarization Technique in Python Programming Language
No ratings yet
Summarization Technique in Python Programming Language
9 pages
Advanced Deep Learning Techniques for Natural Language Understanding: A Comprehensive Guide
From Everand
Advanced Deep Learning Techniques for Natural Language Understanding: A Comprehensive Guide
Adam Jones
No ratings yet
Solar fault detection and Optimization
No ratings yet
Solar fault detection and Optimization
108 pages
Hybrid Deep Learning Model for Fault Detection and Classification of Grid-Connected Photovoltaic System
No ratings yet
Hybrid Deep Learning Model for Fault Detection and Classification of Grid-Connected Photovoltaic System
18 pages
Solar fault detection paper
No ratings yet
Solar fault detection paper
22 pages
Dr Shoab Thesis
No ratings yet
Dr Shoab Thesis
6 pages
IQ+ & IQ4H LomaData Comms Protocol 1.6
No ratings yet
IQ+ & IQ4H LomaData Comms Protocol 1.6
14 pages
Design and Implementation: 3.1 Morphology
No ratings yet
Design and Implementation: 3.1 Morphology
29 pages
CNLM Lift 2013 Web
No ratings yet
CNLM Lift 2013 Web
28 pages
Operations Management 18769 PDF PDF
100% (1)
Operations Management 18769 PDF PDF
278 pages
Sleepover Accident With Sister's Tiny Best Friend - Mochi Mona - Perfect Girlfriend - Alex Adams - Pornhub.com
No ratings yet
Sleepover Accident With Sister's Tiny Best Friend - Mochi Mona - Perfect Girlfriend - Alex Adams - Pornhub.com
1 page
Recruitment System
No ratings yet
Recruitment System
7 pages
Question Bank Xi English
No ratings yet
Question Bank Xi English
12 pages
18 Advanced CSS Tricks and Tips (2024) - LambdaTest
No ratings yet
18 Advanced CSS Tricks and Tips (2024) - LambdaTest
34 pages
Telp, Data, CCTV, Baturaja
No ratings yet
Telp, Data, CCTV, Baturaja
4 pages
Lisa Mattes Growing Firsties
No ratings yet
Lisa Mattes Growing Firsties
19 pages
Kinematic Motion Analysis of The Human Arm During A Manipulation Task
No ratings yet
Kinematic Motion Analysis of The Human Arm During A Manipulation Task
7 pages
OE90C2800 Lite: PLCC-28
No ratings yet
OE90C2800 Lite: PLCC-28
66 pages
S05 - Reinforcement Schedule - Mosque - S3 - A
No ratings yet
S05 - Reinforcement Schedule - Mosque - S3 - A
1 page
Testing & Commissioning Checklist For Sensor New
No ratings yet
Testing & Commissioning Checklist For Sensor New
1 page
AUSA M50x4 Rotax Engine WorkshopManual GB 1
No ratings yet
AUSA M50x4 Rotax Engine WorkshopManual GB 1
155 pages
Mod Frank - Ini
No ratings yet
Mod Frank - Ini
19 pages
Web Service Model To Adaptive Web Service Model
No ratings yet
Web Service Model To Adaptive Web Service Model
18 pages
20cs4104 Ip Unit III
No ratings yet
20cs4104 Ip Unit III
82 pages
Preview - Operating & Maintenance Manual
No ratings yet
Preview - Operating & Maintenance Manual
9 pages
Sidra Medical & Research Centre
No ratings yet
Sidra Medical & Research Centre
1 page
Nursery Syllabus
No ratings yet
Nursery Syllabus
11 pages
Garland Control Mntto
No ratings yet
Garland Control Mntto
28 pages
Ser7 Ic: Serial To Seven Segment Controller Expandable To 32 Digits
No ratings yet
Ser7 Ic: Serial To Seven Segment Controller Expandable To 32 Digits
11 pages
Ifcitt
No ratings yet
Ifcitt
29 pages
ERODE
100% (1)
ERODE
410 pages
Introduction and Overview: Download Completed Project
No ratings yet
Introduction and Overview: Download Completed Project
11 pages
LibraWhitePaper en US
100% (2)
LibraWhitePaper en US
12 pages
Performance Testing of HRC Fuse Links
100% (1)
Performance Testing of HRC Fuse Links
26 pages

NLP-Driven Summarization of Local Language Texts

Uploaded by

NLP-Driven Summarization of Local Language Texts

Uploaded by

NLP-Driven Summarization of Local Language Texts:

A Pre-trained Model Approach

Popularity of social media and news platforms -

Biggest challenge - Sifting through and extracting

Information Extraction or Retrieval (IE or IR) and

TextRank & LexRank

• Comprises of RNN based Encoder Decoder

Seq2Seq Models lacked parallelization creating bottlenecks

Transfer Learning – Language Models

STATISTICAL MACHINE LEARNING DEEP LEARNING

• Creation of Urdu Summary Corpus

• Creation of a summarization dataset in Urdu language from publicly available source

• News Domain; a suitable choice

• BBC Urdu1 was selected for creation of own Dataset

1BBC Urdu - https://www.bbc.com/urdu

URLs Removal Picture Caption Removal

• To cater for limitation of mBERT and considering memory consumption of mT5

Count 1391 1379

Mean 1451.21 46.32 4.02 482.13 46.59 9.75

Std 710.07 13.49 4.27 32.60 13.07 3.04

Min 35.0 0.0 0.0 85 13 3.33

25% 993.5 37 2.51 473 38 7.76

50% 1360 45 3.39 487 45 9.37

75% 1750 55 4.54 500 55 11.43

Max 7621 105 100 512 105 36.17

• mT5 (small, large, XL, XXL)  Abstractive Summarization

ABSTRACTIVE SUMMARIZATION Evaluation  ROUGE Score BERT Score Human Evaluation

LS - Low Resource Lang Finetuning on

 Most common evaluation metric

 Semantic Scores like BERT Score

 Recently evolved method

 Accuracy of BERT Score depends on

 Human Evaluation of Sample Summaries

Truncation versions of Dataset showed better results

Urdu vocabulary collected from

o Utilizing multilingual pretrained models, framework adopted for monolingual

o Competitive Evaluation Results with high resource language English

• Evaluation Metrics. New automated evaluation metrics overcoming shortfalls of

You might also like