0% found this document useful (0 votes)
20 views

Dissertation Proposal

Uploaded by

Arpan Sapkota
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views

Dissertation Proposal

Uploaded by

Arpan Sapkota
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Tribhuvan University

Institute of Science and Technology

A Dissertation Proposal on
NEPALI LEGAL DOCUMENT SUMMARIZATION USING
MULTILINGUAL LARGE LANGUAGE MODELS

Submitted by:
Arpan Sapkota
Reg. No. 3-2-368-168-2016

Submitted to:
School of Mathematical Sciences
Kirtipur, Kathmandu, Nepal

A dissertation proposal submitted in partial fulfillment of the requirements for the


degree of Master in Data Science.
December, 2024
CHAPTER 1: INTRODUCTION

1.1 Background
The legal domain in Nepal encompasses a vast array of documents, including statutes,
regulations, and case laws. These documents are often lengthy and complex, making it
challenging for legal professionals, researchers, and the public to quickly grasp their
essential content. This dissertation aims to address this challenge by developing a system
for analyzing Nepali legal documents using multilingual large language models (LLMs).
By leveraging models like mBART, mT5, and XLM-R, which have been trained on
multiple languages including Nepali. The goal is to create concise and accurate summaries
that enhance accessibility and comprehension of legal information (Liu et al., 2019; Tang et
al., 2020).
In recent years, the growth of legal documents has necessitated the development of
advanced tools to manage and analyze these texts efficiently. Natural Language Processing
(NLP) techniques have shown significant potential in extracting and condensing critical
information from large datasets (Gupta & Lehal, 2010). This proposal outlines a
dissertation project aimed at leveraging NLP techniques to analyze legal texts from the
Nepal Kanun Patrika (https://nkp.gov.np/). Legal texts are a fundamental resource in the
legal domain, providing crucial information about laws, regulations, and judicial decisions.
The Nepal Kanun Patrika serves as a key repository of such legal texts in Nepal. This
dissertation aims to utilize pre-trained LLMs to enhance the analysis of these texts. The
goal is to make a significant contribution to the legal domain of Nepal by improving the
accessibility of important legal information. The sheer volume and complexity of legal
texts can overwhelm users, making it difficult to retrieve relevant information promptly
(Chalkidis et al., 2020). Traditional methods of manual analysis are time-consuming and
prone to errors. Therefore, there is a pressing need for a system that can condense these text
documents while retaining their core information, extracting key details and generating
concise summaries to improve usability and understanding.
Given the linguistic nuances of Nepali, developing such a system requires advanced
multilingual models capable of understanding and processing the language effectively. This
research will explore various methodologies for summarizing Nepali legal texts using state-
of-the-art LLMs while addressing the unique challenges posed by low-resource languages
(Chaudhary et al., 2020; Conneau et al., 2020). By focusing on these aspects, this
dissertation aims to contribute valuable insights into the field of legal text analysis in
Nepal.

2
1.1.1 Transformer Architecture

Figure 1.1: Standard Transformer Architecture


Wikipedia contributors. (2024, January 7)

The Transformer architecture relies on self-attention mechanisms that allow the model to
weigh the importance of different words in a sentence relative to one another. This
capability is particularly beneficial for summarization tasks, as it enables the model to
focus on relevant parts of the text while generating coherent summaries. The architecture's
ability to handle long-range dependencies makes it especially suitable for processing
complex legal texts where context is crucial. This architecture enables parallel processing

3
of data, which significantly enhances training efficiency and performance compared to
previous sequential models like RNNs (Recurrent Neural Networks) (Vaswani et al. 2017).

1.1.2 mBART (Multilingual BART)


mBART is a sequence-to-sequence model that applies the Transformer architecture for
denoising pre-training across multiple languages. It combines the capabilities of both
BART (Bidirectional and Auto-Regressive Transformers) and multilingual processing.
mBART is particularly effective for tasks requiring text generation, such as summarization,
due to its ability to leverage context from both directions of the input text (Tang et al.,
2020).

1.1.3 XLM-R (Cross-lingual Language Model)


XLM-R is a multilingual version of BERT that utilizes the Transformer architecture to
process 100 languages. It is trained on large-scale multilingual data and demonstrates state-
of-the-art performance on various NLP benchmarks. XLM-R's self-attention mechanism
allows it to capture contextual relationships across different languages, making it suitable
for tasks in low-resource languages like Nepali (Liu et al., 2019).

1.1.4 mT5 (Multilingual T5)


mT5 extends the capabilities of the T5 (Text-to-Text Transfer Transformer) framework to
101 languages, utilizing the Transformer architecture for various NLP tasks framed as text-
to-text problems. This versatility allows mT5 to perform tasks such as summarization,
translation, and question answering in multiple languages by treating all tasks uniformly as
text generation problems (Xue et al., 2020).

The connection between the Transformer model and its derivatives mBART, XLM-R, and
mT5 lies in their shared architecture and design principles. Each model leverages the self-
attention mechanism of Transformers to enhance performance across multilingual NLP
tasks, including legal document summarization in low-resource languages like Nepali. By
building on this foundational architecture, these models can effectively process complex
legal texts and generate concise summaries that retain essential information.

4
1.2 Problem Statement
The legal domain in Nepal is overload with extensive textual data, often making it
challenging to retrieve relevant information promptly. Traditional methods of manual
analysis are time-consuming and prone to errors. This dissertation aims to address these
challenges by applying NLP techniques to the Nepali legal texts, thereby enhancing
information retrieval and decision-making processes.
The total volume and complexity of legal texts in Nepali make it challenging for legal
professionals, researchers, and the public to efficiently access and interpret this
information. There is a need for system that can condense these text documents while
retaining their core information, extract key information and generate concise summaries to
improve usability and understanding. However, given the linguistic variations of Nepali
language, developing such system requires advanced multilingual models capable of
understanding and processing the language effectively.

5
1.3 Objectives
1. To develop a summarization model for Nepali legal documents using multilingual
Large Language Models.
2. To evaluate the effectiveness of these multilingual models in producing accurate
and concise summaries of Nepali legal texts.
3. To contribute new methodologies and insights to the field of legal text analysis in
the Nepali language.

1.4 Rationale of the Study


The rationale for this study stems from the pressing need to improve access to legal
information in Nepal, where a vast corpus of legal documents exists, including statutes,
regulations, and case laws. These documents are often lengthy and complex, making it
difficult for legal professionals, researchers, and the general public to quickly grasp their
essential content. Traditional methods of manual analysis are time-consuming and prone to
errors, which can hinder effective decision-making and legal processes. Therefore, there is
a significant demand for advanced tools that can efficiently manage and analyze these legal
texts.
This study aims to leverage multilingual large language models (LLMs) such as mBART,
mT5, and XLM-R to develop a summarization system specifically tailored for Nepali legal
documents. These models have shown promise in handling natural language processing
(NLP) tasks across various languages, including those with limited resources like Nepali
(Liu et al., 2019; Tang et al., 2020). By utilizing these advanced models, the research seeks
to create concise and accurate summaries that enhance the accessibility and comprehension
of legal information.
Moreover, the application of NLP techniques in low-resource languages presents unique
challenges due to the limited availability of annotated data and resources. Existing studies
have highlighted the importance of cross-lingual supervision and unsupervised learning
approaches to address these challenges (Chaudhary et al., 2020; Conneau et al., 2020). This
dissertation will explore these methodologies while focusing on the specific linguistic
nuances of Nepali, ensuring that the summarization system is both effective and relevant.
In addition to improving access to legal information, this study aims to contribute new
methodologies and insights to the field of legal text analysis in the Nepali language. By
developing a summarization model capable of producing informative summaries of legal
documents, the research will not only benefit legal professionals but also empower
ordinary citizens by making legal texts more understandable. Ultimately, this study seeks to
bridge the gap between complex legal language and practical understanding, facilitating
better engagement with the legal system in Nepal.

6
CHAPTER 2: PRELIMINARY LITERATURE REVIEW

The development of a summarization model for Nepali legal documents necessitates a


comprehensive understanding of several key areas, including multilingual large language
models (LLMs), legal document summarization, and the challenges associated with natural
language processing (NLP) tasks in low-resource languages like Nepali. This literature
review addresses these topics, providing a solid foundation for the proposed research.

2.1 Multilingual Large Language Models


Multilingual large language models, such as XLM-R, mBART, and mT5, have
demonstrated significant promise in handling NLP tasks across various languages,
including those with limited resources like Nepali. Liu et al. (2019) introduced XLM-R, a
robust multilingual version of BERT that is trained on 100 languages. XLM-R has shown
state-of-the-art performance on multiple multilingual benchmarks, making it a strong
candidate for tasks involving low-resource languages. Similarly, mBART, a multilingual
sequence-to-sequence model, has proven effective for text generation tasks, including
summarization. Tang et al. (2020) discuss mBART’s ability to perform denoising pre-
training across multiple languages, which enhances its suitability for summarization tasks
in multilingual contexts. Additionally, mT5, a massively multilingual version of T5,
extends the capabilities of text-to-text transformers to 101 languages, offering a versatile
tool for various summarization tasks (Xue et al., 2020).

2.2 Legal Document Summarization


Legal document summarization is a specialized area of NLP that involves distilling
complex legal texts into concise summaries while retaining critical information. Chalkidis
et al. (2020) developed Legal-BERT, a version of BERT fine-tuned on legal texts
specifically for tasks such as legal text classification and summarization. Their work
highlights the potential of adapting LLMs to the legal domain to improve the accuracy and
relevance of generated summaries. Moreover, Zhong et al. (2020) introduced an innovative
approach for summarizing court opinions iteratively. This method focuses on generating
summaries that are not only concise but also informative, ensuring that critical legal details
are preserved during the summarization process.

2.3 Text Summarization Techniques


Text summarization aims to create concise and coherent summaries from larger texts using
various techniques such as extractive and abstractive summarization. Extractive
summarization involves selecting key sentences from the original text to create a summary,

7
while abstractive summarization generates new sentences that convey the main ideas of the
text (Gupta & Lehal, 2010; Nenkova & McKeown, 2012). Both techniques have been
employed in various domains, including legal texts. Integrating summarization with named
entity recognition (NER) can further enhance insights into legal documents by identifying
key entities and their relationships within the text.
Recent advancements in LLMs have led to significant improvements in both extractive and
abstractive summarization tasks. For instance, models like BERT and GPT-3 have
demonstrated substantial capabilities in generating high-quality summaries across different
domains (Chalkidis et al., 2020). Their adaptation to domain-specific tasks like legal text
summarization is an emerging area of research. Exploring these LLMs for Nepali legal
texts can yield promising results and contribute to the overall understanding of how NLP
can be effectively utilized in low-resource language contexts.

2.4 Challenges in Low-Resource Languages


The application of NLP techniques in low-resource languages like Nepali presents unique
challenges due to the limited availability of annotated data and resources. Chaudhary et al.
(2020) examined strategies for improving multilingual neural machine translation in low-
resource settings and emphasized the importance of cross-lingual supervision to enhance
model performance. Furthermore, Conneau et al. (2020) discussed the development of
unsupervised cross-lingual models that do not rely on parallel data, which is particularly
relevant for low-resource languages where such data may be scarce.

8
CHAPTER 3: METHODOLOGY

3.1 System Block Diagram

Figure 3.1: System Block Diagram

3.1.1 Data Collection


The first step in this research involves compiling a comprehensive dataset of Nepali legal
documents. This dataset will be scrapped from the Nepal Kanun Patrika which is the
reputable legal repositories. The Nepal Kanun Patrika serves as a key repository for legal
texts in Nepal, providing essential information about laws, regulations, and judicial
decisions. The aim is to gather a diverse range of documents, including statutes,
regulations, and case law, to ensure a robust dataset for training and evaluation.

9
3.1.2 Preprocessing
Once the data is collected, it will undergo a thorough preprocessing phase to prepare it for
input into multilingual large language models (LLMs). This preprocessing will include
several steps:
1. Tokenization: Breaking down the text into smaller units (tokens) to facilitate
analysis.
2. Normalization: Standardizing the text format to reduce variability and improve
model performance.
3. Removal of Stop Words: Eliminating common words that do not contribute
significant meaning to the text, thereby enhancing the focus on more informative
content.
These preprocessing steps are important to make the dataset clean and suitable for training
LLMs effectively.

3.1.3 Model Implementation


In this phase, the focus will be on fine-tuning selected multilingual LLMs, such as
mBART, mT5, and XLM-R, on the preprocessed dataset. The fine-tuning process will
involve adapting these models specifically for summarization tasks related to Nepali legal
documents. The algorithm employed will primarily be sequence-to-sequence models with
attention mechanisms, which are known for their effectiveness in generating coherent
summaries from input texts. This approach allows the model to focus on relevant parts of
the text while generating summaries.

3.1.4 Evaluation
The performance of the summarization models will be assessed using standard evaluation
metrics. Key metrics include:
1. ROUGE Scores: These scores will evaluate the overlap between generated
summaries and reference summaries.
2. Precision, Recall, and F1-Score: These metrics will provide insights into the
accuracy and completeness of the summaries produced by the models.
3. BLEU Scores: Although primarily used for machine translation, BLEU scores can
also help assess the quality of generated summaries by comparing them with
reference summaries.
Comparative analysis across different multilingual LLMs will be conducted to identify
which model performs best in summarizing Nepali legal texts.

10
3.1.5 Analysis and Interpretation
The final phase involves analyzing and interpreting the results obtained from the evaluation
process. This analysis will focus on determining the effectiveness of the multilingual
models in generating accurate and concise summaries of Nepali legal documents.
Additionally, any challenges or limitations encountered during model training and
evaluation will be documented. Insights gained from this analysis will inform future
research directions and practical applications of LLMs in the legal domain.

This methodology aims to develop a functional summarization model capable of producing


concise and informative summaries of Nepali legal documents while addressing the unique
challenges associated with low-resource languages.

11
CHAPTER 4: EXPECTED OUTCOMES & WORKING
SCHEDULE

4.1 Expected Output


The expected outcomes of this research include:
1. A summarization model that generates concise and informative summaries of
Nepali legal documents.
2. An evaluation report detailing the performance of different multilingual LLMs in
summarizing Nepali texts, highlighting their strengths and limitations.
3. Recommendations for future research and practical applications of multilingual
LLMs in both the legal domain and other low-resource language contexts.

निर्णय नं. ७१२६ - बन्दीप्रत्यक्षीकरण

Figure 4.1: Expected Output of the Summarization Model

12
4.2 Working Schedule
The working schedule for the dissertation project is structured to make a systematic
approach to completing each phase of the research. The timeline is divided into distinct
tasks, each with a specified duration to facilitate efficient progress. Below gantt chart
shows an outline of the working schedule:

Figure 4.2: Working Schedule Gantt Chart

13
REFERENCES

1. Chaudhary, V., Tang, Y., Guzmán, F., Chaudhary, S., & Koehn, P. (2020). Low-
resource multilingual neural machine translation with cross-lingual supervision.
Proceedings of the 58th Annual Meeting of the Association for Computational
Linguistics. https://doi.org/10.18653/v1/2020.acl-main
2. Chalkidis, I., Fergadiotis, M., Malakasiotis, P., Aletras, N., & Androutsopoulos, I.
(2020). Legal-BERT: The muppets straight out of law school. Proceedings of the
2020 Conference on Empirical Methods in Natural Language Processing.
https://doi.org/10.18653/v1/2020.emnlp-main
3. Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V., Wenzek, G., Guzmán, F.,
et al. (2020). Unsupervised cross-lingual representation learning at scale.
Proceedings of the 58th Annual Meeting of the Association for Computational
Linguistics. https://doi.org/10.18653/v1/2020.acl-main
4. Gupta, V., & Lehal, G. S. (2010). A survey of text summarization extractive
techniques. Journal of Emerging Technologies in Web Intelligence, 2(3), 258-268.
https://doi.org/10.3995/jetwi.v2i3.108
5. Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., et al. (2019). XLM-R: A
heavily multilingual pre-trained language model. arXiv preprint arXiv:1911.02116.
https://arxiv.org/abs/1911.02116
6. Nenkova, A., & McKeown, K. (2012). Automatic Summarization: A Survey of the
State of the Art and Future Directions. Journal of Artificial Intelligence Research,
38(1), 1-50.
7. Tang, Y., Lu, L., Dyer, C., Goyal, N., Fenton, J., & Bhosale, S. (2020). mBART:
Multilingual denoising pre-training for neural machine translation. arXiv preprint
arXiv:2001.08210. https://arxiv.org/abs/2001.08210
8. Vaswani, A., Shardlow, M., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N.,
Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. Advances in Neural
Information Processing Systems, 30
9. Xue, L., Constant, N., Roberts, A., Kale, M., Al-Rfou, R., Siddhant, A., et al.
(2020). mT5: A massively multilingual pre-trained text-to-text transformer. arXiv
preprint arXiv:2010.11934. https://arxiv.org/abs/2010.11934
10. Zhong Z., Chen Q., & Zhang X.(2020). Iterative summarization for court opinions
using a novel approach. Proceedings of the 58th Annual Meeting of the Association
for Computational Linguistics. https://www.aclweb.org/anthology/2020.acl-
main.pdf

14

You might also like