Dissertation Proposal
Dissertation Proposal
A Dissertation Proposal on
NEPALI LEGAL DOCUMENT SUMMARIZATION USING
MULTILINGUAL LARGE LANGUAGE MODELS
Submitted by:
Arpan Sapkota
Reg. No. 3-2-368-168-2016
Submitted to:
School of Mathematical Sciences
Kirtipur, Kathmandu, Nepal
1.1 Background
The legal domain in Nepal encompasses a vast array of documents, including statutes,
regulations, and case laws. These documents are often lengthy and complex, making it
challenging for legal professionals, researchers, and the public to quickly grasp their
essential content. This dissertation aims to address this challenge by developing a system
for analyzing Nepali legal documents using multilingual large language models (LLMs).
By leveraging models like mBART, mT5, and XLM-R, which have been trained on
multiple languages including Nepali. The goal is to create concise and accurate summaries
that enhance accessibility and comprehension of legal information (Liu et al., 2019; Tang et
al., 2020).
In recent years, the growth of legal documents has necessitated the development of
advanced tools to manage and analyze these texts efficiently. Natural Language Processing
(NLP) techniques have shown significant potential in extracting and condensing critical
information from large datasets (Gupta & Lehal, 2010). This proposal outlines a
dissertation project aimed at leveraging NLP techniques to analyze legal texts from the
Nepal Kanun Patrika (https://nkp.gov.np/). Legal texts are a fundamental resource in the
legal domain, providing crucial information about laws, regulations, and judicial decisions.
The Nepal Kanun Patrika serves as a key repository of such legal texts in Nepal. This
dissertation aims to utilize pre-trained LLMs to enhance the analysis of these texts. The
goal is to make a significant contribution to the legal domain of Nepal by improving the
accessibility of important legal information. The sheer volume and complexity of legal
texts can overwhelm users, making it difficult to retrieve relevant information promptly
(Chalkidis et al., 2020). Traditional methods of manual analysis are time-consuming and
prone to errors. Therefore, there is a pressing need for a system that can condense these text
documents while retaining their core information, extracting key details and generating
concise summaries to improve usability and understanding.
Given the linguistic nuances of Nepali, developing such a system requires advanced
multilingual models capable of understanding and processing the language effectively. This
research will explore various methodologies for summarizing Nepali legal texts using state-
of-the-art LLMs while addressing the unique challenges posed by low-resource languages
(Chaudhary et al., 2020; Conneau et al., 2020). By focusing on these aspects, this
dissertation aims to contribute valuable insights into the field of legal text analysis in
Nepal.
2
1.1.1 Transformer Architecture
The Transformer architecture relies on self-attention mechanisms that allow the model to
weigh the importance of different words in a sentence relative to one another. This
capability is particularly beneficial for summarization tasks, as it enables the model to
focus on relevant parts of the text while generating coherent summaries. The architecture's
ability to handle long-range dependencies makes it especially suitable for processing
complex legal texts where context is crucial. This architecture enables parallel processing
3
of data, which significantly enhances training efficiency and performance compared to
previous sequential models like RNNs (Recurrent Neural Networks) (Vaswani et al. 2017).
The connection between the Transformer model and its derivatives mBART, XLM-R, and
mT5 lies in their shared architecture and design principles. Each model leverages the self-
attention mechanism of Transformers to enhance performance across multilingual NLP
tasks, including legal document summarization in low-resource languages like Nepali. By
building on this foundational architecture, these models can effectively process complex
legal texts and generate concise summaries that retain essential information.
4
1.2 Problem Statement
The legal domain in Nepal is overload with extensive textual data, often making it
challenging to retrieve relevant information promptly. Traditional methods of manual
analysis are time-consuming and prone to errors. This dissertation aims to address these
challenges by applying NLP techniques to the Nepali legal texts, thereby enhancing
information retrieval and decision-making processes.
The total volume and complexity of legal texts in Nepali make it challenging for legal
professionals, researchers, and the public to efficiently access and interpret this
information. There is a need for system that can condense these text documents while
retaining their core information, extract key information and generate concise summaries to
improve usability and understanding. However, given the linguistic variations of Nepali
language, developing such system requires advanced multilingual models capable of
understanding and processing the language effectively.
5
1.3 Objectives
1. To develop a summarization model for Nepali legal documents using multilingual
Large Language Models.
2. To evaluate the effectiveness of these multilingual models in producing accurate
and concise summaries of Nepali legal texts.
3. To contribute new methodologies and insights to the field of legal text analysis in
the Nepali language.
6
CHAPTER 2: PRELIMINARY LITERATURE REVIEW
7
while abstractive summarization generates new sentences that convey the main ideas of the
text (Gupta & Lehal, 2010; Nenkova & McKeown, 2012). Both techniques have been
employed in various domains, including legal texts. Integrating summarization with named
entity recognition (NER) can further enhance insights into legal documents by identifying
key entities and their relationships within the text.
Recent advancements in LLMs have led to significant improvements in both extractive and
abstractive summarization tasks. For instance, models like BERT and GPT-3 have
demonstrated substantial capabilities in generating high-quality summaries across different
domains (Chalkidis et al., 2020). Their adaptation to domain-specific tasks like legal text
summarization is an emerging area of research. Exploring these LLMs for Nepali legal
texts can yield promising results and contribute to the overall understanding of how NLP
can be effectively utilized in low-resource language contexts.
8
CHAPTER 3: METHODOLOGY
9
3.1.2 Preprocessing
Once the data is collected, it will undergo a thorough preprocessing phase to prepare it for
input into multilingual large language models (LLMs). This preprocessing will include
several steps:
1. Tokenization: Breaking down the text into smaller units (tokens) to facilitate
analysis.
2. Normalization: Standardizing the text format to reduce variability and improve
model performance.
3. Removal of Stop Words: Eliminating common words that do not contribute
significant meaning to the text, thereby enhancing the focus on more informative
content.
These preprocessing steps are important to make the dataset clean and suitable for training
LLMs effectively.
3.1.4 Evaluation
The performance of the summarization models will be assessed using standard evaluation
metrics. Key metrics include:
1. ROUGE Scores: These scores will evaluate the overlap between generated
summaries and reference summaries.
2. Precision, Recall, and F1-Score: These metrics will provide insights into the
accuracy and completeness of the summaries produced by the models.
3. BLEU Scores: Although primarily used for machine translation, BLEU scores can
also help assess the quality of generated summaries by comparing them with
reference summaries.
Comparative analysis across different multilingual LLMs will be conducted to identify
which model performs best in summarizing Nepali legal texts.
10
3.1.5 Analysis and Interpretation
The final phase involves analyzing and interpreting the results obtained from the evaluation
process. This analysis will focus on determining the effectiveness of the multilingual
models in generating accurate and concise summaries of Nepali legal documents.
Additionally, any challenges or limitations encountered during model training and
evaluation will be documented. Insights gained from this analysis will inform future
research directions and practical applications of LLMs in the legal domain.
11
CHAPTER 4: EXPECTED OUTCOMES & WORKING
SCHEDULE
12
4.2 Working Schedule
The working schedule for the dissertation project is structured to make a systematic
approach to completing each phase of the research. The timeline is divided into distinct
tasks, each with a specified duration to facilitate efficient progress. Below gantt chart
shows an outline of the working schedule:
13
REFERENCES
1. Chaudhary, V., Tang, Y., Guzmán, F., Chaudhary, S., & Koehn, P. (2020). Low-
resource multilingual neural machine translation with cross-lingual supervision.
Proceedings of the 58th Annual Meeting of the Association for Computational
Linguistics. https://doi.org/10.18653/v1/2020.acl-main
2. Chalkidis, I., Fergadiotis, M., Malakasiotis, P., Aletras, N., & Androutsopoulos, I.
(2020). Legal-BERT: The muppets straight out of law school. Proceedings of the
2020 Conference on Empirical Methods in Natural Language Processing.
https://doi.org/10.18653/v1/2020.emnlp-main
3. Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V., Wenzek, G., Guzmán, F.,
et al. (2020). Unsupervised cross-lingual representation learning at scale.
Proceedings of the 58th Annual Meeting of the Association for Computational
Linguistics. https://doi.org/10.18653/v1/2020.acl-main
4. Gupta, V., & Lehal, G. S. (2010). A survey of text summarization extractive
techniques. Journal of Emerging Technologies in Web Intelligence, 2(3), 258-268.
https://doi.org/10.3995/jetwi.v2i3.108
5. Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., et al. (2019). XLM-R: A
heavily multilingual pre-trained language model. arXiv preprint arXiv:1911.02116.
https://arxiv.org/abs/1911.02116
6. Nenkova, A., & McKeown, K. (2012). Automatic Summarization: A Survey of the
State of the Art and Future Directions. Journal of Artificial Intelligence Research,
38(1), 1-50.
7. Tang, Y., Lu, L., Dyer, C., Goyal, N., Fenton, J., & Bhosale, S. (2020). mBART:
Multilingual denoising pre-training for neural machine translation. arXiv preprint
arXiv:2001.08210. https://arxiv.org/abs/2001.08210
8. Vaswani, A., Shardlow, M., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N.,
Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. Advances in Neural
Information Processing Systems, 30
9. Xue, L., Constant, N., Roberts, A., Kale, M., Al-Rfou, R., Siddhant, A., et al.
(2020). mT5: A massively multilingual pre-trained text-to-text transformer. arXiv
preprint arXiv:2010.11934. https://arxiv.org/abs/2010.11934
10. Zhong Z., Chen Q., & Zhang X.(2020). Iterative summarization for court opinions
using a novel approach. Proceedings of the 58th Annual Meeting of the Association
for Computational Linguistics. https://www.aclweb.org/anthology/2020.acl-
main.pdf
14