0% found this document useful (0 votes)
29 views

Daniel Prijs MSC Thesis 2022

This thesis examines using automatic summarization techniques to summarize Dutch legal cases. The author collected over 100,000 legal case texts and summaries to create a dataset for training summarization models. Two systems were considered: fine-tuning a BART model on the full dataset, and clustering the dataset into subsets and fine-tuning separate BART models for each cluster. The generated summaries were evaluated using automatic metrics and human evaluation of 40 case summaries. The results showed that the automatically generated summaries were of slightly lower quality than the reference summaries, with more room for improvement in relevance. Clustering had a moderately negative effect on quality compared to the full dataset model. Overall, automatic summarization shows promise for summarizing Dutch legal cases.

Uploaded by

Emmanuel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views

Daniel Prijs MSC Thesis 2022

This thesis examines using automatic summarization techniques to summarize Dutch legal cases. The author collected over 100,000 legal case texts and summaries to create a dataset for training summarization models. Two systems were considered: fine-tuning a BART model on the full dataset, and clustering the dataset into subsets and fine-tuning separate BART models for each cluster. The generated summaries were evaluated using automatic metrics and human evaluation of 40 case summaries. The results showed that the automatically generated summaries were of slightly lower quality than the reference summaries, with more room for improvement in relevance. Clustering had a moderately negative effect on quality compared to the full dataset model. Overall, automatic summarization shows promise for summarizing Dutch legal cases.

Uploaded by

Emmanuel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 70

Utrecht University

Graduate School of Natural Sciences


Business Informatics

Thesis submitted for the degree

Master of Science

On Automatic Summarization of
Dutch Legal Cases

by

Daniël Prijs

Supervisor: M.P. Schraagen


Co-Supervisor: F.J. Bex

July, 2022
Acknowledgments

I would like to express my sincere thanks to Marijn Schraagen. He dedicated


his time and efforts to patiently guide me along the way and gave constructive
feedback in all stages of the project.
I would also like to thank my girlfriend Victoria for supporting me from the
very beginning of this project and for proofreading this document, my friend
Randell for his peer-presence; this greatly improved my own motivation and lead
to some interesting conversations about both of our theses, and finally my family
for constantly inquiring about the progress of my thesis, which kept me determined
to finish it.
Finally, the TPU Research Cloud program by Google deserves praise, for
without this initiative I would have lacked the computational resources to conduct
the main parts of the experiments in this thesis.

ii
Abstract

As is true for many other domains, the legal domain saw an increase in digitization
over the last decade. In the Netherlands, this is reflected in the usage of the
European Case Law Identifier encoding to freely and openly publish Dutch legal
cases. Currently, only 5% of all Dutch legal cases is published this way. The aim
is to bring this percentage up to 75% in the coming years.
There is a need for published cases to contain a summary highlighting the
contents of a case. Such summaries would make it much easier to search for
relevant cases. Approximately 460 thousand cases that are currently published
contain a case text and a case summary. Writing summaries for cases is a time-
consuming and a non-trivial task. Therefore, we studied the feasibility of using
automatic summarization to automate this process for Dutch legal cases.
As a first step, we collected and preprocessed all published legal cases into a
single dataset. This Rechtspraak dataset consists of 100201 case-summary pairs
suitable for automatic summarization. This dataset then was analyzed using a
framework that was recently proposed for this goal.
Subsequently, an experiment was designed to train and evaluate a BART
model on the dataset. This is a sequence-to-sequence model using a transformer-
architecture. To this end, two systems were considered. In one case, the full
dataset was used to fine-tune the BART model. In the other case, the dataset was
first clustered into six subsets, after which a separate BART model was fine-tuned
for each cluster. This technique of prior-clustering was not explored before in the
field of automatic summarization. The obtained models were evaluated in two
phases. First, the common ROUGE metrics were computed. Second, a recently
proposed protocol for human evaluation of automatically generated summaries
was followed to evaluate forty cases and accompanying summaries.
The results of this evaluation showed that the automatically generated sum-
maries are of a slightly worse quality than the reference summaries. For most
metrics, however, the difference is only small. Only with respect to the relevance
of the generated summaries there is more room for improvement.
In comparison with the full dataset model, clustering has a moderately negative
effect on the quality of the generated summaries and therefore is not recommended.
On the whole, automatic summarization techniques show promising results
when applied to Dutch legal cases. We argue that they can readily be applied to
new case texts if human summarization of these case texts is not feasible for any
reason.

iii
Contents

Acknowledgments ii

Abstract iii

Contents iv

1 Introduction 1
1.1 Automatic Summarization of Legal Cases . . . . . . . . . . . . . . 2
1.2 Clustering to Improve Summarization . . . . . . . . . . . . . . . . 2
1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Deep Learning; Its Surface 4


2.1 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3 Natural Language Processing . . . . . . . . . . . . . . . . . . . . 6
2.3.1 Language Models . . . . . . . . . . . . . . . . . . . . . . . 7

3 Related Work 10
3.1 Text Summarization Using Deep Learning . . . . . . . . . . . . . 10
3.1.1 Evaluation of Generated Summaries . . . . . . . . . . . . . 11
3.1.2 Extractive Summarization . . . . . . . . . . . . . . . . . . 12
3.1.3 Abstractive Summarization . . . . . . . . . . . . . . . . . 13
3.2 Summarization of Legal Documents . . . . . . . . . . . . . . . . . 15
3.3 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.3.1 k-means clustering . . . . . . . . . . . . . . . . . . . . . . 17
3.3.2 Mixture Models . . . . . . . . . . . . . . . . . . . . . . . . 17
3.3.3 Clustering and Summarization . . . . . . . . . . . . . . . . 18

4 Methods 19
4.1 Characteristics of the Dataset (RQ1) . . . . . . . . . . . . . . . . 19
4.1.1 Collection of the Data . . . . . . . . . . . . . . . . . . . . 22
4.1.2 Preparation of the Data . . . . . . . . . . . . . . . . . . . 22
4.2 Method of Evaluation (RQ2) . . . . . . . . . . . . . . . . . . . . . 23
4.2.1 Human Evaluation . . . . . . . . . . . . . . . . . . . . . . 23
4.3 Experimental Setup (RQ3 and RQ4) . . . . . . . . . . . . . . . . 24
4.3.1 Obtaining a Dutch Language Model . . . . . . . . . . . . . 26
4.3.2 Architecture of the Clustering Component . . . . . . . . . 27
4.3.3 Architecture of the Summarization Component . . . . . . 29

iv
Contents v

4.3.4 Details on Implementation . . . . . . . . . . . . . . . . . . 30

5 Results 31
5.1 Analysis of the Rechtspraak dataset . . . . . . . . . . . . . . . . . 31
5.2 Pretraining of BART . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.3 Finding a suitable Clustering Model . . . . . . . . . . . . . . . . . 33
5.4 Training the Summarization Models . . . . . . . . . . . . . . . . . 35
5.4.1 Dataset Splits . . . . . . . . . . . . . . . . . . . . . . . . . 36
5.4.2 Fine-tuning Losses . . . . . . . . . . . . . . . . . . . . . . 37
5.5 Evaluation of the Summarization Models . . . . . . . . . . . . . . 39
5.5.1 Summary Generation . . . . . . . . . . . . . . . . . . . . . 39
5.5.2 Automatic Evaluation Using ROUGE . . . . . . . . . . . . 40
5.5.3 Human Evaluation . . . . . . . . . . . . . . . . . . . . . . 42

6 Discussion 48
6.1 Automatically Generated Summaries . . . . . . . . . . . . . . . . 48
6.1.1 Quality of the Reference Summaries . . . . . . . . . . . . . 48
6.1.2 Improving Generated Summaries . . . . . . . . . . . . . . 49
6.1.3 Extractive or Abstractive Summaries? . . . . . . . . . . . 49
6.1.4 Incorporation of Domain-Dependent Features . . . . . . . 50
6.1.5 Generation Time of the Results . . . . . . . . . . . . . . . 50
6.1.6 Summary Generation Configuration . . . . . . . . . . . . . 50
6.2 Improving the Described Method . . . . . . . . . . . . . . . . . . 51
6.2.1 Longer Pretraining of Base Model . . . . . . . . . . . . . . 51
6.2.2 Human Evaluation of Generated Summaries . . . . . . . . 51
6.2.3 Association Dataset Metrics and Model Performance . . . 52
6.3 Architectural Considerations . . . . . . . . . . . . . . . . . . . . . 52
6.3.1 Limitations of Transformer Architecture . . . . . . . . . . 52
6.3.2 Gaussian Mixture or k-means? . . . . . . . . . . . . . . . . 53

7 Conclusion 54

References 56

A The Data Collection Process 61


A.1 Collection of the External Dataset . . . . . . . . . . . . . . . . . . 61
A.2 Constructing the Raw Dataset . . . . . . . . . . . . . . . . . . . . 61
A.2.1 Storing the Dataset . . . . . . . . . . . . . . . . . . . . . . 62

B Examples of Generated Summaries 64


1. Introduction

One of the consequences of the Information Age is the explosion of the amount
of information that is being digitized. This digitization is accompanied by some
well-known challenges: storing information costs resources; digitizing, if not done
automatically, requires human effort, privacy has to be guaranteed and access has
to be safe and reliable.
A less apparent problem is that, in the case of text documents, the number of
documents becomes too large to be able to exhaustively consider all documents
as a single person when searching for information. Search engines are an example
of a field that is already tackling this problem. Here, due to people being unable
to read every website when they search using some keywords, we are presented
with short snippets that concisely summarize the specific web page.
In this thesis, we will consider the Dutch legal domain, where the increasing
digitization of legal cases demands solutions that allow users to easily navigate
the published documents.
Publication of the Dutch legal cases is done by Raad voor de Rechtspraak at
rechtspraak.nl. Until July 2003, if users of rechtspraak.nl wanted to know whether
a case was relevant for them, they had to open the document and had to either
scan or read the full case before they could assess whether the case indeed was
relevant. Because users felt this was a shortcoming of rechtspraak.nl, summaries
were created for new Dutch legal cases (recht.nl, 2003). Since then, some of the
new cases have been given summaries that are shown as snippets to quickly inform
the user about the main contents of the case. This need of users to be able to
quickly assess a case, highlights the relevance of having Dutch legal cases that are
enriched with supplementary summaries.
However, the current summarization process has a number of limitations. The
most obvious limitation is the need for human labour in constructing a summary
for each new case. This might be the reason why only a small portion of the
published Dutch legal cases are supplemented by a summary. This is especially
true for cases before July 2003, as only a small part of these cases were retroactively
summarized. Furthermore, even for cases that are accompanied by a summary,
this summary often only consists of a few keywords or a short sentence.
Recently, Raad voor de Rechtspraak announced that they aim to publish even
more Dutch legal cases (Naves, 2021). Currently, only 5% of all Dutch cases are
published online; their aim is to increase this to 75%. If we only look at the
previous decade, this would mean that an additional 2.2 million cases will be
added for this decade. This further stresses the need for sound summaries so that
relatively little time is lost searching for relevant cases.

1
Introduction 2

1.1 Automatic Summarization of Legal Cases


In this thesis we explore the feasibility of automatically generating summaries for
Dutch legal cases using a deep learning approach. Mainly due to advancements
in hardware, the domains of deep learning and natural language processing saw
rapid developments. Considering that deep learning techniques outperform more
traditional techniques, which will be discussed in depth in chapter 2, it indeed
seems to be an appealing source of possible solutions.
The purpose of summarization is to reduce a source text to some new text that
consists of the most relevant information of this source text (Gambhir & Gupta,
2017). We will describe an approach where a summarization model automates
this process. This model will be trained on existing case case-summary pairs and
it will be evaluated by comparing the quality of generated summaries with real
summaries. The main problem here is that evaluation of the generated summaries
is hard. Because, how can we measure whether summary A does a better job at
summarizing the text than summary B ?
Allahyari et al. (2017) identified the following difficulties of automatic evalua-
tion of summaries:

• “It is fundamental to decide and specify the most important parts of the
original text to preserve.”

• “Evaluators have to automatically identify these pieces of important infor-


mation in the candidate summary, since this information can be represented
using disparate expressions.”

• “The readability of the summary in terms of grammaticality and coherence


has to be evaluated.”

Therefore, how do we measure the correctness of a candidate summary? We shall


discuss this question in-depth for the Rechtspraak dataset. In our experiments
we will adhere to this discussion for our own evaluation.

1.2 Clustering to Improve Summarization


In our search of a system that can summarize Dutch legal cases, we put emphasis
on utilizing data clustering as a first step in this system. Clustering is a technique
that is used in many different domains. The main objective is to create different
segments or groups of a dataset depending on features describing the dataset. Its
popularity is partially due to it being a quick and cheap approach of learning
something about the data. Clustering usually is unsupervised; meaning that we
do not need labeled data.
We expected that clustering the cases will make it easier for the models to find
patterns within each cluster. However, by the end of this thesis, it will become
clear that clustering achieves the opposite of what we expected.
Introduction 3

1.3 Contributions
We aimed to find a system that can be used to automatically generate summaries
of Dutch legal cases. This system should promote the ease of searching through
the large body of published Dutch legal cases.
The project is structured in a way as to answer four research questions. First, a
concise analysis of the Rechtspraak dataset is required to inform further decisions
relating to system components and modeling approaches. Therefore, we start
with answering the following question: what are the key differences between
available benchmark datasets and the Rechtspraak dataset used in this
project? (RQ1)
Second, time will be dedicated to choosing a proper evaluation method. As
was stated before, evaluation of automatically generated summaries is not straight-
forward. For this reason, we study both quantitative and qualitative methods of
evaluation and answer the question: how can generated summaries of Dutch
legal cases be evaluated accurately? (RQ2)
We will experiment with multiple models to find the impact of clustering the
data before summarization is done. We hypothesized that clustering will lead to
improved summaries, which will be measured by evaluation methods found in
RQ2. Our third research questions therefore is: what is the effect of training
automatic summarization models on clustered data? (RQ3)
To finalize this thesis, we will uncover the strengths of weaknesses of our auto-
matic summarization system and will answer the question what are the biggest
challenges when automatic summarization techniques are applied to
Dutch legal cases? (RQ4)
Besides these theoretical contributions, we also provide instructions1 on how
to generate the Rechtspraak dataset, which consists of 100K Dutch legal cases
and summaries. The legal cases from this dataset are already available online, but
only as individual XML files. We provide instructions on how to parse these files
and collect them into a single summarization dataset. The obtained dataset has a
large size and is freely available from the source with few restrictions, making it
suitable to be utilized as a benchmark dataset. To the best of our knowledge, no
summarization benchmark dataset exists for the Dutch language. Instructions to
generate the complete Rechtspraak dataset, including cases that are not viable
for summarization (e.g. due to missing components), are also provided. This
dataset consists of 3 million cases.

1
See https://github.com/prijsdf/dutch-legal-summarization
2. Deep Learning; Its Surface

It is beneficial to have an understanding of some fundamental concepts related


to Deep Learning before summarization-specific concepts and techniques are
discussed. This will allow for a more guided exploration of the available literature
and the rapid evolvements in the field. At the end of the chapter, all will be in
check to be able to compare more advanced components and architectures that
are directly relevant to this thesis.
In this chapter, we will present a concise overview of the base components found
in Deep Learning architectures. In section 2.1, machine learning is introduced.
Next, in section 2.2, training of machine learning architectures is discussed. Finally,
in section 2.3, we will provide a background on the use of Deep Learning for
Natural Language Processing.

2.1 Machine Learning


Machine Learning (ML) is one of the domains within the broad domain of Artificial
Intelligence (AI). Within this domain of ML, we find techniques and approaches
that allow algorithms to automatically learn patterns from observed data. These
algorithms can be used to aid decision making for real-world problems. For
example, we might have an algorithm that automatically identifies fraudulent
credit card transactions by comparing metadata associated with each transaction.
However, we do not need ML techniques for these tasks; we could also use rule-
based systems where we use human-created rules to automatically flag transactions
based on the metadata. A transaction could automatically be flagged, for example,
if it describes an usually large sum of money and was processed in a country
that the card was not used in before. However, ML proved not only to be more
effective for many different tasks, it also requires less human labour.
Within ML we find a subgroup of techniques that contain architectures that
loosely resemble the human brain, namely neural networks. Neural networks are
constructed by linking and chaining artificial neurons. These neurons are nodes
that produce an output by performing a simple operation on an input. By linking
many neurons, a complicated network is created that is shown to be effective in
learning many types of patterns in different types of data.
Neurons are grouped in layers. In a simple neural network we have one input
layer, one hidden layer and one output layer. Generally, to account for more
complex structures in the data, more hidden layers are needed. When a network
has two or more hidden layers, it is called a deep neural network. Deep learning
is used as a term to describe these networks. All neural networks that we will
discuss in later sections are deep neural networks.

4
Deep Learning; Its Surface 5

Neural networks, especially deep ones, are more difficult to understand than
non-neural models. Often, neural networks are described as black boxes because
of the seemingly mysterious way in which they learn patterns. This inability to
explain the model’s reasoning is one of the main criticisms of neural networks and
sometimes prevents them from being applied to real world problems. Furthermore,
neural networks can be become computationally expensive to train. This is because
of the large number of trainable parameters that are associated with a neural
network. BERT, a model that is mentioned in section 2.3.1, for example, has
around 110 million parameters that all need to be trained. In turn, this requires a
lot of training data to be able to effectively learn patterns from.
A meaningful classification of AI techniques can be made w.r.t the ability of a
technique to learn from data. This ability to learn is one of the essential aspects of
any Machine Learning approach; it leads to a system that is able to learn patterns
from the data in a way that enables it to be generalized to also find patterns in
new, unseen data. We can divide approaches into three groups (Goodfellow et al.,
2016): rule-based, traditional (or classical) and neural (or representational).
In figure 2.1 the classes are shown. Rule-based systems have no learning
component; they rely on manually crafted rules that are matched with the data.
Traditional approaches, on the other hand, contain components that learn to map
features of the data to certain outputs. Currently, machine learning approaches
are mainly neural: not only output mappings are learned from features, also the
features themselves are learned.

Figure 2.1: Artificial Intelligence became increasingly less human-dependent.


Rule-based approaches use no element to learn from data. Traditional approaches
allow for learning from features of the data. Neural approaches not only allow for
learning from features, but also learning of the features themselves. The dashed
boxes indicate processing steps dictated by learning. The diagram is adapted from
Goodfellow et al. (2016, Figure 1.5).
Deep Learning; Its Surface 6

2.2 Training
Integral to many ML models is a training phase. In their book, Goodfellow et al.
(2016) use the term representation learning to denote this phase for neural
models and DL models. This name makes sense: during the training phase the
model builds a representation of the data it is being trained on.
For many models the training phase is supervised, meaning that the model is
given some knowledge about each data item in order to have it train conditionally
on this knowledge. In further discussion this prior knowledge will be referred to
as the true label of a data item. In our case this means that a reference summary
is provided together with its full case text. The model, then, tries to derive the
true label from the data item and hereby generates a candidate summary as the
predicted label for this data item. After generating this prediction, the model
will use the true label to measure its representation depending on how closely
the predicted label approximated the true label. This process is backpropagation.
To compute the proximity of the prediction to the real label, a loss function is
needed. Which loss function should be chosen is dependent on the architecture
of the model and the task at hand. See section 7.4 of the book by Jurafsky and
Martin (2020) for more details on both backpropagation and loss functions.
The other two paradigms of training are unsupervised and reinforcement.
In unsupervised learning the model is only supplied with the data items; there are
no true labels associated with these items. Therefore, unsupervised learning often
is restricted to finding discriminating features within the dataset. A common
example of such a system is a clustering model that is able to partition the data
in a number of cluster; each containing items that are similar to each other while
differing from items in the other clusters. In the method we propose, clustering
also is included as the first component. Therefore, we will discuss clustering more
in-depth in section 3.3.
Reinforcement Learning (RL) also differs from the supervised item-label ap-
proach. Here, the problem is framed by picturing an agent that is interacting
with its environment. Each action this agent performs changes the environment.
The environment, in turn, emits this change back to the actor, who ’learns’ a
little about its performed action. Ideally, the actor will increasingly learn how to
behave effectively in its environment.

2.3 Natural Language Processing


Many solutions in Machine Learning are aimed at solving problems that occur
when we try to utilize computers to process texts. The study of this interaction
between computer and human language is that of Natural Language Processing
(NLP).
The classification shown in figure 2.1 also applies to NLP approaches. Now,
rule-based systems consist of handcrafted rules that, ideally, match sought-for
features of texts. For example, a naive rule-based approach to Named Entity
Recognition - identifying references to information units (e.g. names of people or
Deep Learning; Its Surface 7

organizations) in unstructured text (Nadeau & Sekine, 2007) - would be to extract


every word or phrase that contains a capital letter. The limitations of a rule-based
approach are serious: success relies on domain expertise, rules either have the
tendency to yield inaccurate results or should be increasingly more complex to
account for variations of features, and, for these reasons, constructing a rule-based
system likely is to be time-demanding and prone to error.
Next, statistical approaches look at aggregated characteristics of words or
sequences in documents. A widely applied statistical approach is Term Fre-
quency–Inverse Document Frequency (TF-IDF). TF-IDF computes a score for
the frequency of a word in a document compared to the frequency of the word in
the complete corpus (Salton & Buckley, 1988). In other words, if used to match
documents to user queries, TF-IDF would yield the document that has the highest
frequency of the query words relative to the total frequency of the query words in
the corpus (Ramos, 2003).
Finally, neural networks increasingly are the focus of NLP research (Jurafsky
& Martin, 2020, Chapter 7); this is mainly because of their sheer superiority in
learning generalizable patterns in comparison with more traditional techniques.
Many of the approaches to text summarization mentioned in later parts of this
thesis consist of one or multiple neural networks; especially those that are discussed
in section 3.1.3.

2.3.1 Language Models


When it is the aim to generalize or learn from textual information as opposed
to numerical information, problems get more challenging. This is mainly due to
the representation of the information: comparison of two words, for example, is
more ambiguous than the comparison of two numbers. Added to this, is that a
language consists of many different words that can be used in practically unlimited
combinations.
Contemporary work relies on Language Models (LM) to model this represen-
tation of natural language. Essentially, an LM tells us how probable it is that
any word follows from a previous sequence of words (e.g. see Hiemstra (1998) for
an early work). This knowledge helps in solving tasks that require working with
natural language. For example, a summarization model can use an LM at its base
to inform the model of what words to use when generating candidate summaries.
A pioneering contribution that underpins many contemporary LMs is that of
Bengio et al. (2003). In their work, they propose an LM that uses distributed
representations (or word vectors, or word embeddings) for words. In doing so,
this LM achieves greater generalization than legacy methods. When two words
share similar contexts (i.e. neighbouring words) they are seen as similar. Now, if
the LM encounters an unknown sequence, it can use this notion of word similarity
to predict a sequence that is semantically accurate.
This idea of distributed representations has extensively been studied in the
last decade. Notable works are the Word2vec model (Mikolov et al., 2013), which
uses a more effective architecture to achieve larger representations that are trained
in less time, and GloVe (Pennington et al., 2014), which uses a statistical, and
Deep Learning; Its Surface 8

therefore faster, approach to derive the word vectors. GloVe uses probe words to
compare words of interest by looking at how frequently each of the words is found
together with the probe word. The ratio between the occurrence of word one
given the probe word and the occurrence of word two given the probe word tells
something about how each of the words relates to the probe word. Because there
is no supervised learning required to derive these ratios - they can be read from
the input documents - learning a GloVe model is faster than learning a Word2vec
model.
More recently, BERT was introduced by Devlin et al. (2019). BERT signif-
icantly improves upon the previously mentioned models by using a multi-layer
bidirectional Transformer (Vaswani et al., 2017) to derive word embeddings. Here,
word embeddings are no longer statically dependent on the word itself, but instead
also encode information about the words context. To train BERT on incorporating
word context, masked-language modelling was used as a new training objective.
Here, BERT is shown a sequence of tokens of which a percentage (15% in the
original paper) is hidden or masked. Now, BERT is tasked with reconstructing
the original sequence by filling in these masked words.
It is also important to note, that BERT uses WordPiece to derive tokens from
the corpus. This is a tokenization approach where not words, but subwords are
the main block of information. Using subwords has two main advantages. First,
uncommon words that otherwise would not fit in the vocabulary can be broken
down into pieces that may be processed by the model. Second, the model will be
exposed to the root of words and therefore might encode similar words in similar
ways. For example, the model might interpret the words ’prison’ and ’imprisoned’
as similar, instead of seeing them as completely different words.
Many works proposed adaptations of BERT. Notably, Y. Liu et al. (2019) found
that BERT under-fit and could be trained more extensively. With RoBERTa,
a superior model is proposed that relies on a more robust pretraining approach.
Both BERT and RoBERTa, were pre-trained using english data. A well perform-
ing, multilingual version of BERT also exists (Wu & Dredze, 2019), but it is
outperformed by monolingual models. For Dutch, BERTje (de Vries et al., 2019)
was proposed as the counterpart of BERT, whereas RobBERT (Delobelle et al.,
2020) was proposed for RoBERTa.
More recently, we saw the advent of sequence-to-sequence models that use
the ideas from BERT. Here, the framework is specifically tasked with outputting
sequences, rather than single probabilities, such that a model is obtained that can
be applied to tasks that require the generation of sequences such as automatic
summarization. An example of these models is BART (Lewis et al., 2020). BART
uses a more extensive set of pretraining objectives in comparison with BERT. It
uses the following objectives:
Token masking Similar to BART, a percentage of tokens in the text are masked
at random and the model has to reconstruct the original text.
Sentence permutation The text is split-up in sentences (based on full stops)
and then these sentences are shuffled. The model has to reconstruct the text
again.
Deep Learning; Its Surface 9

Document rotation A new start token is picked at random and the document
is rotated such that it starts with this new token. Again, the model has to
reconstruct the original text.

Token deletion Tokens are deleted from the text. Here the model has to pick
the positions of the deleted tokens. Notice that this differs from simply
masking the tokens.

Text infilling Similar to masking, but here random spans of texts are replaced
by a single mask token. The spans mostly have a length of 0 to 9 tokens.
Spans of zero length can also be replaced, which is equal to simply inserting
a mask token into the text. Notice that a span is always replaced by one
mask token, regardless of the length of the span.
3. Related Work

In the previous chapter, relevant components of Deep Learning frameworks were


introduced. In this chapter we will discuss more advanced techniques that shaped
the state-of-the-art of text summarization.
This chapter is structured in three parts. First, in section 3.1, cornerstone
approaches to Text Summarization are discussed. This discussion starts off broadly
with a categorization of common approaches and methods, but will eventually be
narrowed down to the field of Abstractive Summarization (AS). The models that
we used in our experiment also belong to the group of abstractive summarization
methods. Second, in section 3.2, current research on summarization of legal
documents will be discussed. Third, in section 3.3, a concise overview of relevant
clustering techniques will be given.

3.1 Text Summarization Using Deep Learning


As was highlighted in the previous chapter, deep learning currently dominates
many tasks in natural language processing. To this, the task of text summarization
is no exception. There are, however, also less complex ways of approaching text
summarization. Only with the advance of research on neural networks and com-
puters becoming increasingly more powerful, did the field of text summarization
start to become dominated by deep learning.
In essence, most summarization approaches either belong to the group of
extractive summarization or to the group of abstractive summarization (Huang
et al., 2020). In extractive summarization sequences of words are identified and
extracted from the source text. An extractive algorithm is mainly tasked with
judging the relevance of sequences in the text and subsequently combining these
sequences into a summary. Thus, the result will be a summary that consists of a
chain of sequences, each of which found in the source text.
Abstractive summarization, on the other hand, entails the more difficult process
of abstracting information from the source text. Here, summaries are generated
word-for-word based on the understanding of the model of the source text.
For a summarization model to create an accurate summary of a source text,
the model must have a good representation of the source text. To obtain this, it
first needs to have some sort of general representation of the language the text
is written in. That is, it needs to have the capacities of a language model (see
section 2.3.1 for information about Language Models). So, before we can train an
abstractive model to summarize a text, we first need to train a language model.
Training of a language model requires a sizeable dataset containing a large
variety of texts and often is only feasible with dedicated hardware. Unsurprisingly,
early related work in automatic summarization was more heavily focused on

10
Related Work 11

extractive approaches rather than abstractive approaches. Fortunately, language


models can be generalized easily to a wide range of texts. Therefore, since language
models are increasingly published as open-source, more researchers are able to
utilize language models for their research.
In section 3.1.2 we will give an overview of some extractive summarization
approaches and in section 3.1.3 a more extensive overview of abstractive approaches
is given. First, in the next section, we will discuss techniques for evaluating
generated summaries.

3.1.1 Evaluation of Generated Summaries


There exist multiple metrics to compare summarization systems of which some
are frequently reported on. In general, these metrics can be divided in two
groups: automatic evaluation metrics and human evaluation metrics. Automatic
evaluation metrics are easily to compute by comparing generated summaries with
reference summaries. Human evaluation metrics, on the other hand, are metrics
that require either prior-labeling of summaries, e.g. the Pyramid method (Nenkova
& Passonneau, 2004), or the direct human-assessment of generated summaries,
e.g. Grusky et al. (2018).
Regarding automatic evaluation, there really only is a single group of measures
that is commonly used and that is ROUGE (C.-Y. Lin, 2004). ROUGE is a
family of measures that computes the overlap between a generated summary and
a reference summary. Overlap is measured in n-grams of words. The n-grams can
be of any size, but most commonly the 1-gram and 2-gram scores are reported
together with the longest common sub-sequence. These are respectively called the
ROUGE-1, ROUGE-2 and ROUGE-L scores. Each of these metrics is commonly
measured in one or more of three ways. The precision can be measured, which is
the number of matching n-grams between both summaries divided by the number
of n-grams in the generated summary. Instead, recall can be measured by dividing
the number of matching n-grams by the number of n-grams in the reference
summary. Finally, the F-score can be computed using both precision and recall.
As an example, the ROUGE-2 recall metric for the full dataset is computed as
follows:
N
1 X |matched -2 -grams(Si∗ , Si )|
ROUGE -2 =
N i=1 |2 -grams Si |
where N is the number of cases in the dataset, S is the reference summary, S ∗
is the generated summary and |matched -2 -grams(Si∗ , Si )| is the number of 2-gram
matches between the generated summary and the reference summary.
Despite its popularity, ROUGE is frequently criticized for its inflexibility. This
inflexibility stems from the literal matching of n-grams. For example, if a reference
summary would consist of the simple phrase The thief stole the money and the
generated summary would have been The criminal took some cash, then we would
want this summary to have a good evaluation. However, the ROUGE-1 metric
would tell us that only the first word, the, was correctly generated and would
Related Work 12

consider the rest of the summary as completely incorrect. In the case of ROUGE-2,
even the complete summary is evaluated as incorrect.
There are more recent metrics that try to mitigate this problem. BERTScore,
for example, uses BERT-generated contextual embeddings of words, instead of
the words themselves, when comparing the generated summary with the reference
summary (T. Zhang et al., 2019). This means that more information about
the words is taken into consideration when two words are compared. In this
case, our example sentence, despite its synonyms, might get a positive evaluation.
Unfortunately, the authors did not include the task of automatic summarization
in their experiment. For the tasks machine translation and image captioning,
however, BERTScore proved to correlate better with human judgments than the
standard metrics for these tasks, such as BLEU.
Human evaluation metrics are less standardized than automatic evaluation
metrics. Many papers do report some kind of human evaluation results besides
reporting ROUGE scores. However, these evaluations take many different shapes.
In some cases domain experts are tasked with the evaluation, in other cases less
costly approaches are taken, such as usage of Amazon Mechanical Turk (e.g. J.
Zhang et al. (2020)). There are proposals for standardizing this process. In this
thesis project we will consider such an approach, namely the protocol proposed
by Grusky et al. (2018). Specifically, this protocol formulates four questions, each
measuring a different dimension of the generated summary. The authors included
two semantic dimensions and two syntactic dimensions. Three of these dimensions
were initially used by Tan et al. (2017) and one was introduced by Paulus et al.
(2017). The protocol by Grusky et al. (2018) combines these four dimensions
and formulates a question for each of them. In section 4.2.1, we will discuss this
protocol in detail.
Despite all these metrics, there is no perfect metric for evaluating automatically
generated summaries: how summaries best can be evaluated remains an open
question (Saggion & Poibeau, 2013). This is the direct result of the ambiguity of
the task in general. There is no clear and consistent way of saying that a summary
A is better than a summary B. For example, it could be that summary A does a
better job at covering the main points of the source text, whereas summary B
contains less factual errors in the facts that it covers from the source text.

3.1.2 Extractive Summarization


An example of an extractive approach is TextRank (Mihalcea & Tarau, 2004). The
authors applied the graph-oriented algorithm that underlies Pagerank (Page et al.,
1999) to summarization (and keyword extraction). This algorithm measures the
importance of a unit of interest - sentences in the case of TextRank - dependent
on its relation with other units of interests and the importance of these units. The
authors of TextRank use similarity of two sentences as a measure of relatedness.
If two sentences share more words, their relatedness has a higher weight, which is
represented by a weighted edge in the directed graph. This graph is recursively
traversed to compute the importance of a vertex (i.e. sentence) depending on
other vertices’ edges to it, the weight of these edges, and the importance of these
Related Work 13

vertices.
The strength of TextRank not only is that it is a simple and intuitive algorithm,
but also that to apply it we need nothing more than the text it needs to be
applied to, making it an unsupervised approach. The results of TextRank were
competitive with the state-of-the-art (Mihalcea & Tarau, 2004), which mainly
consisted of supervised approaches. These practical benefits and the relatively
strong performance make that TextRank still is used as a baseline for comparison
of new approaches (e.g. S. Zhang et al., 2021).
Another approach commonly used as a baseline (e.g. Zhong et al., 2020) is Lead-
3. It was introduced as such by Nallapati et al. (2017). Lead-3 might be the simplest
common approach in summarization literature: to derive a summary, it picks the
three leading sentences from the source text. The reason for its introduction, and
the popularity that followed, is its effectiveness on some benchmark datasets. The
dataset that is most often reported on, CNN/Daily Mail, consists of news articles.
It seems that for this type of text relevant information relatively often is to be
found in the beginning of the text, which explains the good performance of this
baseline.
With the increase in popularity of deep learning, researchers also started to
apply neural models to help with extractive summarization tasks. To this end,
texts are first transformed into sequences of embeddings (see section 2.3.1) that
can be used as input to the neural models. This process can be applied to the
text at different levels; e.g. at the word-level or sentence-level.
Nallapati et al. (2017) used two bi-directional RNNs to transform texts into
embeddings. The first RNN generated word-embeddings. The second RNN
used these word-embeddings to generate sentence embeddings. The sentence
embeddings are then fed to a binary classifier that classifies each sentence as either
belonging or not belonging to the summary.
Since 2019, the use of transformer-like language models gained momentum.
BERT was used in BERTSUM (Y. Liu & Lapata, 2019) to obtain sentence
representations. In the extractive model multiple transformers are applied to these
sentence representations, capturing latent document document-level features that
are used to extract relevant sentences.
Zhong et al. (2020) propose MatchSum, which can be seen as an extension
of BERTSUM. The model uses BERTSUM to score sentences on saliency. Next,
candidate summaries are generated using all combinations of the most salient
sentences. Then, two BERT models are used to obtain embeddings for each
candidate summary and the source text, which are used to compute a similarity
score between the two. Finally, the candidate summary that is most similar to
the source text is chosen as the final summary.

3.1.3 Abstractive Summarization


Research in abstractive summarization gained traction after the work of Rush et al.
(2015). The authors were the first to successfully apply a neural model to solve
a summarization task. To do this they used an attention-based encoder-decoder
model. Although this model improved over chosen baselines (mostly rule-based
Related Work 14

extractive systems), results were still limited. The main flaw is that generated
sentences have an incorrect word order and therefore contain syntactical mistakes.
The authors also transformed the Gigaword dataset to create one of the earlier
large-sized summarization benchmark.
Quite soon after, CNN/Daily Mail (Nallapati et al., 2016) was proposed as a
new benchmark dataset. The dataset contains bullet-point, and therefore multi-
sentence, summaries of news articles of CNN and Daily Mail and prevailed as one
of the main benchmarks for the task of summarization.
Besides introducing this benchmark, the authors also proposed a few novel
model components aimed at solving limitations of earlier models. First, an
hierarchical attention component was described, that is active at both the word-
level and the sentence-level. The goal of this component is to not only have the
model attend to a word based on the perceived importance of the word, but also
based on the importance of sentence the word is found in. Second, more emphasis
was put on identifying key-words in the text. This was achieved by supplying
the input words with TF-IDF scores and part-of-speech tags. Third, pointer
functionality was presented to allow the model to include words in the summary
directly from the source document. This is beneficial in case a word is important
in the source text, but lacking from the model’s vocabulary. As the vocabulary is
the set of words that the model can recognize and produce, the model lacks the
ability to generate out-of-vocabulary words unless a procedure such as pointing is
included.
The main flaw indicated by the authors was the repetition of phrases in the
generated summary. Intra-attention (or self-attention) is recommended as a means
to account for this repetition.
The intra-attention component was studied in the following year by Paulus
et al. (2017). They combined the encoder-decoder RNN with this component.
Furthermore, to minimize ’exposure bias’ of the model, training was partially
shaped as a Reinforcement Learning problem, instead of the usual Supervised
Learning problem. Three intra-attention models are compared; one solely using
RL, one solely using Supervised Learning, and one hybrid. The authors show that
the RL model quantitatively (measured by ROUGE-1) performs best. However,
qualitatively (measured by readability) the model performs worst: the hybrid
model performs best in this regard. The authors conclude that both models
performed better than the state-of-the-art and, in some cases, the quantitative
measure alone can be deceptive in measuring model performance.
Another work that successfully applied Reinforcement Learning to text sum-
marization was Chen and Bansal (2018). Here, reinforcement was used to extract
suitable sentences from source documents, which then were rewritten. This hybrid
approach was extended by Xiao et al. (2020), who made rewriting optional.
In Al-Sabahi et al. (2018), state-of-the-art improvements are reported using
intra-attention with extra input at each time step. This input consists of a
weighted average of each of the previous states of the model. The authors chose
to include this input to enable the model to more easily attend to earlier states.
Another approach to countering repetition within generated summaries is the
usage of coverage models (See et al., 2017). Here an extra learnable parameter
Related Work 15

is added to the attention mechanism. This parameter serves as a memory of


tokens the mechanism previously has focused on. Besides the parameter, also a
coverage term was added to the loss function. Ultimately, these measures should
steer the model away from the generation of repetitive sequences. Where See
et al. (2017) focused on extending the generation component of the model, J. Lin
et al. (2018) added the notion of global encoding: this encoding aims to refine the
model’s representation of the source text. The authors aimed to tackle repetition
in generated summaries.
J. Zhang et al. (2020) studied pretraining objectives for a transformer archi-
tecture for their effect on downstream tasks such as summarization. The authors
propose Gap Sentences Generation (GSG), a pretraining objective that uses an
’artificial’ summarization set-up. For a large corpus, summaries were generated
from texts by selecting the top-important sentences from the source text, with im-
portance measured by ROUGE1-F1. To account for the abstractive nature of the
task, some of these sentences were masked in the generated summary. Using this
pretraining set-up led to new state-of-the-art results on the main summarization
benchmarks, except for Gigaword.
Recently, adaptations of the typical transformer were proposed (Beltagy et
al., 2020; Zaheer et al., 2020). These adaptations reduced the computational
complexity of the attention mechanism, effectively allowing for a longer sequence
length to be considered as model input. These models showed promising results on
a variety of NLP tasks, including abstractive summarization. Notably, considerable
improvements over previous models were achieved for datasets that deal with
longer source documents, such as BigPatent.

3.2 Summarization of Legal Documents


Recently, law was the focus point of increasingly more automatic summarization
studies. In this section, these studies will be discussed. The approaches taken in
these papers sometimes build upon frameworks that were discussed in section 3.1.
Often, however, the approaches describe rule-based systems specifically designed to
address the task of the paper’s authors’ interest. These systems are highly adapted
to one dataset and therefore not easily generalizable to other legal datasets. Only
if a significant or relevant finding was reported related to such a system, we will
mention the approach. For a more extensive overview of these and other legal
summarization systems, please see the work of Bhattacharya et al. (2019).
In 2005, sentence extraction had been applied to 188 judgments from 2001
until 2003 from the UK House of Lords (Hachey & Grover, 2005). The authors
annotated the judgments’ sentences with a rhetorical role and a relevance score
before using Naive Bayes and Maximum Entropy to rank and extract relevant
sentences, i.e. those sentences that are most informative, from judgments.
Another relatively early work where rhetorical roles were used stems from 2006.
Here, sentences in judgments of the high court of Kerala (India) were annotated
with rhetorical roles, after which a conditional random fields model was used to
extract key sentences and combine these to form a summary (Saravanan et al.,
Related Work 16

2006). In total, seven rhetorical roles were used to annotate sentences. Examples
are ’identifying the case’, ’arguments (analysis)’, and ’final decision (disposal)’.
Both of these two approaches used rhetorical roles to perform extractive
summarization of judgments. One downside to these approaches is the need for
human annotation of cases. Furthermore, as was implied by the authors of the
Kerala judgments paper, different law datasets might require different sets of
rhetorical roles. In our experiment, we will work with a framework that only
considers the source texts and true summaries and not any secondary information.
There are two main reasons for this. First, by not depending our framework
on dataset-specific information, the framework will be easier to generalize to
other domains and datasets. Second, as secondary information often needs to be
supplied on sentence-level (e.g. see above two approaches) instead of document-
level, time-costly manual labelling needs to happen for each of the dataset’s cases.
Not only does this require the necessary expertise; it also means that either only a
very small dataset can be used, or that an unreasonable amount of time is required
to label each of the cases. Furthermore, as deep learning models were studied in
our experiments, a small dataset would have been an immediate drawback of the
proposed method.
Polsley et al. (2016). propose CaseSummarizer, a tool combined with a web
interface to automatically summarize legal judgments. Using tf-idf and domain-
dependent features, such as the number of entities in the text, sentences are
ranked. Then, most important sentences are combined in a summary that is
customizable by the user. The system was evaluated using 3890 legal cases from
the Federal Court of Australia. Human evaluation of the system showed that
it outperforms other summarization systems. These other systems were mostly
dataset-agnostic (thus non-legal), which might explain the difference. Summaries
that were made by experts still outperformed CaseSummarizer (and other systems)
by a large margin. Here, the experts also used extracting summarization to create
a summary. Thus, the authors highlighted, sentence extraction could be a viable
method of legal case summarization, albeit that current systems are lacking.
C.-L. Liu and Chen (2019) studied a highly similar dataset to ours. The
authors studied judgements of the Supreme Court of Taiwain. These judgements
were sometimes published with a summary, comparable to the Dutch published
cases. The authors chose to treat the problem as an extractive summarization
problem due to many summaries containing statements that were directly selected
from the judgement text.
Kornilova and Eidelman (2019) introduced the BillSum dataset. It consists
of 22,218 US congressional bills. The text is semi-structured. Of the available
common benchmarks BillSum might best resemble the Rechtspraak dataset that
is used in this thesis project.
Finally, Luijtgaarden (2019) applied the reinforcement learning approach by
Chen and Bansal (2018), which was discussed in the previous section, to the
Rechtspraak dataset. The author found that the model cut off sentences too
early leading to grammatical errors in the summaries. Five generated summaries
were evaluated on relevance and readability by two law students. In general, the
students preferred the reference summaries to the generated summaries. However,
Related Work 17

in the case where a reference summary would consist of key words the students
would prefer a generated summary.

3.3 Clustering
In our study of Dutch legal cases, we tested our hypothesis that it is beneficial to
cluster the data before abstractive summarization is applied. Clustering simply
is the partitioning of data into groups. The general aim, as is ours, is to obtain
clusters that contain similar data items while items from different clusters should
be dissimilar.

3.3.1 k-means clustering


A common approach is k-means. The general algorithm follows these steps:
1. Choose the number of clusters, k, that the data will be partitioned in;
2. Randomly assign each cluster center (i.e. centroid) a location in the space
that contains the data. A common approach is to assign the centroids to
the locations of random data-points from the dataset;
3. For each data-point, compute its Euclidean distance to each of the centroids
and assign the data-point to the closest centroid;
4. Update the centroids of each cluster with the mean of all data points
belonging to that cluster;
5. Repeat steps 3 and 4 until no further centroid updates occur.
The limitations of k-means clustering are well-studied (Xu & Wunsch, 2005). One
limitation of k-means is the need to manually choose the number of clusters,
k. Unless prior knowledge about the data is present, there is no clear-cut way
of choosing k. Another limitation is the need for an initial assignment of the
centroids as not every initialization yields the same converged solution. Recently,
an adaptation of k-means, U-k-means (U for Unsupervised), was proposed (Sinaga
& Yang, 2020), to specifically deal with both these limitations. For the standard k-
means approach, the objective function is to minimize the total euclidean distance
of each point and its centroid. Instead, U-k-means extends this objective function
to also incorporate entropy.

3.3.2 Mixture Models


Another model that is commonly used for clustering is the mixture model. Here,
the data is assumed to contain n latent components each following a particular
distribution where n has to be manually chosen. Often, Gaussian mixture models
are used where the components are assumed to follow normal distributions. As
opposed to k-means clustering, mixture models consider covariance of the compo-
nents when clustering the data. Interestingly, to initialize the mixture model, the
data sometimes is first clustered using k-means.
Related Work 18

3.3.3 Clustering and Summarization


In this project, a large part of our effort is dedicated to studying the impact of
clustering on subsequent automatic summarization.
Other works have considered clustering in the context of summarization frame-
works for legal data. An early example is the work of Uyttendaele et al. (1998). As
a first step, sentences were labelled with one of a number of segment types. Then,
clustering was applied to make a suitable grouping of the sentences according to
their segment types. In, Abuobieda et al. (2013), greater attention was given to
similarity measures used for sentence clustering. Yet another sentence clustering
approach was proposed by Ferreira et al. (2014). Here, clustering was applied to
a feature graph representing document sentences
It is apparent that clustering for summarization has been explored before.
However, in these works clustering is applied to sentences of a document before
extracting individual sentences from the obtained clusters. Thus, clustering is
used as a specific step in the summarization system. This fundamentally differs
from our approach, where clustering is done a-priori to yield subsets of the dataset,
each containing legal cases that are more similar to each other in comparison with
other cases in the source dataset. How this similarity is measured, is explained in
section 4.3.2.
4. Methods

In this thesis project our main aim was to explore the feasibility of automatic
summarization of Dutch legal cases. In this section, the tools and techniques that
were used are introduced. As we will see, the method combines established results
from the literature to tackle the summarization problem.
In short, two frameworks were compared. The first framework is the standard
framework, in which the full dataset was used to fine-tune a single summarization
model. In the second framework the dataset was first clustered into six clusters
before a separate summarization model was fine-tuned for each of these clusters.
We hypothesized that this two-phase approach leads to an improved quality of
the generated summaries.
In section 4.1, this chapter starts of with the introduction of a set of metrics
that we used to explore our dataset and compare it to benchmark summarization
datasets. This section will also describe how the data was collected and prepared.
Then, in section 4.2, our two-sided evaluation approach is presented. Finally,
in section 4.3, the experimental setup is discussed. In this section we will go
over the technical aspects of both the clustering framework and the standard
summarization framework.

4.1 Characteristics of the Dataset (RQ1)


Before we designed our models, we required a good understanding of the dataset.
To this end, we compared our dataset to common benchmark datasets using the
set of metrics introduced by Bommasani and Cardie (2020).
Bommasani and Cardie (2020) motivated this proposed set of metrics by
arguing that the quality of summarization benchmark datasets is less guaranteed
in comparison with other NLP tasks. They identified the main cause of this quality
problem to stem from the increasing size of benchmark datasets that is required
for modern architectures. Obtaining verified gold standard summaries for such
datasets is no longer feasible, due to this human task being too resource-consuming.
Before, this process was for example carried out by groups such as the Document
Understanding Conference (DUC).
Following Bommasani and Cardie (2020), we computed two groups of features:
statistical features and complex features. The complex features have values in the
interval [0 − 1]. An overview of the features is shown in table 4.1. In the next
paragraphs we will discuss each feature and show how they are computed.
The statistical features are straightforward. They describe certain counts
of the dataset such as the average number of words in the case texts and case
summaries. These features served as input to some of the complex features.

19
Methods 20

Table 4.1: The set of features proposed in (Bommasani & Cardie, 2020) to
compare summarization datasets. We will use these features to describe our
dataset. Furthermore, an adapted subset of these features was used in the
clustering framework that will be discussed in section 4.3.2
.
Metric Description
Dw text length in words
Ds text length in sentences
Sw summary length in words
Ss summary length in sentences
CMP w word compression
CMP s sentence compression
TS topic similarity
ABS abstractivity
RED redundancy
SC semantic coherence

The first two complex features are compression scores. Word compression is
the inverse ratio between the length in number of words of a summary and the
length in number of words of its case description. The dataset word compression
score is obtained by averaging over all cases’ scores:
N
1 X Si
CMP w = 1 − wi
N i=1 Dw

where N is the number of cases. Sentence compression measures the same ratio,
but now length is measured as the number of sentences. Again, to obtain the
dataset sentence compression score, the individual scores are averaged:
N
1 X Si
CMP s = 1 − si
N i=1 Ds

For both measures, higher scores indicate that, on average, a case text needs more
compression to obtain its summary.
Next, we have the topic similarity of the case text and the summary text.
This metric uses the concept of Latent Dirichlet allocation (LDA) (Blei et al.,
2003) to generate topic representations of each text. The topic model required for
this step is generated from all case texts in the dataset. To compute the topic
similarity, first, a topic distribution is generated for both the case text and the
case summary. Then, the Jenson-Shannon distance of these two distributions is
computed. Finally, we obtain the dataset topic similarity score by first subtracting
Methods 21

each individual score from 1 and then taking the average of the obtained scores:
N
1 X
TS = 1 − JS (θDi | M , θSi | M )
N i=1

The fourth complex feature is abstractivity. This is a score based on the


concept of fragments introduced by Grusky et al. (2018). A fragment is a shared
sequence between a summary and its case text. Now, to compute the abstractivity
of a dataset, the aggregate length of all fragments is divided by the length of the
summary. Finally, this value is inversed by subtracting it from 1:
N
P
1 X f ∈F (Di ,Si ) |f |
ABS = 1−
N i=1 Swi

where f is a fragment in the set of fragments F(Di , Si ) for a document Di and


summary Si , and |f | is the length in number of words of fragment f . Alternatively,
this metric could also be explained as the averaged ROUGE-1 score of all case-
summary pairs.
Redundancy is a measure that is only dependent on the case summaries. It is
measured in three steps. First, between every pair of sentences within a summary
the ROUGE-L score is measured. This is the length of the longest sequence of
tokens shared between both sentences. Second, the average is taken of the derived
ROUGE-L scores for the given summary. Third, to obtain the redundancy score
for the dataset, the average is taken of the scores for all summaries. These steps
are summed up as follows:
N |Z|
1 X 1 X
RED = ROUGE -L(x, y)
N i=1 |Z|
(x,y)∈zj ×zj

where Z are the sentences in the summary.


Finally, semantic coherence is measured by computing the average probabil-
ity that a language model predicts for each successive sentence in a summary.
Bommasani and Cardie (2020) used the standard BERT model to compute this
probability. However, as this model was trained using only English texts, we had
to choose a different language model. To this end, we selected the multilingual
BERT, or mBERT, model. To obtain the dataset semantic coherence score, we
have to take the average of the scores for all summaries:
N |Z|
1 X 1 X
SC = mBERT (Zj−1 , Zj )
N i=1 |Z| − 1 j=2

where Z are the sentences in the summary.


We compared the Rechtspraak dataset with CNN/DailyMail and Newsroom;
two news article datasets, and PubMed; a dataset consisting of scientific papers
with abstracts. Of these three datasets, CNN/DailyMail is mostly reported on
Methods 22

in literature. The Newsroom dataset is relatively young (published in 2018) but


is starting to be reported on more frequently too. PubMed is also consistently
reported on due to scientific papers differing greatly from news articles.

4.1.1 Collection of the Data


The dataset was collected from the website of Open Data van de Rechtspraak.
This external dataset consisted of folders for each year cases took place. Each of
this year folders was split up in 12 month folders, each of which containing the
case files belonging to that month. A case file describes a single case and has the
XML format. In total, the external dataset contained 3.0 million case files. This
includes all cases that were published up until 21-4-2022.
A pipeline was created to collect, parse and store all XML files from the
external dataset. This yielded the raw dataset; having a size of 1.94GB. For a
more detailed overview of the collection and parsing, see appendix A. Note that
no preprocessing has happened yet: this section and appendix A merely described
collecting and parsing the dataset.

4.1.2 Preparation of the Data


After collecting the data, the dataset was further processed. To account for the
input constraint of BART, which is 1024 tokens, and the scarcity of resources,
we chose to create the final Rechtspraak dataset from all cases that contain at
most 1024 words. Only cases that contain both a summary and a case text were
included. Furthermore, only cases with summaries consisting of at least 10 tokens
were included. These constraints lead to a final dataset consisting of 100201 legal
cases.
The texts contained no further irregularities of importance, as the quality of
the case texts and summaries was already sufficient in the initial publication of
the data at the website of Raad voor de Rechtspraak. For example, there were no
missing values as these were automatically removed due to the length constraints
from the previous section. The data was already published in a semi-structured
manner. Furthermore, the data was published and collected from a single source,
meaning that no data joining of any kind was required.
Depending on the part of the experiment, this final Rechtspraak dataset was
processed and used in different ways. In the case of feature computation, both
for the set of descriptive features (see table 4.1) and the set of clustering features
(see section 4.3.2), the texts were tokenized on word level and sentence level.
In the case of the summarization models, the data was tokenized on the
subword level. This was achieved using a custom tokenizer which we trained on a
large Dutch corpus. Details about this process are given in section 4.3.1.
From the dataset, three splits were created with the following percentages of
cases: 70% for the train split, 20% for the dev split and 10% for the test split. It
is important to note, that these splits were only generated after we learned the
clustering (details are given in section 4.3.2) for the complete dataset and assigned
a cluster/class to each of the cases. This was done to be able to split the dataset
Methods 23

in a stratified manner, where the distribution of the classes in each of the splits
was constrained to be equal to the distribution of the classes in the full dataset.

4.2 Method of Evaluation (RQ2)


Evaluation of the generated summaries consisted of two parts. First, as a quanti-
tative measure, we computed ROUGE scores for each of the trained models. We
chose to measure F-scores of ROUGE-1, ROUGE-2 and ROUGE-L, as this set
of scores is most commonly reported in literature. Please see section 3.1.1 as a
reminder on how these metrics are computed.
We compare the performance of our system with the results reported by
Luijtgaarden (2019) as this work uses the same dataset. Furthermore, we will
compare our results with the ROUGE scores obtained by Lewis et al. (2020) after
applying BART to the CNN/Daily Mail dataset.

4.2.1 Human Evaluation


Besides the automatic evaluation we also performed a human evaluation of the
model-generated summaries using the protocol proposed by Grusky et al. (2018).
This protocol was introduced as a dataset-agnostic benchmark for qualitative
evaluation of automatic summarization systems.
The authors constructed a four-dimensional template for qualitative evalua-
tion, consisting of dimensions described by Tan et al. (2017) and Paulus et al.
(2017). These dimensions are informativeness, relevance, fluency, and coherence.
Informativeness and relevance were included as semantic measures, while fluency
and coherence were included as syntactic measures. Each of these dimensions is
represented by one question, which is answered by the evaluator according to a
5-point Likert scale. An overview of the dimensions and corresponding questions
is shown in table 4.2.

Table 4.2: Template for qualitative evaluation as proposed in (Grusky et al.,


2018). Each dimension is measured using a single question. Answers are given
using a 5-point Likert scale. The set of questions is answered for a specific
article-summary pair.

Dimension Question
Informativeness How well does the summary capture the key points of the
article?
Relevance Are the details provided by the summary consistent with
details in the article?
Fluency Are the individual sentences of the summary well-written and
grammatical?
Coherence Do phrases and sentences of the summary fit together and
make sense collectively?
Methods 24

In their case study, the authors had had 60 news articles evaluated. Their
experiment compared seven systems. Therefore, each article was bundled with
seven candidate summaries and the true summary. Each of these bundles was
subsequently evaluated by three unique evaluators. Meaning that for each of the
news articles all associated summaries were evaluated three times.
For our experiment, it was infeasible to find enough evaluators to evaluate a
meaningful number of cases. This is because of our dataset mainly containing large
texts (as opposed to short news articles). Also, because of the domain-specific
nature of the texts, which demand sufficient attention and effort to be read, it
proved to be challenging to find suitable evaluators. For these reasons, we chose
to deviate from Grusky et al. (2018) in this respect and had the summaries only
evaluated by the author of this thesis.

Evaluation Setup
In our setup, a random sample of forty cases was taken from the test set. Each of
these cases was presented to the evaluator together with its true summary, the
summary generated by the full summarization model and the summary generated
by the cluster summarization model; the characteristics of these two models will
be explained in detail in section 4.3. Thus, there were three summaries associated
with each case. To minimize bias, at evaluation time the label of each summary
was hidden from the evaluator and the order of the summaries was randomized.
Evaluation was done with the help of a web application that served a case
text together with its true summary and both generated summaries. This setup
is shown in figure 4.1. The summaries were shown in a random order, to make
it least likely that the evaluator will be able to know which summary belongs to
which system.
When evaluating a case, first the summaries were read, then fluency and
coherence were answered. Only after that, the case text was read, after which
the summaries were reread and informativeness and relevance were scored. This
approach was taken to score the non-content metrics of the summary with a
minimum amount of bias. For example, if the case had been read first, mistakes
such as factual inconsistencies of the summary might have caused bias in judging
the overall style of the summary, all the while when factuality has nothing to do
with the overall style of the summary. This way of approaching the evaluation
process also is more in line with how the envisaged summarization system would
be used by the end user. That is, the end user would first read the summary and
only then, if deemed relevant, read the case text.

4.3 Experimental Setup (RQ3 and RQ4)


As was stated before, we compared two frameworks. The first framework consists
of a single summarization model. In later parts of the document, we might refer
to this model as the full model or the full model framework. The second
framework consists of six summarization models, each fine-tuned on a different
Methods 25

Figure 4.1: Human evaluation setup. The case text was presented together
with the true summary and a generated summary for each of the summarization
systems. The summaries were presented in a random order. At the top of the
page the ECLI of the case is printed; this is the case identifier.

subset of the dataset. These subsets were derived using k-means clustering.
Effectively, we considered these cluster-specific summarization models as a single
model in order to compare it with the full model. This single cluster model
is simply referred to as the cluster model or the cluster framework. To
obtain the results of the cluster framework, the weighted average was taken of the
individual cluster model’s results.
The cluster framework consists of two main components; a clustering compo-
nent and a summarization component. In figure 4.2 an overview is given of the
interaction between these components. In the following sections, both components
will be discussed in more detail. In the case of the full model framework, the
dataset is directly used as input to the summarization component. So, for the full
model only a single summarization model is fine-tuned.
We have two separate components that need to be accounted for. First, we
have the clustering component, which is only used in the clustering framework.
Here, the goal is to find a clustering of the data such that cases are most similar
within clusters, and least similar between clusters. Details of this component
are given in section 4.3.2. Second, there is the summarization component. This
component is required in both frameworks. This component consists of either
one or multiple summarization models, depending on the framework. We will use
a custom pretrained BART model as a starting point for these summarization
models. Details on how we pretrained this model, are given in section 4.3.1. The
pretrained model is fine-tuned on the Rechtspraak dataset; either on the full
dataset or on one of the cluster subsets. Specifics of this component and the
fine-tuning process are given in section 4.3.3. Details on how the components
Methods 26

Figure 4.2: The cluster framework that was compared in this project. First,
a set of features is derived for each case in the dataset. Then, these features
are used to cluster all cases into n clusters. Finally, for each of these clusters a
separate pre-trained BART model is fine-tuned only using the cases belonging to
that cluster.

were implemented and what hardware was used, are concisely presented in section
4.3.4.

4.3.1 Obtaining a Dutch Language Model


As was discussed in section 2.3.1, language models model a representation of
natural language. Therefore, for our summarization component, a language model
was also required to obtain competitive results.
In many cases, the state-of-the-art models that are reported in literature
initially are only evaluated on and distributed for the English language. This
was also true for BERT and BART. Since then, there exist either multilingual
versions of the models, such as mBART, or models that were later trained on a
specific language other than English. In our case, we did not succeed in finding
a suitable language model for the Dutch language as they either did not have a
sequence-to-sequence architecture (e.g. BERTje and RobBERT) or were too large
to be able to be used with our setup (e.g. mBART). Please see the introduction
of chapter 5 for specifics of our setup.
Due to these problems, we saw the need to pretrain BART from scratch on
a Dutch corpus. To this end, we used the Dutch split1 of the Colossal Clean
Crawled Corpus or C4 dataset (Raffel et al., 2020). As the name suggests, this
dataset consists of numerous crawled web page texts from many languages. Only
texts that contain no obscenities and texts of certain lengths are included. In
total, the C4 dataset contains approximately 64 million Dutch documents. To
pretrain the BART model, we used a subset of 6 million of these documents2 .
1
This Dutch split of the dataset was accessed at https://huggingface.co/datasets/yhavinga
/mc4_nl_cleaned
2
The ’tiny’ subset found at https://huggingface.co/datasets/yhavinga/mc4_nl_cleaned
Methods 27

As language models, including BART, have differing tokenization procedures


(e.g. the tokenization algorithm and special token characters) it was required
to train a BART-compatible Dutch tokenizer following the BART tokenization
configuration. This tokenizer was trained on the previously described subset of 6
million documents. A vocabulary size of 25000 tokens was used.

4.3.2 Architecture of the Clustering Component


The clustering component is responsible for collecting the cases data in a number
of, ideally, homogeneous groups. We hypothesized that clustering the data made
it easier for the summarization models to learn the patterns within clusters as
opposed to be required to learn patterns for all data together, because the clusters
contained more homogeneous groups of data.
For example, we can see from the distribution of summary and case text lengths
in figure Figure 4.3 that the distribution of word and sentence lengths differs
substantially between cases. Our assumption is that the relevant information
within these cases also is distributed in different ways.

Figure 4.3: Histograms showing the distribution of the cases’ summary and
description lengths, both in the number of words and the number of sentences.
For each of the distributions the right tail (top 1%) is omitted to filter outliers.
The number of bins is equal to the number of unique lengths, so all values are
included.

Furthermore, case text length and summary length are only two features of a
case. There are more interesting features, such as the topic(s) of a text, that also
make that two cases might differ. We expect that clustering the data into more
homogeneous groups with respect to the combination of these features leads to an
improved ability of downstream models to correctly learn relevant dependencies
between tokens in summaries and descriptions.
Methods 28

An important source of variation in these features might stem from the likeliness
that not every case in our dataset is summarized by the same person due to the
large number of summarized cases. Thus, the used vocabulary and overall style of
two summaries very well might differ.

Input of the Clustering Component


We based the clustering on features derived from the base metrics as proposed by
Bommasani and Cardie (2020). These base metrics were also used in our dataset
comparison; see section 4.1 for an overview.
There were two main constraints that we adhered to when selecting features
include in the clustering model.
First, there are features that are specific to this dataset, e.g. jurisdiction and
year of publication, that one may suspect to also be relevant when discriminating
between the case-summary pairs. Incorporating these features comes at the cost
of yielding an end-framework that is less easy to generalize to other datasets.
Therefore, we only included dataset-agnostic features.
Second, some of the metrics were not applicable to the clustering component,
because of their dependency on the reference summary. For example, the word
compression metric is computed as the ratio between the text length in words and
the summary length in words. Obviously, we don’t have access to the summary
if we are to generate it. Thus, all metrics that we used for clustering could only
depend on the case texts. Via the same reasoning, some of the other metrics were
only applicable after slight adaptations.
In total we identified five features. These features are listed in table 4.3.
Table 4.3: The set of features used to cluster the data. The features are derived
from the metrics proposed in (Bommasani & Cardie, 2020). The base features are
shown in table 4.1.

Feature Description
Dw text length in words
Ds text length in sentences
TC topic class
RED redundancy
SC semantic coherence

Again, D denotes the case text. The first two features are simple lengths of
the case text. The first complex feature is the topic class. Here, the LDA model
was used that was trained to compute the topic similarity score in section 4.1.
The case text was fed into this model, after which it returned a distribution of
the latent topics in the text. From this distribution the topic with the highest
probability was picked as the topic of the text. In total there were five topics, each
denoted by a simple integer value. Finding the topic class can be summarized by:

TC = max (θD | M )
Methods 29

where θD | M is the distribution of topics that was computed by the LDA model
for case text D.
Next, we have redundancy. Here, we steer away from Bommasani and Cardie
(2020) and simply obtain the feature by computing the ratio between the number
of unique tokens and the total number of tokens:
Dw-unique
RED =
Dw
Finally, the feature semantic coherence is computed in the same way as before
(see section 4.1). However, due to the case texts consisting of many sentences,
we chose to only include the first ten sentences to keep computation time within
limits. Thus, semantic coherence is computed as follows:
N 10
1 X1X
SC = mBERT (Zj−1 , Zj )
N i=1 9 j=2

where Z are the sentences in the case text.

4.3.3 Architecture of the Summarization Component


The objective of the summarization component was to find summarization models
that are able to accurately generate summaries for cases from the test set. In section
4.2 an overview was provided on how we evaluated the generated summaries.
We used the pretrained BART model (section 4.3.1) as the base of each of
the summarization models. This means that we started with an already trained
model and only fine-tuned it on the Rechtspraak dataset.
For each of the clusters obtained by the clustering component, a separate
BART model was fine-tuned. These models only had access to the data contained
in the specific cluster. When testing the model, to summarize a new case, the new
case first was first assigned to one of the clusters before generating a summary
with the corresponding fine-tuned summarization model.
Nothing was changed regarding the configuration of the pretrained model.

Loss Function
A loss function is used to evaluate the predictions of a model during training.
Each candidate summary that is generated is compared to the true summary.
The outcome of this comparison, the loss, is fed back into the model to steer its
learning in the envisaged direction.
For our summarization models we used the cross-entropy loss. For each of the
steps in the generation process of a summary, the cross-entropy loss compares the
predicted probability of a token with the true probability of that token. The loss
is defined as:
output size
X
loss = − yi · log ŷi
i=1
Methods 30

where yi is the true probability of the token, ŷi is the predicted probability of the
token and output size refers to the size of the output, which corresponds to the
size of the vocabulary.
Conveniently, this formula can be simplified to:

loss = −youtput · log ŷoutput

where youtput and ŷoutput refer to the true probability of the true token and the
predicted probability of the true token respectively.
This simplification follows from the fact that only one token is the true token
at a specific time step. The true token has probability 1 whereas each of the other
tokens has probability 0, meaning that all terms, other than the term of the true
token, will be zero.

4.3.4 Details on Implementation


Most of the implementation was done by using Python.
To compute the features of the cases, we used the following libraries. The
implementation of the K-means models and Gaussian mixture models was done
with scikit-learn. The Dutch sentencizer and tokenizer of the natural language
library spacy were used to split the documents up in sentences and words tokens.
We used gensim to learn an LDA model from the dataset and use this model to
compute topic distributions for the cases.
For all deep-learning related tasks, the main library we depended on was the
transformers library by Huggingface. To load and serve the dataset we used the
datasets library. To be able to train our own language model, we first had to
train a Dutch tokenizer for BART. This was done with the library tokenizers.
Both these libraries are also developed and maintained by Huggingface. This part
of the experiment was run on a v3-8 Tensor Processing Unit (TPU), for which
access was obtained via TPU Research Cloud by Google. The TPU was necessary
for both pretraining and fine-tuning BART, not only because of its significantly
higher speed in comparison with our local setup, but also because the BART
model was too large to fit in the memory of the local GPU.
Finally, numpy, pandas, and matplotlib, were used for most of the other
tasks, such as reading, handling, storing, and plotting the data.
5. Results

In this chapter we will present the results of the earlier discussed method. In
section 5.1 we start with listing the dataset features that were computed using
the framework from Bommasani and Cardie (2020). Then, in section 5.2, the
pretraining process of BART will be handled. In section 5.3 we show how we
obtained clustered data. In section 5.4 the main training phase is discussed. In
this phase the pretrained BART model was fine-tuned to obtain summarization
models. Finally in section 5.5, the summarization models are evaluated and
compared. This section provides an extensive description of the human evaluation
that was performed.

5.1 Analysis of the Rechtspraak dataset


Before we move on to the other results, we provide a comparison of the Rechtspraak
dataset with other summarization datasets. To this end, we followed the approach
introduced by Bommasani and Cardie (2020). Later in the experiment, to cluster
the cases, we also used a couple of these metrics, albeit in an adjusted form. These
clustering features will be discussed in section 5.3.
To compute the semantic coherence metric we had to deviate slightly from
the approach of Bommasani and Cardie (2020). To compute semantic coherence
they used the BERT model. As this model is trained on only English texts, it was
not possible to use it for our dataset. For this reason, we used the multilingual
version of BERT1 (see section 2.3.1). This model uses the same configuration
and hyperparameters as BERT, but instead is trained on 104 languages including
Dutch. The model was initialized using the Huggingface transformers library.
In table 5.1 the computed features of our dataset are shown. Also shown are
the metrics of the CNN/DailyMail, Newsroom and PubMed datasets.
As we can see, with 100K cases, our dataset is smaller than the news datasets,
and larger than the PubMed dataset. This is partially due to the constraint of
having at most 1024 tokens in the case text. Without this constraint, the dataset
would have consisted of 460K documents. In the discussion section 6.3.1, we will
discuss this problem in more depth.
Judging from the compression scores, our dataset has relatively many words in
the summary in comparison with the case text. It is hard to judge the implications
of this characteristic. However, we suspect that it is beneficial as the model needs
to compress less information and therefore can include more parts of the case text
without having to make a strict selection.
1
https://huggingface.co/bert-base-multilingual-cased

31
Results 32

Table 5.1: Characteristics of the Rechtspraak dataset. The values for the other
datasets are adopted from (Bommasani & Cardie, 2020). In table 4.1 a description
is given of each of the metrics.

CNN/
Metric Rechtspraak Newsroom PubMed
DailyMail
Total cases 100K 287K 995K 21K
Dw 666 717 677 2394
Ds 34 50 40 270
Sw 48 31 26 95
Ss 2.90 3.52 1.75 10.00
CMP w 0.742 0.909 0.910 0.870
CMP s 0.838 0.838 0.890 0.874
TS 0.775 0.634 0.539 0.774
ABS 0.135 0.135 0.191 0.122
RED 0.049 0.157 0.037 0.170
SC 0.534 0.964 0.981 0.990

For the topic similarity metric the Rechtspraak dataset scores higher than
the news datasets and comparable to the PubMed dataset. This means that the
topics found in the source text are more similar to the topics from the summary.
Again, this is beneficial as there is less implicit understanding of the case text
required to construct the summary. For example, if topic similarity were to be
very low, the model cannot simply reiterate the main points from the source text,
but first has to parse these points into a story that is in line with the topics to be
expected in the summary.
The Rechtspraak dataset also deviates with respect to its redundancy in
comparison with two of the three other datasets. Here, a lower score indicates
that there is little overlap between sentences in the summary. It is not clear how
this impacted the summarization models.
Abstractivity, on the other hand, is roughly equal for each dataset. This means
that for each dataset, on average, any summary contains approximately the same
ratio of unseen tokens in comparison with the tokens found in the case text.
Finally, most remarkable is the semantic coherence score of the Rechtspraak
dataset. As was discussed earlier, semantic coherence measures how probable it
is that a sentence B follows a sentence A. A low score is understandable if we
consider that it is common for Rechtspraak summaries to consist of key sentences
that are only loosely connected to each other. In section 5.5.3, summaries are
shown that illustrate this characteristic. See for example the reference summary
in table 5.7, which consists of two sentences that barely have a direct relationship.
As we work with sequence to sequence summarization models, the summary is
generated token after token during which the previously generated tokens are
also taken into consideration. Therefore, if semantic coherence is low, this makes
it more challenging for the models to generate good summaries. The summary
Results 33

model now has to learn more word connections that are foreign to the case texts
and also to the initial pretraining corpus.

5.2 Pretraining of BART


For our base language model we decided to train BART from scratch on a corpus
consisting of Dutch texts. The precise model configuration of the original BART
model was used (see the methods for details). The losses of the model during this
process are shown in figure 5.1.
We trained 1 epoch for a total of 500000 steps. With a batch size of 8, this
means that the model was trained on a total of 4 million examples. The total
training time was 5 days and 23 hours. On step 1, the training loss was first
logged and then for every 2500 steps.
Validation was done using a held-out set of 16000 examples. One round of
validating took approximately 12 minutes. Therefore, to not spend an excessive
amount of the time on validating, we chose to validate every 25000 training steps.
On step 25000, the first validation loss was computed and logged.
As is seen in the figure, both the train and validation loss graphs show a similar
development. This almost identical pattern between both graphs is in line with
what can be expected when pretraining a language model like BART.
Unexpected, however, is the drop in loss reduction w.r.t steps that occurs
after approximately 5000 steps. We see that the loss starts decreasing more
substantially again at step 35000 and starts following a more traditional pattern
again at step 82500. We did not find the cause for this stagnated-like phase of
the training process. And, because the model seems to be learning fine from step
82500 again, we decided not to further investigate this anomaly.
The pretrained model that was obtained forms the basis for further experiments.
We will obtain each of the summarization models by fine-tuning this base model
using the Dutch legal cases dataset. As we will see in the next sections, fine-tuning
is relatively fast in comparison with pretraining. Also, because of an intensively-
trained language model like this, fine-tuning can happen with a significantly
smaller dataset (100 thousand documents, versus 4 million documents).
These two advantages of transfer-learning make that training time of a model
can be relatively low. Albeit that in this thesis we had to pretrain the language
model ourselves.

5.3 Finding a suitable Clustering Model


We compared two clustering approaches to cluster the Rechtspraak dataset. Both
approaches are discussed in detail in section 3.3.
First, we trained a k-means model for different numbers of clusters. Precisely,
we computed the distortion score for 1 ≤ k ≤ 12. These distortion scores are
shown in figure 5.2. We see that the elbow value is at k = 3.
Second, in a similar manner, we tried multiple configurations to find a suitable
Gaussian mixture model. Here we considered 1 ≤ n ≤ 9 where n is the number
Results 34

Figure 5.1: Pretraining losses of BART Model. The training and validation
losses measured during the training phase of the BART language model are shown.
The batch size is 8. The training loss was logged every 2500 steps and on step 1,
whereas the validation loss was computed and logged every 25000 steps with the
first measured at step 25000. Three guide lines are plotted that indicate steps
where the decrease in loss shows an unexpected pattern.

of components in the model. Furthermore, three different covariance types were


considered: tied ; were all components shares a general covariance, diagonal ; were
all components have a unique diagonal covariance, and full ; were all components
have a unique single covariance. The different configurations are plotted in figure
5.3. Here we see that the BIC score is lowest for a 9-component mixture model
with the full covariance type.
To choose between both approaches, we clustered the Rechtspraak dataset
with each approach. In the case of the k-means model we deviated slightly from
figure 5.2 and chose k = 6. The elbow plot mainly indicates at what point the
improvement in distortion score might be deemed only as relatively marginal. If
we had had a very small dataset, it probably had been best to stick with the k = 3
model. However, as we had sufficiently many cases for fine-tuning a summarization
model, we chose k = 6. This also gave us a bit more insight in how performance of
the trained summarization models related to the size of the clusters. The choice of
six is somewhat arbitrary as we could also have chosen seven clusters, for example.
The main consideration here was to not choose too large of a value for k, as then
we might have risked obtaining clusters containing too few cases.
To keep in line with above, we chose the number of components, n, to be 6 for
the Gaussian mixture model too. In figure 5.3, we see that there is a large gap
between the BIC scores for n = 5 and n = 6, but only a relatively small decrease
between n = 6 and n = 9. This difference is still quite large if we consider the
scale of the figure. However, we argue that 6 components still was preferred as it
lead to a more fair comparison between Gaussian mixture and k-means.
Results 35

Figure 5.2: Elbow plot showing the distortion loss for 1 ≤ k ≤ 12.

The result of these clusterings is shown in table 5.2. Here, we see the distribu-
tion of cases over the 6 classes in case of the k-means model and in case of the
Gaussian mixture model. The main difference between the clusterings of the two
approaches is how evenly the cases are distributed. For the Gaussian mixture
model, over half of the cases ended up in the first model, whereas for the k-means
model there is a more even distribution.
As we do not know what is the minimum required number of cases in order
for the BART transformers to be able to be effectively fine-tuned, we strived to
avoid obtaining clusters of too few cases and therefore chose to continue with the
k-means model for the rest of the experiment.

Table 5.2: The number of cases per cluster for the chosen k-means and Gaussian
mixture models.

Model 1 2 3 4 5 6
k-means 23584 21875 19156 15413 13504 6669
Gaussian mixture 52520 14895 13457 8955 8161 2213

5.4 Training the Summarization Models


Having obtained the pretrained model from section 5.2, the next step was to
fine-tune the different models to get to our final summarization models. In this
section, the training phases of the different models are highlighted. In section 5.5
we will evaluate the models.
Results 36

Figure 5.3: BIC scores of different Gaussian mixture models. Here, 1 ≤ n ≤ 9


was used, where n is the number of components. For each configuration three
different covariance types were considered.

5.4.1 Dataset Splits


To train the models we first split the dataset in a training, validation and test
split. These consist of respectively 70%, 20% and 10%, of the total cases.
Splitting was done in a stratified manner: the distribution of the case-assigned
cluster (see section 5.3) was used as a constraint on the splitting process. This
stratification is necessary to be able to create cluster-specific data subsets that
contain the same ratio of cases in each of the splits. In table 5.3 the number of
cases per data split is shown. Note that the total number of cases for each of the
cluster models corresponds to the number of cases per cluster as shown in section
5.3.

Table 5.3: The number of cases contained in each of the dataset splits. Effectively,
seven datasets were used in this study: one containing all cases, and one dataset
containing the subset of cases belonging to a specific cluster. Creation of the
datasets was done in a stratified manner.

Dataset Train Validation Test Total


Full model 70140 20140 9921 100201
Cluster model 0 16509 4740 2335 23584
Cluster model 1 9453 2714 1337 13504
Cluster model 2 13409 3850 1897 19156
Cluster model 3 4668 1341 660 6669
Cluster model 4 10789 3098 1526 15413
Cluster model 5 15312 4397 2166 21875
Results 37

5.4.2 Fine-tuning Losses


In the figures that follow, the training and validation cross-entropy losses of the
summarization models are shown. For each of the figures, the training loss is
shown on the left, whereas the validation loss is shown on the right. The loss
curves of the full model, which is the model that uses all data, and the loss curves
of the combined clusters model have a solid line, whereas the individual cluster
models have dashed lines.
The model that uses all data was fine-tuned first; this is the ’full model’.
Second, each of the individual cluster models was fine-tuned. In figure 5.4, the
training and validation loss of the full model is plotted together with the weighted
average training and validation loss of all the cluster models combined; we will
call this model the ’cluster model’.

Figure 5.4: Fine-tuning losses of the full model and the cluster frame-
work. The training and validation losses measured during the fine-tuning phase
of the full model and the weighted average losses of the cluster models. The batch
size is 8. The training loss was logged every epoch and on step 1, whereas the
validation loss was computed and logged every epoch.

The training loss was measured after every epoch and on the very first step.
The validation loss was only measured after each epoch, from which follows that
validation curves start from the first epoch. Furthermore, the y-axis is zoomed in
to only show the cross-entropy loss from 0 to 1 as this better shows the differences
between the two models. In reality, the training loss at step 1 was approximately
27 for both of the models. Thus, if we were to directly use the pretrained BART
model without first fine-tuning it, this would be the approximate loss that would
be obtained. From this, it also follows that most of the learning happens in the
very first epoch. In subsequent epochs the loss only marginally improves.
When we compare the curves of the two different models, we notice that both
curves follow the same pattern. The cluster model, however, yields a slightly higher
loss than the full model. Therefore, purely judging from this figure, we could
Results 38

state that clustering has a negative effect on the quality of the summarization
framework. However, without considering the outputs of the model, this would
be a rushed conclusion. In section 5.5, we will evaluate the outputs of the models
to find whether this statement holds.
Another way of viewing the quality of the cluster model, is by looking at its
individual components. In figure 5.5, we show the same figure as before, but now
the individual cluster models are shown. We see that there is a bit of variation in
model performance. Intuitively, one may think that a smaller size of a model’s
dataset negatively impacts that model’s performance. However, seeing that the
largest of the cluster models, cluster model 0, performs worst, whereas cluster
model 4, which was trained on an average-sized subset of the data, performs best,
this hypothesis becomes less fitting.

Figure 5.5: Epoch view of the Fine-tuning losses of the full model and
the individual cluster models. The training and validation losses measured
during the fine-tuning phase of the full model and the individual cluster models.

Figure 5.5 gives a somewhat distorted view of the training process. That is
because each of the models was trained for a different number of steps. This
follows from the differing sizes of the data splits. Each epoch the model passes over
the complete train split, where each step uses one batch of 8 cases. Therefore, a
better suiting view on the training process would be a plot of the loss per training
step. This view is shown in figure 5.6. Here, the exact same data is plotted.
Training for less steps also means that the model took less time to train. From
the nature of the experiment follows that the total number of training steps of the
cluster models combined is roughly equal to the number of training steps from
the main model. The training time of both variants thus also was roughly equal.
We were curious to see whether a cluster model would improve if we had it
train for more steps. To this end we had cluster model 0 train for approximately
the same amount of training steps as the full model. This corresponded to 43
epochs instead of the prior 10 epochs. In figure 5.7, the results of this 43-epoch
Results 39

Figure 5.6: Step view of the Fine-tuning losses of the full model and
the individual cluster models. The training and validation losses measured
during the fine-tuning phase of the full model and the individual cluster models.
This is the same data as shown in figure 5.5, but here the loss is plotted against
the training steps instead of epochs.

model are shown.


We see that quite soon after the initial 10 epochs, the validation loss starts
to increase while the training loss keeps decreasing. This is a clear indication
of overfitting. In a practical sense, this approach to the experiment would also
be problematic, because it would mean that a six times longer training time is
needed for the cluster framework in comparison with the standard model.

5.5 Evaluation of the Summarization Models


This section describes the main evaluation of the summarization systems. First,
summary generation is discussed in section 5.5.1. Then, the ROUGE scores are
presented in section 5.5.2. Finally, in section 5.5.3, the results of the human
evaluation are discussed.

5.5.1 Summary Generation


Generating summaries largely followed the model’s standard or training con-
figuration. It was not clear from the Transformers library and corresponding
documentation what output lengths were given as constraints during training
and the default settings for the minimum output length and maximum output
length produced unreliable summaries. Therefore, after some small experiments,
we chose to constraint the minimum output length to 40 tokens and the maximum
output length 150 tokens. Note that one token does not have to correspond to
Results 40

Figure 5.7: Fine-tuning losses of the full model and the cluster model.
The training and validation losses measured during the fine-tuning phase of the
full model and cluster model 0. Here, instead of training for the standard 10
epochs, cluster model 0 is trained for 43 epochs which translates to approximately
the same number of training steps as for the full model.

one word. As subword tokenization is used, it is likely that the number of tokens
is always larger than the number of words.
Generating the results took 25 hours and 40 minutes for the test set of the full
model and approximately the same time for the test sets of the cluster models
combined. In total there were 9921 cases in the test set of the full model, and an
equal total number of cases in the test sets of the cluster models.

5.5.2 Automatic Evaluation Using ROUGE


The automatic evaluation of the generated summaries consisted of computing the
ROUGE-1, ROUGE-2 and ROUGE-L F-scores. These scores were introduced in
section 3.1.1.
The computed ROUGE scores are shown in table 5.4. In the table, the scores
for both main models are shown, as well as the scores for the individual cluster
classes. Furthermore, the ROUGE scores reported by Luijtgaarden (2019) and
Lewis et al. (2020) are reported.
If we compare the full model and the cluster model, we see that the full model
performs slightly better than the cluster model on each of the three ROUGE
metrics. Furthermore, judging from the second scores per cluster class, we see
that only for class 3 the cluster model outperforms the full model, albeit with a
very small difference. For all other classes, the full model outperforms the cluster
model.
Our BART model produced summaries with significantly higher ROUGE scores
than the reinforcement learning model that was used by Luijtgaarden (2019) on
Results 41

the same dataset.


Furthermore, we obtained higher scores than BART on CNN/Daily Mail as
reported by Lewis et al. (2020). It is hard to make a good comparison as the
CNN/Daily Mail is in English whereas the Rechtspraak dataset is in Dutch.
However, the better performance might be explained by the differences between
the datasets as shown in table 5.1. Most importantly, the topic similarity score
is higher for the Rechtspraak dataset, meaning that the summaries are more
similar to the texts. We suspect that this makes it easier for the model to
learn patterns between the texts and summaries. Remarkable, however, is that
the semantic coherence score is much lower for the Rechtspraak dataset. This
indicates that consecutive sentences in the dataset’s summaries are only loosely
connected in comparison with the CNN/Daily Mail dataset. Especially for a
sequence-to-sequence model, where the summary is generated word-for-word, we
would expect a low score for this characteristic to be harmful for the model’s
performance.

Table 5.4: Results of the automatic evaluation of both summarization models.


We report the ROUGE-1, ROUGE-2 and ROUGE-L F-scores. Furthermore, we
compare the scores with Luijtgaarden (2019), who used a reinforcement learning
approach, and BART on the CNN/Daily Mail benchmark.

Model Dataset R-1 R-2 R-L


BART (Lewis et al., 2020) CNN/Daily mail 44.16 21.28 40.90
Luijtgaarden (2019) Rechtspraak 37.24 16.20 34.07
Full model Rechtspraak 46.52 33.74 44.88
Cluster model Rechtspraak 43.69 31.25 41.99
Class 0 full Rechtspraak 42.45 28.70 40.48
Class 0 cluster Rechtspraak 39.74 26.41 37.73
Class 1 full Rechtspraak 48.41 36.80 47.40
Class 1 cluster Rechtspraak 46.74 35.42 45.60
Class 2 full Rechtspraak 47.60 34.46 45.83
Class 2 cluster Rechtspraak 43.82 31.20 41.98
Class 3 full Rechtspraak 52.40 42.93 52.06
Class 3 cluster Rechtspraak 52.57 43.15 52.15
Class 4 full Rechtspraak 50.19 37.92 48.57
Class 4 cluster Rechtspraak 46.56 34.35 44.84
Class 5 full Rechtspraak 44.40 30.93 42.47
Class 5 cluster Rechtspraak 41.21 28.13 39.26
Results 42

5.5.3 Human Evaluation


For the human evaluation, 40 cases were randomly sampled in a stratified manner
from the test set. By stratified sampling we maintain the cluster distributions
that were also present in the cluster datasets; this should make for the fairest of
comparisons in this regard.
During evaluating there were two main scenarios that made it possible to
sometimes deduce which system a summary did or did not belong to. First, a
few summaries had a mid-sentence ending. This happened for roughly 5 cases. It
is very unlikely that a true summary contains such an ending and therefore this
summary had to belong to either one of the summarization systems. Mid-sentence
ending happens when the model reaches the maximum amount of summary tokens
and is forced to stop generating. The second scenario happened when one of the
summaries only consisted of unconnected short phrases that almost reduced the
summary to a set of key words. This is a common characteristic of the dataset’s
summaries and happens with relatively many cases and thus hints the evaluator
on the summary being the true summary. By filtering out cases of less than 10
words, we partially excluded these cases, but sometimes a summary still had
this style, albeit that it contained more phrases. This type of summary was
present for roughly 10 cases. In both these cases, the evaluator tried to give an
as unbiased evaluation as possible. In the case of a mid-sentence ending, the
respective summary was penalized on fluency and/or coherence.

Aggregate Results
The results of the human evaluation are shown in table 5.5. This table shows the
average score of each of the four metrics for the true/reference summaries and for
the sumaries that were generated by each of the models. Furthermore, it shows
the average scores for each metric per cluster class for each of the systems. This
second view enables us to compare the performance of the cluster model versus
the full model and the reference summaries for a specific cluster.
The table shows us that, overall, the true summaries are more informative,
more relevant, more fluent, and more coherent. This is in line with the expectations
as the summarization models are trained on these true summaries and, as the
models are not perfect, underperformance is to be expected. It is interesting
however that with respect to fluency and coherence, both summarization models
perform comparatively well. This is further highlighted by the standard deviations
of these scores of the reference summaries overlapping with the average scores of
the summarization models. Nevertheless, the generated summaries underperform
with respect to informativeness and relevance. The metrics still have average
scores between 3.5 and 4, but in comparison with the other metrics there clearly is
a larger gap between the real summaries and the generated summaries. Following
these observations, we can state that, in comparison with the true summaries, the
summarization models do a relatively good job at producing fluent and coherent
summaries, while struggling more with keeping the summaries informative and
relevant.
Results 43

Table 5.5: The human evaluation results. For each metric the average score
and standard deviation are shown. Best scores per group are highlighted in bold.
Inf. denotes Informativeness.

Model # Inf. Relevance Fluency Coherence


True summaries 40 4.13±1.04 4.80±0.61 4.75±0.67 4.45±0.81
Full model 40 3.58 ± 1.24 4.03 ± 1.19 4.45 ± 0.90 4.10 ± 1.08
Cluster model 40 3.60 ± 1.23 3.90 ± 1.23 4.30 ± 0.97 4.15 ± 1.10
Class 0 true 9 4.44±0.73 4.78±0.44 5.00±0.00 4.44±0.73
Class 0 full 9 3.11±1.36 4.11±1.17 3.89±1.27 4.00±1.22
Class 0 cluster 9 3.56±1.13 4.00±1.32 4.22±1.09 3.89±1.27
Class 1 true 5 3.80±1.30 4.40±1.34 5.00±0.00 3.40±0.89
Class 1 full 5 4.00±1.22 4.20±1.10 4.40±0.89 4.20±1.30
Class 1 cluster 5 3.80±1.10 3.80±1.30 4.00±1.00 3.80±1.64
Class 2 true 8 3.38±1.30 4.75±0.71 5.00±0.00 4.38±0.92
Class 2 full 8 3.00±1.60 4.00±1.20 4.50±0.76 3.12±0.99
Class 2 cluster 8 3.12±1.46 4.50±1.07 4.50±0.93 4.25±1.16
Class 3 true 3 4.33±0.58 4.67±0.58 5.00±0.00 5.00±0.00
Class 3 full 3 4.33±0.58 3.00±1.00 5.00±0.00 5.00±0.00
Class 3 cluster 3 4.00±0.00 2.33±0.58 3.33±1.53 3.33±1.15
Class 4 true 6 4.50±0.84 5.00±0.00 4.67±0.82 4.83±0.41
Class 4 full 6 4.50±0.55 4.67±0.82 4.67±0.82 4.50±0.84
Class 4 cluster 6 4.00±0.89 3.67±1.51 4.50±0.84 4.50±0.84
Class 5 true 9 4.33±1.00 5.00±0.00 4.11±1.05 4.67±0.71
Class 5 full 9 3.44±0.88 3.78±1.48 4.67±0.71 4.44±0.73
Class 5 cluster 9 3.56±1.24 4.00±1.12 4.56±0.73 4.56±0.53

Now, if we compare both summarization models, we see that both the full data
model and the cluster framework score relatively equal on all four dimensions.
To get a more detailed view of the specific scores that were given per summary
type, figure 5.8 is provided. This figure shows the frequency of each score (1 to 5)
that was given for each metric for each summary type.
As we see in the figure, the true summaries perform best. This corresponds
to what we saw in table 5.5. There is one summary, however, that got scored
a 1 on informativeness. This is highly unusual as the summaries are manually
written, making it reasonable to expect that at least informativeness and relevance,
two metrics that relate to the factual content of a summary, are of reasonably
quality. In this specific instance, the case describes a person’s appeal to a previous
verdict. The summary only listed the previous verdict, without mentioning that
the appeal was well founded and therefore overruling the previous verdict. The
cluster framework summary was a bit more extensive as it also mentioned the
appeal, but instead of mentioning that the appeal was well founded, it mentioned
Results 44

Figure 5.8: Evaluation scores frequencies. Human evaluation score frequen-


cies of each metric for the true summaries and both of the models. The legend
shows the likert scores corresponding to each color. From top to bottom, the four
y-axis labels are: informativeness, relevance, and fluency, coherence.

that it was unfounded; this summary got a score of 2. Only the full model
summary, which got 4 points, contained both relevant parts, including the remark
that the appeal was well founded.
For the true summaries, informativeness shows most variation. This stems
mainly from the fact that the true summaries relatively often are incomplete and
only list a few characteristics. The summary described in the previous paragraph is
an example of this. On the other hand, relevance, which also measures the content
of the summary, scores better. This is because the facts that were mentioned in
the summaries, were almost always relevant to the case; which comes as little
surprise as they were derived by humans. The two summarization models have
more trouble with keeping the summary relevant, as is shown by the variation in
the score frequencies.
Overall we see that the the summarization models have similar frequencies
for each of the scores. This also corresponds to the standard deviations that
were listed for both models in table 5.5. The 1-scores for informativeness were
given because the summaries were written as a single question, contained a long
paragraph of text that terminated mid-sentence, and contained simply irrelevant
information, respectively.

General Observations
During the evaluation process some other particularities were identified. In this
section, we will shortly go over some of these.
First of all, an important behavior to note is that the summarization models
had trouble with distinguishing whether information belongs to the current case
(e.g. the appeal case) or the case that is referred to in the current case. Of this
Results 45

referred case a synopsis often is provided in the current case to provide the reader
with a background of the case. The problem here is that the models sometimes
concluded a summary by providing the verdict of the referred case, which was only
given as background in the current case, instead of the verdict of the current case.
In cases where an appeal leads to a different verdict, this leads to the summary
containing the wrong conclusion.
It is not uncommon for true summaries to be extractive by design. These
summaries consist of one or more sentences that are literally found in the source
text. As a result, the models are also inclined to create summaries that contain
phrases that are found in the source text.
Another thing that again was visible from evaluating multiple cases is that
the texts always had some structure. The different sections of the summary were
marked with a heading describing the section’s content.
We also noticed that a large portion of the cases are not initial cases, but
appeal cases. This might be the result of filtering out all longer cases (of length
>1024 words) as initial cases might be longer by default as more information has
to be covered.
Lastly, the summarization models sometimes used anonymized terms in the
generated summaries. These terms are found in the source texts, but not used in
the reference summaries. The anonymized terms are always surrounded by square
brackets and replace real names, company names and living places of people. For
example, [adres] could have been used instead of a real address. In the cases
where the summarization model used an anonymized term in the summary, the
summary itself was usually not of a worse quality. Nevertheless, as this behavior
is not in accordance with the reference summaries, the informativeness score was
reduced with 1 if summaries contained such terms.

Case Specific Observations


Before moving on to the discussion of the results in chapter 6, we shortly highlight
some individual cases that give insight in how each of the models compares to the
true summaries. In this section, only two cases are shown. In Appendix B two
more are listed and discussed in the same manner.
To start of, in table 5.62 three summaries are shown. Each of these summaries
was rated with a score of 5 on each metric. Still, however, besides the first sentence,
each summary differs substantially from the others. This first sentence contains
the judgment from the case text. As this behavior, where all three summaries start
with the same sentence or phrase, is somewhat common in the evaluated cases,
we suspect that there is a strong pattern in the dataset between the case text
and the true summary’s first sentence. The rest of the bodies of each summary
consists of some key points from the case text. Here, each summary focuses on
slightly different information from the source text, but in all cases the resulting
summary is of a high quality.
2
Please find the complete case text at https://uitspraken.rechtspraak.nl/inziendocument?i
d=ECLI:NL:CRVB:2012:BW3333.
Results 46

Table 5.6: Summaries belonging to case ECLI:NL:CRVB:2012:BW3333


.
Reference summary inf.:5, rel.:5, flu.:5, coh.:5
Weigering toekenning Wet WIA-uitkering. Het bestreden besluit berust op een
voldoende deugdelijke medische grondslag. De informatie van de huisarts en de
psycholoog brengt niet mee dat appellante meer beperkingen heeft dan door
de verzekeringsarts is vastgesteld. Het vaststellen van - uit ziekte of gebrek
voortvloeiende - beperkingen voor het verrichten van arbeid aan de hand van
de medische situatie van appellante moet in beginsel tot de deskundigheid van
de verzekeringsartsen worden gerekend. Appellante is terecht in staat geacht
de geselecteerde functies te verrichten. Niet gebleken is dat de belasting in de
functies de belastbaarheid van appellante overschrijdt.
Full model summary inf.:5, rel.:5, flu.:5, coh.:5
Weigering WIA-uitkering toe te kennen. Minder dan 35% arbeidsongeschikt.
Voldoende medische en arbeidskundige grondslag. Geen aanleiding te twijfelen
aan de conclusie van de verzekeringsartsen dat appellante in staat is eenvoudig,
routinematig werk te doen.
Cluster model summary inf.:5, rel.:5, flu.:5, coh.:5
Weigering WIA-uitkering toe te kennen. De Raad verenigt zich met de overweg-
ingen van de rechtbank en maakt deze tot de zijne. Voorts is van belang dat
het vaststellen van - uit ziekte of gebrek voortvloeiende - beperkingen voor het
verrichten van arbeid aan de hand van de medische situatie van appellante in
beginsel tot de deskundigheid van de verzekeringsartsen moet worden gerekend.
Voorts geldt dat de informatie van de psycholoog niet op één lijn kan worden
gesteld met de bevindingen van een arts.

A case that highlights some of the flaws of the summarization models, is listed
in table 5.73 . The true summary is concise and informative. On each of the metrics
it scores 5 points. The full model summary is severely lacking. In fact, it is one
of the worst summaries that was present in the human evaluation sample. The
summary shows multiple problems, which is also reflected in the scores that were
given. The summary opens with a hard-to-read sentence stating the judgment of
the case. More importantly, the rest of summary focuses on a very specific section
of the case text. Also, there are anonymized terms in the case summary. For
these reasons, informativeness was scored as 1. The metric relevance answers the
question "Are the details provided by the summary consistent with details in the
article?". Because the details all were in accordance with the case text, relevance
still got 5 points. Finally, fluency and coherence scored low due to summary
terminating mid-sentence, and the general difficulty of reading and understanding
the summary. The cluster summary, on the other hand, scores a bit better. It
started off by correctly mentioning that the case is about an ’extradition request’,
which is information that already tells a lot about the nature of the case. Next,
3
https://uitspraken.rechtspraak.nl/inziendocument?id=ECLI:NL:RBSGR:2010:BM2760.
Results 47

the judgment of the case is presented. Unlike in the full model summary, the
judgment now makes a bit more sense because we are aware of the case topic.
In the body of the summary, some less relevant information from the case text
is described. Here, an non-existing word is included, ’geatz’, which might be a
peculiar result stemming from the usage of subword tokenization. The summary
ends with a conclusion that is relevant, but not that informative.

Table 5.7: Summaries belonging to case ECLI:NL:RBSGR:2010:BM2760


.
Reference summary inf.:5, rel.:5, flu.:5, coh.:5
Kort geding; eindvonnis na tussenvonnis (LJN BM1152). Verbod op uitlev-
ering naar Marokko toegewezen, aangezien aannemelijk is dat Nederland
(straf)rechtsmacht heeft over de feiten waarvoor de uitlevering is gevraagd.
Artt. 6 EUV, 2 Sr.
Full model summary inf.:1, rel.:5, flu.:2, coh.:3
Afwijzing van het verzoek om terug te komen van hetgeen is overwogen in
het tussenvonnis van 7 april 2010. Uit de overgelegde stukken van het uitlev-
eringsverzoek en het Transito-onderzoek volgt dat de transporten waarvoor
uitlevering is gevraagd zijn (mede)gepleegd door [O.] en/of [P.]. Blijkens het
uitlevering wordt aan eiser terzake verweten dat hij de O.] en [P.] heeft gevraagd
de transport te doen en dat hij instructies heeft gegeven. Dat eiser zich mogelijk
schuldig heeft gemaakt aan uitlokkingshandelingen wordt ondersteund door de
door de OvJ en BJ2167 afgelegde verklaringen die zijn weergegeven onder 1.7
van het tussenarrest. Beiden
Cluster model summary inf.:3, rel.:3, flu.:4, coh.:5
Uitleveringsverzoek. De voorzieningenrechter ziet geen aanleiding om terug te
komen van hetgeen is overwogen in het tussenvonnis van 7 april 2010. Uit de
overgelegde stukken van het uitleveringsonderzoek volgt dat de transporten waar-
voor uitlevering van eiser is gevraagd zijn (mede)geatz door de landsadvocaat,
onvoldoende was om dit te weerleggen. Om die reden heeft de voorzieningen-
rechter gedaagde in de gelegenheid gesteld zijn standpunt met betrekking tot de
rechtsmacht van Nederland over de feiten waarvoor de uitlevering is gevraagd
nader te onderbouwen.
6. Discussion

We studied the automatic summarization of Dutch legal cases. There were


constraints during the experiment that restricted the exploration of other configu-
rations and techniques. In this chapter, we will discuss these and other limitations
and will indicate possible directions for future work.

6.1 Automatically Generated Summaries


In this first section, we will discuss some limitations of the summary generation
process and how this process depends on reference summaries that are of a sufficient
quality.

6.1.1 Quality of the Reference Summaries


In projects such as ours, the summaries that are generated always depend on
the reference summaries that are part of the initial dataset. It is debatable,
however, whether these summaries are sufficiently accurate in describing a case.
These summaries widely differ in their length and structure: many only are a few
sentences in size. Naturally, it would be hard to fit each essential component of
a case in so few sentences. For this reason, we could state that the dataset is
already flawed from the beginning, possibly leading to less informative generated
summaries. This is something that we also identified during the human evaluation
of the summaries. For example, in table B.2 we showed a very concise reference
summary that only contained two sentences. In the end, this might not be what
the end-users expect of the system. Instead, they might require that the summaries
are more extensive and contain multiple facts from the source text.
We think that there are interesting ways how this problem can be dealt with.
First, in the early phase of the experiment time and effort could be taken to
rigidly make a sub-selection of the case-summary pairs and only include those
cases that are accompanied by a summary of high quality. However, such an
approach not only requires expensive human labelling, but it also decreases the
size of the dataset and therefore might limit the to-be trained models when
learning a representation of the data. Second, the possibilities of transfer learning
for automatic summarization could be explored. Here, models that work well
on a comparable dataset with strong reference summaries can be considered
for summarizing the dataset at hand. There are different degrees of transfer
learning that could be used here. In the extreme case, we assume that all reference
summaries in the Rechtspraak dataset are flawed and completely disregard these
summaries. Instead, we will simply apply the chosen model to the dataset. To the
best of our knowledge this path of automatic summarization has not been explored

48
Discussion 49

before. The downside of this extreme approach is that automatic evaluation of


the summaries cannot be done with ROUGE anymore, as ROUGE compares
the generated summaries to the reference summaries. This would significantly
impact our ability to compare the performance of the model with other models
as ROUGE is often used to this end. Rather, we would rely more on human
evaluation to evaluate the system, which itself is expensive. A more moderate
approach to transfer learning could also be considered. In this case, we could
still partially train using the reference summaries. However, we think that this
approach would not be sensible, as ROUGE is biased towards having generated
summaries be identical to the reference summaries, meaning that any deviation
from the standard training setup would already lead to a decrease in the ROUGE
score. In both the extreme and the moderate approach, other automatic evaluation
measures seem more appropriate. E.g. if BERTScore is used (see 3.1.1), we already
lose some of the dependence on the reference summaries.

6.1.2 Improving Generated Summaries


In section 5.5.3, we described some of the flaws of generated summaries. Some
cases contained references to other case texts. This could cause the summarization
model to include this information in the summary of the case at hand, making it
seem as if the facts of the referred case belong to this case. This problem could
be addressed by preprocessing the texts and removing these references. Due to
the large size of the dataset, this process must be automated.
The inclusion of anonymized terms in the summaries was another flaw of some
summaries. Postprocessing can be considered to solve this problem. Another
option is to add an extra term to the loss function to penalize the model during
training if it uses anonymized terms in the generated summary.

6.1.3 Extractive or Abstractive Summaries?


Abstractive summarization is often explained as capable of generating unique
words or phrases that are unseen in the source text (hence ’abstraction’). However,
as we illustrated in the results (chapter 5) this certainly does not have to be the
case. Whether or not the model really will generate abstracted summaries, for
example by using synonyms, will mainly depend on the reference summary and on
what loss function is used during training of the model. If the reference summary
itself is extractive and mostly consists of phrases literally found in the source text,
then, depending on the loss function, one should expect generated summaries to
also follow this pattern. If cross-entropy is chosen as a loss function for example, as
we did in our experiment (see section 4.3.3), the model is incentivized to generate
a summary that is as similar to the true summary as possible. This follows from
the nature of the cross-entropy loss, where the loss decreases if the probability for
the true summary-word increases.
Discussion 50

6.1.4 Incorporation of Domain-Dependent Features


For each case, there were many domain-dependent features present in the raw
dataset (e.g. the date of the case, the jurisdiction, etc.). We chose not to include
these features in the processed dataset to keep our method as generalizable to
other summarization datasets as possible. However, it would be interesting to
see whether inclusion of these features when clustering (with or without the
text-generated features) would yield different results for the Rechtspraak dataset.
The main downside of exploring this path is that, in the case of meta-features,
the model becomes more domain-dependent and therefore less generalizable to
other datasets. For features that can be be extracted from the case-summary
pairs, this does not hold and they therefore can be included without impacting
generalizability.

6.1.5 Generation Time of the Results


As reported in section 5.5.1, generating the summaries took 25 hours and 40
minutes for 9921 cases. The complete dataset as published by Raad voor de
Rechtspraak, contains approximately 3M cases, of which some 600K cases contain
a case text and a summary. If we were to summarize each of these 600K cases,
that would mean that almost 65 days are needed to generate a summary for each
case text. If we were to consider the plans of Raad voor de Rechtspraak to publish
75% of all Dutch cases, instead of the current 5%, then this time would increase
even more.
Fortunately, there might be ways to improve the time that it takes to generate
the summaries. In this thesis project we utilized a TPU to do all training and
fine-tuning. This greatly accelerated these phases of the experiment. However,
we found that inference using the Hugging Face transformers models was not
directly supported; meaning that inference happened using a CPU. We are not
sure whether there is an easy way to utilize the TPU for this process. At the
same time, the option of inference using a GPU can be explored. However, as the
2-day period to generate all results was not critical in this project, we did not
look into either of these options.

6.1.6 Summary Generation Configuration


It was not always clear what configuration was used by the model during training
and inference. The original BART paper is not entirely clear in this regard, and
the standard configuration of the model as implemented by Huggingface in the
transformers library also has slight differences. For example, in the BART
paper, for generation tasks, a beam search size of 5 is mentioned while the base
configuration in transformers uses a beam size of 4. Furthermore, the BART
paper does not clearly report on the input length constraints.
During this thesis project, we found that these parameters are quite important
for the model when generating a summary. The possibilities regarding the config-
uration have not been fully explored and tested. Parameters such as the number
Discussion 51

of beams, max_length, min_length, and the length penalty all might influence the
results of the experiments. We had no resources to test for these parameters and
chose to take a common configuration.

6.2 Improving the Described Method


In the experiment, there were certain constraints and weaknesses that might be
improved to obtain more reliable results. We will discuss these constraints and
weaknesses in this section.

6.2.1 Longer Pretraining of Base Model


We worked with a BART model that was pretrained on approximately 6 million
documents of the Dutch split of the C4 dataset. The tokenizer was also trained
on this same data. However, the total Dutch split contains 64 million documents.
Looking at figure 5.2, we see that despite the smaller subset, the model’s cross-
entropy loss already started to converge. So, pretraining on more cases is likely
to only slightly increase the model’s loss. However, if resources allow it, we still
think that this is worth it.

6.2.2 Human Evaluation of Generated Summaries


The generated summaries were extensively evaluated, but this was done by the
thesis’ authors. We realize that the setup of this qualitative evaluation was not
robust enough to draw strong conclusions from it. Therefore, this part of the
evaluation partially served to supplement the quantitative evaluation and provide
the results of the quantitative evaluation with extra context. Furthermore, we
showed examples to highlight strengths and weaknesses of the models with regard
to the human evaluation metrics in table 4.2.
Currently, as is shown in table 5.5, the summaries are evaluated positively
with the average scores hovering around 4 out of 5 points. It is unclear whether
this is the result of the summaries indeed being of such a quality, or the result of
the evaluator being inclined to give higher scores by nature. To have no such bias
and obtain a more robust evaluation, it would have been better to have access
to a pool of evaluators, preferably from the legal domain. For example, if each
case and its summary was evaluated by three evaluators, as was done by Grusky
et al. (2018), a better estimate of the quality of the generated summaries would
be obtained.
However, due to the problems we indicated in section 4.2.1 human evaluation
of generated summaries is costly. This is especially true for a highly specific
dataset like Rechtspraak. Therefore, another possible path to take is to remove
as much subjectivity from the evaluation as possible, in order to enable the
researchers to evaluate the generated cases themselves. To this end, a rigid
evaluation protocol should be used, where components that should be present in
the generated summaries should be identified clearly. Furthermore, by making
Discussion 52

the evaluation metrics countable (e.g. number of grammatical errors, number of


identified components missing, etc.) part of the need for the interpretation of the
evaluator is taken away.

6.2.3 Association Dataset Metrics and Model Performance


A thorough analysis could be conducted to measure how the metrics of Bommasani
and Cardie (2020) are associated with the performance of the model.
This can for example be done by a more extensive error analysis. Here, the
results can be grouped according to metrics of interest to see how the model
performed for a specific group.
A more model-driven approach would be to perform ablation studies. In this
case, subsets should be created for the dataset, where each subset is selected on
a range of values for a certain metric. For example, for semantic coherence, we
could create a subset that includes summaries where this score is at least 0.8 and
another subset where this score is at most 0.4. This way, the impact of a certain
score can be measured. Care must be taken when creating these subsets as they
both should probably be of an approximately equal size for a fair comparison.
Furthermore, ablation studies will require a lot of extra training.

6.3 Architectural Considerations


In this section, we will discuss limitations of the thesis that stem from some of
the technical components used in the method.

6.3.1 Limitations of Transformer Architecture


For BART, and many other transformer-based models, only a maximum input
length of 1024 tokens is allowed. Due to this reason, we included only shorter cases
texts in the dataset that was considered in this thesis. There are promising new
models, however, that lose this constraint of 1024 and allow for at most 8 times
more tokens as input (Beltagy et al., 2020; Zaheer et al., 2020). We recommend
the consideration of such models when the task is to automatically summarize
Dutch legal cases, as the average case length far exceeds the 1024 token constraint.
One can also consider dividing the problem in smaller subproblems so that the
full case text can be included instead of only the first 1024 tokens. For example,
the summary might be split up into sections of at most 1024 tokens, after which
a separate summary is generated for each of the sections. Subsequently, each of
the sections needs to be combined into a single summary. The main challenge of
this approach is to generate a summary for a specific section while we only have a
reference summary for the complete text. If this proves infeasible during training,
then it might still be insightful to experiment with this approach to generate
summaries for the test set.
Another possibility is to first use an extractive summarization approach to
sample a subset of the sentences of the case text and concentenate these sentences
Discussion 53

to obtain a compressed case text.

6.3.2 Gaussian Mixture or k-means?


In section 5.3, we chose k-means over Gaussian mixture because it more evenly
distributed the cases over the clusters. However, it would be interesting to see
how summarization models that were trained on the Gaussian mixture clusters
compare to the k-means clustered models.
Also, we assumed that the mixture models might lead to clusters that are too
small in size and therefore not containing enough data for the summarization
models to learn from. Future work could be focused on exploring the lower bound
of the data required for legal summarization fine-tuning to test whether this
assumption was accurate.
7. Conclusion

In this thesis project, we implemented and evaluated summarization models


with the aim of automatically summarizing Dutch legal cases. We constructed a
dataset from publicly available cases and compared it to benchmark summarization
datasets. We tested clustering as a prior step to automatic summarization and
compared its results to a configuration that does not use clustering. Finally, we
evaluated the generated summaries of both systems using the well-known ROUGE
metrics and a more recent human evaluation protocol.
To conclude this thesis we will now answer the research questions that guided
the project.

RQ1: What are the key differences between available benchmark datasets
and the Rechtspraak dataset used in this project?
To answer this research question, we computed the set of features that was
introduced by Bommasani and Cardie (2020) in their paper on automatic sum-
marization datasets. This set of features was then compared with three common
benchmark datasets.
From this comparison, it followed that there were three clear differences
between the Rechtspraak dataset and the common benchmark datasets. First,
relative to the length of case texts, the length of the summaries was large. Second,
there was less redundancy within the case summaries. Third, the most distinctive
feature of the Rechtspraak dataset is the loose relationship between consecutive
sentences in the case summaries.
We did not have the resources to perform ablation studies to specifically
measure how these differences impacted the performance of the summarization
models. In the discussion (chapter 6) we identified this as possible future work.

RQ2: How can generated summaries of Dutch legal cases accurately


be evaluated?
We studied previous work on automatic summarization to answer this question
and found that evaluation in the majority of these studies consisted of computing
ROUGE scores by comparing generated summaries with reference summaries. In
many cases this automatic evaluation was complemented with human evaluation.
There is an abundance of critique on ROUGE, because it only weakly allows
for flexibility in the generated summaries. However, due to the lack of better
alternatives it is still the dominant, if not only, widespread automatic evaluation
metric. Other promising metrics exist, such as BERTScore, but these metrics
have not been adopted yet.

54
Conclusion 55

Lastly, there exist protocols to perform human evaluation of automatically


generated summaries in a more systematic manner. One of these protocols,
proposed by Grusky et al. (2018), was used in this work to shape the greater part
of evaluation of the summaries generated by the summarization models.

RQ3: What is the effect of training automatic summarization models


on clustered data?
An analysis of clustering as a prior step to automatic summarization is a novel
contribution of this thesis. We hypothesized that prior-clustering leads to more
homogeneous clusters of cases, making it more feasible for a summarization model
to be trained on such a cluster in comparison with training it on the full dataset.
Contrary to our expectations, the cluster framework yielded ROUGE scores
that were 6.63% lower on average in comparison with the full model. The human
evaluation also showed that the cluster summaries were of a slightly worse quality
than the full model summaries. Therefore, we can conclude that in the current
setup, clustering is harmful to the quality of the generated summaries.
There are unexplored areas of interest, however. For example, Gaussian
mixture models have not been tested extensively. Also, the dataset contains many
unused features, such as the jurisdiction a case belongs to, that were not used in
this project. These are considerations for future work.

RQ4: What are the biggest challenges when automatic summarization


techniques are applied to Dutch legal cases?
We trained and evaluated Dutch BART models on the Rechtspraak dataset. To
evaluate the summarization models, we first computed ROUGE scores and then
performed a human evaluation of 40 cases and summaries.
The human evaluation showed that generated summaries were almost as
fluent and coherent as the reference summaries. However, the informativeness
and relevance of the generated summaries were lacking in comparison with the
reference summaries. Therefore, the biggest challenges of automatic summarization
of Dutch legal cases are capturing the key points of the case text and keeping the
information in the summary consistent with the information in the case text.
All things considered, automatic summarization techniques show promising
results when applied to Dutch legal cases. Human-created summaries are still
preferred over automatically generated summaries, but if resources are scarce
and slight inaccuracies are acceptable, automatically generated summaries could
already be used to inform end-users about the contents of Dutch legal cases.
References

Abuobieda, A., Salim, N., Kumar, Y. J., & Osman, A. H. (2013). An improved
evolutionary algorithm for extractive text summarization. In A. Selamat,
N. T. Nguyen, & H. Haron (Eds.), Intelligent information and database
systems (pp. 78–89). Springer Berlin Heidelberg.
Allahyari, M., Pouriyeh, S., Assefi, M., Safaei, S., Trippe, E. D., Gutierrez, J. B.,
& Kochut, K. (2017). Text summarization techniques: A brief survey.
International Journal of Advanced Computer Science and Applications,
8 (10). https://doi.org/10.14569/IJACSA.2017.081052
Al-Sabahi, K., Zuping, Z., & Nadher, M. (2018). A hierarchical structured self-
attentive model for extractive document summarization (HSSAS). IEEE
Access, PP, 1–1. https://doi.org/10.1109/ACCESS.2018.2829199
Beltagy, I., Peters, M. E., & Cohan, A. (2020). Longformer: The long-document
transformer. CoRR, abs/2004.05150. https://arxiv.org/abs/2004.05150
Bengio, Y., Ducharme, R., Vincent, P., & Janvin, C. (2003). A neural probabilistic
language model. The journal of machine learning research, 3, 1137–1155.
Bhattacharya, P., Hiware, K., Rajgaria, S., Pochhi, N., Ghosh, K., & Ghosh, S.
(2019). A comparative study of summarization algorithms applied to legal
case judgments. In L. Azzopardi, B. Stein, N. Fuhr, P. Mayr, C. Hauff,
& D. Hiemstra (Eds.), Advances in information retrieval (pp. 413–428).
Springer International Publishing.
Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. the
Journal of machine Learning research, 3, 993–1022.
Bommasani, R., & Cardie, C. (2020). Intrinsic evaluation of summarization
datasets. Proceedings of the 2020 Conference on Empirical Methods in
Natural Language Processing (EMNLP), 8075–8096. https://doi.org/10.18
653/v1/2020.emnlp-main.649
Chen, Y.-C., & Bansal, M. (2018). Fast abstractive summarization with reinforce-
selected sentence rewriting. CoRR, abs/1805.11080. http://arxiv.org/abs
/1805.11080
Delobelle, P., Winters, T., & Berendt, B. (2020). RobBERT: A Dutch RoBERTa-
based language model. Findings of the Association for Computational
Linguistics: EMNLP 2020, 3255–3265. https://doi.org/10.18653/v1/2020.f
indings-emnlp.292
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of
deep bidirectional transformers for language understanding. Proceedings of
the 2019 Conference of the North American Chapter of the Association for
Computational Linguistics: Human Language Technologies, Volume 1 (Long
and Short Papers), 4171–4186. https://doi.org/10.18653/v1/N19-1423

56
References 57

de Vries, W., van Cranenburgh, A., Bisazza, A., Caselli, T., van Noord, G., &
Nissim, M. (2019). BERTje: A Dutch BERT model. CoRR, abs/1912.09582.
http://arxiv.org/abs/1912.09582
Ferreira, R., de Souza Cabral, L., Freitas, F., Lins, R. D., de França Silva, G.,
Simske, S. J., & Favaro, L. (2014). A multi-document summarization
system based on statistics and linguistic treatment. Expert Systems with
Applications, 41 (13), 5780–5787. https://doi.org/https://doi.org/10.1016
/j.eswa.2014.03.023
Gambhir, M., & Gupta, V. (2017). Recent automatic text summarization tech-
niques: A survey. Artificial Intelligence Review, 47 (1), 1–66. https://doi.or
g/https://doi.org/10.1007/s10462-016-9475-9
Goodfellow, I., Bengio, Y., Courville, A., & Bengio, Y. (2016). Deep learning
(Vol. 1). MIT press Cambridge.
Grusky, M., Naaman, M., & Artzi, Y. (2018). Newsroom: A dataset of 1.3 mil-
lion summaries with diverse extractive strategies. Proceedings of the 2018
Conference of the North American Chapter of the Association for Com-
putational Linguistics: Human Language Technologies, Volume 1 (Long
Papers), 708–719.
Hachey, B., & Grover, C. (2005). Automatic legal text summarisation: Experiments
with summary structuring. Proceedings of the 10th International Conference
on Artificial Intelligence and Law, 75–84. https://doi.org/10.1145/1165485
.1165498
Hiemstra, D. (1998). A linguistically motivated probabilistic model of information
retrieval. Proceedings of the Second European Conference on Research and
Advanced Technology for Digital Libraries, 569–584.
Huang, D., Cui, L., Yang, S., Bao, G., Wang, K., Xie, J., & Zhang, Y. (2020).
What have we achieved on text summarization? CoRR, abs/2010.04529.
https://arxiv.org/abs/2010.04529
Jurafsky, D., & Martin, J. H. (2020). Speech and language processing: An introduc-
tion to natural language processing, computational linguistics, and speech
recognition. Pearson.
Kornilova, A., & Eidelman, V. (2019). Billsum: A corpus for automatic summa-
rization of us legislation. arXiv preprint arXiv:1910.00523.
Lewis, M., Liu, Y., Goyal, N., Ghazvininejad, M., Mohamed, A., Levy, O., Stoy-
anov, V., & Zettlemoyer, L. (2020). BART: Denoising sequence-to-sequence
pre-training for natural language generation, translation, and compre-
hension. Proceedings of the 58th Annual Meeting of the Association for
Computational Linguistics, 7871–7880. https://doi.org/10.18653/v1/2020
.acl-main.703
Lin, C.-Y. (2004). ROUGE: A package for automatic evaluation of summaries. Text
Summarization Branches Out, 74–81. https://aclanthology.org/W04-1013
Lin, J., Sun, X., Ma, S., & Su, Q. (2018). Global encoding for abstractive sum-
marization. Proceedings of the 56th Annual Meeting of the Association for
Computational Linguistics (Volume 2: Short Papers), 163–169.
Liu, C.-L., & Chen, K.-C. (2019). Extracting the gist of chinese judgments of the
supreme court. Proceedings of the Seventeenth International Conference on
References 58

Artificial Intelligence and Law, 73–82. https://doi.org/10.1145/3322640.33


26715
Liu, Y., & Lapata, M. (2019). Text summarization with pretrained encoders.
Proceedings of the 2019 Conference on Empirical Methods in Natural
Language Processing and the 9th International Joint Conference on Natural
Language Processing (EMNLP-IJCNLP), 3730–3740. https://doi.org/10.1
8653/v1/D19-1387
Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M.,
Zettlemoyer, L., & Stoyanov, V. (2019). RoBERTa: A robustly optimized
BERT pretraining approach. CoRR, abs/1907.11692. http://arxiv.org/abs
/1907.11692
Luijtgaarden, N. (2019). Automatic summarization of legal text (Master’s thesis).
Mihalcea, R., & Tarau, P. (2004). TextRank: Bringing order into text. Proceed-
ings of the 2004 Conference on Empirical Methods in Natural Language
Processing, 404–411. https://aclanthology.org/W04-3252
Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of
word representations in vector space. In Y. Bengio & Y. LeCun (Eds.), 1st
international conference on learning representations, ICLR 2013, scottsdale,
arizona, usa, may 2-4, 2013, workshop track proceedings. http://arxiv.org
/abs/1301.3781
Nadeau, D., & Sekine, S. (2007). A survey of named entity recognition and
classification. Lingvisticae Investigationes, 30 (1), 3–26.
Nallapati, R., Xiang, B., & Zhou, B. (2016). Sequence-to-sequence RNNs for text
summarization. CoRR, abs/1602.06023. http://arxiv.org/abs/1602.06023
Nallapati, R., Zhai, F., & Zhou, B. (2017). SummaRuNNer: A recurrent neural
network based sequence model for extractive summarization of documents.
Thirty-First AAAI Conference on Artificial Intelligence.
Naves, H. (2021). Baas rechtspraak: ’misschien hebben we te weinig gesproken
over de uitvoerbaarheid van bepaalde wetgeving’. https://www.nrc.nl/nie
uws/2021/05/30/baas-rechtspraak-de-rechtsstaat-staat-weer-op-de-agen
da-a4045430
Nenkova, A., & Passonneau, R. (2004). Evaluating content selection in summariza-
tion: The pyramid method. Proceedings of the Human Language Technology
Conference of the North American Chapter of the Association for Compu-
tational Linguistics: HLT-NAACL 2004, 145–152. https://aclanthology.or
g/N04-1019
Page, L., Brin, S., Motwani, R., & Winograd, T. (1999). The pagerank cita-
tion ranking: Bringing order to the web. (Technical Report No. 1999-66).
Stanford InfoLab. Stanford InfoLab. http://ilpubs.stanford.edu:8090/422/
Paulus, R., Xiong, C., & Socher, R. (2017). A deep reinforced model for abstractive
summarization. CoRR, abs/1705.04304. http://arxiv.org/abs/1705.04304
Pennington, J., Socher, R., & Manning, C. (2014). GloVe: global vectors for word
representation. EMNLP, 14, 1532–1543. https://doi.org/10.3115/v1/D14-
1162
Polsley, S., Jhunjhunwala, P., & Huang, R. (2016). CaseSummarizer: A system
for automated summarization of legal texts. Proceedings of COLING 2016,
References 59

the 26th International Conference on Computational Linguistics: System


Demonstrations, 258–262. https://aclanthology.org/C16-2054
Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y.,
Li, W., & Liu, P. J. (2020). Exploring the limits of transfer learning with
a unified text-to-text transformer. Journal of Machine Learning Research,
21 (140), 1–67. http://jmlr.org/papers/v21/20-074.html
Ramos, J. (2003). Using TF-IDF to determine word relevance in document queries.
Proceedings of the first instructional conference on machine learning, 242 (1),
29–48.
recht.nl. (2003). Uitspraken Rechtspraak.nl voorzien van inhoudsindicatie [Online;
accessed 17-March-2021].
Rush, A. M., Chopra, S., & Weston, J. (2015). A neural attention model for
abstractive sentence summarization. Proceedings of the 2015 Conference
on Empirical Methods in Natural Language Processing, 379–389.
Saggion, H., & Poibeau, T. (2013). Automatic text summarization: Past, present
and future. In T. Poibeau, H. Saggion, J. Piskorski, & R. Yangarber (Eds.),
Multi-source, multilingual information extraction and summarization (pp. 3–
21). Springer Berlin Heidelberg. https://doi.org/10.1007/978-3-642-28569-
1_1
Salton, G., & Buckley, C. (1988). Term-weighting approaches in automatic text
retrieval. Information processing & management, 24 (5), 513–523.
Saravanan, M., Ravindran, B., & Raman, S. (2006). Improving legal document
summarization using graphical models. Proceedings of the 2006 Conference
on Legal Knowledge and Information Systems: JURIX 2006: The Nineteenth
Annual Conference, 51–60.
See, A., Liu, P. J., & Manning, C. D. (2017). Get to the point: Summarization
with pointer-generator networks. Proceedings of the 55th Annual Meeting
of the Association for Computational Linguistics (Volume 1: Long Papers),
1073–1083.
Sinaga, K. P., & Yang, M.-S. (2020). Unsupervised k-means clustering algorithm.
IEEE Access, 8, 80716–80727. https://doi.org/10.1109/ACCESS.2020.2988
796
Tan, J., Wan, X., & Xiao, J. (2017). Abstractive document summarization with
a graph-based attentional neural model. Proceedings of the 55th Annual
Meeting of the Association for Computational Linguistics (Volume 1: Long
Papers), 1171–1181. https://doi.org/10.18653/v1/P17-1108
Uyttendaele, C., Moens, M.-F., & Dumortier, J. (1998). Salomon: Automatic
abstracting of legal cases for effective access to court decisions. Artificial
Intelligence and Law, 6 (1), 59–79.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N.,
Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. Advances in
neural information processing systems, 5998–6008.
Wu, S., & Dredze, M. (2019). Beto, bentz, becas: The surprising cross-lingual
effectiveness of BERT. Proceedings of the 2019 Conference on Empirical
Methods in Natural Language Processing and the 9th International Joint
References 60

Conference on Natural Language Processing (EMNLP-IJCNLP), 833–844.


https://doi.org/10.18653/v1/D19-1077
Xiao, L., Wang, L., He, H., & Jin, Y. (2020). Copy or rewrite: Hybrid summa-
rization with hierarchical reinforcement learning. Proceedings of the AAAI
Conference on Artificial Intelligence, 34 (05), 9306–9313. https://doi.org/1
0.1609/aaai.v34i05.6470
Xu, R., & Wunsch, D. (2005). Survey of clustering algorithms. Neural Networks,
IEEE Transactions on, 16, 645–678. https://doi.org/10.1109/TNN.2005.84
5141
Zaheer, M., Guruganesh, G., Dubey, A., Ainslie, J., Alberti, C., Ontañón, S.,
Pham, P., Ravula, A., Wang, Q., Yang, L., & Ahmed, A. (2020). Big Bird:
transformers for longer sequences. CoRR, abs/2007.14062. https://arxiv.o
rg/abs/2007.14062
Zhang, J., Zhao, Y., Saleh, M., & Liu, P. (2020). PEGASUS: pre-training with
extracted gap-sentences for abstractive summarization. International Con-
ference on Machine Learning, 11328–11339.
Zhang, S., Celikyilmaz, A., Gao, J., & Bansal, M. (2021). EmailSum: abstractive
email thread summarization. Proceedings of the 59th Annual Meeting of
the Association for Computational Linguistics and the 11th International
Joint Conference on Natural Language Processing (Volume 1: Long Papers),
6895–6909.
Zhang, T., Kishore, V., Wu, F., Weinberger, K. Q., & Artzi, Y. (2019). BERTScore:
Evaluating text generation with BERT. https://doi.org/10.48550/ARXIV.1
904.09675
Zhong, M., Liu, P., Chen, Y., Wang, D., Qiu, X., & Huang, X.-J. (2020). Extractive
summarization as text matching. Proceedings of the 58th Annual Meeting
of the Association for Computational Linguistics, 6197–6208.
A. The Data Collection Process

In this appendix, we will discuss how the Rechtspraak dataset was collected and
parsed. First, we describe the external dataset, which is the dataset as it was
published by Raad voor de Rechtspraak. Second, we show how the external dataset
files were parsed and stored as a single dataset.

A.1 Collection of the External Dataset


In this research we collected data from Open Data van de Rechtspraak and
constructed a data set using these data. The data are provided by the Raad
voor de rechtspraak 1 . On their website, data are published under the open data
policy. The data are indexed using the European Case Law Identifier2 (ECLI)
standard and stored in the XML format. Each XML file consists of three parts: a
description of the file in RDF format, a summary of the case and the full case
text.
The complete collection of XML files containing all cases for each of the years
from 1911 to 2021 was downloaded. This collection contains a folder for each of the
years with the exception of 1912, which has no published cases. Each year-folder
contains a zip file for each month of the year. These zip files contain individual
XML files; one file per legal case. The total number of files, and therefore cases,
is 3,031,135. The total size of the collected zip file is 5.44GB. An example path of
a legal case of September 1997 is:

cases.zip/1997 /199709 .zip/ECLI _NL_CBB _1997 _ZG0135 .xml

Now, we are left with a complete dataset of all published legal cases of Raad
voor de rechtspraak. In the further text, we will refer to this dataset as the external
dataset. The next task was to extract the relevant information from the external
set and store this efficiently. To this end, a raw dataset was constructed from the
external dataset.

A.2 Constructing the Raw Dataset


The raw dataset was kept immutable. Constructing the dataset consisted of the
following steps:

• Iterate over each month’s archive for each year


1
https://www.rechtspraak.nl/Uitspraken/paginas/open-data.aspx
2
https://www.rechtspraak.nl/Uitspraken/Paginas/ECLI.aspx

61
The Data Collection Process 62

• Parse each XML file within the month archive, yielding meta information of
the case, the full text of the case and the summary of the case

• Write the extracted parts to a pandas DataFrame for the year’s month
archive

• Store the DataFrame as a parquet file

• After parsing each archive, the individual parquet files (one for each month
for each year) were combined into four parquet files containing the complete
dataset

A.2.1 Storing the Dataset


To write and store the raw dataset, three file formats were considered: Apache
Parquet, Feather, and CSV. For Parquet, two modes of compression were compared:
snappy, which is the default for pandas to_parquet method, and brotli. For both
modes the pyarrow engine was used. To choose between the four formats, a small
comparison experiment was run for the cases of the year 2020. The results of
this experiment are shown in table A.1. Arguably, of these formats, CSV is most
common in practice. However, when comparing the results, we find that each of
the three alternatives performs significantly better w.r.t the file size. In the case
of Parquet combined with brotli compression, this is most visible: the file only
has 19% of the size of the CSV file. Unfortunately, this result comes at the cost
of slowing down the process of saving the dataset a lot. In this regard, Parquet
with snappy compression performs best.
The decrease in speed should not be underestimated: there are a total of
110 year archives in the external dataset, each containing 12 month archives.
Therefore, 1320 partitions of the dataset have to be written to the disk. On a
side note, however, earlier years contain only a fraction of the number of cases
of recent years. Besides, the cases in these years often lack a case text and/or
summary. So, both the average file size and the average writing speed for files of
month archives in 2020, are not representable for each of the other years.
We chose the best of both worlds when constructing the dataset and storing
it. First, to promote speed, each of the temporary partitions (i.e. each of the
years’ months) was stored as a Parquet file using snappy compression. Then,
after parsing all case files, the parquets files were collectively read into memory,
combined, and stored as a single parquet file using brotli compression.
The Data Collection Process 63

Table A.1: Comparison of different file formats for storing the OpenRechtspraak
dataset. The comparison was made using the legal cases from 2020. For each
month, the parsed files/cases were written to an individual file. Fth. denotes
Feather and Snpy denotes Snappy. Best averages are in bold.

Month Files Time to Write (seconds) File Size (MBs)


CSV Fth. Snpy Brotli CSV Fth. Snpy Brotli
1 11474 1.32 0.25 0.27 4.48 57.00 23.92 24.86 10.93
2 12179 1.49 0.30 0.29 5.43 65.45 26.86 28.26 12.24
3 11656 1.67 0.35 0.34 5.83 74.68 30.70 32.36 13.81
4 9313 1.46 0.28 0.28 5.32 64.79 26.76 28.17 12.24
5 7214 1.09 0.19 0.22 4.07 48.97 20.58 21.59 9.46
6 9686 1.46 0.29 0.29 5.28 65.29 27.09 28.52 12.30
7 12631 1.78 0.39 0.35 6.15 79.40 32.27 34.24 14.60
8 9425 1.18 0.20 0.22 3.99 52.71 21.19 22.45 9.60
9 11357 1.62 0.34 0.32 6.37 72.40 29.84 31.39 13.32
10 11457 1.62 0.32 0.30 5.54 71.25 29.17 30.76 13.24
11 11424 1.65 0.34 0.31 6.19 72.52 29.87 31.52 13.35
12 13381 1.81 0.39 0.36 6.98 81.40 33.24 35.15 14.92
Avg. 10933 1.51 0.30 0.30 5.47 67.16 27.62 29.11 12.50
B. Examples of Generated
Summaries

In this appendix, two cases’ summaries are shown and discussed. This appendix
is supplementary to results section 5.5.3.
Case ECLI:NL:CRVB:2015:1630 shows three summaries that each were rated
with 5/5 on each metric. These summaries are listed in table B.11 . Each summary
starts off the same, but in the middle of the first sentence they start offering
slightly different views on the case. For the summarization models, there clearly is
some variation in how the summaries are written and what parts of the case text
are included. Interestingly, the reference summary almost seems to be a composite
of the two summary models. However, despite being more concise, the summary
models did not fail to include the most important information from the case.
In table B.22 a case is shown that highlights some of the problems of the
summarization models. Here, the full model summary scores worst because of a
lack of informativeness. Furthermore, the summary consists of a single convoluted
question. This behavior where the summary is phrased as a question was present
in one other generated summary. The cluster model summary scores better, but
the second sentence contains grammatical peculiarities.

1
https://uitspraken.rechtspraak.nl/inziendocument?id=ECLI:NL:CRVB:2015:1630.
2
https://uitspraken.rechtspraak.nl/inziendocument?id=ECLI:NL:CRVB:2005:AU5952

64
Examples of Generated Summaries 65

Table B.1: Summaries belonging to case ECLI:NL:CRVB:2015:1630


.
Reference summary inf.:5, rel.:5, flu.:5, coh.:5
De rechtbank heeft terecht geoordeeld dat artikel 11 van het Besluit Dagloon-
regels werknemersverzekeringen op appellante van toepassing was. Dat betekent
dat voor de vaststelling van het WAZO-dagloon niet een referteperiode van een
jaar gold, maar een referteperiode van een maand, in dit geval de maand april
2013. Het Uwv heeft het WAZO-dagloon dan ook juist berekend. De verwijzing
van appellante naar artikel 3:13, derde lid, van de WAZO gaat niet op, omdat
dat artikellid eerst met ingang van 1 juni 2013 in werking is getreden en niet
geldt voor WAZO-uitkeringen waarvan de ingangsdatum vóór 1 juni 2013 ligt.
Full model summary inf.:5, rel.:5, flu.:5, coh.:5
De rechtbank heeft terecht geoordeeld dat artikel 11 van het Besluit Dagloon-
regels werknemersverzekeringen (Besluit) voor de hoogte van de WAZO-uitkering
het s.v.-loon dat appellante in de maand april 2013 heeft genoten bepalend is,
omdat dat het laatste aangiftetijdvak is dat voorafging aan het aangiftenetijdvak
waarin de arbeidsongeschiktheid is ingetreden.
Cluster model summary inf.:5, rel.:5, flu.:5, coh.:5
De rechtbank heeft terecht geoordeeld dat artikel 11 van het Besluit op appellante
van toepassing was. Dat betekent dat voor de vaststelling van het WAZO-
dagloon niet een referteperiode van een jaar gold. De verwijzing van appellante
naar artikel 3:13, derde lid, van de WAZO gaat niet op, omdat dat artikellid
eerst met ingang van 1 juni 2013 in werking is getreden en niet geldt voor
WAZO.

Table B.2: Summaries belonging to case ECLI:NL:CRVB:2005:AU5952


.
Reference summary inf.:3, rel.:5, flu.:5, coh.:5
Berekening van het WW-dagloon. Appellante ontvangt, naast een WW-
uitkering, een WAO-uitkering, welke uitkering met toepassing van artikel 21b
van de WAO gebaseerd is op een vervolgdagloon.
Full model summary inf.:1, rel.:3, flu.:5, coh.:2
Is bij de berekening van het WW-dagloon van betrokkene in overeenstemming
met ’s Raads uitspraak van 15 januari 1998 het loongerelateerde WAO-dag
loonregels IWS als uitgangspunt gehanteerd in de Aantekeningen van de “grijze
Kluwer” bij artikel 14 van de Dagloonregels IW?
Cluster model summary inf.:3, rel.:4, flu.:3, coh.:4
Herziening WW-dagloon. Betrokkene is in overeenstemming met ’s Raads uit-
spraak van 15 januari 1998 het loongerelateerde WAO-uitkering als uitgangspunt
gehanteerd. Geen sprake van een totaal uitkeringsniveau van 70% van het WAO.

You might also like