Daniel Prijs MSC Thesis 2022
Daniel Prijs MSC Thesis 2022
Master of Science
On Automatic Summarization of
Dutch Legal Cases
by
Daniël Prijs
July, 2022
Acknowledgments
ii
Abstract
As is true for many other domains, the legal domain saw an increase in digitization
over the last decade. In the Netherlands, this is reflected in the usage of the
European Case Law Identifier encoding to freely and openly publish Dutch legal
cases. Currently, only 5% of all Dutch legal cases is published this way. The aim
is to bring this percentage up to 75% in the coming years.
There is a need for published cases to contain a summary highlighting the
contents of a case. Such summaries would make it much easier to search for
relevant cases. Approximately 460 thousand cases that are currently published
contain a case text and a case summary. Writing summaries for cases is a time-
consuming and a non-trivial task. Therefore, we studied the feasibility of using
automatic summarization to automate this process for Dutch legal cases.
As a first step, we collected and preprocessed all published legal cases into a
single dataset. This Rechtspraak dataset consists of 100201 case-summary pairs
suitable for automatic summarization. This dataset then was analyzed using a
framework that was recently proposed for this goal.
Subsequently, an experiment was designed to train and evaluate a BART
model on the dataset. This is a sequence-to-sequence model using a transformer-
architecture. To this end, two systems were considered. In one case, the full
dataset was used to fine-tune the BART model. In the other case, the dataset was
first clustered into six subsets, after which a separate BART model was fine-tuned
for each cluster. This technique of prior-clustering was not explored before in the
field of automatic summarization. The obtained models were evaluated in two
phases. First, the common ROUGE metrics were computed. Second, a recently
proposed protocol for human evaluation of automatically generated summaries
was followed to evaluate forty cases and accompanying summaries.
The results of this evaluation showed that the automatically generated sum-
maries are of a slightly worse quality than the reference summaries. For most
metrics, however, the difference is only small. Only with respect to the relevance
of the generated summaries there is more room for improvement.
In comparison with the full dataset model, clustering has a moderately negative
effect on the quality of the generated summaries and therefore is not recommended.
On the whole, automatic summarization techniques show promising results
when applied to Dutch legal cases. We argue that they can readily be applied to
new case texts if human summarization of these case texts is not feasible for any
reason.
iii
Contents
Acknowledgments ii
Abstract iii
Contents iv
1 Introduction 1
1.1 Automatic Summarization of Legal Cases . . . . . . . . . . . . . . 2
1.2 Clustering to Improve Summarization . . . . . . . . . . . . . . . . 2
1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
3 Related Work 10
3.1 Text Summarization Using Deep Learning . . . . . . . . . . . . . 10
3.1.1 Evaluation of Generated Summaries . . . . . . . . . . . . . 11
3.1.2 Extractive Summarization . . . . . . . . . . . . . . . . . . 12
3.1.3 Abstractive Summarization . . . . . . . . . . . . . . . . . 13
3.2 Summarization of Legal Documents . . . . . . . . . . . . . . . . . 15
3.3 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.3.1 k-means clustering . . . . . . . . . . . . . . . . . . . . . . 17
3.3.2 Mixture Models . . . . . . . . . . . . . . . . . . . . . . . . 17
3.3.3 Clustering and Summarization . . . . . . . . . . . . . . . . 18
4 Methods 19
4.1 Characteristics of the Dataset (RQ1) . . . . . . . . . . . . . . . . 19
4.1.1 Collection of the Data . . . . . . . . . . . . . . . . . . . . 22
4.1.2 Preparation of the Data . . . . . . . . . . . . . . . . . . . 22
4.2 Method of Evaluation (RQ2) . . . . . . . . . . . . . . . . . . . . . 23
4.2.1 Human Evaluation . . . . . . . . . . . . . . . . . . . . . . 23
4.3 Experimental Setup (RQ3 and RQ4) . . . . . . . . . . . . . . . . 24
4.3.1 Obtaining a Dutch Language Model . . . . . . . . . . . . . 26
4.3.2 Architecture of the Clustering Component . . . . . . . . . 27
4.3.3 Architecture of the Summarization Component . . . . . . 29
iv
Contents v
5 Results 31
5.1 Analysis of the Rechtspraak dataset . . . . . . . . . . . . . . . . . 31
5.2 Pretraining of BART . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.3 Finding a suitable Clustering Model . . . . . . . . . . . . . . . . . 33
5.4 Training the Summarization Models . . . . . . . . . . . . . . . . . 35
5.4.1 Dataset Splits . . . . . . . . . . . . . . . . . . . . . . . . . 36
5.4.2 Fine-tuning Losses . . . . . . . . . . . . . . . . . . . . . . 37
5.5 Evaluation of the Summarization Models . . . . . . . . . . . . . . 39
5.5.1 Summary Generation . . . . . . . . . . . . . . . . . . . . . 39
5.5.2 Automatic Evaluation Using ROUGE . . . . . . . . . . . . 40
5.5.3 Human Evaluation . . . . . . . . . . . . . . . . . . . . . . 42
6 Discussion 48
6.1 Automatically Generated Summaries . . . . . . . . . . . . . . . . 48
6.1.1 Quality of the Reference Summaries . . . . . . . . . . . . . 48
6.1.2 Improving Generated Summaries . . . . . . . . . . . . . . 49
6.1.3 Extractive or Abstractive Summaries? . . . . . . . . . . . 49
6.1.4 Incorporation of Domain-Dependent Features . . . . . . . 50
6.1.5 Generation Time of the Results . . . . . . . . . . . . . . . 50
6.1.6 Summary Generation Configuration . . . . . . . . . . . . . 50
6.2 Improving the Described Method . . . . . . . . . . . . . . . . . . 51
6.2.1 Longer Pretraining of Base Model . . . . . . . . . . . . . . 51
6.2.2 Human Evaluation of Generated Summaries . . . . . . . . 51
6.2.3 Association Dataset Metrics and Model Performance . . . 52
6.3 Architectural Considerations . . . . . . . . . . . . . . . . . . . . . 52
6.3.1 Limitations of Transformer Architecture . . . . . . . . . . 52
6.3.2 Gaussian Mixture or k-means? . . . . . . . . . . . . . . . . 53
7 Conclusion 54
References 56
One of the consequences of the Information Age is the explosion of the amount
of information that is being digitized. This digitization is accompanied by some
well-known challenges: storing information costs resources; digitizing, if not done
automatically, requires human effort, privacy has to be guaranteed and access has
to be safe and reliable.
A less apparent problem is that, in the case of text documents, the number of
documents becomes too large to be able to exhaustively consider all documents
as a single person when searching for information. Search engines are an example
of a field that is already tackling this problem. Here, due to people being unable
to read every website when they search using some keywords, we are presented
with short snippets that concisely summarize the specific web page.
In this thesis, we will consider the Dutch legal domain, where the increasing
digitization of legal cases demands solutions that allow users to easily navigate
the published documents.
Publication of the Dutch legal cases is done by Raad voor de Rechtspraak at
rechtspraak.nl. Until July 2003, if users of rechtspraak.nl wanted to know whether
a case was relevant for them, they had to open the document and had to either
scan or read the full case before they could assess whether the case indeed was
relevant. Because users felt this was a shortcoming of rechtspraak.nl, summaries
were created for new Dutch legal cases (recht.nl, 2003). Since then, some of the
new cases have been given summaries that are shown as snippets to quickly inform
the user about the main contents of the case. This need of users to be able to
quickly assess a case, highlights the relevance of having Dutch legal cases that are
enriched with supplementary summaries.
However, the current summarization process has a number of limitations. The
most obvious limitation is the need for human labour in constructing a summary
for each new case. This might be the reason why only a small portion of the
published Dutch legal cases are supplemented by a summary. This is especially
true for cases before July 2003, as only a small part of these cases were retroactively
summarized. Furthermore, even for cases that are accompanied by a summary,
this summary often only consists of a few keywords or a short sentence.
Recently, Raad voor de Rechtspraak announced that they aim to publish even
more Dutch legal cases (Naves, 2021). Currently, only 5% of all Dutch cases are
published online; their aim is to increase this to 75%. If we only look at the
previous decade, this would mean that an additional 2.2 million cases will be
added for this decade. This further stresses the need for sound summaries so that
relatively little time is lost searching for relevant cases.
1
Introduction 2
• “It is fundamental to decide and specify the most important parts of the
original text to preserve.”
1.3 Contributions
We aimed to find a system that can be used to automatically generate summaries
of Dutch legal cases. This system should promote the ease of searching through
the large body of published Dutch legal cases.
The project is structured in a way as to answer four research questions. First, a
concise analysis of the Rechtspraak dataset is required to inform further decisions
relating to system components and modeling approaches. Therefore, we start
with answering the following question: what are the key differences between
available benchmark datasets and the Rechtspraak dataset used in this
project? (RQ1)
Second, time will be dedicated to choosing a proper evaluation method. As
was stated before, evaluation of automatically generated summaries is not straight-
forward. For this reason, we study both quantitative and qualitative methods of
evaluation and answer the question: how can generated summaries of Dutch
legal cases be evaluated accurately? (RQ2)
We will experiment with multiple models to find the impact of clustering the
data before summarization is done. We hypothesized that clustering will lead to
improved summaries, which will be measured by evaluation methods found in
RQ2. Our third research questions therefore is: what is the effect of training
automatic summarization models on clustered data? (RQ3)
To finalize this thesis, we will uncover the strengths of weaknesses of our auto-
matic summarization system and will answer the question what are the biggest
challenges when automatic summarization techniques are applied to
Dutch legal cases? (RQ4)
Besides these theoretical contributions, we also provide instructions1 on how
to generate the Rechtspraak dataset, which consists of 100K Dutch legal cases
and summaries. The legal cases from this dataset are already available online, but
only as individual XML files. We provide instructions on how to parse these files
and collect them into a single summarization dataset. The obtained dataset has a
large size and is freely available from the source with few restrictions, making it
suitable to be utilized as a benchmark dataset. To the best of our knowledge, no
summarization benchmark dataset exists for the Dutch language. Instructions to
generate the complete Rechtspraak dataset, including cases that are not viable
for summarization (e.g. due to missing components), are also provided. This
dataset consists of 3 million cases.
1
See https://github.com/prijsdf/dutch-legal-summarization
2. Deep Learning; Its Surface
4
Deep Learning; Its Surface 5
Neural networks, especially deep ones, are more difficult to understand than
non-neural models. Often, neural networks are described as black boxes because
of the seemingly mysterious way in which they learn patterns. This inability to
explain the model’s reasoning is one of the main criticisms of neural networks and
sometimes prevents them from being applied to real world problems. Furthermore,
neural networks can be become computationally expensive to train. This is because
of the large number of trainable parameters that are associated with a neural
network. BERT, a model that is mentioned in section 2.3.1, for example, has
around 110 million parameters that all need to be trained. In turn, this requires a
lot of training data to be able to effectively learn patterns from.
A meaningful classification of AI techniques can be made w.r.t the ability of a
technique to learn from data. This ability to learn is one of the essential aspects of
any Machine Learning approach; it leads to a system that is able to learn patterns
from the data in a way that enables it to be generalized to also find patterns in
new, unseen data. We can divide approaches into three groups (Goodfellow et al.,
2016): rule-based, traditional (or classical) and neural (or representational).
In figure 2.1 the classes are shown. Rule-based systems have no learning
component; they rely on manually crafted rules that are matched with the data.
Traditional approaches, on the other hand, contain components that learn to map
features of the data to certain outputs. Currently, machine learning approaches
are mainly neural: not only output mappings are learned from features, also the
features themselves are learned.
2.2 Training
Integral to many ML models is a training phase. In their book, Goodfellow et al.
(2016) use the term representation learning to denote this phase for neural
models and DL models. This name makes sense: during the training phase the
model builds a representation of the data it is being trained on.
For many models the training phase is supervised, meaning that the model is
given some knowledge about each data item in order to have it train conditionally
on this knowledge. In further discussion this prior knowledge will be referred to
as the true label of a data item. In our case this means that a reference summary
is provided together with its full case text. The model, then, tries to derive the
true label from the data item and hereby generates a candidate summary as the
predicted label for this data item. After generating this prediction, the model
will use the true label to measure its representation depending on how closely
the predicted label approximated the true label. This process is backpropagation.
To compute the proximity of the prediction to the real label, a loss function is
needed. Which loss function should be chosen is dependent on the architecture
of the model and the task at hand. See section 7.4 of the book by Jurafsky and
Martin (2020) for more details on both backpropagation and loss functions.
The other two paradigms of training are unsupervised and reinforcement.
In unsupervised learning the model is only supplied with the data items; there are
no true labels associated with these items. Therefore, unsupervised learning often
is restricted to finding discriminating features within the dataset. A common
example of such a system is a clustering model that is able to partition the data
in a number of cluster; each containing items that are similar to each other while
differing from items in the other clusters. In the method we propose, clustering
also is included as the first component. Therefore, we will discuss clustering more
in-depth in section 3.3.
Reinforcement Learning (RL) also differs from the supervised item-label ap-
proach. Here, the problem is framed by picturing an agent that is interacting
with its environment. Each action this agent performs changes the environment.
The environment, in turn, emits this change back to the actor, who ’learns’ a
little about its performed action. Ideally, the actor will increasingly learn how to
behave effectively in its environment.
therefore faster, approach to derive the word vectors. GloVe uses probe words to
compare words of interest by looking at how frequently each of the words is found
together with the probe word. The ratio between the occurrence of word one
given the probe word and the occurrence of word two given the probe word tells
something about how each of the words relates to the probe word. Because there
is no supervised learning required to derive these ratios - they can be read from
the input documents - learning a GloVe model is faster than learning a Word2vec
model.
More recently, BERT was introduced by Devlin et al. (2019). BERT signif-
icantly improves upon the previously mentioned models by using a multi-layer
bidirectional Transformer (Vaswani et al., 2017) to derive word embeddings. Here,
word embeddings are no longer statically dependent on the word itself, but instead
also encode information about the words context. To train BERT on incorporating
word context, masked-language modelling was used as a new training objective.
Here, BERT is shown a sequence of tokens of which a percentage (15% in the
original paper) is hidden or masked. Now, BERT is tasked with reconstructing
the original sequence by filling in these masked words.
It is also important to note, that BERT uses WordPiece to derive tokens from
the corpus. This is a tokenization approach where not words, but subwords are
the main block of information. Using subwords has two main advantages. First,
uncommon words that otherwise would not fit in the vocabulary can be broken
down into pieces that may be processed by the model. Second, the model will be
exposed to the root of words and therefore might encode similar words in similar
ways. For example, the model might interpret the words ’prison’ and ’imprisoned’
as similar, instead of seeing them as completely different words.
Many works proposed adaptations of BERT. Notably, Y. Liu et al. (2019) found
that BERT under-fit and could be trained more extensively. With RoBERTa,
a superior model is proposed that relies on a more robust pretraining approach.
Both BERT and RoBERTa, were pre-trained using english data. A well perform-
ing, multilingual version of BERT also exists (Wu & Dredze, 2019), but it is
outperformed by monolingual models. For Dutch, BERTje (de Vries et al., 2019)
was proposed as the counterpart of BERT, whereas RobBERT (Delobelle et al.,
2020) was proposed for RoBERTa.
More recently, we saw the advent of sequence-to-sequence models that use
the ideas from BERT. Here, the framework is specifically tasked with outputting
sequences, rather than single probabilities, such that a model is obtained that can
be applied to tasks that require the generation of sequences such as automatic
summarization. An example of these models is BART (Lewis et al., 2020). BART
uses a more extensive set of pretraining objectives in comparison with BERT. It
uses the following objectives:
Token masking Similar to BART, a percentage of tokens in the text are masked
at random and the model has to reconstruct the original text.
Sentence permutation The text is split-up in sentences (based on full stops)
and then these sentences are shuffled. The model has to reconstruct the text
again.
Deep Learning; Its Surface 9
Document rotation A new start token is picked at random and the document
is rotated such that it starts with this new token. Again, the model has to
reconstruct the original text.
Token deletion Tokens are deleted from the text. Here the model has to pick
the positions of the deleted tokens. Notice that this differs from simply
masking the tokens.
Text infilling Similar to masking, but here random spans of texts are replaced
by a single mask token. The spans mostly have a length of 0 to 9 tokens.
Spans of zero length can also be replaced, which is equal to simply inserting
a mask token into the text. Notice that a span is always replaced by one
mask token, regardless of the length of the span.
3. Related Work
10
Related Work 11
consider the rest of the summary as completely incorrect. In the case of ROUGE-2,
even the complete summary is evaluated as incorrect.
There are more recent metrics that try to mitigate this problem. BERTScore,
for example, uses BERT-generated contextual embeddings of words, instead of
the words themselves, when comparing the generated summary with the reference
summary (T. Zhang et al., 2019). This means that more information about
the words is taken into consideration when two words are compared. In this
case, our example sentence, despite its synonyms, might get a positive evaluation.
Unfortunately, the authors did not include the task of automatic summarization
in their experiment. For the tasks machine translation and image captioning,
however, BERTScore proved to correlate better with human judgments than the
standard metrics for these tasks, such as BLEU.
Human evaluation metrics are less standardized than automatic evaluation
metrics. Many papers do report some kind of human evaluation results besides
reporting ROUGE scores. However, these evaluations take many different shapes.
In some cases domain experts are tasked with the evaluation, in other cases less
costly approaches are taken, such as usage of Amazon Mechanical Turk (e.g. J.
Zhang et al. (2020)). There are proposals for standardizing this process. In this
thesis project we will consider such an approach, namely the protocol proposed
by Grusky et al. (2018). Specifically, this protocol formulates four questions, each
measuring a different dimension of the generated summary. The authors included
two semantic dimensions and two syntactic dimensions. Three of these dimensions
were initially used by Tan et al. (2017) and one was introduced by Paulus et al.
(2017). The protocol by Grusky et al. (2018) combines these four dimensions
and formulates a question for each of them. In section 4.2.1, we will discuss this
protocol in detail.
Despite all these metrics, there is no perfect metric for evaluating automatically
generated summaries: how summaries best can be evaluated remains an open
question (Saggion & Poibeau, 2013). This is the direct result of the ambiguity of
the task in general. There is no clear and consistent way of saying that a summary
A is better than a summary B. For example, it could be that summary A does a
better job at covering the main points of the source text, whereas summary B
contains less factual errors in the facts that it covers from the source text.
vertices.
The strength of TextRank not only is that it is a simple and intuitive algorithm,
but also that to apply it we need nothing more than the text it needs to be
applied to, making it an unsupervised approach. The results of TextRank were
competitive with the state-of-the-art (Mihalcea & Tarau, 2004), which mainly
consisted of supervised approaches. These practical benefits and the relatively
strong performance make that TextRank still is used as a baseline for comparison
of new approaches (e.g. S. Zhang et al., 2021).
Another approach commonly used as a baseline (e.g. Zhong et al., 2020) is Lead-
3. It was introduced as such by Nallapati et al. (2017). Lead-3 might be the simplest
common approach in summarization literature: to derive a summary, it picks the
three leading sentences from the source text. The reason for its introduction, and
the popularity that followed, is its effectiveness on some benchmark datasets. The
dataset that is most often reported on, CNN/Daily Mail, consists of news articles.
It seems that for this type of text relevant information relatively often is to be
found in the beginning of the text, which explains the good performance of this
baseline.
With the increase in popularity of deep learning, researchers also started to
apply neural models to help with extractive summarization tasks. To this end,
texts are first transformed into sequences of embeddings (see section 2.3.1) that
can be used as input to the neural models. This process can be applied to the
text at different levels; e.g. at the word-level or sentence-level.
Nallapati et al. (2017) used two bi-directional RNNs to transform texts into
embeddings. The first RNN generated word-embeddings. The second RNN
used these word-embeddings to generate sentence embeddings. The sentence
embeddings are then fed to a binary classifier that classifies each sentence as either
belonging or not belonging to the summary.
Since 2019, the use of transformer-like language models gained momentum.
BERT was used in BERTSUM (Y. Liu & Lapata, 2019) to obtain sentence
representations. In the extractive model multiple transformers are applied to these
sentence representations, capturing latent document document-level features that
are used to extract relevant sentences.
Zhong et al. (2020) propose MatchSum, which can be seen as an extension
of BERTSUM. The model uses BERTSUM to score sentences on saliency. Next,
candidate summaries are generated using all combinations of the most salient
sentences. Then, two BERT models are used to obtain embeddings for each
candidate summary and the source text, which are used to compute a similarity
score between the two. Finally, the candidate summary that is most similar to
the source text is chosen as the final summary.
extractive systems), results were still limited. The main flaw is that generated
sentences have an incorrect word order and therefore contain syntactical mistakes.
The authors also transformed the Gigaword dataset to create one of the earlier
large-sized summarization benchmark.
Quite soon after, CNN/Daily Mail (Nallapati et al., 2016) was proposed as a
new benchmark dataset. The dataset contains bullet-point, and therefore multi-
sentence, summaries of news articles of CNN and Daily Mail and prevailed as one
of the main benchmarks for the task of summarization.
Besides introducing this benchmark, the authors also proposed a few novel
model components aimed at solving limitations of earlier models. First, an
hierarchical attention component was described, that is active at both the word-
level and the sentence-level. The goal of this component is to not only have the
model attend to a word based on the perceived importance of the word, but also
based on the importance of sentence the word is found in. Second, more emphasis
was put on identifying key-words in the text. This was achieved by supplying
the input words with TF-IDF scores and part-of-speech tags. Third, pointer
functionality was presented to allow the model to include words in the summary
directly from the source document. This is beneficial in case a word is important
in the source text, but lacking from the model’s vocabulary. As the vocabulary is
the set of words that the model can recognize and produce, the model lacks the
ability to generate out-of-vocabulary words unless a procedure such as pointing is
included.
The main flaw indicated by the authors was the repetition of phrases in the
generated summary. Intra-attention (or self-attention) is recommended as a means
to account for this repetition.
The intra-attention component was studied in the following year by Paulus
et al. (2017). They combined the encoder-decoder RNN with this component.
Furthermore, to minimize ’exposure bias’ of the model, training was partially
shaped as a Reinforcement Learning problem, instead of the usual Supervised
Learning problem. Three intra-attention models are compared; one solely using
RL, one solely using Supervised Learning, and one hybrid. The authors show that
the RL model quantitatively (measured by ROUGE-1) performs best. However,
qualitatively (measured by readability) the model performs worst: the hybrid
model performs best in this regard. The authors conclude that both models
performed better than the state-of-the-art and, in some cases, the quantitative
measure alone can be deceptive in measuring model performance.
Another work that successfully applied Reinforcement Learning to text sum-
marization was Chen and Bansal (2018). Here, reinforcement was used to extract
suitable sentences from source documents, which then were rewritten. This hybrid
approach was extended by Xiao et al. (2020), who made rewriting optional.
In Al-Sabahi et al. (2018), state-of-the-art improvements are reported using
intra-attention with extra input at each time step. This input consists of a
weighted average of each of the previous states of the model. The authors chose
to include this input to enable the model to more easily attend to earlier states.
Another approach to countering repetition within generated summaries is the
usage of coverage models (See et al., 2017). Here an extra learnable parameter
Related Work 15
2006). In total, seven rhetorical roles were used to annotate sentences. Examples
are ’identifying the case’, ’arguments (analysis)’, and ’final decision (disposal)’.
Both of these two approaches used rhetorical roles to perform extractive
summarization of judgments. One downside to these approaches is the need for
human annotation of cases. Furthermore, as was implied by the authors of the
Kerala judgments paper, different law datasets might require different sets of
rhetorical roles. In our experiment, we will work with a framework that only
considers the source texts and true summaries and not any secondary information.
There are two main reasons for this. First, by not depending our framework
on dataset-specific information, the framework will be easier to generalize to
other domains and datasets. Second, as secondary information often needs to be
supplied on sentence-level (e.g. see above two approaches) instead of document-
level, time-costly manual labelling needs to happen for each of the dataset’s cases.
Not only does this require the necessary expertise; it also means that either only a
very small dataset can be used, or that an unreasonable amount of time is required
to label each of the cases. Furthermore, as deep learning models were studied in
our experiments, a small dataset would have been an immediate drawback of the
proposed method.
Polsley et al. (2016). propose CaseSummarizer, a tool combined with a web
interface to automatically summarize legal judgments. Using tf-idf and domain-
dependent features, such as the number of entities in the text, sentences are
ranked. Then, most important sentences are combined in a summary that is
customizable by the user. The system was evaluated using 3890 legal cases from
the Federal Court of Australia. Human evaluation of the system showed that
it outperforms other summarization systems. These other systems were mostly
dataset-agnostic (thus non-legal), which might explain the difference. Summaries
that were made by experts still outperformed CaseSummarizer (and other systems)
by a large margin. Here, the experts also used extracting summarization to create
a summary. Thus, the authors highlighted, sentence extraction could be a viable
method of legal case summarization, albeit that current systems are lacking.
C.-L. Liu and Chen (2019) studied a highly similar dataset to ours. The
authors studied judgements of the Supreme Court of Taiwain. These judgements
were sometimes published with a summary, comparable to the Dutch published
cases. The authors chose to treat the problem as an extractive summarization
problem due to many summaries containing statements that were directly selected
from the judgement text.
Kornilova and Eidelman (2019) introduced the BillSum dataset. It consists
of 22,218 US congressional bills. The text is semi-structured. Of the available
common benchmarks BillSum might best resemble the Rechtspraak dataset that
is used in this thesis project.
Finally, Luijtgaarden (2019) applied the reinforcement learning approach by
Chen and Bansal (2018), which was discussed in the previous section, to the
Rechtspraak dataset. The author found that the model cut off sentences too
early leading to grammatical errors in the summaries. Five generated summaries
were evaluated on relevance and readability by two law students. In general, the
students preferred the reference summaries to the generated summaries. However,
Related Work 17
in the case where a reference summary would consist of key words the students
would prefer a generated summary.
3.3 Clustering
In our study of Dutch legal cases, we tested our hypothesis that it is beneficial to
cluster the data before abstractive summarization is applied. Clustering simply
is the partitioning of data into groups. The general aim, as is ours, is to obtain
clusters that contain similar data items while items from different clusters should
be dissimilar.
In this thesis project our main aim was to explore the feasibility of automatic
summarization of Dutch legal cases. In this section, the tools and techniques that
were used are introduced. As we will see, the method combines established results
from the literature to tackle the summarization problem.
In short, two frameworks were compared. The first framework is the standard
framework, in which the full dataset was used to fine-tune a single summarization
model. In the second framework the dataset was first clustered into six clusters
before a separate summarization model was fine-tuned for each of these clusters.
We hypothesized that this two-phase approach leads to an improved quality of
the generated summaries.
In section 4.1, this chapter starts of with the introduction of a set of metrics
that we used to explore our dataset and compare it to benchmark summarization
datasets. This section will also describe how the data was collected and prepared.
Then, in section 4.2, our two-sided evaluation approach is presented. Finally,
in section 4.3, the experimental setup is discussed. In this section we will go
over the technical aspects of both the clustering framework and the standard
summarization framework.
19
Methods 20
Table 4.1: The set of features proposed in (Bommasani & Cardie, 2020) to
compare summarization datasets. We will use these features to describe our
dataset. Furthermore, an adapted subset of these features was used in the
clustering framework that will be discussed in section 4.3.2
.
Metric Description
Dw text length in words
Ds text length in sentences
Sw summary length in words
Ss summary length in sentences
CMP w word compression
CMP s sentence compression
TS topic similarity
ABS abstractivity
RED redundancy
SC semantic coherence
The first two complex features are compression scores. Word compression is
the inverse ratio between the length in number of words of a summary and the
length in number of words of its case description. The dataset word compression
score is obtained by averaging over all cases’ scores:
N
1 X Si
CMP w = 1 − wi
N i=1 Dw
where N is the number of cases. Sentence compression measures the same ratio,
but now length is measured as the number of sentences. Again, to obtain the
dataset sentence compression score, the individual scores are averaged:
N
1 X Si
CMP s = 1 − si
N i=1 Ds
For both measures, higher scores indicate that, on average, a case text needs more
compression to obtain its summary.
Next, we have the topic similarity of the case text and the summary text.
This metric uses the concept of Latent Dirichlet allocation (LDA) (Blei et al.,
2003) to generate topic representations of each text. The topic model required for
this step is generated from all case texts in the dataset. To compute the topic
similarity, first, a topic distribution is generated for both the case text and the
case summary. Then, the Jenson-Shannon distance of these two distributions is
computed. Finally, we obtain the dataset topic similarity score by first subtracting
Methods 21
each individual score from 1 and then taking the average of the obtained scores:
N
1 X
TS = 1 − JS (θDi | M , θSi | M )
N i=1
in a stratified manner, where the distribution of the classes in each of the splits
was constrained to be equal to the distribution of the classes in the full dataset.
Dimension Question
Informativeness How well does the summary capture the key points of the
article?
Relevance Are the details provided by the summary consistent with
details in the article?
Fluency Are the individual sentences of the summary well-written and
grammatical?
Coherence Do phrases and sentences of the summary fit together and
make sense collectively?
Methods 24
In their case study, the authors had had 60 news articles evaluated. Their
experiment compared seven systems. Therefore, each article was bundled with
seven candidate summaries and the true summary. Each of these bundles was
subsequently evaluated by three unique evaluators. Meaning that for each of the
news articles all associated summaries were evaluated three times.
For our experiment, it was infeasible to find enough evaluators to evaluate a
meaningful number of cases. This is because of our dataset mainly containing large
texts (as opposed to short news articles). Also, because of the domain-specific
nature of the texts, which demand sufficient attention and effort to be read, it
proved to be challenging to find suitable evaluators. For these reasons, we chose
to deviate from Grusky et al. (2018) in this respect and had the summaries only
evaluated by the author of this thesis.
Evaluation Setup
In our setup, a random sample of forty cases was taken from the test set. Each of
these cases was presented to the evaluator together with its true summary, the
summary generated by the full summarization model and the summary generated
by the cluster summarization model; the characteristics of these two models will
be explained in detail in section 4.3. Thus, there were three summaries associated
with each case. To minimize bias, at evaluation time the label of each summary
was hidden from the evaluator and the order of the summaries was randomized.
Evaluation was done with the help of a web application that served a case
text together with its true summary and both generated summaries. This setup
is shown in figure 4.1. The summaries were shown in a random order, to make
it least likely that the evaluator will be able to know which summary belongs to
which system.
When evaluating a case, first the summaries were read, then fluency and
coherence were answered. Only after that, the case text was read, after which
the summaries were reread and informativeness and relevance were scored. This
approach was taken to score the non-content metrics of the summary with a
minimum amount of bias. For example, if the case had been read first, mistakes
such as factual inconsistencies of the summary might have caused bias in judging
the overall style of the summary, all the while when factuality has nothing to do
with the overall style of the summary. This way of approaching the evaluation
process also is more in line with how the envisaged summarization system would
be used by the end user. That is, the end user would first read the summary and
only then, if deemed relevant, read the case text.
Figure 4.1: Human evaluation setup. The case text was presented together
with the true summary and a generated summary for each of the summarization
systems. The summaries were presented in a random order. At the top of the
page the ECLI of the case is printed; this is the case identifier.
subset of the dataset. These subsets were derived using k-means clustering.
Effectively, we considered these cluster-specific summarization models as a single
model in order to compare it with the full model. This single cluster model
is simply referred to as the cluster model or the cluster framework. To
obtain the results of the cluster framework, the weighted average was taken of the
individual cluster model’s results.
The cluster framework consists of two main components; a clustering compo-
nent and a summarization component. In figure 4.2 an overview is given of the
interaction between these components. In the following sections, both components
will be discussed in more detail. In the case of the full model framework, the
dataset is directly used as input to the summarization component. So, for the full
model only a single summarization model is fine-tuned.
We have two separate components that need to be accounted for. First, we
have the clustering component, which is only used in the clustering framework.
Here, the goal is to find a clustering of the data such that cases are most similar
within clusters, and least similar between clusters. Details of this component
are given in section 4.3.2. Second, there is the summarization component. This
component is required in both frameworks. This component consists of either
one or multiple summarization models, depending on the framework. We will use
a custom pretrained BART model as a starting point for these summarization
models. Details on how we pretrained this model, are given in section 4.3.1. The
pretrained model is fine-tuned on the Rechtspraak dataset; either on the full
dataset or on one of the cluster subsets. Specifics of this component and the
fine-tuning process are given in section 4.3.3. Details on how the components
Methods 26
Figure 4.2: The cluster framework that was compared in this project. First,
a set of features is derived for each case in the dataset. Then, these features
are used to cluster all cases into n clusters. Finally, for each of these clusters a
separate pre-trained BART model is fine-tuned only using the cases belonging to
that cluster.
were implemented and what hardware was used, are concisely presented in section
4.3.4.
Figure 4.3: Histograms showing the distribution of the cases’ summary and
description lengths, both in the number of words and the number of sentences.
For each of the distributions the right tail (top 1%) is omitted to filter outliers.
The number of bins is equal to the number of unique lengths, so all values are
included.
Furthermore, case text length and summary length are only two features of a
case. There are more interesting features, such as the topic(s) of a text, that also
make that two cases might differ. We expect that clustering the data into more
homogeneous groups with respect to the combination of these features leads to an
improved ability of downstream models to correctly learn relevant dependencies
between tokens in summaries and descriptions.
Methods 28
An important source of variation in these features might stem from the likeliness
that not every case in our dataset is summarized by the same person due to the
large number of summarized cases. Thus, the used vocabulary and overall style of
two summaries very well might differ.
Feature Description
Dw text length in words
Ds text length in sentences
TC topic class
RED redundancy
SC semantic coherence
Again, D denotes the case text. The first two features are simple lengths of
the case text. The first complex feature is the topic class. Here, the LDA model
was used that was trained to compute the topic similarity score in section 4.1.
The case text was fed into this model, after which it returned a distribution of
the latent topics in the text. From this distribution the topic with the highest
probability was picked as the topic of the text. In total there were five topics, each
denoted by a simple integer value. Finding the topic class can be summarized by:
TC = max (θD | M )
Methods 29
where θD | M is the distribution of topics that was computed by the LDA model
for case text D.
Next, we have redundancy. Here, we steer away from Bommasani and Cardie
(2020) and simply obtain the feature by computing the ratio between the number
of unique tokens and the total number of tokens:
Dw-unique
RED =
Dw
Finally, the feature semantic coherence is computed in the same way as before
(see section 4.1). However, due to the case texts consisting of many sentences,
we chose to only include the first ten sentences to keep computation time within
limits. Thus, semantic coherence is computed as follows:
N 10
1 X1X
SC = mBERT (Zj−1 , Zj )
N i=1 9 j=2
Loss Function
A loss function is used to evaluate the predictions of a model during training.
Each candidate summary that is generated is compared to the true summary.
The outcome of this comparison, the loss, is fed back into the model to steer its
learning in the envisaged direction.
For our summarization models we used the cross-entropy loss. For each of the
steps in the generation process of a summary, the cross-entropy loss compares the
predicted probability of a token with the true probability of that token. The loss
is defined as:
output size
X
loss = − yi · log ŷi
i=1
Methods 30
where yi is the true probability of the token, ŷi is the predicted probability of the
token and output size refers to the size of the output, which corresponds to the
size of the vocabulary.
Conveniently, this formula can be simplified to:
where youtput and ŷoutput refer to the true probability of the true token and the
predicted probability of the true token respectively.
This simplification follows from the fact that only one token is the true token
at a specific time step. The true token has probability 1 whereas each of the other
tokens has probability 0, meaning that all terms, other than the term of the true
token, will be zero.
In this chapter we will present the results of the earlier discussed method. In
section 5.1 we start with listing the dataset features that were computed using
the framework from Bommasani and Cardie (2020). Then, in section 5.2, the
pretraining process of BART will be handled. In section 5.3 we show how we
obtained clustered data. In section 5.4 the main training phase is discussed. In
this phase the pretrained BART model was fine-tuned to obtain summarization
models. Finally in section 5.5, the summarization models are evaluated and
compared. This section provides an extensive description of the human evaluation
that was performed.
31
Results 32
Table 5.1: Characteristics of the Rechtspraak dataset. The values for the other
datasets are adopted from (Bommasani & Cardie, 2020). In table 4.1 a description
is given of each of the metrics.
CNN/
Metric Rechtspraak Newsroom PubMed
DailyMail
Total cases 100K 287K 995K 21K
Dw 666 717 677 2394
Ds 34 50 40 270
Sw 48 31 26 95
Ss 2.90 3.52 1.75 10.00
CMP w 0.742 0.909 0.910 0.870
CMP s 0.838 0.838 0.890 0.874
TS 0.775 0.634 0.539 0.774
ABS 0.135 0.135 0.191 0.122
RED 0.049 0.157 0.037 0.170
SC 0.534 0.964 0.981 0.990
For the topic similarity metric the Rechtspraak dataset scores higher than
the news datasets and comparable to the PubMed dataset. This means that the
topics found in the source text are more similar to the topics from the summary.
Again, this is beneficial as there is less implicit understanding of the case text
required to construct the summary. For example, if topic similarity were to be
very low, the model cannot simply reiterate the main points from the source text,
but first has to parse these points into a story that is in line with the topics to be
expected in the summary.
The Rechtspraak dataset also deviates with respect to its redundancy in
comparison with two of the three other datasets. Here, a lower score indicates
that there is little overlap between sentences in the summary. It is not clear how
this impacted the summarization models.
Abstractivity, on the other hand, is roughly equal for each dataset. This means
that for each dataset, on average, any summary contains approximately the same
ratio of unseen tokens in comparison with the tokens found in the case text.
Finally, most remarkable is the semantic coherence score of the Rechtspraak
dataset. As was discussed earlier, semantic coherence measures how probable it
is that a sentence B follows a sentence A. A low score is understandable if we
consider that it is common for Rechtspraak summaries to consist of key sentences
that are only loosely connected to each other. In section 5.5.3, summaries are
shown that illustrate this characteristic. See for example the reference summary
in table 5.7, which consists of two sentences that barely have a direct relationship.
As we work with sequence to sequence summarization models, the summary is
generated token after token during which the previously generated tokens are
also taken into consideration. Therefore, if semantic coherence is low, this makes
it more challenging for the models to generate good summaries. The summary
Results 33
model now has to learn more word connections that are foreign to the case texts
and also to the initial pretraining corpus.
Figure 5.1: Pretraining losses of BART Model. The training and validation
losses measured during the training phase of the BART language model are shown.
The batch size is 8. The training loss was logged every 2500 steps and on step 1,
whereas the validation loss was computed and logged every 25000 steps with the
first measured at step 25000. Three guide lines are plotted that indicate steps
where the decrease in loss shows an unexpected pattern.
Figure 5.2: Elbow plot showing the distortion loss for 1 ≤ k ≤ 12.
The result of these clusterings is shown in table 5.2. Here, we see the distribu-
tion of cases over the 6 classes in case of the k-means model and in case of the
Gaussian mixture model. The main difference between the clusterings of the two
approaches is how evenly the cases are distributed. For the Gaussian mixture
model, over half of the cases ended up in the first model, whereas for the k-means
model there is a more even distribution.
As we do not know what is the minimum required number of cases in order
for the BART transformers to be able to be effectively fine-tuned, we strived to
avoid obtaining clusters of too few cases and therefore chose to continue with the
k-means model for the rest of the experiment.
Table 5.2: The number of cases per cluster for the chosen k-means and Gaussian
mixture models.
Model 1 2 3 4 5 6
k-means 23584 21875 19156 15413 13504 6669
Gaussian mixture 52520 14895 13457 8955 8161 2213
Table 5.3: The number of cases contained in each of the dataset splits. Effectively,
seven datasets were used in this study: one containing all cases, and one dataset
containing the subset of cases belonging to a specific cluster. Creation of the
datasets was done in a stratified manner.
Figure 5.4: Fine-tuning losses of the full model and the cluster frame-
work. The training and validation losses measured during the fine-tuning phase
of the full model and the weighted average losses of the cluster models. The batch
size is 8. The training loss was logged every epoch and on step 1, whereas the
validation loss was computed and logged every epoch.
The training loss was measured after every epoch and on the very first step.
The validation loss was only measured after each epoch, from which follows that
validation curves start from the first epoch. Furthermore, the y-axis is zoomed in
to only show the cross-entropy loss from 0 to 1 as this better shows the differences
between the two models. In reality, the training loss at step 1 was approximately
27 for both of the models. Thus, if we were to directly use the pretrained BART
model without first fine-tuning it, this would be the approximate loss that would
be obtained. From this, it also follows that most of the learning happens in the
very first epoch. In subsequent epochs the loss only marginally improves.
When we compare the curves of the two different models, we notice that both
curves follow the same pattern. The cluster model, however, yields a slightly higher
loss than the full model. Therefore, purely judging from this figure, we could
Results 38
state that clustering has a negative effect on the quality of the summarization
framework. However, without considering the outputs of the model, this would
be a rushed conclusion. In section 5.5, we will evaluate the outputs of the models
to find whether this statement holds.
Another way of viewing the quality of the cluster model, is by looking at its
individual components. In figure 5.5, we show the same figure as before, but now
the individual cluster models are shown. We see that there is a bit of variation in
model performance. Intuitively, one may think that a smaller size of a model’s
dataset negatively impacts that model’s performance. However, seeing that the
largest of the cluster models, cluster model 0, performs worst, whereas cluster
model 4, which was trained on an average-sized subset of the data, performs best,
this hypothesis becomes less fitting.
Figure 5.5: Epoch view of the Fine-tuning losses of the full model and
the individual cluster models. The training and validation losses measured
during the fine-tuning phase of the full model and the individual cluster models.
Figure 5.5 gives a somewhat distorted view of the training process. That is
because each of the models was trained for a different number of steps. This
follows from the differing sizes of the data splits. Each epoch the model passes over
the complete train split, where each step uses one batch of 8 cases. Therefore, a
better suiting view on the training process would be a plot of the loss per training
step. This view is shown in figure 5.6. Here, the exact same data is plotted.
Training for less steps also means that the model took less time to train. From
the nature of the experiment follows that the total number of training steps of the
cluster models combined is roughly equal to the number of training steps from
the main model. The training time of both variants thus also was roughly equal.
We were curious to see whether a cluster model would improve if we had it
train for more steps. To this end we had cluster model 0 train for approximately
the same amount of training steps as the full model. This corresponded to 43
epochs instead of the prior 10 epochs. In figure 5.7, the results of this 43-epoch
Results 39
Figure 5.6: Step view of the Fine-tuning losses of the full model and
the individual cluster models. The training and validation losses measured
during the fine-tuning phase of the full model and the individual cluster models.
This is the same data as shown in figure 5.5, but here the loss is plotted against
the training steps instead of epochs.
Figure 5.7: Fine-tuning losses of the full model and the cluster model.
The training and validation losses measured during the fine-tuning phase of the
full model and cluster model 0. Here, instead of training for the standard 10
epochs, cluster model 0 is trained for 43 epochs which translates to approximately
the same number of training steps as for the full model.
one word. As subword tokenization is used, it is likely that the number of tokens
is always larger than the number of words.
Generating the results took 25 hours and 40 minutes for the test set of the full
model and approximately the same time for the test sets of the cluster models
combined. In total there were 9921 cases in the test set of the full model, and an
equal total number of cases in the test sets of the cluster models.
Aggregate Results
The results of the human evaluation are shown in table 5.5. This table shows the
average score of each of the four metrics for the true/reference summaries and for
the sumaries that were generated by each of the models. Furthermore, it shows
the average scores for each metric per cluster class for each of the systems. This
second view enables us to compare the performance of the cluster model versus
the full model and the reference summaries for a specific cluster.
The table shows us that, overall, the true summaries are more informative,
more relevant, more fluent, and more coherent. This is in line with the expectations
as the summarization models are trained on these true summaries and, as the
models are not perfect, underperformance is to be expected. It is interesting
however that with respect to fluency and coherence, both summarization models
perform comparatively well. This is further highlighted by the standard deviations
of these scores of the reference summaries overlapping with the average scores of
the summarization models. Nevertheless, the generated summaries underperform
with respect to informativeness and relevance. The metrics still have average
scores between 3.5 and 4, but in comparison with the other metrics there clearly is
a larger gap between the real summaries and the generated summaries. Following
these observations, we can state that, in comparison with the true summaries, the
summarization models do a relatively good job at producing fluent and coherent
summaries, while struggling more with keeping the summaries informative and
relevant.
Results 43
Table 5.5: The human evaluation results. For each metric the average score
and standard deviation are shown. Best scores per group are highlighted in bold.
Inf. denotes Informativeness.
Now, if we compare both summarization models, we see that both the full data
model and the cluster framework score relatively equal on all four dimensions.
To get a more detailed view of the specific scores that were given per summary
type, figure 5.8 is provided. This figure shows the frequency of each score (1 to 5)
that was given for each metric for each summary type.
As we see in the figure, the true summaries perform best. This corresponds
to what we saw in table 5.5. There is one summary, however, that got scored
a 1 on informativeness. This is highly unusual as the summaries are manually
written, making it reasonable to expect that at least informativeness and relevance,
two metrics that relate to the factual content of a summary, are of reasonably
quality. In this specific instance, the case describes a person’s appeal to a previous
verdict. The summary only listed the previous verdict, without mentioning that
the appeal was well founded and therefore overruling the previous verdict. The
cluster framework summary was a bit more extensive as it also mentioned the
appeal, but instead of mentioning that the appeal was well founded, it mentioned
Results 44
that it was unfounded; this summary got a score of 2. Only the full model
summary, which got 4 points, contained both relevant parts, including the remark
that the appeal was well founded.
For the true summaries, informativeness shows most variation. This stems
mainly from the fact that the true summaries relatively often are incomplete and
only list a few characteristics. The summary described in the previous paragraph is
an example of this. On the other hand, relevance, which also measures the content
of the summary, scores better. This is because the facts that were mentioned in
the summaries, were almost always relevant to the case; which comes as little
surprise as they were derived by humans. The two summarization models have
more trouble with keeping the summary relevant, as is shown by the variation in
the score frequencies.
Overall we see that the the summarization models have similar frequencies
for each of the scores. This also corresponds to the standard deviations that
were listed for both models in table 5.5. The 1-scores for informativeness were
given because the summaries were written as a single question, contained a long
paragraph of text that terminated mid-sentence, and contained simply irrelevant
information, respectively.
General Observations
During the evaluation process some other particularities were identified. In this
section, we will shortly go over some of these.
First of all, an important behavior to note is that the summarization models
had trouble with distinguishing whether information belongs to the current case
(e.g. the appeal case) or the case that is referred to in the current case. Of this
Results 45
referred case a synopsis often is provided in the current case to provide the reader
with a background of the case. The problem here is that the models sometimes
concluded a summary by providing the verdict of the referred case, which was only
given as background in the current case, instead of the verdict of the current case.
In cases where an appeal leads to a different verdict, this leads to the summary
containing the wrong conclusion.
It is not uncommon for true summaries to be extractive by design. These
summaries consist of one or more sentences that are literally found in the source
text. As a result, the models are also inclined to create summaries that contain
phrases that are found in the source text.
Another thing that again was visible from evaluating multiple cases is that
the texts always had some structure. The different sections of the summary were
marked with a heading describing the section’s content.
We also noticed that a large portion of the cases are not initial cases, but
appeal cases. This might be the result of filtering out all longer cases (of length
>1024 words) as initial cases might be longer by default as more information has
to be covered.
Lastly, the summarization models sometimes used anonymized terms in the
generated summaries. These terms are found in the source texts, but not used in
the reference summaries. The anonymized terms are always surrounded by square
brackets and replace real names, company names and living places of people. For
example, [adres] could have been used instead of a real address. In the cases
where the summarization model used an anonymized term in the summary, the
summary itself was usually not of a worse quality. Nevertheless, as this behavior
is not in accordance with the reference summaries, the informativeness score was
reduced with 1 if summaries contained such terms.
A case that highlights some of the flaws of the summarization models, is listed
in table 5.73 . The true summary is concise and informative. On each of the metrics
it scores 5 points. The full model summary is severely lacking. In fact, it is one
of the worst summaries that was present in the human evaluation sample. The
summary shows multiple problems, which is also reflected in the scores that were
given. The summary opens with a hard-to-read sentence stating the judgment of
the case. More importantly, the rest of summary focuses on a very specific section
of the case text. Also, there are anonymized terms in the case summary. For
these reasons, informativeness was scored as 1. The metric relevance answers the
question "Are the details provided by the summary consistent with details in the
article?". Because the details all were in accordance with the case text, relevance
still got 5 points. Finally, fluency and coherence scored low due to summary
terminating mid-sentence, and the general difficulty of reading and understanding
the summary. The cluster summary, on the other hand, scores a bit better. It
started off by correctly mentioning that the case is about an ’extradition request’,
which is information that already tells a lot about the nature of the case. Next,
3
https://uitspraken.rechtspraak.nl/inziendocument?id=ECLI:NL:RBSGR:2010:BM2760.
Results 47
the judgment of the case is presented. Unlike in the full model summary, the
judgment now makes a bit more sense because we are aware of the case topic.
In the body of the summary, some less relevant information from the case text
is described. Here, an non-existing word is included, ’geatz’, which might be a
peculiar result stemming from the usage of subword tokenization. The summary
ends with a conclusion that is relevant, but not that informative.
48
Discussion 49
of beams, max_length, min_length, and the length penalty all might influence the
results of the experiments. We had no resources to test for these parameters and
chose to take a common configuration.
RQ1: What are the key differences between available benchmark datasets
and the Rechtspraak dataset used in this project?
To answer this research question, we computed the set of features that was
introduced by Bommasani and Cardie (2020) in their paper on automatic sum-
marization datasets. This set of features was then compared with three common
benchmark datasets.
From this comparison, it followed that there were three clear differences
between the Rechtspraak dataset and the common benchmark datasets. First,
relative to the length of case texts, the length of the summaries was large. Second,
there was less redundancy within the case summaries. Third, the most distinctive
feature of the Rechtspraak dataset is the loose relationship between consecutive
sentences in the case summaries.
We did not have the resources to perform ablation studies to specifically
measure how these differences impacted the performance of the summarization
models. In the discussion (chapter 6) we identified this as possible future work.
54
Conclusion 55
Abuobieda, A., Salim, N., Kumar, Y. J., & Osman, A. H. (2013). An improved
evolutionary algorithm for extractive text summarization. In A. Selamat,
N. T. Nguyen, & H. Haron (Eds.), Intelligent information and database
systems (pp. 78–89). Springer Berlin Heidelberg.
Allahyari, M., Pouriyeh, S., Assefi, M., Safaei, S., Trippe, E. D., Gutierrez, J. B.,
& Kochut, K. (2017). Text summarization techniques: A brief survey.
International Journal of Advanced Computer Science and Applications,
8 (10). https://doi.org/10.14569/IJACSA.2017.081052
Al-Sabahi, K., Zuping, Z., & Nadher, M. (2018). A hierarchical structured self-
attentive model for extractive document summarization (HSSAS). IEEE
Access, PP, 1–1. https://doi.org/10.1109/ACCESS.2018.2829199
Beltagy, I., Peters, M. E., & Cohan, A. (2020). Longformer: The long-document
transformer. CoRR, abs/2004.05150. https://arxiv.org/abs/2004.05150
Bengio, Y., Ducharme, R., Vincent, P., & Janvin, C. (2003). A neural probabilistic
language model. The journal of machine learning research, 3, 1137–1155.
Bhattacharya, P., Hiware, K., Rajgaria, S., Pochhi, N., Ghosh, K., & Ghosh, S.
(2019). A comparative study of summarization algorithms applied to legal
case judgments. In L. Azzopardi, B. Stein, N. Fuhr, P. Mayr, C. Hauff,
& D. Hiemstra (Eds.), Advances in information retrieval (pp. 413–428).
Springer International Publishing.
Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. the
Journal of machine Learning research, 3, 993–1022.
Bommasani, R., & Cardie, C. (2020). Intrinsic evaluation of summarization
datasets. Proceedings of the 2020 Conference on Empirical Methods in
Natural Language Processing (EMNLP), 8075–8096. https://doi.org/10.18
653/v1/2020.emnlp-main.649
Chen, Y.-C., & Bansal, M. (2018). Fast abstractive summarization with reinforce-
selected sentence rewriting. CoRR, abs/1805.11080. http://arxiv.org/abs
/1805.11080
Delobelle, P., Winters, T., & Berendt, B. (2020). RobBERT: A Dutch RoBERTa-
based language model. Findings of the Association for Computational
Linguistics: EMNLP 2020, 3255–3265. https://doi.org/10.18653/v1/2020.f
indings-emnlp.292
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of
deep bidirectional transformers for language understanding. Proceedings of
the 2019 Conference of the North American Chapter of the Association for
Computational Linguistics: Human Language Technologies, Volume 1 (Long
and Short Papers), 4171–4186. https://doi.org/10.18653/v1/N19-1423
56
References 57
de Vries, W., van Cranenburgh, A., Bisazza, A., Caselli, T., van Noord, G., &
Nissim, M. (2019). BERTje: A Dutch BERT model. CoRR, abs/1912.09582.
http://arxiv.org/abs/1912.09582
Ferreira, R., de Souza Cabral, L., Freitas, F., Lins, R. D., de França Silva, G.,
Simske, S. J., & Favaro, L. (2014). A multi-document summarization
system based on statistics and linguistic treatment. Expert Systems with
Applications, 41 (13), 5780–5787. https://doi.org/https://doi.org/10.1016
/j.eswa.2014.03.023
Gambhir, M., & Gupta, V. (2017). Recent automatic text summarization tech-
niques: A survey. Artificial Intelligence Review, 47 (1), 1–66. https://doi.or
g/https://doi.org/10.1007/s10462-016-9475-9
Goodfellow, I., Bengio, Y., Courville, A., & Bengio, Y. (2016). Deep learning
(Vol. 1). MIT press Cambridge.
Grusky, M., Naaman, M., & Artzi, Y. (2018). Newsroom: A dataset of 1.3 mil-
lion summaries with diverse extractive strategies. Proceedings of the 2018
Conference of the North American Chapter of the Association for Com-
putational Linguistics: Human Language Technologies, Volume 1 (Long
Papers), 708–719.
Hachey, B., & Grover, C. (2005). Automatic legal text summarisation: Experiments
with summary structuring. Proceedings of the 10th International Conference
on Artificial Intelligence and Law, 75–84. https://doi.org/10.1145/1165485
.1165498
Hiemstra, D. (1998). A linguistically motivated probabilistic model of information
retrieval. Proceedings of the Second European Conference on Research and
Advanced Technology for Digital Libraries, 569–584.
Huang, D., Cui, L., Yang, S., Bao, G., Wang, K., Xie, J., & Zhang, Y. (2020).
What have we achieved on text summarization? CoRR, abs/2010.04529.
https://arxiv.org/abs/2010.04529
Jurafsky, D., & Martin, J. H. (2020). Speech and language processing: An introduc-
tion to natural language processing, computational linguistics, and speech
recognition. Pearson.
Kornilova, A., & Eidelman, V. (2019). Billsum: A corpus for automatic summa-
rization of us legislation. arXiv preprint arXiv:1910.00523.
Lewis, M., Liu, Y., Goyal, N., Ghazvininejad, M., Mohamed, A., Levy, O., Stoy-
anov, V., & Zettlemoyer, L. (2020). BART: Denoising sequence-to-sequence
pre-training for natural language generation, translation, and compre-
hension. Proceedings of the 58th Annual Meeting of the Association for
Computational Linguistics, 7871–7880. https://doi.org/10.18653/v1/2020
.acl-main.703
Lin, C.-Y. (2004). ROUGE: A package for automatic evaluation of summaries. Text
Summarization Branches Out, 74–81. https://aclanthology.org/W04-1013
Lin, J., Sun, X., Ma, S., & Su, Q. (2018). Global encoding for abstractive sum-
marization. Proceedings of the 56th Annual Meeting of the Association for
Computational Linguistics (Volume 2: Short Papers), 163–169.
Liu, C.-L., & Chen, K.-C. (2019). Extracting the gist of chinese judgments of the
supreme court. Proceedings of the Seventeenth International Conference on
References 58
In this appendix, we will discuss how the Rechtspraak dataset was collected and
parsed. First, we describe the external dataset, which is the dataset as it was
published by Raad voor de Rechtspraak. Second, we show how the external dataset
files were parsed and stored as a single dataset.
Now, we are left with a complete dataset of all published legal cases of Raad
voor de rechtspraak. In the further text, we will refer to this dataset as the external
dataset. The next task was to extract the relevant information from the external
set and store this efficiently. To this end, a raw dataset was constructed from the
external dataset.
61
The Data Collection Process 62
• Parse each XML file within the month archive, yielding meta information of
the case, the full text of the case and the summary of the case
• Write the extracted parts to a pandas DataFrame for the year’s month
archive
• After parsing each archive, the individual parquet files (one for each month
for each year) were combined into four parquet files containing the complete
dataset
Table A.1: Comparison of different file formats for storing the OpenRechtspraak
dataset. The comparison was made using the legal cases from 2020. For each
month, the parsed files/cases were written to an individual file. Fth. denotes
Feather and Snpy denotes Snappy. Best averages are in bold.
In this appendix, two cases’ summaries are shown and discussed. This appendix
is supplementary to results section 5.5.3.
Case ECLI:NL:CRVB:2015:1630 shows three summaries that each were rated
with 5/5 on each metric. These summaries are listed in table B.11 . Each summary
starts off the same, but in the middle of the first sentence they start offering
slightly different views on the case. For the summarization models, there clearly is
some variation in how the summaries are written and what parts of the case text
are included. Interestingly, the reference summary almost seems to be a composite
of the two summary models. However, despite being more concise, the summary
models did not fail to include the most important information from the case.
In table B.22 a case is shown that highlights some of the problems of the
summarization models. Here, the full model summary scores worst because of a
lack of informativeness. Furthermore, the summary consists of a single convoluted
question. This behavior where the summary is phrased as a question was present
in one other generated summary. The cluster model summary scores better, but
the second sentence contains grammatical peculiarities.
1
https://uitspraken.rechtspraak.nl/inziendocument?id=ECLI:NL:CRVB:2015:1630.
2
https://uitspraken.rechtspraak.nl/inziendocument?id=ECLI:NL:CRVB:2005:AU5952
64
Examples of Generated Summaries 65