Effectively Leveraging BERT for Legal Document Classification (1)
Effectively Leveraging BERT for Legal Document Classification (1)
Nut Limsopatham
Microsoft Corporation
Redmond, WA
[email protected]
4.2 Datasets
4.2.1 ECHR Violation (Multi-Label) Dataset
The dataset contains 11k cases from the Eu-
ropean Convention of Human Rights public
database (Chalkidis et al., 2021). Each case con-
tains a list of paragraphs representing facts in the
case. The task is to predict which of the human
right articles of the Convention are violated (if any)
in a given case. The number of target labels are 40
ECHR articles (Chalkidis et al., 2021).
Figure 1: An example of fine-tuning BERT model on a
Table 1 provides statistical information of the
classification task.
ECHR Violation (Multi-Label) dataset. In partic-
ular, the dataset is separated into 3 folds: training,
We fine-tune the models on the violation prediction development and testing with the number of data
and court overruling prediction tasks. We provide points (cases) of 9,000, 1,000 and 1,000, respec-
detailed information about the tasks in Section 4.2. tively. On average, the number of tokens within a
case is between 1,619 - 1,926, which are more than
RQ2 How to adapt BERT-based models to effec- 512 tokens supported by BERT.
tively deal with long documents in legal text This is a multi-label classification task where
classification? we follow Chalkidis et al. (2021) and evaluate the
classification performance in terms of micro-F1
For RQ2, we discuss the performances of several score.
BERT variances (including truncating long docu-
ments from the front or from the back), as well 4.2.2 Overruling Task Dataset
as hierarchical BERT models (Pappagari et al., This dataset is composes of 2,400 data-points,
2019) that learn to combine output vectors of which are legal statements that are either overruled
BERT using different strategies, such as, max or not overruled by the same or the higher ranked
pooling (Krizhevsky et al., 2012), and mean pool- court (Sulea et al., 2017; Zheng et al., 2021).
ing (Krizhevsky et al., 2012) before applying a We show the statistics of the Overruling Task
classification layer. Dataset in Table 2. The average and the maximum
number of tokens within a statement (i.e. case) is
4 Experimental Setup 21.94 and 204, respectively. Therefore, the BERT
model should directly support this dataset without
In Section 3, we have discussed the two main re-
any alteration.
search questions to be investigated in this paper. In
Follow Zheng et al. (2021), the task is modeled
this section, we discuss the hyper-parameters of
as a binary classification, where we conduct a 10
our models in Section 4.1. Then, we provide the
folds cross-validation on the dataset. Finally, we
details of the two legal text classification datasets
report the average of the F1-score across the 10
(Section 4.2) and the variances of the BERT models
folds with a standard deviation value (Zheng et al.,
(Section 4.3) used in the experiments.
2021).
4.1 Hyper-parameters
4.3 Model Variances
We use the transformers library1 to develop and
Next, we discuss the variances of adapting *Model,
train BERT models in our experiments. For all ex-
which is a pre-trained BERT-based model from
periments, we fine-tune the models using AdamW
Table 3, to deal with long documents in the experi-
optimizer (Loshchilov and Hutter, 2017), learning
ments. The used methods are as follows:
rate of 5e-5 and a linear learning-rate scheduler. We
2
Our preliminary results showed that 5 epochs resulted in
1
https://huggingface.co/transformers/ most effective performances for most of the used models.
212
Fold # Cases Max # Words Min # Words Avg. # Words Max # Labels Min # Labels Avg. # Labels
Training 9,000 35,426 69 1619.24 10 0 1.8
Development 1,000 14,493 84 1,784.03 7 0 1.7
Testing 1,000 15,919 101 1,925.73 6 1 1.7