0% found this document useful (0 votes)
28 views7 pages

Effectively Leveraging BERT for Legal Document Classification (1)

It's a research paper of legal Document Classification

Uploaded by

manoj116606
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views7 pages

Effectively Leveraging BERT for Legal Document Classification (1)

It's a research paper of legal Document Classification

Uploaded by

manoj116606
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Effectively Leveraging BERT for Legal Document Classification

Nut Limsopatham
Microsoft Corporation
Redmond, WA
[email protected]

Abstract legal information extraction (Chalkidis et al., 2018),


and court opinion generation (Ye et al., 2018).
Bidirectional Encoder Representations from
Transformers (BERT) has achieved state-of- Bidirectional Encoder Representations from
the-art performances on several text classifica- Transformers (BERT) (Devlin et al., 2019) has
tion tasks, such as GLUE and sentiment analy- gained attentions from the NLP community due
sis. Recent work in the legal domain started to to its effectiveness on several NLP tasks (Chalkidis
use BERT on tasks, such as legal judgement et al., 2019, 2020; Zheng et al., 2021). Impor-
prediction and violation prediction. A com- tantly, the effectiveness of BERT is mainly due
mon practise in using BERT is to fine-tune a
to the transfer learning ability that leverages se-
pre-trained model on a target task and truncate
the input texts to the size of the BERT input
mantic and syntactic knowledge from pre-training
(e.g. at most 512 tokens). However, due to the on a large non-labeled corpus (Devlin et al., 2019;
unique characteristics of legal documents, it is Chalkidis et al., 2020; Zheng et al., 2021). How-
not clear how to effectively adapt BERT in the ever, Chalkidis et al. (2019) reported that BERT
legal domain. In this work, we investigate how could not effectively handle long documents in the
to deal with long documents, and how is the European Court of Human Rights (ECHR) dataset.
importance of pre-training on documents from In addition, pre-training BERT is costly. We need
the same domain as the target task. We con-
access to a special type of machines to pre-train
duct experiments on the two recent datasets:
ECHR Violation Dataset and the Overruling BERT on a large corpora (Devlin et al., 2019; Liu
Task Dataset, which are multi-label and binary et al., 2019; Zheng et al., 2021).
classification tasks, respectively. Importantly, In this work, we investigate how to effectively
on average the number of tokens in a document adapt BERT to handle long documents, and how
from the ECHR Violation Dataset is more than importance of pre-training on in-domain docu-
1,600. While the documents in the Overruling ments. Specifically, we will focus on two legal
Task Dataset are shorter (the maximum num-
document prediction tasks, including ECHR Viola-
ber of tokens is 204). We thoroughly compare
several techniques for adapting BERT on long
tion Dataset (Chalkidis et al., 2021) and Overruling
documents and compare different models pre- Task Dataset (Zheng et al., 2021). The ECHR Vio-
trained on the legal and other domains. Our lation Dataset provides a multi-label classification
experimental results show that we need to ex- task. Given a list of facts described in free-texts, the
plicitly adapt BERT to handle long documents, task is to identify which articles of the European
as the truncation leads to less effective perfor- Convention were violated. The Overruling Task is
mance. We also found that pre-training on the a binary classification task to predict whether a le-
documents that are similar to the target task
gal statement will be later overruled by the same or
would result in more effective performance on
several scenario. higher ranking court (Zheng et al., 2021). We will
discuss more about the two tasks in Section 4.2.
1 Introduction The main contributions of this paper are three-
fold:
Recent advances in deep learning contribute to ef-
fective performances of several natural language 1. We investigate how to effectively adapt BERT
processing (NLP) tasks on legal text documents, to deal with long documents (i.e. documents
such as, violation prediction (Chalkidis et al., containing more than 512 tokens).
2020), overruling prediction (Zheng et al., 2021), 2. We analyse the impacts of pre-training on
legal judgement prediction (Chalkidis et al., 2019), different types of documents, especially in-
210
Natural Legal Language Processing Workshop 2021, pages 210–216
November 10, 2021. ©2021 Association for Computational Linguistics
domain documents, on the performance of a malisation (Limsopatham and Collier, 2015, 2016)
fine-tuned BERT model. and Novel Named Entity Recognition (Derczynski
3. We thoroughly evaluate the approaches to et al., 2017). In particular, when fine-tuning BERT,
adapt BERT on long documents and pre- we normally add a classification layer (either Soft-
trained models to identify best practises for Max or Sigmoid) on the C (or CLS) representation
using BERT in legal document classification in BERT output layer, in order to compute the pre-
tasks. diction probabilities as in Figure 1. In the legal
domain, Zheng et al. (2021) found that pre-training
The remainder of the paper is organised as fol-
BERT on legal documents before fine-tuning on
lows. Section 2 further discusses related work and
particular tasks lead to a better performance than
positions our work in the literature. Section 3 de-
pre-training BERT on general documents. How-
scribes the two research questions we aim to answer
ever, Chalkidis et al. (2019) found that BERT did
in this paper and how we will find the answers. Sec-
not perform well on the violation prediction task
tions 4 and 5 discuss our experimental setup and
due to the length of the documents that are mostly
results. Section 6 provides more insight from the
longer than 512 tokens. They dealt with the long
experimental results and answers the two research
legal documents by using a hierarchical BERT tech-
questions. Finally, we provide concluding remarks
nique (Chalkidis et al., 2019). Difference from the
in Section 7.
previous work, we investigate the effectiveness of
2 Related Work variances of pre-trained BERT-based models and
compare several methods to handle the long legal
Legal documents, such as EU & UK legislation, documents in legal text classification.
European Court of Human Rights (ECHR) cases, Several attempts (Beltagy et al., 2020; Zaheer
Case Holdings On Legal Decisions (CaseHOLD) et al., 2020; Pappagari et al., 2019) have been made
are normally written in a descriptive language to enable BERT-like models to work on documents
in a non-structured text format and have unique with more than 512 tokens. For example, Belt-
characteristics that are different from those of agy et al. (2020) and Zaheer et al. (2020) used
other domains. In order to advance Legal NLP several different attention-mechanism techniques,
research, several tasks and datasets have been such as global attentions and sliding window at-
developed, including violation prediction on the tentions to enable learning on a longer number of
ECHR dataset (Chalkidis et al., 2020), court over- tokens. Pappagari et al. (2019) investigated differ-
ruling (Zheng et al., 2021), legal docket classifica- ent approaches to apply BERT on sequence chunks
tion (Nallapati and Manning, 2008) and court view of texts in a document before aggregating the fea-
generation (Ye et al., 2018). In this work, we focus tures using techniques, such as max pooling and
on text classification, which is a main research area mean pooling. In this work, we adapt these tech-
of legal NLP. niques to learn how to effectively use BERT on
Bidirectional Encoder Representations from long legal documents.
Transformers (BERT) is a language representation
model that is optimized during pre-training by self- 3 Research Questions
training using a masked language model prediction This section discusses research questions we aim
and a next sentence prediction as a joint training ob- to answer in this paper.
jective (Devlin et al., 2019). As shown in Figure 1,
BERT model architecture is built upon a multi- RQ1 For legal text classification, does pre-training
layer bidirectional Transformer encoder of Vaswani on the in-domain documents lead to a more ef-
et al. (2017), where the number of input tokens is fective performance than pre-training on gen-
limited to 512. Pre-training BERT enables effective eral documents?
transfer learning from a large dataset before fine- To answer the first research question, we compare
tuning the model on a specific task (Devlin et al., the performance of variances of BERT-based mod-
2019; Vaswani et al., 2017). Importantly, Devlin els that are pre-trained on general documents or
et al. (2019) used this transfer learning method to different types of legal documents. Examples of
achieve the state-of-the-art performance on several the models are BERT (Devlin et al., 2019) and
NLP datasets, such as GLUE (Wang et al., 2018), LEGAL-BERT (Chalkidis et al., 2020). The com-
SQuAD (Rajpurkar et al., 2016), Concept Nor- plete list of models will be described in Section 4.3.
211
used a batch size of 16 and fine-tune the models on
individual tasks for 5 epochs2 .

4.2 Datasets
4.2.1 ECHR Violation (Multi-Label) Dataset
The dataset contains 11k cases from the Eu-
ropean Convention of Human Rights public
database (Chalkidis et al., 2021). Each case con-
tains a list of paragraphs representing facts in the
case. The task is to predict which of the human
right articles of the Convention are violated (if any)
in a given case. The number of target labels are 40
ECHR articles (Chalkidis et al., 2021).
Figure 1: An example of fine-tuning BERT model on a
Table 1 provides statistical information of the
classification task.
ECHR Violation (Multi-Label) dataset. In partic-
ular, the dataset is separated into 3 folds: training,
We fine-tune the models on the violation prediction development and testing with the number of data
and court overruling prediction tasks. We provide points (cases) of 9,000, 1,000 and 1,000, respec-
detailed information about the tasks in Section 4.2. tively. On average, the number of tokens within a
case is between 1,619 - 1,926, which are more than
RQ2 How to adapt BERT-based models to effec- 512 tokens supported by BERT.
tively deal with long documents in legal text This is a multi-label classification task where
classification? we follow Chalkidis et al. (2021) and evaluate the
classification performance in terms of micro-F1
For RQ2, we discuss the performances of several score.
BERT variances (including truncating long docu-
ments from the front or from the back), as well 4.2.2 Overruling Task Dataset
as hierarchical BERT models (Pappagari et al., This dataset is composes of 2,400 data-points,
2019) that learn to combine output vectors of which are legal statements that are either overruled
BERT using different strategies, such as, max or not overruled by the same or the higher ranked
pooling (Krizhevsky et al., 2012), and mean pool- court (Sulea et al., 2017; Zheng et al., 2021).
ing (Krizhevsky et al., 2012) before applying a We show the statistics of the Overruling Task
classification layer. Dataset in Table 2. The average and the maximum
number of tokens within a statement (i.e. case) is
4 Experimental Setup 21.94 and 204, respectively. Therefore, the BERT
model should directly support this dataset without
In Section 3, we have discussed the two main re-
any alteration.
search questions to be investigated in this paper. In
Follow Zheng et al. (2021), the task is modeled
this section, we discuss the hyper-parameters of
as a binary classification, where we conduct a 10
our models in Section 4.1. Then, we provide the
folds cross-validation on the dataset. Finally, we
details of the two legal text classification datasets
report the average of the F1-score across the 10
(Section 4.2) and the variances of the BERT models
folds with a standard deviation value (Zheng et al.,
(Section 4.3) used in the experiments.
2021).
4.1 Hyper-parameters
4.3 Model Variances
We use the transformers library1 to develop and
Next, we discuss the variances of adapting *Model,
train BERT models in our experiments. For all ex-
which is a pre-trained BERT-based model from
periments, we fine-tune the models using AdamW
Table 3, to deal with long documents in the experi-
optimizer (Loshchilov and Hutter, 2017), learning
ments. The used methods are as follows:
rate of 5e-5 and a linear learning-rate scheduler. We
2
Our preliminary results showed that 5 epochs resulted in
1
https://huggingface.co/transformers/ most effective performances for most of the used models.
212
Fold # Cases Max # Words Min # Words Avg. # Words Max # Labels Min # Labels Avg. # Labels
Training 9,000 35,426 69 1619.24 10 0 1.8
Development 1,000 14,493 84 1,784.03 7 0 1.7
Testing 1,000 15,919 101 1,925.73 6 1 1.7

Table 1: Statistics: ECHR Violation (Multi-Label).

# Cases 2,400 window attention and global attention, so that


Max # Words 204 it can deal with documents longer than 512
Min # Words 1 tokens.
Avg. # Words 21.94
Ratio of Negative:Positive Labels 1:1.03 • LongFormer – Fine-tuning the LongFormer
from Beltagy et al. (2020), which was
Table 2: Statistics: Overruling Task Dataset. pre-trained using BookCorpus and English
Wikipedia, on each classification task. Long-
Former is a variance of BERT that uses sev-
• RR-*Model – Remove tokens in the rear of eral attention techniques, such as, sliding win-
the input texts if the length is more than 512 dow attention, dilated sliding window, and
and fine-tune the model on each classification global attention, so that it can handle docu-
task (similar to vanilla BERT (Devlin et al., ments longer than 512 tokens.
2019)).
5 Experimental Results
• RF-*Model – Remove tokens in the front of
the input texts if the length is more than 512 In this section, we compare the variances of BERT-
and fine-tune the model on each classification based models on the ECHR Violation Dataset (Sec-
task. tion 5.1) and Overruling Task Dataset (Section 5.2),
respectively.
• MeanPool-*Model – Apply the model on ev-
ery chunk of n tokens before using a mean 5.1 ECHR Violation Dataset
function to average the features from the same
dimensions of the output vector representa- Table 4 reports the performances in terms of Micro-
tions of the chunks. Then, use a classification F1 score on different approaches to deal with long
layer for each classification task. In this work, legal documents.
we set n = 200. First, when comparing the performance of dif-
ferent BERT pre-trained models, we found that
• MaxPool-*Model – Apply the model on ev- *-ECHR-Legal-BERT outperformed the other pre-
ery chunk of n tokens before using a max trained models across all of the methods used for
function to select features from each dimen- adapting BERT to deal with long documents. This
sion, based on the highest scores among the finding supported that pre-training BERT on the
same dimensions of the output vector repre- documents that are more similar to the task would
sentations of the chunks, as a final vector rep- lead to a better performance. Please note that *-
resentation. Then, use a classification layer ECHR-Legal-BERT was pre-training using docu-
for each classification task. ments from the ECHR Violation Dataset, as men-
tioned in Table 3. Moreover, we observed that
In addition, we include other two baselines that
*-RoBERTa performs comparably to *-Harvard-
use different attention techniques, in order to cope
Law-BERT. This provides an insight that if the
with document longer than 512 tokens:
in-domain documents (or documents similar to the
• BigBird – Fine-tuning the BigBird from Za- task) are limited, pre-training the model on a large
heer et al. (2020), which was pre-trained using corpus could also lead to an effective performance.
English language corpora, such as BookCor- Next, as shown in Table 4, when comparing
pus and English portion of the CommonCrawl the performance of RR-* and RF-* in Table 4,
News, on each classification task. BigBird we found that the micro F-1 scores of RR-* (e.g.
is a variance of BERT that uses several at- 0.6466 for BERT) is worse than those of the corre-
tention techniques, such as, random attention, sponding RF-* (e.g. 0.6803 for BERT). This shows
213
BERT The BERT (bert-base-uncased) from Devlin et al. (2019), which
were pre-trained using BookCorpus and English Wikipedia.
ECHR-Legal-BERT The BERT (bert-base-uncased) from Chalkidis et al. (2020), which
were pre-trained using legal documents including the ECHR
dataset.
Harvard-Law-BERT The BERT (bert-base-uncased) from Zheng et al. (2021), which
were pre-trained using the entire Harvard Law case corpus.
RoBERTa The RoBERTa (roberta-base) from Liu et al. (2019), which were
pre-trained using English language corpora, such as BookCorpus
and English portion of the CommonCrawl News. RoBERTa is
a variance of BERT which trains only to optimize the dymamic
masking language model.

Table 3: Pre-trained BERT-based Models used in the experiment.

Approach Micro Finally, we observed that BigBird and Long-


F-1 Former (Micro-F1 score 0.7308 and 0.7238, re-
RR-BERT 0.6466 spectively) outperformed the other baselines that
RR-ECHR-Legal-BERT 0.6699 adapted BERT to deal with longer documents. This
RR-Harvard-Law-BERT 0.6590 supported that BigBird and LongFormer that were
RR-RoBERTa 0.6656 explicitly designed to with long documents using
RF-BERT 0.6803 different variances of attention techniques could
RF-ECHR-Legal-BERT 0.7090 lead to a better performance than aggregating re-
RF-Harvard-Law-BERT 0.6896 sults from applying BERT on multiple chunks of
RF-RoBERTa 0.6925 text.
MeanPool-BERT 0.7075
MeanPool-ECHR-Legal-BERT 0.7196 5.2 Overruling Task Dataset
MeanPool-Harvard-Law-BERT 0.7009 In this section, we discuss the performances on
MeanPool-RoBERTa 0.6949 the Overruling Task Dataset. As discussed in Sec-
MaxPool-BERT 0.7110 tion 4.2.2, the lengths of the documents in this
MaxPool-ECHR-Legal-BERT 0.7213 dataset are shorter than 512 tokens. Therefore, we
MaxPool-Harvard-Law-BERT 0.7010 can directly use BERT without any changes.
MaxPool-RoBERTa 0.7000 Table 5 reported the performance in terms of F-
BigBird 0.7308 1 score averaged across 10 folds cross-validation,
LongFormer 0.7238 along with the standard deviation (STD). From
Table 5, we observed that Harvard-Law-BERT
Table 4: Comparing the performances on ECHR Viola- and ECHR-Legal-BERT achieved the best and the
tion Dataset. 2nd best performances (0.9756 and 0.9725, re-
spectively). This supported the impacts of pre-
that for this ECHR Violation Dataset, the back sec- training on the in-domain documents. Meanwhile,
tions of the cases are more important than the front RoBERTa achieved the 3 rank (0.9683 F-1 score)
sections. Importantly, removing texts at the back of demonstrated that if no in-domain documents avail-
the input as suggested by Devlin et al. (2019) could able, pre-training on a large corpus could also be
lead to a poor performance. In addition, the best effective. These results are inline with the findings
approach, MaxPool-ECHR-Legal-BERT, achieved in Section 5.1.
0.7213 Micro F-1 score, which was significantly On the other hand, BigBird and LongFormer
better than any of the RR-* and RF-*, supported (0.9570 and 0.9569, respectively) performed a
that truncation worsened the classification perfor- marginally worse than the other approaches. This
mance. could be due to the fact that BigBird and Long-
Former are explicitly modelled to deal with long
documents. Specifically, for shorter documents al-
214
Approach Mean F1 ± STD long documents during designing the model archi-
BERT 0.9656 ± 0.010 tecture. Next, both MaxPool-* and MeanPool-*
ECHR-Legal-BERT 0.9725 ± 0.005 achieved F-1 performances that are markedly bet-
Harvard-Law-BERT 0.9756 ±0.010 ter than the other approaches. Therefore, it is the
RoBERTa 0.9683 ± 0.010 most desirable to use BigBird or Longformer that
BigBird 0.9570 ± 0.010 were explicitly designed to deal with long legal
LongFormer 0.9569 ± 0.009 documents. An alternative method but less effec-
tive is to apply BERT on chunks of n tokens before
Table 5: Comparing the performances, in terms of F1- using appropriate function (e.g. max or mean) to
score, of different BERT pre-trainings on the Overrul- aggregate the vector representation across all the
ing Task Dataset. chunks before applying a classification layer, as
described in Section 4.3.
lowing multi-head attentions to freely attend to any
7 Conclusions
tokens would lead to a more effective performance
than restricting them to be on particular sliding We have discussed the challenges of using BERT
windows or specific areas (e.g. global attentions or for text classification in the legal domains, and
randomized attentions). posed two research questions regarding the pre-
train documents and how to cope with long doc-
6 Discussions uments. To answer the research questions, we
conducted the experiments on the ECHR Viola-
In this section, we provide further discussions on
tion Dataset and the Overruling Task Dataset. Our
the experimental results from Section 5, in order to
experimental result showed that the models pre-
answer the research questions posed in Section 3.
trained on the domain similar to the task enhanced
RQ1 For legal text classification, does pre-training the performance. In addition, the experiments on
on the in-domain documents lead to a more ef- ECHR Violation Dataset supported that truncat-
fective performance than pre-training on gen- ing or discarding parts of a document resulted in a
eral documents? poor performance. Importantly, BigBird and Long-
former, which explicitly handled long documents
Yes, based on the experiments on both datasets, using different attention techniques, achieved the
the model pre-trained on documents in the legal best performance on long legal document classifi-
domain (ECHR-Legal-BERT and Harvard-Law- cation. Alternatively, applying BERT on chunks of
BERT) achieved the highest performance as shown texts before aggregating the vector representation
in Tables 4 and 5, respectively. In addition, as dis- across all of the chunks using an appropriate func-
cussed in Sections 5.1 and 5.2, RoBERTa achieved tion (e.g. max or mean) could achieve a reasonable
competitive performances on both datasets sup- result.
ported that pre-training on a large dataset could
be a good option, if in-domain data cannot be ob-
tained. References
Iz Beltagy, Matthew E Peters, and Arman Cohan.
RQ2 How to adapt BERT-based models to effec- 2020. Longformer: The long-document transformer.
tively deal with long documents in legal text arXiv preprint arXiv:2004.05150.
classification? Ilias Chalkidis, Ion Androutsopoulos, and Nikolaos
Aletras. 2019. Neural legal judgment prediction in
From the performances of RR- and RF-, in Sec- english. In Proceedings of the 57th Annual Meet-
tion 5.1, we found that truncating long documents ing of the Association for Computational Linguistics,
(on either ends) lessened the classification perfor- pages 4317–4323.
mance due to the lost of data. From the experimen- Ilias Chalkidis, Ion Androutsopoulos, and Achilleas
tal result reported in Table 4, BigBird and Long- Michos. 2018. Obligation and prohibition ex-
former (even though not pre-trained on in-domain traction using hierarchical rnns. arXiv preprint
documents) outperformed other approaches that arXiv:1805.03871.
adapted BERT to deal with long documents. This Ilias Chalkidis, Manos Fergadiotis, Prodromos Malaka-
highlighted the importance of explicitly handling siotis, Nikolaos Aletras, and Ion Androutsopoulos.
215
2020. Legal-bert:“preparing the muppets for court’”. Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and
In Proceedings of the 2020 Conference on Empirical Percy Liang. 2016. Squad: 100,000+ questions for
Methods in Natural Language Processing: Findings, machine comprehension of text. In Proceedings of
pages 2898–2904. the 2016 Conference on Empirical Methods in Natu-
ral Language Processing, pages 2383–2392.
Ilias Chalkidis, Manos Fergadiotis, Dimitrios Tsarapat-
sanis, Nikolaos Aletras, Ion Androutsopoulos, and Octavia-Maria Sulea, Marcos Zampieri, Shervin Mal-
Prodromos Malakasiotis. 2021. Paragraph-level ra- masi, Mihaela Vela, Liviu P Dinu, and Josef
tionale extraction through regularization: A case Van Genabith. 2017. Exploring the use of text
study on european court of human rights cases. In classification in the legal domain. arXiv preprint
Proceedings of the Annual Conference of the North arXiv:1710.09306.
American Chapter of the Association for Computa-
tional Linguistics, Mexico City, Mexico. Associa- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
tion for Computational Linguistics. Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
Kaiser, and Illia Polosukhin. 2017. Attention is all
Leon Derczynski, Eric Nichols, Marieke van Erp, and you need. In Advances in neural information pro-
Nut Limsopatham. 2017. Results of the wnut2017 cessing systems, pages 5998–6008.
shared task on novel and emerging entity recogni-
tion. In Proceedings of the 3rd Workshop on Noisy Alex Wang, Amanpreet Singh, Julian Michael, Felix
User-generated Text, pages 140–147. Hill, Omer Levy, and Samuel R Bowman. 2018.
Glue: A multi-task benchmark and analysis platform
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and for natural language understanding. In International
Kristina Toutanova. 2019. Bert: Pre-training of deep Conference on Learning Representations.
bidirectional transformers for language understand-
ing. pages 4171–4186. Hai Ye, Xin Jiang, Zhunchen Luo, and Wenhan Chao.
2018. Interpretable charge predictions for criminal
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hin- cases: Learning to generate court views from fact
ton. 2012. Imagenet classification with deep convo- descriptions. In Proceedings of the 2018 Confer-
lutional neural networks. Advances in neural infor- ence of the North American Chapter of the Associ-
mation processing systems, 25:1097–1105. ation for Computational Linguistics: Human Lan-
guage Technologies, Volume 1 (Long Papers), pages
Nut Limsopatham and Nigel Collier. 2015. Adapting 1854–1864.
phrase-based machine translation to normalise med-
ical terms in social media messages. In Proceed- Manzil Zaheer, Guru Guruganesh, Kumar Avinava
ings of the 2015 Conference on Empirical Methods Dubey, Joshua Ainslie, Chris Alberti, Santiago On-
in Natural Language Processing, pages 1675–1680. tanon, Philip Pham, Anirudh Ravula, Qifan Wang,
Li Yang, et al. 2020. Big bird: Transformers for
Nut Limsopatham and Nigel Collier. 2016. Normalis- longer sequences. In NeurIPS.
ing medical concepts in social media texts by learn-
ing semantic representation. In Proceedings of the Lucia Zheng, Neel Guha, Brandon R Anderson, Peter
54th Annual Meeting of the Association for Compu- Henderson, and Daniel E Ho. 2021. When does pre-
tational Linguistics (Volume 1: Long Papers), pages training help? assessing self-supervised learning for
1014–1023. law and the casehold dataset of 53,000+ legal hold-
ings. In Proceedings of the Eighteenth International
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man- Conference on Artificial Intelligence and Law, pages
dar Joshi, Danqi Chen, Omer Levy, Mike Lewis, 159–168.
Luke Zettlemoyer, and Veselin Stoyanov. 2019.
Roberta: A robustly optimized bert pretraining ap-
proach. arXiv preprint arXiv:1907.11692.
Ilya Loshchilov and Frank Hutter. 2017. Decou-
pled weight decay regularization. arXiv preprint
arXiv:1711.05101.
Ramesh Nallapati and Christopher D Manning. 2008.
Legal docket classification: Where machine learning
stumbles. In Proceedings of the 2008 Conference on
Empirical Methods in Natural Language Processing,
pages 438–446.
Raghavendra Pappagari, Piotr Zelasko, Jesús Villalba,
Yishay Carmiel, and Najim Dehak. 2019. Hierar-
chical transformers for long document classification.
In 2019 IEEE Automatic Speech Recognition and
Understanding Workshop (ASRU), pages 838–844.
IEEE.
216

You might also like