0% found this document useful (0 votes)
34 views

Long Document Summarization in A Low Resource Setting Using Pretrained Language Models

The document presents a method for abstractive summarization of long documents in a low-resource setting. It uses a pretrained language model (GPT-2) to identify salient sentences in source documents by calculating perplexity scores. These salient sentences are then used to compress the source documents. The compressed documents are fed to another pretrained summarizer (BART) to generate summaries. On a legal briefs dataset with an average source length of 4268 words and only 120 document-summary pairs, this method improves ROUGE-L score by 6.0 compared to using the summarizer alone without compression. The identified salient sentences also tend to agree with human labels from domain experts.

Uploaded by

Ismail Najim
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views

Long Document Summarization in A Low Resource Setting Using Pretrained Language Models

The document presents a method for abstractive summarization of long documents in a low-resource setting. It uses a pretrained language model (GPT-2) to identify salient sentences in source documents by calculating perplexity scores. These salient sentences are then used to compress the source documents. The compressed documents are fed to another pretrained summarizer (BART) to generate summaries. On a legal briefs dataset with an average source length of 4268 words and only 120 document-summary pairs, this method improves ROUGE-L score by 6.0 compared to using the summarizer alone without compression. The identified salient sentences also tend to agree with human labels from domain experts.

Uploaded by

Ismail Najim
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Long Document Summarization in a Low Resource Setting

using Pretrained Language Models

Ahsaas Bajaj∗ Pavitra Dangati*


University of Massachusetts Amherst University of Massachusetts Amherst
[email protected] [email protected]

Kalpesh Krishna Pradhiksha Ashok Kumar


University of Massachusetts Amherst University of Massachusetts Amherst

Rheeya Uppaal Bradford Windsor Eliot Brenner Dominic Dotterrer


Goldman Sachs Goldman Sachs Goldman Sachs Goldman Sachs
arXiv:2103.00751v1 [cs.CL] 1 Mar 2021

Rajarshi Das Andrew McCallum


University of Massachusetts Amherst University of Massachusetts Amherst

Abstract 1 Introduction and Related Work

Abstractive summarization is the task of com- Text summarization is the task of generating a
pressing a long document into a coherent short smaller coherent version of a document preserv-
document while retaining salient information. ing key information. Typical abstractive summa-
Modern abstractive summarization methods rization algorithms use seq2seq models with atten-
are based on deep neural networks which of-
tion (Chopra et al., 2016), copy mechanisms (Gu
ten require large training datasets. Since col-
lecting summarization datasets is an expensive et al., 2016), content selection (Cheng and Lapata,
and time-consuming task, practical industrial 2016), pointer-generator methods (See et al., 2017)
settings are usually low-resource. In this pa- and reinforcement learning (Wu and Hu, 2018).
per, we study a challenging low-resource set- These methods perform well in high resource sum-
ting of summarizing long legal briefs with marization datasets with small documents such
an average source document length of 4268 as CNN/DailyMail (Nallapati et al., 2016), Gi-
words and only 120 available (document, sum- gaword (Rush et al., 2015), etc. However, sum-
mary) pairs. To account for data scarcity,
we used a modern pretrained abstractive sum-
marization over long documents with thousands
marizer BART (Lewis et al., 2020), which of tokens is a more practically relevant problem.
only achieves 17.9 ROUGE-L as it struggles Existing solutions focus on leveraging document
with long documents. We thus attempt to structure (Cohan et al., 2018) or do mixed model
compress these long documents by identify- summarization involving compression or selection
ing salient sentences in the source which best followed by abstractive summarization (Liu et al.,
ground the summary, using a novel algorithm 2018; Gehrmann et al., 2018). However, these
based on GPT-2 (Radford et al., 2019) lan-
methods require large amounts of training data.
guage model perplexity scores, that operates
within the low resource regime. On feeding the Low resource settings are common in real world
compressed documents to BART, we observe a applications as curating domain specific datasets es-
6.0 ROUGE-L improvement. Our method also pecially over long documents and on a large scale,
beats several competitive salience detection is both expensive and time consuming.
baselines. Furthermore, the identified salient A human summarizing a long document would
sentences tend to agree with an independent first understand the text, then highlight the impor-
human labeling by domain experts.
tant information, and finally paraphrase it to gen-
erate a summary. Building on this intuition, we

* Equal Contribution present a low-resource long document summariza-
Pick +/- samples
Using GPT2,
based on salience
ground summary
score of source
to source
sentences
Training Data
(Source, Summary)
Extractive Model Training Data

Summary
Sentence BART
Saliency Abstractive
Classifier Summarizer
Extractive
Test Data Summary
(Source D) (Compressed D’)

Figure 1: Our method for long document summarization task in low resource setting. The Extraction Model
generates a compressed document D0 by identifying salient sentences. It is trained by computing salience score for
each training set source sentence. The pretrained abstractive summarizer takes as input the compressed document.

tion algorithm (Section 2) operating in 3 steps — 2 Dataset and Approach


(1) ground sentences of every training set summary
into its source, identifying salient sentences; To mimic the real world scenario of summariza-
(2) train a salience classifier on this data, and use it tion over long domain-specific documents, we cu-
to compress the source document during test time; rate 120 document-summary pairs from publicly
(3) feed the compressed document to a state-of-the- available Amicus Briefs1 , thus simulating the le-
art abstractive summarizer pretrained on a related gal domain.2 As shown in Table 1, our dataset is
domain to generate a coherent and fluent summary. significantly smaller than the popular CNN/Daily
Mail benchmark (Nallapati et al., 2016) and has
To tackle data scarcity, we use pretrained lan- significantly longer documents and summaries.
guage models in all three steps, which show strong
Dataset # (S, T ) Avg. |S| Avg. |T |
generalization (Devlin et al., 2019) and are sample
efficient (Yogatama et al., 2019). Notably, our step CNN/DM 312084 781 56
Amicus 120 4268 485
(1) uses a novel method based on GPT-2 perplex-
ity (Radford et al., 2019) to ground sentences. Table 1: A comparison between the Amicus legal
briefs dataset and the popular CNN/Daily Mail bench-
Unlike prior work (Parida and Motlicek, 2019; mark. Amicus has far fewer document-summary pairs
Magooda and Litman, 2020) tackling data scarcity #(S, T ), with more documents tokens (Avg. |S|) and
summary tokens (Avg. |T |) on average.
in summarization, our method needs no synthetic
data augmentation. Moreover, we study a signifi- To tackle this low resource setting, we use
cantly more resource constrained setting — a com- the state-of-the-art abstractive summarizer
plex legal briefs dataset (Section 2) with only 120 BART (Lewis et al., 2020), pretrained on a dataset
available (document, summary) pairs and an av- from a related domain (CNN/DM). Since BART
erage of 4.3K tokens per document; Parida and was trained on short documents, it truncates docu-
Motlicek (2019) assume access to 90,000 pairs with ments longer than 1024 subwords. Hence, instead
a maximum of 0.4K source document tokens, Ma- of feeding the whole source document as input to
gooda and Litman (2020) use 370 pairs with 0.2K BART, we feed salient sentences extracted using
source document tokens. a salience classifier. Our salience classification
dataset is built using a novel method which grounds
Despite this challenging setup, our method beats summary sentences to sentences in source with
an abstractor-only approach by 6 ROUGE-L points, language model perplexity scores. Our approach
and also beats several competitive salience detec- (Figure 1) resembles the extract-then-abstract
tion baselines (Section 3). Interestingly, identified
1
salient sentences show agreement with an indepen- https://publichealthlawcenter.org/
amicus-briefs
dent human labeling by domain experts, further 2
The source contains detailed arguments that the court
validating the efficacy of our approach. should consider for a case; the target summarizes them.
paradigm popular in prior work (Gehrmann et al., Here, a lower perplexity corresponds to a higher
2018; Liu et al., 2018; Subramanian et al., 2019; f (s, t) score. We find that this measure correlates
Chen and Bansal, 2018). with entailment and outperforms other choices of
f (s, t) like n-gram overlap, sentence embedding
Extraction Stage: To extract the most important
similarity & entailment classifiers (Section 3.3).
content from the source document required to gen-
erate the summary, we pose content selection as a Abstraction Stage: Having compressed the source
binary classification task, labeling every sentence document using our extractor, we use a black-box
in the source document as salient or non-salient. pretrained abstractive summarizer trained on a re-
Sentences classified as salient are concatenated in lated domain. In this work, we make use of the
the order of occurrence in the source document3 state-of-the-art model (i.e. BART), which is based
to generate a compressed “extractive summary”, on pretrained language models. Pretraining on
which is then fed to the abstractive summarizer. CNN/DM helps BART generalize to unseen but
In addition to identifying important information, related domains like legal briefs.5
the salience classifier is able to remove repetitive
boilerplate text which is common in technical doc- 3 Experiments
uments but often irrelevant to the actual content.
3.1 Evaluating the extractor
Training Data for Salience Classification: Since
we do not have sentence-level training data for To evaluate our proposed extractor, we first check
the classifier, we construct it by grounding sen- whether our salience classifier generalizes to a held-
tences of the ground truth summary to sentences out test set6 . Indeed, it achieves a classification
in the source document. Consider a source docu- accuracy of 73.66%, and qualitative analysis of
ment S consisting of m sentences s1:m and a target the classifications confirm its ability to identify
summary T consisting of n sentences t1:n where boilerplate sentences as non-salient. Our classifier
m >> n. We compute the salience compresses source documents by 61% on average.7
Pscore for ev-
ery source sentence si ∈ S as n1 nj=0 f (si , tj ). Next, we evaluate the quality of extracted salient
Here f (s, t) is a measure of how much source sentences by checking the extent to which they
sentence s grounds target sentence t. Following overlap in information with the gold test set sum-
this, we sort the sentences in the source document maries, by measuring ROUGE-1/2 recall scores.
based on salience score. The highest scoring 3n As shown in Table 2, our extractor outperforms a
sentences are chosen as salient sentences and the random selection of the same number of sentences
lowest scoring 3n are chosen as non-salient sen- and is comparable to the upper-bound recall per-
tences.4 We construct our dataset for salience formance achieved by feeding in the whole source
classification by running this algorithm for every document. Finally, to measure the extent to which
(S, T ) pair in the training dataset. To ensure gen- our salience classifier matches human judgement,
eralization with limited training data, we incorpo- domain experts identified 8-10 salient sentences in
rate transfer learning and build our classifier by four test documents with more than 200 sentences
finetuning BERT-base (Devlin et al., 2019) using each on request. Despite their scarcity, our salience
transformers (Wolf et al., 2019). More details classifier recovers 64.7% marked sentences, con-
on training are provided in Appendix A.2. firming correlation with human judgments.

Choice of f (s, t): To measure how much a 3.2 Evaluating the entire pipeline
source sentence s grounds a target sentence t we
measure the perplexity of t conditioned on s, using We evaluate the entire pipeline by measuring the
a pretrained language model GPT-2 large (Radford quality of abstractive summaries, obtained by feed-
et al., 2019). More formally, we concatenate ing the extractive summary to BART. We study
s and t as [s; t] and feed it as input to GPT-2 two abstractor settings: (1) Treating BART as a
large, calculating perplexity over the tokens of t. black-box with no modification; (2) Finetuning
3 5
Maintaining the order of sentences ensures the logical Details on our BART setup are provided in Appendix A.3.
6
flow of information is not disrupted. Classifier data statistics at salient/non-salient sentences
4 level: (Train=5363, Dev=1870, Test=2070)
3n is a tuned hyperparameter. Whenever m < 6n, we
7
sort the sentences according to the salience score and assign Note that classifier score can be thresholded to obtain
salient to the top half and non-salient to the bottom half. more or less compression depending on domain and end-task.
Source R-1 (Recall) R-2 (Recall) Perplexity distribution for classes on an MNLI dataset.
entailment
Whole Document 87.75 50.67 contradiction
Random Extractor 78.66 38.53 0.08 neutral

Percentage of examples
Proposed Extractor 81.78 43.96
0.06
Table 2: ROUGE-1/2 (R-1/2) recall scores of the gold
summary with respect to the the “Source” document.
Our saliency-driven extractor performs better than a 0.04
random selection of the same number of sentences and
is close to the upperbound recall performance achieved 0.02
by feeding in the whole source document.
Extractor Abstractor R-1 R-2 R-L 0.00
0 5 10 15 20 25
NE BART 40.17 13.36 17.95 Perplexity scores
Random BART 41.96 13.30 17.91
TextRank BART 42.63 13.09 17.93 Figure 2: Perplexity distribution of the hypothesis
Bottom-up BART 42.41 14.50 20.76 given the premise for each of the three classes sampled
Ours BART 44.97 15.37 23.95 from the MultiNLI dataset. Entailment pairs tend to
NE f.t. BART 43.47 16.30 19.35 have lower perplexity, validating our choice of f (s, t).
Random f.t. BART 44.63 15.11 18.57
TextRank f.t. BART 45.10 15.51 18.74 Choice of f (s, t) R-1 R-2 R-L
Bottom-up f.t. BART 44.89 17.26 23.40 Entailment (using RoBERTa) 43.66 16.95 23.24
Ours f.t. BART 47.07 17.64 24.40 Similarity (using BERT) 44.67 16.69 23.81
BLEU (using nltk) 43.95 17.38 23.69
Table 3: Comparison of our method on the Amicus Perplexity (using GPT-2) 47.07 17.64 24.40
dataset with strong baselines. Our method outperforms
all baselines in both Abstractor settings: (1) a pre- Table 4: Results of our extract-then-abstract pipeline
trained CNN/DM BART; (2) the pretrained CNN/DM (after finetuning BART) by varying f (s, t). Our choice
BART finetuned on the Amicus dataset (f.t. BART). of GPT-2 perplexity performs better than 3 alternatives.

BART on the training and validation split of Am- serve a 2.1 / 0.5 R-1/L boost in performance and
icus dataset8 . We present results on the Amicus outperform the best baseline (per metric) by 2.0 /
test set. We compare our model against several 1.0 R-1/L points. Our model’s improvements are
competitive baselines — (1) NE: no extraction; (2) statistically significant (p-value< 0.06) except for
Random: a random selection of the same number when comparing our extractor + f.t BART with
of sentences as our extractive summary; (3) Tex- Bottom-up + f.t BART, the p-value is 0.16 due to
tRank (Mihalcea and Tarau, 2004; Liu et al., 2018): the small test set. Refer Appendix A.3 for qualita-
unsupervised graph based approach to rank text tive analysis of our proposed model’s generations.
chunks within a document; (4) Bottom-up sum-
marizer (Gehrmann et al., 2018): a strong extract- 3.3 Validating the choice of f (s, t)
then-abstract baseline where content selection is In Section 2 we used GPT-2 perplexity scores to
posed as a word-level sequence tagging problem. measure the extent to which a source sentence
Similar to our setting, their content selector also grounds a target sentence. To motivate this choice,
uses large pretrained models (ELMo, Peters et al., we measure its correlation with existing entailment
2018), which we finetune on our training set. datasets. We randomly sample 5000 sentences
As seen in Table 3, we observe a 4.8 / 6 ROUGE- from each class of the MultiNLI dataset (Williams
1/L improvement when compared to the no extrac- et al., 2018) and compute the perplexity of the hy-
tor baseline (NE), and 2.3 / 3.2 R-1/L improvement pothesis with the premise as context. As seen in
over the strongest extractor baseline (per metric); Figure 2, entailment pairs tend to have the lowest
confirming the effectiveness of our method. In ad- perplexity. This motivates our choice of f (s, t),
dition, finetuning the CNN/DM pretrained BART since hypothesis sentences are best grounded in
on 96 Amicus documents helps in domain adaption premise sentences for entailment pairs.9 To further
and boosts the ROUGE scores of both baselines validate the merit of GPT-2 perplexity, we con-
and our method (f.t. BART). Specifically, we ob- duct ablations using alternatives for f (s, t): (1)
8 9
The training and validation splits together comprise of 96 We hypothesize contradiction sentences have slightly
documents. The test split was not used. lower perplexity than neutral due to more word overlap.
Entailment score from a RoBERTa based MNLI Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
classifier (Liu et al., 2019) (2) Cosine similar- Kristina Toutanova. 2019. BERT: Pre-training of
deep bidirectional transformers for language under-
ity of averaged embeddings from final layer of
standing. In Conference of the North American
BERT (Devlin et al., 2019) (3) BLEU scores (Pap- Chapter of the Association for Computational Lin-
ineni et al., 2002). We present ROUGE scores us- guistics.
ing our whole extract-then-abstract pipeline with
different choices of f (s, t) in Table 4. We note Sebastian Gehrmann, Yuntian Deng, and Alexander
Rush. 2018. Bottom-up abstractive summarization.
that perplexity performs the best, 2.4 ROUGE-1 In Proceedings of Empirical Methods in Natural
better than the best alternative and also performs Language Processing.
3.41 ROUGE-1 better than entailment.10
Jiatao Gu, Zhengdong Lu, Hang Li, and Victor O.K.
4 Conclusion Li. 2016. Incorporating copying mechanism in
sequence-to-sequence learning. In Proceedings of
We tackle an important real-world problem of sum- the 54th Annual Meeting of the Association for Com-
putational Linguistics (Volume 1: Long Papers),
marizing long domain-specific documents with pages 1631–1640, Berlin, Germany. Association for
very less training data. We propose an extract-then- Computational Linguistics.
abstract pipeline which uses GPT-2 perplexity and
a BERT classifier to estimate sentence salience. Suchin Gururangan, Swabha Swayamdipta, Omer
Levy, Roy Schwartz, Samuel Bowman, and Noah A.
This sufficiently compresses a document, allowing Smith. 2018. Annotation artifacts in natural lan-
us to use a pretrained model (BART) to generate guage inference data. Proceedings of the 2018 Con-
coherent & fluent summaries. ference of the North American Chapter of the Asso-
ciation for Computational Linguistics: Human Lan-
guage Technologies, Volume 2 (Short Papers).
References Mike Lewis, Yinhan Liu, Naman Goyal, Mar-
Yen-Chun Chen and Mohit Bansal. 2018. Fast abstrac- jan Ghazvininejad, Abdelrahman Mohamed, Omer
tive summarization with reinforce-selected sentence Levy, Ves Stoyanov, and Luke Zettlemoyer. 2020.
rewriting. Proceedings of the 56th Annual Meet- Bart: Denoising sequence-to-sequence pre-training
ing of the Association for Computational Linguistics for natural language generation, translation, and
(Volume 1: Long Papers). comprehension. In Proceedings of the Association
for Computational Linguistics.
Jianpeng Cheng and Mirella Lapata. 2016. Neural sum-
marization by extracting sentences and words. In Peter J. Liu, Mohammad Saleh, Etienne Pot, Ben
Proceedings of the 54th Annual Meeting of the As- Goodrich, Ryan Sepassi, Lukasz Kaiser, and Noam
sociation for Computational Linguistics (Volume 1: Shazeer. 2018. Generating wikipedia by summariz-
Long Papers), pages 484–494, Berlin, Germany. As- ing long sequences.
sociation for Computational Linguistics.
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-
Sumit Chopra, Michael Auli, and Alexander M. Rush. dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,
2016. Abstractive sentence summarization with at- Luke Zettlemoyer, and Veselin Stoyanov. 2019.
tentive recurrent neural networks. In Proceedings of Roberta: A robustly optimized bert pretraining ap-
the 2016 Conference of the North American Chap- proach. arXiv preprint arXiv:1907.11692.
ter of the Association for Computational Linguistics:
Human Language Technologies, pages 93–98, San Ahmed Magooda and Diane Litman. 2020. Abstractive
Diego, California. Association for Computational summarization for low resource data using domain
Linguistics. transfer and data synthesis.

Arman Cohan, Franck Dernoncourt, Doo Soon Kim, Rada Mihalcea and Paul Tarau. 2004. TextRank:
Trung Bui, Seokhwan Kim, Walter Chang, and Na- Bringing order into text. In Proceedings of the 2004
zli Goharian. 2018. A discourse-aware attention Conference on Empirical Methods in Natural Lan-
model for abstractive summarization of long docu- guage Processing, pages 404–411, Barcelona, Spain.
ments. In Proceedings of the 2018 Conference of Association for Computational Linguistics.
the North American Chapter of the Association for
Computational Linguistics: Human Language Tech- Ramesh Nallapati, Bowen Zhou, Cicero dos Santos,
nologies, Volume 2 (Short Papers), pages 615–621, Caglar Gulcehre, and Bing Xiang. 2016. Abstrac-
New Orleans, Louisiana. Association for Computa- tive text summarization using sequence-to-sequence
tional Linguistics. RNNs and beyond. In Proceedings of The 20th
SIGNLL Conference on Computational Natural Lan-
10 guage Learning, pages 280–290, Berlin, Germany.
We hypothesize that RoBERTa overfits on the MNLI
dataset that also has known biases (Gururangan et al., 2018). Association for Computational Linguistics.
Kishore Papineni, Salim Roukos, Todd Ward, and Wei- Yu, Chris Dyer, et al. 2019. Learning and evaluat-
Jing Zhu. 2002. Bleu: a method for automatic eval- ing general linguistic intelligence. arXiv preprint
uation of machine translation. In Proceedings of arXiv:1901.11373.
the 40th Annual Meeting of the Association for Com-
putational Linguistics, pages 311–318, Philadelphia,
Pennsylvania, USA. Association for Computational
Linguistics.
Shantipriya Parida and Petr Motlicek. 2019. Abstract
text summarization: A low resource challenge. In
Proceedings of the 2019 Conference on Empirical
Methods in Natural Language Processing and the
9th International Joint Conference on Natural Lan-
guage Processing (EMNLP-IJCNLP), pages 5994–
5998, Hong Kong, China. Association for Computa-
tional Linguistics.
Matthew Peters, Mark Neumann, Mohit Iyyer, Matt
Gardner, Christopher Clark, Kenton Lee, and Luke
Zettlemoyer. 2018. Deep contextualized word repre-
sentations. In Proceedings of the 2018 Conference
of the North American Chapter of the Association
for Computational Linguistics: Human Language
Technologies, Volume 1 (Long Papers).
Alec Radford, Jeffrey Wu, Rewon Child, David Luan,
Dario Amodei, and Ilya Sutskever. 2019. Language
models are unsupervised multitask learners. OpenAI
Blog, 1(8):9.
Alexander M. Rush, Sumit Chopra, and Jason Weston.
2015. A neural attention model for abstractive sen-
tence summarization. Proceedings of the 2015 Con-
ference on Empirical Methods in Natural Language
Processing.
Abigail See, Peter J. Liu, and Christopher D. Manning.
2017. Get to the point: Summarization with pointer-
generator networks. Proceedings of the 55th Annual
Meeting of the Association for Computational Lin-
guistics (Volume 1: Long Papers).
Sandeep Subramanian, Raymond Li, Jonathan Pilault,
and Christopher Pal. 2019. On extractive and ab-
stractive neural document summarization with trans-
former language models.
Adina Williams, Nikita Nangia, and Samuel Bowman.
2018. A broad-coverage challenge corpus for sen-
tence understanding through inference. In Confer-
ence of the North American Chapter of the Associa-
tion for Computational Linguistics.
Thomas Wolf, Lysandre Debut, Victor Sanh, Julien
Chaumond, Clement Delangue, Anthony Moi, Pier-
ric Cistac, Tim Rault, R’emi Louf, Morgan Funtow-
icz, and Jamie Brew. 2019. Huggingface’s trans-
formers: State-of-the-art natural language process-
ing. ArXiv, abs/1910.03771.
Yuxiang Wu and Baotian Hu. 2018. Learning to extract
coherent summary via deep reinforcement learning.
Dani Yogatama, Cyprien de Masson d’Autume, Jerome
Connor, Tomas Kocisky, Mike Chrzanowski, Ling-
peng Kong, Angeliki Lazaridou, Wang Ling, Lei
• Top k - Bottom k: ∀tj ∈ T, we picked the
top-k scoring source sentences as positive
A Appendix samples and the bottom-k sentences as the
A.1 Data pre-processing negative samples ensuring that {positive} ∩
{negative} = 0. Using this technique, the
In this section, the various pre-processing steps of
classifier achieves accuracy of nearly 1 as can
data performed at different stages are explained.
be seen from Table 5. On qualitative analysis,
we identified that there is a clear distinction
Extracting (document, summary) pairs: The
in the positive and the negative examples. Eg:
120 pairs of Amicus Briefs were scrapped from
sentences such as ‘This document is prepared
their website11 . The Summary of Arguments
by XYZ’ would be picked as non salient sen-
section of the Amicus Briefs was extracted as the
tence and classifier is able to achieve high
target summary and the main content excluding
accuracy. This could however be used to train
title page, table of contents, acknowledgements,
a classifier to identify boiler plate sentences
appendix etc was extracted as document/source.
across the document. This method compresses
source document by 63% on an average.
Sentence pre-processing: Sentences from the
(document, summary) files were split using the • Random negative sampling: Salient examples
spaCy12 sentence splitter. Furthermore, the sen- were chosen for a document as per the above
tences were each processed to remove special char- method. For the non salient examples, we ran-
acters using regex rules. If a sentence contained domly sampled from the rest of the document.
less that 5 words, it was pruned out from the com- This allows the classifier to learn about sen-
putation of f (s, t) to reduce the complexity of pairs tences that that are difficult to be classified
considered. as positive or negative. Hence, the accuracy
of the classifier is lower than the other two
A.2 Sentence Saliency Classifier
methods as can be seen from Table 5. This
Training Details: Our classifier uses the method compresses the source document by
BERT sequence labeling configuration13 from 70% on an average.
transformers (Wolf et al., 2019), which is
a pretrained BERT-base model with an initially Compute time and resources: Execution time
untrained classification head on the [CLS] feature for different choice of f(s,t) for all 120 pairs:
vector. This model is then finetuned for 5 epochs
using the training data which consists of 5363 • Perplexity using GPT-2:executes within 15hrs
sentences in the Amicus dataset (equal distribution using 2 GPUs
among the two classes). We use a train / dev / test • Entailment score using RoBERTa: executes
split of 60%, 20%, 20%. Training configuration within 22hrs using 2 GPUs
of the classifier is as follows: learning rate = 2e-5,
max grad norm = 1.0, num training steps = 1000, • Cosine Similarity using BERT [CLS] embed-
num warmup steps = 100, warmup proportion = dings: executes within 3hrs on a single GPU
0.1, Optimizer = Adam, Schduler = linear with
warmup. • BLEU score using nltk: executes within
15min on a single GPU.
Alternate methods to choose +/- samples: The
These scores need to be generated once and can be
aggregate scoring method mentioned in Section 2
reused for various experiments. Sampling methods
was one choice to pick salient and non-salient sam-
to choose salient and non-salient sentences for
ples for each document. Aggregate method com-
each document takes less than a minute to run.
presses the source by 61% on an average. The other
methods experimented were:
Analysis: (a) Table 5 shows the classifier ac-
11
https://publichealthlawcenter.org/amicus-briefs curacies for combinations of f(s,t) and sampling
12
https://pypi.org/project/spacy/ methods. We observe that for the aggregate
13
https://huggingface.co/transformers/
model_doc/bert.html#transformers. sampling method, although perplexity based clas-
BertForSequenceClassification sifier does not have the highest accuracy, our
Sampling Method f(s,t) Accuracy to encourage BART to produce longer outputs
Aggregate scoring
BLEU 0.7813 which is more suitable to our dataset. Also, we use
Perplexity 0.7366 beam size of 4 and and no repeat ngram size of 3.
for each source
Entailment 0.6569
sentence.
Similarity 0.8391
Top k-Bottom k
BLEU 0.9997 Finetuning: We use the train and dev splits
Perplexity 0.9915 of Amicus dataset (96 source-target pairs) and
sources sentences
Entailment 0.9973
or each summary sentence finetune BART for summarization task starting
Similarity 1
Top k for each summary BLEU 0.5784 from its CNN/DM finetuned checkpoint. First, we
sentence and random Perplexity 0.655 pre-process the dataset as per the guidelines in the
negative sampling from Entailment 0.5611
the remaining document. Similarity 0.6233 official code15 . We finetune for 500 epochs with
learning rate of 3e-5 and early stop if validation
Table 5: The accuracy of the held out set of Amicus for loss doesn’t decrease for 50 epochs. Others
different classifiers trained on the data prepared using parameters are as follows: total num updates =
choice of different f(s,t) and sampling methods. Here, 20000, warmup updates = 500, update freq = 4,
k=3.
optimiser = Adam with weight decay of 0.01. Rest
of parameters were kept as default in the official
pipeline where f (s, t) is perplexity score gives the script. Results (Precision, Recall, F1) on the test
best result(ROUGE) amongst the ablation experi- set of Amicus using the existing BART model and
ments(Table 4). Classifier accuracy is determined finetuned BART are shown in Table 8.
on automated labelling based on the saliency score,
rather than true labels, hence best classifier does Table 9 shows an example of target summary
not imply best summarization. (b) Table 6 shows and summary generated by our model(Section 2)
the examples of using perplexity as f(s,t) to see how for one sample source document. We can see that
the summary grounds the source. The table shows the summary generated by our model is fluent and
three summary sentences and the corresponding has coherent flow of information.
source sentences that had the lowest perplexity
scores. We can see that, summary either has a
similar meaning or logically follows the source. (c)
Table 7 has three examples each for salient sen-
tences and non-salient sentences inferred by the
classifier trained on data prepared as mentioned in
Section 2. The third sentence in the non-salient sen-
tences column is an example of boiler-plate content
detected that is present across documents.

A.3 Abstractive Summarizer: BART


BART is a seq2seq model based on denoising pre-
training objective which is supposed to generalize
better on various natural language understanding
tasks; abstractive summarization being one of them.
For abstractive stage of our proposed approach, we
decided to see (bart.large.cnn) variant which is
essentially BART-large model (with 12 encoder
and decoder layers and 400 million parameters)
finetuned for CNN/DM summarization task.
We use the pre-computed weights available
for use here14 . Using BART’s text generation
script, we set length penalty (lenpen) as 2.0 and
minimum length (min len) as 500 words in order
15
https://github.com/pytorch/fairseq/
14
https://github.com/pytorch/fairseq/ blob/master/examples/bart/README.
tree/master/examples/bart summarization.md
Summary Sentence Source Sentence
Prior to Knauff and Mezei, the distinction
between noncitizens who had entered the
In the immigration context, this jurisprudence has United States and those who remained outside
prompted the Court to reject the notion that it had not had been elevated to a bright-line constitutional
the so-called entry fiction is of constitutional significance. rule, and entry had never been completely determinative
of the fact or extent of protection under the Due
Process Clause.
The Court’s substantive due process jurisprudence also
It has accordingly authorized such detention only in limited recognizes that an individual may be subjected to regulatory
circumstances pursuant to a carefully defined scheme. detention only in narrow circumstances under a carefully
drawn scheme.
Thus, the Court has substantially restricted the availability and
duration of regulatory confinement in the — years since it decided
With respect to substantive due process, this Court has
Meze1.In Zadvydas, this Court established that its substantive
increasingly recognized the punitive consequences of indefinite
due process jurisprudence provided the appropriate framework
regulatory detention.
for evaluating the administrative detention of noncitizens
pending removal from the United States.

Table 6: Using GPT-2 perplexity as f(s,t), here are three sentences from the summary with corresponding source
sentence, having the lowest perplexity.

Salient Sentences Non-Salient sentences


The same time, the Court has long been skeptical of the A government predicated on checks and balances serves
military’s authority to try individuals other than not only to make Government accountable but also to
active service personnel. secure individual liberty.
On the basis of this revised test, the Court of Appeals At present, the Rules for Courts-Martial require that the
refused to apply the exceptional circumstances exception accused be brought to trial within 120 days after
to Al-Nashiri’s petition. the earlier of preferral of charges or confinement.
Consonant with that tradition,
this Court should review the Court of Appeals’ Respectfully submitted, May 31, 2017 LINDA A. KLEIN
decision to confirm that exceptional delay before trial remains Counsel of Record AMERICAN BAR ASSOCIATION
of central concern on habeas review and is indeed one of the 321 North Clark Street Chicago ...
very dangers the writ of habeas corpus was designed to avoid.

Table 7: This table shows the sentences classified as salient and non-salient from one Amicus source document.
We can see that the last sentence in the non-salient sentences column shows an example of boiler-plate content
present across documents. The classifier is trained on data chosen on aggregate score of source sentences where
f(s,t) is GPT-2 perplexity.

Metric BART Ours + BART f.t. BART Ours + f.t. BART


Recall 40.87 47.46 46.90 56.04
ROUGE-1 Precision 47.21 49.97 48.68 46.16
F-1 40.17 44.97 43.47 47.07
Recall 13.76 16.54 17.84 21.50
ROUGE-2 Precision 15.46 17.04 17.84 17.10
F-1 13.36 15.37 16.30 17.64
Recall 18.34 25.58 21.30 29.62
ROUGE-L Precision 21.04 26.27 21.35 23.47
F-1 17.95 23.95 19.35 24.40

Table 8: Overall pipeline results by adding our extractor (f(s,t) as GPT-2 perplexity + Classifier) to BART and
finetuned BART (f.t. BART), including the precision and recall values for each metric.
This Court’s determination of whether due process under the New HampshireConstitution requires
court-appointed counsel for indigent parent-defendants, in order to protect their fundamental right
to parent, requires the balancing of three factors–(1) theprivate interest at stake, (2) the risk of error
and the value of procedural safeguards, and (3)the state’s interest. See In re Shelby R., 148 N.H. 237,
240 (2002) (citing In re Richard A., 146 N.H..295, 298 (2001)). Because there is no dispute that the
fundamental right to parent isat stake in abuse and neglect proceedings, the ABA focuses its
discussion on the second and third factors of the three factor test.As to the second, so-called ”risk of error”
factor, the ABA’s conclusion, after years of investigation and analysis, is that the absence of counsel for
indigent parent-defendants in abuse and neglect proceedings results in a significant risk of an erroneous
determination. This is especially true where the opposing party is the State. As to the third, state’s interest
factor, the ABA’s investigation shows that the interests of both the parent and the state are best served
where indigent parent-defendants are represented. The ABA respectfully suggests that the evidence and
analysis relevant to these two factors is so compelling in most, if not all, abuse and neglect proceedings
involving indigent parent-defendants, that a case-by-case balancing of the factors should be rejected in
favor of a rule requiring the appointment of counsel] for indigent parent-defendants in all such proceedings.
The evidence and analysis supporting the ABA’s policy includes the fact that a substantial majority of states
have recognized an unqualified right to counsel for indigentparent-defendants in child custody proceedings.
Similarly, other industrial democraciesprovide indigent parent-defendants with such right to counsel. The
ABA respectfully submits that this Court should require no less as a matter of due process under the New
Hampshire Constitution.Although of whetherJn re Shelby R. resulted in a or not a natural parent’splurality
role inruling, the the familyCourt is awas not split fundamentalon the libertyquestion interestprotected by
the State Constitution. See In re Shelby R., 148 NH. at 244 (dissenting opinion).
Hampshire constitution requires this court to determine whether indigentparents have a legally protected interest.
Most indigent parent - defendants are incapable of performingthe advocacy functions required in abuse and
neglectproceedings. Most unrepresented parents cannot perform the advocacy functions - - including investigating
facts , making an orderly factual presentation , and cross - examining witnesses - - that are required. The intense,
emotionally charged backdrop against which custody decisionsare often made further exacerbates the inherent
disadvantages faced by unrepresented indigent parents. The need for counsel for the indigentparent - defendant is
especially great where the opposing party is the state. The court must weighthree factors : ( 1) the private interests
that will be affected. ( 2) the risk of erroneousdeprivation of the liberty interest through the procedures used and the
value , if any, ofadditional or substitute procedural safeguards. ( 3) the state ’ s interest , including the function
involved and fiscal and administrative burdens that additional or substituteprocedural requirements would entail id
at 240 ; see also in re father , 155 n . h . 93 , 95 ( 2007 ) . this court has previously concluded as to the first factor
that adversary child custody proceedings implicate a fundamental liberty interest - - the right to parent in this case,
the central question thus becomes whether that right is sufficiently protected. The conclusion that counsel must be
provided is so compelling in most , if not all cases , that a case - by - case balancing of the factors should be rejected
in favor of a rule requiring the appointment of counsel for lowincome parent - defendant in all such proceedings to be
constitutionally acceptable. The state is not the only adversary finding the only meaningful right to be heard when her
adversary is not represented by counsel is not spaled against the traditional weapons of the state, such as the state’s
attorney general. The courts must also weigh the public interest in the child custody case, including the function
involved and the cost of additional or substitute safeguards, as well as the cost to the state of the additional or substituted
safeguards. The risk of an erroneous deprivation of the findamentalright to parent only increases the only increase in
the risk that the state will find the child is not heard when the state is the adversary. The public interest is only
increased by the fact that the child will not be heard by the state when the parent is represented by a lawyer.
The high level of complexity of child custody cases makes it difficult for the court to make a fair and just decision.

Table 9: The table shows the comparison of summaries where the top summary is the target summary and the
bottom summary is the one generated by our extractor and f.t BART. As we can see, the summary is coherent and
has fluent information flow.

You might also like