Extractive Summarization as Text Matching
Extractive Summarization as Text Matching
Ming Zhong∗, Pengfei Liu∗, Yiran Chen, Danqing Wang, Xipeng Qiu†, Xuanjing Huang
Shanghai Key Laboratory of Intelligent Information Processing, Fudan University
School of Computer Science, Fudan University
825 Zhangheng Road, Shanghai, China
{mzhong18,pfliu14,yrchen19,dqwang18,xpqiu,xjhuang}@fudan.edu.cn
Abstract
extract BERT
This paper creates a paradigm shift with regard
Document
to the way we build neural extractive summa-
rization systems. Instead of following the com-
arXiv:2004.08795v1 [cs.CL] 19 Apr 2020
Table 1: Datasets overview. SDS represents single-document summarization and MDS represents multi-document
summarization. The data in Doc. and Sum. indicates the average length of document and summary in the test set
respectively. # Ext denotes the number of sentences should extract in different datasets.
Table 3: Results on CNN/DM test set. The model We can see from the second section, although
with ∗ indicates that the large version of BERT is used. RL can score the entire summary, it does not lead
B ERT E XT† add an additional Pointer Network com-
pared to other B ERT E XT in this table.
to much performance improvement. This is prob-
ably because it still relies on the sentence-level
summarizers such as Pointer network or sequence
0.005<γ2 <0.05 they have little effect on perfor- labeling models, which select sentences one by one,
mance, otherwise they will cause performance rather than distinguishing the semantics of differ-
degradation. We use the validation set to save three ent summaries as a whole. Trigram Blocking is a
best checkpoints during training, and record the simple yet effective heuristic on CNN/DM, even
performance of the best checkpoints on the test set. better than all redundancy removal methods based
Importantly, all the experimental results listed in on neural models.
this paper are the average of three runs. To obtain a Compared with these models, our proposed
Siamese-BERT model on CNN/DM, we use 8 Tesla- M ATCH S UM has outperformed all competitors by
V100-16G GPUs for about 30 hours of training. a large margin. For example, it beats B ERT E XT
For datasets, we remove samples with empty by 1.51 ROUGE-1 score when using BERT-base
document or summary and truncate the document as the encoder. Additionally, even compared with
to 512 tokens, therefore ORACLE in this paper the baseline with BERT-large pre-trained encoder,
is calculated on the truncated datasets. Details of our model M ATCH S UM (BERT-base) still perform
candidate summary for the different datasets can better. Furthermore, when we change the encoder
be found in Table 2. to RoBERTa-base (Liu et al., 2019), the perfor-
mance can be further improved. We think the im-
5.3 Experimental Results provement here is because RoBERTa introduced
Results on CNN/DM As shown in Table 3, we 63 million English news articles during pretraining.
list strong baselines with different learning ap- The superior performance on this dataset demon-
proaches. The first section contains LEAD, OR- strates the effectiveness of our proposed matching
ACLE and MATCH-ORACLE4 . Because we prune framework.
documents before matching, MATCH-ORACLE is
Results on Datasets with Short Summaries
relatively low.
Reddit and XSum have been heavily evaluated
4
LEAD and ORACLE are common baselines in the sum- by abstractive summarizer due to their short sum-
marization task. The former means extracting the first sev- maries. Here, we evaluate our model on these
eral sentences of a document as a summary, the latter is the
groundtruth used in extractive models training. MATCH- two datasets to investigate whether M ATCH S UM
ORACLE is the groundtruth used to train M ATCH S UM. could achieve improvement when dealing with
WikiHow PubMed Multi-News
Model R-1 R-2 R-L R-1 R-2 R-L R-1 R-2 R-L
LEAD 24.97 5.83 23.24 37.58 12.22 33.44 43.08 14.27 38.97
ORACLE 35.59 12.98 32.68 45.12 20.33 40.19 49.06 21.54 44.27
MATCH-ORACLE 35.22 10.55 32.87 42.21 15.42 37.67 47.45 17.41 43.14
B ERT E XT 30.31 8.71 28.24 41.05 14.88 36.57 45.80 16.42 41.53
+ 3gram-Blocking 30.37 8.45 28.28 38.81 13.62 34.52 44.94 15.47 40.63
+ 4gram-Blocking 30.40 8.67 28.32 40.29 14.37 35.88 45.86 16.23 41.57
M ATCH S UM (BERT-base) 31.85 8.98 29.58 41.21 14.91 36.75 46.20 16.51 41.89
Table 5: Results on test sets of WikiHow, PubMed and Multi-News. M ATCH S UM beats the state-of-the-art BERT
model with Ngram Blocking on all different domain datasets.
summaries containing fewer sentences compared have these strong constraints, but aligns the original
with other typical extractive models. document with the summary from semantic space.
When taking just one sentence to match the orig- Experiment results display that our model is robust
inal document, M ATCH S UM degenerates into a on all domains, especially on WikiHow, M ATCH -
re-ranking of sentences. Table 4 illustrates that S UM beats the state-of-the-art BERT model by 1.54
this degradation can still bring a small improve- ROUGE-1 score.
ment (compared to B ERT E XT (Num = 1), 0.88
∆R-1 on Reddit, 0.82 ∆R-1 on XSum). How- 5.4 Analysis
ever, when the number of sentences increases to In the following, our analysis is driven by two ques-
two and summary-level semantics need to be taken tions:
into account, M ATCH S UM can obtain a more re- 1) Whether the benefits of M ATCH S UM are con-
markable improvement (compared to B ERT E XT sistent with the property of the dataset analyzed in
(Num = 2), 1.04 ∆R-1 on Reddit, 1.62 ∆R-1 on Section 3?
XSum). 2) Why have our model achieved different per-
In addition, our model maps candidate summary formance gains on diverse datasets?
as a whole into semantic space, so it can flexibly
choose any number of sentences, while most other Dataset Splitting Testing Typically, we choose
methods can only extract a fixed number of sen- three datasets (XSum, CNN/DM and WikiHow)
tences. From Table 4, we can see this advantage with the largest performance gain for this exper-
leads to further performance improvement. iment. We split each test set into roughly equal
numbers of five parts according to z described in
Results on Datasets with Long Summaries Section 3.2, and then experiment with each subset.
When the summary is relatively long, summary- Figure 4 shows that the performance gap be-
level matching becomes more complicated and is tween M ATCH S UM and B ERT E XT is always the
harder to learn. We aim to compare the difference smallest when the best-summary is not a pearl-
between Trigram Blocking and our model when summary (z = 1). The phenomenon is in line with
dealing with long summaries. our understanding, in these samples, the ability
Table 5 presents that although Trigram Blocking of the summary-level extractor to discover pearl-
works well on CNN/DM, it does not always main- summaries does not bring advantages.
tain a stable improvement. Ngram Blocking has As z increases, the performance gap gener-
little effect on WikiHow and Multi-News, and ally tends to increase. Specifically, the benefit
it causes a large performance drop on PubMed. of M ATCH S UM on CNN/DM is highly consistent
We think the reason is that Ngram Blocking can- with the appearance of pearl-summary. It can only
not really understand the semantics of sentences bring an improvement of 0.49 in the subset with
or summaries, just restricts the presence of entities the smallest z, but it rises sharply to 1.57 when z
with many words to only once, which is obviously reaches its maximum value. WikiHow is similar
not suitable for the scientific domain where entities to CNN/DM, when best-summary consists entirely
may often appear multiple times. of highest-scoring sentences, the performance gap
On the contrary, our proposed method does not is obviously smaller than in other samples. XSum
1.3
1.6
1.2
1.25 1.4
1.2 1.2
1
∆R
∆R
∆R
1
1.15
0.8
0.8
1.1
0.6
1.05 0.4
1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
z: Small =⇒ Large z: Small =⇒ Large z: Small =⇒ Large
Figure 4: Datasets splitting experiment. We split test sets into five parts according to z described in Section 3.2.
The X-axis from left to right indicates the subsets of the test set with the value of z from small to large, and the
Y-axis represents the ROUGE improvement of M ATCH S UM over B ERT E XT on this subset.
0.7
B ERT E XT on dataset D. Moreover, compared
0.6
with the inherent gap between sentence-level and
0.5
summary-level extractors, we define the ratio that
0.4
ψ(D)