0% found this document useful (0 votes)
1 views

Extractive Summarization as Text Matching

This paper introduces a novel framework for extractive summarization called M ATCH S UM, which formulates the task as a semantic text matching problem rather than extracting sentences individually. The proposed approach demonstrates superior performance on benchmark datasets, achieving state-of-the-art results on CNN/DailyMail with a ROUGE-1 score of 44.41. The authors provide a comprehensive analysis of the differences between sentence-level and summary-level extractors, highlighting the advantages of their method in capturing semantic similarity.

Uploaded by

data science
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1 views

Extractive Summarization as Text Matching

This paper introduces a novel framework for extractive summarization called M ATCH S UM, which formulates the task as a semantic text matching problem rather than extracting sentences individually. The proposed approach demonstrates superior performance on benchmark datasets, achieving state-of-the-art results on CNN/DailyMail with a ROUGE-1 score of 44.41. The authors provide a comprehensive analysis of the differences between sentence-level and summary-level extractors, highlighting the advantages of their method in capturing semantic similarity.

Uploaded by

data science
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Extractive Summarization as Text Matching

Ming Zhong∗, Pengfei Liu∗, Yiran Chen, Danqing Wang, Xipeng Qiu†, Xuanjing Huang
Shanghai Key Laboratory of Intelligent Information Processing, Fudan University
School of Computer Science, Fudan University
825 Zhangheng Road, Shanghai, China
{mzhong18,pfliu14,yrchen19,dqwang18,xpqiu,xjhuang}@fudan.edu.cn

Abstract
extract BERT
This paper creates a paradigm shift with regard
Document
to the way we build neural extractive summa-
rization systems. Instead of following the com-
arXiv:2004.08795v1 [cs.CL] 19 Apr 2020

monly used framework of extracting sentences BERT

individually and modeling the relationship be- Candidate Summary

tween sentences, we formulate the extractive


summarization task as a semantic text match- BERT
ing problem, in which a source document Gold Summary Semantic Space
and candidate summaries will be (extracted
from the original text) matched in a semantic Figure 1: M ATCH S UM framework. We match the con-
space. Notably, this paradigm shift to seman- textual representations of the document with gold sum-
tic matching framework is well-grounded in mary and candidate summaries (extracted from the doc-
our comprehensive analysis of the inherent gap ument). Intuitively, better candidate summaries should
between sentence-level and summary-level ex- be semantically closer to the document, while the gold
tractors based on the property of the dataset. summary should be the closest.
Besides, even instantiating the framework with
a simple form of a matching model, we
have driven the state-of-the-art extractive re- one from the original text, model the relationship
sult on CNN/DailyMail to a new level (44.41 between the sentences, and then select several sen-
in ROUGE-1). Experiments on the other five
tences to form a summary. Cheng and Lapata
datasets also show the effectiveness of the
matching framework. We believe the power (2016); Nallapati et al. (2017) formulate the ex-
of this matching-based summarization frame- tractive summarization task as a sequence label-
work has not been fully exploited. To encour- ing problem and solve it with an encoder-decoder
age more instantiations in the future, we have framework. These models make independent bi-
released our codes, processed dataset, as well nary decisions for each sentence, resulting in high
as generated summaries in https://github. redundancy. A natural way to address the above
com/maszhongming/MatchSum.
problem is to introduce an auto-regressive decoder
(Chen and Bansal, 2018; Jadhav and Rajan, 2018;
1 Introduction
Zhou et al., 2018), allowing the scoring operations
The task of automatic text summarization aims to of different sentences to influence on each other.
compress a textual document to a shorter highlight Trigram Blocking (Paulus et al., 2017; Liu and La-
while keeping salient information on the original pata, 2019), as a more popular method recently, has
text. In this paper, we focus on extractive summa- the same motivation. At the stage of selecting sen-
rization since it usually generates semantically and tences to form a summary, it will skip the sentence
grammatically correct sentences (Dong et al., 2018; that has trigram overlapping with the previously se-
Nallapati et al., 2017) and computes faster. lected sentences. Surprisingly, this simple method
Currently, most of the neural extractive summa- of removing duplication brings a remarkable per-
rization systems score and extract sentences (or formance improvement on CNN/DailyMail.
smaller semantic unit (Xu et al., 2019)) one by The above systems of modeling the relationship

These two authors contributed equally. between sentences are essentially sentence-level

Corresponding author. extractors, rather than considering the semantics
of the entire summary. This makes them more We summarize our contributions as follows:
inclined to select highly generalized sentences 1) Instead of scoring and extracting sentences
while ignoring the coupling of multiple sentences. one by one to form a summary, we formulate ex-
Narayan et al. (2018b); Bae et al. (2019) utilize tractive summarization as a semantic text match-
reinforcement learning (RL) to achieve summary- ing problem and propose a novel summary-level
level scoring, but still limited to the architecture of framework. Our approach bypasses the difficulty
sentence-level summarizers. of summary-level optimization by contrastive learn-
To better understand the advantages and limi- ing, that is, a good summary should be more se-
tations of sentence-level and summary-level ap- mantically similar to the source document than the
proaches, we conduct an analysis on six benchmark unqualified summaries.
datasets (in Section 3) to explore the characteristics 2) We conduct an analysis to investigate whether
of these two methods. We find that there is indeed extractive models must do summary-level extrac-
an inherent gap between the two approaches across tion based on the property of dataset, and attempt
these datasets, which motivates us to propose the to quantify the inherent gap between sentence-level
following summary-level method. and summary-level methods.
In this paper, we propose a novel summary-level 3) Our proposed framework has achieved supe-
framework (M ATCH S UM, Figure 1) and conceptu- rior performance compared with strong baselines
alize extractive summarization as a semantic text on six benchmark datasets. Notably, we obtain a
matching problem. The principle idea is that a good state-of-the-art extractive result on CNN/DailyMail
summary should be more semantically similar as a (44.41 in ROUGE-1) by only using the base version
whole to the source document than the unqualified of BERT. Moreover, we seek to observe where the
summaries. Semantic text matching is an important performance gain of our model comes from.
research problem to estimate semantic similarity
between a source and a target text fragment, which 2 Related Work
has been applied in many fields, such as informa- 2.1 Extractive Summarization
tion retrieval (Mitra et al., 2017), question answer-
ing (Yih et al., 2013; Severyn and Moschitti, 2015), Recent research work on extractive summarization
natural language inference (Wang and Jiang, 2016; spans a large range of approaches. These work usu-
Wang et al., 2017) and so on. One of the most con- ally instantiate their encoder-decoder framework
ventional approaches to semantic text matching is by choosing RNN (Zhou et al., 2018), Transformer
to learn a vector representation for each text frag- (Zhong et al., 2019b; Wang et al., 2019) or GNN
ment, and then apply typical similarity metrics to (Wang et al., 2020) as encoder, non-auto-regressive
compute the matching scores. (Narayan et al., 2018b; Arumae and Liu, 2018) or
auto-regressive decoders (Jadhav and Rajan, 2018;
Specific to extractive summarization, we pro-
Liu and Lapata, 2019). Despite the effectiveness,
pose a Siamese-BERT architecture to compute the
these models are essentially sentence-level extrac-
similarity between the source document and the
tors with individual scoring process favor the high-
candidate summary. Siamese BERT leverages the
est scoring sentence, which probably is not the
pre-trained BERT (Devlin et al., 2019) in a Siamese
optimal one to form summary1 .
network structure (Bromley et al., 1994; Hoffer and
The application of RL provides a means of
Ailon, 2015; Reimers and Gurevych, 2019) to de-
summary-level scoring and brings improvement
rive semantically meaningful text embeddings that
(Narayan et al., 2018b; Bae et al., 2019). However,
can be compared using cosine-similarity. A good
these efforts are still limited to auto-regressive or
summary has the highest similarity among a set of
non-auto-regressive architectures. Besides, in the
candidate summaries.
non-neural approaches, the Integer Linear Program-
We evaluate the proposed matching framework ming (ILP) method can also be used for summary-
and perform significance testing on a range of level scoring (Wan et al., 2015).
benchmark datasets. Our model outperforms strong In addition, there is some work to solve extrac-
baselines significantly in all cases and improve the tive summarization from a semantic perspective be-
state-of-the-art extractive result on CNN/DailyMail. fore this paper, such as concept coverage (Gillick
Besides, we design experiments to observe the
1
gains brought by our framework. We will quantify this phenomenon in Section 3.
and Favre, 2009), reconstruction (Miao and Blun- {s1 , · · · , sk , |si ∈ D} as a candidate summary in-
som, 2016) and maximize semantic volume (Yo- cluding k (k ≤ n) sentences extracted from a docu-
gatama et al., 2015). ment. Given a document D with its gold summary
C ∗ , we measure a candidate summary C by cal-
2.2 Two-stage Summarization culating the ROUGE (Lin and Hovy, 2003) value
Recent studies (Alyguliyev, 2009; Galanis and An- between C and C ∗ in two levels:
droutsopoulos, 2010; Zhang et al., 2019a) have 1) Sentence-Level Score:
attempted to build two-stage document summariza- 1 X
tion systems. Specific to extractive summarization, gsen (C) = R(s, C∗ ), (1)
|C|
the first stage is usually to extract some fragments s∈C
of the original text, and the second stage is to select
where s is the sentence in C and |C| represents
or modify on the basis of these fragments.
the number of sentences. R(·) denotes the average
Chen and Bansal (2018) and Bae et al. (2019)
ROUGE score2 . Thus, gsen (C) indicates the aver-
follow a hybrid extract-then-rewrite architecture,
age overlaps between each sentence in C and the
with policy-based RL to bridge the two networks
gold summary C ∗ .
together. Lebanoff et al. (2019); Xu and Durrett
2) Summary-Level Score:
(2019); Mendes et al. (2019) focus on the extract-
then-compress learning paradigm, namely compres- gsum (C) = R(C, C ∗ ), (2)
sive summarization, which will first train an extrac-
tor for content selection. Our model can be viewed where gsum (C) considers sentences in C as a
as an extract-then-match framework, which also whole and then calculates the ROUGE score with
employs a sentence extractor to prune unnecessary the gold summary C ∗ .
information.
Pearl-Summary We define the pearl-summary
3 Sentence-Level or Summary-Level? A to be the summary that has a lower sentence-level
Dataset-dependent Analysis score but a higher summary-level score.
Definition 1 A candidate summary C is defined
Although previous work has pointed out the weak- as a pearl-summary if there exists another can-
ness of sentence-level extractors, there is no sys- didate summary C 0 that satisfies the inequality:
tematic analysis towards the following questions: gsen (C 0 ) > gsen (C) while gsum (C 0 ) < gsum (C).
1) For extractive summarization, is the summary-
Clearly, if a candidate summary is a pearl-summary,
level extractor better than the sentence-level extrac-
it is challenging for sentence-level summarizers to
tor? 2) Given a dataset, which extractor should
extract it.
we choose based on the characteristics of the data,
and what is the inherent gap between these two Best-Summary The best-summary refers to a
extractors? summary has highest summary-level score among
In this section, we investigate the gap between all the candidate summaries.
sentence-level and summary-level methods on six Definition 2 A summary Ĉ is defined as the best-
benchmark datasets, which can instruct us to search summary when it satisfies: Ĉ = argmax gsum (C),
for an effective learning framework. It is worth not- C∈C
ing that the sentence-level extractor we use here where C denotes all the candidate summaries of the
doesn’t include a redundancy removal process so document.
that we can estimate the effect of the summary-
3.2 Ranking of Best-Summary
level extractor on redundancy elimination. Notably,
the analysis method to estimate the theoretical ef- For each document, we sort all candidate sum-
fectiveness presented in this section is generalized maries3 in descending order based on the sentence-
and can be applicable to any summary-level ap- level score, and then define z as the rank index of
proach. the best-summary Ĉ.
2
Here we use mean F1 of ROUGE-1, ROUGE-2 and
3.1 Definition ROUGE-L.
3
We use an approximate method here: take #Ext (see Table
We refer to D = {s1 , · · · , sn } as a single 1) of ten highest-scoring sentences to form candidate sum-
document consisting of n sentences, and C = maries.
# Pairs # Tokens
Datasets Source Type Train Valid Test Doc. Sum. # Ext

Reddit Social Media SDS 41,675 645 645 482.2 28.0 2


XSum News SDS 203,028 11,273 11,332 430.2 23.3 2
CNN/DM News SDS 287,084 13,367 11,489 766.1 58.2 3
WikiHow Knowledge Base SDS 168,126 6,000 6,000 580.8 62.6 4
PubMed Scientific Paper SDS 83,233 4,946 5,025 444.0 209.5 6
Multi-News News MDS 44,972 5,622 5,622 487.3 262.0 9

Table 1: Datasets overview. SDS represents single-document summarization and MDS represents multi-document
summarization. The data in Doc. and Sum. indicates the average length of document and summary in the test set
respectively. # Ext denotes the number of sentences should extract in different datasets.

Since the appearance of the pearl-summary will


bring challenges to sentence-level extractors, we
attempt to investigate the proportion of pearl-
summary in different datasets on six benchmark
datasets. A detailed description of these datasets is
displayed in Table 1.
(a) Reddit (b) XSum As demonstrated in Figure 2, we can observe that
for all datasets, most of the best-summaries are not
made up of the highest-scoring sentences. Specifi-
cally, for CNN/DM, only 18.9% of best-summaries
are not pearl-summary, indicating sentence-level
extractors will easily fall into a local optimization,
missing better candidate summaries.
Different from CNN/DM, PubMed is most suit-
(c) CNN/DM (d) WikiHow able for sentence-level summarizers, because most
of best-summary sets are not pearl-summary. Ad-
ditionally, it is challenging to achieve good perfor-
mance on WikiHow and Multi-News without
a summary-level learning process, as these two
datasets are most evenly distributed, that is, the
appearance of pearl-summary makes the selection
of the best-summary more complicated.
(e) PubMed (f) Multi-News In conclusion, the proportion of the pearl-
summaries in all the best-summaries is a prop-
Figure 2: Distribution of z(%) on six datasets. Because erty to characterize a dataset, which will affect
the number of candidate summaries for each document
our choices of summarization extractors.
is different (short text may have relatively few candi-
dates), we use z / number of candidate summaries as
the X-axis. The Y-axis represents the proportion of the 3.3 Inherent Gap between Sentence-Level
best-summaries with this rank in the test set. and Summary-Level Extractors

Above analysis has explicated that the summary-


Intuitively, 1) if z = 1 (Ĉ comes first), it means level method is better than the sentence-level
that the best-summary is composed of sentences method because it can pick out pearl-summaries,
with the highest score; 2) If z > 1, then the best- but how much improvement can it bring given a
summary is a pearl-summary. And as z increases specific dataset?
(Ĉ gets lower rankings), we could find more can- Based on the definition of Eq. (1) and (2), we
didate summaries whose sentence-level score is can characterize the upper bound of the sentence-
higher than best-summary, which leads to the learn- level and summary-level summarization systems
ing difficulty for sentence-level extractors. for a document D as:
5
inherently unaware of pearl-summary, so obtain-
4 ing the best-summary is difficult. To better utilize
∆(D) 3 the above characteristics of the data, we propose a
2
summary-level framework which could score and
extract a summary directly.
1
Specifically, we formulate the extractive summa-
0
Reddit XSum CNN/DM WikiHow PubMed Multi-News rization task as a semantic text matching problem,
in which a source document and candidate sum-
Figure 3: ∆(D) for different datasets.
maries will be (extracted from the original text)
matched in a semantic space. The following section
will detail how we instantiate our proposed match-
ing summarization framework by using a simple
αsen (D) = max gsen (C), (3)
C∈CD siamese-based architecture.
αsum (D) = max gsum (C), (4)
C∈CD 4.1 Siamese-BERT
where CD is the set of candidate summaries ex- Inspired by siamese network structure (Bromley
tracted from D. et al., 1994), we construct a Siamese-BERT archi-
Then, we quantify the potential gain for a doc- tecture to match the document D and the candidate
ument D by calculating the difference between summary C. Our Siamese-BERT consists of two
αsen (D) and αsum (D): BERTs with tied-weights and a cosine-similarity
layer during the inference phase.
∆(D) = αsum (D) − αsen (D). (5)
Unlike the modified BERT used in (Liu, 2019;
Finally, a dataset-level potential gain can be ob- Bae et al., 2019), we directly use the original BERT
tained as: to derive the semantically meaningful embeddings
1 X from document D and candidate summary C since
∆(D) = ∆(D), (6) we need not obtain the sentence-level representa-
|D|
D∈D
tion. Thus, we use the vector of the ‘[CLS]’ token
where D represents a specific dataset and |D| is the from the top BERT layer as the representation of
number of documents in this dataset. a document or summary. Let rD and rC denote
We can see from Figure 3, the performance the embeddings of the document D and candidate
gain of the summary-level method varies with summary C. Their similarity score is measured by
the dataset and has an improvement at a max- f (D, C) = cosine(rD , rC ).
imum 4.7 on CNN/DM. From Figure 3 and Ta- In order to fine-tune Siamese-BERT, we use a
ble 1, we can find the performance gain is re- margin-based triplet loss to update the weights. In-
lated to the length of reference summary for dif- tuitively, the gold summary C ∗ should be semanti-
ferent datasets. In the case of short summaries cally closest to the source document, which is the
(Reddit and XSum), the perfect identification of first principle our loss should follow:
pearl-summaries does not lead to much improve-
ment. Similarly, multiple sentences in a long sum- L1 = max(0, f (D, C) − f (D, C ∗ ) + γ1 ), (7)
mary (PubMed and Multi-News) already have
a large degree of semantic overlap, making the where C is the candidate summary in D and γ1 is
improvement of the summary-level method rela- a margin value. Besides, we also design a pairwise
tively small. But for a medium-length summary margin loss for all the candidate summaries. We
(CNN/DM and WikiHow, about 60 words), the sort all candidate summaries in descending order of
summary-level learning process is rewarding. We ROUGE scores with the gold summary. Naturally,
will discuss this performance gain with specific the candidate pair with a larger ranking gap should
models in Section 5.4. have a larger margin, which is the second principle
to design our loss function:
4 Summarization as Matching
The above quantitative analysis suggests that for L2 = max(0, f (D, Cj ) − f (D, Ci )
(8)
most of the datasets, sentence-level extractors are + (j − i) ∗ γ2 ) (i < j),
where Ci represents the candidate summary ranked Reddit XSum CNN/DM Wiki PubMed M-News
i and γ2 is a hyperparameter used to distinguish be-
Ext 5 5 5 5 7 10
tween good and bad candidate summaries. Finally,
Sel 1, 2 1, 2 2, 3 3, 4, 5 6 9
our margin-based triplet loss can be written as:
Size 15 15 20 16 7 9
L = L1 + L2 . (9)
Table 2: Details about the candidate summary for dif-
The basic idea is to let the gold summary have the ferent datasets. Ext denotes the number of sentences
highest matching score, and at the same time, a bet- after we prune the original document, Sel denotes the
number of sentences to form a candidate summary and
ter candidate summary should obtain a higher score
Size is the number of final candidate summaries.
compared with the unqualified candidate summary.
Figure 1 illustrate this idea.
In the inference phase, we formulate extractive CNN/DailyMail (Hermann et al., 2015) is a
summarization as a task to search for the best sum- commonly used summarization dataset modified
mary among all the candidates C extracted from by Nallapati et al. (2016), which contains news ar-
the document D. ticles and associated highlights as summaries. In
this paper, we use the non-anonymized version.
Ĉ = arg max f (D, C). (10) PubMed (Cohan et al., 2018) is collected from
C∈C
scientific papers and thus consists of long docu-
4.2 Candidates Pruning ments. We modify this dataset by using the intro-
Curse of Combination The matching idea is duction section as the document and the abstract
more intuitive while it suffers from combinatorial section as the corresponding summary.
explosion problems. For example, how could we WikiHow (Koupaee and Wang, 2018) is a di-
determine the size of the candidate summary set or verse dataset extracted from an online knowledge
should we score all possible candidates? To allevi- base. Articles in it span a wide range of topics.
ate these difficulties, we propose a simple candidate XSum (Narayan et al., 2018a) is a one-sentence
pruning strategy. summary dataset to answer the question “What is
Concretely, we introduce a content selection the article about?”. All summaries are profession-
module to pre-select salient sentences. The mod- ally written, typically by the authors of documents
ule learns to assign each sentence a salience score in this dataset.
and prunes sentences irrelevant with the current Multi-News (Fabbri et al., 2019) is a multi-
0
document, resulting in a pruned document D = document news summarization dataset with a rela-
0 0 0
{s1 , · · · , sext |si ∈ D}. tively long summary, we use the truncated version
Similar to much previous work on two-stage and concatenate the source documents as a single
summarization, our content selection module is a input in all experiments.
parameterized neural network. In this paper, we Reddit (Kim et al., 2019) is a highly abstractive
use B ERT S UM (Liu and Lapata, 2019) without tri- dataset collected from social media platform. We
gram blocking (we call it B ERT E XT) to score each only use the TIFU-long version of Reddit, which
sentence. Then, we use a simple rule to obtain regards the body text of a post as the document and
the candidates: generating all combinations of sel the TL;DR as the summary.
sentences subject to the pruned document, and re-
organize the order of sentences according to the 5.2 Implementation Details
original position in the document to form candidate We use the base version of BERT to implement
summaries. Therefore, we have a total of ext sel
our models in all experiments. Adam optimizer
candidate sets. (Kingma and Ba, 2014) with warming-up is used
and our learning rate schedule follows Vaswani
5 Experiment et al. (2017) as:
5.1 Datasets
lr = 2e−3 · min(step−0.5 , step · wm−1.5 ), (11)
In order to verify the effectiveness of our frame-
work and obtain more convicing explanations, we where each step is a batch size of 32 and wm
perform experiments on six divergent mainstream denotes warmup steps of 10,000. We choose
datasets as follows. γ1 = 0 and γ2 = 0.01. When γ1 <0.05 and
Model R-1 R-2 R-L Model R-1 R-2 R-L
LEAD 40.43 17.62 36.67 Reddit
ORACLE 52.59 31.23 48.87
B ERT E XT (Num = 1) 21.99 5.21 16.99
MATCH-ORACLE 51.08 26.94 47.22
B ERT E XT (Num = 2) 23.86 5.85 19.11
BANDIT S UM (Dong et al., 2018) 41.50 18.70 37.60 M ATCH S UM (Sel = 1) 22.87 5.15 17.40
N EU S UM (Zhou et al., 2018) 41.59 19.01 37.98 M ATCH S UM (Sel = 2) 24.90 5.91 20.03
J ECS (Xu and Durrett, 2019) 41.70 18.50 37.90 M ATCH S UM (Sel = 1, 2) 25.09 6.17 20.13
H I B ERT (Zhang et al., 2019b) 42.37 19.95 38.83
XSum
P N B ERT (Zhong et al., 2019a) 42.39 19.51 38.69
P N B ERT + RL 42.69 19.60 38.85 B ERT E XT (Num = 1) 22.53 4.36 16.23
B ERT E XT† (Bae et al., 2019) 42.29 19.38 38.63 B ERT E XT (Num = 2) 22.86 4.48 17.16
B ERT E XT† + RL 42.76 19.87 39.11 M ATCH S UM (Sel = 1) 23.35 4.46 16.71
B ERT E XT (Liu, 2019) 42.57 19.96 39.04 M ATCH S UM (Sel = 2) 24.48 4.58 18.31
B ERT E XT + Tri-Blocking 43.23 20.22 39.60 M ATCH S UM (Sel = 1, 2) 24.86 4.66 18.41
B ERT S UM∗ (Liu and Lapata, 2019) 43.85 20.34 39.90
Table 4: Results on test sets of Reddit and XSum.
B ERT E XT (Ours) 42.73 20.13 39.20
N um indicates how many sentences B ERT E XT ex-
B ERT E XT + Tri-Blocking (Ours) 43.18 20.16 39.56
tracts as a summary and Sel indicates the number of
M ATCH S UM (BERT-base) 44.22 20.62 40.38
sentences we choose to form a candidate summary.
M ATCH S UM (RoBERTa-base) 44.41 20.86 40.55

Table 3: Results on CNN/DM test set. The model We can see from the second section, although
with ∗ indicates that the large version of BERT is used. RL can score the entire summary, it does not lead
B ERT E XT† add an additional Pointer Network com-
pared to other B ERT E XT in this table.
to much performance improvement. This is prob-
ably because it still relies on the sentence-level
summarizers such as Pointer network or sequence
0.005<γ2 <0.05 they have little effect on perfor- labeling models, which select sentences one by one,
mance, otherwise they will cause performance rather than distinguishing the semantics of differ-
degradation. We use the validation set to save three ent summaries as a whole. Trigram Blocking is a
best checkpoints during training, and record the simple yet effective heuristic on CNN/DM, even
performance of the best checkpoints on the test set. better than all redundancy removal methods based
Importantly, all the experimental results listed in on neural models.
this paper are the average of three runs. To obtain a Compared with these models, our proposed
Siamese-BERT model on CNN/DM, we use 8 Tesla- M ATCH S UM has outperformed all competitors by
V100-16G GPUs for about 30 hours of training. a large margin. For example, it beats B ERT E XT
For datasets, we remove samples with empty by 1.51 ROUGE-1 score when using BERT-base
document or summary and truncate the document as the encoder. Additionally, even compared with
to 512 tokens, therefore ORACLE in this paper the baseline with BERT-large pre-trained encoder,
is calculated on the truncated datasets. Details of our model M ATCH S UM (BERT-base) still perform
candidate summary for the different datasets can better. Furthermore, when we change the encoder
be found in Table 2. to RoBERTa-base (Liu et al., 2019), the perfor-
mance can be further improved. We think the im-
5.3 Experimental Results provement here is because RoBERTa introduced
Results on CNN/DM As shown in Table 3, we 63 million English news articles during pretraining.
list strong baselines with different learning ap- The superior performance on this dataset demon-
proaches. The first section contains LEAD, OR- strates the effectiveness of our proposed matching
ACLE and MATCH-ORACLE4 . Because we prune framework.
documents before matching, MATCH-ORACLE is
Results on Datasets with Short Summaries
relatively low.
Reddit and XSum have been heavily evaluated
4
LEAD and ORACLE are common baselines in the sum- by abstractive summarizer due to their short sum-
marization task. The former means extracting the first sev- maries. Here, we evaluate our model on these
eral sentences of a document as a summary, the latter is the
groundtruth used in extractive models training. MATCH- two datasets to investigate whether M ATCH S UM
ORACLE is the groundtruth used to train M ATCH S UM. could achieve improvement when dealing with
WikiHow PubMed Multi-News
Model R-1 R-2 R-L R-1 R-2 R-L R-1 R-2 R-L
LEAD 24.97 5.83 23.24 37.58 12.22 33.44 43.08 14.27 38.97
ORACLE 35.59 12.98 32.68 45.12 20.33 40.19 49.06 21.54 44.27
MATCH-ORACLE 35.22 10.55 32.87 42.21 15.42 37.67 47.45 17.41 43.14
B ERT E XT 30.31 8.71 28.24 41.05 14.88 36.57 45.80 16.42 41.53
+ 3gram-Blocking 30.37 8.45 28.28 38.81 13.62 34.52 44.94 15.47 40.63
+ 4gram-Blocking 30.40 8.67 28.32 40.29 14.37 35.88 45.86 16.23 41.57
M ATCH S UM (BERT-base) 31.85 8.98 29.58 41.21 14.91 36.75 46.20 16.51 41.89

Table 5: Results on test sets of WikiHow, PubMed and Multi-News. M ATCH S UM beats the state-of-the-art BERT
model with Ngram Blocking on all different domain datasets.

summaries containing fewer sentences compared have these strong constraints, but aligns the original
with other typical extractive models. document with the summary from semantic space.
When taking just one sentence to match the orig- Experiment results display that our model is robust
inal document, M ATCH S UM degenerates into a on all domains, especially on WikiHow, M ATCH -
re-ranking of sentences. Table 4 illustrates that S UM beats the state-of-the-art BERT model by 1.54
this degradation can still bring a small improve- ROUGE-1 score.
ment (compared to B ERT E XT (Num = 1), 0.88
∆R-1 on Reddit, 0.82 ∆R-1 on XSum). How- 5.4 Analysis
ever, when the number of sentences increases to In the following, our analysis is driven by two ques-
two and summary-level semantics need to be taken tions:
into account, M ATCH S UM can obtain a more re- 1) Whether the benefits of M ATCH S UM are con-
markable improvement (compared to B ERT E XT sistent with the property of the dataset analyzed in
(Num = 2), 1.04 ∆R-1 on Reddit, 1.62 ∆R-1 on Section 3?
XSum). 2) Why have our model achieved different per-
In addition, our model maps candidate summary formance gains on diverse datasets?
as a whole into semantic space, so it can flexibly
choose any number of sentences, while most other Dataset Splitting Testing Typically, we choose
methods can only extract a fixed number of sen- three datasets (XSum, CNN/DM and WikiHow)
tences. From Table 4, we can see this advantage with the largest performance gain for this exper-
leads to further performance improvement. iment. We split each test set into roughly equal
numbers of five parts according to z described in
Results on Datasets with Long Summaries Section 3.2, and then experiment with each subset.
When the summary is relatively long, summary- Figure 4 shows that the performance gap be-
level matching becomes more complicated and is tween M ATCH S UM and B ERT E XT is always the
harder to learn. We aim to compare the difference smallest when the best-summary is not a pearl-
between Trigram Blocking and our model when summary (z = 1). The phenomenon is in line with
dealing with long summaries. our understanding, in these samples, the ability
Table 5 presents that although Trigram Blocking of the summary-level extractor to discover pearl-
works well on CNN/DM, it does not always main- summaries does not bring advantages.
tain a stable improvement. Ngram Blocking has As z increases, the performance gap gener-
little effect on WikiHow and Multi-News, and ally tends to increase. Specifically, the benefit
it causes a large performance drop on PubMed. of M ATCH S UM on CNN/DM is highly consistent
We think the reason is that Ngram Blocking can- with the appearance of pearl-summary. It can only
not really understand the semantics of sentences bring an improvement of 0.49 in the subset with
or summaries, just restricts the presence of entities the smallest z, but it rises sharply to 1.57 when z
with many words to only once, which is obviously reaches its maximum value. WikiHow is similar
not suitable for the scientific domain where entities to CNN/DM, when best-summary consists entirely
may often appear multiple times. of highest-scoring sentences, the performance gap
On the contrary, our proposed method does not is obviously smaller than in other samples. XSum
1.3
1.6
1.2
1.25 1.4

1.2 1.2
1
∆R

∆R

∆R
1
1.15
0.8
0.8
1.1
0.6

1.05 0.4
1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
z: Small =⇒ Large z: Small =⇒ Large z: Small =⇒ Large

(a) XSum (b) CNN/DM (c) WikiHow

Figure 4: Datasets splitting experiment. We split test sets into five parts according to z described in Section 3.2.
The X-axis from left to right indicates the subsets of the test set with the value of z from small to large, and the
Y-axis represents the ROUGE improvement of M ATCH S UM over B ERT E XT on this subset.

0.7
B ERT E XT on dataset D. Moreover, compared
0.6
with the inherent gap between sentence-level and
0.5
summary-level extractors, we define the ratio that
0.4
ψ(D)

M ATCH S UM can learn on dataset D as:


0.3

0.2 ψ(D) = ∆(D)∗ /∆(D), (14)


0.1

0 where ∆(D) is the inherent gap between sentence-


XS
um D M ow
bM
ed ews
N/ kiH lti-
N
CN Wi Pu
Mu level and summary-level extractos.
It is clear from Figure 5, the value of ψ(D) de-
Figure 5: ψ of different datasets. Reddit is excluded
because it has too few samples in the test set. pends on z (see Figure 2) and the length of the gold
summary (see Table 1). As the gold summaries
get longer, the upper bound of summary-level ap-
is slightly different, although the trend remains proaches becomes more difficult for our model to
the same, our model does not perform well in the reach. M ATCH S UM can achieve 0.64 ψ(D) on
samples with the largest z, which needs further XSum (23.3 words summary), however, ψ(D) is
improvement and exploration. less than 0.2 in PubMed and Multi-News whose
From the above comparison, we can see that summary length exceeds 200. From another per-
the performance improvement of M ATCH S UM spective, when the summary length are similar, our
is concentrated in the samples with more pearl- model performs better on datasets with more pearl-
summaries, which illustrates our semantic-based summaries. For instance, z is evenly distributed
summary-level model can capture sentences that in Multi-News (see Figure 2), so higher ψ(D)
are not particularly good when viewed individually, (0.18) can be obtained than PubMed (0.09), which
thereby forming a better summary. has the least pearl-summaries.
Comparison Across Datasets Intuitively, im- A better understanding of the dataset allows us
provements brought by M ATCH S UM framework to get a clear awareness of the strengths and lim-
should be associated with inherent gaps presented itations of our framework, and we also hope that
in Section 3.3. To better understand their relation, the above analysis could provide useful clues for
we introduce ∆(D)∗ as follows: future research on extractive summarization.

∆(D)∗ = gsum (CM S ) − gsum (CBE ), (12) 6 Conclusion


1 X We formulate the extractive summarization task
∆(D)∗ = ∆(D)∗ , (13)
|D| as a semantic text matching problem and propose
D∈D
a novel summary-level framework to match the
where CM S and CBE represent the candidate sum- source document and candidate summaries in the
mary selected by M ATCH S UM and B ERT E XT in semantic space. We conduct an analysis to show
the document D, respectively. Therefore, ∆(D)∗ how our model could better fit the characteristic of
can indicate the improvement by M ATCH S UM over the data. Experimental results show M ATCH S UM
outperforms the current state-of-the-art extractive Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
model on six benchmark datasets, which demon- Kristina Toutanova. 2019. Bert: Pre-training of
deep bidirectional transformers for language under-
strates the effectiveness of our method. We believe
standing. In Proceedings of the 2019 Conference of
the power of this matching-based summarization the North American Chapter of the Association for
framework has not been fully exploited. In the Computational Linguistics: Human Language Tech-
future, more forms of matching models can be ex- nologies, Volume 1 (Long and Short Papers), pages
plored to instantiated the proposed framework. 4171–4186.

Yue Dong, Yikang Shen, Eric Crawford, Herke van


Acknowledgment Hoof, and Jackie Chi Kit Cheung. 2018. Bandit-
sum: Extractive summarization as a contextual ban-
We would like to thank the anonymous reviewers
dit. In Proceedings of the 2018 Conference on Em-
for their valuable comments. This work is sup- pirical Methods in Natural Language Processing,
ported by the National Key Research and Develop- pages 3739–3748.
ment Program of China (No. 2018YFC0831103),
Alexander Richard Fabbri, Irene Li, Tianwei She, Suyi
National Natural Science Foundation of China Li, and Dragomir R. Radev. 2019. Multi-news: A
(No. U1936214 and 61672162), Shanghai Mu- large-scale multi-document summarization dataset
nicipal Science and Technology Major Project (No. and abstractive hierarchical model. In ACL (1),
2018SHZDZX01) and ZJLab. pages 1074–1084. Association for Computational
Linguistics.

Dimitrios Galanis and Ion Androutsopoulos. 2010. An


References extractive supervised two-stage method for sentence
RM Alyguliyev. 2009. The two-stage unsupervised ap- compression. In Human Language Technologies:
proach to multidocument summarization. Automatic The 2010 Annual Conference of the North American
Control and Computer Sciences, 43(5):276. Chapter of the Association for Computational Lin-
guistics, pages 885–893. Association for Computa-
Kristjan Arumae and Fei Liu. 2018. Reinforced extrac- tional Linguistics.
tive summarization with question-focused rewards.
In Proceedings of ACL 2018, Student Research Dan Gillick and Benoit Favre. 2009. A scalable global
Workshop, pages 105–111. model for summarization. In Proceedings of the
Workshop on Integer Linear Programming for Nat-
Sanghwan Bae, Taeuk Kim, Jihoon Kim, and Sang- ural Language Processing, pages 10–18.
goo Lee. 2019. Summary level training of sentence
rewriting for abstractive summarization. In Proceed- Karl Moritz Hermann, Tomas Kocisky, Edward Grefen-
ings of the 2nd Workshop on New Frontiers in Sum- stette, Lasse Espeholt, Will Kay, Mustafa Suleyman,
marization, pages 10–20. and Phil Blunsom. 2015. Teaching machines to read
and comprehend. In Advances in Neural Informa-
Jane Bromley, Isabelle Guyon, Yann LeCun, Eduard
tion Processing Systems, pages 1684–1692.
Säckinger, and Roopak Shah. 1994. Signature verifi-
cation using a” siamese” time delay neural network. Elad Hoffer and Nir Ailon. 2015. Deep metric learning
In Advances in neural information processing sys- using triplet network. In International Workshop on
tems, pages 737–744. Similarity-Based Pattern Recognition, pages 84–92.
Yen-Chun Chen and Mohit Bansal. 2018. Fast abstrac- Springer.
tive summarization with reinforce-selected sentence
rewriting. In Proceedings of the 56th Annual Meet- Aishwarya Jadhav and Vaibhav Rajan. 2018. Extrac-
ing of the Association for Computational Linguistics tive summarization with swap-net: Sentences and
(Volume 1: Long Papers), volume 1, pages 675–686. words from alternating pointer networks. In Pro-
ceedings of the 56th Annual Meeting of the Associa-
Jianpeng Cheng and Mirella Lapata. 2016. Neural sum- tion for Computational Linguistics (Volume 1: Long
marization by extracting sentences and words. In Papers), volume 1, pages 142–151.
Proceedings of the 54th Annual Meeting of the As-
sociation for Computational Linguistics (Volume 1: Byeongchang Kim, Hyunwoo Kim, and Gunhee Kim.
Long Papers), volume 1, pages 484–494. 2019. Abstractive summarization of reddit posts
with multi-level memory networks. In Proceed-
Arman Cohan, Franck Dernoncourt, Doo Soon Kim, ings of the 2019 Conference of the North American
Trung Bui, Seokhwan Kim, Walter Chang, and Nazli Chapter of the Association for Computational Lin-
Goharian. 2018. A discourse-aware attention model guistics: Human Language Technologies, Volume 1
for abstractive summarization of long documents. In (Long and Short Papers), pages 2519–2531.
Proceedings of the 2018 Conference of the North
American Chapter of the Association for Computa- Diederik Kingma and Jimmy Ba. 2014. Adam: A
tional Linguistics: Human Language Technologies, method for stochastic optimization. arXiv preprint
Volume 2 (Short Papers), volume 2, pages 615–621. arXiv:1412.6980.
Mahnaz Koupaee and William Yang Wang. 2018. Wik- Shashi Narayan, Shay B Cohen, and Mirella Lapata.
ihow: A large scale text summarization dataset. 2018a. Dont give me the details, just the summary!
arXiv preprint arXiv:1810.09305. topic-aware convolutional neural networks for ex-
treme summarization. In Proceedings of the 2018
Logan Lebanoff, Kaiqiang Song, Franck Dernoncourt, Conference on Empirical Methods in Natural Lan-
Doo Soon Kim, Seokhwan Kim, Walter Chang, and guage Processing, pages 1797–1807.
Fei Liu. 2019. Scoring sentence singletons and
pairs for abstractive summarization. arXiv preprint Shashi Narayan, Shay B Cohen, and Mirella Lapata.
arXiv:1906.00077. 2018b. Ranking sentences for extractive summariza-
tion with reinforcement learning. In Proceedings of
Chin-Yew Lin and Eduard Hovy. 2003. Auto- the 2018 Conference of the North American Chap-
matic evaluation of summaries using n-gram co- ter of the Association for Computational Linguistics:
occurrence statistics. In Proceedings of the 2003 Hu- Human Language Technologies, Volume 1 (Long Pa-
man Language Technology Conference of the North pers), volume 1, pages 1747–1759.
American Chapter of the Association for Computa-
tional Linguistics, pages 150–157. Romain Paulus, Caiming Xiong, and Richard Socher.
2017. A deep reinforced model for abstractive sum-
Yang Liu. 2019. Fine-tune bert for extractive summa- marization. arXiv preprint arXiv:1705.04304.
rization. arXiv preprint arXiv:1903.10318. Nils Reimers and Iryna Gurevych. 2019. Sentence-
bert: Sentence embeddings using siamese bert-
Yang Liu and Mirella Lapata. 2019. Text summariza-
networks. In Proceedings of the 2019 Conference on
tion with pretrained encoders. In Proceedings of
Empirical Methods in Natural Language Processing
the 2019 Conference on Empirical Methods in Nat-
and the 9th International Joint Conference on Natu-
ural Language Processing and the 9th International
ral Language Processing (EMNLP-IJCNLP), pages
Joint Conference on Natural Language Processing
3973–3983.
(EMNLP-IJCNLP), pages 3721–3731.
Aliaksei Severyn and Alessandro Moschitti. 2015.
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man- Learning to rank short text pairs with convolutional
dar Joshi, Danqi Chen, Omer Levy, Mike Lewis, deep neural networks. In Proceedings of the 38th in-
Luke Zettlemoyer, and Veselin Stoyanov. 2019. ternational ACM SIGIR conference on research and
Roberta: A robustly optimized bert pretraining ap- development in information retrieval, pages 373–
proach. arXiv preprint arXiv:1907.11692. 382. ACM.
Alfonso Mendes, Shashi Narayan, Sebastião Miranda, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
Zita Marinho, André FT Martins, and Shay B Co- Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
hen. 2019. Jointly extracting and compressing doc- Kaiser, and Illia Polosukhin. 2017. Attention is all
uments with summary state representations. In Pro- you need. In Advances in Neural Information Pro-
ceedings of the 2019 Conference of the North Amer- cessing Systems, pages 5998–6008.
ican Chapter of the Association for Computational
Linguistics: Human Language Technologies, Vol- Xiaojun Wan, Ziqiang Cao, Furu Wei, Sujian Li, and
ume 1 (Long and Short Papers), pages 3955–3966. Ming Zhou. 2015. Multi-document summariza-
tion via discriminative summary reranking. arXiv
Yishu Miao and Phil Blunsom. 2016. Language as a preprint arXiv:1507.02062.
latent variable: Discrete generative models for sen-
tence compression. In Proceedings of the 2016 Con- Danqing Wang, Pengfei Liu, Yining Zheng, Xipeng
ference on Empirical Methods in Natural Language Qiu, and Xuan-Jing Huang. 2020. Heterogeneous
Processing, pages 319–328. graph neural networks for extractive document sum-
marization. In Proceedings of the 58th Conference
Bhaskar Mitra, Fernando Diaz, and Nick Craswell. of the Association for Computational Linguistics.
2017. Learning to match using local and distributed Danqing Wang, Pengfei Liu, Ming Zhong, Jie Fu,
representations of text for web search. In Proceed- Xipeng Qiu, and Xuanjing Huang. 2019. Exploring
ings of the 26th International Conference on World domain shift in extractive text summarization. arXiv
Wide Web, pages 1291–1299. International World preprint arXiv:1908.11664.
Wide Web Conferences Steering Committee.
Shuohang Wang and Jing Jiang. 2016. Learning natu-
Ramesh Nallapati, Feifei Zhai, and Bowen Zhou. 2017. ral language inference with lstm. In Proceedings of
Summarunner: A recurrent neural network based se- the 2016 Conference of the North American Chap-
quence model for extractive summarization of docu- ter of the Association for Computational Linguistics:
ments. In Thirty-First AAAI Conference on Artificial Human Language Technologies, pages 1442–1451.
Intelligence.
Zhiguo Wang, Wael Hamza, and Radu Florian. 2017.
Ramesh Nallapati, Bowen Zhou, Cicero dos Santos, Bilateral multi-perspective matching for natural lan-
Ça glar Gulçehre, and Bing Xiang. 2016. Abstrac- guage sentences. In Proceedings of the 26th Inter-
tive text summarization using sequence-to-sequence national Joint Conference on Artificial Intelligence,
rnns and beyond. CoNLL 2016, page 280. pages 4144–4150. AAAI Press.
Jiacheng Xu and Greg Durrett. 2019. Neural extrac-
tive text summarization with syntactic compression.
In Proceedings of the 2019 Conference on Empiri-
cal Methods in Natural Language Processing, Hong
Kong, China. Association for Computational Lin-
guistics.
Jiacheng Xu, Zhe Gan, Yu Cheng, and Jingjing
Liu. 2019. Discourse-aware neural extractive
model for text summarization. arXiv preprint
arXiv:1910.14142.
Wen-tau Yih, Ming-Wei Chang, Christopher Meek, and
Andrzej Pastusiak. 2013. Question answering using
enhanced lexical semantic models. In Proceedings
of the 51st Annual Meeting of the Association for
Computational Linguistics (Volume 1: Long Papers),
pages 1744–1753.
Dani Yogatama, Fei Liu, and Noah A Smith. 2015. Ex-
tractive summarization by maximizing semantic vol-
ume. In Proceedings of the 2015 Conference on
Empirical Methods in Natural Language Processing,
pages 1961–1966.
Haoyu Zhang, Yeyun Gong, Yu Yan, Nan Duan, Jian-
jun Xu, Ji Wang, Ming Gong, and Ming Zhou.
2019a. Pretraining-based natural language gen-
eration for text summarization. arXiv preprint
arXiv:1902.09243.

Xingxing Zhang, Furu Wei, and Ming Zhou. 2019b.


Hibert: Document level pre-training of hierarchical
bidirectional transformers for document summariza-
tion. In ACL.

Ming Zhong, Pengfei Liu, Danqing Wang, Xipeng Qiu,


and Xuan-Jing Huang. 2019a. Searching for effec-
tive neural extractive summarization: What works
and whats next. In Proceedings of the 57th Confer-
ence of the Association for Computational Linguis-
tics, pages 1049–1058.

Ming Zhong, Danqing Wang, Pengfei Liu, Xipeng


Qiu, and Xuanjing Huang. 2019b. A closer look at
data bias in neural extractive summarization models.
EMNLP-IJCNLP 2019, page 80.

Qingyu Zhou, Nan Yang, Furu Wei, Shaohan Huang,


Ming Zhou, and Tiejun Zhao. 2018. Neural docu-
ment summarization by jointly learning to score and
select sentences. In Proceedings of the 56th Annual
Meeting of the Association for Computational Lin-
guistics (Volume 1: Long Papers), volume 1, pages
654–663.

You might also like