0% found this document useful (0 votes)
14 views

TextRank Enhanced Topic Model for Query Focussed Text Summarization

Uploaded by

abhimyvkn
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

TextRank Enhanced Topic Model for Query Focussed Text Summarization

Uploaded by

abhimyvkn
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

TextRank enhanced Topic Model for Query focussed

Text Summarization
Nadeem Akhtar, M M Sufyan Beg, Hira Javed
Department fo Computer Engineering, Zakir Husain College of Engineering & Technology,
Aligarh Muslim University, Aligarh, UP, India-202001
[email protected], [email protected], [email protected]

Abstract—Topic model based query focused text distribution of document collection. to find the best scoring
summarization methods can meet the user’s needs as they are able sentences. Topic model based summarization methods find
to find subtopics and their correlations. Topic model based summary topic distributions and does not score the sentences
summarization method does not score the sentences directly as directly for summarization. They lack using several useful
they generates topics distributions, which are used to score the information to score sentences. For example, position of word,
sentences. Graph based methods rank sentences using syntactic sentence length, presence of nouns etc. as used in the surface
and semantic information. In this work, a topic model based based approaches and sentence similarity scores as used in the
summarization method namely two-tiered topic model is graph based approaches. These information cannot be used
combined with graph based TextRank method. The combined
directly in the probabilistic framework of topic models.
method, called TextRank enhanced Two-Tiered topic model, uses
Although some information is used in some work as location of
the important sentences obtained from TextRank in the generative
process of two-tiered model to extract better summary sentences. sentences are represented by dirichlet hyper-parameter. In this
The proposed method’s summary results outperform other topic paper, we have enhanced a probabilistic topic model based
model based summary results using ROUGE metrics evaluated on summarization method by enriching it using information from
DUC 2005 dataset. (Abstract) graph based approach. We enhanced Two-Tiered topic model
[13] by using sentence scores obtained from the TextRank [3].
Keywords—TextRank, two-tiered topic model, text
Topic models does not usually score sentences directly rather
summarization, topic model based text summarization (key words)
they generates topic distributions which are used as summary
distributions to score summary sentences. Two-tiered topic
I. INTRODUCTION model (TTM) [13] directly scores the sentences for query
A query focused extractive text summarizer [1][2] generates focused summarization by counting the number of times a
coherent, non-redundant sentences that represents general theme sentence is sampled in its generative process. This direct
of the documents and are related to the user’s query. Several sentence scoring ability of TTM allows to use other information
different approaches have focused on finding coherent in the probabilistic framework of TTM. We have used sentence
sentences. Graph based approaches [3-6] organizes sentences scores obtained from TextRank to find the set of important
into a graph based on some sentence similarity criterion which sentences related to a sentence from which a sentence is sampled
is used to extract coherent sentences represented by sentence for each query word in the TTM generative process. TextRank
subgraphs using graph metrics. Discourse based approaches use [13] is a graph based summarization method which executes
linguistic methods based on rhetorical structure theory (RST) [7] PageRank algorithm on the sentence graph to score sentences
to identify coherent span of text. Unsupervised machine learning for summary generation. Enriching the TTM with TextRank
based approaches [8][9] makes clusters of coherent sentences to sentence scores benefits TTM to sample only the important
extract theme oriented sentences. Bayesian topic model based sentences for the query words, which results in significantly
approaches [10-13] find hidden topics and their correlations to better ROUGE [15] score for summarization.
extract topic based coherent sentences. Remainder of the paper is organized as follows. Next section
Topic model based summarization methods can distinguish describes related work. Section III presents the proposed
general topics from specific topics by using hierarchy of topics method. Section IV presents experimental settings, results and
where topics at the higher level in the hierarchy represents discussion. Section V concludes and discusses future work.
general topics and lower level represents specific topics [14].
They can also extract coherent topics from the topic hierarchy II. RELATED WORK
which is used to generate coherent sentences for summarization.
Topic model based summarization methods usually compare Earlier research in extractive multi-document summarization
sentence distribution with summary distribution to find salient has focused on relevant sentence selection based on lexical and
semantic relationships. In this section, topic model based and
summary sentences. For example, method used in KL-SUM [10]
graph based summarization methods and their important
is also used in several other topic base methods which minimize
features are reviewed.
Kullback-Leibler divergence between empirical unigram
distribution of candidate summary and observed unigram

978-1-7281-3591-5/19/$31.00 ©2019 IEEE

Authorized licensed use limited to: St. Aloysius College Mangalore (AIMIT). Downloaded on May 30,2023 at 10:04:36 UTC from IEEE Xplore. Restrictions apply.
A. Hierarchical Topics for generic summarization document set- one general content and multiple specific content
Topic models which generate topic hierarchy can identify distributions.
general and specific topics. Topics at higher level in the
hierarchy represents general topics and are more suitable for D. Graph based approaches
summary generation. These models associate with each word a TextRank [3] and Lexrank [4] uses sentence graphs and
hierarchy of topics instead of a single flat topic. These models TextRank algorithm to find the important sentences for
use topic structure similar to Pachinko Allocation Model (PAM) summary generation. TextRank uses TextRank algorithm to find
[14], in which a super-topic at some level may be associated with the sentences scores. LexRank uses the concept of graph
several sub-topics at the next level. Super-topics represents more centrality to find the sentence salience. In [5], authors present an
general topics compared to sub-topics. Two-tiered topic model approach to produce graph-model extractive summaries of texts,
(TTM) [13], Enriched TTM (ETTM) [13], Latent Dirichlet Co- meeting the target domain exigencies and treating the terms
clustering [16] uses two levels of topics to capture general repetition problem. In [6], an entity relationship graph is used
concepts in the dataset. TTM and ETTM uses sentences as together with word frequency based keyword identification to
metavariables to select the sentence for the words and then select sentences for summaries.
selecting hierarchy of high and low level topics. This allow to
directly extract the related sentences that are related to high level III. TEXTRANK ENHANCED TWO-TIERED TOPIC MODEL
topics. Hierarchical structure between super-topic and sub-topic
allow to model constrained relationships between them. Another Two-Tiered topic model is inspired by the pachinko
topic model which captures correlation between topics is allocation model (PAM) [14], a hierarchical topic model, which
correlated topic model [17]. Topic hierarchies may also be captures arbitrary, nested and sparse correlations between topics
generated using a flat topic model with hierarchy generation using directed acyclic graph. TTM uses model structure similar
tools [18]. to four level PAM with two difference. First, instead of having
one fixed root topic at first level, TTM samples a sentence
B. Generic and specific summaries metavariable from the set of sentences containing current word
uniformly. Second, since TTM is a query focused not all the
Depending on the extracted topics at different level of words of a document are sampled through the topic hierarchy.
document summarization, generic and specific summaries can
be generated. HIERSUM [10] can generate general summary for The proposed TextRank enhanced Two-Tiered Topic Model
the entire document collection and both general and specific (TReTTM) use the same model structure as Two-Tiered Model
summaries for the individual documents. In [11], authors have with one difference. TReTTM samples the sentence meta-
used an entity-aspect model to cluster sentences and words into variable for each query word from the set of sentences which are
aspects. Clustered sentences are used then to generate aspect found using a sentence graph. In TTM, the sentence meta-
specific patterns for entity summary template generation task. variable is sampled uniformly from the set of sentences
Event-aspect model [27] generates aspect oriented summaries of containing the query word to be sampled. A query word is that
events by also generating corpus wide topic distribution besides word which is contained in the query text or synonymous to any
generating topic distributions for document sets, documents and query word. The synonymy relationship is found using WordNet
sentences as in HIERSUM. It uses an extended LexRank [4] dictionary. TTM includes only those sentences in the summary
method for scoring sentences for summary generation. which contains at least one sampled query word. The sentences
containing the query word are related to the query but may not
C. Leveraging document structure be the best sentences for summarization for two reasons. First,
not all the words contained in the query are important for query
Words in the document collection have different affinities
summarization. For example, inclusion of sentences in the
according to their positions in the collection. Words in different
summary containing words ‘include’, ‘identify’ and ‘involve’
segments e.g. phrases, sentences, paragraphs, document,
which appears in the query text of cluster D301 does not
document clusters have different co-occurrence relationships.
necessarily improve the summarization performance. A
Latent dirichlet co-clustering (LDCC) [16] considers each sentence which does not contains important words besides these
document as collection meaningful single topic segments i.e. query words must not be sampled for improved summarization
paragraphs. The words in the segment has different lower level performance. Second, the sentences which contains the sampled
topics. LDCC has a hierarchy of topics but level of granularity query word may not be the important sentences for
for super-topics and sub-topics are different. Super-topics are summarization. There may be other sentences which does not
assigned to segments and sub-topics are assigned to words of the contain the sampled query word but are related to the sentence
segments. This is in contrast to TTM wherein both super-topic containing sampled query word and important for
and sub-topic are assigned to individual words. SenLDA [19] summarization.
considers grouping of words in sentences in generative process
To resolve both the above mentioned problems, TReTTM
and shows that it provides better results on classification task
modifies the sentence sampling procedure in TTM. Instead of
and perplexity with fast convergence. TopicSum [10] and
sampling a sentence for the query word uniformly from the set
HIERSUM [10] find topics called content distributions at three
of sentences containing the sampled query word, TReTTM
levels- sentence, document and document set. TopicSum finds a
samples sentence uniformly from the set of important sentences
single general con-tent distribution for entire document set
related to the sentence containing the sampled query word.
whereas HIERSUM finds multiple content distributions for
Sampling from the set of important sentences related to the

Authorized licensed use limited to: St. Aloysius College Mangalore (AIMIT). Downloaded on May 30,2023 at 10:04:36 UTC from IEEE Xplore. Restrictions apply.
sampled query word ensures that an important sentence related score of the sentences. Normalized common words are used as
to the sentence containing sampled query word (which may also similarity score.
include the sampled query word) will always be sampled. Thus
the score for the important sentences related to the query words ∗ , (1)
will be high. To find the important sentences related to a , , / (2)
sentence, TReTTM uses TextRank algorithm as used in
TextRank summarization system. Where Cs,s’ is the number of common words in sentences s
and s’.
TextRank uses a sentence graph to find the similarities
among all the sentences. The most important sentences are those All the sentences are ranked according to their sentence
which are most similar to the other sentences. The compound scores. Top 10 sentences are selected as the important related
similarity of a sentence is represented by the TextRank score of sentences for a sentence.
the sentence found after executing TextRank algorithm on the
sentence graph. Sentences having high TextRank scores are the A. Generative Process of TReTTM
most important sentences and included in the summary. The The generative process of TReTTM is same as TTM except
similarity function that is used in this work is the number of that the set of sentences is obtained from TextRank algorithm.
common words in the sentences normalized by their length. The generative process is explained below. The plate diagram is
shown in figure 1. The hyper-parameters are not shown in the
To find the important sentences for a sentence s, sentence
figure.
scores are assigned to each sentence. Sentence score is the
TextRank score of the sentence multiplied by the similarity

Fig. 1. Plate Notation for TReTTM.

For each word wij of sentence si of the document d, variable Θy. Each high level topic h is associated with a multinomial
xij is drawn from a binomial distribution. If xij is 0, it is not distribution Θh over K2 low level topics. A low level topic T is
related to the query and word wij is sampled from a background sampled from Θh. Each low level topic t has a multinomial
distribution Θ. If xij is sampled as 1, it is query related and word distribution φt over W vocabulary words. Finally, word wij is
wij is sampled from three level hierarchy. The set of top 10 sampled from φt.
important sentences S related to sentence si are found using
TextRank according to the above mentioned procedure. The For parameter estimation, Gibbs sampling [20] is used. The
variable yij is sampled uniformly from these top 10 sentences. sampling distribution for query related and unrelated words is
Each sentence y is associated with multinomial distribution Θy shown in equation (3) and (4).
over K1 high level topics. A high level topic H is sampled from

(3)
, , 1|. ∑ ∑

η (4)
0|.
2 ∑

Authorized licensed use limited to: St. Aloysius College Mangalore (AIMIT). Downloaded on May 30,2023 at 10:04:36 UTC from IEEE Xplore. Restrictions apply.
αh, αt and βw are dirichlet hyper-parameters for multinomial B. Experiment settings
distributions Θy, Θh, and φt respectively. All the counts in the Each document cluster is treated as a separate document
equations (3) and (4) are same as defined in [13]. The score of collection and each model is trained on them separately. For
the sentence is the number of times that sentence is sampled for each model, Gibbs sampling chain is run for 1000 iterations with
a sampled query word normalized by the length of sentence to first 250 iterations as burn-in. Hyper-parameters α in LDCC are
favor sentences with several related words. estimated using moment-matching method as described in [24].
The Enriched Two-Tiered Topic model (ETTM) is an Other dirichlet hyper-parameters are assumed fixed.
extension of TTM which also allows to sample a word from a Hyper-parameters are not tuned for the proposed algorithm
high level topic. An extra random variable is sampled from a unlike the work in [13] because the goal of this work is to
high and low level topic pair binomial distribution to decide if ascertain that TextRank enhances two-tiered topic model.
the word is to be sampled from high level or low level topic. Hyper-parameters β is fixed to 0.01. Hyper-parameter α is fixed
ETTM samples general words from the high level topics and to 1/number of super-topics. In all the experiments, number of
context specific words from the low level topics. The general high level and low level topics are kept as 5 and 10 respectively.
words sampled from high level topics are more suitable for
summary generation. For this reason, ETTM results are better C. Evaluation Criterion
than TTM results. The sentence score in ETTM is found by
multiplying the TTM sentence score with average probabilities The DUC2005 task is to generate the maximum 250 words
of the high level topics of sentence words. query focused summary for each document cluster. Standard
DUC evaluation metric ROUGE [15] is used, which measure
ETTM is also enhanced by obtaining important sentences recall over n-gram statistics from a system generated summary
from TextRank in the same way as in TReTTM. The enhanced against a set of human generated summary. Rouge-1 (recall
ETTM is termed as TextRank Enhanced Enriched Two-Tiered against unigrams) and Rouge-2 (recall against bigrams) results
Topic model (TReETTM). with stop-words are reported.

IV. EXPERIMENTAL RESULTS D. Results and Discussion


This section describes the qualitative evaluation methods Experiments for different values for number of topics are
and experimental results. The proposed method is compared performed. Best results are obtained when number of super-
with LDA [21], TTM, ETTM, SenLDA [19], LDCC [16], topics and sub-topics are 5 and 10 respectively.
NCBsum-A [25] and MDSBG [26]. Rouge-1 and Rouge-2 results are shown in Table I. For
LDA is the base topic model upon which all other topic MDSBG and NCBsum-A, only those results are shown in table
models are designed. NCBsum-A exploits novelty, coverage and which are provided in respective papers. The average results of
balance requirements with respect to a topic to rank sentences. recall, precision and F-measure over several experiments are
MDSBG use a bipartite graph based on entity grid to rank reported. Both TReTTM and TReETTM outperform TTM and
sentences using HITS algorithm. It then uses an optimization ETTM significantly on both Rouge-1 and Rouge-2 metric. They
step to generate topic orient non-redundant summary. SenLDA also outperform sentence based topic models SenLDA and
and LDCC topic models are chosen for comparison because LDCC.
these uses hierarchy of topics and also utilizes the inherent TReTTM and TReETTM are able to include important
sentence based structure of the documents. The summary sentences which are query related. Analysis shows that both
generation process for LDA, SenLDA and LDCC follows from TTM and ETTM includes sentences which are query related but
the work in [22]. The conditional probability distributions for all not included in the human generated summary or not important
sentences given topics (super-topics in LDCC) are found. The for summary generation.
word distribution given super-topics are obtained from word
distributions given sub-topics. The topic distribution given the TReTTM and TReETTM does not improve over state of the
query is also obtained. A topic is generated from this distribution art MDSBG and NCBsum-A. MDSBG uses a bipartite graph
and sentence is generated from the generated topic’s sentence based on entity grid to include sematic information for sentence
distribution. This process is repeated until desired summary ranking whereas the proposed method uses only common words
length is obtained. In DUC 2005 task, desired summary length for sentence ranking.
is 250 words.
In future work, we may enhance the proposed method by
considering the semantic information in the sentence graph.
A. Dataset NCBsum-A uses a novelty detection method to find topic aware
For comparison among the discussed summarization balance and novel sentences. Both MDSBG and NCBsum-A are
methods, a standard DUC2005 task [23] dataset is used for the not topic model based summarization method. TReTTM and
evaluation of query focused summarization. It contains 50 TReETTM have the advantage of having a topic hierarchy that
document clusters, each having 25 to 50 documents. Each allows to get finer summary details extracted from low level of
document cluster has a query statement which describes the topics in the hierarchy.
details of the summary to be obtained. Each document cluster
has four or nine reference summaries written by human experts.

Authorized licensed use limited to: St. Aloysius College Mangalore (AIMIT). Downloaded on May 30,2023 at 10:04:36 UTC from IEEE Xplore. Restrictions apply.
TABLE I. ROUGE-1 AND ROUGE-2 RESULTS

ROUGE-1 ROUGE-2
Recall Precision F-measure Recall Precision F-measure
LDA 0.3293 0.3141 0.3208 0.0457 0.0431 0.0443
SenLDA 0.3352 0.3231 0.3281 0.0523 0.04945 0.0508
LDCC 0.3580 0.3394 0.3481 0.0705 0.0658 0.0680
TTM 0.3515 0.3381 0.3445 0.0587 0.0564 0.0575
ETTM 0.3575 0.3319 0.3423 0.0618 0.0576 0.0593
TReTTM 0.3684 0.3457 0.3565 0.0748 0.0690 0.0717
TReETTM 0.3743 0.3510 0.3620 0.0762 0.0699 0.0729
MDSBG - - - 0.0797 - -
NCBsum-A 0.3909 - - 0.0792 - -

[8] M. Fattah (2014) A hybrid machine learning model for multi-document


V. CONCLUSION AND FUTURE WORK summarization. 592–600.
[9] L. Yang, Cai X, Zhang Y, Shi P (2014) Enhancing sentence-level
In this paper, two different text summarization approaches clustering with ranking-based clustering framework for theme-based
are combined to take advantage of both topic model based and summarization. Inf Sci 260:37–50.
graph based approaches for query focused summarization. Topic [10] A. Haghighi, and Lucy Vanderwende. "Exploring content models for
model based approaches are able to identify general topics and multi-document summarization." Proceedings of Human Language
their correlations for query focused summarization but they lack Technologies: The 2009 Annual Con-ference of the North American
generating high ROUGE metrics. In this paper, TTM and ETTM Chapter of the Association for Computational Linguistics. Association for
Computational Linguistics, 2009.
are combined with graph based TextRank method to achieve
better ROUGE metric. The combined methods TReTTM and [11] P. Li, Jing Jiang, and Yinglin Wang. "Generating templates of entity
summaries with an entity-aspect model and pattern mining." Proceedings
TReETTM outperform both TTM and ETTM on Rouge-1 and of the 48th annual meeting of the Association for Computational
Rouge-2 evaluation. They also outperform sentence based Linguistics. Association for Computational Linguistics, 2010.
models LDCC and SenLDA based summarization methods. [12] N. Akhtar, and Beg, MM Sufyan. “User graph topic model”, Journal of
Intelligent & Fuzzy Systems, vol. 36, no. 3, pp. 2229-2240, 2019
For the future work, different form of sentence graphs may
[13] A. Celikyilmaz, and Dilek Hakkani-Tür. "Discovery of topically coherent
be used to find sentence TextRank scores. TReTTM uses simple sentences for extractive summarization." Proceedings of the 49th Annual
sentence scoring as used by TTM i.e. count the number of times Meeting of the Association for Computational Linguistics: Human
sentence is sampled for a query word. The scoring method may Language Technologies-Volume 1. Association for Computational
be enhanced to include both topical information and sentence Linguistics, 2011.
features. [14] W. Li, and Andrew McCallum. "Pachinko allocation: DAG-structured
mixture models of topic correlations." Proceedings of the 23rd
international conference on Machine learning. ACM, 2006.
REFERENCES [15] Lin, Chin-Yew. "Rouge: A package for automatic evaluation of
[1] Daumé III, Hal, and Daniel Marcu. "Bayesian query-focused summaries." Text Summa-rization Branches out (2004).
summarization." Proceedings of the 21st International Conference on [16] MM Shafiei, and Evangelos E. Milios. "Latent Dirichlet co-clustering."
Computational Linguistics and the 44th annual meeting of the Association Data Min-ing, 2006. ICDM'06. Sixth International Conference on. IEEE,
for Computational Linguistics. Association for Computational 2006.
Linguistics, 2006. [17] JD Lafferty, and David M. Blei. "Correlated topic models." Advances in
[2] J. Tang, Limin Yao, and Dewei Chen. "Multi-topic based query-oriented neural in-formation processing systems. 2006.
summarization." Proceedings of the 2009 SIAM International Conference [18] N. Akhtar, Hira Javed, and Tameem Ahmad. "Hierarchical summarization
on Data Mining. Society for Industrial and Applied Mathematics, 2009. of text documents using topic modeling and formal concept analysis."
[3] R. Mihalcea, and Paul Tarau. "Textrank: Bringing order into text." Data Management, Analytics and Innovation. Springer, Singapore, 2019.
Proceedings of the 2004 conference on empirical methods in natural 21-33.
language processing. 2004 [19] G. Balikas, Massih-Reza Amini, and Marianne Clausel. "On a topic model
[4] G. Erkan, and Dragomir R. Radev. "Lexrank: Graph-based lexical for sentences." Proceedings of the 39th International ACM SIGIR
centrality as salience in text summarization." Journal of artificial conference on Research and Development in Information Retrieval.
intelligence research 22 (2004): 457-479. ACM, 2016.
[5] V. Woloszyn, et al. "Modeling, comprehending and summarizing textual [20] T. Griffiths "Gibbs sampling in the generative model of latent dirichlet
content by graphs." arXiv preprint arXiv:1807.00303 (2018). allocation." (2002).
[6] A. Sakhadeo, and Nisheeth Srivastava. "Effective extractive [21] D Blei, Andrew Y. Ng, and Michael I. Jordan. "Latent dirichlet
summarization using frequency-filtered entity relationship graphs." arXiv allocation." Jour-nal of machine Learning research 3.Jan (2003): 993-
preprint arXiv:1810.10419 (2018). 1022
[7] W. Mann, Thompson S (1988) Rhetorical structure theory: toward a [22] R. Arora, and Balaraman Ravindran. "Latent dirichlet allocation based
functional theory of text organization. Text 8:243–281 multi-document summarization." Proceedings of the second workshop on
Analytics for noisy un-structured text data. ACM, 2008.

Authorized licensed use limited to: St. Aloysius College Mangalore (AIMIT). Downloaded on May 30,2023 at 10:04:36 UTC from IEEE Xplore. Restrictions apply.
[23] DUC2005 Summarization Task, https://duc.nist.gov/duc2005/tasks.html [26] D. Parveen, and Michael Strube. "Multi-document summarization using
[24] T. Minka, "Estimating a Dirichlet distribution." (2000): 4 bipartite graphs." Proceedings of TextGraphs-9: the workshop on Graph-
based Methods for Natural Language Processing. 2014.
[25] X. Li, et al. "Exploiting novelty, coverage and balance for topic-focused
multi-document summarization." Proceedings of the 19th ACM [27] P. Li, et al. "Generating aspect-oriented multi-document summarization
international conference on Information and knowledge management. with event-aspect model." Proceedings of the Conference on Empirical
ACM, 2010. Methods in Natural Language Processing. Association for Computational
Linguistics, 2011.

Authorized licensed use limited to: St. Aloysius College Mangalore (AIMIT). Downloaded on May 30,2023 at 10:04:36 UTC from IEEE Xplore. Restrictions apply.

You might also like