0% found this document useful (0 votes)
13 views

Hatakenaka 2011

Uploaded by

Aline Alencar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

Hatakenaka 2011

Uploaded by

Aline Alencar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Ranking Documents using Similarity-based PageRanks

Shota HATAKENAKA Takao MIURA


Dept.of Elect. and Elect. Engineering, HOSEI University
Kajinocho 3-7-2, Koganei, Tokyo, Japan
E-mail: [email protected] [email protected]

Abstract words are more important than frequent words. But


the documents are what we want ?
In this investigation we propose a new ranking algo- We wonder whether we describe relevance to our
rithm for documents based on popularity. Consid- interests by queries or not. Some investigations have
ering similarity as virtual links, we introduce struc- discussed features on documents (i.e., attributes for
tural aspects among documents. Then, by means relevance to query) such as feature selection and rel-
of authoritative importance or popularity, we obtain evance feedback approach (like Rocchio algorithm).
ranking scheme based on PageRank algorithm. We
Ranking approach provides us with ordering doc-
discuss experimental results to examine the effective-
uments based on some scores. But usually people
ness and some applications.
goes through top 3 or 4 documents in the ranking list
Keywords: Document Ranking, PageRank, Similar-
and discusses how to have documents ranked highly
ity , TF*IDF
but nothing about what the rank does mean. Highly
Querying Documents using Similarity-based Ranks
ranked documents are often called important, though
we should define important documents in advance.
Whenever we say ”important”, we look for docu-
1 Motivation and Background ments to capture some subjects of interests related
Very often we see many electric papers on some Web to our queries.
sites. Certainly advantageous are that we can in- To obtain important documents, we apply some
quire, go through, navigate these documents easily ranking algorithm. The higher they get ranked, the
and quickly. It could be impossible to tackle with more important they mean. A ranking algorithm
information explosion without the relevant technolo- means a way to determine order of the results for
gies. Typical examples are Web, Wiki, blog, twitters the document retrieval. To do that, we combine IR
(micro blog) which we can obtain through internet. algorithm based on the notion of importance with
On the other hand, we get in trouble with an in- several parameters. We propose a new approach to
herent issue: we have mountains of information but obtain degree of importance of documents by analyz-
which is the one we need ? Traditionally there have ing similarity among them. To do that, we extend a
been much investigation proposed so far about infor- PageRank algorithm for Web pages [?, ?] to reflect
mation retrieval (IR)[?]. Given a query in terms of implicit relationship among documents. Unlike Web
query words, we explore all the documents that match pages, there exist no explicit links among documents
the words (called term matching) or that match the so that we introduce virtual links by means of doc-
words very often (term frequency). Putting an at- ument similarity. We also discuss our algorithm to
tention on term importance, we examine documents examine effectiveness.
with inverse document frequency which says rare Several web page ranking algorithms have been

19 978-1-4577-0253-2/11/$26.00 ©2011 IEEE


proposed based on various parameter to find out 0 (no-presence), and term frequency (TF) means how
their advantages and limitations for Web pages[?]. many wi appears in d. A notion of inverse document
And, among others, two graph based page rank- frequency (IDF) shows N/m where N means the to-
ing algorithms have been put much attention, i.e., tal number of document and m documents contain
Google PageRank and Hypertext Induced Topic Se- this word. By the large IDF of this word, we may
lection (HITS) that are used widely, successfully and identify rare documents.
traditionally. These algorithms give equal weights to Information Retrieval (IR) can be described easily
all links of page relationships for ranking. The algo- within this framework. A term matching query can
rithms takes input of forward/backward links as well be characterized by BF and a term frequency query
as contents information (such as tags and anchors). by TF or TF*IDF.
But there happen topic drift and efficiecy problems More interesting is similarity. Unlike database
and the analysis comes at indexing but not at query- queries, we can retrieve similar documents as well as
ing. exact answers. The similarity of answer documents
This work is organized as follows. We introduce provides us with ranking: the more similarity docu-
a model for document description and what ranking ments have, the better answers we get. There have
does mean in section 2. In section 3, we discuss a been many kinds of similarities proposed so far. Typ-
new ranking algorithm for general documents based ical examples are cosine similarity, Jaccard similarity
on PageRank and we show some experimental results and any other distances. For example, given a query
in section 4. We conclude our investigation in section q in terms of a vector, we examine all the documents
5. di by (di ∗q)/(|di ||q|) where ∗ means an inner product
and || a norm value. This is just the degree (cosine)
of the angle between the two vectors so called cosine
2 Document Description and similarity.
Ranking di ∗ q
cos(di , q) =
|di ||q|
2.1 A Vector Space Model In this investigation, we examine only nouns without
A vector space model provides us with a powerful and pronouns and proper nouns, as content words.
flexible vehicle to describe documents. Basically, in
this model, every document is described by means of 2.2 Calculating Importance
a multi-set of terms, called a Bag of Words although
each document is a sequence of words so that each How we can obtain important documents ? Hopefully
word w has its position i, denoted by wi . Identical we believe that highly ranked documents are impor-
words may appear many times but at different po- tant so that we apply some ranking algorithm. Let
sitions. On the other hand, in this model, we don’t us note that we don’t mean the results responded to
keep position. For example, no distinction is made IR queries but we like important documents inspired
between ”a dog bites John.” and ”John bites a from queries.
dog.” Recent algorithms for ranking have been proposed
The model is advantageous enough to capture in- based on probabilities or heuristic weights[?]. Para-
dex words (terms). Both words ”John” and ”dog” metric approaches assume some probability distribu-
are included while trivial terms such as ”a” and ”.” tion while heuristics depend on subjective interpre-
could be removed. The former words are called con- tation. Here we here stick to analyzing all the doc-
tent words. Each document d can be described by uments and extract discriminative aspects of impor-
means of vector d =< v1 , ..., vn > where i = 1, ..., n tance.
corresponds to a content word wi and vi means a term In this investigation, we examine documents which
weight. Binary frequency (BF) means 1 (presence) or receive strong support from other documents. These

20
aspects are defined as words ”authoritative” and ”dis- and links, we calculate all the ai , cj and p by solving
tributive”. In Web world [?], the word ”important” the difference equations.
has been defined to capture three aspects, subjec- Considering each Web page as a node and each
tive (theme-oriented), authoritative and distributive. link as a directed edge, we can give an adjacency
An authoritative document means that the document matrix A = ((aij )) where aij = 1 if there exists an
contains many topics of interests and many other edge from j to i. Let B = ((bij )) be a matrix as
”important” documents find very interesting contents bij = aij / k akj
(themes) on whole or part of this document. By a Note k akj means a total number of links from
distributive document, we mean that the documents j and k bkj = 1. On the other hand each row
keep a variety of common topics with many other b1j , ..., bN j means weights over all the nodes at each
”important” documents though each of them refers node. B is called a transitive probability matrix be-
to parts of topics. cause a page j has a link to i with a probability
The two kinds of documents differ from each other bij . The PageRank values should be stable in a sense
since an authoritative document is considered popu- that, once all the degrees are given consistently, they
lar by other important documents so that they carry remain unchanged by transitions. We can obtain
contents of topics inherently, while a distributive doc- such degrees v by solving an eigen vector equation
ument contains the contents shared with many other B · v = λv where we have λ = 1 and v contains a
important documents deeply or not deeply. We ex- list of the degrees. In fact, let v = (v1 , ..., vN ) be
pect (and assume) that both an authoritative and the degree vector for PageRank. Then we must have
distributive document is important. the equation which is same exactly to the equation
of input scores for the PageRank condition:

3 Ranking Documents by vi = bij vj i = 1, ..., N
j
PageRank
The PageRank algorithm is easy and simple to
3.1 PageRank on Web understand based on Linear Algebra. However the
algorithm has several deficiencies practically which
A PageRank algorithm has been proposed to calculate come from iteration procedure to obtain PageRank
degrees of importance of Web pages using link struc- values[?]. If there is a node which has no output
ture among them and to give ranking to the pages[?]. arc, the input score simply increases. Such a node is
The degree is sometimes called a PageRank value. In called a RankSink by which we can’t distribute input
the algorithm, a link from a page X to Y is regarded scores, though this is due to the calculation proce-
as a support to Y. The page X is called a supporter for dure. Similarly, if there exits a loop in the Web page
Y. The PageRank value of Y comes from the number link structure, there happens hoarding of PageRank
of the supports as well as the PageRank values of the value. In fact, the input score simply never goes out-
supporters. side and increases again. Note this is also due to the
Let us define a PageRank value formally. Given a calculation procedure.
Web page Y with a degree p, we assume that there are From the viewpoint of stochastic process, if we con-
n links to Y from X1 , ..., Xn and each link from Xi has a sider the matrix B as a transition probability, neither
score ai , and that there are m links from Y to Z1 , ..., Zm situations are ergodic in a case of Markov chain: some
and a score of a link to Zj is cj . Then we have the transition never arises eventually. That is, there ex-
equations a1 +...+an = p, c1 = ... = cm = p/m. That ists Web pages X,Y where users never move from X
is, assuming p0 means a total sum of input scores to to Y. In the Web world, it is well-known that those
Y and p1 means a total sum of output scores from who navigate several Web pages randomly are called
Y, we define p under the condition of p = p0 = p1 . random surfers. The stochastic process can’t be non-
Then giving the conditions over all the Web pages ergodic.

21
We introduce a notion of teleport parameter α is symmetric which means all the eigen values are
which plays a randomizing role for behavior selection real and that we can obtain the vectors efficiently by
to avoid the defficiencies. means of, say, Jacobi algorithm.
B  = (1 − α)B + α/N
Note that N means the total number of Web pages
and we add 1/N to all the transition, and that α 4 Experiments
means the teleport ratio of random surfer users. In
the following, we discuss teleport modification B  . 4.1 Preliminaries
3.2 Ranking Documents by PageRank We show some experimental results to examine our
algorithm by using 10,000 news articles extracted
Let us describe a new algorithm to give document-
randomly from Mainichi 2009 January to June. We
ranking. Our idea is to obtain importance of doc-
discuss only first paragraph since the first paragraph
uments using PageRank algorithm to a collection
contains major scenario very often. We apply mor-
of documents. Since there exist no explicit link
phological analysis in advance and extract only nouns
among documents, we can’t apply the algorithm in
as contents words.
a straightforward manner. Here we put an attention
on similarity relationship, consider some of them as To evaluate the results, we apply cosine similar-
links so that we may apply the algorithm. ity and take frequency of articles appeared in both
First of all we describe documents as vectors over results by our PageRank algorithm and by all the
index terms. Then we define an extended adjacency articles. In the latter case, to each article, we cal-
matrix A = ((aij )) where aij = cos(di , dj ) for two culate all the cosine similarities with all the articles
documents di , dj , i = j and aii = 0. Note that and take all the sum of these values. Remember our
0.0 ≤ aij ≤ 1.0 and A is symmetric, i.e., aij = aji and results come from authoritative and distributive ar-
A = AT . The relationship between two documents is ticles, the articles are similar to all the articles and
bidirectional so forward/backward link analysis im- similar to the similar articles.
proves the process. Given two ranking results r1 , r2 , we define
In the PageRank algorithm, the matrix A contains Sim(r1 , r2 ) as (A1 ∩ A2 )/k where A1 means the ar-
binary values and is normalized into B after that. ticles in r1 and A2 in r2 , k means all the number of
Remember we distribute the degree p to direct chil- articles in r1 ∪ r2 . Clearly Sim(r1 , r2 ) we obtain the
dren uniformly c1 = ... = cm = p/m where there better results r1 with respect to r2 .
are m arcs to direct children. Compared to the origi-
nal PageRank algorithm, the  matrix A is normalized
into B = ((bij )) as bij = aij / k aik . Then bij plays 4.2 Results
a role of weight on the arc (i to j): the more similar
two documents are, the more weight they receive. We show all the results in tables 1, 2 and 3. A table 1
We put a threshold ρ on weights: bij = 0.0 if bij ≤ contains top 10 PageRank ranking where we show the
ρ. rank, article ID, the PageRank value and headline of
Finally, if bij = 0.0, we replace the entry by 1/N all the articles. Also we show the top 10 results by
similar to the PageRank algorithm to avoid non- similarly in a table 2 where we show the rank, article
ergodic transition, and we make teleport modification ID, the similarity sum value and headline of all the
: articles. A table 3 shows Sim values of top 10, 50,
B  = (1 − α)B + α/N 100, ..., 1000 articles. For example, ”top 100” means
By using B  as transitivity matrix, we calculate 92.0 % articles of the top 100 articles of PageRank
the eigen value 1.0 and the eigen vector. Note B  and of Similarity are identical.

22
Rank ArticleID PageRank Headline
1 3074 0.03287732 Baseball:WBC, Japan takes a practice game with Chicago Cubs
2 2244 0.03139896 Cherry blossom forecast of flowering time on March 20
3 8745 0.03073055 Professional troublemaker at stockholders’ meetings Arrested –
Demanding profits
4 7428 0.0307029 SARS – infected confirmed in Shizuoka
5 9379 0.02942945 Plenary session of House of Representatives on 21st Afternoon –
Demands Aso resignation Settled
6 5848 0.02902681 Golden Week - Nationwide Sunny
7 8642 0.0288975 Bus and Truck Toll Half-priced for 8 days – during Lantern festival
8 4921 0.02887041 FLU infected – 52 police freshmen in Tochigi
9 9180 0.02865223 Raininy Season ends in southern Kyushu
10 5635* 0.02849303 Prime Minister Aso visits China and Europe

Table 1: Ranking by PageRank

Rank ArticleID Similarity Sum Headline


1 3074 1436.3933 Baseball:WBC, Japan takes a practice game with Chicago Cubs
2 2244 1379.4339 Cherry blossom forecast of flowering time on March 20
3 8745 1348.871501 Professional troublemaker at stockholders’ meetings Arrested –
Demanding profits
4 7428 1344.996201 SARS – infected confirmed in Shizuoka
5 4921 1270.10094 FLU infected – 52 police freshmen in Tochigi
6 9379 1258.451663 Plenary session of House of Representatives on 21st Afternoon –
Demands Aso resignation Settled
7 5848 1248.997552 Golden Week - Nationwide Sunny
8 8642 1244.844031 Bus and Truck Toll Half-priced for 8 days – during Lantern festival
9 9180 1242.542849 Rainy Season ends in southern Kyushu
10 7674* 1238.969157 SUMO Wrestling – Summer Tour scheduled

Table 2: Ranking by Similarity

TOP Sim ( % ) both as the 10-th rank by PageRank and Similarity


10 90.0
50 90.0 respectively where the article 5635 appears as the 14-
100 92.0 th rank by Similarity and the article 7674 as the 11-th
200 92.5
300 91.3 rank by PageRank. There is no distinction between
400 92.8 the two. However, in Top 100 results, this is no true:
500 91.4
600 93.5 Let us show the difference in a table 5. There exists
700 93.9 one article (6377) which is put by PageRank in 30
800 94.6
900 93.3 ranks higher than Similarity, and one article (8262)
1000 93.9 by Similarity in 20 ranks higher than Similarity.
Table 3: Sim Values in PageRank ID PageRank Similarity Difference
5635 10 14 -4
7674 11 10 1

4.3 Discussion Table 4: Exceptional Articles in Top 10 PageRank

Let us discuss what the results mean. As shown in


the table 3, we get more than 90% of the articles by Let us go into more detail of ID 6377. By our
PageRank and by Similarity in all the cases, and it PageRank algorithm, an article ID 6377 has the 96-
looks like the both approach seems identical. th PageRank value which has the 128-th position by
In a table 4, we show the difference in Top 10 re- Similarity. The headline of ID 6377 says about a new
sults. Exceptional two articles 5635 and 7674 appear kind of FLU (called SARS).

23
ID PageRank Similarity Difference SARS that are linked to ID 6377. All these bring the
1669 93 117 -24
2357 99 110 -11 higher rank to ID 6377.
2766 104 88 16 However there appear 3 words penalty,
2705 84 103 -19
2749 90 107 -17 Mr.Inukai and Utah only in this article, and
3767 109 89 20 there appears no word of higher TF values. That’s
6359 106 93 13
6377 96 128 -32 why we got lower cosine similarity. The 4-th article,
7066 119 97 22 ID 7428, on the other hand, contains SARS-related
7411 102 74 28
7657 122 96 26 words as well as many words of higher TF values
8262 125 95 30 (1st, 3rd, 5th ,8th and 9th). This is the reason the
9179 92 109 -17
9866 103 91 12 article gets both higher PageRank and Similarity
ranks.
Table 5: Exceptional Articles in Top 100 PageRank

5 Conclusion
(ID 6377 : 1st Paragraph) JFA (Japan In this investigation, we proposed a new algorithm
Football Association) announced that for document-ranking based on PageRank algorithm.
they had canceled away matches (18th By considering document similarity as an adjacency
to 27th) by Japan national women team
among virtual links, we have extended the algorithm
officially and told this decision to
both USA and Canada associations, taking
to capture weights on links. We have examined some
account of a new kind of FLU. Although news corpus and shown the effectiveness to obtain
there happens a penalty, the president degrees of importance on articles.
Mr.Inukai asked the committee of the
cancellation. They scheduled 2 friendly
games with US in Texas(20-th) and Utah References
(23-th) as well as a friendly game with
Canada in Toronto (25-th). No definite [1] Brin,S. and Page, L.: The anatomy of a large
answer has come yet. scale hypertextual Web search engine, Computer
(Headline) A new kind of FLU : Networks and ISDN Systems, vol. 30, pp.107-117,
Cancelling Games with USA and Canada 1998
ID 7428 also says about a new kind of FLU. [2] Haveliwala, T.H.: Topic-sensitive PageRank,
proc. World Wide Web (WWW) Conf, 2002
(ID 7428 : 1st Paragraph) They
confirmed infected individuals (one [3] Langville, A.N. and Meyer, C.D.: Google’s
male and 2 females) of a new kind of PageRank and Beyond: The Science of Search
FLU in Shizuoka, Chiba and Tokushima. A
Engine Rankings. Princeton University Press,
male patient (20yo) from Kanagawa stayed
at some training facility and has no
June 2006
recent experiment of going abroad. He [4] Manning, C.D., Raghavan, P. and Schutze,
visited the facility on Mar 18, moved
H.: Introduction to Information Retrieval, Cam-
to Tokyo on 30th and came back on 31th,
taken ill on June 1.
bridge University Press, 2008
(Headline) SARS -- infected confirmed in [5] Sharma, D.K. and Sharma, A.K.: A Compara-
Shizuoka
tive Analysis of Web Page Ranking Algorithms,
In our corpus, the 4-th article ID 7428 and the 8-th International Journal on Computer Science and
article ID 4928 contain new and FLU where ID 6377 Engineering Vol. 02-08, 2010, pp.2670-2676
contains them too, and there exist 226 articles about

24

You might also like