Hatakenaka 2011

Uploaded by

Aline Alencar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views

Hatakenaka 2011

Uploaded by

Aline Alencar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

Ranking Documents using Similarity-based PageRanks

Shota HATAKENAKA Takao MIURA

Dept.of Elect. and Elect. Engineering, HOSEI University
Kajinocho 3-7-2, Koganei, Tokyo, Japan
E-mail: [email protected] [email protected]

Abstract words are more important than frequent words. But

the documents are what we want ?
In this investigation we propose a new ranking algo- We wonder whether we describe relevance to our
rithm for documents based on popularity. Consid- interests by queries or not. Some investigations have
ering similarity as virtual links, we introduce struc- discussed features on documents (i.e., attributes for
tural aspects among documents. Then, by means relevance to query) such as feature selection and rel-
of authoritative importance or popularity, we obtain evance feedback approach (like Rocchio algorithm).
ranking scheme based on PageRank algorithm. We
Ranking approach provides us with ordering doc-
discuss experimental results to examine the effective-
uments based on some scores. But usually people
ness and some applications.
goes through top 3 or 4 documents in the ranking list
Keywords: Document Ranking, PageRank, Similar-
and discusses how to have documents ranked highly
ity , TF*IDF
but nothing about what the rank does mean. Highly
Querying Documents using Similarity-based Ranks
ranked documents are often called important, though
we should define important documents in advance.
Whenever we say ”important”, we look for docu-
1 Motivation and Background ments to capture some subjects of interests related
Very often we see many electric papers on some Web to our queries.
sites. Certainly advantageous are that we can in- To obtain important documents, we apply some
quire, go through, navigate these documents easily ranking algorithm. The higher they get ranked, the
and quickly. It could be impossible to tackle with more important they mean. A ranking algorithm
information explosion without the relevant technolo- means a way to determine order of the results for
gies. Typical examples are Web, Wiki, blog, twitters the document retrieval. To do that, we combine IR
(micro blog) which we can obtain through internet. algorithm based on the notion of importance with
On the other hand, we get in trouble with an in- several parameters. We propose a new approach to
herent issue: we have mountains of information but obtain degree of importance of documents by analyz-
which is the one we need ? Traditionally there have ing similarity among them. To do that, we extend a
been much investigation proposed so far about infor- PageRank algorithm for Web pages [?, ?] to reflect
mation retrieval (IR)[?]. Given a query in terms of implicit relationship among documents. Unlike Web
query words, we explore all the documents that match pages, there exist no explicit links among documents
the words (called term matching) or that match the so that we introduce virtual links by means of doc-
words very often (term frequency). Putting an at- ument similarity. We also discuss our algorithm to
tention on term importance, we examine documents examine effectiveness.
with inverse document frequency which says rare Several web page ranking algorithms have been

19 978-1-4577-0253-2/11/$26.00 ©2011 IEEE

proposed based on various parameter to find out 0 (no-presence), and term frequency (TF) means how
their advantages and limitations for Web pages[?]. many wi appears in d. A notion of inverse document
And, among others, two graph based page rank- frequency (IDF) shows N/m where N means the to-
ing algorithms have been put much attention, i.e., tal number of document and m documents contain
Google PageRank and Hypertext Induced Topic Se- this word. By the large IDF of this word, we may
lection (HITS) that are used widely, successfully and identify rare documents.
traditionally. These algorithms give equal weights to Information Retrieval (IR) can be described easily
all links of page relationships for ranking. The algo- within this framework. A term matching query can
rithms takes input of forward/backward links as well be characterized by BF and a term frequency query
as contents information (such as tags and anchors). by TF or TF*IDF.
But there happen topic drift and efficiecy problems More interesting is similarity. Unlike database
and the analysis comes at indexing but not at query- queries, we can retrieve similar documents as well as
ing. exact answers. The similarity of answer documents
This work is organized as follows. We introduce provides us with ranking: the more similarity docu-
a model for document description and what ranking ments have, the better answers we get. There have
does mean in section 2. In section 3, we discuss a been many kinds of similarities proposed so far. Typ-
new ranking algorithm for general documents based ical examples are cosine similarity, Jaccard similarity
on PageRank and we show some experimental results and any other distances. For example, given a query
in section 4. We conclude our investigation in section q in terms of a vector, we examine all the documents
5. di by (di ∗q)/(|di ||q|) where ∗ means an inner product
and || a norm value. This is just the degree (cosine)
of the angle between the two vectors so called cosine
2 Document Description and similarity.
Ranking di ∗ q
cos(di , q) =
|di ||q|
2.1 A Vector Space Model In this investigation, we examine only nouns without
A vector space model provides us with a powerful and pronouns and proper nouns, as content words.
flexible vehicle to describe documents. Basically, in
this model, every document is described by means of 2.2 Calculating Importance
a multi-set of terms, called a Bag of Words although
each document is a sequence of words so that each How we can obtain important documents ? Hopefully
word w has its position i, denoted by wi . Identical we believe that highly ranked documents are impor-
words may appear many times but at different po- tant so that we apply some ranking algorithm. Let
sitions. On the other hand, in this model, we don’t us note that we don’t mean the results responded to
keep position. For example, no distinction is made IR queries but we like important documents inspired
between ”a dog bites John.” and ”John bites a from queries.
dog.” Recent algorithms for ranking have been proposed
The model is advantageous enough to capture in- based on probabilities or heuristic weights[?]. Para-
dex words (terms). Both words ”John” and ”dog” metric approaches assume some probability distribu-
are included while trivial terms such as ”a” and ”.” tion while heuristics depend on subjective interpre-
could be removed. The former words are called con- tation. Here we here stick to analyzing all the doc-
tent words. Each document d can be described by uments and extract discriminative aspects of impor-
means of vector d =< v1 , ..., vn > where i = 1, ..., n tance.
corresponds to a content word wi and vi means a term In this investigation, we examine documents which
weight. Binary frequency (BF) means 1 (presence) or receive strong support from other documents. These

20
aspects are defined as words ”authoritative” and ”dis- and links, we calculate all the ai , cj and p by solving
tributive”. In Web world [?], the word ”important” the difference equations.
has been defined to capture three aspects, subjec- Considering each Web page as a node and each
tive (theme-oriented), authoritative and distributive. link as a directed edge, we can give an adjacency
An authoritative document means that the document matrix A = ((aij )) where aij = 1 if there exists an
contains many topics of interests and many other edge from j to i. Let B = ((bij )) be a matrix as
”important” documents find very interesting contents bij = aij / k akj
(themes) on whole or part of this document. By a Note k akj means a total number of links from
distributive document, we mean that the documents j and k bkj = 1. On the other hand each row
keep a variety of common topics with many other b1j , ..., bN j means weights over all the nodes at each
”important” documents though each of them refers node. B is called a transitive probability matrix be-
to parts of topics. cause a page j has a link to i with a probability
The two kinds of documents differ from each other bij . The PageRank values should be stable in a sense
since an authoritative document is considered popu- that, once all the degrees are given consistently, they
lar by other important documents so that they carry remain unchanged by transitions. We can obtain
contents of topics inherently, while a distributive doc- such degrees v by solving an eigen vector equation
ument contains the contents shared with many other B · v = λv where we have λ = 1 and v contains a
important documents deeply or not deeply. We ex- list of the degrees. In fact, let v = (v1 , ..., vN ) be
pect (and assume) that both an authoritative and the degree vector for PageRank. Then we must have
distributive document is important. the equation which is same exactly to the equation
of input scores for the PageRank condition:

3 Ranking Documents by vi = bij vj i = 1, ..., N
j
PageRank
The PageRank algorithm is easy and simple to
3.1 PageRank on Web understand based on Linear Algebra. However the
algorithm has several deficiencies practically which
A PageRank algorithm has been proposed to calculate come from iteration procedure to obtain PageRank
degrees of importance of Web pages using link struc- values[?]. If there is a node which has no output
ture among them and to give ranking to the pages[?]. arc, the input score simply increases. Such a node is
The degree is sometimes called a PageRank value. In called a RankSink by which we can’t distribute input
the algorithm, a link from a page X to Y is regarded scores, though this is due to the calculation proce-
as a support to Y. The page X is called a supporter for dure. Similarly, if there exits a loop in the Web page
Y. The PageRank value of Y comes from the number link structure, there happens hoarding of PageRank
of the supports as well as the PageRank values of the value. In fact, the input score simply never goes out-
supporters. side and increases again. Note this is also due to the
Let us define a PageRank value formally. Given a calculation procedure.
Web page Y with a degree p, we assume that there are From the viewpoint of stochastic process, if we con-
n links to Y from X1 , ..., Xn and each link from Xi has a sider the matrix B as a transition probability, neither
score ai , and that there are m links from Y to Z1 , ..., Zm situations are ergodic in a case of Markov chain: some
and a score of a link to Zj is cj . Then we have the transition never arises eventually. That is, there ex-
equations a1 +...+an = p, c1 = ... = cm = p/m. That ists Web pages X,Y where users never move from X
is, assuming p0 means a total sum of input scores to to Y. In the Web world, it is well-known that those
Y and p1 means a total sum of output scores from who navigate several Web pages randomly are called
Y, we define p under the condition of p = p0 = p1 . random surfers. The stochastic process can’t be non-
Then giving the conditions over all the Web pages ergodic.

21
We introduce a notion of teleport parameter α is symmetric which means all the eigen values are
which plays a randomizing role for behavior selection real and that we can obtain the vectors efficiently by
to avoid the defficiencies. means of, say, Jacobi algorithm.
B = (1 − α)B + α/N
Note that N means the total number of Web pages
and we add 1/N to all the transition, and that α 4 Experiments
means the teleport ratio of random surfer users. In
the following, we discuss teleport modification B . 4.1 Preliminaries
3.2 Ranking Documents by PageRank We show some experimental results to examine our
algorithm by using 10,000 news articles extracted
Let us describe a new algorithm to give document-
randomly from Mainichi 2009 January to June. We
ranking. Our idea is to obtain importance of doc-
discuss only first paragraph since the first paragraph
uments using PageRank algorithm to a collection
contains major scenario very often. We apply mor-
of documents. Since there exist no explicit link
phological analysis in advance and extract only nouns
among documents, we can’t apply the algorithm in
as contents words.
a straightforward manner. Here we put an attention
on similarity relationship, consider some of them as To evaluate the results, we apply cosine similar-
links so that we may apply the algorithm. ity and take frequency of articles appeared in both
First of all we describe documents as vectors over results by our PageRank algorithm and by all the
index terms. Then we define an extended adjacency articles. In the latter case, to each article, we cal-
matrix A = ((aij )) where aij = cos(di , dj ) for two culate all the cosine similarities with all the articles
documents di , dj , i = j and aii = 0. Note that and take all the sum of these values. Remember our
0.0 ≤ aij ≤ 1.0 and A is symmetric, i.e., aij = aji and results come from authoritative and distributive ar-
A = AT . The relationship between two documents is ticles, the articles are similar to all the articles and
bidirectional so forward/backward link analysis im- similar to the similar articles.
proves the process. Given two ranking results r1 , r2 , we define
In the PageRank algorithm, the matrix A contains Sim(r1 , r2 ) as (A1 ∩ A2 )/k where A1 means the ar-
binary values and is normalized into B after that. ticles in r1 and A2 in r2 , k means all the number of
Remember we distribute the degree p to direct chil- articles in r1 ∪ r2 . Clearly Sim(r1 , r2 ) we obtain the
dren uniformly c1 = ... = cm = p/m where there better results r1 with respect to r2 .
are m arcs to direct children. Compared to the origi-
nal PageRank algorithm, the matrix A is normalized
into B = ((bij )) as bij = aij / k aik . Then bij plays 4.2 Results
a role of weight on the arc (i to j): the more similar
two documents are, the more weight they receive. We show all the results in tables 1, 2 and 3. A table 1
We put a threshold ρ on weights: bij = 0.0 if bij ≤ contains top 10 PageRank ranking where we show the
ρ. rank, article ID, the PageRank value and headline of
Finally, if bij = 0.0, we replace the entry by 1/N all the articles. Also we show the top 10 results by
similar to the PageRank algorithm to avoid non- similarly in a table 2 where we show the rank, article
ergodic transition, and we make teleport modification ID, the similarity sum value and headline of all the
: articles. A table 3 shows Sim values of top 10, 50,
B = (1 − α)B + α/N 100, ..., 1000 articles. For example, ”top 100” means
By using B as transitivity matrix, we calculate 92.0 % articles of the top 100 articles of PageRank
the eigen value 1.0 and the eigen vector. Note B and of Similarity are identical.

22
Rank ArticleID PageRank Headline
1 3074 0.03287732 Baseball:WBC, Japan takes a practice game with Chicago Cubs
2 2244 0.03139896 Cherry blossom forecast of flowering time on March 20
3 8745 0.03073055 Professional troublemaker at stockholders’ meetings Arrested –
Demanding profits
4 7428 0.0307029 SARS – infected confirmed in Shizuoka
5 9379 0.02942945 Plenary session of House of Representatives on 21st Afternoon –
Demands Aso resignation Settled
6 5848 0.02902681 Golden Week - Nationwide Sunny
7 8642 0.0288975 Bus and Truck Toll Half-priced for 8 days – during Lantern festival
8 4921 0.02887041 FLU infected – 52 police freshmen in Tochigi
9 9180 0.02865223 Raininy Season ends in southern Kyushu
10 5635* 0.02849303 Prime Minister Aso visits China and Europe

Table 1: Ranking by PageRank

Rank ArticleID Similarity Sum Headline

1 3074 1436.3933 Baseball:WBC, Japan takes a practice game with Chicago Cubs
2 2244 1379.4339 Cherry blossom forecast of flowering time on March 20
3 8745 1348.871501 Professional troublemaker at stockholders’ meetings Arrested –
Demanding profits
4 7428 1344.996201 SARS – infected confirmed in Shizuoka
5 4921 1270.10094 FLU infected – 52 police freshmen in Tochigi
6 9379 1258.451663 Plenary session of House of Representatives on 21st Afternoon –
Demands Aso resignation Settled
7 5848 1248.997552 Golden Week - Nationwide Sunny
8 8642 1244.844031 Bus and Truck Toll Half-priced for 8 days – during Lantern festival
9 9180 1242.542849 Rainy Season ends in southern Kyushu
10 7674* 1238.969157 SUMO Wrestling – Summer Tour scheduled

Table 2: Ranking by Similarity

TOP Sim ( % ) both as the 10-th rank by PageRank and Similarity

10 90.0
50 90.0 respectively where the article 5635 appears as the 14-
100 92.0 th rank by Similarity and the article 7674 as the 11-th
200 92.5
300 91.3 rank by PageRank. There is no distinction between
400 92.8 the two. However, in Top 100 results, this is no true:
500 91.4
600 93.5 Let us show the diﬀerence in a table 5. There exists
700 93.9 one article (6377) which is put by PageRank in 30
800 94.6
900 93.3 ranks higher than Similarity, and one article (8262)
1000 93.9 by Similarity in 20 ranks higher than Similarity.
Table 3: Sim Values in PageRank ID PageRank Similarity Diﬀerence
5635 10 14 -4
7674 11 10 1

4.3 Discussion Table 4: Exceptional Articles in Top 10 PageRank

Let us discuss what the results mean. As shown in

the table 3, we get more than 90% of the articles by Let us go into more detail of ID 6377. By our
PageRank and by Similarity in all the cases, and it PageRank algorithm, an article ID 6377 has the 96-
looks like the both approach seems identical. th PageRank value which has the 128-th position by
In a table 4, we show the diﬀerence in Top 10 re- Similarity. The headline of ID 6377 says about a new
sults. Exceptional two articles 5635 and 7674 appear kind of FLU (called SARS).

23
ID PageRank Similarity Diﬀerence SARS that are linked to ID 6377. All these bring the
1669 93 117 -24
2357 99 110 -11 higher rank to ID 6377.
2766 104 88 16 However there appear 3 words penalty,
2705 84 103 -19
2749 90 107 -17 Mr.Inukai and Utah only in this article, and
3767 109 89 20 there appears no word of higher TF values. That’s
6359 106 93 13
6377 96 128 -32 why we got lower cosine similarity. The 4-th article,
7066 119 97 22 ID 7428, on the other hand, contains SARS-related
7411 102 74 28
7657 122 96 26 words as well as many words of higher TF values
8262 125 95 30 (1st, 3rd, 5th ,8th and 9th). This is the reason the
9179 92 109 -17
9866 103 91 12 article gets both higher PageRank and Similarity
ranks.
Table 5: Exceptional Articles in Top 100 PageRank

5 Conclusion
(ID 6377 : 1st Paragraph) JFA (Japan In this investigation, we proposed a new algorithm
Football Association) announced that for document-ranking based on PageRank algorithm.
they had canceled away matches (18th By considering document similarity as an adjacency
to 27th) by Japan national women team
among virtual links, we have extended the algorithm
officially and told this decision to
both USA and Canada associations, taking
to capture weights on links. We have examined some
account of a new kind of FLU. Although news corpus and shown the eﬀectiveness to obtain
there happens a penalty, the president degrees of importance on articles.
Mr.Inukai asked the committee of the
cancellation. They scheduled 2 friendly
games with US in Texas(20-th) and Utah References
(23-th) as well as a friendly game with
Canada in Toronto (25-th). No definite [1] Brin,S. and Page, L.: The anatomy of a large
answer has come yet. scale hypertextual Web search engine, Computer
(Headline) A new kind of FLU : Networks and ISDN Systems, vol. 30, pp.107-117,
Cancelling Games with USA and Canada 1998
ID 7428 also says about a new kind of FLU. [2] Haveliwala, T.H.: Topic-sensitive PageRank,
proc. World Wide Web (WWW) Conf, 2002
(ID 7428 : 1st Paragraph) They
confirmed infected individuals (one [3] Langville, A.N. and Meyer, C.D.: Google’s
male and 2 females) of a new kind of PageRank and Beyond: The Science of Search
FLU in Shizuoka, Chiba and Tokushima. A
Engine Rankings. Princeton University Press,
male patient (20yo) from Kanagawa stayed
at some training facility and has no
June 2006
recent experiment of going abroad. He [4] Manning, C.D., Raghavan, P. and Schutze,
visited the facility on Mar 18, moved
H.: Introduction to Information Retrieval, Cam-
to Tokyo on 30th and came back on 31th,
taken ill on June 1.
bridge University Press, 2008
(Headline) SARS -- infected confirmed in [5] Sharma, D.K. and Sharma, A.K.: A Compara-
Shizuoka
tive Analysis of Web Page Ranking Algorithms,
In our corpus, the 4-th article ID 7428 and the 8-th International Journal on Computer Science and
article ID 4928 contain new and FLU where ID 6377 Engineering Vol. 02-08, 2010, pp.2670-2676
contains them too, and there exist 226 articles about

bonus_proteus
No ratings yet
bonus_proteus
10 pages
Melaleuca Handbook
100% (1)
Melaleuca Handbook
54 pages
AI6122 Topic 3.2 - Ranking
No ratings yet
AI6122 Topic 3.2 - Ranking
27 pages
1 Overview
No ratings yet
1 Overview
44 pages
Information Retrieval Models: Vector Space Models: Chengxiang Zhai
No ratings yet
Information Retrieval Models: Vector Space Models: Chengxiang Zhai
30 pages
Information Retrieval Systems Information Retrieval Systems
No ratings yet
Information Retrieval Systems Information Retrieval Systems
7 pages
Liu 2009
No ratings yet
Liu 2009
109 pages
4 IRinArabic2021 Ranked Retrieval I
No ratings yet
4 IRinArabic2021 Ranked Retrieval I
49 pages
Term Weighting 2021
100% (2)
Term Weighting 2021
38 pages
Document Ranking Using Customizes Vector Method
No ratings yet
Document Ranking Using Customizes Vector Method
6 pages
Module1PartBInformationRetrievalWebdocuments
No ratings yet
Module1PartBInformationRetrievalWebdocuments
49 pages
Tf-Idf: David Kauchak cs160 Fall 2009
No ratings yet
Tf-Idf: David Kauchak cs160 Fall 2009
51 pages
2 Introduction To Information Retrieval
No ratings yet
2 Introduction To Information Retrieval
38 pages
Boolean and Vector Space Retrieval Models
No ratings yet
Boolean and Vector Space Retrieval Models
31 pages
Lect 13-Text Ranking
No ratings yet
Lect 13-Text Ranking
58 pages
Chapter 2
No ratings yet
Chapter 2
37 pages
IRS Unit 3 by Krishna
No ratings yet
IRS Unit 3 by Krishna
50 pages
Chapter 6_Scoring Term weighting and vector space model
No ratings yet
Chapter 6_Scoring Term weighting and vector space model
43 pages
CS583 Info Retrieval
No ratings yet
CS583 Info Retrieval
34 pages
Chapter 2: Modeling: Advanced Topics in Information Retrieval
No ratings yet
Chapter 2: Modeling: Advanced Topics in Information Retrieval
28 pages
Unit V Easy To Learn
No ratings yet
Unit V Easy To Learn
21 pages
IR - Models
100% (3)
IR - Models
58 pages
Relevance of A Document To A Query
No ratings yet
Relevance of A Document To A Query
10 pages
Boolean and Vector Space Retrieval Models
No ratings yet
Boolean and Vector Space Retrieval Models
27 pages
Vector Space Model: TF - IDF: Adapted From Lectures by
No ratings yet
Vector Space Model: TF - IDF: Adapted From Lectures by
37 pages
Unit-4 1
No ratings yet
Unit-4 1
7 pages
NLP SEE
No ratings yet
NLP SEE
27 pages
Acm Iconiaac 2014
No ratings yet
Acm Iconiaac 2014
8 pages
Lecture 5 - Scoring, Term Weighting, Vector Space Model - Part 1
No ratings yet
Lecture 5 - Scoring, Term Weighting, Vector Space Model - Part 1
45 pages
L12&L13 Ranked Retrieval
No ratings yet
L12&L13 Ranked Retrieval
31 pages
Information Retrieval System and The Pagerank Algorithm
No ratings yet
Information Retrieval System and The Pagerank Algorithm
37 pages
TF Idf
100% (3)
TF Idf
38 pages
NLP SEE
No ratings yet
NLP SEE
9 pages
NLP Ir
No ratings yet
NLP Ir
24 pages
IRS Automatic Indexing UNIT-2
67% (3)
IRS Automatic Indexing UNIT-2
18 pages
Materi Pertemuan Ke-1-Dno 2018-1
No ratings yet
Materi Pertemuan Ke-1-Dno 2018-1
42 pages
CS583 Info Retrieval
No ratings yet
CS583 Info Retrieval
33 pages
Tif Idf Cosine Similarity
No ratings yet
Tif Idf Cosine Similarity
44 pages
F-IR
No ratings yet
F-IR
30 pages
L02-IR Models MMN
No ratings yet
L02-IR Models MMN
27 pages
7 B - Query Languages
No ratings yet
7 B - Query Languages
33 pages
Content-Based Filtering
No ratings yet
Content-Based Filtering
20 pages
Session 4 Text Feature
No ratings yet
Session 4 Text Feature
40 pages
An Overview of Information Retrieval Outline: A (Simple) Database Example Databases vs. IR
No ratings yet
An Overview of Information Retrieval Outline: A (Simple) Database Example Databases vs. IR
16 pages
Unit-3 Irs
No ratings yet
Unit-3 Irs
48 pages
Document Classification Utilising Ontologies and Relations Between Documents
No ratings yet
Document Classification Utilising Ontologies and Relations Between Documents
8 pages
IRS III Year UNIT-3 Part 1
50% (2)
IRS III Year UNIT-3 Part 1
18 pages
Information Retrievalpdf
No ratings yet
Information Retrievalpdf
7 pages
IRS BZU Lecture 7 Jan23
No ratings yet
IRS BZU Lecture 7 Jan23
27 pages
I R Rank
No ratings yet
I R Rank
52 pages
Lecture 2: More Similarity Searching Multidimensional Scaling
No ratings yet
Lecture 2: More Similarity Searching Multidimensional Scaling
8 pages
Module 6 Updated Final
No ratings yet
Module 6 Updated Final
48 pages
Rocchio Relevance
No ratings yet
Rocchio Relevance
10 pages
module 7
No ratings yet
module 7
53 pages
Lec 4
No ratings yet
Lec 4
39 pages
Recuperación Información Modelo Vectorial
No ratings yet
Recuperación Información Modelo Vectorial
40 pages
Web Query Mining
No ratings yet
Web Query Mining
16 pages
IR Chapter 4
No ratings yet
IR Chapter 4
15 pages
Introduction To Information Retrieval
No ratings yet
Introduction To Information Retrieval
50 pages
Rankingfor Web Databases Using SVM and K-Means Algorithm: P.Ayyadurai, S.Jayanthi
No ratings yet
Rankingfor Web Databases Using SVM and K-Means Algorithm: P.Ayyadurai, S.Jayanthi
6 pages
What is Information Retrieval (IR) (1)
No ratings yet
What is Information Retrieval (IR) (1)
17 pages
Visualizing Data Structures
From Everand
Visualizing Data Structures
Rhonda Hoenigman
No ratings yet
Milky Mushroom Cultivation Project Report
No ratings yet
Milky Mushroom Cultivation Project Report
21 pages
Call for applications
No ratings yet
Call for applications
4 pages
T CPD 1684768556 CPD Safeguarding Glossary Handout
No ratings yet
T CPD 1684768556 CPD Safeguarding Glossary Handout
22 pages
Study of A Juyo Blade
No ratings yet
Study of A Juyo Blade
5 pages
Diagnostic Trouble Codes
No ratings yet
Diagnostic Trouble Codes
84 pages
Sri Muttathurayan Temple Vs V P Sundaram On 20 August 2014
No ratings yet
Sri Muttathurayan Temple Vs V P Sundaram On 20 August 2014
4 pages
IFY Biology EoS1 Test 2122 V5
100% (1)
IFY Biology EoS1 Test 2122 V5
8 pages
Abnormalities of Labour and Delivery and Their Management: Joó József Gábor
No ratings yet
Abnormalities of Labour and Delivery and Their Management: Joó József Gábor
44 pages
Can The Dancer Speak - Aimar Pérez Galí
No ratings yet
Can The Dancer Speak - Aimar Pérez Galí
5 pages
OceanofPDF.com Ruthless Mafia Daddy - Alexis Lee
No ratings yet
OceanofPDF.com Ruthless Mafia Daddy - Alexis Lee
280 pages
READING TASK - 20 MIN
No ratings yet
READING TASK - 20 MIN
5 pages
Revisão de Língua Inglesa - 9º Ano
No ratings yet
Revisão de Língua Inglesa - 9º Ano
1 page
A Pharmacological Comprehensive Review On 'Rassbhary' Physalis Angulata (L.)
No ratings yet
A Pharmacological Comprehensive Review On 'Rassbhary' Physalis Angulata (L.)
6 pages
William Beckfords Depiction of Orient and Orientals in Vetek
No ratings yet
William Beckfords Depiction of Orient and Orientals in Vetek
7 pages
Thursday, May. 26th, 2016: Jaipur Ahmedabad MR Sumit Kumar MR Prasenjit Maity
No ratings yet
Thursday, May. 26th, 2016: Jaipur Ahmedabad MR Sumit Kumar MR Prasenjit Maity
2 pages
New Version 2nd Quarter Exam Grade 10
No ratings yet
New Version 2nd Quarter Exam Grade 10
3 pages
A DollsHouse Afewlitgennotes
No ratings yet
A DollsHouse Afewlitgennotes
5 pages
8B What Did You Do
No ratings yet
8B What Did You Do
6 pages
Final Edu 011 Items
No ratings yet
Final Edu 011 Items
7 pages
MSM B3 Rev.024
No ratings yet
MSM B3 Rev.024
212 pages
Chapter 0 - Introduction To Computing
No ratings yet
Chapter 0 - Introduction To Computing
43 pages
Rights of Unpaid Seller
No ratings yet
Rights of Unpaid Seller
19 pages
Asynchronous Assignment Grade 6-1
No ratings yet
Asynchronous Assignment Grade 6-1
5 pages
Cambridge IGCSE ™: History 0470/41 October/November 2022
No ratings yet
Cambridge IGCSE ™: History 0470/41 October/November 2022
12 pages
15, Health, Safety & Security
No ratings yet
15, Health, Safety & Security
25 pages
NIFT Jodhpur Fabric Fundamentals: Mid Term Assignment
No ratings yet
NIFT Jodhpur Fabric Fundamentals: Mid Term Assignment
36 pages
Science JC Test 25
100% (1)
Science JC Test 25
13 pages
Multimedia Notes-1 RPM
No ratings yet
Multimedia Notes-1 RPM
10 pages

Hatakenaka 2011

Uploaded by

Hatakenaka 2011

Uploaded by

Ranking Documents using Similarity-based PageRanks

Shota HATAKENAKA Takao MIURA

Abstract words are more important than frequent words. But

19 978-1-4577-0253-2/11/$26.00 ©2011 IEEE

Table 1: Ranking by PageRank

Rank ArticleID Similarity Sum Headline

Table 2: Ranking by Similarity

TOP Sim ( % ) both as the 10-th rank by PageRank and Similarity

4.3 Discussion Table 4: Exceptional Articles in Top 10 PageRank

Let us discuss what the results mean. As shown in

You might also like