Hatakenaka 2011
Hatakenaka 2011
20
aspects are defined as words ”authoritative” and ”dis- and links, we calculate all the ai , cj and p by solving
tributive”. In Web world [?], the word ”important” the difference equations.
has been defined to capture three aspects, subjec- Considering each Web page as a node and each
tive (theme-oriented), authoritative and distributive. link as a directed edge, we can give an adjacency
An authoritative document means that the document matrix A = ((aij )) where aij = 1 if there exists an
contains many topics of interests and many other edge from j to i. Let B = ((bij )) be a matrix as
”important” documents find very interesting contents bij = aij / k akj
(themes) on whole or part of this document. By a Note k akj means a total number of links from
distributive document, we mean that the documents j and k bkj = 1. On the other hand each row
keep a variety of common topics with many other b1j , ..., bN j means weights over all the nodes at each
”important” documents though each of them refers node. B is called a transitive probability matrix be-
to parts of topics. cause a page j has a link to i with a probability
The two kinds of documents differ from each other bij . The PageRank values should be stable in a sense
since an authoritative document is considered popu- that, once all the degrees are given consistently, they
lar by other important documents so that they carry remain unchanged by transitions. We can obtain
contents of topics inherently, while a distributive doc- such degrees v by solving an eigen vector equation
ument contains the contents shared with many other B · v = λv where we have λ = 1 and v contains a
important documents deeply or not deeply. We ex- list of the degrees. In fact, let v = (v1 , ..., vN ) be
pect (and assume) that both an authoritative and the degree vector for PageRank. Then we must have
distributive document is important. the equation which is same exactly to the equation
of input scores for the PageRank condition:
3 Ranking Documents by vi = bij vj i = 1, ..., N
j
PageRank
The PageRank algorithm is easy and simple to
3.1 PageRank on Web understand based on Linear Algebra. However the
algorithm has several deficiencies practically which
A PageRank algorithm has been proposed to calculate come from iteration procedure to obtain PageRank
degrees of importance of Web pages using link struc- values[?]. If there is a node which has no output
ture among them and to give ranking to the pages[?]. arc, the input score simply increases. Such a node is
The degree is sometimes called a PageRank value. In called a RankSink by which we can’t distribute input
the algorithm, a link from a page X to Y is regarded scores, though this is due to the calculation proce-
as a support to Y. The page X is called a supporter for dure. Similarly, if there exits a loop in the Web page
Y. The PageRank value of Y comes from the number link structure, there happens hoarding of PageRank
of the supports as well as the PageRank values of the value. In fact, the input score simply never goes out-
supporters. side and increases again. Note this is also due to the
Let us define a PageRank value formally. Given a calculation procedure.
Web page Y with a degree p, we assume that there are From the viewpoint of stochastic process, if we con-
n links to Y from X1 , ..., Xn and each link from Xi has a sider the matrix B as a transition probability, neither
score ai , and that there are m links from Y to Z1 , ..., Zm situations are ergodic in a case of Markov chain: some
and a score of a link to Zj is cj . Then we have the transition never arises eventually. That is, there ex-
equations a1 +...+an = p, c1 = ... = cm = p/m. That ists Web pages X,Y where users never move from X
is, assuming p0 means a total sum of input scores to to Y. In the Web world, it is well-known that those
Y and p1 means a total sum of output scores from who navigate several Web pages randomly are called
Y, we define p under the condition of p = p0 = p1 . random surfers. The stochastic process can’t be non-
Then giving the conditions over all the Web pages ergodic.
21
We introduce a notion of teleport parameter α is symmetric which means all the eigen values are
which plays a randomizing role for behavior selection real and that we can obtain the vectors efficiently by
to avoid the defficiencies. means of, say, Jacobi algorithm.
B = (1 − α)B + α/N
Note that N means the total number of Web pages
and we add 1/N to all the transition, and that α 4 Experiments
means the teleport ratio of random surfer users. In
the following, we discuss teleport modification B . 4.1 Preliminaries
3.2 Ranking Documents by PageRank We show some experimental results to examine our
algorithm by using 10,000 news articles extracted
Let us describe a new algorithm to give document-
randomly from Mainichi 2009 January to June. We
ranking. Our idea is to obtain importance of doc-
discuss only first paragraph since the first paragraph
uments using PageRank algorithm to a collection
contains major scenario very often. We apply mor-
of documents. Since there exist no explicit link
phological analysis in advance and extract only nouns
among documents, we can’t apply the algorithm in
as contents words.
a straightforward manner. Here we put an attention
on similarity relationship, consider some of them as To evaluate the results, we apply cosine similar-
links so that we may apply the algorithm. ity and take frequency of articles appeared in both
First of all we describe documents as vectors over results by our PageRank algorithm and by all the
index terms. Then we define an extended adjacency articles. In the latter case, to each article, we cal-
matrix A = ((aij )) where aij = cos(di , dj ) for two culate all the cosine similarities with all the articles
documents di , dj , i = j and aii = 0. Note that and take all the sum of these values. Remember our
0.0 ≤ aij ≤ 1.0 and A is symmetric, i.e., aij = aji and results come from authoritative and distributive ar-
A = AT . The relationship between two documents is ticles, the articles are similar to all the articles and
bidirectional so forward/backward link analysis im- similar to the similar articles.
proves the process. Given two ranking results r1 , r2 , we define
In the PageRank algorithm, the matrix A contains Sim(r1 , r2 ) as (A1 ∩ A2 )/k where A1 means the ar-
binary values and is normalized into B after that. ticles in r1 and A2 in r2 , k means all the number of
Remember we distribute the degree p to direct chil- articles in r1 ∪ r2 . Clearly Sim(r1 , r2 ) we obtain the
dren uniformly c1 = ... = cm = p/m where there better results r1 with respect to r2 .
are m arcs to direct children. Compared to the origi-
nal PageRank algorithm, the matrix A is normalized
into B = ((bij )) as bij = aij / k aik . Then bij plays 4.2 Results
a role of weight on the arc (i to j): the more similar
two documents are, the more weight they receive. We show all the results in tables 1, 2 and 3. A table 1
We put a threshold ρ on weights: bij = 0.0 if bij ≤ contains top 10 PageRank ranking where we show the
ρ. rank, article ID, the PageRank value and headline of
Finally, if bij = 0.0, we replace the entry by 1/N all the articles. Also we show the top 10 results by
similar to the PageRank algorithm to avoid non- similarly in a table 2 where we show the rank, article
ergodic transition, and we make teleport modification ID, the similarity sum value and headline of all the
: articles. A table 3 shows Sim values of top 10, 50,
B = (1 − α)B + α/N 100, ..., 1000 articles. For example, ”top 100” means
By using B as transitivity matrix, we calculate 92.0 % articles of the top 100 articles of PageRank
the eigen value 1.0 and the eigen vector. Note B and of Similarity are identical.
22
Rank ArticleID PageRank Headline
1 3074 0.03287732 Baseball:WBC, Japan takes a practice game with Chicago Cubs
2 2244 0.03139896 Cherry blossom forecast of flowering time on March 20
3 8745 0.03073055 Professional troublemaker at stockholders’ meetings Arrested –
Demanding profits
4 7428 0.0307029 SARS – infected confirmed in Shizuoka
5 9379 0.02942945 Plenary session of House of Representatives on 21st Afternoon –
Demands Aso resignation Settled
6 5848 0.02902681 Golden Week - Nationwide Sunny
7 8642 0.0288975 Bus and Truck Toll Half-priced for 8 days – during Lantern festival
8 4921 0.02887041 FLU infected – 52 police freshmen in Tochigi
9 9180 0.02865223 Raininy Season ends in southern Kyushu
10 5635* 0.02849303 Prime Minister Aso visits China and Europe
23
ID PageRank Similarity Difference SARS that are linked to ID 6377. All these bring the
1669 93 117 -24
2357 99 110 -11 higher rank to ID 6377.
2766 104 88 16 However there appear 3 words penalty,
2705 84 103 -19
2749 90 107 -17 Mr.Inukai and Utah only in this article, and
3767 109 89 20 there appears no word of higher TF values. That’s
6359 106 93 13
6377 96 128 -32 why we got lower cosine similarity. The 4-th article,
7066 119 97 22 ID 7428, on the other hand, contains SARS-related
7411 102 74 28
7657 122 96 26 words as well as many words of higher TF values
8262 125 95 30 (1st, 3rd, 5th ,8th and 9th). This is the reason the
9179 92 109 -17
9866 103 91 12 article gets both higher PageRank and Similarity
ranks.
Table 5: Exceptional Articles in Top 100 PageRank
5 Conclusion
(ID 6377 : 1st Paragraph) JFA (Japan In this investigation, we proposed a new algorithm
Football Association) announced that for document-ranking based on PageRank algorithm.
they had canceled away matches (18th By considering document similarity as an adjacency
to 27th) by Japan national women team
among virtual links, we have extended the algorithm
officially and told this decision to
both USA and Canada associations, taking
to capture weights on links. We have examined some
account of a new kind of FLU. Although news corpus and shown the effectiveness to obtain
there happens a penalty, the president degrees of importance on articles.
Mr.Inukai asked the committee of the
cancellation. They scheduled 2 friendly
games with US in Texas(20-th) and Utah References
(23-th) as well as a friendly game with
Canada in Toronto (25-th). No definite [1] Brin,S. and Page, L.: The anatomy of a large
answer has come yet. scale hypertextual Web search engine, Computer
(Headline) A new kind of FLU : Networks and ISDN Systems, vol. 30, pp.107-117,
Cancelling Games with USA and Canada 1998
ID 7428 also says about a new kind of FLU. [2] Haveliwala, T.H.: Topic-sensitive PageRank,
proc. World Wide Web (WWW) Conf, 2002
(ID 7428 : 1st Paragraph) They
confirmed infected individuals (one [3] Langville, A.N. and Meyer, C.D.: Google’s
male and 2 females) of a new kind of PageRank and Beyond: The Science of Search
FLU in Shizuoka, Chiba and Tokushima. A
Engine Rankings. Princeton University Press,
male patient (20yo) from Kanagawa stayed
at some training facility and has no
June 2006
recent experiment of going abroad. He [4] Manning, C.D., Raghavan, P. and Schutze,
visited the facility on Mar 18, moved
H.: Introduction to Information Retrieval, Cam-
to Tokyo on 30th and came back on 31th,
taken ill on June 1.
bridge University Press, 2008
(Headline) SARS -- infected confirmed in [5] Sharma, D.K. and Sharma, A.K.: A Compara-
Shizuoka
tive Analysis of Web Page Ranking Algorithms,
In our corpus, the 4-th article ID 7428 and the 8-th International Journal on Computer Science and
article ID 4928 contain new and FLU where ID 6377 Engineering Vol. 02-08, 2010, pp.2670-2676
contains them too, and there exist 226 articles about
24