Hits Algorithm
Hits Algorithm
PageRank:
Link voting:
P with importance x has n out‐links, each link gets x/n votes
Page R’s importance is the sum of the votes on its in‐links
Complications: Spider traps, Dead‐ends
At each step, random surfer has two options:
With probability , follow a link at random
With prob. 1‐, jump to some page uniformly at random
PageRank as a tool to
measure the
“importance”
of Web pages
Technique:
Get as many links from accessible pages as
possible to target page t
Construct “link farm” to get PageRank multiplier
effect
Inaccessible 1
t 2
Millions of
farm pages
where
2/12/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 20
Accessible Own
Inaccessible 1
t 2
where
For = 0.85, 1/(1‐2)= 3.6
Trust splitting:
The larger the number of out‐links from a page,
the less scrutiny the page author gives each out‐
link
Trust is “split” across out‐links
PageRank:
Pick the top k pages by PageRank
Theory is that you can’t get a bad page’s rank really high
Complementary view:
What fraction of a page’s PageRank comes
from “spam” pages?
Then:
r‐(p) = r(p) – r+(p)
SDM R. Ramakrishnan
AAAI M. Jordan
…
NIPS
…
Conference Author
0.005
0.004
0.005
0.004
0.004
j1 j2 j3 j4
Each page i has 2 scores:
Authority score:
Hub score: i
HITS algorithm: →
Initialize:
Then keep iterating: i
Authority: →
Hub: → j1 j2 j3 j4
normalize: ,
→
So: h A a
And likewise: a A h
T
A= 101 AT = 1 0 1
010 110
Amazon M’soft
a(yahoo) = 1 1 1 1 ... 1
a(amazon) = 1 1 4/5 0.75 . . . 0.732
a(m’soft) = 1 1 1 1 ... 1