Web Crawlers & Hyperlink Analysis
Web Crawlers & Hyperlink Analysis
&
Hyperlink Analysis
Albert Sutojo
CS267 Fall 2005
Instructor : Dr. T.Y Lin
Agenda
Web Crawlers
Hyperlink Analysis
HITS
PageRank
Web Crawlers
Definitions
Research on crawlers
Research on crawlers
Research on crawlers
Research on crawlers
Research on crawlers
1999 : Mercator
A. Heydon and M. Najork. Mercator: A scalable, extensible
web crawler. World Wide Web, 2(4):219229, 1999.
Research on crawlers
2002 :
UbiCrawler
P. Boldi, B. Codenotti, M. Santini, and S. Vigna. Ubicrawler:
A scalable fully distributed web crawler. In Proceedings of
the 8th Australian World Wide Web Conference, July 2002.
Research on crawlers
2005 : DynaBot
Daniel Rocco, James Caverlee, Ling Liu, Terence Critchlow.
Posters: Exploiting the Deep Web with DynaBot : Matching,
Probing, and Ranking. Special interest tracks and posters of the
14th international conference on World Wide Web, May 2005
2006 : ?
6.
7.
Single Crawler
Initialize URL list
with starting URLs
Termination ?
Crawling
loop
[done]
[not done]
Pick URL
from URL list
[URL]
Parse page
Multithreaded Crawler
Get URL
Add URL
Thread
Get URL
end
Add URL
Thread
Unlock URL List
Fetch Page
Fetch Page
Parse Page
Parse Page
Parallel Crawler
C - Proc
C - Proc
Internet
Local Connect
Collected
Pages
Queues of
URLs to
visit
C - Proc
C - Proc
Local Connect
Breadth-first search
Robot Protocol
Crawler(s)
WWW
Indexer
Module
Indexes :
Text
Collection
Analysis
Module
Structure Utility
Query
Engine
Results
Ranking
Crawlers
The Indexer
Issues on crawler
1.
2.
3.
4.
5.
General architecture
What pages should the crawler download
?
How should the crawler refresh pages ?
How should the load on the visited web
sites be minimized ?
How should the crawling process be
parallelized ?
Connectivity-based analysis
Hyperlink Analysis
2.
Hyperlink Analysis
Most popular methods :
HITS (1998)
(Hypertext Induced Topic Search)
By Jon Kleinberg
PageRank (1998)
By Lawrence Page & Sergey Brin
Googles founders
HITS
HITS
freedom : term 1 doc 3, doc 117, doc 3999
.
.
registration : term 10 doc 15, doc 3, doc 101,
doc 19,
doc 1199, doc 280
faculty : term 11 doc 56, doc 94, doc 31, doc 3
.
.
graduation : term m doc 223
HITS
3
15
101
673
31
1199
19
Indegree
94
56
Outdegree
HITS
Authority
Hub
HITS computation
Xi
(k)
j : e
ji
E Yj
(k-1)
Yi
(k)
j : e
ij
E Xj
(k)
For k = 1,2,3,
HITS computation
Xi
(k)
j : e
ji
E Yj
(k-1)
Yi
(k)
j : e
ij
E Xj
(k)
For k = 1,2,3,
HITS computation
1
d1 d2 d3 d4
d1
L=
3
Xi
LT Yj
(k-1)
d3
d4
(k)
d2
And
Yi
(k)
0
1
0
0
Xj
(k)
1
0
1
1
1
1
0
0
0
0
1
0
HITS computation
1.
2.
Initialize y(0)
= e , e is a column vector
of all ones
Until convergence do
(k)
(kT
x(k) = LT y(k-1)
y(k) = L x(k)
k =k+1
Normalize x(k) and y(k)
x
1)
(k)
=L Lx
(k-
y(k) = L LT y(k-
LT 1)
L = authority
matrix
L LT = hub matrix
Computing authority vector X and hub vector Y
can be viewed as finding dominant right-hand
eigenvectors of LT L and L LT
HITS Example
3
10
1 2 3 5 6 10
6
1
2
L=
5
3
5
6
10
0
1
0
0
0
0
0
0
0
0
0
0
1
0
0
0
1
0
1
0
0
0
1
0
0
0
1
0
0
1
0
0
0
0
0
0
HITS Example
1 2 3 5 6 10
1
2
LTL =
3
5
6
10
0
1
0
0
0
0
0
0
0
0
0
0
1
0
0
0
1
0
1
0
0
0
1
0
0
0
1
0
0
1
0
0
0
0
0
0
1 2 3 5 6 10
1
2
LLT =
3
5
6
10
0
1
0
0
0
0
0
0
0
0
0
0
1
0
0
0
1
0
1
0
0
0
1
0
0
0
1
0
0
1
0
0
0
0
0
0
HITS Example
The normalized principles eigenvectors with
the Authority score x and Hub y are :
XT = (0
.3660
.1340
.5
0)
Authority Ranking = (6
Hub Ranking = (1 3 6
1
10
2
2
10)
5)
Strengths
Dual rankings
Weaknesses
Query-dependence
PageRank
PageRank
PR(A) = (1 d) x d [ PR (t1)/C(t1) + PR (t1)/C(t1) + .. PR (tn)/C(tn) ]
Where :
PR(A)
= PageRank of page A
d = damping factor , usually set to 0.85
t1, t2,t3, tn = pages that link to page A
C( ) = the number of outlinks of a page
In a simpler way :
R(A) = 0.15 x 0.85 [ a share of the PageRank of every page that links to
PageRank Example
B
PR= 1
A
PR= 1
C
PR= 1
PageRank Example
B
PR= 1
A
PR= 1
C
PR= 1
PR(A) = 0.15
PR(B) = 1
PR(C) = 0.15
Page Bs PageRank increase, because page A
has voted for page B
PageRank Example
B
PR= 1
A
PR= 1
C
PR= 1
PageRank Example
B
PR= 1
A
PR= 1
C
PR= 1
PageRank Example
B
PR= 1
A
PR= 1
C
PR= 1
PR(A) = 1.85
PR(B) = 0.575
PR(C) = 0.575
After 100 iterations :
PR(A) = 1.459459
PR(B) = 0.7702703
PR(C) = 0.7702703
PageRank Example
B
PR= 1
A
PR= 1
C
PR= 1
PR(A) = 1.425
PR(B) = 1
PR(C) = 0.575
After 100 iterations :
PR(A) = 1.298245
PR(B) = 0.999999
PR(C) = 0.7017543
Dangling Links
B
PR= 1
A
PR= 1
C
PR= 1
PageRank Demo
http://homer.informatics.indiana.edu/
cgi-bin/pagerank/cleanup.cgi
PageRank Implementation
freedom : term 1 doc 3, doc 117, doc 3999
.
.
registration : term 10 doc 101, doc 87,doc 1199
faculty : term 11 doc 280, doc 85
.
.
graduation : term m doc 223
PageRank Implementation
Weaknesses
Strengths
Query-independence
Faster retrieval
HITS vs PageRank
HITS
PageRank
Connectivity-based analysis
Connectivity-based analysis
Query-dependence
Query-independence
Relevance
Importance
Q&A
Thank you