12 Link Analysis
12 Link Analysis
Introduction to
Information Retrieval
Block B
Lecture 12: Link analysis
(chapter 21)
Introduction to Information Retrieval
co-author
graph
4
Introduction to Information Retrieval
Facebook Contacts
14
Introduction to Information Retrieval
15
Introduction to Information Retrieval
Document Enrichment
16
Introduction to Information Retrieval Sec. 21.1
hyperlink
Page A Anchor Page B
A2
Big Blue today announced
Joe’s computer hardware links record profits for the quarter
Sun
HP A3
IBM
Introduction to Information Retrieval
Assumption 2:
A hyperlink between pages denotes
author perceived relevance (quality signal)
Link popularity
24
Introduction to Information Retrieval
Query processing
First retrieve all relevant pages, i.e., meeting the text
query requirements.
More nuanced:
use link popularity as a measure of static (not
query-related) goodness,
combined with text match score (query-related)
Introduction to Information Retrieval
26
Introduction to Information Retrieval Sec. 21.2
Pagerank scoring
Imagine a browser doing a random walk on web
pages: 1/3
Start at a random page 1/3
1/3
Search by topic
Topic (query)
jump from
random page
of initial set
Topic t
query
results 1/3
1/3
= initial set 1/3
follow
random
link
current
page
29
Introduction to Information Retrieval
Formal derivation
Introduction to Information Retrieval
Random walk
Random walk over the internet graph
with probability d > 0:
jump to a random page from the initial set.
with probability 1-d > 0:
follow any link from the current page.
damping factor d independent of the current page
Random jump
Random jump to page on topic t:
1 if page p is about t
p ,t
0 otherwise
Number of pages about t:
N t p p ,t
Probability:
p ,t
Pr( p | q, t , jump ) =0 if p not about topic t
= 1/Nt otherwise
Nt
33 24 November 2015 IR – block C2 - Web Retrieval
Introduction to Information Retrieval
Random link
Probability:
1 if an edge from q to p
q p
0 otherwise
Out-degree of node:
p
q
O(q) q p
p
Probability: O(q)
q p
Pr( p | q, t, link ) q p
O(q)
34 24 November 2015 IR – block C2 - Web Retrieval
Introduction to Information Retrieval
Probability transition
Leading to:
Pr( p | q, t )
p ,t q p
d (1 d )
Nt O(q)
35 24 November 2015 IR – block C2 - Web Retrieval
Introduction to Information Retrieval
Probability check
Lemma: sum probabilities equals 1
Pr( p | q, t )
p
p ,t q p
p d (1 d )
Nt O(q)
d (1 d )
Nt
p
p ,t
O(q) p
q p
d (1 d )
36 1 24 November 2015 IR – block C2 - Web Retrieval
Introduction to Information Retrieval
37
Introduction to Information Retrieval
Transition Matrix
Probability as entry in matrix U:
U p ,q Pr(q | p, t ) Pr( p q | t )
transition from p to q
how can we get
in node 3
Example matrix
where do
Pr(1 1) Pr(1 2) Pr(1 3) we go
Double steps
Going in 2 steps from y to x
a1
U[y,a2] U[a2,x]
a3
y x U[y,a3] U[a3,x]
U[y,a4] U[a4,x]
a4
U[y,a5] U[a5,x]
(+)
a5
??
Multiply row y with column x
Introduction to Information Retrieval
2-step paths
Making 2 steps:
U 2 y , x Pr( y x in 2 steps)
z Pr( y z x) random
z U y,z U z,x
U 2 U U U 2
n-step paths
Making n steps:
Pr( y n x)
z
Pr( y n 1 z x)
Pr( y n 1
z ) Pr( z x) (independe nce)
z
Distribution
Let v be the following row vector
v (v1 , v2 , v3 )
45
Introduction to Information Retrieval
Stationary distribution
Initial distribution v, consider the situation after n
steps:
vn vU n vU vU n n 1
U v n 1 U
Stationary distribution (if existing):
P lim n vn
Then: P PU Result independent of
inital distribution v!
Teleporting
The web is full of dead-ends.
Random walk can get stuck in dead-ends.
At a dead end, jump to a random web page.
with probability d we jump to a random web page.
Consequence:
Now cannot get stuck locally.
1
PR( p) d (1 d )q p PR(q)
O(q)
Pagerank computation
Compute eigenvector:
P PU
Alternative: iterative approximation:
start with some v;
repeat
w = v; v = vU;
until |v – w| small enough
X 0 U ; v0 v
Speedup:
start with some v; X = U; X i 1 X i 2 ; v i 1 vi X i 1
repeat
w = v; X = X * X; v = vX;
until |v – w| small enough Xi U 2i
j1 2 j
i
2( 2i 1)
vi vU vU
Problem: U is big data!
54
Introduction to Information Retrieval
MapReduce Computation
Introduction to Information Retrieval
Situation of node p
q1 r1
q2 r2
⋮ p ⋮
qm rn
reduce map
Introduction to Information Retrieval
p : Content
Map
Reduce
Map
1
r : △PR for all r s.t. p r p : (PRi, [r1, r2, …, rn]) PR(q) PR(q)
O(q)
p : [△PR1, …, △PRk, (PRi, [r1, r2, …, rn]) ]
PR( p) d (1 d )q p PR(q)
Reduce
rn
Introduction to Information Retrieval
1
PRi 1 ( p) d (1 d )q p
Update O(q )
PRi (q) for each p
Input: < p, (PRi, [r1, r2, …, rn]) > for each node p
Result: < r, △PR> for each node r such that p r
< p, (PRi, [r1, r2, …, rn]) >
r1
map:
1. along each edge
r2
send PR-contribution to end-node
2. send structure begin-node
1
PRi 1 ( p) d (1 d )q p
Update O(q )
PRi (q) for each p
Input: < p, [△PR1, …, △PRk, (PRi, [r1, r2, …, rn])] > for each node p
Output: < p, (Pri+1, [r1, r2, …, rn]) > for each node p
q1 r1
qm rn
24 November 2015 IR – block C2 - Web Retrieval 61
Introduction to Information Retrieval
How to stop
Similarity between PRi and PRi+1 are sufficiently
similar:
d ( PRi , PRi 1 ) PR ( p) PR ( p)
2
p i i 1
d ( PRi , PRi 1 )
62
Introduction to Information Retrieval Sec. 21.3
The hope
AT&T
Alice Authorities
Hubs
ITIM
Bob
O2
Mobile telecom companies
Introduction to Information Retrieval Sec. 21.3
Visualization
Root
set
Base set
Introduction to Information Retrieval Sec. 21.3
Iterative update
Repeat the following updates, for all x:
h( x ) a( y )
x y
x
a( x) h( y )
yx
x
Introduction to Information Retrieval Sec. 21.3
End of lecture
103
Introduction to Information Retrieval
EXERCISES
104
Introduction to Information Retrieval
106