0% found this document useful (0 votes)
50 views

12 Link Analysis

This document provides an introduction to link analysis and PageRank for information retrieval. It discusses two assumptions: 1) anchor text describes the target page and 2) hyperlinks denote author-perceived relevance. PageRank scores pages based on the concept of a random web surfer following links, with pages linked from many pages receiving a higher score over time. The text describes modeling this as a random walk over the link graph and calculating the stationary probabilities to determine importance.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
50 views

12 Link Analysis

This document provides an introduction to link analysis and PageRank for information retrieval. It discusses two assumptions: 1) anchor text describes the target page and 2) hyperlinks denote author-perceived relevance. PageRank scores pages based on the concept of a random web surfer following links, with pages linked from many pages receiving a higher score over time. The text describes modeling this as a random walk over the link graph and calculating the stationary probabilities to determine importance.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 47

Introduction to Information Retrieval

Introduction to
Information Retrieval
Block B
Lecture 12: Link analysis
(chapter 21)
Introduction to Information Retrieval

Where to find information?


 Information may be obtained from:

 the contents of the documents

 the description of the contents (meta-data) semantic web

 user judgments (obtained from user behavior) user profiling


 implicit or explicit

 reputation among authors page rank

3 24 November 2015 IR – block C2 - Web Retrieval


Introduction to Information Retrieval

Reputation among authors


 Related to the graphical structure between
documents:

co-author
graph

4
Introduction to Information Retrieval

Facebook Contacts

14
Introduction to Information Retrieval

15
Introduction to Information Retrieval

Document Enrichment

16
Introduction to Information Retrieval Sec. 21.1

The Web as a Directed Graph


<a href=“http://www.cs.ru.nl”>Information Science</a>

hyperlink
Page A Anchor Page B

Assumption 1: The text in the anchor of the hyperlink


describes the target page (textual context)

Assumption 2: A hyperlink between pages denotes


author perceived relevance (quality signal)
Introduction to Information Retrieval Sec. 21.1.1

Assumption 1: Indexing anchor text


 When indexing a document D, include anchor text
(A1, A2, A3, …) from links pointing to D.
A1
Armonk, NY-based computer
giant IBM announced today
D
www.ibm.com

A2
Big Blue today announced
Joe’s computer hardware links record profits for the quarter
Sun
HP A3
IBM
Introduction to Information Retrieval

Approach based on assumption 2

Assumption 2:
A hyperlink between pages denotes
author perceived relevance (quality signal)
 Link popularity
24
Introduction to Information Retrieval

Query processing
 First retrieve all relevant pages, i.e., meeting the text
query requirements.

 Order these by their link popularity.

 More nuanced:
 use link popularity as a measure of static (not
query-related) goodness,
 combined with text match score (query-related)
Introduction to Information Retrieval

Assumption 2: Link popularity


 Ideas:

 Number of incoming links

 Normalization: a link from page with many


outgoing links is less relevant

 A link from a high reputation page is a higher


quality signal

26
Introduction to Information Retrieval Sec. 21.2

Pagerank scoring
 Imagine a browser doing a random walk on web
pages: 1/3
 Start at a random page 1/3
1/3

 At each step, go out of the current page along one of the


links on that page, equiprobably

 “In the steady state” each page has a long-term visit


rate - use this as the page’s score.
 Pagerank
Introduction to Information Retrieval

Search by topic
Topic (query)
jump from
random page
of initial set

Topic t

query
results 1/3
1/3
= initial set 1/3
follow
random
link

current
page

29
Introduction to Information Retrieval

Formal derivation
Introduction to Information Retrieval

Random walk
 Random walk over the internet graph
 with probability d > 0:
 jump to a random page from the initial set.
 with probability 1-d > 0:
 follow any link from the current page.
 damping factor d independent of the current page

 Reputation R[p,t] of page p on topic t:


 Pr(p|t) probability that random surfer searching for pages
on topic t eventually will visit page p.

31 24 November 2015 IR – block C2 - Web Retrieval


Introduction to Information Retrieval

Random jump
 Random jump to page on topic t:
1 if page p is about t
 p ,t 
0 otherwise
 Number of pages about t:
N t   p  p ,t

 Probability:
 p ,t
Pr( p | q, t , jump )  =0 if p not about topic t
= 1/Nt otherwise
Nt
33 24 November 2015 IR – block C2 - Web Retrieval
Introduction to Information Retrieval

Random link
 Probability:
1 if an edge from q to p
 q p  
0 otherwise

 Out-degree of node:
p

q
O(q)  q p
p

 Probability: O(q)
 q p
Pr( p | q, t, link )  q p
O(q)
34 24 November 2015 IR – block C2 - Web Retrieval
Introduction to Information Retrieval

Probability transition
 Leading to:

Pr( p | q, t )

 Pr( p | q, t , jump)  Pr(jump) Pr( jump)  d


 Pr(p | q, t, link)  Pr(link) Pr(link )  1  d

 p ,t  q p
 d   (1  d )
Nt O(q)
35 24 November 2015 IR – block C2 - Web Retrieval
Introduction to Information Retrieval

Probability check
 Lemma: sum probabilities equals 1

 Pr( p | q, t )
p

  p ,t  q p 
  p  d   (1  d ) 
 Nt O(q) 
d (1  d )

Nt
 p
 p ,t 
O(q) p
  q p

 d  (1  d )
36 1 24 November 2015 IR – block C2 - Web Retrieval
Introduction to Information Retrieval

Transformation to linear algebra for fast computation

37
Introduction to Information Retrieval

Transition Matrix
 Probability as entry in matrix U:

U p ,q  Pr(q | p, t )  Pr( p  q | t )
transition from p to q
how can we get
in node 3

 Example matrix
where do
 Pr(1  1) Pr(1  2) Pr(1  3)  we go

U  Pr(2  1) Pr(2  2) Pr(2  3)


from
node 1

 Pr(3  1) Pr(3  2) Pr(3  3) 


Introduction to Information Retrieval

Double steps
Going in 2 steps from y to x
a1

Pr (y a1) = U[y,a1] Pr (a1x) = U[a1,x]


a2 U[y,a1] U[a1,x]

U[y,a2] U[a2,x]
a3

y x U[y,a3] U[a3,x]

U[y,a4] U[a4,x]
a4
U[y,a5] U[a5,x]
(+)
a5
??
Multiply row y with column x
Introduction to Information Retrieval

2-step paths
 Making 2 steps:
U 2 y , x  Pr( y  x in 2 steps)
  z Pr( y  z  x) random

  z Pr( y  z )  Pr( z  x) (independence)


walk

 z U y,z  U z,x

U 2  U U  U 2

 And thus 2-step transition matrix:


41 24 November 2015 IR – block C2 - Web Retrieval
Introduction to Information Retrieval

n-step paths
 Making n steps:
Pr( y  n x)
  z
Pr( y  n 1 z  x)

  Pr( y  n 1
z )  Pr( z  x) (independe nce)
z

 And thus n-step transition matrix:


n 1
U n  U n 1  U  U U  U n

42 24 November 2015 IR – block C2 - Web Retrieval


Introduction to Information Retrieval

Distribution
 Let v be the following row vector
v  (v1 , v2 , v3 )

be the row vector describing the initial probabilities


of being in any node, so Pr(1)=v1, Pr(2)=v2, P2(3)=v3
 Then the probability of being in node 3 after one
step:
Pr(1)  Pr(1  3)  Pr(2)  Pr(2  3)  Pr(3)  Pr(3  3)
 v1 U1,3  v2 U 2,3  v3 U 3,3
  p v p U p ,3  (vU ) 3
Introduction to Information Retrieval

 So if row vector v describes the initial node


distribution,

 then vU describes the node distribution after 1 step

45
Introduction to Information Retrieval

Stationary distribution
 Initial distribution v, consider the situation after n
steps:
vn  vU n  vU  vU n n 1
U  v n 1 U
 Stationary distribution (if existing):

P  lim n vn
 Then: P  PU Result independent of
inital distribution v!

PU  limn vn U  limn vn U   limn vn1  P


Introduction to Information Retrieval Sec. 21.2

Teleporting
 The web is full of dead-ends.
 Random walk can get stuck in dead-ends.
 At a dead end, jump to a random web page.
 with probability d we jump to a random web page.

Consequence:
 Now cannot get stuck locally.

 There is a long-term rate at which any page is visited.


Introduction to Information Retrieval

Back to the page rank


 The page rank is not topic-directed, so t =  (‘any
topic’)

 The transition probability equals:


 p ,  q p
U p ,q  d   (1  d )
N O(q)
1  q p
 d   (1  d )
N O(q)
50 24 November 2015 IR – block C2 - Web Retrieval
Introduction to Information Retrieval

The resulting formula


 The random walk formula thus is:
1 1
Pp   d  (1  d )q  p Pq
N O(q)
???
 For the Google Pagerank, the factor N is left out,
resulting in:

1
PR( p)  d  (1  d )q  p PR(q)
O(q)

53 24 November 2015 IR – block C2 - Web Retrieval


Introduction to Information Retrieval

Pagerank computation
 Compute eigenvector:
P  PU
 Alternative: iterative approximation:
start with some v;
repeat
w = v; v = vU;
until |v – w| small enough
X 0  U ; v0  v
 Speedup:
start with some v; X = U; X i 1   X i 2 ; v i 1  vi X i 1
repeat
w = v; X = X * X; v = vX;
until |v – w| small enough Xi U 2i

 j1 2 j
i
2( 2i 1)
vi  vU  vU
 Problem: U is big data!
54
Introduction to Information Retrieval

MapReduce Computation
Introduction to Information Retrieval

The iterative algorithm


1
PR ( p)  d  (1  d )q  p PR (q)
O(q)
 Initial state:
PR0 ( p)  1 for each p
 Update:
1
PRi 1 ( p)  d  (1  d )q  p PRi (q) for each p
O(q )
 Lemma:
lim i  PRi ( p)  PR ( p) for each p
Introduction to Information Retrieval

Situation of node p

q1 r1

q2 r2

⋮ p ⋮

qm rn
reduce map
Introduction to Information Retrieval

node p represented as key:value pair


nodeid : (current PR value, list of outgoing links)

p : Content

Map

p : (1.0, [r1, …rk]) for each link in Content

Reduce

p : (PRi, [r1, …rk])

Map
1
r : △PR for all r s.t. p  r p : (PRi, [r1, r2, …, rn]) PR(q)  PR(q)
O(q)
p : [△PR1, …, △PRk, (PRi, [r1, r2, …, rn]) ]
PR( p)  d  (1  d )q  p PR(q)
Reduce

p : (Pri+1, [r1, …rk])


Introduction to Information Retrieval

PR0 ( p)  1 for each p


Initial state
 Input: <URL, Content> tuples
 Output: < p, (1.0, [r1, r2, …, rn]) > tuple for each p
r1

def mapper(URL, Content): r2


output(URL, pair(1.0, Links(URL)))

def reducer(key, values):


output(key, values) p ⋮

rn
Introduction to Information Retrieval

1
PRi 1 ( p)  d  (1  d )q  p
Update O(q )
PRi (q) for each p

 Input: < p, (PRi, [r1, r2, …, rn]) > for each node p
 Result: < r, △PR> for each node r such that p  r
< p, (PRi, [r1, r2, …, rn]) >
r1
map:
1. along each edge
r2
send PR-contribution to end-node
2. send structure begin-node

def mapper(node, values): p ⋮


n = size (values.Nodes)
foreach r in values.Nodes
output(r, PR/n)
output (node, values) rn
24 November 2015 IR – block C2 - Web Retrieval 60
Introduction to Information Retrieval

1
PRi 1 ( p)  d  (1  d )q  p
Update O(q )
PRi (q) for each p

 Input: < p, [△PR1, …, △PRk, (PRi, [r1, r2, …, rn])] > for each node p
 Output: < p, (Pri+1, [r1, r2, …, rn]) > for each node p

q1 r1

def reducer(node, results): r2


q2
output(node,
pair(sum(results.dPs
,results.value
) ) ⋮ p ⋮

qm rn
24 November 2015 IR – block C2 - Web Retrieval 61
Introduction to Information Retrieval

How to stop
 Similarity between PRi and PRi+1 are sufficiently
similar:
d ( PRi , PRi 1 )   PR ( p)  PR ( p) 
2
p i i 1

 Stop when this similarity is sufficiently small:

d ( PRi , PRi 1 )  

62
Introduction to Information Retrieval Sec. 21.3

Hubs and Authorities


 A good hub page for a topic points to many
authoritative pages for that topic.

 A good authority page for a topic is pointed


to by many good hubs for that topic.

 Circular definition - will turn this into an


iterative computation.
Introduction to Information Retrieval Sec. 21.3

The hope

AT&T
Alice Authorities
Hubs

ITIM
Bob
O2
Mobile telecom companies
Introduction to Information Retrieval Sec. 21.3

Visualization

Root
set

Base set
Introduction to Information Retrieval Sec. 21.3

Iterative update
 Repeat the following updates, for all x:

h( x )   a( y )
x y
x

a( x)   h( y )
yx
x
Introduction to Information Retrieval Sec. 21.3

Rewrite in matrix form


 h = Aa
 a = Ath
Substituting:
h = AAT h
a = ATA a
Thus:
h right eigenvector of AAT with eigenvalue 1
a right eigenvector of ATA with eigenvalue 1
Relation between a and h?
- AT h = AT (AAT h) = ATA (AT h), so AT h = a
- what about h AT - A a = A (ATA a) = A AT (A a), so A a = h
- what about A a
Introduction to Information Retrieval

End of lecture

103
Introduction to Information Retrieval

EXERCISES

104
Introduction to Information Retrieval

 Somebody decides to construct a set of web pages


p1, …, pn for some n>1, where p1 has a high pagerank.
The graph is constructed in the following way:
 Page p1 has incoming links form all other pages p2, …, pn.
 There is a link from p1 to each pi (2  i  n).
 There are no other links outgoing from or incoming to p1,
…, pn.
 Compute the page rank of each of the pages p1, …,
pn.
(use http://www.webworkshop.net/pagerank_calculator.php to get an idea)

106

You might also like