0% found this document useful (0 votes)

50 views

12 Link Analysis

This document provides an introduction to link analysis and PageRank for information retrieval. It discusses two assumptions: 1) anchor text describes the target page and 2) hyperlinks denote author-perceived relevance. PageRank scores pages based on the concept of a random web surfer following links, with pages linked from many pages receiving a higher score over time. The text describes modeling this as a random walk over the link graph and calculating the stationary probabilities to determine importance.

Uploaded by

Benny Sukma Negara

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

50 views

12 Link Analysis

Uploaded by

Benny Sukma Negara

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 47

Introduction to Information Retrieval

Introduction to
Information Retrieval
Block B
Lecture 12: Link analysis
(chapter 21)
Introduction to Information Retrieval

Where to find information?

 Information may be obtained from:

 the contents of the documents

 the description of the contents (meta-data) semantic web

 user judgments (obtained from user behavior) user profiling

 implicit or explicit

 reputation among authors page rank

3 24 November 2015 IR – block C2 - Web Retrieval

Introduction to Information Retrieval

Reputation among authors

 Related to the graphical structure between
documents:

co-author
graph

4
Introduction to Information Retrieval

Facebook Contacts

14
Introduction to Information Retrieval

15
Introduction to Information Retrieval

Document Enrichment

16
Introduction to Information Retrieval Sec. 21.1

The Web as a Directed Graph

<a href=“http://www.cs.ru.nl”>Information Science</a>

hyperlink
Page A Anchor Page B

Assumption 1: The text in the anchor of the hyperlink

describes the target page (textual context)

Assumption 2: A hyperlink between pages denotes

author perceived relevance (quality signal)
Introduction to Information Retrieval Sec. 21.1.1

Assumption 1: Indexing anchor text

 When indexing a document D, include anchor text
(A1, A2, A3, …) from links pointing to D.
A1
Armonk, NY-based computer
giant IBM announced today
D
www.ibm.com

A2
Big Blue today announced
Joe’s computer hardware links record profits for the quarter
Sun
HP A3
IBM
Introduction to Information Retrieval

Approach based on assumption 2

Assumption 2:
A hyperlink between pages denotes
author perceived relevance (quality signal)
 Link popularity
24
Introduction to Information Retrieval

Query processing
 First retrieve all relevant pages, i.e., meeting the text
query requirements.

 Order these by their link popularity.

 More nuanced:
 use link popularity as a measure of static (not
query-related) goodness,
 combined with text match score (query-related)
Introduction to Information Retrieval

Assumption 2: Link popularity

 Ideas:

 Number of incoming links

 Normalization: a link from page with many

outgoing links is less relevant

 A link from a high reputation page is a higher

quality signal

26
Introduction to Information Retrieval Sec. 21.2

Pagerank scoring
 Imagine a browser doing a random walk on web
pages: 1/3
 Start at a random page 1/3
1/3

 At each step, go out of the current page along one of the

links on that page, equiprobably

 “In the steady state” each page has a long-term visit

rate - use this as the page’s score.
 Pagerank
Introduction to Information Retrieval

Search by topic
Topic (query)
jump from
random page
of initial set

Topic t

query
results 1/3
1/3
= initial set 1/3
follow
random
link

current
page

29
Introduction to Information Retrieval

Formal derivation
Introduction to Information Retrieval

Random walk
 Random walk over the internet graph
 with probability d > 0:
 jump to a random page from the initial set.
 with probability 1-d > 0:
 follow any link from the current page.
 damping factor d independent of the current page

 Reputation R[p,t] of page p on topic t:

 Pr(p|t) probability that random surfer searching for pages
on topic t eventually will visit page p.

31 24 November 2015 IR – block C2 - Web Retrieval

Introduction to Information Retrieval

Random jump
 Random jump to page on topic t:
1 if page p is about t
 p ,t 
0 otherwise
 Number of pages about t:
N t   p  p ,t

 Probability:
 p ,t
Pr( p | q, t , jump )  =0 if p not about topic t
= 1/Nt otherwise
Nt
33 24 November 2015 IR – block C2 - Web Retrieval
Introduction to Information Retrieval

Random link
 Probability:
1 if an edge from q to p
 q p  
0 otherwise

 Out-degree of node:
p

q
O(q)  q p
p

 Probability: O(q)
 q p
Pr( p | q, t, link )  q p
O(q)
34 24 November 2015 IR – block C2 - Web Retrieval
Introduction to Information Retrieval

Probability transition
 Leading to:

Pr( p | q, t )

 Pr( p | q, t , jump)  Pr(jump) Pr( jump)  d

 Pr(p | q, t, link)  Pr(link) Pr(link )  1  d

 p ,t  q p
 d   (1  d )
Nt O(q)
35 24 November 2015 IR – block C2 - Web Retrieval
Introduction to Information Retrieval

Probability check
 Lemma: sum probabilities equals 1

 Pr( p | q, t )
p

  p ,t  q p 
  p  d   (1  d ) 
 Nt O(q) 
d (1  d )

Nt
 p
 p ,t 
O(q) p
  q p

 d  (1  d )
36 1 24 November 2015 IR – block C2 - Web Retrieval
Introduction to Information Retrieval

Transformation to linear algebra for fast computation

37
Introduction to Information Retrieval

Transition Matrix
 Probability as entry in matrix U:

U p ,q  Pr(q | p, t )  Pr( p  q | t )
transition from p to q
how can we get
in node 3

 Example matrix
where do
 Pr(1  1) Pr(1  2) Pr(1  3)  we go

U  Pr(2  1) Pr(2  2) Pr(2  3)

from
node 1

 Pr(3  1) Pr(3  2) Pr(3  3) 

Introduction to Information Retrieval

Double steps
Going in 2 steps from y to x
a1

Pr (y a1) = U[y,a1] Pr (a1x) = U[a1,x]

a2 U[y,a1] U[a1,x]

U[y,a2] U[a2,x]
a3

y x U[y,a3] U[a3,x]

U[y,a4] U[a4,x]
a4
U[y,a5] U[a5,x]
(+)
a5
??
Multiply row y with column x
Introduction to Information Retrieval

2-step paths
 Making 2 steps:
U 2 y , x  Pr( y  x in 2 steps)
  z Pr( y  z  x) random

  z Pr( y  z )  Pr( z  x) (independence)

walk

 z U y,z  U z,x

U 2  U U  U 2

 And thus 2-step transition matrix:

41 24 November 2015 IR – block C2 - Web Retrieval
Introduction to Information Retrieval

n-step paths
 Making n steps:
Pr( y  n x)
  z
Pr( y  n 1 z  x)

  Pr( y  n 1
z )  Pr( z  x) (independe nce)
z

 And thus n-step transition matrix:

n 1
U n  U n 1  U  U U  U n

42 24 November 2015 IR – block C2 - Web Retrieval

Introduction to Information Retrieval

Distribution
 Let v be the following row vector
v  (v1 , v2 , v3 )

be the row vector describing the initial probabilities

of being in any node, so Pr(1)=v1, Pr(2)=v2, P2(3)=v3
 Then the probability of being in node 3 after one
step:
Pr(1)  Pr(1  3)  Pr(2)  Pr(2  3)  Pr(3)  Pr(3  3)
 v1 U1,3  v2 U 2,3  v3 U 3,3
  p v p U p ,3  (vU ) 3
Introduction to Information Retrieval

 So if row vector v describes the initial node

distribution,

 then vU describes the node distribution after 1 step

45
Introduction to Information Retrieval

Stationary distribution
 Initial distribution v, consider the situation after n
steps:
vn  vU n  vU  vU n n 1
U  v n 1 U
 Stationary distribution (if existing):

P  lim n vn
 Then: P  PU Result independent of
inital distribution v!

PU  limn vn U  limn vn U   limn vn1  P

Introduction to Information Retrieval Sec. 21.2

Teleporting
 The web is full of dead-ends.
 Random walk can get stuck in dead-ends.
 At a dead end, jump to a random web page.
 with probability d we jump to a random web page.

Consequence:
 Now cannot get stuck locally.

 There is a long-term rate at which any page is visited.

Introduction to Information Retrieval

Back to the page rank

 The page rank is not topic-directed, so t =  (‘any
topic’)

 The transition probability equals:

 p ,  q p
U p ,q  d   (1  d )
N O(q)
1  q p
 d   (1  d )
N O(q)
50 24 November 2015 IR – block C2 - Web Retrieval
Introduction to Information Retrieval

The resulting formula

 The random walk formula thus is:
1 1
Pp   d  (1  d )q  p Pq
N O(q)
???
 For the Google Pagerank, the factor N is left out,
resulting in:

1
PR( p)  d  (1  d )q  p PR(q)
O(q)

53 24 November 2015 IR – block C2 - Web Retrieval

Introduction to Information Retrieval

Pagerank computation
 Compute eigenvector:
P  PU
 Alternative: iterative approximation:
start with some v;
repeat
w = v; v = vU;
until |v – w| small enough
X 0  U ; v0  v
 Speedup:
start with some v; X = U; X i 1   X i 2 ; v i 1  vi X i 1
repeat
w = v; X = X * X; v = vX;
until |v – w| small enough Xi U 2i

 j1 2 j
i
2( 2i 1)
vi  vU  vU
 Problem: U is big data!
54
Introduction to Information Retrieval

MapReduce Computation
Introduction to Information Retrieval

The iterative algorithm

1
PR ( p)  d  (1  d )q  p PR (q)
O(q)
 Initial state:
PR0 ( p)  1 for each p
 Update:
1
PRi 1 ( p)  d  (1  d )q  p PRi (q) for each p
O(q )
 Lemma:
lim i  PRi ( p)  PR ( p) for each p
Introduction to Information Retrieval

Situation of node p

q1 r1

q2 r2

⋮ p ⋮

qm rn
reduce map
Introduction to Information Retrieval

node p represented as key:value pair

nodeid : (current PR value, list of outgoing links)

p : Content

Map

p : (1.0, [r1, …rk]) for each link in Content

Reduce

p : (PRi, [r1, …rk])

Map
1
r : △PR for all r s.t. p  r p : (PRi, [r1, r2, …, rn]) PR(q)  PR(q)
O(q)
p : [△PR1, …, △PRk, (PRi, [r1, r2, …, rn]) ]
PR( p)  d  (1  d )q  p PR(q)
Reduce

p : (Pri+1, [r1, …rk])

Introduction to Information Retrieval

PR0 ( p)  1 for each p

Initial state
 Input: <URL, Content> tuples
 Output: < p, (1.0, [r1, r2, …, rn]) > tuple for each p
r1

def mapper(URL, Content): r2

output(URL, pair(1.0, Links(URL)))

def reducer(key, values):

output(key, values) p ⋮

rn
Introduction to Information Retrieval

1
PRi 1 ( p)  d  (1  d )q  p
Update O(q )
PRi (q) for each p

 Input: < p, (PRi, [r1, r2, …, rn]) > for each node p
 Result: < r, △PR> for each node r such that p  r
< p, (PRi, [r1, r2, …, rn]) >
r1
map:
1. along each edge
r2
send PR-contribution to end-node
2. send structure begin-node

def mapper(node, values): p ⋮

n = size (values.Nodes)
foreach r in values.Nodes
output(r, PR/n)
output (node, values) rn
24 November 2015 IR – block C2 - Web Retrieval 60
Introduction to Information Retrieval

1
PRi 1 ( p)  d  (1  d )q  p
Update O(q )
PRi (q) for each p

 Input: < p, [△PR1, …, △PRk, (PRi, [r1, r2, …, rn])] > for each node p
 Output: < p, (Pri+1, [r1, r2, …, rn]) > for each node p

q1 r1

def reducer(node, results): r2

q2
output(node,
pair(sum(results.dPs
,results.value
) ) ⋮ p ⋮

qm rn
24 November 2015 IR – block C2 - Web Retrieval 61
Introduction to Information Retrieval

How to stop
 Similarity between PRi and PRi+1 are sufficiently
similar:
d ( PRi , PRi 1 )   PR ( p)  PR ( p) 
2
p i i 1

 Stop when this similarity is sufficiently small:

d ( PRi , PRi 1 )  

62
Introduction to Information Retrieval Sec. 21.3

Hubs and Authorities

 A good hub page for a topic points to many
authoritative pages for that topic.

 A good authority page for a topic is pointed

to by many good hubs for that topic.

 Circular definition - will turn this into an

iterative computation.
Introduction to Information Retrieval Sec. 21.3

The hope

AT&T
Alice Authorities
Hubs

ITIM
Bob
O2
Mobile telecom companies
Introduction to Information Retrieval Sec. 21.3

Visualization

Root
set

Base set
Introduction to Information Retrieval Sec. 21.3

Iterative update
 Repeat the following updates, for all x:

h( x )   a( y )
x y
x

a( x)   h( y )
yx
x
Introduction to Information Retrieval Sec. 21.3

Rewrite in matrix form

 h = Aa
 a = Ath
Substituting:
h = AAT h
a = ATA a
Thus:
h right eigenvector of AAT with eigenvalue 1
a right eigenvector of ATA with eigenvalue 1
Relation between a and h?
- AT h = AT (AAT h) = ATA (AT h), so AT h = a
- what about h AT - A a = A (ATA a) = A AT (A a), so A a = h
- what about A a
Introduction to Information Retrieval

End of lecture

103
Introduction to Information Retrieval

EXERCISES

104
Introduction to Information Retrieval

 Somebody decides to construct a set of web pages

p1, …, pn for some n>1, where p1 has a high pagerank.
The graph is constructed in the following way:
 Page p1 has incoming links form all other pages p2, …, pn.
 There is a link from p1 to each pi (2  i  n).
 There are no other links outgoing from or incoming to p1,
…, pn.
 Compute the page rank of each of the pages p1, …,
pn.
(use http://www.webworkshop.net/pagerank_calculator.php to get an idea)

106

SEO For Beginners PDF
100% (1)
SEO For Beginners PDF
547 pages
SEO Competitive Analysis - Sample Report
67% (3)
SEO Competitive Analysis - Sample Report
4 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
54 pages
lecture16-linkanalysis.pptx
No ratings yet
lecture16-linkanalysis.pptx
58 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
32 pages
Evaluation and Result Summaries
No ratings yet
Evaluation and Result Summaries
60 pages
IR Lec13 Web Crawling
No ratings yet
IR Lec13 Web Crawling
27 pages
Materi Pertemuan Ke-1-Dno 2018-1
No ratings yet
Materi Pertemuan Ke-1-Dno 2018-1
42 pages
An Overview of Information Retrieval Outline: A (Simple) Database Example Databases vs. IR
No ratings yet
An Overview of Information Retrieval Outline: A (Simple) Database Example Databases vs. IR
16 pages
Information Retrieval
No ratings yet
Information Retrieval
171 pages
Ip 8
No ratings yet
Ip 8
51 pages
Week 2 - Information Retrieval Basics
No ratings yet
Week 2 - Information Retrieval Basics
74 pages
01 Introduction To ISR
No ratings yet
01 Introduction To ISR
48 pages
lecture15-crawling.pptx
No ratings yet
lecture15-crawling.pptx
17 pages
Lecture8-Evaluation 2013
No ratings yet
Lecture8-Evaluation 2013
44 pages
ISR U 1&2 Tech-Knowledge
No ratings yet
ISR U 1&2 Tech-Knowledge
68 pages
IR Unit 1
No ratings yet
IR Unit 1
30 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
60 pages
IR Evaluation Tugas Kampus
No ratings yet
IR Evaluation Tugas Kampus
25 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
40 pages
Informaiton Retrieval and Web Search
No ratings yet
Informaiton Retrieval and Web Search
44 pages
Lecture15 Learning Ranking
No ratings yet
Lecture15 Learning Ranking
46 pages
5-Introduction To Information Retrieval
No ratings yet
5-Introduction To Information Retrieval
3 pages
Neural IR
No ratings yet
Neural IR
45 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
50 pages
Ranked Retrieval: Thus Far, Our Queries Have All Been Boolean
No ratings yet
Ranked Retrieval: Thus Far, Our Queries Have All Been Boolean
40 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
60 pages
Monday - IR Fundamentals - Grace Yang - AFIRM19-IR
No ratings yet
Monday - IR Fundamentals - Grace Yang - AFIRM19-IR
77 pages
Lecture3 Tolerant Retrieval
100% (1)
Lecture3 Tolerant Retrieval
48 pages
Lecture3 Tolerant Retrieval
100% (1)
Lecture3 Tolerant Retrieval
48 pages
Lecture15 Learning Ranking
No ratings yet
Lecture15 Learning Ranking
46 pages
Tutorial: Web Information Retrieval: Monika Henzinger
No ratings yet
Tutorial: Web Information Retrieval: Monika Henzinger
154 pages
Syllabus
No ratings yet
Syllabus
9 pages
Introduction To Information Retrieval
No ratings yet
Introduction To Information Retrieval
3 pages
Information Retrieval System and The Pagerank Algorithm
No ratings yet
Information Retrieval System and The Pagerank Algorithm
37 pages
Lecture3-Tolerant-retrieval Dictionaries and Tolerant Retrieval CH 3
No ratings yet
Lecture3-Tolerant-retrieval Dictionaries and Tolerant Retrieval CH 3
47 pages
Information Retrieval
No ratings yet
Information Retrieval
5 pages
IR Textbook
No ratings yet
IR Textbook
167 pages
Lecture Crawling
No ratings yet
Lecture Crawling
38 pages
Chapter One ISR
No ratings yet
Chapter One ISR
25 pages
Lecture10 Efficient Scoring
No ratings yet
Lecture10 Efficient Scoring
19 pages
Information Retrieval: Unit 4: Web Search - Part 1
No ratings yet
Information Retrieval: Unit 4: Web Search - Part 1
63 pages
Information Retrieval: Introduction To
No ratings yet
Information Retrieval: Introduction To
48 pages
1.introduction Information Retrival
No ratings yet
1.introduction Information Retrival
31 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
48 pages
IR_2 unit
No ratings yet
IR_2 unit
46 pages
IR Introduction
100% (1)
IR Introduction
6 pages
1 Overview
No ratings yet
1 Overview
44 pages
IR Basics Lec28 Oct 3 2011
No ratings yet
IR Basics Lec28 Oct 3 2011
26 pages
Introduction
No ratings yet
Introduction
25 pages
Information Storage and Retrival
No ratings yet
Information Storage and Retrival
31 pages
IR Chapter 1
No ratings yet
IR Chapter 1
32 pages
Chapter 1 Introduction To ISR
No ratings yet
Chapter 1 Introduction To ISR
39 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
52 pages
Adt Unit 5
No ratings yet
Adt Unit 5
31 pages
PPT08-Natural Language Processing
100% (1)
PPT08-Natural Language Processing
44 pages
lecture6-tfidf Vector Space Model (2)
No ratings yet
lecture6-tfidf Vector Space Model (2)
45 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
60 pages
Lecture 5-Dictionaries and Tolerant Retrieval
No ratings yet
Lecture 5-Dictionaries and Tolerant Retrieval
48 pages
20200728204914D5872 - COMP6639 - Session 28 - Natural Language Processing
No ratings yet
20200728204914D5872 - COMP6639 - Session 28 - Natural Language Processing
29 pages
IR_workbook_answers
No ratings yet
IR_workbook_answers
36 pages
Learning Shiny: Make the most of R's dynamic capabilities and implement web applications with Shiny
From Everand
Learning Shiny: Make the most of R's dynamic capabilities and implement web applications with Shiny
Hernan Resnizky
No ratings yet
WSMA Mid-2 1
No ratings yet
WSMA Mid-2 1
26 pages
Textrank: Bringing Order Into Texts: Rada Mihalcea and Paul Tarau
No ratings yet
Textrank: Bringing Order Into Texts: Rada Mihalcea and Paul Tarau
8 pages
Unit 4 NLP
No ratings yet
Unit 4 NLP
51 pages
Google's 200 Ranking Factors - The Complete List PDF
No ratings yet
Google's 200 Ranking Factors - The Complete List PDF
40 pages
Project On Business Model of Google
86% (7)
Project On Business Model of Google
49 pages
Immediate download Modeling the Internet and the Web Probabilistic Methods and Algorithms 1st Edition Pierre Baldi Paolo Frasconi Padhraic Smyth Pierre Baldi ebooks 2024
100% (4)
Immediate download Modeling the Internet and the Web Probabilistic Methods and Algorithms 1st Edition Pierre Baldi Paolo Frasconi Padhraic Smyth Pierre Baldi ebooks 2024
81 pages
Keyword Research Competition Analysis Mastery
100% (2)
Keyword Research Competition Analysis Mastery
33 pages
Introduction To Search Engine Optimization
No ratings yet
Introduction To Search Engine Optimization
10 pages
Ranking of Closeness Centrality For Large-Scale Social Networks
No ratings yet
Ranking of Closeness Centrality For Large-Scale Social Networks
10 pages
How Google Fights Disinformation
No ratings yet
How Google Fights Disinformation
32 pages
Digital Marketing
No ratings yet
Digital Marketing
13 pages
Sandaruwan WP
No ratings yet
Sandaruwan WP
4 pages
Web Mining: Based On Tutorials and Presentations
No ratings yet
Web Mining: Based On Tutorials and Presentations
101 pages
Search Engine Optimization in E-Commerce Sites
No ratings yet
Search Engine Optimization in E-Commerce Sites
3 pages
Business Model of Google
100% (4)
Business Model of Google
57 pages
Algorithms For Webdevs Ebook
No ratings yet
Algorithms For Webdevs Ebook
24 pages
Discrete Mathematics in The Modern World
50% (2)
Discrete Mathematics in The Modern World
24 pages
A PageRank Model For Player Performance Assessment
No ratings yet
A PageRank Model For Player Performance Assessment
27 pages
Cs6007 Information Retrieval Question Bank
100% (3)
Cs6007 Information Retrieval Question Bank
45 pages
Internet Searching: Crawling Is Conceptually Quite Simple: Starting at Some Well-Known Sites On The Web
No ratings yet
Internet Searching: Crawling Is Conceptually Quite Simple: Starting at Some Well-Known Sites On The Web
4 pages
Pagerank Project: Instructions
No ratings yet
Pagerank Project: Instructions
7 pages
Solution - 6
No ratings yet
Solution - 6
3 pages
Bayut Internal Linking Strategy
No ratings yet
Bayut Internal Linking Strategy
27 pages
MCS 226
No ratings yet
MCS 226
13 pages
Markov Chains
No ratings yet
Markov Chains
37 pages
Google Algorithms
No ratings yet
Google Algorithms
14 pages
Digital Marketing Training Notes by Optimized Infotech PDF
No ratings yet
Digital Marketing Training Notes by Optimized Infotech PDF
125 pages
Climate Services: Francesca Larosa, Jaroslav Mysiak T
No ratings yet
Climate Services: Francesca Larosa, Jaroslav Mysiak T
13 pages