lecture16-linkanalysis.pptx
lecture16-linkanalysis.pptx
Introduction to
Information Retrieval
Link analysis
Introduction to Information Retrieval
Good ? ? Bad
?
3
Introduction to Information Retrieval
Example 1: Good/Bad/Unknown
▪ The Good, The Bad and The Unknown
▪ Good nodes won’t point to Bad nodes
▪ All other combinations plausible
Good ? ? Bad
?
4
Introduction to Information Retrieval
Good ? ? Bad
?
5
Introduction to Information Retrieval
Good ? Bad
?
6
Introduction to Information Retrieval
Good Bad
Example 2:
In-links to pages – unusual patterns ☺
Spammers
violating
power laws!
8
Introduction to Information Retrieval
9
Introduction to Information Retrieval
10
Introduction to Information Retrieval Sec. 21.1
hyperlink
Page A Anchor Page B
12
Introduction to Information Retrieval
13
Introduction to Information Retrieval Sec. 21.1.1
Anchor Text
WWW Worm - McBryan [Mcbr94]
www.ibm.com
Connectivity servers
Getting at all that link information
inexpensively
17
Introduction to Information Retrieval Sec. 20.4
Connectivity Server
▪ Support for fast queries on the web graph
▪ Which URLs point to a given URL?
▪ Which URLs does a given URL point to?
Stores mappings in memory from
▪ URL to outlinks, URL to inlinks
▪ Applications
▪ Link analysis
▪ Web graph analysis
▪ Connectivity, crawl optimization
▪ Crawl control
Introduction to Information Retrieval Sec. 20.4
Adjacency lists
▪ The set of neighbors of a node
▪ Assume each URL represented by an integer
▪ E.g., for a 4 billion page web, need 32 bits per
node … and now there are definitely > 4B pages
▪ Naively, this demands 64 bits to represent each
hyperlink
▪ Boldi/Vigna get down to an average of ~3
bits/link
▪ Further work achieves 2 bits/link
Introduction to Information Retrieval Sec. 20.4
Boldi/Vigna
▪ Each of these URLs has an adjacency list
▪ Main idea: due to templates, the adjacency list of a
node is similar to one of the 7 preceding URLs in
the lexicographic ordering … or else encoded anew
▪ Express adjacency list in terms of one of these
▪ E.g., consider these adjacency lists
▪ 1, 2, 4, 8, 16, 32, 64
▪ 1, 4, 9, 16, 25, 36, 49, 64
▪ 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144
▪ 1, 4, 8, 16, 25, 36, 49, 64
Encode as (–2), remove 9, add 8
Introduction to Information Retrieval Sec. 20.4
Gap encodings
▪ Given a sorted list of integers x, y, z,
…, represent by x, y-x, z-y, …
▪ Compress each integer using a code
▪ γ code - Number of bits = 1 + 2 ⎣lg x⎦
▪ δ code: …
▪ Information theoretic bound: 1 + ⎣lg x⎦
bits
▪ ζ code: Works well for integers from a
power law [Boldi, Vigna: Data Compression Conf. 2004]
Introduction to Information Retrieval Sec. 20.4
Main advantages of BV
▪ Depends only on locality in a canonical ordering
▪ Lexicographic ordering works well for the web
▪ Adjacency queries can be answered very
efficiently
▪ To fetch out-neighbors, trace back the chain of
prototypes
▪ This chain is typically short in practice (since similarity
is mostly intra-host)
▪ Can also explicitly limit the length of the chain during
encoding
▪ Easy to implement one-pass algorithm
Introduction to Information Retrieval
26
Introduction to Information Retrieval
Citation Analysis
▪ Citation frequency
▪ Bibliographic coupling frequency
▪ Articles that co-cite the same articles are related
▪ Citation indexing
▪ Who is this author cited by? (Garfield 1972)
▪ Pagerank preview: Pinsker and Narin ’60s
▪ Asked: which journals are authoritative?
Introduction to Information Retrieval
28
Introduction to Information Retrieval Sec. 21.2
Pagerank scoring
▪ Imagine a user doing a random walk on web pages:
▪ Start at a random page 1/3
▪ At each step, go out of the 1/3
1/3
current page along one of
the links on that page, equiprobably
▪ “In the long run” each page has a long-term visit rate
– use this as the page’s score
?
?
Introduction to Information Retrieval Sec. 21.2
Teleporting
▪ At a dead end, jump to a random web page.
▪ At any non-dead end, with probability 10%,
jump to a random web page.
▪ With remaining probability (90%), go out on
a random link.
▪ 10% - a parameter.
▪ “Teleportation” probability
▪ Simulates a web users going somewhere else
▪ Solves linear algebra problems….
Introduction to Information Retrieval Sec. 21.2
Result of teleporting
▪ Now cannot get stuck locally.
▪ There is a long-term rate at which any
page is visited (not obvious, will show
this).
▪ How do we compute this visit rate?
Introduction to Information Retrieval Sec. 21.2.1
Markov chains
▪ A Markov chain consists of n states, plus an n×n
transition probability matrix P.
▪ At each step, we are in one of the states.
▪ For 1 ≤ i,j ≤ n, the matrix entry Pij tells us the
probability of j being the next state, given we are
currently in state i.
Pii>0
is
OK.
i j
Pij
Introduction to Information Retrieval Sec. 21.2.1
Markov chains
▪ Clearly, for all i,
▪ Markov chains are abstractions of random walks.
▪ Exercise: represent the teleporting random walk from
3 slides ago as a Markov chain, for this case:
Introduction to Information Retrieval Sec. 21.2.1
Probability vectors
▪ A probability (row) vector x = (x1, … xn) tells us
where the walk is at any point.
▪ E.g., (000…1…000) means we’re in state i.
1 i n
39
Introduction to Information Retrieval
40
Introduction to Information Retrieval
41
Introduction to Information Retrieval
42
Introduction to Information Retrieval Sec. 21.3
The hope
Authorities
Hubs
High-level scheme
▪ Extract from the web a base set of
pages that could be good hubs or
authorities.
▪ From these, identify a small set of top
hub and authority pages;
→ iterative algorithm.
Introduction to Information Retrieval Sec. 21.3
Base set
▪ Given text query (say browser), use a text
index to get all pages containing browser.
▪ Call this the root set of pages.
▪ Add in any page that either
▪ points to a page in the root set, or
▪ is pointed to by a page in the root set.
▪ Call this the base set.
Introduction to Information Retrieval Sec. 21.3
Visualization
Root
set
Base
set
Iterative update
▪ Repeat the following updates, for all x:
x
Introduction to Information Retrieval Sec. 21.3
Scaling
▪ To prevent the h() and a() values from
getting too big, can scale down after each
iteration.
▪ Scaling factor doesn’t really matter:
▪ we only care about the relative values of
the scores.
Introduction to Information Retrieval Sec. 21.3
Proof of convergence
1 2 3
1 2 1 0 1 0
2 1 1 1
3 3 1 0 0
Introduction to Information Retrieval Sec. 21.3
Hub/authority vectors
▪ View the hub scores h() and the authority scores a()
as vectors with n components.
▪ Recall the iterative updates
Introduction to Information Retrieval Sec. 21.3
Issues
▪ Topic Drift
▪ Off-topic pages can cause off-topic “authorities” to
be returned
▪ E.g., the neighborhood graph can be about a “super
topic”
▪ Mutually Reinforcing Affiliates
▪ Affiliated pages/sites can boost each others’
scores
▪ Linkage between affiliated pages is not a useful signal
Introduction to Information Retrieval
Resources
▪ IIR Chap 21
▪ Kleinberg, Jon (1999). Authoritative sources in a hyperlinked
environment. Journal of the ACM. 46 (5): 604–632.
▪ http://www2004.org/proceedings/docs/1p309.pdf
▪ http://www2004.org/proceedings/docs/1p595.pdf
▪ http://www2003.org/cdrom/papers/refereed/p270/kamvar-2
70-xhtml/index.html
▪ http://www2003.org/cdrom/papers/refereed/p641/xhtml/p6
41-mccurley.html
▪ The WebGraph framework I: Compression techniques (Boldi
et al. 2004)