0% found this document useful (0 votes)

10 views

lecture16-linkanalysis.pptx

Uploaded by

cddsingh13

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views

lecture16-linkanalysis.pptx

Uploaded by

cddsingh13

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 58

Introduction to Information Retrieval

Introduction to
Information Retrieval
Link analysis
Introduction to Information Retrieval

Today’s lecture – hypertext and links

▪ We look beyond the content of documents
▪ We begin to look at the hyperlinks between them
▪ Address questions like
▪ Do the links represent a conferral of authority to some
pages? Is this useful for ranking?
▪ How likely is it that a page pointed to by the CERN home
page is about high energy physics
▪ Big application areas
▪ The Web
▪ Email
▪ Social networks
Introduction to Information Retrieval

Links are everywhere

▪ Powerful sources of authenticity and authority
▪ Mail spam – which email accounts are spammers?
▪ Host quality – which hosts are “bad”?
▪ Phone call logs
▪ The Good, The Bad and The Unknown
?

Good ? ? Bad

?
3
Introduction to Information Retrieval

Example 1: Good/Bad/Unknown
▪ The Good, The Bad and The Unknown
▪ Good nodes won’t point to Bad nodes
▪ All other combinations plausible

Good ? ? Bad

?
4
Introduction to Information Retrieval

Simple iterative logic

▪ Good nodes won’t point to Bad nodes
▪ If you point to a Bad node, you’re Bad
▪ If a Good node points to you, you’re Good

Good ? ? Bad

?
5
Introduction to Information Retrieval

Simple iterative logic

▪ Good nodes won’t point to Bad nodes
▪ If you point to a Bad node, you’re Bad
▪ If a Good node points to you, you’re Good

Good ? Bad

?
6
Introduction to Information Retrieval

Simple iterative logic

▪ Good nodes won’t point to Bad nodes
▪ If you point to a Bad node, you’re Bad
▪ If a Good node points to you, you’re Good

Good Bad

Sometimes need probabilistic analogs – e.g., mail spam 7

Introduction to Information Retrieval

Example 2:
In-links to pages – unusual patterns ☺

Spammers
violating
power laws!

8
Introduction to Information Retrieval

Many other examples of link analysis

▪ Social networks are a rich source of grouping
behavior
▪ E.g., Shoppers’ affinity – Goel+Goldstein 2010
▪ Consumers whose friends spend a lot, spend a lot
themselves
▪ http://www.cs.cornell.edu/home/kleinber/networks-book/
▪ See cs224w

9
Introduction to Information Retrieval

Our primary interest in this course

▪ Link analysis additions to IR functionality thus far
based purely on text
▪ Scoring and ranking
▪ Link-based clustering – topical structure from links
▪ Links as features in classification – documents that link to
one another are likely to be on the same subject
▪ Crawling
▪ Based on the links seen, where do we crawl next?

10
Introduction to Information Retrieval Sec. 21.1

The Web as a Directed Graph

hyperlink
Page A Anchor Page B

Hypothesis 1: A hyperlink between pages denotes

a conferral of authority (quality signal)

Hypothesis 2: The text in the anchor of a hyperlink on page

A describes the target page B
Introduction to Information Retrieval

Assumption 1: reputed sites

12
Introduction to Information Retrieval

Assumption 2: annotation of target

13
Introduction to Information Retrieval Sec. 21.1.1

Anchor Text
WWW Worm - McBryan [Mcbr94]

▪ For ibm how to distinguish between:

▪ IBM’s home page (mostly graphical)
▪ IBM’s copyright page (high term freq. for ‘ibm’)
▪ Rival’s spam page (arbitrarily high term freq.)

“ibm.com” “IBM home page”

“ibm”
A million pieces of
anchor text with “ibm”
send a strong signal www.ibm.com
Introduction to Information Retrieval Sec. 21.1.1

Indexing anchor text

▪ When indexing a document D, include (with some
weight) anchor text (and perhaps nearby
surrounding text) from links pointing to D.
Armonk, NY-based computer More information about IBM
giant IBM announced today products can be found here

www.ibm.com

Joe’s computer hardware links

Sun
Big Blue today announced
HP
record profits for the quarter
IBM
Introduction to Information Retrieval Sec. 21.1.1

Indexing anchor text

▪ Can sometimes have unexpected effects, e.g., spam,
miserable failure
▪ Can score anchor text with weight depending on the
authority of the anchor page’s website
▪ E.g., if we were to assume that content from cnn.com or
yahoo.com is authoritative, then trust (more) the anchor
text from them
▪ Increase the weight of off-site anchors (non-nepotistic
scoring)
Introduction to Information Retrieval

Connectivity servers
Getting at all that link information
inexpensively

17
Introduction to Information Retrieval Sec. 20.4

Connectivity Server
▪ Support for fast queries on the web graph
▪ Which URLs point to a given URL?
▪ Which URLs does a given URL point to?
Stores mappings in memory from
▪ URL to outlinks, URL to inlinks
▪ Applications
▪ Link analysis
▪ Web graph analysis
▪ Connectivity, crawl optimization
▪ Crawl control
Introduction to Information Retrieval Sec. 20.4

Boldi and Vigna 2004

▪ http://www2004.org/proceedings/docs/1p595.pdf
▪ Webgraph – set of algorithms and a java
implementation
▪ Fundamental goal – maintain node adjacency lists
in memory
▪ For this, compressing the adjacency lists is the
critical component
Introduction to Information Retrieval Sec. 20.4

Adjacency lists
▪ The set of neighbors of a node
▪ Assume each URL represented by an integer
▪ E.g., for a 4 billion page web, need 32 bits per
node … and now there are definitely > 4B pages
▪ Naively, this demands 64 bits to represent each
hyperlink
▪ Boldi/Vigna get down to an average of ~3
bits/link
▪ Further work achieves 2 bits/link
Introduction to Information Retrieval Sec. 20.4

Adjacency list compression

▪ Properties exploited in compression:
▪ Similarity (between lists)
▪ Locality (many links from a page go to
“nearby” pages)
▪ Use gap encoding in sorted lists
▪ Distribution of gap values
Introduction to Information Retrieval Sec. 20.4

Main ideas of Boldi/Vigna

▪ Consider lexicographically ordered list of all URLs,
e.g.,
▪ www.stanford.edu/alchemy
▪ www.stanford.edu/biology
▪ www.stanford.edu/biology/plant
▪ www.stanford.edu/biology/plant/copyright
▪ www.stanford.edu/biology/plant/people
▪ www.stanford.edu/chemistry
Introduction to Information Retrieval Sec. 20.4

Boldi/Vigna
▪ Each of these URLs has an adjacency list
▪ Main idea: due to templates, the adjacency list of a
node is similar to one of the 7 preceding URLs in
the lexicographic ordering … or else encoded anew
▪ Express adjacency list in terms of one of these
▪ E.g., consider these adjacency lists
▪ 1, 2, 4, 8, 16, 32, 64
▪ 1, 4, 9, 16, 25, 36, 49, 64
▪ 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144
▪ 1, 4, 8, 16, 25, 36, 49, 64
Encode as (–2), remove 9, add 8
Introduction to Information Retrieval Sec. 20.4

Gap encodings
▪ Given a sorted list of integers x, y, z,
…, represent by x, y-x, z-y, …
▪ Compress each integer using a code
▪ γ code - Number of bits = 1 + 2 ⎣lg x⎦
▪ δ code: …
▪ Information theoretic bound: 1 + ⎣lg x⎦
bits
▪ ζ code: Works well for integers from a
power law [Boldi, Vigna: Data Compression Conf. 2004]
Introduction to Information Retrieval Sec. 20.4

Main advantages of BV
▪ Depends only on locality in a canonical ordering
▪ Lexicographic ordering works well for the web
▪ Adjacency queries can be answered very
efficiently
▪ To fetch out-neighbors, trace back the chain of
prototypes
▪ This chain is typically short in practice (since similarity
is mostly intra-host)
▪ Can also explicitly limit the length of the chain during
encoding
▪ Easy to implement one-pass algorithm
Introduction to Information Retrieval

Link analysis: Pagerank

26
Introduction to Information Retrieval

Citation Analysis
▪ Citation frequency
▪ Bibliographic coupling frequency
▪ Articles that co-cite the same articles are related
▪ Citation indexing
▪ Who is this author cited by? (Garfield 1972)
▪ Pagerank preview: Pinsker and Narin ’60s
▪ Asked: which journals are authoritative?
Introduction to Information Retrieval

The web isn’t scholarly citation

▪ Millions of participants, each with self interests
▪ Spamming is widespread
▪ Once search engines began to use links for ranking
(roughly 1998), link spam grew
▪ You can join a link farm – a group of websites that heavily
link to one another

28
Introduction to Information Retrieval Sec. 21.2

Pagerank scoring
▪ Imagine a user doing a random walk on web pages:
▪ Start at a random page 1/3
▪ At each step, go out of the 1/3
1/3
current page along one of
the links on that page, equiprobably
▪ “In the long run” each page has a long-term visit rate
– use this as the page’s score

▪ Variant: rather than equiprobable, use text and link

information to have probability of following a link:
intelligent surfer [Richardson and Domingos 2001]
Introduction to Information Retrieval Sec. 21.2

Not quite enough

▪ The web is full of dead-ends.
▪ Random walk can get stuck in dead-ends.
▪ Makes no sense to talk about long-term visit rates.

?
?
Introduction to Information Retrieval Sec. 21.2

Teleporting
▪ At a dead end, jump to a random web page.
▪ At any non-dead end, with probability 10%,
jump to a random web page.
▪ With remaining probability (90%), go out on
a random link.
▪ 10% - a parameter.
▪ “Teleportation” probability
▪ Simulates a web users going somewhere else
▪ Solves linear algebra problems….
Introduction to Information Retrieval Sec. 21.2

Result of teleporting
▪ Now cannot get stuck locally.
▪ There is a long-term rate at which any
page is visited (not obvious, will show
this).
▪ How do we compute this visit rate?
Introduction to Information Retrieval Sec. 21.2.1

Markov chains
▪ A Markov chain consists of n states, plus an n×n
transition probability matrix P.
▪ At each step, we are in one of the states.
▪ For 1 ≤ i,j ≤ n, the matrix entry Pij tells us the
probability of j being the next state, given we are
currently in state i.
Pii>0
is
OK.

i j
Pij
Introduction to Information Retrieval Sec. 21.2.1

Markov chains
▪ Clearly, for all i,
▪ Markov chains are abstractions of random walks.
▪ Exercise: represent the teleporting random walk from
3 slides ago as a Markov chain, for this case:
Introduction to Information Retrieval Sec. 21.2.1

Ergodic Markov chains

▪ For any ergodic Markov chain, there is a
unique long-term visit rate for each state.
▪ Steady-state probability distribution.
▪ Over a long time-period, we visit each state in
proportion to this rate.
▪ It doesn’t matter where we start.

▪ Ergodic: no periodic patterns

▪ Teleportation ensures ergodicity
Not ergodic
Introduction to Information Retrieval Sec. 21.2.1

Probability vectors
▪ A probability (row) vector x = (x1, … xn) tells us
where the walk is at any point.
▪ E.g., (000…1…000) means we’re in state i.
1 i n

More generally, the vector x = (x1, … xn)

means the walk is in state i with probability xi.
Introduction to Information Retrieval Sec. 21.2.1

Change in probability vector

▪ If the probability vector is x = (x1, … xn) at this
step, what is it at the next step?
▪ Recall that row i of the transition prob. matrix P
tells us where we go next from state i.
▪ So from x, our next state is distributed as xP
▪ The one after that is xP2, then xP3, etc.
▪ (Where) Does this converge?
▪ Running this and finding out is “the power method”
▪ It’s actually the method of choice, done with sparse P
Introduction to Information Retrieval Sec. 21.2.2

How do we compute this vector?

▪ Let a = (a1, … an) denote the row vector of
steady-state probabilities.
▪ If our current position is described by a, then the
next step is distributed as aP.
▪ But a is the steady state, so a=aP.
▪ Solving this matrix equation gives us a.
▪ So a is the (left) eigenvector for P.
▪ Corresponds to the “principal” eigenvector of P with the
largest eigenvalue. (See: Perron-Frobenius theorem.)
▪ Transition probability matrices always have largest
eigenvalue 1.
Introduction to Information Retrieval

Example: Mini web graph

39
Introduction to Information Retrieval

Example: Fixing sinks and teleporting

40
Introduction to Information Retrieval

Example: Doing power iteration

41
Introduction to Information Retrieval

Link analysis: HITS

Kleinberg (1999)

42
Introduction to Information Retrieval Sec. 21.3

Hyperlink-Induced Topic Search (HITS)

▪ In response to a query, instead of an ordered list of
pages each meeting the query, find two sets of
inter-related pages:
▪ Hub pages are good lists of links on a subject
▪ e.g., “Bob’s list of cancer-related links.”
▪ Authority pages occur recurrently on good hubs for the
subject
▪ Best suited for “broad topic” queries rather than for
page-finding queries
▪ Gets at a broader slice of common opinion
Introduction to Information Retrieval Sec. 21.3

Hubs and Authorities

▪ Thus, a good hub page for a topic points to
many authoritative pages for that topic.
▪ A good authority page for a topic is pointed
to by many good hubs for that topic.
▪ Circular definition – will turn this into an
iterative computation.
Introduction to Information Retrieval Sec. 21.3

The hope

Authorities
Hubs

Mobile telecom companies

Introduction to Information Retrieval Sec. 21.3

High-level scheme
▪ Extract from the web a base set of
pages that could be good hubs or
authorities.
▪ From these, identify a small set of top
hub and authority pages;
→ iterative algorithm.
Introduction to Information Retrieval Sec. 21.3

Base set
▪ Given text query (say browser), use a text
index to get all pages containing browser.
▪ Call this the root set of pages.
▪ Add in any page that either
▪ points to a page in the root set, or
▪ is pointed to by a page in the root set.
▪ Call this the base set.
Introduction to Information Retrieval Sec. 21.3

Visualization

Root
set

Base
set

Get in-links (and out-links) from a connectivity server

Introduction to Information Retrieval Sec. 21.3

Distilling hubs and authorities

▪ Compute, for each page x in the base set, a hub
score h(x) and an authority score a(x).
▪ Initialize: for all x, h(x)←1; a(x) ←1;
▪ Iteratively update all h(x), a(x); Ke
y
▪ After iterations
▪ output pages with highest h() scores as top hubs
▪ highest a() scores as top authorities.
Introduction to Information Retrieval Sec. 21.3

Iterative update
▪ Repeat the following updates, for all x:

x
Introduction to Information Retrieval Sec. 21.3

Scaling
▪ To prevent the h() and a() values from
getting too big, can scale down after each
iteration.
▪ Scaling factor doesn’t really matter:
▪ we only care about the relative values of
the scores.
Introduction to Information Retrieval Sec. 21.3

How many iterations?

▪ Claim: relative values of scores will converge after
a few iterations:
▪ in fact, suitably scaled, h() and a() scores settle
into a steady state!
▪ proof of this comes later.
▪ In practice, ~5 iterations get you close to stability.
Introduction to Information Retrieval Sec. 21.3

Proof of convergence

▪ n×n adjacency matrix A:

▪ each of the n pages in the base set has a row and
column in the matrix.
▪ Entry Aij = 1 if page i links to page j, else = 0.

1 2 3
1 2 1 0 1 0
2 1 1 1
3 3 1 0 0
Introduction to Information Retrieval Sec. 21.3

Hub/authority vectors
▪ View the hub scores h() and the authority scores a()
as vectors with n components.
▪ Recall the iterative updates
Introduction to Information Retrieval Sec. 21.3

Rewrite in matrix form

▪ h=Aa. Recall AT
▪ a=ATh. is the
transpose
of A.
Substituting, h=AATh and
a=ATAa. T
Thus, h is an eigenvector of AA and a is an
eigenvector of ATA.
Further, our algorithm is a particular, known algorithm for
computing eigenvectors: again, the power iteration method.
Guaranteed to converge.
Introduction to Information Retrieval

Example authorities found … in 1999

▪ (java) Authorities
▪ .328 http://www.gamelan.com/ Gamelan
▪ .251 http://java.sun.com/ JavaSoft Home Page
▪ .190 http://www.digitalfocus.com/... The Java Developer: How Do I ...
▪ .190 http://lightyear.ncsa.uiuc.edu/;srp/java/ javabooks.html
▪ .183 http://sunsite.unc.edu/javafaq/javafaq.html comp.lang.java FAQ
▪ (censorship) Authorities
▪ .378 http://www.eff.org/ EFFweb—The Electronic Frontier Foundation
▪ .344 http://www.eff.org/blueribbon.html The Blue Ribbon Campaign
for Online Free Speech
▪ .238 http://www.cdt.org/ The Center for Democracy and Technology
▪ .235 http://www.vtw.org/ Voters Telecommunications Watch
▪ .218 http://www.aclu.org/ ACLU: American Civil Liberties Union
56
Introduction to Information Retrieval Sec. 21.3

Issues
▪ Topic Drift
▪ Off-topic pages can cause off-topic “authorities” to
be returned
▪ E.g., the neighborhood graph can be about a “super
topic”
▪ Mutually Reinforcing Affiliates
▪ Affiliated pages/sites can boost each others’
scores
▪ Linkage between affiliated pages is not a useful signal
Introduction to Information Retrieval

Resources
▪ IIR Chap 21
▪ Kleinberg, Jon (1999). Authoritative sources in a hyperlinked
environment. Journal of the ACM. 46 (5): 604–632.
▪ http://www2004.org/proceedings/docs/1p309.pdf
▪ http://www2004.org/proceedings/docs/1p595.pdf
▪ http://www2003.org/cdrom/papers/refereed/p270/kamvar-2
70-xhtml/index.html
▪ http://www2003.org/cdrom/papers/refereed/p641/xhtml/p6
41-mccurley.html
▪ The WebGraph framework I: Compression techniques (Boldi
et al. 2004)

Railway Reservation System
50% (2)
Railway Reservation System
31 pages
Writing for the Web
From Everand
Writing for the Web
Crawford Kilian
No ratings yet
Desktop Publishing Lecture Notes
80% (5)
Desktop Publishing Lecture Notes
42 pages
Cucumber With Java - Paul Watson
100% (2)
Cucumber With Java - Paul Watson
66 pages
Giant Bomb Videos - Subscriber
No ratings yet
Giant Bomb Videos - Subscriber
24 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
54 pages
12 Link Analysis
No ratings yet
12 Link Analysis
47 pages
IR Lec13 Web Crawling
No ratings yet
IR Lec13 Web Crawling
27 pages
lecture15-crawling.pptx
No ratings yet
lecture15-crawling.pptx
17 pages
Lecture Crawling
No ratings yet
Lecture Crawling
38 pages
Information Retrieval
No ratings yet
Information Retrieval
171 pages
Materi Pertemuan Ke-1-Dno 2018-1
No ratings yet
Materi Pertemuan Ke-1-Dno 2018-1
42 pages
Tutorial: Web Information Retrieval: Monika Henzinger
No ratings yet
Tutorial: Web Information Retrieval: Monika Henzinger
154 pages
Monday - IR Fundamentals - Grace Yang - AFIRM19-IR
No ratings yet
Monday - IR Fundamentals - Grace Yang - AFIRM19-IR
77 pages
An Overview of Information Retrieval Outline: A (Simple) Database Example Databases vs. IR
No ratings yet
An Overview of Information Retrieval Outline: A (Simple) Database Example Databases vs. IR
16 pages
Information Networks and World Wide Web
No ratings yet
Information Networks and World Wide Web
37 pages
Introduction
No ratings yet
Introduction
32 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
46 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
32 pages
Introduction
No ratings yet
Introduction
25 pages
1preprocessing Crawling Laws PDF
No ratings yet
1preprocessing Crawling Laws PDF
53 pages
Information Retrieval
No ratings yet
Information Retrieval
5 pages
4.link Analysis and Page Rank S4
No ratings yet
4.link Analysis and Page Rank S4
32 pages
Cs8080irtunitinotes 220515215754 E06d144b
No ratings yet
Cs8080irtunitinotes 220515215754 E06d144b
43 pages
IR Textbook
No ratings yet
IR Textbook
167 pages
Chapter 1
No ratings yet
Chapter 1
52 pages
Lecture15 Learning Ranking
No ratings yet
Lecture15 Learning Ranking
46 pages
Course No.: CS F469 Course Title: Information Retrieval Instructor-In-Charge: POONAM GOYAL (
No ratings yet
Course No.: CS F469 Course Title: Information Retrieval Instructor-In-Charge: POONAM GOYAL (
4 pages
Unit - I - IR
No ratings yet
Unit - I - IR
39 pages
The Wisdom of Crowds: Web Mining or
No ratings yet
The Wisdom of Crowds: Web Mining or
50 pages
1.introduction Information Retrival
No ratings yet
1.introduction Information Retrival
31 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
48 pages
Week 2 - Information Retrieval Basics
No ratings yet
Week 2 - Information Retrieval Basics
74 pages
Learning To Rank
No ratings yet
Learning To Rank
777 pages
Module 6-: Real Time Big Data Models
No ratings yet
Module 6-: Real Time Big Data Models
58 pages
Cs8080 - Irt - Notes All
No ratings yet
Cs8080 - Irt - Notes All
281 pages
Lecture1 Chap1
No ratings yet
Lecture1 Chap1
22 pages
Informaiton Retrieval and Web Search
No ratings yet
Informaiton Retrieval and Web Search
44 pages
Searching The Web
No ratings yet
Searching The Web
24 pages
Lecture15 Learning Ranking
No ratings yet
Lecture15 Learning Ranking
46 pages
IR On Web Search Engines: Reference of Slides Taken From DR Haddawy's Material
No ratings yet
IR On Web Search Engines: Reference of Slides Taken From DR Haddawy's Material
21 pages
Introduction To Information Retrieval
No ratings yet
Introduction To Information Retrieval
42 pages
Modern Information Retrieval: Computer Engineering Department Fall 2005
No ratings yet
Modern Information Retrieval: Computer Engineering Department Fall 2005
19 pages
1 Mod-1_Lec-1
No ratings yet
1 Mod-1_Lec-1
21 pages
All Units Notes TYBSC-CS-Information-Retrieval
No ratings yet
All Units Notes TYBSC-CS-Information-Retrieval
89 pages
Lecture8-Evaluation 2013
No ratings yet
Lecture8-Evaluation 2013
44 pages
Introduction Information Retrieval
No ratings yet
Introduction Information Retrieval
73 pages
Search Engines Information Retrieval in Practice PDF
No ratings yet
Search Engines Information Retrieval in Practice PDF
542 pages
IR_workbook_answers
No ratings yet
IR_workbook_answers
36 pages
ITR notes (2)
No ratings yet
ITR notes (2)
166 pages
Ip 8
No ratings yet
Ip 8
51 pages
1-Introduction-MIR
No ratings yet
1-Introduction-MIR
35 pages
IR-Module 1 and 2
No ratings yet
IR-Module 1 and 2
48 pages
IR UNIT I - Notes
No ratings yet
IR UNIT I - Notes
23 pages
Cs6007 - Information Retrieval: Objectives: The Student Should Be Made To
No ratings yet
Cs6007 - Information Retrieval: Objectives: The Student Should Be Made To
24 pages
Deeper Inside Pagerank: Amy N. Langville and Carl D. Meyer
No ratings yet
Deeper Inside Pagerank: Amy N. Langville and Carl D. Meyer
33 pages
Introduction Advanced DB
No ratings yet
Introduction Advanced DB
80 pages
AZ Lecture7-Queryexpansion
No ratings yet
AZ Lecture7-Queryexpansion
49 pages
Relevance Feedback
No ratings yet
Relevance Feedback
47 pages
Balancing Keywords And Content
From Everand
Balancing Keywords And Content
Forest Johnson
No ratings yet
Querying MariaDB: Use SQL Operations, Data Extraction, and Custom Queries to Make your MariaDB Database Analytics more Accessible (English Edition)
From Everand
Querying MariaDB: Use SQL Operations, Data Extraction, and Custom Queries to Make your MariaDB Database Analytics more Accessible (English Edition)
Adam Aspin
No ratings yet
Backlink Basic
From Everand
Backlink Basic
MUHAMMAD NUR WAHID ANUAR
No ratings yet
URL Branding And Domain Names, A Technical Look At The Structure And Strategy Behind Your Digital Addresses
From Everand
URL Branding And Domain Names, A Technical Look At The Structure And Strategy Behind Your Digital Addresses
Glenn Ward
No ratings yet
Jump Start MySQL: Master the Database That Powers the Web
From Everand
Jump Start MySQL: Master the Database That Powers the Web
Timothy Boronczyk
No ratings yet
Half Wave Rectifier
No ratings yet
Half Wave Rectifier
12 pages
QTM-Theory Questions & Answers - New
No ratings yet
QTM-Theory Questions & Answers - New
20 pages
Journal of Cleaner Production: Muhammad Afzal, Yuhan Liu, Jack C.P. Cheng, Vincent J.L. Gan
No ratings yet
Journal of Cleaner Production: Muhammad Afzal, Yuhan Liu, Jack C.P. Cheng, Vincent J.L. Gan
22 pages
Bus Bar Schemes: Submitted By: Under Guidance of
No ratings yet
Bus Bar Schemes: Submitted By: Under Guidance of
26 pages
6ED10521HB080BA1 Datasheet en
No ratings yet
6ED10521HB080BA1 Datasheet en
2 pages
Glossary: A/B Test
No ratings yet
Glossary: A/B Test
10 pages
O714 - RLS 2.3 For Deployment Engineers Student Lab Guide - Reve
No ratings yet
O714 - RLS 2.3 For Deployment Engineers Student Lab Guide - Reve
144 pages
SAMPLE HR Heads Directory India
No ratings yet
SAMPLE HR Heads Directory India
136 pages
Data Warehousing & Mining: Unit - V
100% (2)
Data Warehousing & Mining: Unit - V
13 pages
ISTQB CTFL Questions
No ratings yet
ISTQB CTFL Questions
6 pages
Secure Architecture and Design
No ratings yet
Secure Architecture and Design
19 pages
Business Analyst Interview Questions
No ratings yet
Business Analyst Interview Questions
28 pages
GC Computer (Eng)
No ratings yet
GC Computer (Eng)
121 pages
Enterprise Structure
No ratings yet
Enterprise Structure
4 pages
HART Filter
No ratings yet
HART Filter
4 pages
Quiz PMI
100% (1)
Quiz PMI
66 pages
TAB 3 - ATMS Software Design Specifications v1.0.0
No ratings yet
TAB 3 - ATMS Software Design Specifications v1.0.0
38 pages
DISR ch.9
No ratings yet
DISR ch.9
11 pages
Teste
No ratings yet
Teste
558 pages
FR d710w 008 Na Mitsubishi VFD
No ratings yet
FR d710w 008 Na Mitsubishi VFD
3 pages
Silver Schmidt Manual
No ratings yet
Silver Schmidt Manual
47 pages
Class 12 Practical File Informatics Practices
No ratings yet
Class 12 Practical File Informatics Practices
28 pages
Amit Goyal Resume For Performance Tester
No ratings yet
Amit Goyal Resume For Performance Tester
5 pages
Yamaha Yc73 Manual
No ratings yet
Yamaha Yc73 Manual
76 pages
Redmi Miui 12 Moniles
No ratings yet
Redmi Miui 12 Moniles
15 pages
IT214 - DBMS Autumn 2022
No ratings yet
IT214 - DBMS Autumn 2022
2 pages

lecture16-linkanalysis.pptx

Uploaded by

lecture16-linkanalysis.pptx

Uploaded by

Introduction to Information Retrieval

Today’s lecture – hypertext and links

Links are everywhere

Simple iterative logic

Simple iterative logic

Simple iterative logic

Sometimes need probabilistic analogs – e.g., mail spam 7

Many other examples of link analysis

Our primary interest in this course

The Web as a Directed Graph

Hypothesis 1: A hyperlink between pages denotes

Hypothesis 2: The text in the anchor of a hyperlink on page

Assumption 1: reputed sites

Assumption 2: annotation of target

▪ For ibm how to distinguish between:

“ibm.com” “IBM home page”

Indexing anchor text

Joe’s computer hardware links

Indexing anchor text

Boldi and Vigna 2004

Adjacency list compression

Main ideas of Boldi/Vigna

Link analysis: Pagerank

The web isn’t scholarly citation

▪ Variant: rather than equiprobable, use text and link

Not quite enough

Ergodic Markov chains

▪ Ergodic: no periodic patterns

More generally, the vector x = (x1, … xn)

Change in probability vector

How do we compute this vector?

Example: Mini web graph

Example: Fixing sinks and teleporting

Example: Doing power iteration

Link analysis: HITS

Hyperlink-Induced Topic Search (HITS)

Hubs and Authorities

Mobile telecom companies

Get in-links (and out-links) from a connectivity server

Distilling hubs and authorities

How many iterations?

▪ n×n adjacency matrix A:

Rewrite in matrix form

Example authorities found … in 1999

You might also like