Data Mining
Advanced Topics in Data Mining
Link Based Ranking
Graph-based
Representation
• Directed / Undirected
• Weighted / Unweighted
• Graph - Adjacency
Matrix
• Degree of a node
• In_degree / Out_degree
Ranking
Teams and Player Ranking
Student Ranking
Web Pages Ranking
Exert Ranking
Scholars and Academic Entities Ranking
My interest
Think of anything and you can Rank it
Content
Link
Introduction – From Content to ?
• Early search engines focus
• compare content similarity of the query and the indexed pages.
• They use information retrieval methods, cosine, TF-IDF, ...
• From 1996, it became clear that content similarity alone
was no longer sufficient.
• The number of pages grew rapidly in the mid-late 1990’s.
• This Growth is Exponential [Internet Statistics]
• How to rank top 10-40 pages and show to the user?
• Issues
• Content similarity is easily spammed.
Links
• Starting around 1996, researchers began to work on the problem.
[hyperlinks. ]
• Web pages on the other hand are connected through hyperlinks,
• carry important information
Ranking Algorithm
• HITS (Hyperlink Induced Topic Search)
e.g.Alta Vista
• [Developd by Jon Kleinberg.]
Short Introduction
• HITS (Hyperlink-Induced Topic Search)
• Information Retrieval like PageRank
• tries to find key pages for specific web communities.
• HITS focuses on finding
• Authorities
• Hubs
Introduction HITS
• Authority
• A page with many in-links.
• page may have good or authoritative content on a topic
• Hub
• Page with many out-links.
• Page serves as an organizer of information on a topic
• The key idea
• Good hub points to many good authorities
• Good authority is pointed to by many good hubs.
Authorities and Hubs
• Let Ai be the authority score for page i,
• let Hi be the hub score for page i.
• Calculation Procedure
• Initialize the variables as 1 for every page,
• then iterate the following two equations until the Convergence is
achieved (numbers settle down):
H n A calculation
a(1) = h(2) + h(3) + h(4) h(1) = a(5) + a(6) + a(7)
Authorities and Hubs
• Initially Ha = Hb = Hc = Hd =1
Authorities and Hubs example
• Initially Ha = Hb = Hc = Hd =1
1. Aa = Hb = 1; Ab = Ha = 1; Ac = Ha + Hb = 2; Ad = Ha + Hb + Hc = 3
Normalise: Aa = 0.143 ; Ab = 0.143; Ac = 0.286; Ad = 0.429
Ha = Ab + Ac + Ad = 0.858; Hb = Aa + Ac + Ad = 0.858; Hc = 0.429; Hd = 0
Normalise: Ha = 0.4; Hb = 0.4; Hc = 0.2; Hd = 0
Example
2. Aa = Hb = 0.4; Ab = Ha = 0.4; Ac = Ha + Hb = 0.8; Ad = Ha + Hb + Hc = 1
Normalise: Aa = 0.154 ; Ab = 0.154; Ac = 0.308; Ad = 0.386
Ha = Ab + Ac + Ad = 0.848; Hb = Aa + Ac + Ad = 0.848; Hc = 0.386; Hd = 0
Normalise: Ha = 0.356; Hb = 0.356; Hc = 0.288; Hd = 0
Example
3. Aa = Hb = 0.356; Ab = Ha = 0.356; Ac = Ha + Hb = 0.712; Ad = Ha+Hb+Hc = 1
Normalise: Aa = 0.146 ; Ab = 0.146; Ac = 0.292; Ad = 0.416
Ha = Ab + Ac + Ad = 0.854; Hb = Aa + Ac + Ad = 0.854; Hc = 0.416; Hd = 0
Normalise: Ha = 0.402; Hb = 0.402; Hc = 0.196; Hd = 0
Example
4. Aa = Hb = 0.402; Ab = Ha = 0.402; Ac = Ha + Hb = 0.804; Ad = Ha+Hb+Hc = 1
Normalise: Aa = 0.154 ; Ab = 0.154; Ac = 0.308; Ad = 0.384
Ha = Ab + Ac + Ad = 0.846; Hb = Aa + Ac + Ad = 0.846; Hc = 0.384; Hd = 0
Normalise: Ha = 0.408; Hb = 0.408; Hc = 0.184; Hd = 0
Example
Exercise (Optional)
(Computer Hub and Authorities for the following Graph
till Convergence)
The HITS algorithm: Formal way
• Given a broad search query, q, HITS collects a set of
pages as follows:
• It sends the query q to a search engine.
• It then collects t (t = 200 usually) highest ranked
pages. This set is called the root set W.
• It then grows W by including any page pointed to
by a page in W and any page that points to a page
in W. This gives a larger set S, base set.
The link graph G
• HITS works on the pages in S, and assigns every page in S an authority score
and a hub score.
• Let the number of pages in S be n.
• We again use G = (V, E) to denote the hyperlink graph of S.
• We use L to denote the adjacency matrix of the graph.
The HITS algorithm
• Let the authority score of the page i be a(i), and the hub score of page i be
h(i).
• The mutual reinforcing relationship of the two scores is represented as
follows:
The HITS
algorithm
How is HITS used
HITS is search query dependent.
When the user issues a search query,
HITS first expands the list of relevant pages returned by a search engine and
then produces two rankings of the expanded set of pages, authority ranking
and hub ranking.
WHICH IS TO BE CONSIDERED?
How is HITS used
Finding Communities in Web
Community Detection – imp research domain
Application in all sorts of research domains
Web
Marketing
Social Issues
TASKS [Basic before
Implementation] Optional
First Think of Scenario where Basic Idea of HITS Applies
Recall basic idea
Incoming are as imp as Outgoing
Pure Connected Graph not Sparse
You need to define and Find any one Scenario
Entities [Vertex]
Relationship [Edges]
What type of relationship
Why incoming
Why outgoing
Make it a sample graph
Define sets of Vertices and Edges
Define Hubs and Authorities
Issues of HITS
What is the main flaw in HITS?
Outlinks are not as importance as Inlinks
What is the possible Solution?
How inlinks can be considered more imp?
Any Question?