Web Clustering

The document discusses incremental clustering algorithms. Previous clustering algorithms processed all data points simultaneously, but some applications need to cluster a stream of incoming documents. Incremental clustering algorithms address this by maintaining clusters as new points are added, either assigning points to existing clusters or creating new clusters by merging two clusters. The document presents the doubling algorithm and clique partitioning algorithm as two incremental clustering models. It provides examples of how they work by incrementally processing points and merging clusters.

Uploaded by

Suvralipi Mohanta

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

42 views

Web Clustering

Uploaded by

Suvralipi Mohanta

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 23

Incremental Clustering

 Previous clustering algorithms worked in

“batch” mode: processed all points at
essentially the same time.
 Some IR applications cluster an incoming
document stream (e.g., topic tracking).
 For these applications, we need incremental
clustering algorithms.
Incremental Clustering Issues
 How to be efficient? Should all
documents be cached?
 How to handle or support concept drift?
 How to reduce sensitivity to ordering?
 Goals:
 minimize the maximum cluster diameter
 minimize the number of clusters given a
fixed diameter
Incremental Clustering Model
[Charikar et al. 1997]
 Extension to HAC as follows:
 Incremental Clustering: “for an update sequence
of n points in M, maintain a collection of k
clusters such that as each one is presented,
either it is assigned to one of the current k
clusters or it starts off a new cluster while two
existing clusters are merged into one.”
 Maintains a HAC for points added up until
current time.
M. Charikar, C. Chekuri, T. Feder, R. Motwani. “Incremental Clustering and Dynamic
Information Retrieval”, Proc. 29th Annual ACM Symposium on Theory of Computing,
1997.
Doubling Algorithm (a=b=2)
1. Assign first k+1 points to k+1 clusters with each
point as centroid, d1=distance between closest
two points.
2. Do while more points
1. dt+1 = bdt
2. Merge clusters until all clusters in some new cluster:

1. Pick an arbitrary cluster; merge all clusters within dt+1 of centers

2. Remove selected clusters from old clusters
3. Calculate the centroid for the new cluster
3. Update clusters while number of clusters <=k:
1. Assign new point to closest cluster if within adt+1 of center;
otherwise create new cluster.
Example:Plot -- Incremental
50
45 11
2
40
35 15
9
30 5 1
14 7 10
25
8 6
20 16
4 3 13
15
10 12
5
0
0 10 20 30 40 50
Example:Doubling Merge
d2=24.08
50
45 11
2
40
35 15 X
9
30 5 1
14 7 10
25
8 6
20 16
4 3 13
15
10 12
5
0
0 10 20 30 40 50
Example:Doubling Update
d2=24.08
50
45 11
2
40
35 15 X
9
30 5 1
14 7 10
25
8 6
20 16
4
15 X 3 13
10 12
5
0
0 10 20 30 40 50
Example:Doubling Update
d2=24.08
50
45 11
2
40
35 15 X
9
30 5 1
14 7 10
25
8 6
20 X 16
4 3 13
15
10 12
5
0
0 10 20 30 40 50
Example:Doubling Update
d2=24.08
50
45 11
2
40
35 15
5
9 X
30 1
14 7 10
25
8 6
20 X 16
4 3 13
15
10 12
5
0
0 10 20 30 40 50
Example:Doubling Solution
50
45 11
2
40
35 15
9
30 5 1
14 7 10
25
8 6
20 16
4 3 13
15
10 12
5
0
0 10 20 30 40 50
Clique Partition Background
 A clique in G=(V,E) is a subset V’ of V
s.t. every two vertices in V’ are joined
by an edge in E.
 A clique partition for G is a partition of
V into disjoint subsets V1…Vk s.t. for
1<=I<=k, the subgraph induced by Vi
is a complete graph.
Clique Partition Algorithm
1. Assign first k+1 points to k+1 clusters with each point as
centroid, d1=distance between closest two points.
2. Do while more points
1. dt+1 = 2dt

2. Merge clusters:

1. Compute minimum clique partition from dt+1 threshold graph

2. Merge clusters in each clique
3. In each new cluster, arbitrarily assign one of the existing centers
as the center for the new cluster
3. Update clusters while number of clusters <=k:
1. Assign new point to a cluster if within dt+1 of center of it or sub-
clusters; otherwise create new cluster.
Example: CP: Merge d1=12.04
50
45 11
2
40
35 15
9
30 5 1
14 7 10
25
8 6
20 16
4 3 13
15
10 12
5
0
0 10 20 30 40 50
Example: CP: Update
d2=24.08
50
45 11
2
40
35 15
9
30 5 1
14 7 10
25 23
8 6
20 16
4 3 13
15
10 12
5
0
0 10 20 30 40 50
Web Document Clustering
Applications
 Organizing search engine retrieval results
 Meta-search engine that hierarchically clusters of
results: Vivisimo
 Meta-search engine that graphically displays
clusters of results: Kartoo
 Detecting redundancy (e.g., mirror sites or
moved or re-formatted documents)
 User interest profiles (aka filtering)
Vivisimo: Result Organization
Kartoo: Visual Clustering
Detecting Mirrors/Subsumed
Web Documents
Resemblance assesses similarity between two
documents.
| S ( A)  S ( B) |
r ( A, B) 
| S ( A)  S ( B) |

Containment assesses how A is a subset of B.

| S ( A)  S ( B) |
c( A, B) 
| S ( A) |
A.Z. Broder, S.C. Glassman, M.S. Manasse, G. Zweig, “Syntactic Clustering
of the Web”, Proceedings of WWW6, 1997.
Computing R and C
 S(D,w) (shingle) is the set of all unique
contiguous subsequences of length w in
document D.
 S(D) is S(D,w) for a fixed size w.
 To reduce the storage and computation, we
can sample the shingles for each doc:
 First s: MINs(W)
 Every mth: MODm(W)
Estimating R & C from a
Portion of a Document
Keep a sketch of each document D, which consists of F(D)
and/or V(D) .
 : U  U is a random permutatio n
F ( A)  MIN s ( ( S ( A)))
V ( A)  MODM ( ( S ( A)))
| MIN s ( F ( A)  F ( B))  F ( A)  F ( B) |
r ( A, B) 
| MIN s ( F ( A)  F ( B)) |
V ( A)  V ( B)
r ( A, B) 
V ( A)  V ( B)
V ( A)  V ( B)
c( A, B) 
V ( A)
Web Clustering with R & C
 w=10, m=25, s=50?, threshold=.5
 Pre-process documents

1. For each doc, calculate a sketch

2. Sort pairs of <shingle,docid>, removing lexically-

equivalent and shingle-equivalent docs
3. Compute list of doc pairs with # of shared
shingles, ignoring very common shingles
4. Generate clusters
1. if r(A,B) > threshold, then add link A<->B
2. Produce connected components using union-find
Web Clustering Results 1997
 30M web pages, 150 GBytes
 600M shingles
 3.6M clusters of 12.3M docs
 2.1M clusters of 5.3M identical docs
 Took 10.5 CPU days to compute
Web Applications of
Resemblance Clusters
 Find URL similar to …
 relies on fixed threshold and requires URLs to
have been processed
 WWW Lost and Found
 requires keeping some historical sketch info
 Remove similar docs from search results

Data Visualization LAB MANUAL
100% (1)
Data Visualization LAB MANUAL
85 pages
Permenkominfo No. 20/2016 On Indonesia Protection of Electronic Personal Data (Translated by Wishnu Basuki)
75% (12)
Permenkominfo No. 20/2016 On Indonesia Protection of Electronic Personal Data (Translated by Wishnu Basuki)
23 pages
CS276A Text Retrieval and Mining
No ratings yet
CS276A Text Retrieval and Mining
48 pages
Clustering: Unsupervised Learning Methods 15-381
No ratings yet
Clustering: Unsupervised Learning Methods 15-381
25 pages
Clustering
No ratings yet
Clustering
28 pages
Flat Clustering & Hierarchical Clustering in I.R
No ratings yet
Flat Clustering & Hierarchical Clustering in I.R
13 pages
Chapter 6
No ratings yet
Chapter 6
62 pages
15-505 Internet Search Technologies: Kamal Nigam
No ratings yet
15-505 Internet Search Technologies: Kamal Nigam
62 pages
Clustering
No ratings yet
Clustering
110 pages
Chapter4 Clustering Compressed
No ratings yet
Chapter4 Clustering Compressed
48 pages
Cluster
No ratings yet
Cluster
66 pages
Clustering, A Tool To Analyze Data Points
No ratings yet
Clustering, A Tool To Analyze Data Points
61 pages
Survey of Clustering Algorithms
No ratings yet
Survey of Clustering Algorithms
37 pages
Assignment Cover Sheet: Research Report On Clustering in Data Mining
No ratings yet
Assignment Cover Sheet: Research Report On Clustering in Data Mining
13 pages
Hierarchical Clustering Algorithms: - Divisive (Top-Down)
No ratings yet
Hierarchical Clustering Algorithms: - Divisive (Top-Down)
53 pages
12 Text Clustering
No ratings yet
12 Text Clustering
26 pages
Unsupervised Algorithms Unit3
No ratings yet
Unsupervised Algorithms Unit3
53 pages
Lecture 8
No ratings yet
Lecture 8
56 pages
LecN10_R
No ratings yet
LecN10_R
9 pages
Module5 QB 1
No ratings yet
Module5 QB 1
21 pages
Clustering
No ratings yet
Clustering
27 pages
P 3.1.3 Hierarchical
No ratings yet
P 3.1.3 Hierarchical
30 pages
Data Mining-Unit 3-Part1
No ratings yet
Data Mining-Unit 3-Part1
41 pages
DWDM-1 (1) - Removed
No ratings yet
DWDM-1 (1) - Removed
18 pages
Cluster Analysis: Basic Concepts and Algorithms
No ratings yet
Cluster Analysis: Basic Concepts and Algorithms
141 pages
Text Clustering and Validation For Web Search Results
No ratings yet
Text Clustering and Validation For Web Search Results
7 pages
Unit 4 Clustering
No ratings yet
Unit 4 Clustering
32 pages
8. Clustering
No ratings yet
8. Clustering
80 pages
Clustering: EE-671 Prof L. Behera, IITK
No ratings yet
Clustering: EE-671 Prof L. Behera, IITK
33 pages
Chp10 Cluster Analysis Basic Concepts and Methods
No ratings yet
Chp10 Cluster Analysis Basic Concepts and Methods
24 pages
Ijcet: International Journal of Computer Engineering & Technology (Ijcet)
No ratings yet
Ijcet: International Journal of Computer Engineering & Technology (Ijcet)
12 pages
Introduction To Clustering Techniques: Leo Wanner
No ratings yet
Introduction To Clustering Techniques: Leo Wanner
32 pages
DIP-M5-Ktunotes.in
No ratings yet
DIP-M5-Ktunotes.in
58 pages
L 12 Flat Cluster
No ratings yet
L 12 Flat Cluster
26 pages
Document Clustering in Web Search Engine: International Journal of Computer Trends and Technology-volume3Issue2 - 2012
No ratings yet
Document Clustering in Web Search Engine: International Journal of Computer Trends and Technology-volume3Issue2 - 2012
4 pages
Chapter 4 PDF
No ratings yet
Chapter 4 PDF
89 pages
1 s2.0 016781919500017I Main
No ratings yet
1 s2.0 016781919500017I Main
13 pages
12 Clustering
No ratings yet
12 Clustering
46 pages
Cluster
100% (1)
Cluster
72 pages
_Clustering
No ratings yet
_Clustering
41 pages
UNIT4 Clustering
No ratings yet
UNIT4 Clustering
30 pages
Efficient Clustering Approaches For Organizing Document Collection
No ratings yet
Efficient Clustering Approaches For Organizing Document Collection
29 pages
Chapter 3-Unsupervised learning_updated
No ratings yet
Chapter 3-Unsupervised learning_updated
54 pages
Web Document Clustering Using: Fuzzy Equivalence Relations
No ratings yet
Web Document Clustering Using: Fuzzy Equivalence Relations
17 pages
WINSEM2020-21 CSE4020 ETH VL2020210504996 Reference Material I 26-Apr-2021 Clustering
No ratings yet
WINSEM2020-21 CSE4020 ETH VL2020210504996 Reference Material I 26-Apr-2021 Clustering
43 pages
8 Clustering: Marco Gaertler
No ratings yet
8 Clustering: Marco Gaertler
38 pages
Chapter 5. Clustering Algorithms-Stud
No ratings yet
Chapter 5. Clustering Algorithms-Stud
44 pages
Clustering
No ratings yet
Clustering
75 pages
DWDM Unit-5
No ratings yet
DWDM Unit-5
52 pages
U-5_IML (2)
No ratings yet
U-5_IML (2)
20 pages
Unit 4 - Data Warehousing and Mining
No ratings yet
Unit 4 - Data Warehousing and Mining
51 pages
Unit-4th Question-Bank Solution.docx
No ratings yet
Unit-4th Question-Bank Solution.docx
52 pages
Unit-7 Finalized
No ratings yet
Unit-7 Finalized
20 pages
Chameleon: A Hierarchical Clustering Algorithm Using Dynamic Modeling
No ratings yet
Chameleon: A Hierarchical Clustering Algorithm Using Dynamic Modeling
18 pages
Module 5.Docx Aiml
No ratings yet
Module 5.Docx Aiml
28 pages
Lesson 4
No ratings yet
Lesson 4
30 pages
DM Clustering UNIT4
No ratings yet
DM Clustering UNIT4
36 pages
dm 4
No ratings yet
dm 4
76 pages
Unsupervised Learning and Clustering
No ratings yet
Unsupervised Learning and Clustering
19 pages
4.5-Cluster Analysis
No ratings yet
4.5-Cluster Analysis
17 pages
Math Workbook - Grade 8
From Everand
Math Workbook - Grade 8
Beverly Nance
No ratings yet
Fractions - Advanced
From Everand
Fractions - Advanced
Lynne Aldrich
5/5 (1)
ILFORD ICC Profile Installation (2020)
No ratings yet
ILFORD ICC Profile Installation (2020)
17 pages
Ashika Resume DS
No ratings yet
Ashika Resume DS
7 pages
Blake MCQ in Digital Communications
No ratings yet
Blake MCQ in Digital Communications
22 pages
Cyber Security Interview Questions and Answers: 1. Explain Risk, Vulnerability and Threat?
No ratings yet
Cyber Security Interview Questions and Answers: 1. Explain Risk, Vulnerability and Threat?
10 pages
Lossless and Lossy Compression
No ratings yet
Lossless and Lossy Compression
18 pages
Mexico_Electronic_Accounting_Specification_Chart_of_Account
No ratings yet
Mexico_Electronic_Accounting_Specification_Chart_of_Account
1 page
Computer Security and Cryptography A Simple Presentation
No ratings yet
Computer Security and Cryptography A Simple Presentation
22 pages
The Automated Testing Framework
No ratings yet
The Automated Testing Framework
9 pages
SAP BPC NW 10.0 - 7.5 Script Logic Implementation Guide V18
No ratings yet
SAP BPC NW 10.0 - 7.5 Script Logic Implementation Guide V18
132 pages
CS8494 SOFTWARE ENGINEERING - Watermark
No ratings yet
CS8494 SOFTWARE ENGINEERING - Watermark
197 pages
ECDIS900 - Installation and Commissioning Rel E
No ratings yet
ECDIS900 - Installation and Commissioning Rel E
18 pages
Unit-4 Soft Comp Fuzzy Logic
No ratings yet
Unit-4 Soft Comp Fuzzy Logic
12 pages
Kubernetes Ingress Controllers
No ratings yet
Kubernetes Ingress Controllers
18 pages
Optronix 2019 8862428
No ratings yet
Optronix 2019 8862428
4 pages
ShopSphere - A Dynamic Multi-Vendor E-commerce Platform
No ratings yet
ShopSphere - A Dynamic Multi-Vendor E-commerce Platform
2 pages
Sisay Ayalew
No ratings yet
Sisay Ayalew
15 pages
SCHEME - G Fourth Semester CO
No ratings yet
SCHEME - G Fourth Semester CO
37 pages
Nokia 3310-User Guide
No ratings yet
Nokia 3310-User Guide
48 pages
Módulo Bluethoo PDF
No ratings yet
Módulo Bluethoo PDF
6 pages
Invoice GL Account Tax Relevant
100% (1)
Invoice GL Account Tax Relevant
2 pages
Weigh Feeder Controller
No ratings yet
Weigh Feeder Controller
2 pages
Kaspersky Endpoint Security For Enterprise
No ratings yet
Kaspersky Endpoint Security For Enterprise
4 pages
GSM/GPRS/GPS Vehicle Tracker Model A/B User Manual: Preface
100% (1)
GSM/GPRS/GPS Vehicle Tracker Model A/B User Manual: Preface
33 pages
Uvm Factory
No ratings yet
Uvm Factory
14 pages
Unzip
No ratings yet
Unzip
13 pages
An, Improved Successive-Approximation Register Design For Use in A/D Converters
No ratings yet
An, Improved Successive-Approximation Register Design For Use in A/D Converters
5 pages
Online Education System
No ratings yet
Online Education System
41 pages
UCMDB10.00 DiscoveryIntegration PDF
No ratings yet
UCMDB10.00 DiscoveryIntegration PDF
1,360 pages