Lecture 10
Lecture 10
Overview of Clustering
Uses of Clustering
Pattern Recognition
Image Processing
cluster images based on their visual content
Bio-informatics
WWW and IR
document classification
cluster Weblog data to discover groups of similar access
patterns
Uses of Clustering
Discovered Clusters Industry Group
Understanding Applied-Matl-DOWN,Bay-Network-Down,3-COM-DOWN,
4 Louisiana-Land-UP,Phillips-Petro-UP,Unocal-UP,
Schlumberger-UP
Oil-UP
Summarization
Reduce the size of large
data sets
Applications
Recommendation systems
Clustering
Search Personalization precipitation in
Australia
Outliers
Outliers are objects that do not belong to any
cluster or form clusters of very small cardinality
cluster
outliers
tuples/objects
... ... ... ... ...
x ... x if ... x ip
the “classic” data input i1
... ... ... ... ...
x ... x nf ... x np
dissimilarity or distance n1
matrix objects
(one mode) 0
d(2,1) 0
objects
d(3,1) d ( 3,2) 0
Assuming simmetric distance : : :
d(i,j) = d(j, i) d ( n,1) d ( n,2) ... ... 0
Similarity and Distance
For many different problems we need to quantify how close two objects
are.
Examples:
For an item bought by a customer, find other similar items
Group together the customers of a site so that similar customers are shown
the same ad.
Group together web documents so that you can separate the ones that talk
about politics and the ones that talk about sports.
Find all the near-duplicate mirrored web documents.
Find credit card transactions that are very different from previous transactions.
To solve these problems we need a definition of similarity, or distance.
The definition depends on the type of data that we have
Similarity
Binary variables
e.g., gender (M/F), has_cancer(T/F)
Ordinal variables
e.g., military rank (soldier, sergeant, lutenant, captain, etc.)
Ratio-scaled variables
population growth (1,10,100,1000,...)
where i = (xi1, xi2, …, xin) and j = (xj1, xj2, …, xjn) are two n-
dimensional data objects, and p is a positive integer
d (i, j) (| x x |2 | x x |2 ... | x x |2 )
i1 j1 i2 j 2 in jn
Properties
d(i,j) 0
d(i,i) =0
d(i,j) = d(j,i)
d(i,j) d(i,k) + d(k,j)
Also one can use weighted distance:
d (i, j) (w | x x |2 w | x x |2 ... wn | x x |2 )
1 i1 j1 2 i2 j 2 in jn
Jaccard Similarity
The Jaccard similarity (Jaccard coefficient) of two sets S1, S2
is the size of their intersection divided by the size of their
union.
JSim (C1, C2) = |C1C2| / |C1C2|.
3 in intersection.
8 in union.
Jaccard similarity
= 3/8
Extreme behavior:
Jsim(X,Y) = 1, iff X = Y
Jsim(X,Y) = 0 iff X,Y have no elements in common
JSim is symmetric
Jaccard Similarity between sets
Jack: 101000
Mary: 101010
Jim: 110000
Simple match distance:
number of non - common bit positions
d (i, j )
total number of bits