Lect 4
Lect 4
Lecture outline
Distance/Similarity between data
objects
Data objects as geometric data
points
Clustering problems and algorithms
K-means
K-median
K-center
What is clustering?
A grouping of data objects such that the objects within a
group are similar (or related) to one another and different
from (or unrelated to) the objects in other groups
Intra-cluster
distances are
minimized
Inter-cluster
distances are
maximized
Outliers
Outliers are objects that do not belong to
any cluster or form clusters of very small
cardinality
cluster
outliers
In some applications we are interested in
discovering outliers, not clusters (outlier analysis)
Why do we cluster?
Clustering : given a collection of data objects group
them so that
Similar to one another within the same cluster
Dissimilar to the objects in other clusters
Applications of
clustering?
Image Processing
cluster images based on their visual content
Web
Cluster groups of users based on their access
patterns on webpages
Cluster webpages based on their content
Bioinformatics
Cluster similar proteins together (similarity
wrt chemical structure and/or functionality
etc)
Many more
Observations to cluster
Real-value attributes/variables
e.g., salary, height
Binary attributes
e.g., gender (M/F), has_cancer(T/F)
Ordinal/Ranked attributes
e.g., military rank (soldier, sergeant, lutenant, captain, etc.)
Observations to cluster
Usually data objects consist of a set of
attributes (also known as dimensions)
J. Smith, 20, 200K
If all d dimensions are real-valued then we
can visualize each data point as points in a
d-dimensional space
If all d dimensions are binary then we can
think of each data point as a binary vector
Distance functions
The distance d(x, y) between two objects xand y
is a metric if
data matrix
tuples/objects
Data Structures
attributes/dimensions
x
11
...
x
i1
...
x
n1
...
...
x
1
...
... x
1d
... ...
x
id
...
... ...
... x
... x
n
nd
...
x
i
...
...
objects
Distance matrix
objects
d(2,1)
0
d(3,1) d ( 3,2) 0
:
:
:
... 0
Q
6
L p ( x, y) | x y | | x y | ... | x x |
1
2 2
d d
1
p 1/ p
1/ p
(x y )
i
i
i 1
L ( x, y) | x1 y1 | | x y | ... | x y |
1
2 2
d d
x y
i
i
i 1
Cost (C ) L2 x ci
2
i 1 xCi
is minimized
Some special cases: k = 1, k = n
Original Points
1.5
1
0.5
0
-2
-1.5
-1
-0.5
0.5
1.5
2.5
2.5
1.5
1.5
0.5
0.5
-2
-1.5
-1
-0.5
0.5
1.5
Optimal Clustering
-2
-1.5
-1
-0.5
0.5
1.5
Sub-optimal Clustering
Discussion k-means
algorithm
Finds a local optimum
Converges often quickly (but not always)
The choice of initial points can have
large influence
Clusters of different densities
Clusters of different sizes
Multiple runs
Cost (C ) L1 ( x, ci )
i 1 xCi
is minimized