Cluster Analysis
Cluster Analysis
Inter-cluster
Intra-cluster distances
distances are are
minimized maximized
Applications of Cluster Analysis
1
Applied-Matl-DOWN,Bay-Network-Down,3-COM-DOWN,
2
Apple-Comp-DOWN,Autodesk-DOWN,DEC-DOWN,
ADV-Micro-Device-DOWN,Andrew-Corp-DOWN,
browsing Computer-Assoc-DOWN,Circuit-City-DOWN,
Compaq-DOWN, EMC-Corp-DOWN, Gen-Inst-DOWN,
Technology2-DOWN
Motorola-DOWN,Microsoft-DOWN,Scientific-Atl-DOWN
– Finance
3
Fannie-Mae-DOWN,Fed-Home-Loan-DOWN,
MBNA-Corp-DOWN,Morgan-Stanley-DOWN Financial-DOWN
Image segmentation
– Goal: Break up the image
into meaningful or
perceptually similar regions
Summarization
– Reduce the size of large
data sets
Clustering precipitation in
Australia
Notion of a Cluster can be Ambiguous
Clustering results are crucially dependent on the measure of similarity (or distance) between
“points” to be clustered
Measure the Quality of Clustering
Quality of clustering:
– There is usually a separate “quality” function that
measures the “goodness” of a cluster.
– It is hard to define “similar enough” or “good
enough”
The answer is typically highly subjective
6
Considerations for Cluster Analysis
Partitioning criteria
– Single level vs. hierarchical partitioning (often, multi-level hierarchical
partitioning is desirable)
Separation of clusters
– Exclusive (e.g., one customer belongs to only one region) vs. non-
exclusive (e.g., one document may belong to more than one class)
Similarity measure
– Distance-based (e.g., Euclidian, road network, vector) vs.
connectivity-
based (e.g., density or contiguity)
Clustering space (Partial versus complete)
– Full space (often when low dimensional) vs. subspaces (often in high-
dimensional clustering)
Heterogeneous versus homogeneous
– Cluster of widely different sizes, shapes, and densities
7
Types of Clusters
Well-separated clusters
Center-based clusters
Contiguous clusters
Density-based clusters
Property or Conceptual
Described by an Objective Function
Types of Clusters: Well-Separated
Well-Separated Clusters:
– A cluster is a set of points such that any point in a cluster is
closer (or more similar) to every point in the cluster than to any
point not in the cluster.
3 well-separated clusters
Types of Clusters: Center-Based
Center-based
– A cluster is a set of objects such that an object in a
cluster is closer (more similar) to the “center” of a cluster,
than to the center of any other cluster
– The center of a cluster is often a centroid
4 center-based clusters
Types of Clusters: Contiguity-Based
Density-based
– A cluster is a dense region of points, which is separated by
low-density regions, from other regions of high density.
– Used when the clusters are irregular or intertwined, and when
noise and outliers are present.
6 density-based clusters
Types of Clusters: Conceptual Clusters
2 Overlapping Circles
Characteristics of the Input Data Are Important
Interval-scaled variables
Binary variables
Nominal, ordinal, and ratio variables
Variables of mixed types
– Nominal attribute: distinctness
– Ordinal attribute: distinctness & order
– Interval attribute: distinctness, order & addition
– Ratio attribute: all 4 properties
Types of Attributes
Nominal The values of a nominal attribute are zip codes, employee ID mode, entropy,
just different names, i.e., nominal numbers, eye color, sex: contingency
attributes provide only enough {male, female} correlation, 2 test
information to distinguish one object
from another. (=, )
Interval For interval attributes, the differences calendar dates, mean, standard
between values are meaningful, i.e., temperature in Celsius deviation, Pearson's
a unit of measurement exists. or Fahrenheit correlation, t and F
(+, - ) tests
Ratio For ratio variables, both differences temperature in Kelvin, geometric mean,
and ratios are meaningful. (*, /) monetary quantities, harmonic mean,
counts, age, mass, percent
length, electrical current variation
Attribute Transformation Comments
Level
Similarity
– Numerical measure of how alike two data objects are.
– Is higher when objects are more alike.
– Often falls in the range [0,1]
Dissimilarity
– Numerical measure of how different are two data
objects
– Lower when objects are more alike
– Minimum dissimilarity is often 0
– Upper limit varies
Proximity refers to a similarity or dissimilarity
Similarity/Dissimilarity for Simple Attributes
Manhattan Distance
Euclidean Distance
n
2
dist ( pk q k)
k 1
Where n is the number of dimensions (attributes) and pk and qk
are, respectively, the kth attributes (components) or data
objects p and q.
Manhattan Distance
3
point x y
2 p1 p1 0 2
p3 p4 p2 2 0
1 p3 3 1
p2 p4 5 1
0
0 1 2 3 4 5 6
Acutal Points in 2D
Distance between p1 and p2 Manhattan Distance is the sum of
(X1,Y1) = (0 , 2) The absolute values of differences
(X2,Y2) = (2 , 0) Of the coordinates.
d=|0–2|+|2–0|
d= 2+2 =4
Distance between p1 and p3
(X1,Y1) = (0 , 2)
(X2,Y2) = (3 , 1)
d=|0–3|+|2–1|
d= 3+1 = 4 L1 p1 P2 p3 p4
Distance between p1 and p4 p1 0 4 4 6
(X1,Y1) = (0 , 2) p2 4 0 2 4
(X4,Y4) = (5 , 1) p3 4 2 0 2
d=|0–5|+|2–1| p4 6 4 2 0
d= 5+1 =6
Euclidean Distance
point x y
3 p1 0 2
p2 2 0
2 p1 p3 3 1
p3 p4 p4 5 1
1
p2 Acutal Points in 2D
0
0 1 2 3 4 5 6
p1 p2 p3 p4
p1 0 2.828 3.162 5.099
p2 2.828 0 1.414 3.162
p3 3.162 1.414 0 2
p4 5.099 3.162 2 0
Distance Matrix
Clustering Algorithms
Hierarchical clustering
Density-based clustering
Partitional Clustering
Divide data objects into non-overlapping subsets (clusters) such that
each data object is in exactly one subset
Typical methods: k-means, k-medoids, CLARANS
p1
p3 p4
p2
p1 p2 p3
Traditional Hierarchical Clustering p4
Traditional Dendrogram
p1
p3 p4
p2
p1 p2
p3 p4
Non-traditional Hierarchical Clustering
Non-traditional Dendrogram
K-Means : Partitioning approach
Iteration 1
Iterative Step 2:
Iterative Step 1: Change the cluster
Assign data points to center to average of the
closest cluster center assigned points
K-means clustering Example
Iteration 2
Iterative Step 1:
Assign data points to
closest cluster center
Iteration 3
Iterative Step 1:
Assign data points to
closest cluster center
Complexity is O( n * K * I )
– n = number of points,
– K = number of clusters,
– I = number of iterations
Two different K-means Clusterings
3
2.5
2
Original Points
1.5
y
1
0.5
3 3
2.5 2.5
2 2
1.5 1.5
y
y
1 1
0.5 0.5
0 0
2 2 2
y
1 1 1
0 0 0
-2 -1.5 -1 -0.5 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0.5 1 1.5 2
0 0 0
x x x
2 2 2
y
1 1 1
0 0 0
-2 -1.5 -1 -0.5 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0.5 1 1.5 2
0 0 0
x x x
Importance of Choosing Initial Centroids
Iteration 6
2
3
4
5
1
3
2.5
1.5
1.5
y
y
0.5
-
1
Importance of Choosing Initial Centroids …
Iteration 1 Iteration 2
3 3
2.5 2.5
2 2
1.5 1.5
y
y
1 1
0.5 0.5
0 0
2 2 2
y
1 1 1
0 0 0
-2 -1.5 -1 -0.5 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0.5 1 1.5 2
0 0 0
x x x
Importance of Choosing Initial Centroids …
Iteration 5
1
2
3
4
3
2.5
1.5
1.5
y
y
0.5
-
1
Evaluating K-means Clusters
– Given two clusters, we can choose the one with the smallest error
– One easy way to reduce SSE is to increase K, the number of clusters
A good clustering with smaller K can have a lower SSE than a poor
clustering with higher K
Solutions to Initial Centroids Problem
Multiple runs
– Helps, but probability is not on your side
Sample and use hierarchical clustering to determine initial
centroids
Select more than k initial centroids and then select among
these initial centroids
– Select most widely separated
Postprocessing
Bisecting K-means
– Not as susceptible to initialization issues
Pre-processing and Post-processing
Pre-processing
– Normalize the data
– Eliminate outliers
Post-processing
– Eliminate small clusters that may represent outliers
– Split ‘loose’ clusters, i.e., clusters with relatively high SSE
– Merge clusters that are ‘close’ and that have relatively low SSE
Limitations of K-means
132
DBSCAN: Density Based Spatial Clustering of Applications with Noise
Locates region of high density that are separated by regions of low
density.
In center- based approach, density of a point is the number of points
within specified radius, Eps, of that point
A cluster is defined as a maximal set of density-connected points
Border
Eps = 1cm
MinPts = 5
Cor e
DBSCAN
138
DBSCAN Algorithm
Time Complexity
– O(N x time to find points in Eps-neighbourhood)
– where N is the no of points
– Worst case O(N2)
– KD-trees, allow efficient retreivel of all points within
given distance of a specified point in O(N logN)
Space Complexity
– O(N)
DBSCAN: Core, Border and Noise Points
• Resistant to Noise
• Can handle clusters of different shapes and sizes
DBSCAN: Sensitive to Parameters