Lecture 9 Clustering
Lecture 9 Clustering
LECTURE 8
Clustering
2
Inter-cluster
Intra-cluster distances are
distances are maximized
minimized
Applications of Cluster Analysis
Discovered Clusters Industry Group
• Understanding Applied-Matl-DOWN,Bay-Network-Down,3-COM-DOWN,
4 Louisiana-Land-UP,Phillips-Petro-UP,Unocal-UP,
Schlumberger-UP
Oil-UP
• Summarization
• Reduce the size of large data
sets
Clustering precipitation
in Australia
Early applications of cluster analysis
• John Snow, London 1854
Notion of a Cluster can be Ambiguous
p1
p3 p4
p2
p1 p2 p3 p4
p1
p3 p4
p2
p1 p2 p3 p4
3 well-separated clusters
Types of Clusters: Center-Based
• Center-based
• A cluster is a set of objects such that an object in a cluster is
closer (more similar) to the “center” of a cluster, than to the
center of any other cluster
• The center of a cluster is often a centroid, the minimizer of
distances from all the points in the cluster, or a medoid, the
most “representative” point of a cluster
4 center-based clusters
Types of Clusters: Contiguity-Based
• Contiguous Cluster (Nearest neighbor or
Transitive)
• A cluster is a set of points such that a point in a cluster is
closer (or more similar) to one or more other points in the
cluster than to any point not in the cluster.
8 contiguous clusters
Types of Clusters: Density-Based
• Density-based
• A cluster is a dense region of points, which is separated by
low-density regions, from other regions of high density.
• Used when the clusters are irregular or intertwined, and when
noise and outliers are present.
6 density-based clusters
Types of Clusters: Conceptual Clusters
• Shared Property or Conceptual Clusters
• Finds clusters that share some common property or represent
a particular concept.
.
2 Overlapping Circles
Types of Clusters: Objective Function
• Clustering as an optimization problem
• Finds clusters that minimize or maximize an objective function.
• Enumerate all possible ways of dividing the points into clusters
and evaluate the `goodness' of each potential set of clusters by
using the given objective function. (NP Hard)
• Can have global or local objectives.
• Hierarchical clustering algorithms typically have local objectives
• Partitional algorithms typically have global objectives
• A variation of the global objective function approach is to fit the
data to a parameterized model.
• The parameters for the model are determined from the data, and they
determine the clustering
• E.g., Mixture models assume that the data is a ‘mixture' of a number of
statistical distributions.
Clustering Algorithms
• K-means and its variants
• Hierarchical clustering
• DBSCAN
K-MEANS
K-means Clustering
• Partitional clustering approach
• Each cluster is associated with a centroid
(center point)
• Each point is assigned to the cluster with the
closest centroid
• Number of clusters, K, must be specified
• The objective is to minimize the sum of
distances of the points to their respective
centroid
K-means Clustering
2.5
1.5
Original Points
y
1
0.5
3 3
2.5 2.5
2 2
1.5 1.5
y
y
1 1
0.5 0.5
0 0
2.5
1.5
y
0.5
2 2 2
y
1 1 1
0 0 0
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x
2 2 2
y
1 1 1
0 0 0
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x
Importance of Choosing Initial Centroids
Iteration 5
1
2
3
4
3
2.5
1.5
y
0.5
2.5 2.5
2 2
1.5 1.5
y
y
1 1
0.5 0.5
0 0
2 2 2
y
1 1 1
0 0 0
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x
Dealing with Initialization
• Do multiple runs and select the clustering with the
smallest error