02 - Clustering
02 - Clustering
Clustering
by:
Hossam El Din Hassan Abd El Munim
حسام الدين حسن عبد المنعم
Computer & Systems Engineering Dept.,
Ain Shams University,
1 El-Sarayat Street, Abbassia, Cairo 11517
1
Today
1. Parametric Approach
▪ assume parametric distribution of data
▪ estimate parameters of this distribution
▪ much “harder” than supervised case
▪ NonParametric Approach
▪ group the data into clusters, each cluster (hopefully)
says something about categories (classes) present in
the data
little is
known
“harder”
Why Unsupervised Learning?
▪ Unsupervised learning is harder
▪ How do we know if results are meaningful? No answer
labels are available.
▪ Let the expert look at the results (external evaluation)
▪ Define an objective function on clustering (internal evaluation)
▪ We nevertheless need it because
1. Labeling large datasets is very costly (speech recognition)
▪ sometimes can label only a few examples by hand
2. May have no idea what/how many classes there are (data
mining)
3. May want to use clustering to gain some insight into the
structure of the data before designing a classifier
▪ Clustering as data description
Clustering
▪ Seek “natural” clusters in the data
3 clusters or 2 clusters?
▪ Possible approaches
1. fix the number of clusters to k
2. find the best clustering according to the criterion
function (number of clusters may vary)
Proximity Measures
▪ good proximity measure is VERY application
dependent
▪ Clusters should be invariant under the transformations
“natural” to the problem
▪ For example for object recognition, should have
invariance to rotation
distance 0
9 6
Distance (dissimilarity) Measures
▪ Euclidean distanced
Σx x
d xi , x j i
k
j
k 2
k 1
▪ translation invariant
k 1
▪ approximation to Euclidean distance,
cheaper to compute
▪ Chebyshev distance
d x i , x j max | x ik x jk |
1k d
▪ Correlation coefficient
▪ popular in image processing
Σx x x x
d
k k
i i
i i
s xi , x j k 1
「d
Σx
1/ 2
|Σ xi x i
d
k 2 k
x j 2|
]
j
k 1
k 1
Feature Scale
Simplest Clustering Algorithm
•Two Issues
1.How to measure similarity between samples?
2.How to evaluate partitioning?
•If distance is a good measure of dissimilarity, distance between
samples in same cluster must be smaller than distance between
samples in different clusters
•Two samples belong to the same cluster if distance between them is
less than a threshold d0.
Scaling Axis
Criterion Functions for Clustering
▪ Have samples x1,…,xn
▪ Suppose partitioned samples into c subsets D1,…,Dc
D3
D1
D2
▪ There are approximately cn/c! distinct partitions
▪ Can define a criterion function J(D1,…,Dc) which
measures the quality of a partitioning D1,…,Dc
▪ Then the clustering problem is a well defined
problem
▪ the optimal clustering is the partition which optimizes the
criterion function
SSE Criterion Function
▪ Let ni be the number of samples in Di, and define
the mean of samples in in Di
i
1
ni
Σx
xDi
i 1 xDi
2
1
i 1 xDi
i 1 xDi
1. Initialize x
▪ pick k cluster centers arbitrary
▪ assign each example to closest x x
center
2. compute sample
means for each cluster x
x x
▪ It is very efficient
• Scatter criteria
• Scatter matrices used in multiple discriminant analysis,
i.e., the within-scatter matrix SW and the between-
scatter matrix SB
ST = SB +SW
• Note:
• ST does not depend on partitioning
• In contrast, SB and SW depend on partitioning
• Two approaches:
• minimize the within-cluster
• maximize the between-cluster scatter
i 1 i 1 xDi
i 1
1 1 c
m x ni mi
n D n i 1
Iterative optimization
• Clustering discrete optimization problem
• Finite data set finite number of partitions
• What is the cost of exhaustive search?
cn/c! For c clusters. Not a good idea
n j 1 ni 1
ni 1 n j 1
Hierarchical Clustering
• Many times, clusters are not disjoint, but a cluster
may have subclusters, in turn having sub-
subclusters, etc.
• Consider a sequence of partitions of the n
samples into c clusters
• The first is a partition into n cluster, each one
containing exactly one sample
• The second is a partition into n-1 clusters, the third into
n-2, and so on, until the n-th in which there is only one
cluster containing all of the samples
• At the level k in the sequence, c = n-k+1.
Pattern Classification, Chapter 10
32
1
d avg ( Di , D j )
ni n j
x x'
xDi x 'D j
d mean ( Di , D j ) m i m j
which behave quite similar of the clusters are
hyperspherical and well separated.
• The computational complexity is O(cn2d2), n>>c
Pattern Classification, Chapter 10
• Nearest-neighbor algorithm (single linkage)
37
• dmin is used
• The use of dmin as a distance measure and the
agglomerative clustering:-
• dmax is used
• When two clusters are merged, the graph is changed by adding
edges between every pair of nodes in the 2 clusters