0% found this document useful (0 votes)
15 views

02 - Clustering

The document discusses unsupervised learning and clustering algorithms. It defines supervised vs. unsupervised learning, and describes clustering as a form of nonparametric unsupervised learning. The key components of clustering are discussed as proximity measures, criterion functions, and clustering algorithms like k-means and hierarchical clustering. K-means clustering is described in detail, including initializing cluster centers, computing cluster means, and iteratively reassigning samples until cluster membership stabilizes.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views

02 - Clustering

The document discusses unsupervised learning and clustering algorithms. It defines supervised vs. unsupervised learning, and describes clustering as a form of nonparametric unsupervised learning. The key components of clustering are discussed as proximity measures, criterion functions, and clustering algorithms like k-means and hierarchical clustering. K-means clustering is described in detail, including initializing cluster centers, computing cluster means, and iteratively reassigning samples until cluster membership stabilizes.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 43

CSE464 & CSE468 : Image

Processing and Pattern


Recognition

Clustering
by:
Hossam El Din Hassan Abd El Munim
‫حسام الدين حسن عبد المنعم‬
Computer & Systems Engineering Dept.,
Ain Shams University,
1 El-Sarayat Street, Abbassia, Cairo 11517

1
Today

▪ New Topic: Unsupervised Learning


▪ Supervised vs. unsupervised learning
▪ Unsupervised learning
▪ Nonparametric unsupervised learning = clustering
▪ Proximity Measures
▪ Criterion Functions
▪ Flat Clustering
▪ k-means
▪ Hierarchical Clustering
▪ Divisive
▪ Agglomerative
Supervised vs. Unsupervised Learning
▪ Up to now we considered supervised learning
scenario, where we are given
1. samples x1,…, xn
2. class labels for all samples x1,…, xn
▪ This is also called learning with teacher, since correct
answer (the true class) is provided

▪ In the next few lectures we consider


unsupervised learning scenario, where we are
only given
1. samples x1,…, xn
▪ This is also called learning without teacher, since
correct answer is not provided
▪ do not split data into training and test sets
Unsupervised Learning
a lot is
▪ Data is not labeled known
”easier”

1. Parametric Approach
▪ assume parametric distribution of data
▪ estimate parameters of this distribution
▪ much “harder” than supervised case
▪ NonParametric Approach
▪ group the data into clusters, each cluster (hopefully)
says something about categories (classes) present in
the data

little is
known
“harder”
Why Unsupervised Learning?
▪ Unsupervised learning is harder
▪ How do we know if results are meaningful? No answer
labels are available.
▪ Let the expert look at the results (external evaluation)
▪ Define an objective function on clustering (internal evaluation)
▪ We nevertheless need it because
1. Labeling large datasets is very costly (speech recognition)
▪ sometimes can label only a few examples by hand
2. May have no idea what/how many classes there are (data
mining)
3. May want to use clustering to gain some insight into the
structure of the data before designing a classifier
▪ Clustering as data description
Clustering
▪ Seek “natural” clusters in the data

▪ What is a good clustering?


▪ internal (within the cluster) distances should be small
▪ external (intra-cluster) should be large
▪ Clustering is a way to discover new
categories (classes)
What we Need for Clustering
1. Proximity measure, either
▪ similarity measure s(xi,xk): large if xi,xk are similar
▪ dissimilarity(or distance) measure d(xi,xk): small if xi,xk are similar

large d, small s large s, small d

2. Criterion function to evaluate a clustering

good clustering bad clustering

3. Algorithm to compute clustering


▪ For example, by optimizing the criterion function
How Many Clusters?

3 clusters or 2 clusters?

▪ Possible approaches
1. fix the number of clusters to k
2. find the best clustering according to the criterion
function (number of clusters may vary)
Proximity Measures
▪ good proximity measure is VERY application
dependent
▪ Clusters should be invariant under the transformations
“natural” to the problem
▪ For example for object recognition, should have
invariance to rotation

distance 0

▪ For character recognition, no invariance to rotation

9 6
Distance (dissimilarity) Measures
▪ Euclidean distanced
  Σx    x   
d xi , x j  i
k
j
k 2

k 1
▪ translation invariant

▪ Manhattan (city block) distance


 
d x i , x j  Σx ik   x jk 
d

k 1
▪ approximation to Euclidean distance,
cheaper to compute

▪ Chebyshev distance
 
d x i , x j  max | x ik   x jk  |
1k d

▪ approximation to Euclidean distance,


cheapest to compute
Similarity Measures
▪ Cosine similarity:

s xi , x j   x Ti x j
|| x i || || x j ||

▪ the smaller the angle, the larger the


similarity
▪ scale invariant measure
▪ popular in text retrieval

▪ Correlation coefficient
▪ popular in image processing

Σx    x x    x 
d
k k
i i

 
i i
s xi , x j  k 1

「d
  Σx 
1/ 2

|Σ xi  x i
d
k  2 k 
 x j 2|
]
j
k 1
k 1
Feature Scale
Simplest Clustering Algorithm
•Two Issues
1.How to measure similarity between samples?
2.How to evaluate partitioning?
•If distance is a good measure of dissimilarity, distance between
samples in same cluster must be smaller than distance between
samples in different clusters
•Two samples belong to the same cluster if distance between them is
less than a threshold d0.
Scaling Axis
Criterion Functions for Clustering
▪ Have samples x1,…,xn
▪ Suppose partitioned samples into c subsets D1,…,Dc
D3
D1
D2
▪ There are approximately cn/c! distinct partitions
▪ Can define a criterion function J(D1,…,Dc) which
measures the quality of a partitioning D1,…,Dc
▪ Then the clustering problem is a well defined
problem
▪ the optimal clustering is the partition which optimizes the
criterion function
SSE Criterion Function
▪ Let ni be the number of samples in Di, and define
the mean of samples in in Di
i 
1
ni
Σx
xDi

▪ Then the sum-of-squared errors criterion function (to


minimize) is:
JSSE  ΣΣ|| x  i ||
c
2

i 1 xDi

2
1

▪ Note that the number of clusters, c, is fixed


SSE Criterion Function
JSSE  ΣΣ|| x  i ||2
c

i 1 xDi

▪ SSE criterion appropriate when data forms compact


clouds that are relatively well separated

▪ SSE criterion favors equally sized clusters, and may


not be appropriate when “natural” groupings have
very different sizes
large JSSE small JSSE
K-means Clustering
▪ We now consider an example of iterative optimization algorithm for the special case of
JSSE objective function

JSSE  ΣΣ|| x  i ||2


k

i 1 xDi

▪ for a different objective function, we need a different optimization algorithm, of


course

▪ Fix number of clusters to k (c = k)

▪ k-means is probably the most famous clustering algorithm


▪ it has a smart way of moving from current partitioning to the next one
K-means Clustering k=3

1. Initialize x
▪ pick k cluster centers arbitrary
▪ assign each example to closest x x
center

2. compute sample
means for each cluster x
x x

3. reassign all samples to the x


closest mean x
x

4. if clusters changed at step 3, go to step 2


K-means Clustering (Means Derivation)
K-means Clustering
▪ In practice, k-means clustering performs usually
well

▪ It is very efficient

▪ Its solution can be used as a starting point for


other clustering algorithms

▪ Still 100’s of papers on variants and improvements


of k-means clustering every year
Clustering Criteria Again
23

• Scatter criteria
• Scatter matrices used in multiple discriminant analysis,
i.e., the within-scatter matrix SW and the between-
scatter matrix SB
ST = SB +SW
• Note:
• ST does not depend on partitioning
• In contrast, SB and SW depend on partitioning
• Two approaches:
• minimize the within-cluster
• maximize the between-cluster scatter

Pattern Classification, Chapter 10


24

The trace (sum of diagonal elements) is the


simplest scalar measure of the scatter matrix
c c
trSW    trSi     x  m i  Je
2

i 1 i 1 xDi

• proportional to the sum of the variances in the


coordinate directions
• This is the sum-of-squared-error criterion, Jsse.

Pattern Classification, Chapter 10


25

• As tr[ST] = tr[SW] + tr[SB] and tr[ST] is independent from


the partitioning, no new results can be derived by
minimizing tr[SB]

• However, seeking to minimize the within-cluster criterion


Jsse=tr[SW], is equivalent to maximise the between-cluster
criterion c
trS B    ni m i  m
2

i 1

where m is the total mean vector:

1 1 c
m   x   ni mi
n D n i 1

Pattern Classification, Chapter 10


26

Iterative optimization
• Clustering  discrete optimization problem
• Finite data set  finite number of partitions
• What is the cost of exhaustive search?
cn/c! For c clusters. Not a good idea

• Typically iterative optimization used:


• starting from a reasonable initial partition
• Redistribute samples to minimize criterion function.
•  guarantees local, not global, optimization.
Pattern Classification, Chapter 10
27

• consider an iterative procedure to minimize the


sum-of-squared-error criterion Je
c
Je   Ji Ji   xm
2
where i
i 1 xDi

where Ji is the effective error per cluster.

• Moving sample x̂ from cluster Di to Dj, changes


the errors in the 2 clusters by:
nj 2 ni
J j*  J j  xˆ  m j Ji*  Ji  xˆ  m i
2

n j 1 ni  1

Pattern Classification, Chapter 10


28

• Hence, the transfer is advantegeous if the


decrease in Ji is larger than the increase in Jj
ni nj 2
xˆ  m i  xˆ  m j
2

ni  1 n j 1

Pattern Classification, Chapter 10


29

• Alg. 3 is sequential version of the k-means alg.


• Alg. 3 updates each time a sample is reclassified
• k-means waits until n samples have been reclassified
before updating

• Alg 3 can get trapped in local minima


• Depends on order of the samples
• However, it is at least a stepwise optimal procedure,
and it can be easily modified to apply to problems in
which samples are acquired sequentially and
clustering must be done on-line.

• Alg 2 is the fuzzy k-means clustering. Pattern Classification, Chapter 10


Selecting the Number of
Clusters (Elbow Method)
31

Hierarchical Clustering
• Many times, clusters are not disjoint, but a cluster
may have subclusters, in turn having sub-
subclusters, etc.
• Consider a sequence of partitions of the n
samples into c clusters
• The first is a partition into n cluster, each one
containing exactly one sample
• The second is a partition into n-1 clusters, the third into
n-2, and so on, until the n-th in which there is only one
cluster containing all of the samples
• At the level k in the sequence, c = n-k+1.
Pattern Classification, Chapter 10
32

• Given any two samples x and x’, they will be grouped


together at some level, and if they are grouped a level k,
they remain grouped for all higher levels
• Hierarchical clustering  tree representation called
dendrogram

Pattern Classification, Chapter 10


33

• Are groupings natural or forced: check similarity values


• Evenly distributed similarity  no justification for grouping

• Another representation is based on set, e.g., on the Venn


diagrams

Pattern Classification, Chapter 10


34

• Hierarchical clustering can be divided in


agglomerative and divisive.

• Agglomerative (bottom up, clumping): start with n


singleton cluster and form the sequence by
merging clusters

• Divisive (top down, splitting): start with all of the


samples in one cluster and form the sequence by
successively splitting clusters

Pattern Classification, Chapter 10


35

Agglomerative hierarchical clustering

• The procedure terminates when the specified


number of cluster has been obtained, and returns
the cluster as sets of points, rather than the mean
or a representative vector for each cluster
Pattern Classification, Chapter 10
36

• At any level, the distance between nearest clusters


can provide the dissimilarity value for that level
• To find the nearest clusters, one can use
d min ( Di , D j )  min x  x'
xDi , x 'D j

d max ( Di , D j )  max x  x'


xDi , x 'D j

1
d avg ( Di , D j ) 
ni n j
  x  x'
xDi x 'D j

d mean ( Di , D j )  m i  m j
which behave quite similar of the clusters are
hyperspherical and well separated.
• The computational complexity is O(cn2d2), n>>c
Pattern Classification, Chapter 10
• Nearest-neighbor algorithm (single linkage)
37

• dmin is used
• The use of dmin as a distance measure and the
agglomerative clustering:-

Pattern Classification, Chapter 10


• The farthest neighbour algorithm (complete linkage)
38

• dmax is used
• When two clusters are merged, the graph is changed by adding
edges between every pair of nodes in the 2 clusters

• All the procedures involving minima or maxima are sensitive to


outliers. The use of dmean or davg are natural compromises
Pattern Classification, Chapter 10
Applications of Clustering
▪ Image segmentation
▪ Find interesting “objects” in images to focus attention at

From: Image Segmentation by Nested Cuts, O. Veksler, CVPR2000


Applications of Clustering
▪ Image Database Organization
▪ for efficient search
Applications of Clustering
▪ Data Mining
▪ Technology watch
▪ Derwent Database, contains all patents filed in the
last 10 years worldwide
▪ Searching by keywords leads to thousands of
documents
▪ Find clusters in the database and find if there are any
emerging technologies and what competition is up to
▪ Marketing
▪ Customer database
▪ Find clusters of customers and tailor marketing
schemes to them
Applications of Clustering
▪ Profiling Web Users
▪ Use web access logs to generate a feature vector for
each user
▪ Cluster users based on their feature vectors
▪ Identify common goals for users
▪ Shopping
▪ Job Seekers
▪ Product Seekers
▪ Tutorials Seekers
▪ Can use clustering results to improving web content and
design
Summary
▪ Clustering (nonparametric unsupervised learning)
is useful for discovering inherent structure in data
▪ Clustering is immensely useful in different fields
▪ Clustering comes naturally to humans (in up to 3
dimensions), but not so to computers
▪ It is very easy to design a clustering algorithm, but
it is very hard to say if it does anything good
▪ General purpose clustering does not exist, for best
results, clustering should be tuned to application at
hand

You might also like