Clustering L7
Clustering L7
Clustering
Clustering is the task of dividing the population or data points into a number of groups
such that data points in the same groups are more similar to other data points in the
same group and dissimilar to the data points in other groups. It is basically a collection
of objects on the basis of similarity and dissimilarity between them.
Why Clustering?
Clustering is very much important as it determines the intrinsic grouping among
the unlabeled data present. There are no criteria for good clustering. It depends on the
user, what is the criteria they may use which satisfy their need. For instance, we could
be interested in finding representatives for homogeneous groups (data reduction), in
finding “natural clusters” and describe their unknown properties (“natural” data types),
in finding useful and suitable groupings (“useful” data classes) or in finding unusual
data objects (outlier detection). This algorithm must make some assumptions that
constitute the similarity of points and each assumption make different and equally
valid clusters.
Clustering Methods:
Density-Based Methods: These methods consider the clusters as the dense region
having some similarities and differences from the lower dense region of the space.
These methods have good accuracy and the ability to merge two clusters.
Example DBSCAN (Density-Based Spatial Clustering of Applications with
Noise), OPTICS (Ordering Points to Identify Clustering Structure), etc.
Hierarchical Based Methods: The clusters formed in this method form a tree-
type structure based on the hierarchy. New clusters are formed using the previously
formed one. It is divided into two category
Agglomerative (bottom-up approach)
Divisive (top-down approach)
Partitioning Methods: These methods partition the objects into k clusters and each
partition forms one cluster. This method is used to optimize an objective criterion
similarity function such as when the distance is a major parameter example K-
means, CLARANS (Clustering Large Applications based upon Randomized
Search), etc.
Grid-based Methods: In this method, the data space is formulated into a finite
number of cells that form a grid-like structure. All the clustering operations done
on these grids are fast and independent of the number of data objects
1
-7-
Clustering Algorithms
K-means clustering algorithm
It is the simplest unsupervised learning algorithm that solves clustering problem. K-
means algorithm partitions n observations into k clusters where each observation
belongs to the cluster with the nearest mean serving as a prototype of the cluster.
2
-7-
Step4: Pixel(x) are assigned to their nearest class based on minimum distance.
Step5 :For each class, recomputed its center by finding the mean of the class
according to the following equation :
Nj
Z j (n 1) 1 Xi
Where Zj is the new mean, Nj is the number of points in class j, and Xi is the
points belonging to class.
Step6: Compare the old centers of classes with the new centers. if there is no change
in centers the algorithm stop, Otherwise, repeat steps from 2.
Example:
Suppose we want to group the visitors to a website using just their age (one-dimensional
space) as follows:
n = 19
15,15,16,19,19,20,20,21,22,28,35,40,41,42,43,44,60,61,65
Initial clusters (random centroid or average):
k = 2, c1 = 16, c2 = 22
3
-7-
Iteration 1:
Distance Distance New Centroid
xi c1 c2 1 2
Nearest Cluster
(Z)
15 16 22 1 7 1
15 16 22 1 7 1 15.33
16 16 22 0 6 1
19 16 22 9 3 2
19 16 22 9 3 2
20 16 22 16 2 2
20 16 22 16 2 2
21 16 22 25 1 2
22 16 22 36 0 2
28 16 22 12 6 2
35 16 22 19 13 2
36.25
40 16 22 24 18 2
41 16 22 25 19 2
42 16 22 26 20 2
43 16 22 27 21 2
44 16 22 28 22 2
60 16 22 44 38 2
61 16 22 45 39 2
65 16 22 49 43 2
Iteration 2:
Distance Distance New Centroid
xi c1 c2 1 2
Nearest Cluster
(Z)
15 15.33 36.25 0.33 21.25 1
15 15.33 36.25 0.33 21.25 1
16 15.33 36.25 0.67 20.25 1
19 15.33 36.25 3.67 17.25 1
19 15.33 36.25 3.67 17.25 1 18.56
20 15.33 36.25 4.67 16.25 1
20 15.33 36.25 4.67 16.25 1
21 15.33 36.25 5.67 15.25 1
22 15.33 36.25 6.67 14.25 1
4
-7-
5
-7-
Iteration 4:
Distance Distance New Centroid
xi c1 c2 1 2
Nearest Cluster
(Z)
15 19.5 47.89 4.50 32.89 1
15 19.5 47.89 4.50 32.89 1
16 19.5 47.89 3.50 31.89 1
19 19.5 47.89 0.50 28.89 1
19 19.5 47.89 0.50 28.89 1
19.50
20 19.5 47.89 0.50 27.89 1
20 19.5 47.89 0.50 27.89 1
21 19.5 47.89 1.50 26.89 1
22 19.5 47.89 2.50 25.89 1
28 19.5 47.89 8.50 19.89 1
35 19.5 47.89 15.50 12.89 2
40 19.5 47.89 20.50 7.89 2
41 19.5 47.89 21.50 6.89 2
42 19.5 47.89 22.50 5.89 2
43 19.5 47.89 23.50 4.89 2 47.89
44 19.5 47.89 24.50 3.89 2
60 19.5 47.89 40.50 12.11 2
61 19.5 47.89 41.50 13.11 2
65 19.5 47.89 45.50 17.11 2
6
-7-
H.w.
Suppose you have the following data