Clustering-Part 1
Clustering-Part 1
Clustering
Contents…
4. Clustering
Introduction to Unsupervised Learning
Introduction to Clustering
Clustering Applications
Partioning-based Clustering
Hierarchical-based clustering
Density-based clustering
Evaluation methods for Clustering
2
Types of ML
Supervised Unsupervised
7
Cluster Analysis
Cluster- a collection of data objects
Clustering- is the process of partitioning a set of data objects (or
observations) into subsets
Each subset is a cluster, such that:
objects in a cluster are similar to one another
dissimilar to objects in other clusters
9
Application of Clustering
10
Requirement for cluster analysis
Scalability :- Clustering on only a sample of a given large data
set may lead to biased results. Therefore, highly scalable
clustering algorithms are needed
Ability to deal with different types of attributes: - a clustering
algorithms shall be able to cluster numeric, nominal, ordinal and
other types of data
Ability to deal with noisy data :- Clustering algorithms can be
sensitive to noise (outliers, missing values unknown or
erroneous data) and may produce poor-quality clusters.
Therefore, we need clustering methods that are robust to noise.
Interpretability :− The clustering results should be
interpretable, comprehensible, and usable.
High dimensionality :− The clustering algorithm should not only
be able to handle low-dimensional data but also the high
dimensional space
11
Categories of Clustering Methods
12
Clustering Methods: Partitioning
13
Clustering Methods: K-Means
14
Clustering Methods: K-Means
15
Clustering Methods: K-Means
16
Clustering Methods: K-Means
17
Clustering Methods: K-Means
18
Clustering Methods: K-Means
19
Clustering Methods: K-Means
Steps:
1. arbitrarily choose k objects from D as the initial cluster
centroids.
2. For each of the objects in D do
Compute distance between the current objects and k
cluster centroids
Assign the current object to that cluster to which it is
closest.
3. Compute the “cluster centers” of each cluster. These become
the new cluster centroids.
4. Repeat step 2-3 until the convergence criterion is satisfied
21
5. Stop
k-Means Algorithm
Note:
1) Objects are defined in terms of set of attributes. 𝐴 = {𝐴1 , 𝐴2 , … . . , 𝐴𝑚 }
where each 𝐴𝑖 is continuous data type.
A1 A2
6.8 12.6 Fig 1: Plotting data of Table 1
25
0.8 9.8
1.2 11.6
Table 1 : 20
16 objects with 2.8 9.6
two attributes 3.8 9.9
𝑨𝟏 and 𝑨𝟐 . 15
4.4 6.5
A2
4.8 1.1
6.0 19.9 10
6.2 18.5
7.6 17.4 5
7.8 12.2
6.6 7.7 0
0 2 4 6 8 10 12
8.2 4.5
A1
8.4 6.9
9.0 3.4
9.6 11.1 23
k-Means Algorithm
Step 1: We arbitrarily choose three objects as the three initial cluster centers
A1 A2
6.8 12.6 Plotting data of Table 1
25
0.8 9.8
1.2 11.6
20
2.8 9.6
3.8 9.9
4.4 6.5 15
A2
4.8 1.1
6.0 19.9 10
6.2 18.5
7.6 17.4 5
7.8 12.2
6.6 7.7 0
8.2 4.5 0 2 4 6 8 10 12
A1
8.4 6.9
9.0 3.4
9.6 11.1 24
k-Means Algorithm
Step 2: Each object is assigned to a cluster based on the cluster center to which it is the nearest.
Let d1, d2 and d3 denote the distance from an object to c1, c2 and c3 respectively.
A1 A2
6.8 12.6 Plotting data of Table 1
25
0.8 9.8
1.2 11.6
20
2.8 9.6
3.8 9.9
4.4 6.5 15
A2
4.8 1.1
6.0 19.9 10
6.2 18.5
7.6 17.4 5
7.8 12.2
6.6 7.7 0
8.2 4.5 0 2 4 6 8 10 12
A1
8.4 6.9
9.0 3.4
9.6 11.1 25
k-Means Algorithm
Step 2: Each object is assigned to a cluster based on the cluster center to which it is the nearest.
Let d1, d2 and d3 denote the distance from an object to c1, c2 and c3 respectively.
.
A1 A2
6.8 12.6
0.8 9.8
Initial centroids chosen randomly
1.2 11.6
2.8 9.6
Centroid Objects
3.8 9.9
4.4 6.5
A1 A2
4.8 1.1 c1 3.8 9.9
6.0 19.9 c2 7.8 12.2
6.2 18.5 c3 6.2 18.5
7.6 17.4
7.8 12.2
6.6 7.7
8.2 4.5
8.4 6.9
9.0 3.4
26
9.6 11.1
k-Means Algorithm
Step 2: Each object is assigned to a cluster based on the cluster center to
which it is the nearest.
Let d1, d2 and d3 denote the Euclidean distance from an object to
c1, c2 and c3 respectively.
The distance calculations are shown in Table.
. A1 A2 d1 d2 d3
6.8 12.6 4.0 1.1 5.9
0.8 9.8 3.0 7.4 10.2
1.2 11.6 3.1 6.6 8.5
2.8 9.6 1.0 5.6 9.5 Centroid Objects
3.8 9.9 0.0 4.6 8.9
A1 A2
4.4 6.5 3.5 6.6 12.1
c1 3.8 9.9
4.8 1.1 8.9 11.5 17.5
c2 7.8 12.2
6.0 19.9 10.2 7.9 1.4
6.2 18.5 8.9 6.5 0.0 c3 6.2 18.5
7.6 17.4 8.4 5.2 1.8
7.8 12.2 4.6 0.0 6.5
6.6 7.7 3.6 4.7 10.8
8.2 4.5 7.0 7.7 14.1
8.4 6.9 5.5 5.3 11.8
9.0 3.4 8.3 8.9 15.4
27
9.6 11.1 5.9 2.1 8.1
k-Means Algorithm
Step 2: Each object is assigned to a cluster based on the cluster center to
which it is the nearest.
Assignment of each object to the respective centroid is shown in
the right-most column and the clustering so obtained is shown in
Fig below.
. A1 A2 d1 d2 d3 cluster
6.8 12.6 4.0 1.1 5.9 2
0.8 9.8 3.0 7.4 10.2 1
1.2 11.6 3.1 6.6 8.5 1
2.8 9.6 1.0 5.6 9.5 1
3.8 9.9 0.0 4.6 8.9 1
4.4 6.5 3.5 6.6 12.1 1
4.8 1.1 8.9 11.5 17.5 1
6.0 19.9 10.2 7.9 1.4 3
6.2 18.5 8.9 6.5 0.0 3
7.6 17.4 8.4 5.2 1.8 3
7.8 12.2 4.6 0.0 6.5 2
6.6 7.7 3.6 4.7 10.8 1
8.2 4.5 7.0 7.7 14.1 1
8.4 6.9 5.5 5.3 11.8 2
9.0 3.4 8.3 8.9 15.4 1
28
9.6 11.1 5.9 2.1 8.1 2
k-Means Algorithm
Step 3: Update the cluster centers
the mean value of each cluster is recalculated based on the current
objects in the cluster.
Using the new cluster centers, the objects are redistributed to the
clusters based on which cluster center is the nearest
A1 A2 d1 d2 d3 cluster
6.8 12.6 4.0 1.1 5.9 2
0.8 9.8 3.0 7.4 10.2 1
1.2 11.6 3.1 6.6 8.5 1
2.8 9.6 1.0 5.6 9.5 1
3.8 9.9 0.0 4.6 8.9 1
4.4 6.5 3.5 6.6 12.1 1
4.8 1.1 8.9 11.5 17.5 1
6.0 19.9 10.2 7.9 1.4 3
6.2 18.5 8.9 6.5 0.0 3
7.6 17.4 8.4 5.2 1.8 3
7.8 12.2 4.6 0.0 6.5 2
6.6 7.7 3.6 4.7 10.8 1
8.2 4.5 7.0 7.7 14.1 1
8.4 6.9 5.5 5.3 11.8 2
9.0 3.4 8.3 8.9 15.4 1
29
9.6 11.1 5.9 2.1 8.1 2
k-Means Algorithm
Step 3: Update the cluster centers
the mean value of each cluster is recalculated based on the current
objects in the cluster.
Using the new cluster centers, the objects are redistributed to the
clusters based on which cluster center is the nearest
A1 A2 d1 d2 d3 cluster
6.8 12.6 4.0 1.1 5.9 2
0.8 9.8 3.0 7.4 10.2 1
1.2 11.6 3.1 6.6 8.5 1
2.8 9.6 1.0 5.6 9.5 1
3.8 9.9 0.0 4.6 8.9 1
4.4 6.5 3.5 6.6 12.1 1
4.8 1.1 8.9 11.5 17.5 1
6.0 19.9 10.2 7.9 1.4 3
6.2 18.5 8.9 6.5 0.0 3
7.6 17.4 8.4 5.2 1.8 3
7.8 12.2 4.6 0.0 6.5 2
6.6 7.7 3.6 4.7 10.8 1
8.2 4.5 7.0 7.7 14.1 1
8.4 6.9 5.5 5.3 11.8 2
9.0 3.4 8.3 8.9 15.4 1
30
9.6 11.1 5.9 2.1 8.1 2
k-Means Algorithm
Step 3: Update the cluster centers
the mean value of each cluster is recalculated based on the current
objects in the cluster.
Using the new cluster centers, the objects are redistributed to the
clusters based on which cluster center is the nearest
New Objects
Centroid A1 A2
c1 4.6 7.1
c2 8.2 10.7
c3 6.6 18.6
31
k-Means Algorithm
Step 3: We next reassign the 16 objects to three clusters by determining
which centroid is closest to each one. This gives the revised set of
clusters shown in Fig 4.
Note that point p moves from cluster C2 to cluster C1.
32
k-Means Algorithm
• The newly obtained centroids after second iteration are given in the table below.
Note that the centroid c3 remains unchanged, where c2 and c1 changed a little.
• With respect to newly obtained cluster centres, 16 points are reassigned again.
These are the same clusters as before. Hence, their centroids also remain
unchanged.
• Considering this as the termination criteria, the k-means algorithm stops here.
Hence, the final cluster in Fig 5 is same as Fig 4.
Cluster centres after second iteration Fig 5: Cluster after Second iteration
32
Thank You
35