Clustering
Clustering
1 k-means
1. Randomly pick k centroids.
(a) For each example find the nearest centroid and assign the example to its cluster.
(b) For each cluster, determine the new centroid – the average of all points in the
cluster.
2. Identify two nearest cluster according to an accepted metric and join them. When
determining the distance between two clusters, we can for example compare their
centroids.
Questions
Question 1.
Given the following data points:
• A(1, 1),
• B(3, 5),
• C(4, 3),
• D(4, 5).
Starting with centroids c1 (2, 4) and c2 (5, 4), cluster them using k-means.
1
8
0
0 2 4 6 8
Question 2.
Cluster the following data points using k-means, starting with the centroids c1 (0, 1, 0, 1),
c2 (0, 2, 1, 0):
• A(0, 1, 0, 1),
• B(0, 0, 2, 0),
• D(0, 2, 1, 0).
Question 3.
Cluster the following data points using hierarchical agglomerative clustering. Draw the
dendrogram.
• A(0,0),
• B(1,2),
• C(2,1),
• D(5,1),
• E(6,0),
• F(6,3).
2
6
4
F
2 B
C D
0 A E
0 2 4 6
Mini-project: k-means
Implement the k-means algorithm. Cluster the iris dataset in in iris.data (during clu-
stering, ignore the decision attribute).
The program should additionally:
• After every iteration: print the sum/average distance of each point from its centroid.
This value should decrease with every iteration.
• Optional: print measures of cluster homogeneity, eg. percentage of each iris class, or
entropy.