Clustering in Machine Learning
Clustering in Machine Learning
CLUSTERING
Learning objectives
After careful study of this chapter, we should be able to understand:
The clustering technique
K-Means algorithms
Hierarchical clustering algorithms
5.1 INTRODUCTION
In this chapter we introduce simple unsupervised learning technique called clustering
algorithms. Supervised learning algorithm needs training data {(x1, y1), (x2, y2), … ,(xn,yn)}
along with their class labels {y1, y2,…, yn}, but in un-supervised learning algorithm class
labels are not required.
Unsupervised machine learning algorithm e.g., clustering algorithm needs only training data
{x1, x2, …, xn} (without the class labels) to reveal the structure of the data.
The K-Means is the most popular non-overlapping clustering algorithm and hierarchical
clustering gives overlapping output.
Each cluster thus describes, in terms of the data collected, the class to which its members
belong. Items in each cluster are similar in some ways to each other and dissimilar to those
in other clusters. Cluster analysis divides data into groups (clusters) that are meaningful, and
useful. It is the study of techniques for automatically finding the hidden structure i.e. there
is no prior knowledge (hence unsupervised) about which elements belong to which groups
(clusters).
There are many applications of cluster analysis as in: (a) business: in customer segmentation
customers are grouped into small segments for marketing, (b) medicine: to identify different
groups of diseases, (c) information retrieval: retrieval of similar types of information from
the huge database, (d) document clustering: searching and grouping similar types of text
document, (e) social networking: to identify the group they interact more to each other, (f)
software engineering, (g) recommendation system, (h) k-anonymity: for data hiding, (i)
bioinformatics, (j) astronomical: to know the hidden structure of the galaxy, (j) engineering
fields like civil, mechanical, chemical etc.
Figure 5.1: Different clusters formation for the same training data
Practically in many situations, there is little prior information available about the structure
of the data and the decision maker must make as few assumptions about the data. A given
training data can be divided into different clusters (Figure 5.1), therefore to choose best
suited k value, various techniques are there which we will discuss later, where k is the
number of clusters.
For example, A= {1,3,6} and B= {2,3,7} and the distance between A and B is:
dist ( A , B )=√(1−2 )2 +(3−3 )2 +(6−7 )2 = 1.4142
Advantages of K-Means algorithm
K-Means algorithm is simple, scalable, robust and a widely used clustering algorithm.
The time complexity of K-Means algorithm is Ο(nkt ) where n is the number of data objects,
k is the number of clusters and t is the number of iterations.
5.3.1Cost function
n
1
The cost function is J ( C , c )= ∑ dist ( x i , c j ) , where, n is the total number of training data,
n i =1
x i 1 ≤ ⅈ ≤ n ,is the ith training data, c j, 1 ≤ j≤ K , is the nearest cluster center of the ith training
data, K is the total number of clusters. The objective is to minimize the cost value.
Example: Let the data be D = {(1,1), (1,2), (2,2), (1,3), (10,10), (10,11), (11,12), (20,20),
(20,21), (21,24)}, and three cluster centers are c 1 (1.25, 2), c2 (10.33, 11) and c3 (20.33,
21.67). The cost value is calculated as follows:
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
Data (xi) (1,1) (1,2) (2,2) (1,3) (10,10) (10,11) (11,12) (20,20) (20,21) (21,24)
cj (1.25, 2) (10.33, 11) (20.33, 21.67)
dist ( x i , c j ) 1.03 0.25 0.75 1.03 1.05 0.33 1.20 1.70 0.75 2.43
Table: The cost value calculation
First, randomly select three data (from the training data) as initial cluster centers. Assume
the initial cluster centers are c1 = (1, 2), c2 = (11, 12), c3 = (20, 20).
Next, assign each data point to its nearest cluster center. As an example, to assign the
cluster center for the data d1 = (1, 1), we apply Euclidean (or Manhattan) distance from all
(initial) cluster centers c1 = (1, 2), c2 = (11, 12), and c3 = (20, 20).
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
(1,1) (1,2) (2,2) (1,3) (10,10) (10,11) (11,12) (20,20) (20,21) (21,24)
C1 (1, 2) 1.0 0.0 1.0 1.0 12.04 12.73 14.14 26.17 26.87 29.73
C2 (11, 12) 14.87 14.14 13.45 13.45 2.24 1.41 0.0 12.04 12.73 15.62
C3 (20, 20) 26.87 26.17 25.46 25.5 14.14 13.45 12.04 0.0 1.0 4.12
C1 C1 C1 C1 C2 C2 C2 C3 C3 C3
Table 7.1: K-means iterations
Since dist (d 1 , c 1 )=1 is the minimum distance, hence data (1, 1) should belong to k 1 (c1 is the
cluster center) cluster. In this way we assign the cluster for rest of the data.
Data (1, 1), (1, 2), (2, 2) and (1, 3) should belong to cluster center c 1 (1,2) and K1 cluster is
formed. Data (10, 10), (10, 11), and (11,11) should belong to cluster center c 2 (11,12) and
cluster K2is formed. Data (20,20), (20,21), and (22,23) should belong to cluster center c 3 (20,
20) and the cluster K3 is formed.
Next calculate new cluster centers from the clusters C 1, C2 and C3. New cluster center of
cluster C1 is the mean of (1, 1), (1, 2), (2,2) and (1,3) i.e., (1.25, 2) (updated value of c 1), new
cluster center of cluster C2 is the mean of (10,10), (10,11), and (11,11) i.e., (10.33, 11)
(updated value of c2) and new cluster center of cluster C 3 is the mean of (20,20), (20,21), and
(22,23) i.e., (20.33, 21.67) (updated value of c3).
Further reassign all the data points to the nearest cluster centers c 1 (1.25, 2), c2 (10.33,
10.66) and c3 (20.66, 21.33).
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
(1,1) (1,2) (2,2) (1,3) (10,10) (10,11) (11,12) (20,20) (20,21) (21,24)
(1.25, 2) 1.03 0.25 0.75 1.03 11.86 12.55 13.97 25.99 26.69 29.56
(10.33, 10.66) 13.43 12.02 12.02 12.07 0.74 0.47 1.5 13.44 14.16 17.08
(20.66, 21.33) 28.28 27.57 26.87 26.88 15.56 14.84 13.43 1.48 0.74 2.69
C1 C1 C1 C1 C2 C2 C2 C3 C3 C3
Data (1, 1), (1, 2), (2, 2) and (1, 3) should belong to cluster center c 1 (1.25,2) and K1 cluster is
formed. Data (10, 10), (10, 11), and (11,11) should belong to cluster center c 2 (10.33,10.66)
and cluster K2is formed. Data (20,20), (20,21), and (22,23) should belong to cluster center c 3
(20.66, 21.33) and the cluster K3 is formed.
Next calculate updated cluster centers from the clusters C 1, C2 and C3. New cluster center of
cluster C1 is the mean of (1, 1), (1, 2), (2,2) and (1,3) i.e., (1.25, 2) (updated value of c 1), new
cluster center of cluster C2 is the mean of (10,10), (10,11), and (11,11) i.e., (10.33, 10.66)
(updated value of c2) and new cluster center of cluster C 3 is the mean of (20,20), (20,21), and
(22,23) i.e., (20.66, 21.33) (updated value of c3).
The above process will continue until no cluster centers change to the successive iteration
or until fixed iteration (say 30 iterations). In this example, the clusters are not updated
further, hence final cluster is as follows:
The data 1, 2, 3, 4 are in 1 st cluster, data 5, 6, 7 are in 3 rd cluster and data 8, 9, and 10 are in
2nd cluster (after executing K-Means using Python, see below).
1 #K-Means algorithm
2 import numpy as np
3 #from matplotlib import pyplot as plt
4 from scipy.spatial import distance
5 k=3
6 #Training data [x,y, Bias]
7 X = np.array([[1,1],[1,2],[2,2],[1,3],[10,10],[10,11], [11,12], [20,20], [20,21], [21,24]])
Example: Let two-dimensional 10 (input data) data say D = (1,1), (1,2), (2,2), (1,3), (10,10),
(10,11), (11,12), (20,20), (20,21), (21,24). The cost values for the different number of
clusters k are calculated and since random initialization is done hence the minimum cost
value for each number of k is noted. Next, from the plot, number of clusters versus cost, it is
seen that an “elbow” is seen when k = 3. Therefore, we conclude that the optimal number
of clusters for the given data is 3.
There are two basic approaches for generating a hierarchical clustering. (a) Agglomerative
hierarchical clustering algorithms: begin with all the data objects as individual cluster. At
each step two most alike clusters are merged. After each merge, the total number of
clusters decreases by one. These steps can be repeated until the desired number of clusters
is obtained. (b) Divisive clustering algorithm: start with one, all-inclusive cluster and, at each
step, split a cluster until only singleton clusters of individual points remain. In this case, we
need to decide which cluster to split at each step and how to do the splitting.
Figure 5.3: Dendrogram and nested structure of the hierarchical clustering
Single link, complete link and group average link, centroid, medoid, techniques are the most
well known agglomerative techniques. The basic agglomerative hierarchical clustering is
presented in the following algorithm:
Single link (MIN): It is the smallest distance between a data in one cluster and a data in the
other. i.e., dis(C i , C j )=min (dis(o il , o jm )) [5.2], where
∀ oil ∈C i ∉C j , and ∀ o jm ∈ C j ∉C i ,
1≤l≤n 1≤m≤n n n th
are the number of data present in the i and j cluster
th
ci , cj , ci and cj
respectively. The single link technique is good at handling non-elliptical shapes, but is
sensitive to noise and outliers (figure 5.4.a).
Complete link (MAX): It is the largest distance between a data in one cluster and a data in
the other cluster. i.e., dis(C i , C j )=max (dis(oil , o jm )) [5.3], where
∀ oil ∈ Ci ∉C j and
∀ o jm ∈ C j ∉C i ,1≤l≤nc , 1≤m≤nc ,n c and n c are the number of data present in the i th
i j i j
th
and j cluster respectively. Complete link distance is less susceptible to noise and outliers,
but it can break large clusters and it favors globular shapes (figure 5.4.b).
Average link distance: The cluster proximity is defined to be the average pairwise distance
of all pairs of data from different clusters.
nC i
nC j
i.e.,
∑ ∑ dis (oil , o jm) [5.4], where
1≤l≤n
ci ,
1≤m≤n
cj
n
, ci and
n
cj are the
l=1 m=1
dis(C i ,C j)=
NC ×NC i j
th
number of data present in the i and j cluster respectively (figure 5.4.c).
th
Centroid: If clusters have a representative centroid, then the centroid distance is defined as
the distance between the centroids. i.e., dis(C i , C j )=dis( Ri , R j ) [5.5], where
Ri , R j are the
Medoid: It is the distance between medoids of clusters. i.e., dis(C i , C j )=dis( M i , M j ) [5.6],
where
M i , M j are the medoids of the clustersC i and C j respectively (figure 5.4.e).
(e) Medoid
Figure 5.4: Proximity between clusters
Example 5.3: Let the data be D = {(1,1), (1,2), (2,2), (1,3), (10,10), (10,11), (11,12), (20,20),
(20,21), (21,24)}. Using python apply hierarchical clustering algorithm using single link.
Solution:
1 #Hierarchical clustering
2 import numpy as np
3 import matplotlib.pyplot as plt
4 from scipy.cluster.hierarchy import dendrogram, linkage
5 #Training data [x,y]
6 X = np.array([[1,1], [1,2], [2,2], [1,3], [10,10], [10,11], [11,12], [20,20], [20,21], [21,24]])
1 # DBSCAN
2 import numpy as np
3 from sklearn.cluster import DBSCAN
4 # Load data in X
5 X = np.array([[1,2],[1,3],[1,4],[2,4],[2,5],[3,5],
6 [3,6],[4,6],[5,5],[5,6],[6,4],[6,5],
[6,6],[7,4],[7,5],[7,6],[4,2],[4,3],
7 [5,1],[5,2],[6,1],[6,2],[7,1],[7,2],
8 [8,1],[8,2],[9,1],[9,2],[9,3],[9,4],
9 [9,5],[10,1],[10,2],[10,3],[10,4],[10,5]])
10 db = DBSCAN(eps=1, min_samples=2).fit(X)
11 print(db.labels_)
12 core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
13 print(core_samples_mask)
14 core_samples_mask[db.core_sample_indices_] = True
15 labels = db.labels_
29 class_member_mask = (labels == k)
30 xy = X[class_member_mask&core_samples_mask]
31 plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=col, markeredgecolor='k', markersize=6)
32 xy = X[class_member_mask& ~core_samples_mask]
33 plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=col, markeredgecolor='k', markersize=6)
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]
[False FalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
False FalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
False FalseFalseFalseFalseFalseFalseFalseFalseFalseFalse False]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]
['y', 'b', 'g', 'r']
Let we have data X = {x1, x2,…,xn}, and the probability density function for normal distribution
2
n − ( x i −μ)
for each data is P(x i∨θ) = ∏ 1 e , since the value of the P(x i∨θ)is very small,
2
2σ
i=1 √ 2 π σ
2
n
−( x i−μ )
hence we take log in both the sides then logP ( x i|θ ) =−∑ 2
−0.5 nlog (2 π )−nlog (σ ).
2σ i=1
In EM we select suitable μi and σ ivalues so that logP ( x i|θ ) is maximum (It is called maximum
likelihood principle).
Likelihood function: Given a set of data and the probability of the data (regarded as likelihood
function) we can estimate the parameters of the distribution.
EM clustering algorithm:
In EM clustering algorithm initially, parameters are guessed and then probability of
belongingness of each data to each distribution is calculated. Next, new estimation of the
parameters are calculated. If parameters do not change then stop the process otherwise re guess
the parameters [Figure 5.x].
Step 1: randomly select the initial parameters (μ j , σ j )∈ θ j 1 ≤ j≤ k and initial weight value w j
Step 2: While 1:
(Assignment or expectation step)
Step 3: for i in range(n):
Step 4: for j in range(k):
Step 5: calculate Prob( j th distribution¿ x i ,θ j)
(Maximization step)
Step 6: using the Prob( j th distribution¿ x i ,θ j), find the new estimation of the parameters that
maximize the expected likelihood
Step 5: if (parameters do not change):
Step 8: break
Step 1: random selection of the initial set of parameters, let number of cluster k =2 , the
parameters are θ1 {μ1 , σ 1 } and θ2 {μ2 , σ 2 }, n is the number of data and w = 1/k.
e j
√2 π σ j
Example:
For example, X = {[1, 2], [2, 3], [1, 3], [9, 5], [9, 6], [10, 5]]) and k = 2, w = 1/2 = 0.5, let initial
mean are {[ 2, 3,]. and [10, 5]}, moreover we assume that σ = 0.471 for the cases.
Expectation step
If j = 1 then
2 2
− ( 1−2 ) − ( 2−3)
1 2
1 2
If j = 2 then
2 2
− ( 1−10 ) − ( 2−5 )
1 1 2 2
0.5×0.007908511357356719/(0.5×0.007908511357356719 + 0.5×5.7546306462643696e-89) =
1(w11 )
prob ( 2st distribution| x 1 , μ 1 ¿ = w × prob ( x 1|μ2 ) /(w × prob ( x1|μ1 ) + w × prob ( x1|μ2 ) )
1 0
1 0
1 0
0 1
0 1
0 1
Maximization step
n
# EM clusterig algorithm
import numpy as np
import numpy as np, numpy.random
np.set_printoptions(suppress=True)
X = np.array([
[1, 2],
[2, 3],
[1, 3],
[9, 5],
[9, 6],
[10, 5]])
k=2
# Print the number of data and dimension
n = len(X)
d = len(X[0])
addZeros = np.zeros((n, 1))
X = np.append(X, addZeros, axis=1)
print("EM clustering algorithm: \n")
print("The training data: \n", X)
print("\nTotal number of data: ",n)
print("Total number of features: ",d)
print("Total number of Clusters: ",k)
w =1/k
for i in range(n):
sumP = 0
for j in range(k):
logL = 1
for p in range(d):
logL *= cT*np.exp(-(np.square(X[i,p] - meanC[j,p])/cT2))
sumP +=w*logL
for j in range(k):
logL = 1
for p in range(d):
logL *= cT*np.exp(-(np.square(X[i,p] - meanC[j,p])/cT2))
print(logL)
weight[i,j] = w*logL/sumP
print(weight)
sumPDF = []
for j in range(k):
sumPDF = np.append(sumPDF, np.sum(weight[:,j]))
print("sumPDF", sumPDF)
for j in range(k):
for p in range(d):
meanSum = 0
for i in range(n):
meanSum += X[i]*weight[i,j]/sumPDF[j]
meanC[j,p] = meanSum[0]
print(meanC)
EM clustering algorithm:
[[1. 0.]
[1. 0.]
[1. 0.]
[0. 1.]
[0. 1.]
[0. 1.]]
sumPDF [3. 3.]
[[1 1]
[9 9]]