0% found this document useful (0 votes)
16 views

Clustering in Machine Learning

The document discusses clustering techniques in machine learning. It introduces clustering as an unsupervised learning method that organizes unlabeled data into groups without predefined labels. It describes two popular clustering algorithms: K-means clustering, which partitions data into a predefined number of clusters so within-cluster distances are minimized; and hierarchical clustering, which creates nested clusters based on distance metrics. The document focuses on explaining K-means clustering in detail, including its cost function, advantages/disadvantages, and steps.

Uploaded by

Akshay kumar
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views

Clustering in Machine Learning

The document discusses clustering techniques in machine learning. It introduces clustering as an unsupervised learning method that organizes unlabeled data into groups without predefined labels. It describes two popular clustering algorithms: K-means clustering, which partitions data into a predefined number of clusters so within-cluster distances are minimized; and hierarchical clustering, which creates nested clusters based on distance metrics. The document focuses on explaining K-means clustering in detail, including its cost function, advantages/disadvantages, and steps.

Uploaded by

Akshay kumar
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 20

7.

CLUSTERING
Learning objectives
After careful study of this chapter, we should be able to understand:
 The clustering technique
 K-Means algorithms
 Hierarchical clustering algorithms

5.1 INTRODUCTION
In this chapter we introduce simple unsupervised learning technique called clustering
algorithms. Supervised learning algorithm needs training data {(x1, y1), (x2, y2), … ,(xn,yn)}
along with their class labels {y1, y2,…, yn}, but in un-supervised learning algorithm class
labels are not required.

Unsupervised machine learning algorithm e.g., clustering algorithm needs only training data
{x1, x2, …, xn} (without the class labels) to reveal the structure of the data.

The K-Means is the most popular non-overlapping clustering algorithm and hierarchical
clustering gives overlapping output.

5.2 WHAT IS CLUSTER ANALYSIS?


Cluster analysis is an unsupervised learning technique for organizing the data into
meaningful groups (or clusters), which maximizes the similarity of data within each group
while maximizing the dissimilarity between groups.

Each cluster thus describes, in terms of the data collected, the class to which its members
belong. Items in each cluster are similar in some ways to each other and dissimilar to those
in other clusters. Cluster analysis divides data into groups (clusters) that are meaningful, and
useful. It is the study of techniques for automatically finding the hidden structure i.e. there
is no prior knowledge (hence unsupervised) about which elements belong to which groups
(clusters).

There are many applications of cluster analysis as in: (a) business: in customer segmentation
customers are grouped into small segments for marketing, (b) medicine: to identify different
groups of diseases, (c) information retrieval: retrieval of similar types of information from
the huge database, (d) document clustering: searching and grouping similar types of text
document, (e) social networking: to identify the group they interact more to each other, (f)
software engineering, (g) recommendation system, (h) k-anonymity: for data hiding, (i)
bioinformatics, (j) astronomical: to know the hidden structure of the galaxy, (j) engineering
fields like civil, mechanical, chemical etc.

Examples of clustering are depicted in figure 5.1


(a) Input data pattern (b) when k = 3 clusters (c) When k = 4 clusters

(d) when k = 4 clusters (another instance) (e) when k = 5 clusters

Figure 5.1: Different clusters formation for the same training data

Practically in many situations, there is little prior information available about the structure
of the data and the decision maker must make as few assumptions about the data. A given
training data can be divided into different clusters (Figure 5.1), therefore to choose best
suited k value, various techniques are there which we will discuss later, where k is the
number of clusters.

Types of clustering algorithm


Different types of clustering algorithms are as follows: (a) partitional clustering, (b)
hierarchical clustering, (c) density-based clustering, (d) grid-based clustering, (e) model-
based clustering. In this section we will discuss about the K-Means (partitional) and the
hierarchical clustering algorithms.

5.3K-MEANS ALGORITHM (PARTITIONAL CLUSTERING)


K-Means clustering is a partitinal based clustering method in which items are moved among
sets of clusters until the desired set is reached. The partitioning of data is such that the sum
of intra-cluster distances is the minimum. K-Means is a simple and widely used clustering
algorithm. The algorithm attempts to determine k partitions that minimize the square-error
function (say, distance). It works well when the clusters are compact, spherical, well
separated from one another. In the K-Means algorithm an attempt is made to find the
centers of the cluster so that the sum of the squared distances of each object to its nearest
cluster center is minimized. Usually the distance is chosen as the Euclidean distance.

The Euclidean distance between two data A and B is:


dist ( A , B )=√(a 1−b 1 )2 +(a 2−b2 )2 +(a3 −b3 )2 +.. .+( an −bn )2
where,
a i , bi , i=1 ,...,n are the attribute values of the two data A and B

For example, A= {1,3,6} and B= {2,3,7} and the distance between A and B is:
dist ( A , B )=√(1−2 )2 +(3−3 )2 +(6−7 )2 = 1.4142
Advantages of K-Means algorithm
K-Means algorithm is simple, scalable, robust and a widely used clustering algorithm.

Drawbacks of K-Means algorithm


The drawbacks of K-Means algorithms are as follows:
(1) The user must initialize the number of clusters which is very difficult to identify in most
of the cases.
(2) It requires selection of the suitable initial cluster centers which is again subject to error.
Since the structure of the clusters depends on the initial cluster centers this may result in an
inefficient clustering.
(3) The K-Means algorithm is very sensitive to noise. Moreover, the K-Means algorithm is
not suitable for discovering clusters with non-convex shapes or clusters of very different
size.

The steps of K-Means algorithm are presented as follows:


Algorithm 5.1: The K-Means algorithm
Input: Data (D), Number of clusters (k)
Output: Clusters (K)
Step1: randomly select k data as initial cluster centers;
Step2: (re) assign each data point to its nearest cluster centers;
Step3: (re) calculate mean for all the clusters to get the new cluster center;
Step4: repeat step 2 and 3 until (no change of cluster centers or) fixed iteration;

The time complexity of K-Means algorithm is Ο(nkt ) where n is the number of data objects,
k is the number of clusters and t is the number of iterations.

5.3.1Cost function
n
1
The cost function is J ( C , c )= ∑ dist ( x i , c j ) , where, n is the total number of training data,
n i =1
x i 1 ≤ ⅈ ≤ n ,is the ith training data, c j, 1 ≤ j≤ K , is the nearest cluster center of the ith training
data, K is the total number of clusters. The objective is to minimize the cost value.

Example: Let the data be D = {(1,1), (1,2), (2,2), (1,3), (10,10), (10,11), (11,12), (20,20),
(20,21), (21,24)}, and three cluster centers are c 1 (1.25, 2), c2 (10.33, 11) and c3 (20.33,
21.67). The cost value is calculated as follows:

X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
Data (xi) (1,1) (1,2) (2,2) (1,3) (10,10) (10,11) (11,12) (20,20) (20,21) (21,24)
cj (1.25, 2) (10.33, 11) (20.33, 21.67)
dist ( x i , c j ) 1.03 0.25 0.75 1.03 1.05 0.33 1.20 1.70 0.75 2.43
Table: The cost value calculation

The cost value = ((1.03+0.25+0.75+1.03)+(1.05+0.33+1.20)+(1.70+0.75+2.43))/10 = 1.052

Explanation of K-Means algorithm


Let two-dimensional 10 (input data) data say D = (1,1), (1,2), (2,2), (1,3), (10,10), (10,11),
(11,12), (20,20), (20,21), (21,24). We want 3 (k) clusters.

First, randomly select three data (from the training data) as initial cluster centers. Assume
the initial cluster centers are c1 = (1, 2), c2 = (11, 12), c3 = (20, 20).

Next, assign each data point to its nearest cluster center. As an example, to assign the
cluster center for the data d1 = (1, 1), we apply Euclidean (or Manhattan) distance from all
(initial) cluster centers c1 = (1, 2), c2 = (11, 12), and c3 = (20, 20).

X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
(1,1) (1,2) (2,2) (1,3) (10,10) (10,11) (11,12) (20,20) (20,21) (21,24)
C1 (1, 2) 1.0 0.0 1.0 1.0 12.04 12.73 14.14 26.17 26.87 29.73
C2 (11, 12) 14.87 14.14 13.45 13.45 2.24 1.41 0.0 12.04 12.73 15.62
C3 (20, 20) 26.87 26.17 25.46 25.5 14.14 13.45 12.04 0.0 1.0 4.12
C1 C1 C1 C1 C2 C2 C2 C3 C3 C3
Table 7.1: K-means iterations

dist (d 1 ,c 1 )=√ ¿ ¿1dist (d 1 ,c 2 )=√ ¿ ¿14.87, dist (d 1 ,c 3 )=√ ¿ ¿26.87

Since dist (d 1 , c 1 )=1 is the minimum distance, hence data (1, 1) should belong to k 1 (c1 is the
cluster center) cluster. In this way we assign the cluster for rest of the data.

Data (1, 1), (1, 2), (2, 2) and (1, 3) should belong to cluster center c 1 (1,2) and K1 cluster is
formed. Data (10, 10), (10, 11), and (11,11) should belong to cluster center c 2 (11,12) and
cluster K2is formed. Data (20,20), (20,21), and (22,23) should belong to cluster center c 3 (20,
20) and the cluster K3 is formed.

Next calculate new cluster centers from the clusters C 1, C2 and C3. New cluster center of
cluster C1 is the mean of (1, 1), (1, 2), (2,2) and (1,3) i.e., (1.25, 2) (updated value of c 1), new
cluster center of cluster C2 is the mean of (10,10), (10,11), and (11,11) i.e., (10.33, 11)
(updated value of c2) and new cluster center of cluster C 3 is the mean of (20,20), (20,21), and
(22,23) i.e., (20.33, 21.67) (updated value of c3).

Further reassign all the data points to the nearest cluster centers c 1 (1.25, 2), c2 (10.33,
10.66) and c3 (20.66, 21.33).

X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
(1,1) (1,2) (2,2) (1,3) (10,10) (10,11) (11,12) (20,20) (20,21) (21,24)
(1.25, 2) 1.03 0.25 0.75 1.03 11.86 12.55 13.97 25.99 26.69 29.56
(10.33, 10.66) 13.43 12.02 12.02 12.07 0.74 0.47 1.5 13.44 14.16 17.08
(20.66, 21.33) 28.28 27.57 26.87 26.88 15.56 14.84 13.43 1.48 0.74 2.69
C1 C1 C1 C1 C2 C2 C2 C3 C3 C3

Data (1, 1), (1, 2), (2, 2) and (1, 3) should belong to cluster center c 1 (1.25,2) and K1 cluster is
formed. Data (10, 10), (10, 11), and (11,11) should belong to cluster center c 2 (10.33,10.66)
and cluster K2is formed. Data (20,20), (20,21), and (22,23) should belong to cluster center c 3
(20.66, 21.33) and the cluster K3 is formed.

Next calculate updated cluster centers from the clusters C 1, C2 and C3. New cluster center of
cluster C1 is the mean of (1, 1), (1, 2), (2,2) and (1,3) i.e., (1.25, 2) (updated value of c 1), new
cluster center of cluster C2 is the mean of (10,10), (10,11), and (11,11) i.e., (10.33, 10.66)
(updated value of c2) and new cluster center of cluster C 3 is the mean of (20,20), (20,21), and
(22,23) i.e., (20.66, 21.33) (updated value of c3).

The above process will continue until no cluster centers change to the successive iteration
or until fixed iteration (say 30 iterations). In this example, the clusters are not updated
further, hence final cluster is as follows:

The data 1, 2, 3, 4 are in 1 st cluster, data 5, 6, 7 are in 3 rd cluster and data 8, 9, and 10 are in
2nd cluster (after executing K-Means using Python, see below).

1 #K-Means algorithm
2 import numpy as np
3 #from matplotlib import pyplot as plt
4 from scipy.spatial import distance
5 k=3
6 #Training data [x,y, Bias]
7 X = np.array([[1,1],[1,2],[2,2],[1,3],[10,10],[10,11], [11,12], [20,20], [20,21], [21,24]])

8 # Print the number of data and dimension


9 n = len(X)
10 d = len(X[0])
11 addZeros = np.zeros((n, 1))
12 X = np.append(X, addZeros, axis=1)
print("The K Means algorithm: \n")
13 print("The training data: \n", X)
14 print("Total number of data: ",n)
15 print("Total number of features: ",d)
16 print("Total number of Clusters: ",k)

17 # Random selection of initial cluster centers


18 dup = np.array([])
19 while 1:
20 ranIndex = np.random.randint(low=1, high=n, size=k)
21 u, c = np.unique(ranIndex, return_counts=True)
22 dup = u[c > 1]
23 if dup.size == 0:
24 break
25 C = X[ranIndex]
26 print("\n The initial cluster centers: \n", C[:,0:d])
27 print("\n")

28 # Main iteration starts


29 for it in range(10): # Total number of iterations
30 for i in range(n): # Iterate each data
31 minDist = 9999999999
32 for j in range(k): # Iterate each cluster center
33 # Distance calculation from centers
34 dist = distance.euclidean(C[j,0:d], X[i,0:d])
35 #Capture the minimum distance
36 if (dist<minDist):
37 minDist = dist
38 clusterNumber = j
39 X[i,d] = clusterNumber
40 C[j,d] = clusterNumber

41 # Group the data to calculate the mean


42 for j in range(k):
43 result = np.where(X[:,d] == j)
44 C[j] = np.reshape(np.mean(X[result,0:(d+1)], axis=1),(d+1))

45 # Calculate cost value


46 cost = 0
47 for i in range(n):
48 tempC = C[C[:,d] == X[i,d]]
49 cost += distance.euclidean(X[i,0:d],tempC[:,0:d])
50 cost = cost/n

51 print("The Final cluster centers: \n", C)


52 print("\n The data with cluster number: \n", X)
53 print("\n The cost is: ", cost)
54 # End of cost value calculation

The K Means algorithm:


Total number of data: 10
Total number of features: 2
Total number of Clusters: 3

The data with cluster number:


[[ 1. 1. 2.]
[ 1. 2. 2.]
[ 2. 2. 2.]
[ 1. 3. 2.]
[10. 10. 0.]
[10. 11. 0.]
[11. 12. 0.]
[20. 20. 1.]
[20. 21. 1.]
[21. 24. 1.]]

The cost is: 1.0522561584810652


5.3.2Suitable number of clusters
We have already mentioned that for the given data any number of clusters can be formed.
To know the optimal number of clusters, the cost value can be used. To know the optimal
number of clusters, number of clusters versus cost value is plotted and from the plot, we
can guess the number of clusters present in the given dataset.

Example: Let two-dimensional 10 (input data) data say D = (1,1), (1,2), (2,2), (1,3), (10,10),
(10,11), (11,12), (20,20), (20,21), (21,24). The cost values for the different number of
clusters k are calculated and since random initialization is done hence the minimum cost
value for each number of k is noted. Next, from the plot, number of clusters versus cost, it is
seen that an “elbow” is seen when k = 3. Therefore, we conclude that the optimal number
of clusters for the given data is 3.

Figure 7.2: Elbow method to identify suitable number of clusters

5.3.3Local optima and K-Means algorithm


Let the data be (1, 1), (1, 4), (3, 2), (1, 3), (4, 4), (10, 10), (10, 11), (12, 5), (11, 12), (20, 20),
(20, 21), (21,24) and number of clusters be 3. The following figures (Figure 7.3) indicate
various possibility of the clusters.

Figure 5.3: Possible clusters of the given dataset when k = 3


In Figure 5.3.A the data is plotted. In Figure 5.3.B to 5.3.D various clusters are formed,
where the cost values are high and cluster centers are trapped into local minima. In Figure
5.3.E, the cost of clusters is minimum (global optima) and the appropriate clusters are
showed. Therefore, in K-Means algorithm, the initial cluster centers are selected randomly
and run the algorithm many times for the given number of clusters kto identify the global
minima.

5.3.4 The criteria to choose k value


The number of clusters K can be identified automatically e.g., using “elbow method”
(Section 5.3.2) or manually as per requirement. To select the number of clusters manually
the domain expert must apply logic. For example, in the field of management, let a dataset
contains the purchase history of the customers, and we need to divide the customers into
three groups like platinum, diamond and gold customers based on frequency and values of
the purchase of the customers. In this situation the value of the K should be 3 (Figure 5.4).
Similarly, if we need to partition the customers into five different groups e.g., platinum,
diamond, gold, silver, and copper customer groups then the value of K should be 5 (Figure
5.4). Therefore, based on the requirement the value of the K must be chosen.

Figure 5.4: The criteria to choose K value

5.4 HIERARCHICAL CLUSTERING


Hierarchical clustering constructs a hierarchical series of nested clusters which can be
graphically represented by a tree called “dendrogram”. Due to its nested structure, it is
effective and gives better structural information. By cutting the dendrogram at some level,
we can obtain a specified number of clusters.

There are two basic approaches for generating a hierarchical clustering. (a) Agglomerative
hierarchical clustering algorithms: begin with all the data objects as individual cluster. At
each step two most alike clusters are merged. After each merge, the total number of
clusters decreases by one. These steps can be repeated until the desired number of clusters
is obtained. (b) Divisive clustering algorithm: start with one, all-inclusive cluster and, at each
step, split a cluster until only singleton clusters of individual points remain. In this case, we
need to decide which cluster to split at each step and how to do the splitting.
Figure 5.3: Dendrogram and nested structure of the hierarchical clustering

Single link, complete link and group average link, centroid, medoid, techniques are the most
well known agglomerative techniques. The basic agglomerative hierarchical clustering is
presented in the following algorithm:

Algorithm 5.2: Basic agglomerative hierarchical clustering


Input: Dataset (D);
Output: Cluster (C);
Step1: Compute the proximity matrix;
Step2: repeat
Step3: merge the closest two clusters;
Step4: update the proximity matrix;
Step5: until converged; // until k number of clusters

Defining proximity between clusters


The computation of the proximity between two clusters is the key operation of the
agglomerative clustering algorithms. Different agglomerative clustering algorithms are
dependent on various types of proximity measures. Given clusters
C i andC j , there are
different techniques to calculate the proximity between the clusters.

Single link (MIN): It is the smallest distance between a data in one cluster and a data in the
other. i.e., dis(C i , C j )=min (dis(o il , o jm )) [5.2], where
∀ oil ∈C i ∉C j , and ∀ o jm ∈ C j ∉C i ,
1≤l≤n 1≤m≤n n n th
are the number of data present in the i and j cluster
th
ci , cj , ci and cj
respectively. The single link technique is good at handling non-elliptical shapes, but is
sensitive to noise and outliers (figure 5.4.a).

Complete link (MAX): It is the largest distance between a data in one cluster and a data in
the other cluster. i.e., dis(C i , C j )=max (dis(oil , o jm )) [5.3], where
∀ oil ∈ Ci ∉C j and
∀ o jm ∈ C j ∉C i ,1≤l≤nc , 1≤m≤nc ,n c and n c are the number of data present in the i th
i j i j

th
and j cluster respectively. Complete link distance is less susceptible to noise and outliers,
but it can break large clusters and it favors globular shapes (figure 5.4.b).

Average link distance: The cluster proximity is defined to be the average pairwise distance
of all pairs of data from different clusters.
nC i
nC j

i.e.,
∑ ∑ dis (oil , o jm) [5.4], where
1≤l≤n
ci ,
1≤m≤n
cj
n
, ci and
n
cj are the
l=1 m=1
dis(C i ,C j)=
NC ×NC i j

th
number of data present in the i and j cluster respectively (figure 5.4.c).
th

Centroid: If clusters have a representative centroid, then the centroid distance is defined as
the distance between the centroids. i.e., dis(C i , C j )=dis( Ri , R j ) [5.5], where
Ri , R j are the

centroids of the clusters


C i and C j respectively (figure 5.4.d).

Medoid: It is the distance between medoids of clusters. i.e., dis(C i , C j )=dis( M i , M j ) [5.6],
where
M i , M j are the medoids of the clustersC i and C j respectively (figure 5.4.e).

(a) Single link distance (b) Complete link distance

(c) Average link distance (d) Centroid distance

(e) Medoid
Figure 5.4: Proximity between clusters
Example 5.3: Let the data be D = {(1,1), (1,2), (2,2), (1,3), (10,10), (10,11), (11,12), (20,20),
(20,21), (21,24)}. Using python apply hierarchical clustering algorithm using single link.

Solution:
1 #Hierarchical clustering
2 import numpy as np
3 import matplotlib.pyplot as plt
4 from scipy.cluster.hierarchy import dendrogram, linkage
5 #Training data [x,y]
6 X = np.array([[1,1], [1,2], [2,2], [1,3], [10,10], [10,11], [11,12], [20,20], [20,21], [21,24]])

7 # Print the number of data and dimension


8 n = len(X)
9 d = len(X[0])
10 print("The hierarchical clustering algorithm: \n")
11 print("The training data: \n", X)
12 print("Total number of data: ",n)
13 print("Total number of features: ",d)

14 lab = range(1, 11)


15 plt.scatter(X[:,0],X[:,1], label='True Position')

16 for lab, x, y in zip(lab, X[:, 0], X[:, 1]):


17 plt.annotate(lab, xy = (x, y), xytext=(-3, 3), textcoords='offset points', ha='right', va='bottom')
18 plt.show()

19 linkedData = linkage(X, 'single') # Apply single link distance


20 labelList = range(1, 11)

21 # Draw the dendrogram


22 dendrogram(linkedData, orientation='top', labels=labelList, distance_sort='descending',
23 show_leaf_counts=True)
24 plt.show()

The hierarchical clustering algorithm:


The training data:
[[ 1 1] [ 1 2] [ 2 2] [ 1 3] [10 10] [10 11] [11 12] [20 20] [20 21] [21 24]]
Total number of data: 10
Total number of features: 2
The methods are

linkedData = linkage(X, 'single') # Apply single link distance


linkedData = linkage(X, 'complete') # Apply complete link distance
linkedData = linkage(X, 'average') # Apply average link distance
linkedData = linkage(X, 'weighted') # Apply weight link distance
linkedData = linkage(X, 'centroid') # Apply centroid link distance
linkedData = linkage(X,’median’) # Apply median link distance
linkedData = linkage(X, 'ward') # Apply ward distance

5.5 Shaped based clustering (DBSCAN)

1 # DBSCAN
2 import numpy as np
3 from sklearn.cluster import DBSCAN

4 # Load data in X
5 X = np.array([[1,2],[1,3],[1,4],[2,4],[2,5],[3,5],
6 [3,6],[4,6],[5,5],[5,6],[6,4],[6,5],
[6,6],[7,4],[7,5],[7,6],[4,2],[4,3],
7 [5,1],[5,2],[6,1],[6,2],[7,1],[7,2],
8 [8,1],[8,2],[9,1],[9,2],[9,3],[9,4],
9 [9,5],[10,1],[10,2],[10,3],[10,4],[10,5]])

10 db = DBSCAN(eps=1, min_samples=2).fit(X)
11 print(db.labels_)
12 core_samples_mask = np.zeros_like(db.labels_, dtype=bool)

13 print(core_samples_mask)
14 core_samples_mask[db.core_sample_indices_] = True
15 labels = db.labels_

16 # Number of clusters in labels, ignoring noise if present.


17 n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
18 print(labels)
19 # Plot result
20 import matplotlib.pyplot as plt

21 # Black removed and is used for noise instead.


22 unique_labels = set(labels)
23 colors = ['y', 'b', 'g', 'r']
24 print(colors)
25 for k, col in zip(unique_labels, colors):
26 if k == -1:
27 # Black used for noise.
28 col = 'k'

29 class_member_mask = (labels == k)

30 xy = X[class_member_mask&core_samples_mask]
31 plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=col, markeredgecolor='k', markersize=6)
32 xy = X[class_member_mask& ~core_samples_mask]
33 plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=col, markeredgecolor='k', markersize=6)

34 plt.title('number of clusters: %d' %n_clusters_)


35 plt.show()

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]
[False FalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
False FalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
False FalseFalseFalseFalseFalseFalseFalseFalseFalseFalse False]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]
['y', 'b', 'g', 'r']

Expectation maximization (EM) for soft clustering


Let we have k number of data distribution (i.e., we have k number of groups). In Figure 5.X let
three distributions (i.e., 3 clusters) are there and associated three statistical parameters μi and σ i,
1 ≤i ≤3 are there. These parameters ( μi and σ i,) are not known and our aim is to identify which
parameter belongs to which clusters. Here cluster assignment is not crisp i.e., for each data we
must assign weight to demonstrate the belongingness of the group. For example, weight for the
distribution D1 of any data which is inside the distribution D1, is much higher than other two
distributions. Our main goal is to identify the parameter means (μ¿¿ i)¿ , in Figure 5.x (+) for
each cluster.

Figure 5.X: Dataset for EM soft clustering

Let we have data X = {x1, x2,…,xn}, and the probability density function for normal distribution
2
n − ( x i −μ)

for each data is P(x i∨θ) = ∏ 1 e , since the value of the P(x i∨θ)is very small,
2

i=1 √ 2 π σ
2
n
−( x i−μ )
hence we take log in both the sides then logP ( x i|θ ) =−∑ 2
−0.5 nlog (2 π )−nlog (σ ).
2σ i=1

In EM we select suitable μi and σ ivalues so that logP ( x i|θ ) is maximum (It is called maximum
likelihood principle).

Likelihood function: Given a set of data and the probability of the data (regarded as likelihood
function) we can estimate the parameters of the distribution.

Figure 7.X: Likelihood function

EM clustering algorithm:
In EM clustering algorithm initially, parameters are guessed and then probability of
belongingness of each data to each distribution is calculated. Next, new estimation of the
parameters are calculated. If parameters do not change then stop the process otherwise re guess
the parameters [Figure 5.x].

Figure 5.X: EM clustering

Basic steps of EM clustering algorithm:


Input: Data x i ϵX , 1 ≤i ≤ n , k is the number of clusters, n is the number of data
Output: Clusters (C j ), associated parameters, θ j 1 ≤ j≤ k

Step 1: randomly select the initial parameters (μ j , σ j )∈ θ j 1 ≤ j≤ k and initial weight value w j
Step 2: While 1:
(Assignment or expectation step)
Step 3: for i in range(n):
Step 4: for j in range(k):
Step 5: calculate Prob( j th distribution¿ x i ,θ j)

(Maximization step)
Step 6: using the Prob( j th distribution¿ x i ,θ j), find the new estimation of the parameters that
maximize the expected likelihood
Step 5: if (parameters do not change):
Step 8: break

Detail steps of the EM clustering algorithm

Step 1: random selection of the initial set of parameters, let number of cluster k =2 , the
parameters are θ1 {μ1 , σ 1 } and θ2 {μ2 , σ 2 }, n is the number of data and w = 1/k.

Step 2 to 5: (Expectation step)


prob(xi ∨θ j )
wj
Calculate prob ( j distribution| x i , θ j ¿ =
th k
, for all i , j,
∑ w j prob(x i∨θ j)
j=
2
− ( x i −μ j )

1 ≤i ≤ n ,1 ≤ j≤ k , Where, prob ( x i|θ j )= 1 2σ


2

e j

√2 π σ j

Step 6: (Maximization step)


n

∑ xi prob ( jth distribution|x i , θ j ¿


μ j= i=1n ¿ ¿ for all i , j, 1 ≤i ≤ n ,1 ≤ j≤ k
∑ prob ( j th
distribution| x i ,θ j ¿
i=1

Step 7 to 8: repeat step 2 to 6, if the parameters do not change.

Example:

For example, X = {[1, 2], [2, 3], [1, 3], [9, 5], [9, 6], [10, 5]]) and k = 2, w = 1/2 = 0.5, let initial
mean are {[ 2, 3,]. and [10, 5]}, moreover we assume that σ = 0.471 for the cases.

Expectation step

For the first data [1, 2]

If j = 1 then
2 2
− ( 1−2 ) − ( 2−3)
1 2
1 2

prob ( x 1|μ1 ) =prob ( x 11|μ11 ) × prob ( x 12|μ 12) = e 2 × 0.471


× e 2 ×0.471 =
√ 2 π 0.471 √2 π 0.471
0.007908511357356719

If j = 2 then
2 2
− ( 1−10 ) − ( 2−5 )
1 1 2 2

prob ( x 1|μ2 ) =prob ( x 11|μ21 ) × prob ( x 12|μ22 )= e 2 ×0.471 × e 2 ×0.471 =


√ 2 π 0.471 √ 2 π 0.471
5.7546306462643696e-89

prob ( 1 st distribution| x 1 , μ 1 ¿ = w × prob ( x 1|μ1 ) /(w × prob ( x1|μ1 ) + w × prob ( x1|μ2 ) )

0.5×0.007908511357356719/(0.5×0.007908511357356719 + 0.5×5.7546306462643696e-89) =
1(w11 )
prob ( 2st distribution| x 1 , μ 1 ¿ = w × prob ( x 1|μ2 ) /(w × prob ( x1|μ1 ) + w × prob ( x1|μ2 ) )

0.5×5.7546306462643696e-89 /(0.5×0.007908511357356719 + 0.5×5.7546306462643696e-89)


= 0 (w12 )

After completing the first iteration, the weight matrix is:

1 0
1 0
1 0
0 1
0 1
0 1

Maximization step
n

∑ x i 1 prob ( jth distribution|x i ,θ j ¿ 1× 1+ 1× 2+ 1×1


μ11 = i=1n ¿¿ = = 1.33
1+ 1+ 1
∑ prob ( jth distribution|x i , θ j ¿
i=1

∑ x i2 prob ( jth distribution|x i , θ j ¿ 1× 1+ 1× 2+ 1×1


μ12= i=1n ¿¿ = = 1.33
1+ 1+ 1
∑ prob ( jth distribution|x i ,θ j ¿
i=1

The updated cluster centre μ1=[ μ11 , μ12 ] =¿ [1.33, 1.33]

The updated cluster centre μ2=[ μ 21 , μ 22 ] =¿ [1.33, 1.33]

# EM clusterig algorithm

import numpy as np
import numpy as np, numpy.random
np.set_printoptions(suppress=True)

X = np.array([
[1, 2],
[2, 3],
[1, 3],
[9, 5],
[9, 6],
[10, 5]])
k=2
# Print the number of data and dimension
n = len(X)
d = len(X[0])
addZeros = np.zeros((n, 1))
X = np.append(X, addZeros, axis=1)
print("EM clustering algorithm: \n")
print("The training data: \n", X)
print("\nTotal number of data: ",n)
print("Total number of features: ",d)
print("Total number of Clusters: ",k)

# Random selection of initial cluster centers


dup = np.array([])
while 1:
ranIndex = np.random.randint(low=1, high=n, size=k)
u, c = np.unique(ranIndex, return_counts=True)
dup = u[c > 1]
if dup.size == 0:
break
meanC = X[ranIndex]

meanC = np.array([[ 2, 3,],[10, 5]])

print("\n The initial cluster centers: \n", meanC[:,0:d])


print("\n")
sigma = 0.471
cT = (1/(np.sqrt(2*np.pi)*sigma))
cT2 = (2*np.square(sigma))

## Create an empty weights


weight = np.zeros((n,k))

w =1/k

for i in range(n):
sumP = 0
for j in range(k):
logL = 1
for p in range(d):
logL *= cT*np.exp(-(np.square(X[i,p] - meanC[j,p])/cT2))
sumP +=w*logL

for j in range(k):
logL = 1

for p in range(d):
logL *= cT*np.exp(-(np.square(X[i,p] - meanC[j,p])/cT2))
print(logL)
weight[i,j] = w*logL/sumP

print(weight)

sumPDF = []
for j in range(k):
sumPDF = np.append(sumPDF, np.sum(weight[:,j]))

print("sumPDF", sumPDF)

for j in range(k):
for p in range(d):
meanSum = 0
for i in range(n):
meanSum += X[i]*weight[i,j]/sumPDF[j]
meanC[j,p] = meanSum[0]

print(meanC)

# Identify the cluster number


for i in range(n):
cNumber = np.where(weight[i] == np.amax(weight[i]))
X[i,d] = cNumber[0]

print("\nThe data with cluster number: \n", X)

EM clustering algorithm:

The training data:


[[ 1. 2. 0.]
[ 2. 3. 0.]
[ 1. 3. 0.]
[ 9. 5. 0.]
[ 9. 6. 0.]
[10. 5. 0.]]

Total number of data: 6


Total number of features: 2
Total number of Clusters: 2

The initial cluster centers:


[[ 2 3]
[10 5]]

[[1. 0.]
[1. 0.]
[1. 0.]
[0. 1.]
[0. 1.]
[0. 1.]]
sumPDF [3. 3.]
[[1 1]
[9 9]]

The data with cluster number:


[[ 1. 2. 0.]
[ 2. 3. 0.]
[ 1. 3. 0.]
[ 9. 5. 1.]
[ 9. 6. 1.]
[10. 5. 1.]]

You might also like