ML Unit III
ML Unit III
2
Clustering
• Unsupervised learning
• Requires data, but no labels
• Detect patterns e.g.
Group emails or search results
Customer shopping patterns
Regions of images
3
Clustering
• Clustering is a technique in machine learning used to group similar data
points together in an unsupervised manner.
• In clustering, the goal is to partition a set of data points into subsets or
clusters based on the similarity of their attributes or features.
• The clusters are formed such that the data points within a cluster are more
similar to each other than to those in other clusters.
Example: Let's understand the clustering technique with the real-world
example of Mall: When we visit any shopping mall, we can observe that the
things with similar usage are grouped together. Such as the t-shirts are grouped
in one section, and trousers are at other sections, similarly, at vegetable
sections, apples, bananas, Mangoes, etc., are grouped in separate sections, so
that we can easily find out the things. The clustering technique also works in
the same way.
4
▪ In general a grouping of objects such that the objects in a group (cluster) are
similar (or related) to one another and different from (or unrelated to) the objects
in other groups
Inter-cluster
Intra-cluster distances are
distances are maximized
minimized
6
Cntd…
7
Cntd..
8
Cntd..
9
Why Clustering
• Clustering is very much important as it determines the intrinsic grouping
among the unlabelled data present.
• There are no criteria for good clustering, it depends on the user, what is the
criteria they may use which satisfy their need.
• For instance, we could be interested in finding representatives for
homogeneous groups (data reduction), in finding “natural clusters” and
describe their unknown properties (“natural” data types), in finding useful and
suitable groupings (“useful” data classes).
• This algorithm must make some assumptions that constitute the similarity of
points and each assumption make different and equally valid clusters.
10
Clustering Applications
• Customer segmentation: Customer segmentation is the practice of dividing a company's
customers into groups that reflect similarity among customers in each group
• Fraud detection: Using techniques such as K-means Clustering, one can easily identify the
patterns of any unusual activities. Detecting an outlier will mean a fraud event has taken
place.
• Document groupings
• Image segmentation
• Anomaly detection
11
Clustering Methods
• Partitioning Methods: These methods partition the objects into k clusters and
each partition forms one cluster. This method is used to optimize an objective
criterion similarity function such as when the distance is a major parameter
example K-means
• Hierarchical Based Methods: The clusters formed in this method form a tree-type
structure based on the hierarchy. It is divided into two category
• Agglomerative (bottom-up approach)
• Divisive (top-down approach)
• Density-Based Methods: These methods consider the clusters as the dense region
having some similarities and differences from the lower dense region of the space.
These methods have good accuracy and the ability to merge two clusters.
Example DBSCAN (Density-Based Spatial Clustering of Applications with Noise)
12
Dunn Index: Performance Measure
Dunn index
• Dunn Index is used to identify dense and well-separated groups.
• It is the ratio between minimum inter-cluster distance and maximum intra-cluster distance.
The Dunn index can be computed as below:
• Here d(i,j) is the distance between clusters i and j, which is the minimum of all inter-cluster
distances, and d(k) is the intra-cluster distance of cluster k, which is the maximum of all
intra-cluster distances.
• The algorithms that create clusters with a high Dunn index are more desirable as that way,
clusters would be more compact and different from each other.
13
K-mean Clustering
• K-Means clustering is an unsupervised iterative clustering technique.
• It partitions the given data set into k predefined distinct clusters.
• A cluster is defined as a collection of data points exhibiting certain similarities
14
K-mean Clustering
15
K-mean Clustering
It partitions the data set such that-
• Each data point belongs to a cluster with the nearest mean.
17
K-mean Clustering
K-Means Clustering Algorithm-
Step-04:
• Assign each data point to some cluster.
• A data point is assigned to that cluster whose center is nearest to that data point.
Step-05:
• Re-compute the center of newly formed clusters.
• The center of a cluster is computed by taking mean of all the data points contained in that
cluster.
Step-06:
Keep repeating the procedure from Step-03 to Step-05 until any of the following stopping
criteria is met-
• Center of newly formed clusters do not change
• Data points remain present in the same cluster
• Maximum number of iterations are reached
18
K-mean Clustering
• More detailed version of the algorithm:
• Input: A set of n data points X = {x1, x2, ..., xn}, and a number k of clusters
to form.
• Output: A set of k cluster centroids C = {c1, c2, ..., ck} and a set of k
clusters S = {S1, S2, ..., Sk}.
• Randomly select k data points as the initial centroids C = {c1, c2, ..., ck}.
• Repeat until convergence:
• a. For each data point xi, find the nearest centroid cj using Euclidean
distance.
• b. Assign xi to the cluster with centroid cj.
• c. Update the centroid cj by calculating the mean of all data points
assigned to it.
• Return the set of k centroids C and the set of k clusters S.
19
Advantages
Advantages of K-means clustering algorithm
• Relatively easy to understand and implement.
20
Disadvantages
Disadvantages of K-means clustering algorithm
• Choosing K manually and being dependent on the initial values
21
Elbow Method
K Means Clustering Using the Elbow Method
• In the Elbow method, we are actually varying the number of clusters (K) from 1 – 10.
• For each value of K, we are calculating WCSS (Within-Cluster Sum of Square).
• WCSS is the sum of the squared distance between each point and the centroid in a cluster.
• When we plot the WCSS with the K value, the plot looks like an Elbow.
• As the number of clusters increases, the WCSS value will start to decrease.
• WCSS value is largest when K = 1. When we analyze the graph, we can see that the graph
will rapidly change at a point and thus creating an elbow shape.
• From this point, the graph moves almost parallel to the X-axis. The K value corresponding
to this point is the optimal value of K or an optimal number of clusters.
22
Elbow Method
23
Variations
▪ K-medoids: Similar problem definition as in K-means, but the centroid of the cluster
is defined to be one of the points in the cluster (the medoid).
▪ A medoid can be defined as a point in the cluster, whose dissimilarities with all the
other points in the cluster are minimum.
24
Hierarchical Clustering
Introduction
• Hierarchical clustering is another unsupervised machine learning algorithm,
which is used to group the unlabeled datasets into a cluster and also known
as Hierarchical Cluster Analysis or HCA.
• In this algorithm, we develop the hierarchy of clusters in the form of a tree,
and this tree-shaped structure is known as the dendrogram.
• A dendrogram is a diagram that shows the hierarchical relationship
between objects.
• It is most commonly created as an output from hierarchical clustering.
• The main use of a dendrogram is to work out the best way to allocate
objects to clusters.
• Hierarchical clustering algorithms group similar objects into groups
called clusters.
25
Hierarchical Clustering: Introduction
• There are two types of hierarchical clustering algorithms:
• Agglomerative — Bottom up approach. Start with many small clusters and merge
them together to create bigger clusters.
▪ Start with the points as individual clusters
▪ At each step, merge the closest pair of clusters until only one cluster (or k clusters) left
• Divisive — Top down approach. Start with a single cluster than break it up into
smaller clusters.
▪ Start with one, all-inclusive cluster
▪ At each step, split a cluster until each cluster contains a point (or there are k clusters)
26
Introduction
27
Divisive Clustering
• The divisive clustering algorithm is a top-down clustering approach,
initially, all the points in the dataset belong to one cluster and split is
performed recursively as one moves down the hierarchy.
• Steps of Divisive Clustering:
• Initially, all points in the dataset belong to one single cluster.
• Partition the cluster into two least similar cluster
• Proceed recursively to form new clusters until the desired number
of clusters is obtained.
28
Agglomerative Clustering
▪ Produces a set of nested clusters organized as a
hierarchical tree
▪ Can be visualized as a dendrogram
▪ A tree like diagram that records the sequences of merge or splits
0.2 6 5
0.15 4
3 4
2
0.1 5
2
0.05
1
3 1
0
29
Strengths of Hierarchical Clustering
▪ Do not have to assume any particular number of clusters
▪ Any desired number of clusters can be obtained by ‘cutting’ the
dendogram at the proper level
30
Agglomerative Clustering
• Also known as bottom-up approach or Hierarchical Agglomerative
Clustering (HAC).
• This clustering algorithm does not require us to pre-specify the number of
clusters.
• Bottom-up algorithms treat each data as a singleton cluster at the outset
and then successively agglomerates pairs of clusters until all clusters have
been merged into a single cluster that contains all data.
• It means, this algorithm considers each dataset as a single cluster at the
beginning, and then start combining the closest pair of clusters together.
• It does this until all the clusters are merged into a single cluster that
contains all the datasets.
31
Agglomerative Clustering
32
Measure for the distance between two clusters
33
Measure for the distance between two clusters
34
Measure for the distance between two clusters
• Average Linkage: It is the linkage method in which the distance
between each pair of datasets is added up and then divided by the
total number of datasets to calculate the average distance between
two clusters. It is also one of the most popular linkage methods.
35
Measure for the distance between two clusters
• Centroid Linkage: It is the linkage method in which the distance
between the centroid of the clusters is calculated. Consider the
below image:
36
Cluster Distance Measures
Example: Given a data set of five objects characterised by a single feature, assume that
there are two clusters: C1: {a, b} and C2: {c, d, e}.
a b c d e
Feature 1 2 4 5 6
1. Calculate the distance matrix. 2. Calculate three cluster distances between C1 and C2.
Single link
a b c d e
dist(C1 ,C2 ) = min{d(a,c), d(a,d), d(a,e),d(b,c),d(b,d),
a 0 1 3 4 5 d(b,e)}
= min{3, 4, 5, 2, 3, 4} = 2
b 1 0 2 3 4
Complete link
c 3 2 0 1 2
dist(C1 ,C2 ) = max{d(a,c), d(a,d), d(a,e),d(b,c),d(b,d),
d 4 3 1 0 1 d(b,e)}
= max{3, 4, 5, 2, 3, 4} = 5
dist(C1 , C2 )
e 5 4 2 1 0
= 6
Average
3 + 4 + 5 + 2 + 3 + 4 = 21 d(b,c) + d(b,d) + d(b,e)
= d(a,c) + d(a,d) + d(a,e)+ =
3.5 37
6 6
Hierarchical Clustering: Problems and
Limitations
▪ Computational complexity in time and space
38
Customer Dataset
Annual Income Spending Score
CustomerID Gender Age (k$) (1-100) Cluster
1 1 19 15 39 3
2 1 21 15 81 4
3 0 20 16 6 3
4 0 23 16 77 4
5 0 31 17 40 3
6 0 22 17 76 4
7 0 35 18 6 3
8 0 23 18 94 4
9 1 64 19 3 3
10 0 30 19 72 4
Rows: 200 and Columns: 5
39
Implementation
• Step 1: Data Pre-processing:
• Importing the libraries
# Importing the libraries
import numpy as nm
import matplotlib.pyplot as mtp
import pandas as pd
• The above lines of code are used to import the libraries to perform
specific tasks, such as numpy for the Mathematical operations,
matplotlib for drawing the graphs or scatter plot, and pandas for
importing the dataset.
40
Implementation
• Importing the dataset
• # Importing the dataset
dataset = pd.read_csv('Mall_Customers_data.csv')
• Extracting the matrix of features
• Here we will extract only the matrix of features as we don't have any
further information about the dependent variable.
x = dataset.iloc[:, [3, 4]].values
• Here we have extracted only 3 and 4 columns as we will use a 2D
plot to see the clusters. So, we are considering the Annual income
and Spending score as the matrix of features.
41
Implementation
• Step-2: Finding the optimal number of clusters using the Dendrogram
• Now we will find the optimal number of clusters using the Dendrogram
for our model. For this, we are going to use scipy library as it provides a
function that will directly return the dendrogram for our code.
• #Finding the optimal number of clusters using the dendrogram
import scipy.cluster.hierarchy as shc
dendro = shc.dendrogram(shc.linkage(x, method="ward"))
mtp.title("Dendrogrma Plot")
mtp.ylabel("Euclidean Distances")
mtp.xlabel("Customers")
mtp.show()
42
Implementation
• In the above lines of code, we have imported the hierarchy module
of scipy library.
• This module provides us a method shc.denrogram(), which takes
the linkage() as a parameter. The linkage function is used to define
the distance between two clusters, so here we have passed the
x(matrix of features), and method "ward," the popular method of
linkage in hierarchical clustering.
• Ward minimizes the sum of squared differences within all clusters.
43
Implementation
• Output:
• By executing the above lines of code, we will get the below output:
44
Implementation
• Using this Dendrogram, we will now determine the optimal number of clusters
for our model. For this, we will find the maximum vertical distance that does
not cut any horizontal bar. Consider the below diagram:
45
Implementation
• In the above diagram, we have shown the vertical distances that are
not cutting their horizontal bars. As we can visualize, the 4th distance
is looking the maximum, so according to this, the number of
clusters will be 5 (the vertical lines in this range).
46
Implementation
• Step-3: The hierarchical clustering model
• As we know the required optimal number of clusters, we can now
train our model.
• #training the hierarchical model on dataset
• from sklearn.cluster import AgglomerativeClustering
• hc= AgglomerativeClustering(n_clusters=5, affinity='euclidean', linkag
e='ward')
• y_pred= hc.fit_predict(x)
47
Implementation
• In the code, we have imported the AgglomerativeClustering class of cluster
module of scikit learn library.
• Then we have created the object of this class named as hc.
• The AgglomerativeClustering class takes the following parameters:
• n_clusters=5: It defines the number of clusters, and we have taken here 5
because it is the optimal number of clusters.
• affinity='euclidean': It is a metric used to compute the linkage.
• linkage='ward': It defines the linkage criteria, here we have used the "ward"
linkage. This method is the popular linkage method that we have already used
for creating the Dendrogram.
• In the last line, we have created the dependent variable y_pred to fit or train the
model. It does train not only the model but also returns the clusters to which
each data point belongs.
48
Implementation
• Step-4: Visualizing the clusters
• As we have trained our model successfully, now we can visualize the clusters
corresponding to the dataset.
• #visulaizing the clusters
mtp.scatter(x[y_pred == 0, 0], x[y_pred == 0, 1], s = 100, c = 'blue', label = 'Cluster 1')
mtp.scatter(x[y_pred == 1, 0], x[y_pred == 1, 1], s = 100, c = 'green', label = 'Cluster 2')
mtp.scatter(x[y_pred== 2, 0], x[y_pred == 2, 1], s = 100, c = 'red', label = 'Cluster 3')
mtp.scatter(x[y_pred == 3, 0], x[y_pred == 3, 1], s = 100, c = 'cyan', label = 'Cluster 4')
mtp.scatter(x[y_pred == 4, 0], x[y_pred == 4, 1], s = 100, c = 'magenta', label = 'Cluster 5')
mtp.title('Clusters of customers')
mtp.xlabel('Annual Income (k$)')
mtp.ylabel('Spending Score (1-100)')
mtp.legend()
mtp.show()
49
Implementation
• Output: By executing the above lines of code, we will get the below
output:
50
Density based Clustering
• Why do we need a Density-Based clustering algorithm like DBSCAN when we already have
K-means clustering?
• K-Means clustering may cluster loosely related observations together and every observation
becomes a part of some cluster eventually, even if the observations are scattered far away in the
vector space.
• Since clusters depend on the mean value of cluster elements, each data point plays a role in
forming the clusters and a slight change in data points might affect the clustering outcome.
• This problem is greatly reduced in DBSCAN due to the way clusters are formed.
• Another challenge with k-means is that you need to specify the number of clusters (“k”) in order
to use it and much of the time, we won’t know what a reasonable k value is a priori.
• What’s nice about DBSCAN is that you don’t have to specify the number of clusters to use it.
• All you need is a function to calculate the distance between values and some guidance for what
amount of distance is considered “close”.
• DBSCAN also produces more reasonable results than k-means across a variety of different
distributions.
51
▪ Partitioning methods (K-means) and hierarchical clustering work for finding
spherical-shaped clusters or convex clusters.
• In other words, they are suitable only for compact and well-separated clusters.
Moreover, they are also severely affected by the presence of noise and outliers
in the data.
52
DBSCAN algorithm requires two parameters:
1. eps : It defines the neighborhood around a data point i.e. if the distance
between two points is lower or equal to ‘eps’ then they are considered neighbors.
If the eps value is chosen too small then large part of the data will be considered
as outliers. If it is chosen very large then the clusters will merge and the majority
of the data points will be in the same clusters.
2. MinPts: Minimum number of neighbors (data points) within eps radius. Larger
the dataset, the larger value of MinPts must be chosen. As a general rule, the
minimum MinPts can be derived from the number of dimensions D in the dataset
as, MinPts >= D+1. The minimum value of MinPts must be chosen at least 3.
53
▪ DBSCAN is a Density-Based Clustering algorithm
▪ Reminder: In density based clustering we partition points into dense
regions separated by not-so-dense regions.
▪ Important Questions:
▪ How do we measure density?
▪ What is a dense region?
▪ DBSCAN:
▪ Density at point p: number of points within a circle of radius Eps
▪ Dense Region: A circle of radius Eps that contains at least MinPts
points
54
▪ In this algorithm, we have 3 types of data points.
Core Point: A point is a core point if it has more than MinPts points
within eps.
Border Point: A point which has fewer than MinPts within eps but it is
in the neighborhood of a core point.
Noise or outlier: A point which is not a core point or border point.
55
▪ In this algorithm, we have 3 types of data points.
Core Point: A point is a core point if it has more than MinPts points
within eps.
Border Point: A point which has fewer than MinPts within eps but it is
in the neighborhood of a core point.
Noise or outlier: A point which is not a core point or border point.
56
▪ Characterization of points
▪ A point is a core point if it has more than a specified number
of points (MinPts) within Eps
▪ These points belong in a dense region and are at the interior of a
cluster
59
▪ Density edge
60
▪ Label points as core, border and noise
▪ Eliminate noise points
▪ For every core point p that has not been
assigned to a cluster
▪ Create a new cluster with the point p and all the
points that are density-connected to p.
▪ Assign border points to the cluster of
the closest core point.
61
DBSCAN algorithm can be abstracted in the
following steps:
• The algorithm proceeds by arbitrarily picking up a point in the
dataset (until all points have been visited).
• If there are at least ‘minPoint’ points within a radius of ‘ε’ to the
point then we consider all these points to be part of the same
cluster.
• The clusters are then expanded by recursively repeating the
neighborhood calculation for each neighboring point
62
DBSCAN algorithm can be abstracted in the
following steps:
• Find all the neighbor points within eps and identify the core points or
visited with more than MinPts neighbors.
• For each core point if it is not already assigned to a cluster, create a new
cluster.
• Find recursively all its density connected points and assign them to the
same cluster as the core point.
A point a and b are said to be density connected if there exist a
point c which has a sufficient number of points in its neighbors and both
the points a and b are within the eps distance. This is a chaining process.
So, if b is neighbor of c, c is neighbor of d, d is neighbor of e, which in turn
is neighbor of a implies that b is neighbor of a.
• Iterate through the remaining unvisited points in the dataset. Those
points that do not belong to any cluster are noise.
63
▪ Idea is that for points in a cluster, their kth nearest neighbors are
at roughly the same distance
▪ Noise points have the kth nearest neighbor at farther distance
▪ So, plot sorted distance of every point to its kth nearest neighbor
▪ Find the distance d where there is a “knee” in the curve
▪ Eps = d, MinPts = k
Eps ~ 7-10
MinPts = 4
64
Original Points
Clusters
• Resistant to Noise
• Can handle clusters of different shapes and sizes
65
(MinPts=4, Eps=9.75).
Original Points
•Varying densities
• High-dimensional data
(MinPts=4, Eps=9.92)
66
67
▪ PAM, CLARANS: Solutions for the k-medoids problem
▪ BIRCH: Constructs a hierarchical tree that acts a
summary of the data, and then clusters the leaves.
▪ MST: Clustering using the Minimum Spanning Tree.
▪ ROCK: clustering categorical data by neighbor and
link analysis
▪ LIMBO, COOLCAT: Clustering categorical data using
information theoretic tools.
▪ CURE: Hierarchical algorithm uses different
representation of the cluster
▪ CHAMELEON: Hierarchical algorithm uses closeness
and interconnectivity for merging 68
K-Medoids
• K-medoids is an unsupervised method with unlabelled data to be clustered, It is an
improvised version of the K-Means algorithm mainly designed to deal with outlier data
sensitivity.
• Compared to other partitioning algorithms, the algorithm is simple, fast, and easy to
implement.
• A medoid can be defined as a point in the cluster, whose dissimilarities with all the other
points in the cluster are minimum.
• The dissimilarity of the medoid(Ci) and object(Pi) is calculated by using E = |Pi – Ci|
• Medoid: A Medoid is a point in the cluster from which the sum of distances to other
data points is minimal.
• Instead of centroids as reference points in K-Means algorithms, the K-Medoids
algorithm takes a Medoid as a reference point.
• The partitioning will be carried on such that:
• Each cluster must have at least one object
• An object must belong to only one cluster
69
K-Medoids: Algorithm
• Given the value of k and unlabelled data:
• Choose k number of random points from the data and assign these k points to k number
of clusters. These are the initial medoids.
• For all the remaining data points, calculate the distance from each medoid and assign it
to the cluster with the nearest medoid.
• Calculate the total cost (Sum of all the distances from all the data points to the medoids)
• Select a random point as the new medoid and swap it with the previous medoid. Repeat
2 and 3 steps.
• If the total cost of the new medoid is less than that of the previous medoid, make the
new medoid permanent and repeat step 4.
• If the total cost of the new medoid is greater than the cost of the previous medoid,
undo the swap and repeat step 4.
• The Repetitions have to continue until no change is encountered with new medoids to
classify data points.
70
K-Medoids: Example
• Let's suppose we have the data given below, and we want to divide
the data into two clusters, i.e., k=2. If we plot the data, it looks like
the below image.
S. No. X Y
1 9 6
2 10 4
3 4 4
4 5 8
5 3 8
6 2 5
7 8 5
8 4 6
9 8 4
10 9 3 71
K-Medoids: Example
• For step 1, let's pick up two random medoids(as our k=2). So we pick M1 (8,4)
and M2 (4,6) as our initial medoids.
Let’s calculate the distance of each data point from both the medoids.
S.No. X Y Distance from Distance from
M1(8,4) M2(4,6)
1 9 6 3 5
2 10 4 2 8
3 4 4 4 2
4 5 8 7 3
5 3 8 9 3
6 2 5 7 3
7 8 5 1 5
8 4 6 - -
9 8 4 - -
10 9 3 2 8
72
K-Medoids: Example
• Each point is assigned to the cluster of that medoid whose distance is
less.
• Points (1,2,7,10) are assigned to M1(8,4) and points (3,4,5,6) are assigned
to M2(4,6)
• Therefore,
• Cost = (3+2+1+2)+(2+3+3+3)
• = (8)+(11)
• = 19
• Now let us randomly select another medoid and swap it. Let us check by
having M1 as (8,5).
• The new medoids are M1(8,5) and M2(4,6)
73
K-Medoids: Example
• Therefore distance of each point from M1(8,5) and M2(4,6) is
calculated as follows:-
S.No. X Y Distance Distance
from M1(8,5) from M2(4,6)
1 9 6 2 5
2 10 4 3 8
3 4 4 5 2
4 5 8 6 3
5 3 8 9 3
6 2 5 6 3
7 8 5 - 5
8 4 6 - -
9 8 4 1 -
10 9 3 3 8
74
K-Medoids: Example
• Points (1,2,7,10) are assigned to M1(8,5) and points (3,4,5,6) are assigned to M2(4,6)
• Our final medoids are M1(8,4) and M2(4,6), and the two clusters are formed with these medoids.
75
K-Medoids: Example
• The orange dots represent the first cluster, and the blue dots
represent the second cluster. The triangles represent the medoids of
the clusters.
76
Advantages and Disadvantages of using
K-Medoids:
• Advantages
• Deals with noise and outlier data effectively
• Easily implementable and simple to understand
• Faster compared to other partitioning algorithms
• Disadvantages:
• Not suitable for Clustering arbitrarily shaped groups of data points.
• As the initial medoids are chosen randomly, the results might vary based on
the choice in different runs.
77
Spectral Clustering
• Spectral Clustering is a growing clustering algorithm which has
performed better than many traditional clustering algorithms in
many cases.
• It treats each data point as a graph node and thus transforms the
clustering problem into a graph-partitioning problem.
• Spectral clustering is a technique with roots in graph theory, where
the approach is used to identify communities of nodes in a graph
based on the edges connecting them. The method is flexible and
allows us to cluster non graph data as well.
• Spectral clustering uses information from the eigenvalues
(spectrum) of special matrices built from the graph or the data set.
78
How to do Spectral Clustering?
• The three major steps involved in spectral clustering are:
constructing a similarity graph, projecting data onto a
lower-dimensional space, and clustering the data.
• Given a set of points S in a higher-dimensional space, it can be
elaborated as follows:
• 1. Form a distance matrix
• 2. Transform the distance matrix into an affinity matrix A
• 3. Compute the degree matrix D and the Laplacian matrix L = D – A.
• 4. Find the eigenvalues and eigenvectors of L.
• 5. With the eigenvectors of k largest eigenvalues computed from the
previous step form a matrix.
• 6. Normalize the vectors.
• 7. Cluster the data points in k-dimensional space.
79
Pros and Cons of Spectral Clustering
• Spectral clustering helps us overcome two major problems in clustering:
• One being the shape of the cluster and the other is determining the cluster centroid.
• K-means algorithm generally assumes that the clusters are spherical or round i.e. within
k-radius from the cluster centroid.
• In K means, many iterations are required to determine the cluster centroid, In spectral,
the clusters do not follow a fixed shape or pattern.
• Points that are far away but connected belong to the same cluster and the points which
are less distant from each other could belong to different clusters if they are not
connected, This implies that the algorithm could be effective for data of different shapes
and sizes.
80
Spectral Clustering: Applications
• Spectral clustering has its application in many areas which
includes: image segmentation, educational data mining, speech
separation, spectral clustering of protein sequences, text-image
segmentation.
• Though spectral clustering is a technique based on graph theory,
the approach is used to identify communities of vertices in a
graph based on the edges connecting them.
• This method is flexible and allows us to cluster non-graph data
as well either with or without the original data.
81
THANK YOU
82