Clustering
Clustering
Learning :
Clustering
What will we learn in this
Session (Objective)
1 Introduction to Unsupervised Learning
2 K-Means Clustering
3 Hierarchical Clustering
4 DBSCAN Clustering
5 Sum-of-Squares
Outline
Sum-of-Squares
K-Means Clustering
Hierarchical Clustering
DBSCAN Clustering
Hands-On
Unsupervised
Learning
Introduction
What is Unsupervised
Learning?
In unsupervised learning, only input data is provided in the dataset.
There are no labelled outputs to aim for. But it may be surprising to
know that it is still possible to find many interesting and complex
patterns hidden within data without any labels. The goal is to
capture interesting structure / information.
What is Unsupervised
Learning?
What is Clustering?
1. Customer Segmentation
2. Spam Email Identification
3. Fraud / Criminal Activity Identification
The Challange of Unsupervised
Learning
To determine the sum of squares, the distance between each data point and
the line of best fit is squared and then summed up. The line of best fit will
minimize this value.
Key Takeaways
● The sum of squares measures the deviation of data points away from the
mean value.
● A higher sum-of-squares result indicates a large degree of variability
within the data set, while a lower result indicates that the data does not
vary considerably from the mean value.
Distance Function
Key Takeaways
● Euclidean Distance
● Manhattan Distance
K-Means Clustering
How it works
How it works
How it works
How it works
How it works
How it works
How it works
How it works
How it works
How it works
How it works
How it works
How it works
How it works
How it works
How it works
How it works
How it works
How it works
How it works
Elbow Method
Elbow Method
Elbow Method
Elbow Method
Elbow Method
Elbow Method
Advantages of K-Means
• Choosing k manually
• Being dependent on initial values
• Clustering outliers
• Scaling with number of dimensions
Hierarchical
Clustering
The number of clusters is not predetermined
We are merging (or adding) the clusters at each step, right? Hence, this
type of clustering is also known as additive hierarchical clustering.
How it works (Divisive)
Divisive hierarchical clustering works in the opposite way. Instead of starting with n
clusters (in case of n observations), we start with a single cluster and assign all the
points to that cluster.
So, it doesn’t matter if we have 10 or 1000 data points. All these points will belong to
the same cluster at the beginning:
How it works (Divisive)
Now, at each iteration, we split the farthest point in the cluster and repeat
this process until each cluster only contains a single point:
We are splitting (or dividing) the clusters at each step, hence the name
divisive hierarchical clustering. Agglomerative Clustering is widely used in the
industry.
How it works
We merge the most similar points or clusters in hierarchical clustering. Now
the question is – how do we decide which points are similar and which are
not? It’s one of the most important questions in clustering!
Here’s one way to calculate similarity – Take the distance between the
centroids of these clusters. The points having the least distance are referred
to as similar points and we can merge them. We can refer to this as a
distance-based algorithm as well (since we are calculating the distances
between the clusters).
Euclidean Distance
How it works
Step 1: First, we assign all the points to an individual cluster:
Different colors here represent different clusters. You can see that we have 5
different clusters for the 5 points in our data.
How it works
Step 2: Next, we will look at the Here, the smallest distance is 3 and
smallest distance in the proximity hence we will merge point 1 and 2:
matrix and merge the points with
the smallest distance. We then
update the proximity matrix:
So, we will first look at the minimum distance in the proximity matrix and then merge the closest pair of clusters. We will
get the merged clusters as shown below after repeating these steps:
How it works
We can clearly visualize the steps of hierarchical clustering. More the distance of the vertical lines in the
dendrogram, more the distance between those clusters.
How it works
Now, we can set a threshold distance and draw a horizontal line (Generally, we try to set the
threshold in such a way that it cuts the tallest vertical line). Let’s set this threshold as 12 and draw
a horizontal line:
The number of clusters will be the number of vertical lines which are being intersected by the line
drawn using the threshold.
Linkage
● Complete Linkage
● Single Linkage The distance betweem two
The distance betweem two clusters is the longest
clusters is the shortest distance between two points
distance between two points in each cluster
in each cluster
Linkage
2. If there are at least ‘minPoint’ points within a radius of ‘ε’ to the point
then we consider all these points to be part of the same cluster.
Disadvantages
- If the database has data points that form clusters of varying density,
then DBSCAN fails to cluster the data points well, since the clustering
depends on ϵ and MinPts parameter, they cannot be chosen
separately for all clusters.
- If the data and features are not so well understood by a domain
expert then, setting up ϵ and MinPts could be tricky and, may need
comparisons for several iterations with different values of ϵ and
MinPts.
Lets go to
Notebook