0% found this document useful (0 votes)
1 views

Clustering

Uploaded by

Dyna Fransisca
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1 views

Clustering

Uploaded by

Dyna Fransisca
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 69

Unsupervised

Learning :
Clustering
What will we learn in this
Session (Objective)
1 Introduction to Unsupervised Learning

2 K-Means Clustering

3 Hierarchical Clustering

4 DBSCAN Clustering

5 Sum-of-Squares
Outline

Introduction to Unsupervised Learning

Sum-of-Squares
K-Means Clustering

Hierarchical Clustering

DBSCAN Clustering

Hands-On
Unsupervised
Learning
Introduction
What is Unsupervised
Learning?
In unsupervised learning, only input data is provided in the dataset.
There are no labelled outputs to aim for. But it may be surprising to
know that it is still possible to find many interesting and complex
patterns hidden within data without any labels. The goal is to
capture interesting structure / information.
What is Unsupervised
Learning?
What is Clustering?

Clustering is the task of dividing the data points into a number


of groups such that data points in the same groups are more
similar to other data points in the same group than those in other
groups. The aim is to segregate groups with similar traits and
assign them into clusters
Difference Between
Clustering and Classification
The prior difference between classification and clustering is that
classification is used in supervised learning technique where
predefined labels are assigned to instances by properties, on the
contrary, clustering is used in unsupervised learning where
similar instances are grouped, based on their features or
properties
Difference Between
Clustering and Classification
Application in Real-World
Problems

1. Customer Segmentation
2. Spam Email Identification
3. Fraud / Criminal Activity Identification
The Challange of Unsupervised
Learning

1. The problem tends to be more subjective, and there is no


simple goal for the analysis
2. Unsupervised learning is often performed as part of an
exploratory data analysis.
3. In unsupervised learning, there is no way to check our result
because we don’t know the true answer
Sum of Squares
Definition
The sum of squares is the sum of the square of variation, where variation
is defined as the spread between each individual value and the mean.

To determine the sum of squares, the distance between each data point and
the line of best fit is squared and then summed up. The line of best fit will
minimize this value.
Key Takeaways

● The sum of squares measures the deviation of data points away from the
mean value.
● A higher sum-of-squares result indicates a large degree of variability
within the data set, while a lower result indicates that the data does not
vary considerably from the mean value.
Distance Function
Key Takeaways

● Euclidean Distance

● Manhattan Distance
K-Means Clustering
How it works
How it works
How it works
How it works
How it works
How it works
How it works
How it works
How it works
How it works
How it works
How it works
How it works
How it works
How it works
How it works
How it works
How it works
How it works
How it works
Elbow Method
Elbow Method
Elbow Method
Elbow Method
Elbow Method
Elbow Method
Advantages of K-Means

• Relatively simple to implement


• Scales to large data sets
• Guarantees convergence
• Easily adapts to new examples
Disadvantages of K-Means

• Choosing k manually
• Being dependent on initial values
• Clustering outliers
• Scaling with number of dimensions
Hierarchical
Clustering
The number of clusters is not predetermined

There are two ways: Bottom up, or Top Down

● Agglomerative - Bottom up approach. Start with many small


clusters and merge them together to create bigger clusters.
● Divisive - Top down approach. Start with a single cluster
rather than break it up into smaller clusters.
How it works
(Agglomerative)
We assign each point to an individual cluster in this technique. Suppose
there are 4 data points. We will assign each of these points to a cluster
and hence will have 4 clusters in the beginning:
How it works
(Agglomerative)
Then, at each iteration, we merge the closest pair of clusters and repeat
this step until only a single cluster is left:

We are merging (or adding) the clusters at each step, right? Hence, this
type of clustering is also known as additive hierarchical clustering.
How it works (Divisive)

Divisive hierarchical clustering works in the opposite way. Instead of starting with n
clusters (in case of n observations), we start with a single cluster and assign all the
points to that cluster.

So, it doesn’t matter if we have 10 or 1000 data points. All these points will belong to
the same cluster at the beginning:
How it works (Divisive)
Now, at each iteration, we split the farthest point in the cluster and repeat
this process until each cluster only contains a single point:

We are splitting (or dividing) the clusters at each step, hence the name
divisive hierarchical clustering. Agglomerative Clustering is widely used in the
industry.
How it works
We merge the most similar points or clusters in hierarchical clustering. Now
the question is – how do we decide which points are similar and which are
not? It’s one of the most important questions in clustering!

Here’s one way to calculate similarity – Take the distance between the
centroids of these clusters. The points having the least distance are referred
to as similar points and we can merge them. We can refer to this as a
distance-based algorithm as well (since we are calculating the distances
between the clusters).

In hierarchical clustering, we have a concept called a proximity matrix.


How it works
Create proximity matrix

Euclidean Distance
How it works
Step 1: First, we assign all the points to an individual cluster:

Different colors here represent different clusters. You can see that we have 5
different clusters for the 5 points in our data.
How it works
Step 2: Next, we will look at the Here, the smallest distance is 3 and
smallest distance in the proximity hence we will merge point 1 and 2:
matrix and merge the points with
the smallest distance. We then
update the proximity matrix:

Let’s look at the updated clusters and


accordingly update the proximity
matrix:
How it works
Step 3: We will repeat step 2 until only a single cluster is left.

So, we will first look at the minimum distance in the proximity matrix and then merge the closest pair of clusters. We will
get the merged clusters as shown below after repeating these steps:
How it works

To get the number of clusters for hierarchical clustering, we make use of an


awesome concept called a Dendrogram.

A dendrogram is a tree-like diagram that records the sequences of merges or


splits.
How it works

We can clearly visualize the steps of hierarchical clustering. More the distance of the vertical lines in the
dendrogram, more the distance between those clusters.
How it works
Now, we can set a threshold distance and draw a horizontal line (Generally, we try to set the
threshold in such a way that it cuts the tallest vertical line). Let’s set this threshold as 12 and draw
a horizontal line:

The number of clusters will be the number of vertical lines which are being intersected by the line
drawn using the threshold.
Linkage

● Complete Linkage
● Single Linkage The distance betweem two
The distance betweem two clusters is the longest
clusters is the shortest distance between two points
distance between two points in each cluster
in each cluster
Linkage

● Average Linkage ● Ward Linkage


The distance between The distance between
clusters is the average clusters is sum of squared
distance between two points differences within all
in each cluster clusters
Advantages of Disvantages of
Hierarchical Clustering Hierarchical Clustering

No assumption of a Too slow for large


particular number of data sets
clusters
(i.e. k-means)
DBSCAN
Density-Based Spatial Clustering of
Applications with Noise (DBSCAN)

The main concept of DBSCAN algorithm is to locate regions of high density


that are separated from one another by regions of low density. So, how do we
measure density of a region ?
Components
How it works

1. The algorithm proceeds by arbitrarily picking up a point in the dataset


(until all points have been visited).

2. If there are at least ‘minPoint’ points within a radius of ‘ε’ to the point
then we consider all these points to be part of the same cluster.

3. The clusters are then expanded by recursively repeating the


neighborhood calculation for each neighboring point
Parameter Estimation

● minPts, must be chosen at least 3. However, larger values are


usually better for data sets with noise and will yield more
significant clusters. As a rule of thumb, minPts = 2·dim can
be used, but it may be necessary to choose larger values for
very large data, for noisy data or for data that contains many
duplicates.
● ε, if ε is chosen much too small, a large part of the data will
not be clustered; whereas for a too high value of ε, clusters will
merge and the majority of objects will be in the same cluster.
In general, small values of ε are preferable, and as a rule of
thumb, only a small fraction of points should be within this
distance of each other.
Advantages
- Is great at separating clusters of high density versus clusters of low
density within a given dataset
- Is great with handling outliers within the dataset

Disadvantages
- If the database has data points that form clusters of varying density,
then DBSCAN fails to cluster the data points well, since the clustering
depends on ϵ and MinPts parameter, they cannot be chosen
separately for all clusters.
- If the data and features are not so well understood by a domain
expert then, setting up ϵ and MinPts could be tricky and, may need
comparisons for several iterations with different values of ϵ and
MinPts.
Lets go to
Notebook

You might also like