0% found this document useful (0 votes)

1 views

Clustering

Uploaded by

Dyna Fransisca

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

1 views

Clustering

Uploaded by

Dyna Fransisca

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 69

Unsupervised

Learning :
Clustering
What will we learn in this
Session (Objective)
1 Introduction to Unsupervised Learning

2 K-Means Clustering

3 Hierarchical Clustering

4 DBSCAN Clustering

5 Sum-of-Squares
Outline

Introduction to Unsupervised Learning

Sum-of-Squares
K-Means Clustering

Hierarchical Clustering

DBSCAN Clustering

Hands-On
Unsupervised
Learning
Introduction
What is Unsupervised
Learning?
In unsupervised learning, only input data is provided in the dataset.
There are no labelled outputs to aim for. But it may be surprising to
know that it is still possible to find many interesting and complex
patterns hidden within data without any labels. The goal is to
capture interesting structure / information.
What is Unsupervised
Learning?
What is Clustering?

Clustering is the task of dividing the data points into a number

of groups such that data points in the same groups are more
similar to other data points in the same group than those in other
groups. The aim is to segregate groups with similar traits and
assign them into clusters
Difference Between
Clustering and Classification
The prior difference between classification and clustering is that
classification is used in supervised learning technique where
predefined labels are assigned to instances by properties, on the
contrary, clustering is used in unsupervised learning where
similar instances are grouped, based on their features or
properties
Difference Between
Clustering and Classification
Application in Real-World
Problems

1. Customer Segmentation
2. Spam Email Identification
3. Fraud / Criminal Activity Identification
The Challange of Unsupervised
Learning

1. The problem tends to be more subjective, and there is no

simple goal for the analysis
2. Unsupervised learning is often performed as part of an
exploratory data analysis.
3. In unsupervised learning, there is no way to check our result
because we don’t know the true answer
Sum of Squares
Definition
The sum of squares is the sum of the square of variation, where variation
is defined as the spread between each individual value and the mean.

To determine the sum of squares, the distance between each data point and
the line of best fit is squared and then summed up. The line of best fit will
minimize this value.
Key Takeaways

● The sum of squares measures the deviation of data points away from the
mean value.
● A higher sum-of-squares result indicates a large degree of variability
within the data set, while a lower result indicates that the data does not
vary considerably from the mean value.
Distance Function
Key Takeaways

● Euclidean Distance

● Manhattan Distance
K-Means Clustering
How it works
How it works
How it works
How it works
How it works
How it works
How it works
How it works
How it works
How it works
How it works
How it works
How it works
How it works
How it works
How it works
How it works
How it works
How it works
How it works
Elbow Method
Elbow Method
Elbow Method
Elbow Method
Elbow Method
Elbow Method
Advantages of K-Means

• Relatively simple to implement

• Scales to large data sets
• Guarantees convergence
• Easily adapts to new examples
Disadvantages of K-Means

• Choosing k manually
• Being dependent on initial values
• Clustering outliers
• Scaling with number of dimensions
Hierarchical
Clustering
The number of clusters is not predetermined

There are two ways: Bottom up, or Top Down

● Agglomerative - Bottom up approach. Start with many small

clusters and merge them together to create bigger clusters.
● Divisive - Top down approach. Start with a single cluster
rather than break it up into smaller clusters.
How it works
(Agglomerative)
We assign each point to an individual cluster in this technique. Suppose
there are 4 data points. We will assign each of these points to a cluster
and hence will have 4 clusters in the beginning:
How it works
(Agglomerative)
Then, at each iteration, we merge the closest pair of clusters and repeat
this step until only a single cluster is left:

We are merging (or adding) the clusters at each step, right? Hence, this
type of clustering is also known as additive hierarchical clustering.
How it works (Divisive)

Divisive hierarchical clustering works in the opposite way. Instead of starting with n
clusters (in case of n observations), we start with a single cluster and assign all the
points to that cluster.

So, it doesn’t matter if we have 10 or 1000 data points. All these points will belong to
the same cluster at the beginning:
How it works (Divisive)
Now, at each iteration, we split the farthest point in the cluster and repeat
this process until each cluster only contains a single point:

We are splitting (or dividing) the clusters at each step, hence the name
divisive hierarchical clustering. Agglomerative Clustering is widely used in the
industry.
How it works
We merge the most similar points or clusters in hierarchical clustering. Now
the question is – how do we decide which points are similar and which are
not? It’s one of the most important questions in clustering!

Here’s one way to calculate similarity – Take the distance between the
centroids of these clusters. The points having the least distance are referred
to as similar points and we can merge them. We can refer to this as a
distance-based algorithm as well (since we are calculating the distances
between the clusters).

In hierarchical clustering, we have a concept called a proximity matrix.

How it works
Create proximity matrix

Euclidean Distance
How it works
Step 1: First, we assign all the points to an individual cluster:

Different colors here represent different clusters. You can see that we have 5
different clusters for the 5 points in our data.
How it works
Step 2: Next, we will look at the Here, the smallest distance is 3 and
smallest distance in the proximity hence we will merge point 1 and 2:
matrix and merge the points with
the smallest distance. We then
update the proximity matrix:

Let’s look at the updated clusters and

accordingly update the proximity
matrix:
How it works
Step 3: We will repeat step 2 until only a single cluster is left.

So, we will first look at the minimum distance in the proximity matrix and then merge the closest pair of clusters. We will
get the merged clusters as shown below after repeating these steps:
How it works

To get the number of clusters for hierarchical clustering, we make use of an

awesome concept called a Dendrogram.

A dendrogram is a tree-like diagram that records the sequences of merges or

splits.
How it works

We can clearly visualize the steps of hierarchical clustering. More the distance of the vertical lines in the
dendrogram, more the distance between those clusters.
How it works
Now, we can set a threshold distance and draw a horizontal line (Generally, we try to set the
threshold in such a way that it cuts the tallest vertical line). Let’s set this threshold as 12 and draw
a horizontal line:

The number of clusters will be the number of vertical lines which are being intersected by the line
drawn using the threshold.
Linkage

● Complete Linkage
● Single Linkage The distance betweem two
The distance betweem two clusters is the longest
clusters is the shortest distance between two points
distance between two points in each cluster
in each cluster
Linkage

● Average Linkage ● Ward Linkage

The distance between The distance between
clusters is the average clusters is sum of squared
distance between two points differences within all
in each cluster clusters
Advantages of Disvantages of
Hierarchical Clustering Hierarchical Clustering

No assumption of a Too slow for large

particular number of data sets
clusters
(i.e. k-means)
DBSCAN
Density-Based Spatial Clustering of
Applications with Noise (DBSCAN)

The main concept of DBSCAN algorithm is to locate regions of high density

that are separated from one another by regions of low density. So, how do we
measure density of a region ?
Components
How it works

1. The algorithm proceeds by arbitrarily picking up a point in the dataset

(until all points have been visited).

2. If there are at least ‘minPoint’ points within a radius of ‘ε’ to the point
then we consider all these points to be part of the same cluster.

3. The clusters are then expanded by recursively repeating the

neighborhood calculation for each neighboring point
Parameter Estimation

● minPts, must be chosen at least 3. However, larger values are

usually better for data sets with noise and will yield more
significant clusters. As a rule of thumb, minPts = 2·dim can
be used, but it may be necessary to choose larger values for
very large data, for noisy data or for data that contains many
duplicates.
● ε, if ε is chosen much too small, a large part of the data will
not be clustered; whereas for a too high value of ε, clusters will
merge and the majority of objects will be in the same cluster.
In general, small values of ε are preferable, and as a rule of
thumb, only a small fraction of points should be within this
distance of each other.
Advantages
- Is great at separating clusters of high density versus clusters of low
density within a given dataset
- Is great with handling outliers within the dataset

Disadvantages
- If the database has data points that form clusters of varying density,
then DBSCAN fails to cluster the data points well, since the clustering
depends on ϵ and MinPts parameter, they cannot be chosen
separately for all clusters.
- If the data and features are not so well understood by a domain
expert then, setting up ϵ and MinPts could be tricky and, may need
comparisons for several iterations with different values of ϵ and
MinPts.
Lets go to
Notebook

Machine Learning with Clustering: A Visual Guide for Beginners with Examples in Python
From Everand
Machine Learning with Clustering: A Visual Guide for Beginners with Examples in Python
Artem Kovera
No ratings yet
UnSupervisedLearning
No ratings yet
UnSupervisedLearning
22 pages
Clustering Analysis
No ratings yet
Clustering Analysis
30 pages
03 Clustering
No ratings yet
03 Clustering
63 pages
Hierarchical Clustering
No ratings yet
Hierarchical Clustering
26 pages
Lecture+Notes+ +clustering
No ratings yet
Lecture+Notes+ +clustering
13 pages
Unit 2
No ratings yet
Unit 2
33 pages
Unit-6 Clustering Techniques
No ratings yet
Unit-6 Clustering Techniques
110 pages
Clustering
No ratings yet
Clustering
20 pages
Hierarchical Clustering: Required Data
No ratings yet
Hierarchical Clustering: Required Data
6 pages
UNIT IV
No ratings yet
UNIT IV
19 pages
Data Science Session 8 Clustering V0
No ratings yet
Data Science Session 8 Clustering V0
30 pages
Chapter 4 _ Clustering
No ratings yet
Chapter 4 _ Clustering
21 pages
Lecture Notes - Clustering
No ratings yet
Lecture Notes - Clustering
13 pages
Clustering Analysis (1)
No ratings yet
Clustering Analysis (1)
12 pages
Data Mining Functionalities
No ratings yet
Data Mining Functionalities
13 pages
MACHINE LEARNING NOTES ANNA UNIVERSITY
No ratings yet
MACHINE LEARNING NOTES ANNA UNIVERSITY
14 pages
Hierarchical Clustering: Relationship Between Clusters
No ratings yet
Hierarchical Clustering: Relationship Between Clusters
23 pages
Unit 4 Self Made (1)
No ratings yet
Unit 4 Self Made (1)
28 pages
Presentation 28128 Content Document 20241126014005PM
No ratings yet
Presentation 28128 Content Document 20241126014005PM
80 pages
Module-5-Cluster Analysis-Part1
No ratings yet
Module-5-Cluster Analysis-Part1
24 pages
19 - Sessionppt - Clusteringalgos
No ratings yet
19 - Sessionppt - Clusteringalgos
36 pages
Hierarchical Clustering in Data Mining
No ratings yet
Hierarchical Clustering in Data Mining
4 pages
Week 07 Lecture Material
No ratings yet
Week 07 Lecture Material
49 pages
Slide TIF311 DM 10 11
No ratings yet
Slide TIF311 DM 10 11
49 pages
Data Mining: Hierarchical Clustering, DBSCAN The EM Algorithm
No ratings yet
Data Mining: Hierarchical Clustering, DBSCAN The EM Algorithm
63 pages
ML-UNIT-III
No ratings yet
ML-UNIT-III
12 pages
Unit 5
No ratings yet
Unit 5
10 pages
Clustering
No ratings yet
Clustering
75 pages
Hierarchical Clustering - 11.3.2024 - Full
No ratings yet
Hierarchical Clustering - 11.3.2024 - Full
14 pages
Introduction To Clustering: Alka Arora Sr. Scientist
No ratings yet
Introduction To Clustering: Alka Arora Sr. Scientist
57 pages
MA Unit 5
No ratings yet
MA Unit 5
7 pages
ML UNIT 4
No ratings yet
ML UNIT 4
15 pages
L08 Hierachical agglomerative clustering
No ratings yet
L08 Hierachical agglomerative clustering
41 pages
8.Cluster Analysis HCA
No ratings yet
8.Cluster Analysis HCA
31 pages
Chapter 3 Unsupervised Learning
No ratings yet
Chapter 3 Unsupervised Learning
45 pages
Clustring
No ratings yet
Clustring
20 pages
21AI71-module-5-textbook
No ratings yet
21AI71-module-5-textbook
25 pages
Module 3 - 1
No ratings yet
Module 3 - 1
149 pages
EML %th Module
No ratings yet
EML %th Module
40 pages
Lecture 6
No ratings yet
Lecture 6
55 pages
Unit-IV ppt
No ratings yet
Unit-IV ppt
51 pages
Clustering
No ratings yet
Clustering
75 pages
M4 - Clustering
No ratings yet
M4 - Clustering
43 pages
clustering
No ratings yet
clustering
6 pages
Unit 5
No ratings yet
Unit 5
5 pages
U-5_IML (2)
No ratings yet
U-5_IML (2)
20 pages
Cluster
100% (1)
Cluster
72 pages
"These Are Just Rough Notes For References" What Is K-Means Clustering
No ratings yet
"These Are Just Rough Notes For References" What Is K-Means Clustering
9 pages
Unit-4 new
No ratings yet
Unit-4 new
36 pages
Data Clustering..
No ratings yet
Data Clustering..
10 pages
Unit 3 unsupervised learning algorith
No ratings yet
Unit 3 unsupervised learning algorith
15 pages
Clustering: EE-671 Prof L. Behera, IITK
No ratings yet
Clustering: EE-671 Prof L. Behera, IITK
33 pages
8. Clustering
No ratings yet
8. Clustering
38 pages
Clustering: Sridhar S Department of IST Anna University
No ratings yet
Clustering: Sridhar S Department of IST Anna University
91 pages
ML - 8
No ratings yet
ML - 8
70 pages
Agglomerative Clustering
No ratings yet
Agglomerative Clustering
6 pages
P 3.1.3 Hierarchical
No ratings yet
P 3.1.3 Hierarchical
30 pages
Exp5 - Unsupervised Learning
No ratings yet
Exp5 - Unsupervised Learning
13 pages
AI
No ratings yet
AI
19 pages
Design of Energy-Efficient Multiplier Based On 3:2 Compressor
No ratings yet
Design of Energy-Efficient Multiplier Based On 3:2 Compressor
8 pages
Myupdated CV 2022
No ratings yet
Myupdated CV 2022
3 pages
NV Ec1701u Kit1 PDF
No ratings yet
NV Ec1701u Kit1 PDF
5 pages
Term Paper Spirometer-2
No ratings yet
Term Paper Spirometer-2
12 pages
Manual Do Gerador WEG Linha - G
No ratings yet
Manual Do Gerador WEG Linha - G
151 pages
Information Technology Project Management, Eighth Edition
No ratings yet
Information Technology Project Management, Eighth Edition
44 pages
SCADA Project Guide - 21
No ratings yet
SCADA Project Guide - 21
1 page
30(B) Marking system
No ratings yet
30(B) Marking system
6 pages
Strawford Webfont License
No ratings yet
Strawford Webfont License
4 pages
Digital Marketing Manager Job Description
No ratings yet
Digital Marketing Manager Job Description
4 pages
ET-500-PLUS: Installers' Manual
No ratings yet
ET-500-PLUS: Installers' Manual
26 pages
Ipc2022-87765 Phased Array Shot Scenario and Shot Sequence Optimization For Crack
No ratings yet
Ipc2022-87765 Phased Array Shot Scenario and Shot Sequence Optimization For Crack
10 pages
Cloud Computing Assign # 3
No ratings yet
Cloud Computing Assign # 3
21 pages
Muhammad Bilal Profile
No ratings yet
Muhammad Bilal Profile
5 pages
Schneider Electric - PM8000-series - METSEPM8244
No ratings yet
Schneider Electric - PM8000-series - METSEPM8244
5 pages
3.1.memory
No ratings yet
3.1.memory
29 pages
Chen, Computational Geometry Methods and Applications
No ratings yet
Chen, Computational Geometry Methods and Applications
228 pages
Bài Tập Bổ Trợ Lớp 11 I-Learn Smart Practice Test - Unit 10
No ratings yet
Bài Tập Bổ Trợ Lớp 11 I-Learn Smart Practice Test - Unit 10
4 pages
2019 Winter Question Papermsbte Study Resources 221201 001131
No ratings yet
2019 Winter Question Papermsbte Study Resources 221201 001131
5 pages
Hyva Gear Pumps Motors
No ratings yet
Hyva Gear Pumps Motors
2 pages
Week 3 L4 (Design Thinking)
No ratings yet
Week 3 L4 (Design Thinking)
16 pages
1.elasticsearch Introduction Slides
No ratings yet
1.elasticsearch Introduction Slides
106 pages
Lista D
No ratings yet
Lista D
5 pages
Concurrency State Models Java Programs
No ratings yet
Concurrency State Models Java Programs
56 pages
My Capital One-1 (2) - 1
No ratings yet
My Capital One-1 (2) - 1
1 page
Jlil 287 Underwriting Intern
No ratings yet
Jlil 287 Underwriting Intern
2 pages
Injection Systems PURINJECT 1C 55 LV TFE Versie2!17!04 2009
No ratings yet
Injection Systems PURINJECT 1C 55 LV TFE Versie2!17!04 2009
3 pages
Business Systems
No ratings yet
Business Systems
6 pages
Unit 3 End-Of-Unit Test
No ratings yet
Unit 3 End-Of-Unit Test
6 pages
EMD refund PO
No ratings yet
EMD refund PO
1 page

Clustering

Uploaded by

Clustering

Uploaded by

Unsupervised

Introduction to Unsupervised Learning

Clustering is the task of dividing the data points into a number

1. The problem tends to be more subjective, and there is no

• Relatively simple to implement

There are two ways: Bottom up, or Top Down

● Agglomerative - Bottom up approach. Start with many small

In hierarchical clustering, we have a concept called a proximity matrix.

Let’s look at the updated clusters and

To get the number of clusters for hierarchical clustering, we make use of an

A dendrogram is a tree-like diagram that records the sequences of merges or

● Average Linkage ● Ward Linkage

No assumption of a Too slow for large

The main concept of DBSCAN algorithm is to locate regions of high density

1. The algorithm proceeds by arbitrarily picking up a point in the dataset

3. The clusters are then expanded by recursively repeating the

● minPts, must be chosen at least 3. However, larger values are

You might also like