0% found this document useful (0 votes)

7 views

Clustering

Clustering lecture pdf

Uploaded by

Azhar

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views

Clustering

Clustering lecture pdf

Uploaded by

Azhar

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 32

Introduction to Machine Learning

Clustering

Waqar Aziz

Department of Electrical Engineering and Technology

Govt. College University Faisalabad

*Lecture notes adopted from UK data service workshop

Outline

• What is clustering?
• Why bother with it?
• Types of clustering algorithms
• K-Means
• Hierarchical clustering
Recap

Supervised learning Unsupervised learning

Input data is labelled Input data is unlabelled
Data is classified based on the training Assigns properties of given data to
dataset classify it
Divided into Regression and Classification Divided into Clustering and Association
Used for prediction Used for analysis
Algorithms include: decision trees, Algorithms include: k-means clustering,
logistic regressions, support vector hierarchical clustering, apriori algorithm
machine
A known number of classes An unknown number of classes
Recap (Cont’d)
Supervised learning: used for prediction

Dps Sepal length Petal length Petal width Species

(cm) (cm) (cm)

A 3.5 1.4 0.2 Iris-Versicolour

B 3.2 5.7 2.3 Iris-Setosa

C 3.2 5.9 2.3 Iris-Setosa
D 2.9 4.7 1.4 Iris-Virginica
Unsupervised learning: used for analysis
E 3.7 1.5 0.4 Iris-Versicolour
Dps Sepal Petal length Petal width
length (cm) (cm) F 3.1 5.5 2.2 ?
(cm)
A 3.5 1.4 0.2

B 3.2 5.7 2.3

C 3.2 5.9 2.3

D 2.9 4.7 1.4

E 3.7 1.5 0.4

What is clustering?

“Clustering is the task of partitioning the dataset into

groups, called clusters. The goal is to split up the data
in such a way that points within a single cluster are
very similar and points in different clusters are
different.”

(Müller and Guido 2017)

Dps Sepal length Petal length Petal width cluster

(cm) (cm) (cm)
A 3.5 1.4 0.2 1
B 3.2 5.7 2.3 2
C 3.2 5.9 2.3 2
Why bother with it?

• It provides more information on the

structure of the data  patterns

• It can help identify problems in the

data, such as outliers

• It can be used to compress data

Other use cases
• Customer recommendation systems: “People
who bought Harry Potter and the
Philosopher’s Stone also bought The Hunger
Games…”

• Grouping DNA sequences of different strains

of HIV into families of genetically similar
viruses

• Identifying fake news by clustering the words

used in articles. Certain words may appear
more in sensationalized click-bait articles.

• And the more frivolous and fun side projects…

What is a cluster?

“There is no universal definition of what a

cluster is: it really depends on the context,
and different algorithms will capture
different kinds of clusters.”

(Géron, 2019)
Types of clustering algorithms
Centroid-based Density-based Distribution-based Hierarchical clustering

How do I know which type of algorithm is right for me?

? ?
?

EXPLORE YOUR DATA

K-Means clustering
• We want to separate our data points into k
clusters

• First, we initialize the algorithm with k

random points (our centroids)

• Then, we assign each data point to its

nearest initialisation point – using the
Euclidean distance
• Once each data point is assigned, we
relocate the initialisation point to the
mean of the data points that were
assigned to it

• Repeat the highlighted steps until the

assignment of data points to centroids
remains unchanged
Introducing pseudocode…
Pseudo English
Python Code
Initialisation – how do we select our centroids?

 Forgy’s method: choose k random data points from the dataset

 Random Partition method: Randomly assign data points to a cluster.

Then calculate the mean of each cluster to get the initial centroids.

 K-means++: first centroid is a random datapoint, but remaining

centroids are chosen based on the maximum squared distance 
centroids are spread out evenly
How do we determine the number of clusters we want?
Sepal Petal length Petal
length (cm) width K=?
(cm) (cm)
3.5 1.4 0.2
3.2 5.7 2.3
3.2 5.9 2.3
2.9 4.7 1.4
Elbow plot
3.7 1.5 0.4

• Each time we increase the number

of clusters  the SSE decreases
• Goal: select a small value of k that
SSE

still has a low SSE

• Elbow represents where we start
to have diminishing returns by
increasing k

k value 15
What are the strengths?

• Easy to understand and

implement

• Fast

• Scalable
What are the limitations?
• Choosing 𝑘 manually – it’s a hassle!

• It is dependent on initial values:

necessary to run the algorithm several
Elbow method
times to avoid suboptimal solutions –
converges to a local minimum

• Not good at clustering data of varying

Bad centroid initialization Suboptimal solution
sizes, densities, or nonspherical shapes

density direction shape

Hierarchical clustering
“Hierarchical clustering algorithms […]
approach the problem of clustering by
developing a binary tree-based data
structure called the dendrogram. Once
the dendrogram is constructed, one
can automatically choose the right
number of clusters by splitting the tree
at different levels to obtain different
clustering solutions for the same
dataset without rerunning the
clustering algorithm again.”

(Reddy and Vinzamuri, 2015)

How do I read a dendrogram?
6

5 E Increasing similarity
4 D
3

2 C
1 A
0 B
0 1 2 3 4

5 E Branches
4 D
3

2 C
1 A
0 B
0 1 2 3 4
What are the 2 main approaches to hierarchical clustering?
6

5 E
1) Agglomerative 2) Divisive
4 D
3

B C D E 2 C ABCDE
1 A
0 B
AB 0 1 2 3 4

DE
ABC

ABC
DE
AB

ABCDE E
A B C D
which clusters should be combined, or split?

1) Measure of distance – some measure of similarity

Increasing similarity
• Hierarchical clustering is proximity-based

• Affects the shape of the clusters

• Used to build distance matrix

• Default is Euclidean distance, but other

measures exist: correlation-based,
Levenshtein distance etc.
p q ED
3 4 1.414214
2 1
which clusters should be combined, or split?

1) Measure of distance – some measure of similarity

Increasing similarity
• Hierarchical clustering is proximity-based

• Affects the shape of the clusters

• Used to build distance matrix

• Default is Euclidean distance, but other

measures exist: correlation-based,
Levenshtein distance etc.
p q ED
3 4 1.414214
2 1
which clusters should be combined, or split?
2) Linkage criterion – different ways to link clusters based on distance

• A means of determining whether certain clusters should be merged

• Default is complete-linkage

• Other commonly used linkage criteria: single-linkage, average-linkage

• Used to update the distance matrix and merge clusters

Agglomerative hierarchical clustering: Using complete-linkage
Step by step…

1) Load in dataset
x y
6
Dps sepal length Petal
(cm) length 5 E
(cm)
4 D
A 1 1
3
B 1 0
2 C
C 0 2

D 2 4 1 A

E 3 5 0 B
0 0.5 1 1.5 2 2.5 3 3.5
Step by step…

2) Build distance matrix and identify smallest distance

A B C D E

A 0 1 1.4 3.2 4.5

B 1 0 2.2 4.1 5.4

C 1.4 2.2 0 2.8 4.2

D 3.2 4.1 2.8 0 1.4

E 4.5 5.4 4.2 1.4 0

Step by step…
3) Perform merge and update distance matrix

Updated distance matrix:

AB C D E d[(A,B),C] = max {d(A,C),

d(B,C)}
= max {1.4, 2.2}
AB 0

d[(A,B),D] = max {d(A,D),

C 2.2 0 d(B,D)}
= max {3.2, 4.1}
D 4.1 2.8 0
d[(A,B),E] = max {d(A,E),
d(B,E)}
E 5.4 4.2 1.4 0
= max {4.5, 5.4}
Step by step…
Continue merging and updating the distance matrix…

AB DE C ABC DE

AB 0
ABC 0

DE 5.4 0
DE 5.4 0
C 2.2 4.2 0

d[(A,B),(D,E)] = max {d((A,B)D),

d[(A,B,C),(D,E) = max
d(A,B)E))}
{d((D,E)(A,B), ((D,E,(C))
= max {4.1, 5.4}
= max {5.4, 4.2}
d[(C,(D,E))] = max
{d(C,D), d (C,E)}
= max {2.8, 4.2}
RESULT

• Dendrogram: y-axis denotes

when in the agglomerative
algorithm two clusters get
merged

• Y-axis also shows how far

apart the merged clusters are
 pay attention to the
length of the branches
What are the strengths?

• Easy to understand and implement

• Most appealing output

• Can handle non-convex clusters

• No need to specify the number of clusters!

K=?
What are the limitations?

• Mathematically simple…but computationally

expensive!

• Hard to visualize results with a large dataset

• Heavily driven by heuristics and arbitrary

decisions

• Algorithm can’t undo previous step

K-Means vs Hierarchical clustering

K-Means Hierarchical clustering

Time complexity O(n) O(n²)
Hyperparameters Tuning Must specify the number No need to specify k value,
of clusters (k) and retrain can perform split wherever
model for each k
Data structure Better performance when Generates better results
dealing with convex when dealing with non-
clusters convex clusters
Types/variations Many variations (e.g., K- Two approaches:
median, K-medoid) Agglomerative and Divisive
Result Robustness Result may be different on Same parameters generate
different runs the same result every time

Chapter 3-Unsupervised learning_updated
No ratings yet
Chapter 3-Unsupervised learning_updated
54 pages
RDM Slides Clustering With R 1
No ratings yet
RDM Slides Clustering With R 1
64 pages
9 Som
No ratings yet
9 Som
32 pages
USL
No ratings yet
USL
21 pages
ML L14 Clustering
No ratings yet
ML L14 Clustering
59 pages
Kmeans Notes
No ratings yet
Kmeans Notes
8 pages
00. April27 Revision LR DT Boosting Student Copy
No ratings yet
00. April27 Revision LR DT Boosting Student Copy
33 pages
Unsupervised Learning Models Overview, K-Means Algorithm: Sir Syed University of Engineering & Technology, Karachi
No ratings yet
Unsupervised Learning Models Overview, K-Means Algorithm: Sir Syed University of Engineering & Technology, Karachi
36 pages
Microsoft PowerPoint - Clustering - Week - 12 - 2 - 4.04
No ratings yet
Microsoft PowerPoint - Clustering - Week - 12 - 2 - 4.04
31 pages
03 Clustering
No ratings yet
03 Clustering
63 pages
DS Journal-1
No ratings yet
DS Journal-1
25 pages
DS Journal_Final
No ratings yet
DS Journal_Final
37 pages
09-06-2023-2.00 PM-5.00 PM-CSE338-Applied Data Science - Saleti Sumalatha (1)
No ratings yet
09-06-2023-2.00 PM-5.00 PM-CSE338-Applied Data Science - Saleti Sumalatha (1)
3 pages
Data Mining - Clustering
No ratings yet
Data Mining - Clustering
90 pages
Chapter-3-2
No ratings yet
Chapter-3-2
27 pages
CIE03-ARA-AIML-Scheme (1)
No ratings yet
CIE03-ARA-AIML-Scheme (1)
4 pages
Hierarchical Clustering: DSCI 5240 Data Mining and Machine Learning For Business
No ratings yet
Hierarchical Clustering: DSCI 5240 Data Mining and Machine Learning For Business
45 pages
Lecture - 10 Unsupervised Learning & K-Means Clustering
No ratings yet
Lecture - 10 Unsupervised Learning & K-Means Clustering
31 pages
L11_12
No ratings yet
L11_12
28 pages
Clustering Notes
No ratings yet
Clustering Notes
37 pages
Clustering Techniques
No ratings yet
Clustering Techniques
38 pages
K-Means Clustering Clearly Explained
No ratings yet
K-Means Clustering Clearly Explained
12 pages
Clustering New
No ratings yet
Clustering New
64 pages
Clustering-Part1.pptx
No ratings yet
Clustering-Part1.pptx
84 pages
Introduction To Unsupervised Learning:: Clustering
No ratings yet
Introduction To Unsupervised Learning:: Clustering
21 pages
Machine Learning Basics
No ratings yet
Machine Learning Basics
18 pages
Machine Learning: Prepared by
No ratings yet
Machine Learning: Prepared by
44 pages
MCQ Amt 2
No ratings yet
MCQ Amt 2
9 pages
ENGG1003 07 DataModelingAndVisualization
No ratings yet
ENGG1003 07 DataModelingAndVisualization
29 pages
k Means Clustering
No ratings yet
k Means Clustering
43 pages
Data Mining Project - Clustering - State Wise Health Income
No ratings yet
Data Mining Project - Clustering - State Wise Health Income
9 pages
COSC 6335 Data Mining (Dr. Eick) Solution Sketches Midterm Exam October 25, 2012
No ratings yet
COSC 6335 Data Mining (Dr. Eick) Solution Sketches Midterm Exam October 25, 2012
11 pages
KMean Merged
No ratings yet
KMean Merged
13 pages
LJ 9
No ratings yet
LJ 9
7 pages
Machine Learning & Data Mining
No ratings yet
Machine Learning & Data Mining
108 pages
Lecture - 11 Hierarchical Clustering
No ratings yet
Lecture - 11 Hierarchical Clustering
28 pages
AI Syllabus Course
No ratings yet
AI Syllabus Course
16 pages
Hierarchical Clustering
No ratings yet
Hierarchical Clustering
35 pages
P 3.1.3 Hierarchical
No ratings yet
P 3.1.3 Hierarchical
30 pages
MODULE 4 - 5TH SEM (2)
No ratings yet
MODULE 4 - 5TH SEM (2)
23 pages
ML ch 4 (4)
No ratings yet
ML ch 4 (4)
65 pages
Data Mining - Lecture 9
No ratings yet
Data Mining - Lecture 9
29 pages
2021 BM MA Course Session 3 - Segmentation
No ratings yet
2021 BM MA Course Session 3 - Segmentation
20 pages
K Means
No ratings yet
K Means
36 pages
1731009606_Clustering_(Class_38-39)
No ratings yet
1731009606_Clustering_(Class_38-39)
45 pages
Graph Handout
No ratings yet
Graph Handout
13 pages
Machine Learning Unsupervised
No ratings yet
Machine Learning Unsupervised
20 pages
Hierarchical Clustering: Class Program University Semester Lecturer Sources
100% (1)
Hierarchical Clustering: Class Program University Semester Lecturer Sources
33 pages
Decision Trees and Decision Modeling
No ratings yet
Decision Trees and Decision Modeling
58 pages
Understanding-Code-for A-Classifier
No ratings yet
Understanding-Code-for A-Classifier
15 pages
AAIC Syllabus
No ratings yet
AAIC Syllabus
19 pages
Unit-4 (2)
No ratings yet
Unit-4 (2)
29 pages
Data Mining Lecture Notes-1: Bsc. (H) Computer Science: Vi Semester Teacher: Ms. Sonal Linda
No ratings yet
Data Mining Lecture Notes-1: Bsc. (H) Computer Science: Vi Semester Teacher: Ms. Sonal Linda
40 pages
Unit_6 (2)
No ratings yet
Unit_6 (2)
58 pages
DMDWUNITV
No ratings yet
DMDWUNITV
72 pages
decision tree new
No ratings yet
decision tree new
8 pages
algorithms_for_validation
No ratings yet
algorithms_for_validation
396 pages
Datamining-lect5 - Clustering. the K-means Algorithm. Hierarchical Clustering. the DBSCAN Algorithm. Clustering Evaluation
No ratings yet
Datamining-lect5 - Clustering. the K-means Algorithm. Hierarchical Clustering. the DBSCAN Algorithm. Clustering Evaluation
110 pages
module_4
No ratings yet
module_4
30 pages
XQuery for Humanists
From Everand
XQuery for Humanists
Clifford B. Anderson
No ratings yet
1.Lab Activity
No ratings yet
1.Lab Activity
5 pages
Lecture 9-1
No ratings yet
Lecture 9-1
5 pages
Business Plan
No ratings yet
Business Plan
22 pages
7.4 stokes theorem and onwards
No ratings yet
7.4 stokes theorem and onwards
4 pages
Area of Circle (Function Outside the Class Body)
No ratings yet
Area of Circle (Function Outside the Class Body)
1 page
7.3 curls
No ratings yet
7.3 curls
6 pages
Loan Default Prediction Using Machine Learning
No ratings yet
Loan Default Prediction Using Machine Learning
8 pages
Crop yield prediction presentation (1)
No ratings yet
Crop yield prediction presentation (1)
8 pages
BCA - Fundamentals of Digital Systems
No ratings yet
BCA - Fundamentals of Digital Systems
35 pages
Os Super-Imp-Tie-22 (1) PDF
No ratings yet
Os Super-Imp-Tie-22 (1) PDF
4 pages
2021 FOSDEM Idmapped Mounts
No ratings yet
2021 FOSDEM Idmapped Mounts
11 pages
Major Project Presentation PPT of Java Using Frontend and Backend Technology
No ratings yet
Major Project Presentation PPT of Java Using Frontend and Backend Technology
40 pages
Creating a traffic light controller using flip
No ratings yet
Creating a traffic light controller using flip
5 pages
(R18) B.tech. CSE Syllabus
No ratings yet
(R18) B.tech. CSE Syllabus
34 pages
Week 4 OOP
No ratings yet
Week 4 OOP
16 pages
IT-209 Data Structures
No ratings yet
IT-209 Data Structures
4 pages
Stack Queue Tree Graph
No ratings yet
Stack Queue Tree Graph
2 pages
12th Computer Science Interior Questions EM
No ratings yet
12th Computer Science Interior Questions EM
1 page
Thinking Functionally
No ratings yet
Thinking Functionally
19 pages
Session5 Notes-1
No ratings yet
Session5 Notes-1
7 pages
C Programming Language
No ratings yet
C Programming Language
191 pages
Aicha Ben Jrad Curriculum Vitae
No ratings yet
Aicha Ben Jrad Curriculum Vitae
2 pages
Object Oriented Programming Through Java: G.Praveen Itdept-Snist
No ratings yet
Object Oriented Programming Through Java: G.Praveen Itdept-Snist
88 pages
Python V Pseudocode (OL - IGCSE CS)
No ratings yet
Python V Pseudocode (OL - IGCSE CS)
5 pages
BackPropagation for Exam Problem -2
No ratings yet
BackPropagation for Exam Problem -2
3 pages
Compare Phone Number
No ratings yet
Compare Phone Number
3 pages
AI Learning Journey
No ratings yet
AI Learning Journey
11 pages
Stacks Notes With Programs
No ratings yet
Stacks Notes With Programs
2 pages
Beau Mount Talk
No ratings yet
Beau Mount Talk
68 pages
AI-Based Path Planning for Autonomous Robots
No ratings yet
AI-Based Path Planning for Autonomous Robots
3 pages
Python Notes 2020
100% (1)
Python Notes 2020
329 pages
Chapter 6. Synchronization Tools
No ratings yet
Chapter 6. Synchronization Tools
60 pages
Weekly Report of Jayesh
No ratings yet
Weekly Report of Jayesh
7 pages
June 2021 Model Set Question Paper
No ratings yet
June 2021 Model Set Question Paper
32 pages
Lab Manual Csc103 PF v2.1
No ratings yet
Lab Manual Csc103 PF v2.1
119 pages
978-3-319-23485-4
No ratings yet
978-3-319-23485-4
824 pages
Lecture 1 Introduction
No ratings yet
Lecture 1 Introduction
66 pages
Hack For Dino Run Game
No ratings yet
Hack For Dino Run Game
3 pages

Clustering

Uploaded by

Clustering

Uploaded by

Introduction to Machine Learning

Department of Electrical Engineering and Technology

*Lecture notes adopted from UK data service workshop

Supervised learning Unsupervised learning

Dps Sepal length Petal length Petal width Species

A 3.5 1.4 0.2 Iris-Versicolour

B 3.2 5.7 2.3 Iris-Setosa

B 3.2 5.7 2.3

C 3.2 5.9 2.3

D 2.9 4.7 1.4

E 3.7 1.5 0.4

“Clustering is the task of partitioning the dataset into

(Müller and Guido 2017)

Dps Sepal length Petal length Petal width cluster

• It provides more information on the

• It can help identify problems in the

• It can be used to compress data

• Grouping DNA sequences of different strains

• Identifying fake news by clustering the words

• And the more frivolous and fun side projects…

“There is no universal definition of what a

How do I know which type of algorithm is right for me?

EXPLORE YOUR DATA

• First, we initialize the algorithm with k

• Then, we assign each data point to its

• Repeat the highlighted steps until the

 Forgy’s method: choose k random data points from the dataset

 Random Partition method: Randomly assign data points to a cluster.

 K-means++: first centroid is a random datapoint, but remaining

• Each time we increase the number

still has a low SSE

• Easy to understand and

• It is dependent on initial values:

• Not good at clustering data of varying

density direction shape

(Reddy and Vinzamuri, 2015)

1) Measure of distance – some measure of similarity

• Affects the shape of the clusters

• Used to build distance matrix

• Default is Euclidean distance, but other

1) Measure of distance – some measure of similarity

• Affects the shape of the clusters

• Used to build distance matrix

• Default is Euclidean distance, but other

• A means of determining whether certain clusters should be merged

• Other commonly used linkage criteria: single-linkage, average-linkage

• Used to update the distance matrix and merge clusters

2) Build distance matrix and identify smallest distance

A 0 1 1.4 3.2 4.5

B 1 0 2.2 4.1 5.4

C 1.4 2.2 0 2.8 4.2

D 3.2 4.1 2.8 0 1.4

E 4.5 5.4 4.2 1.4 0

Updated distance matrix:

AB C D E d[(A,B),C] = max {d(A,C),

d[(A,B),D] = max {d(A,D),

d[(A,B),(D,E)] = max {d((A,B)D),

• Dendrogram: y-axis denotes

• Y-axis also shows how far

• Easy to understand and implement

• Most appealing output

• Can handle non-convex clusters

• No need to specify the number of clusters!

• Mathematically simple…but computationally

• Hard to visualize results with a large dataset

• Heavily driven by heuristics and arbitrary

• Algorithm can’t undo previous step

K-Means Hierarchical clustering

You might also like