0% found this document useful (0 votes)
7 views

Clustering

Clustering lecture pdf

Uploaded by

Azhar
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

Clustering

Clustering lecture pdf

Uploaded by

Azhar
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

Introduction to Machine Learning

Clustering

Waqar Aziz

Department of Electrical Engineering and Technology


Govt. College University Faisalabad

*Lecture notes adopted from UK data service workshop


Outline

• What is clustering?
• Why bother with it?
• Types of clustering algorithms
• K-Means
• Hierarchical clustering
Recap

Supervised learning Unsupervised learning


Input data is labelled Input data is unlabelled
Data is classified based on the training Assigns properties of given data to
dataset classify it
Divided into Regression and Classification Divided into Clustering and Association
Used for prediction Used for analysis
Algorithms include: decision trees, Algorithms include: k-means clustering,
logistic regressions, support vector hierarchical clustering, apriori algorithm
machine
A known number of classes An unknown number of classes
Recap (Cont’d)
Supervised learning: used for prediction

Dps Sepal length Petal length Petal width Species


(cm) (cm) (cm)

A 3.5 1.4 0.2 Iris-Versicolour

B 3.2 5.7 2.3 Iris-Setosa


C 3.2 5.9 2.3 Iris-Setosa
D 2.9 4.7 1.4 Iris-Virginica
Unsupervised learning: used for analysis
E 3.7 1.5 0.4 Iris-Versicolour
Dps Sepal Petal length Petal width
length (cm) (cm) F 3.1 5.5 2.2 ?
(cm)
A 3.5 1.4 0.2

B 3.2 5.7 2.3

C 3.2 5.9 2.3

D 2.9 4.7 1.4

E 3.7 1.5 0.4


What is clustering?

“Clustering is the task of partitioning the dataset into


groups, called clusters. The goal is to split up the data
in such a way that points within a single cluster are
very similar and points in different clusters are
different.”

(Müller and Guido 2017)

Dps Sepal length Petal length Petal width cluster


(cm) (cm) (cm)
A 3.5 1.4 0.2 1
B 3.2 5.7 2.3 2
C 3.2 5.9 2.3 2
Why bother with it?

• It provides more information on the


structure of the data  patterns

• It can help identify problems in the


data, such as outliers

• It can be used to compress data


Other use cases
• Customer recommendation systems: “People
who bought Harry Potter and the
Philosopher’s Stone also bought The Hunger
Games…”

• Grouping DNA sequences of different strains


of HIV into families of genetically similar
viruses

• Identifying fake news by clustering the words


used in articles. Certain words may appear
more in sensationalized click-bait articles.

• And the more frivolous and fun side projects…


What is a cluster?

“There is no universal definition of what a


cluster is: it really depends on the context,
and different algorithms will capture
different kinds of clusters.”

(Géron, 2019)
Types of clustering algorithms
Centroid-based Density-based Distribution-based Hierarchical clustering

How do I know which type of algorithm is right for me?


? ?
?

EXPLORE YOUR DATA


K-Means clustering
• We want to separate our data points into k
clusters

• First, we initialize the algorithm with k


random points (our centroids)

• Then, we assign each data point to its


nearest initialisation point – using the
Euclidean distance
• Once each data point is assigned, we
relocate the initialisation point to the
mean of the data points that were
assigned to it

• Repeat the highlighted steps until the


assignment of data points to centroids
remains unchanged
Introducing pseudocode…
Pseudo English
Python Code
Initialisation – how do we select our centroids?

 Forgy’s method: choose k random data points from the dataset

 Random Partition method: Randomly assign data points to a cluster.


Then calculate the mean of each cluster to get the initial centroids.

 K-means++: first centroid is a random datapoint, but remaining


centroids are chosen based on the maximum squared distance 
centroids are spread out evenly
How do we determine the number of clusters we want?
Sepal Petal length Petal
length (cm) width K=?
(cm) (cm)
3.5 1.4 0.2
3.2 5.7 2.3
3.2 5.9 2.3
2.9 4.7 1.4
Elbow plot
3.7 1.5 0.4

• Each time we increase the number


of clusters  the SSE decreases
• Goal: select a small value of k that
SSE

still has a low SSE


• Elbow represents where we start
to have diminishing returns by
increasing k

k value 15
What are the strengths?

• Easy to understand and


implement

• Fast

• Scalable
What are the limitations?
• Choosing 𝑘 manually – it’s a hassle!

• It is dependent on initial values:


necessary to run the algorithm several
Elbow method
times to avoid suboptimal solutions –
converges to a local minimum

• Not good at clustering data of varying


Bad centroid initialization Suboptimal solution
sizes, densities, or nonspherical shapes

density direction shape


Hierarchical clustering
“Hierarchical clustering algorithms […]
approach the problem of clustering by
developing a binary tree-based data
structure called the dendrogram. Once
the dendrogram is constructed, one
can automatically choose the right
number of clusters by splitting the tree
at different levels to obtain different
clustering solutions for the same
dataset without rerunning the
clustering algorithm again.”

(Reddy and Vinzamuri, 2015)


How do I read a dendrogram?
6

5 E Increasing similarity
4 D
3

2 C
1 A
0 B
0 1 2 3 4

5 E Branches
4 D
3

2 C
1 A
0 B
0 1 2 3 4
What are the 2 main approaches to hierarchical clustering?
6

5 E
1) Agglomerative 2) Divisive
4 D
3

B C D E 2 C ABCDE
1 A
0 B
AB 0 1 2 3 4

DE
ABC

ABC
DE
AB

ABCDE E
A B C D
which clusters should be combined, or split?

1) Measure of distance – some measure of similarity

Increasing similarity
• Hierarchical clustering is proximity-based

• Affects the shape of the clusters

• Used to build distance matrix

• Default is Euclidean distance, but other


measures exist: correlation-based,
Levenshtein distance etc.
p q ED
3 4 1.414214
2 1
which clusters should be combined, or split?

1) Measure of distance – some measure of similarity

Increasing similarity
• Hierarchical clustering is proximity-based

• Affects the shape of the clusters

• Used to build distance matrix

• Default is Euclidean distance, but other


measures exist: correlation-based,
Levenshtein distance etc.
p q ED
3 4 1.414214
2 1
which clusters should be combined, or split?
2) Linkage criterion – different ways to link clusters based on distance

• A means of determining whether certain clusters should be merged

• Default is complete-linkage

• Other commonly used linkage criteria: single-linkage, average-linkage

• Used to update the distance matrix and merge clusters


Agglomerative hierarchical clustering: Using complete-linkage
Step by step…

1) Load in dataset
x y
6
Dps sepal length Petal
(cm) length 5 E
(cm)
4 D
A 1 1
3
B 1 0
2 C
C 0 2

D 2 4 1 A

E 3 5 0 B
0 0.5 1 1.5 2 2.5 3 3.5
Step by step…

2) Build distance matrix and identify smallest distance

A B C D E

A 0 1 1.4 3.2 4.5

B 1 0 2.2 4.1 5.4

C 1.4 2.2 0 2.8 4.2

D 3.2 4.1 2.8 0 1.4

E 4.5 5.4 4.2 1.4 0


Step by step…
3) Perform merge and update distance matrix

Updated distance matrix:

AB C D E d[(A,B),C] = max {d(A,C),


d(B,C)}
= max {1.4, 2.2}
AB 0

d[(A,B),D] = max {d(A,D),


C 2.2 0 d(B,D)}
= max {3.2, 4.1}
D 4.1 2.8 0
d[(A,B),E] = max {d(A,E),
d(B,E)}
E 5.4 4.2 1.4 0
= max {4.5, 5.4}
Step by step…
Continue merging and updating the distance matrix…

AB DE C ABC DE

AB 0
ABC 0

DE 5.4 0
DE 5.4 0
C 2.2 4.2 0

d[(A,B),(D,E)] = max {d((A,B)D),


d[(A,B,C),(D,E) = max
d(A,B)E))}
{d((D,E)(A,B), ((D,E,(C))
= max {4.1, 5.4}
= max {5.4, 4.2}
d[(C,(D,E))] = max
{d(C,D), d (C,E)}
= max {2.8, 4.2}
RESULT

• Dendrogram: y-axis denotes


when in the agglomerative
algorithm two clusters get
merged

• Y-axis also shows how far


apart the merged clusters are
 pay attention to the
length of the branches
What are the strengths?

• Easy to understand and implement

• Most appealing output

• Can handle non-convex clusters

• No need to specify the number of clusters!

K=?
What are the limitations?

• Mathematically simple…but computationally


expensive!

• Hard to visualize results with a large dataset

• Heavily driven by heuristics and arbitrary


decisions

• Algorithm can’t undo previous step


K-Means vs Hierarchical clustering

K-Means Hierarchical clustering


Time complexity O(n) O(n²)
Hyperparameters Tuning Must specify the number No need to specify k value,
of clusters (k) and retrain can perform split wherever
model for each k
Data structure Better performance when Generates better results
dealing with convex when dealing with non-
clusters convex clusters
Types/variations Many variations (e.g., K- Two approaches:
median, K-medoid) Agglomerative and Divisive
Result Robustness Result may be different on Same parameters generate
different runs the same result every time

You might also like