Week 10 Lecture - Introduction to Clustering(1)
Week 10 Lecture - Introduction to Clustering(1)
CMP4294 – INTRODUCTION TO AI
DR MARIAM ADEDOYIN-OLOWE
Data
Analysis
Techniqu
es
Predictive Descriptive
Analytics Analytics
Supervised Unsupervise
learning d learning
Clustering
Unsupervised learning Partition the data into
Training data does not clusters based on their
include desired outputs similarity
https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcRmehaTzUloHT0P5G0Ok1rnKxyZcsZsbIJ0ViYm-p
m8LLcUDj
Given a cloud of data points we
want to understand its structure
• Labeling is expensive
• Gain insight into the structure of the data
• Find prototypes in the data
Goal of Clustering
• Given a set of data points, each described by
a set of attributes, find clusters such that:
F1 xx
– Inter-cluster similarity is x x
x xx x
xx
minimized
xxxx
x
x xx x
– Intra-cluster similarity is
maximized F2
Clustering is subjective
Similarity is
hard to define,
but…
“We know it
when we see it”
Clustering Algorithms
• Flat algorithms
– Usually start with a random (partial) partitioning
– Refine it iteratively
• K means clustering
• (Model based clustering)
• Hierarchical algorithms
– Bottom-up, agglomerative
– (Top-down, divisive)
Partitioning Algorithms
http://www.richardafolabi.com/blog/data-analysis/understanding-clustering-for-machine-learning.
html
http://www.richardafolabi.com/blog/data-analysis/understanding-clustering-for-machine-learning.html
http://www.richardafolabi.com/blog/data-analysis/understanding-clustering-for-machine-learning.html
Types of
Clustering
1. Partitional Clustering:
𝐶
𝐶
5
4
3
Hierarchical
Clustering
𝐶
𝐶5 𝐶4 𝐶 3 𝐶2 𝐶1
1
𝐶 Hierarchical Clustering
2
Output of a Clustering Session
• Instance assignment: each instance is assigned to a cluster
(group), or in some methods, some instances are considered
outliers (instances that do not belong to any cluster).
• Cluster statistics:
• Centroids: the centre of each cluster (the average of each
feature of all instances that belong to the same cluster).
• Size: number of instances the belong to the cluster.
• Variations: the variance or standard deviation of the instances that
belong to each cluster.
k-Means
https://
www.youtube.com/watch?v=4R8nWDh-wA0
https://
www.youtube.com/watch?v=4R8nWDh-wA0
k–means Clustering
https://en.wikipedia.org/wiki/K-means_clustering
Example: Assigning Clusters
x
x
x
x
x
x … data point
… centroid
J. Leskovec, A. Rajaraman, J. Ullman: Mining of
Massive Datasets, http://www.mmds.org
Example: Assigning Clusters
x
x
x
x
x
x … data point
… centroid
J. Leskovec, A. Rajaraman, J. Ullman:
Mining of Massive Datasets,
Example: Assigning Clusters
x
x
x
x
x
x x x x x x
x … data point
… centroid
J. Leskovec, A. Rajaraman, J. Ullman: Mining of
Massive Datasets, http://www.mmds.org
k–means Algorithm
1. Clusters the data into k groups where k is predefined.
5. Repeat steps 2, 3 and 4 until the same points are assigned to each
cluster in consecutive rounds.
https://www.saedsayad.com/clustering_kmeans.htm
Getting the k right
• k < total number of points
• Try different k, looking at the change in the average distance to centroid as k
increases
• Average falls rapidly until right k, then changes little
𝑘
𝑊𝑆𝑆 ( 𝐶 )=∑ ∑ 𝑑𝑖𝑠𝑡 (𝑥 ,𝑐 𝑖 )
𝑖=1 𝑥 ∈𝐶 𝑖
• Strengths
• Simple and easy to implement Quite efficient
• Weaknesses
• Need to specify the value of k, but we may not know what the
value should be beforehand
• Sensitive to the choice of initial k centroids: the result can
be non deterministic
• Sensitive to noise
• Initialization
• Initial centroids are often chosen randomly.
• Clusters produced vary from one run to another.
Based on Prof Mohamed Gaber slides, BCU
Summary