03 Clustering
03 Clustering
Supervised Unsupervised
Discrete Data
Classification Clustering
(predict a label) (group similar items)
Continuous Data
Dimensionality
Regression Reduction
(predict a quantity) (reduce n. of variables)
Machine Learning Tasks
Supervised Unsupervised
Discrete Data
Classification Clustering
(predict a label) (group similar items)
Continuous Data
Dimensionality
Regression Reduction
(predict a quantity) (reduce n. of variables)
Agenda
• Introduction to Clustering
• Clustering Algorithms
• Centroid-based
• Optional: Connectivity-based
• Optional: Density-based
• Evaluation
Introduction to
Clustering
Clustering
• Customer Segmentation
• Fraud Detection
• Classification:
- supervised
- requires a set of labelled training samples
• Clustering:
- unsupervised
- learns without labels
Classification Example
Training:
items with labels
Classification Example
New, unseen item
?
Training:
items with labels
Classification Example
Training:
items with labels
Prediction:
assign item to class
Clustering Example
Training: no labels
Clustering Example
Training: no labels
Prediction: group items
More Definitions
• “Unsupervised classification”
Flat vs Hierarchical
• Flat approach:
• Bottom-up, agglomerative
• Top-down, divisive
• Cluster assignment
• Centroid-based clustering:
K-Means
• Connectivity-based clustering:
Hierarchical Agglomerative
• Density-based clustering:
DBSCAN
Centroid-based Clustering
• Output:
- A partition of the input set into K clusters
• Assumption:
- Input items are real-valued vectors
- Notion of distance / similarity
K-Means Algorithm
3.Update centroids
K=2
K-Means Example
2. Assign items to
closest centroid
K-Means Example
3. Update centroids
K-Means Example
4. Repeat:
- assign items to centroids
- update centroids
K-Means Example
4. Repeat:
- assign items to centroids
- update centroids
K-Means Example
4. Repeat:
- assign items to centroids
- update centroids
K-Means Example
Convergence!
Assignment Step
• a.k.a. Distortion
• If K increases
the distortion decreases
X
Notebook Intermezzo
Clustering - KMeans
Connectivity-based
Clustering
• a.k.a. Hierarchical Clustering
Steps:
6
{1}, {2}, {3}, {4}, {5}, {6}, {7}
{1, 2}, {3}, {4}, {5}, {6}, {7}
{1, 2}, {3, 4}, {5}, {6}, {7}
4
{1, 2}, {3, 4, 5}, {6}, {7}
5
{1, 2, 6}, {3, 4, 5}, {7}
3 {1, 2, 6, 7}, {3, 4, 5}
{1, 2, 3, 4, 5, 6, 7}
Dendrogram
5 4
3
1 2 6 7 5 3 4
Dendrogram Example on
Iris dataset
From Dendroid to Clusters
• Choice of distance
0.7
• For each point x
there is a point y
in its cluster
where d(x, y) ≤ 0.7
1 2 6 7 5 3 4
Complete Linkage
Distance
0.7
• For each point x
all points y
in its cluster
satisfy d(x, y) ≤ 0.7
1 2 6 7 5 3 4
Average Linkage
Distance
0.7
• Cut interpretation:
there isn’t a good one!
1 2 6 7 5 3 4
Linkage Issues
• Single linkage suffers from chaining.
Only one pair of points needs to be close in order to
merge two groups, i.e. clusters can be spread out and
not very compact
• No silver bullet
Notebook Intermezzo:
Clustering - Hierarchical
Density-based Clustering
• Clusters are regions in the data space with higher
density, separated by lower density regions
• Density-reachable points
q is directly reachable from p if it’s in N(p) and p is a
core point
q is reachable from p if p is a core point and there’s a
path of directly reachable core points between them
DBSCAN Concepts
7 minPts = 2
2
6 1
8
- 1 is reachable from 6
- 3 and 5 are
5 4
ε directly reachable from 4
3 - 7 is an outlier
DBSCAN Algorithm
• Find the ε-neighbour of all points