ML Lecture06 Unsupervised Learning
ML Lecture06 Unsupervised Learning
M.Sc. in DSBA
Lecture 6
Unsupervised learning: clustering
Joseph Boyd
Thank you!
2
Unsupervised learning
(clustering)
3
Supervised vs. Unsupervised Learning
4
What is Cluster Analysis?
• Typical applications
– As a stand-alone tool to get further insights about the data
– As a preprocessing step for other algorithms
5
Any Natural Grouping?
Clustering is subjective
7
Goals of Clustering
⚫ Group objects that are similar into clusters: classes that are
unknown beforehand
8
Goals of Clustering
⚫ Group objects that are similar into clusters: classes that are
unknown beforehand
9
Applications of Clustering
10
Cluster Centroids and Medoids
11
Cluster Evaluation
⚫ Clustering is unsupervised
⚫ There is no ground truth. How do we evaluate the quality of
a clustering algorithm?
⚫ Based on the shape of the clusters:
− Points within the same cluster should be nearby/similar and points
far from each other should belong to different clusters
⚫ Based on the stability of the clusters:
− We should get the same results if we remove some data points, add
noise, etc.
⚫ Based on domain knowledge (ground truth)
− The clusters should “make sense”
12
Major Clustering Approaches
Fahad et al. A Survey of Clustering Algorithms for Big Data: Taxonomy and Empirical Analysis, 2014
13
Hierarchical vs. Partitional
Hierarchical
Partitional
15
k-means Algorithm – The Idea
2. Iterate:
– Assign each example to closest center
– Recalculate centers as the mean of the points in a cluster
16
k-means: An Example
17
k-means: Initialize Centers Randomly
18
k-means: Assign Points to Nearest Center
19
k-means: Readjust Centers
20
k-means: Assign Points to Nearest Center
21
k-means: Readjust Centers
22
k-means: Assign Points to Nearest Center
23
k-means: Readjust Centers
24
k-means: Assign Points to nearest Center
No changes: Done
27
How to Select k?
Elbow rule
28
Summary of k-means
• Advantages
compute 𝑘𝑛 distances
– Computational time is linear: 𝒪(𝑛𝑝𝑘𝑡) in p dimensions t times
– Easily implemented
– Stochastic optimisation -- mini-batch k-means
• Drawbacks
– Need to select (user-defined parameter)
– Sensitivity to noise and outliers
– Non-deterministic (stochastic)
• Different solutions with each iteration
– The clusters are forced to have “spherical” (convex) shapes
29
Example
Source: https://pafnuty.wordpress.com/
30
Improving k-means: k-means++
31
scikit-learn
http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html
http://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_digits.html#sphx-glr-
auto-examples-cluster-plot-kmeans-digits-py
32
Gaussian Mixture Models
33
Gaussian mixture models – overview
= 𝒩(𝒙𝒏 ; 𝝁𝑘 , 𝜮𝑘 ) ∙ 𝜋𝑘
𝑘=1
35
Marginal likelihood (2/2)
𝐾
𝑝 𝒙𝒏 ; 𝜽 = 𝒩(𝒙𝒏 ; 𝝁𝑘 , 𝜮𝑘 ) ∙ 𝜋𝑘
𝑘=1
𝑘=1
𝑘=2
𝑝 𝑥 𝑧 = 1 low, 𝑝 𝑧 = 1 high
𝑝 𝑥 𝑧 = 2 high, 𝑝 𝑧 = 2 low
⇒ 𝑝(𝑥) high
𝑝 𝑥 𝑧 = 1 low, 𝑝 𝑧 = 1 high
𝑝 𝑥 𝑧 = 2 low, 𝑝 𝑧 = 2 low
⇒ 𝑝(𝑥) low
36
Posterior distribution (1/2)
𝑙𝑖𝑘𝑒𝑙𝑖ℎ𝑜𝑜𝑑 𝑝𝑟𝑖𝑜𝑟
𝑝(𝒙𝒏 |𝒛𝒏 = 𝒌; 𝜽) ∙ 𝑝(𝒛𝒏 = 𝒌; 𝜽)
𝑝 𝒛𝒏 = 𝒌|𝒙𝒏 ; 𝜽 =
𝑝 𝒙𝒏 ; 𝜽
𝑚𝑎𝑟𝑔𝑖𝑛𝑎𝑙 𝑙𝑖𝑘𝑒𝑙𝑖ℎ𝑜𝑜𝑑
𝒩(𝒙𝒏 ; 𝝁𝑘 , 𝜮𝑘 ) ∙ 𝜋𝑘
= 𝐾
σ𝑗=1 𝒩(𝒙𝒏 ; 𝝁𝑗 , 𝜮𝑗 ) ∙ 𝜋𝑗
by which we may assign 𝒙𝒏 to cluster 𝒌 maximising
the posterior. 37
Posterior distribution (2/2)
𝒩(𝒙𝒏 ; 𝝁𝑘 , 𝜮𝑘 ) ∙ 𝜋𝑘
𝑝 𝒛𝒏 = 𝒌|𝒙𝒏 ; 𝜽 = 𝐾
σ𝑗=1 𝒩(𝒙𝒏 ; 𝝁𝑗 , 𝜮𝑗 ) ∙ 𝜋𝑗
𝑘=1
𝑘=2
38
Maximum likelihood estimation
max 𝑝 𝑿; 𝜽 = max ෑ 𝑝 𝒙𝒏 ; 𝜽
𝜽 𝜽
𝑛=1
𝑵
= max log 𝑝 𝒙𝒏 ; 𝜽
𝜽
𝑛=1
𝑵 𝐾
(𝑖) 𝑖
𝑝𝑛𝑘 ≔ 𝑝 𝒛𝒏 = 𝒌|𝒙𝒏 ; 𝜽𝑘
𝑖 (𝑖)
𝒩(𝒙𝒏 ; 𝝁𝑘 , 𝜮𝑘 ) ∙ 𝜋𝑘
= 𝑖 (𝑖)
𝐾
σ𝑗=1 𝒩(𝒙𝒏 ; 𝝁𝑗 , 𝜮𝑗 ) ∙ 𝜋𝑗
41
Expectation maximisation: M step (1)
42
Expectation maximisation: M step (2)
𝑖
(𝑖+1) σ𝑁
𝑛=1 𝑛𝑘 𝒙𝒏
𝑝
𝝁𝑘 = 𝑖
σ𝑁 𝑝
𝑛=1 𝑛𝑘
𝑖 (𝑖+1) (𝑖+1) 𝑻
(𝑖+1) σ𝑁 𝑝 (𝒙
𝑛=1 𝑛𝑘 𝒏 −𝝁𝑘 )(𝒙𝒏 −𝝁𝑘 )
𝜮𝑘 = 𝑖
σ𝑁 𝑝
𝑛=1 𝑛𝑘
𝑁
(𝑖+1) 1 𝑖
𝜋𝑘 = 𝑝𝑛𝑘
𝑁
𝑛=1
44
Expectation maximisation
45
Expectation maximisation
46
Expectation maximisation
47
Expectation maximisation
48
Expectation maximisation
49
Expectation maximisation
50
Expectation maximisation
51
Expectation maximisation
52
Connection to k-means
Recall we optimise,
𝑵
(𝑖+1) 𝑖 (𝑖) (𝑖) (𝑖)
𝜽𝑘 = argmax 𝑝𝑛𝑘 log 𝒩 𝒙𝒏 ; 𝝁𝑘 , 𝜮𝑘 + log 𝜋𝑘
𝜽
𝑛=1
(𝑖) (𝑖) 𝑖
Suppose constant 𝜮𝑘 = 𝑰𝐷×𝐷 and 𝜋𝑘 = 1/𝐾, and “one-hot” 𝑝𝑛𝑘 :
(𝑖+1) (𝑖)
𝜽𝑘 = argmax log 𝒩 𝒙 ; 𝝁𝑘 , 𝑰𝐷×𝐷
𝜽 𝑖
𝒙 ∶ 𝑝𝑘 𝒙 = 1
1 (𝑖) (𝑖)
= argmin (𝒙 − 𝝁𝑘 )𝑻 (𝒙 − 𝝁𝑘 )
𝜽 2
𝑖
𝒙 ∶ 𝑝𝑘 𝒙 = 1
𝟏
= 𝑖
𝒙𝒏
|{𝒙 ∶ 𝑝𝑘 𝒙 = 1}| 𝑖
𝒙 ∶ 𝑝𝑘 𝒙 = 1
53
GMM properties
54
scikit-learn
https://scikit-learn.org/stable/modules/mixture.html#mixture
https://scikit-learn.org/stable/auto_examples/mixture/plot_gmm_pdf.html
55
Spectral clustering
56
Clustering Structure in Graphs
1 0.5
Similarity graph 0
−1
−1 −1.5
−2
−2
−2.5
−3
−3
−3.5
[von Luxburg ‘07], [Shi and Malik ‘00], [Ng, Jordan, Weiss ’02] 58
Adjacency Matrix
• Adjacency matrix 𝑾:
- 𝑛 × 𝑛 matrix, where 𝑛 = |𝑉| is the number of nodes
- 𝑠(𝑣𝑖 , 𝑣𝑗 ) = 𝑤𝑖𝑗 ≥ 0 for similarity function 𝑠 (non-zero if 𝐸𝑖𝑗 exists)
- Simplest case: 𝑤𝑖𝑗 binary
1 2 3 4 5 6
5 1 0 1 1 0 1 0
1
2 1 0 1 0 0 0
2 6
4 3 1 1 0 1 0 0
3 4 0 0 1 0 1 1
5 1 0 0 1 0 1
6 0 0 0 1 1 0
• Important properties
– Symmetric matrix
– Eigenvectors are real and orthogonal
59
Degree Matrix
• Degree matrix 𝑫:
– 𝑛 × 𝑛 diagonal matrix
– 𝑑𝑖𝑖 = σ𝑗 𝑤𝑖𝑗 “degree” of node 𝑣𝑖
1 2 3 4 5 6
5 1 3 0 0 0 0 0
1
2 0 2 0 0 0 0
2 6
4 3 0 0 3 0 0 0
3 4 0 0 0 3 0 0
5 0 0 0 0 3 0
6 0 0 0 0 0 2
60
Laplacian Matrix
• Laplacian matrix 𝑳 = 𝑫 − 𝑾:
– 𝑛 × 𝑛 symmetric matrix
– Note row sum, σ𝑗 𝑙𝑖𝑗 = 0 ∀ 𝑖 ⇒ [1, … , 1]𝑇 eigenvector with
eigenvalue 0.
1 2 3 4 5 6
5 1 3 -1 -1 0 -1 0
1
2 -1 2 -1 0 0 0
2 6
4 3 -1 -1 3 -1 0 0
3 4 0 0 -1 3 -1 -1
5 -1 0 0 -1 3 -1
6 0 0 0 -1 -1 2
• Important properties
– Eigenvalues are non-negative real numbers
– Eigenvectors are real and orthogonal
61
Bi-partitioning (1/2)
1 5
• Undirected graph 𝐺 = (𝑉, 𝐸)
2
• Bi-partitioning task (k = 2): 4
6
– Divide nodes into two disjoint groups 𝐴, 𝐵 3
A 1 5 B
2 6
3 4
Questions:
• How can we define a good partition of 𝐺?
• How can we efficiently identify such a partition?
62
Bi-partitioning (2/2)
5
1
2 6
4
3
A B
63
Graph Cuts
cut(A, B) = 𝑤𝑖𝑗
𝑖∈𝐴,𝑗∈𝐵
Two partitions, A and B
A 5 B
1
cut(A, B) = 2
2 6
4
3
64
Graph Cut Criterion for Clustering
• Criterion: Minimum-cut
– Minimize the weight of connections between groups
argmin cut(𝐴, 𝐵)
𝐴,𝐵
“Optimal” cut
• Degenerate case Minimum cut
Problem
• Not satisfactory partition – often isolated nodes
• Does not consider internal cluster connectivity
65
Ratio Cut
Size of A and B
A
Internal group
B connectivity is not
taken into account
66
Normalized Cut
cut(𝐴, 𝐵) cut(𝐴, 𝐵)
normalized−cut(A, B) = +
vol(𝐴) vol(𝐵)
B
• How do we efficiently find a good partition?
– Computing the optimal cut is NP-hard
[Shi and Malik ‘97] 67
Ratio Cut vs. Normalized Cut (1/2)
Red is Min-Cut
cut(𝐴, 𝐵) cut(𝐴, 𝐵)
ratio−cut(A, B) = +
|𝐴| |𝐵|
cut(𝐴, 𝐵) cut(𝐴, 𝐵)
normalized−cut(A, B) = +
vol(𝐴) vol(𝐵)
Recall the:
• Adjacency matrix 𝑊
• Degree matrix 𝐷
• Laplacian matrix 𝐿 = 𝐷 − 𝑊
69
From Graph Cuts to Spectral Partitioning (1/2)
𝐶 = cut 𝐴, 𝐵 = 𝑤𝑖𝑗
𝑖∈𝐴,𝑗∈𝐵
– The goal is to minimize the cut
1, 𝑣𝑖 ∈ 𝐴
𝑥𝑖 = ቊ
0, 𝑣𝑖 ∈ 𝐵
• Then, we have the following
0 i, j in same group
(𝑥𝑖 − 𝑥𝑗 )2 = ቊ
1 i, j in different groups
70
From Graph Cuts to Spectral Partitioning (2/2)
𝐶 = 𝒙𝑇 𝐿𝒙, 𝒙 ∈ 0, 1 |𝑉|
71
Graph Cut Minimization
𝒙𝑇 𝐿𝒙 ≥ 𝟎, ∀𝒙 ∈ ℝ 𝑉
73
Properties of the Laplacian (2/2)
It can be shown that the multiplicity of the eigenvalue 0 is equal to the number
of connected components of 𝐺.
- How to find 𝒉2 ? Recall the constraints:
▪ 𝒉2 ∈ 𝛼, 𝛽 𝑉
(𝒙𝑇 𝐿𝒙 = 0 ⇒ constant across component)
▪ 𝒉1 𝑻 𝒉2 = 0 (orthogonality)
▪ 𝒉2 2 =1 magnitudes
- It can be shown that:
𝐵 /(|𝐴| ∙ |𝑉|), 𝑣𝑖 ∈ 𝐴
𝒉2 [𝑖] = ቐ
− 𝐴 /(|𝐵| ∙ |𝑉|), 𝑣𝑖 ∈ 𝐵
74
Spectral Graph Bisection
ෝ = argmin 𝒙𝑇 𝐿𝒙,
𝒙 𝑠. 𝑡. 𝟏𝑻 𝒙 = 0, 𝒙 2 =1
𝒙 ∈ℝ|𝑉|
• If there are two connected components, the objective equals 0 and the solution is
binarized.
• If not, we obtain something approximate. How to obtain the binary cluster labels?
1, 𝒉2 𝑖 ≥ 0
𝑠𝑖 = sign 𝒉2 𝑖 =ቊ
0, 𝒉2 𝑖 < 0
75
Spectral Bisection Algorithm (1/2)
(1) Pre-processing: 1 2 3 4 5 6
– Build Laplacian 1
2
3
-1
-1
2
-1
-1
0
0
-1
0
0
matrix 𝐿 of the 3 -1 -1 3 -1 0 0
graph 4 0 0 -1 3 -1 -1
5 -1 0 0 -1 3 -1
6 0 0 0 -1 -1 2
(2) Decomposition:
0.0 1 0.3 -0.5 -0.2 -0.4 -0.5
– Map vertices to 1
2
0.3
0.6
corresponding 3 0.3
components of λ2 4
How do we now
-0.3
5 -0.3
find the clusters?
6 -0.6
ℎ2
76
Spectral Bisection Algorithm (2/2)
(3) Grouping:
– Assign nodes into one of the two clusters based on the sign of the
corresponding component of ℎ2
Split at 0:
Cluster A: Positive points
Cluster B: Negative points
1 0.3
A B
2 0.6
1 0.3 4 -0.3
3 0.3
2 0.6 5 -0.3
4 -0.3
3 0.3 6 -0.6
5 -0.3
6 -0.6
ℎ2
77
Example: Spectral Partitioning (1/3)
Value of ℎ2
Rank in ℎ2
78
Example: Spectral Partitioning (2/3)
Components of ℎ2
Value of ℎ2
Rank in ℎ2
79
k-Way Spectral Clustering
80
k-Way Spectral Clustering
𝐻 = 𝒉1 𝒉2 … 𝒉𝑘 ∈ ℝ𝑛 × 𝑘
[von Luxburg ‘07], [Shi and Malik ‘00], [Ng, Jordan, Weiss ‘02]
82
k-Way Spectral Clustering
83
How to Select k?
• Eigengap
– The difference between two consecutive eigenvalues
• Most stable clustering is generally given by the value k that
maximizes eigengap
50
45
λ1 In general, pick k
40
that maximizes:
35
|λ𝑘+1 − λ𝑘 |
Eigenvalue
30
25
20
λ2
15
max ∆𝑘 |λ2 − λ1 |
10
5 choose k =2
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
84
Spectral Clustering vs. k-Means
• 2-dimensional points
• Find k=3 clusters
http://scikit-learn.org/stable/modules/clustering.html#spectral-clustering
http://scikit-learn.org/stable/auto_examples/cluster/plot_cluster_comparison.html
86
Comparison (scikit-learn)
87
Thank You! ☺
DiscoverGreece.com 88