0% found this document useful (0 votes)
8 views

ML Lecture06 Unsupervised Learning

Uploaded by

khalil
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

ML Lecture06 Unsupervised Learning

Uploaded by

khalil
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 87

Foundations of Machine Learning

M.Sc. in DSBA

Lecture 6
Unsupervised learning: clustering

Joseph Boyd

Thursday, November 18, 2022


Acknowledgements

• The lecture is partially based on material by


– Fragkiskos Malliaros (CentraléSupelec)
– Richard Zemel, Raquel Urtasun and Sanja Fidler (University of Toronto)
– Chloé-Agathe Azencott (Mines ParisTech)
– Julian McAuley (UC San Diego)
– Dimitris Papailiopoulos (UW-Madison)
– Jure Leskovec, Anand Rajaraman, Jeff Ullman (Stanford Univ.)
• http://www.mmds.org
– Panagiotis Tsaparas (UOI)
– Evimaria Terzi (Boston University)
– Andrew Ng (Stanford University)
– Nina Balcan and Matt Gormley (CMU)
– Ricardo Gutierrez-Osuna (Texas A&M Univ.)
– M. Pawan Kumar
– Tan, Steinbach, Kumar
• Introduction to Data Mining

Thank you!
2
Unsupervised learning
(clustering)

3
Supervised vs. Unsupervised Learning

Supervised learning Unsupervised learning

• We have labeled examples • The data is unlabeled

• Given those examples, • Given the data, learn a model


learn a model that can that identifies structure in the
generalize to unseen data (and generalize to new
examples data)

• Key tasks: • Key tasks:


– Classification – Clustering
– Regression – Dimensionality reduction
(unsupervised)

4
What is Cluster Analysis?

• Cluster: a collection of data objects


– Similar (or related) to one another within the same group
– Dissimilar (or unrelated) with objects in other groups

• Cluster analysis (or clustering)


– Finding similarities between data according to the characteristics of
the data and …
– … grouping similar data objects into clusters

• Typical applications
– As a stand-alone tool to get further insights about the data
– As a preprocessing step for other algorithms

5
Any Natural Grouping?

Clustering is subjective

Simpson’s family School employees Females Males


6
Slide by Eamonn Keogh, UCR
What is a Good Clustering?

• Good clusters have:


– High intra-cluster similarity: cohesive within clusters
– Low inter-cluster similarity: distinctive between clusters

• The quality of a clustering method depends on


– The similarity measure used by the method
– Its ability to discover some or all of the hidden patterns

Recall the distance and similarity


measures covered in kNN classification

7
Goals of Clustering

⚫ Group objects that are similar into clusters: classes that are
unknown beforehand

8
Goals of Clustering

⚫ Group objects that are similar into clusters: classes that are
unknown beforehand

9
Applications of Clustering

⚫ Understand general characteristics of the data


⚫ Visualize the data
⚫ Infer some properties of a data point based on how it relates to
other data points
⚫ Examples
− Find subtypes of diseases
− Visualize protein families
− Find categories among images
− Find patterns in financial transactions
− Detect communities in social networks
− Find users with similar interests (e.g., Netflix, Amazon)

10
Cluster Centroids and Medoids

⚫ Centroid: mean of the points in the cluster

⚫ Medoid: point in the cluster that is closest to the centroid

11
Cluster Evaluation

⚫ Clustering is unsupervised
⚫ There is no ground truth. How do we evaluate the quality of
a clustering algorithm?
⚫ Based on the shape of the clusters:
− Points within the same cluster should be nearby/similar and points
far from each other should belong to different clusters
⚫ Based on the stability of the clusters:
− We should get the same results if we remove some data points, add
noise, etc.
⚫ Based on domain knowledge (ground truth)
− The clusters should “make sense”

12
Major Clustering Approaches

Fahad et al. A Survey of Clustering Algorithms for Big Data: Taxonomy and Empirical Analysis, 2014

13
Hierarchical vs. Partitional

Hierarchical
Partitional

Slide by Eamonn Keogh, UCR 14


k-means clustering

15
k-means Algorithm – The Idea

Most well-known and popular clustering algorithm:

1. Start with 𝑘 random cluster centers

2. Iterate:
– Assign each example to closest center
– Recalculate centers as the mean of the points in a cluster

16
k-means: An Example

17
k-means: Initialize Centers Randomly

18
k-means: Assign Points to Nearest Center

19
k-means: Readjust Centers

20
k-means: Assign Points to Nearest Center

21
k-means: Readjust Centers

22
k-means: Assign Points to Nearest Center

23
k-means: Readjust Centers

24
k-means: Assign Points to nearest Center

No changes: Done

Test demo at: http://stanford.edu/class/ee103/visualizations/kmeans/kmeans.html


25
k-means Clustering – Objective Function

⚫ Minimize the intra-cluster variance


⚫ Within-cluster sum of squares
1 2
Varin 𝐶 = ෍ 𝑥 − 𝜇𝐶 for a cluster 𝐶
|𝐶|
𝑥∈𝐶
𝐾
1 2
𝑉 = ෍ ෍ 𝑥 − 𝜇𝐶𝑘 for all clusters
|𝐶𝑘 |
𝑘=1 𝑥∈𝐶𝑘

⚫ For each cluster, the points in that cluster are those


that are closest to its centroid than to any other
centroid
26
Lloyd’s Algorithm for k-means

⚫ k-means cannot be easily optimized (NP-Hard problem)


⚫ We adopt a greedy strategy (Lloyd’s Algorithm)
− Randomly partition the data into 𝑘 clusters and iterate:
➢ Compute the centroid of each cluster
➢ Assign each point to the cluster of the centroid

27
How to Select k?

Elbow rule

28
Summary of k-means

• Advantages
compute 𝑘𝑛 distances
– Computational time is linear: 𝒪(𝑛𝑝𝑘𝑡) in p dimensions t times
– Easily implemented
– Stochastic optimisation -- mini-batch k-means

• Drawbacks
– Need to select (user-defined parameter)
– Sensitivity to noise and outliers
– Non-deterministic (stochastic)
• Different solutions with each iteration
– The clusters are forced to have “spherical” (convex) shapes

29
Example

Source: https://pafnuty.wordpress.com/
30
Improving k-means: k-means++

• The quality of the solution depends on the initialization


• Rationale behind random initialization
– Choosing a random assignment may lead the algorithm to a good
local minimum
• Another approach: k-means++ [Arthurs and Vassilvitskii ‘07]
1. Select a random point and declare it centroid 𝑐1
2. For all remaining data points 𝑥𝑗 compute distance 𝑑(𝑥𝑗 , 𝑐1 )
3. Select a random point with probability proportional to 𝑑(𝑥𝑗 , 𝑐1 )2
and set it as 𝑐2
• This will give a point far enough from 𝑐1
4. Repeat steps 2 and 3 for all centroids 𝑘

31
scikit-learn

http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html

http://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_digits.html#sphx-glr-
auto-examples-cluster-plot-kmeans-digits-py

32
Gaussian Mixture Models

33
Gaussian mixture models – overview

GMMs build upon the basic ideas of k-means:

• Soft-clustering: for each datapoint 𝒙𝒏 ∈ ℝ𝐷 ,


model cluster membership as continuous random
variable 𝒛𝒏 , such that σ𝐾
𝑘=1 𝑝(𝒛𝑘 = 𝑘) = 1

• Model clusters as elliptical Gaussians: fit 𝜽 =


{𝝁, 𝜮, 𝝅} (means, covariances, prior probabilities)

• Fit with iterative Expectation-Maximisation (EM)


34
Marginal likelihood (1/2)

The marginal likelihood on 𝒙𝒏 ,

𝑝 𝒙𝒏 ; 𝜽 = ෍ 𝑝(𝒙𝒏 |𝒛𝒏 = 𝒌; 𝜽)𝑝(𝒛𝒏 = 𝒌; 𝜽)


𝑘=1
𝐾

= ෍ 𝒩(𝒙𝒏 ; 𝝁𝑘 , 𝜮𝑘 ) ∙ 𝜋𝑘
𝑘=1

i.e. a weighted sum of Gaussians: each datapoint is in


part in all clusters.

35
Marginal likelihood (2/2)
𝐾

𝑝 𝒙𝒏 ; 𝜽 = ෍ 𝒩(𝒙𝒏 ; 𝝁𝑘 , 𝜮𝑘 ) ∙ 𝜋𝑘
𝑘=1

𝑘=1
𝑘=2

𝑝 𝑥 𝑧 = 1 low, 𝑝 𝑧 = 1 high
𝑝 𝑥 𝑧 = 2 high, 𝑝 𝑧 = 2 low
⇒ 𝑝(𝑥) high

𝑝 𝑥 𝑧 = 1 low, 𝑝 𝑧 = 1 high
𝑝 𝑥 𝑧 = 2 low, 𝑝 𝑧 = 2 low
⇒ 𝑝(𝑥) low
36
Posterior distribution (1/2)

By Bayes’ rule, the posterior distribution,

𝑙𝑖𝑘𝑒𝑙𝑖ℎ𝑜𝑜𝑑 𝑝𝑟𝑖𝑜𝑟
𝑝(𝒙𝒏 |𝒛𝒏 = 𝒌; 𝜽) ∙ 𝑝(𝒛𝒏 = 𝒌; 𝜽)
𝑝 𝒛𝒏 = 𝒌|𝒙𝒏 ; 𝜽 =
𝑝 𝒙𝒏 ; 𝜽
𝑚𝑎𝑟𝑔𝑖𝑛𝑎𝑙 𝑙𝑖𝑘𝑒𝑙𝑖ℎ𝑜𝑜𝑑

𝒩(𝒙𝒏 ; 𝝁𝑘 , 𝜮𝑘 ) ∙ 𝜋𝑘
= 𝐾
σ𝑗=1 𝒩(𝒙𝒏 ; 𝝁𝑗 , 𝜮𝑗 ) ∙ 𝜋𝑗
by which we may assign 𝒙𝒏 to cluster 𝒌 maximising
the posterior. 37
Posterior distribution (2/2)

𝒩(𝒙𝒏 ; 𝝁𝑘 , 𝜮𝑘 ) ∙ 𝜋𝑘
𝑝 𝒛𝒏 = 𝒌|𝒙𝒏 ; 𝜽 = 𝐾
σ𝑗=1 𝒩(𝒙𝒏 ; 𝝁𝑗 , 𝜮𝑗 ) ∙ 𝜋𝑗

𝑘=1
𝑘=2

𝑝 𝑧 = 1 𝑥 < 𝑝(𝑧 = 2|𝑥)

𝑝 𝑧 = 1 𝑥 > 𝑝(𝑧 = 2|𝑥)

38
Maximum likelihood estimation

GMM parameters 𝜽 can be estimated with


maximum likelihood over all data 𝑿,
𝑁

max 𝑝 𝑿; 𝜽 = max ෑ 𝑝 𝒙𝒏 ; 𝜽
𝜽 𝜽
𝑛=1
𝑵

= max ෍ log 𝑝 𝒙𝒏 ; 𝜽
𝜽
𝑛=1
𝑵 𝐾

= max ෍ log ෍ 𝒩(𝒙𝒏 ; 𝝁𝑘 , 𝜮𝑘 ) ∙ 𝜋𝑘


𝜽
𝑛=1 𝑘=1
where taking logs will make optimisation easier.
39
Expectation maximisation

EM is an iterative algorithm for maximising the


likelihood in two alternating steps. For iteration 𝑖:

• Expectation (E) step: compute posterior


(𝑖)
probabilities 𝑝𝑛𝑘 for all data 𝒙𝒏 and 𝐾 clusters.

• Maximisation (M) step: maximise likelihood for


model parameters 𝜽(𝑖) = {𝝁 𝑖 , 𝜮 𝑖 , 𝝅(𝑖) }.

Repeat until convergence of likelihood.


40
Expectation maximisation: E step

At iteration 𝑖 compute the posteriors, defined as,

(𝑖) 𝑖
𝑝𝑛𝑘 ≔ 𝑝 𝒛𝒏 = 𝒌|𝒙𝒏 ; 𝜽𝑘

𝑖 (𝑖)
𝒩(𝒙𝒏 ; 𝝁𝑘 , 𝜮𝑘 ) ∙ 𝜋𝑘
= 𝑖 (𝑖)
𝐾
σ𝑗=1 𝒩(𝒙𝒏 ; 𝝁𝑗 , 𝜮𝑗 ) ∙ 𝜋𝑗

In other words, cluster assignment

41
Expectation maximisation: M step (1)

The optimal model parameters maximise the likelihood,


𝑵 𝐾

𝜽∗ = argmax ෍ log ෍ 𝒩(𝒙𝒏 ; 𝝁𝑘 , 𝜮𝑘 ) ∙ 𝜋𝑘


𝜽
𝑛=1 𝑘=1
Applying Jensen’s inequality, we obtain a lower bound,
𝐾 𝐾
𝑖 𝒩 𝒙𝒏 ; 𝝁𝑘 , 𝜮𝑘 ∙ 𝜋𝑘
log ෍ 𝒩(𝒙𝒏 ; 𝝁𝑘 , 𝜮𝑘 ) ∙ 𝜋𝑘 ≥ ෍ 𝑝𝑛𝑘 ∙ log 𝑖
𝑘=1 𝑘=1 𝑝𝑛𝑘
Proceed by maximising the lower bound,
𝑵
(𝑖+1) 𝑖 (𝑖) (𝑖) (𝑖)
𝜽𝑘 = argmax ෍ 𝑝𝑛𝑘 log 𝒩 𝒙𝒏 ; 𝝁𝑘 , 𝜮𝑘 + log 𝜋𝑘
𝜽
𝑛=1

42
Expectation maximisation: M step (2)

This can be solved analytically,

𝑖
(𝑖+1) σ𝑁
𝑛=1 𝑛𝑘 𝒙𝒏
𝑝
𝝁𝑘 = 𝑖
σ𝑁 𝑝
𝑛=1 𝑛𝑘
𝑖 (𝑖+1) (𝑖+1) 𝑻
(𝑖+1) σ𝑁 𝑝 (𝒙
𝑛=1 𝑛𝑘 𝒏 −𝝁𝑘 )(𝒙𝒏 −𝝁𝑘 )
𝜮𝑘 = 𝑖
σ𝑁 𝑝
𝑛=1 𝑛𝑘
𝑁
(𝑖+1) 1 𝑖
𝜋𝑘 = ෍ 𝑝𝑛𝑘
𝑁
𝑛=1

In other words, cluster description


43
Expectation maximisation

44
Expectation maximisation

45
Expectation maximisation

46
Expectation maximisation

47
Expectation maximisation

48
Expectation maximisation

49
Expectation maximisation

50
Expectation maximisation

51
Expectation maximisation

52
Connection to k-means

Recall we optimise,
𝑵
(𝑖+1) 𝑖 (𝑖) (𝑖) (𝑖)
𝜽𝑘 = argmax ෍ 𝑝𝑛𝑘 log 𝒩 𝒙𝒏 ; 𝝁𝑘 , 𝜮𝑘 + log 𝜋𝑘
𝜽
𝑛=1
(𝑖) (𝑖) 𝑖
Suppose constant 𝜮𝑘 = 𝑰𝐷×𝐷 and 𝜋𝑘 = 1/𝐾, and “one-hot” 𝑝𝑛𝑘 :
(𝑖+1) (𝑖)
𝜽𝑘 = argmax ෍ log 𝒩 𝒙 ; 𝝁𝑘 , 𝑰𝐷×𝐷
𝜽 𝑖
𝒙 ∶ 𝑝𝑘 𝒙 = 1
1 (𝑖) (𝑖)
= argmin ෍ (𝒙 − 𝝁𝑘 )𝑻 (𝒙 − 𝝁𝑘 )
𝜽 2
𝑖
𝒙 ∶ 𝑝𝑘 𝒙 = 1
𝟏
= 𝑖
෍ 𝒙𝒏
|{𝒙 ∶ 𝑝𝑘 𝒙 = 1}| 𝑖
𝒙 ∶ 𝑝𝑘 𝒙 = 1

53
GMM properties

- Parametric? Yes, 𝜽 grow with 𝑘 and 𝑑, not with 𝑛

- Generative? Yes, posterior is composed with


Bayes’ rule

- Identifiable? No, the clusters can be permuted (𝑘!


models)

- Convex? No, NLL is not a convex function

54
scikit-learn

https://scikit-learn.org/stable/modules/mixture.html#mixture

https://scikit-learn.org/stable/auto_examples/mixture/plot_gmm_pdf.html

55
Spectral clustering

56
Clustering Structure in Graphs

How to discover the clustering structure?


57
Clustering Non-Graph Data
• Apply graph clustering algorithms on data with no inherent graph structure
(e.g., points in a d-dimensional Euclidean space)
• How?
1. Construct a similarity graph based on the topological relationships and
distances between data points (e.g., kNN graph)
2. Then, the problem of clustering the set of data points is transformed to a
graph clustering problem
kNN graph, k = 10
Data points
1

1 0.5
Similarity graph 0

0 (e.g., kNN) −0.5

−1

−1 −1.5

−2
−2
−2.5

−3
−3
−3.5

−1 0 1 2 −1 −0.5 0 0.5 1 1.5 2

[von Luxburg ‘07], [Shi and Malik ‘00], [Ng, Jordan, Weiss ’02] 58
Adjacency Matrix

• Adjacency matrix 𝑾:
- 𝑛 × 𝑛 matrix, where 𝑛 = |𝑉| is the number of nodes
- 𝑠(𝑣𝑖 , 𝑣𝑗 ) = 𝑤𝑖𝑗 ≥ 0 for similarity function 𝑠 (non-zero if 𝐸𝑖𝑗 exists)
- Simplest case: 𝑤𝑖𝑗 binary
1 2 3 4 5 6
5 1 0 1 1 0 1 0
1
2 1 0 1 0 0 0
2 6
4 3 1 1 0 1 0 0
3 4 0 0 1 0 1 1
5 1 0 0 1 0 1
6 0 0 0 1 1 0
• Important properties
– Symmetric matrix
– Eigenvectors are real and orthogonal
59
Degree Matrix

• Degree matrix 𝑫:
– 𝑛 × 𝑛 diagonal matrix
– 𝑑𝑖𝑖 = σ𝑗 𝑤𝑖𝑗 “degree” of node 𝑣𝑖

1 2 3 4 5 6
5 1 3 0 0 0 0 0
1
2 0 2 0 0 0 0
2 6
4 3 0 0 3 0 0 0
3 4 0 0 0 3 0 0
5 0 0 0 0 3 0
6 0 0 0 0 0 2

60
Laplacian Matrix

• Laplacian matrix 𝑳 = 𝑫 − 𝑾:
– 𝑛 × 𝑛 symmetric matrix
– Note row sum, σ𝑗 𝑙𝑖𝑗 = 0 ∀ 𝑖 ⇒ [1, … , 1]𝑇 eigenvector with
eigenvalue 0.
1 2 3 4 5 6
5 1 3 -1 -1 0 -1 0
1
2 -1 2 -1 0 0 0
2 6
4 3 -1 -1 3 -1 0 0
3 4 0 0 -1 3 -1 -1
5 -1 0 0 -1 3 -1
6 0 0 0 -1 -1 2
• Important properties
– Eigenvalues are non-negative real numbers
– Eigenvectors are real and orthogonal
61
Bi-partitioning (1/2)
1 5
• Undirected graph 𝐺 = (𝑉, 𝐸)
2
• Bi-partitioning task (k = 2): 4
6
– Divide nodes into two disjoint groups 𝐴, 𝐵 3

A 1 5 B
2 6
3 4

Questions:
• How can we define a good partition of 𝐺?
• How can we efficiently identify such a partition?

62
Bi-partitioning (2/2)

• What makes a good partition?


– Maximize the number of within-group connections
– Minimize the number of between-group connections

5
1
2 6
4
3

A B

63
Graph Cuts

• Express partitioning objectives as a function of the edge cut of


the partition
• Cut: Set of edges across two groups:

cut(A, B) = ෍ 𝑤𝑖𝑗
𝑖∈𝐴,𝑗∈𝐵
Two partitions, A and B

A 5 B
1
cut(A, B) = 2
2 6
4
3
64
Graph Cut Criterion for Clustering

• Criterion: Minimum-cut
– Minimize the weight of connections between groups

argmin cut(𝐴, 𝐵)
𝐴,𝐵
“Optimal” cut
• Degenerate case Minimum cut

Problem
• Not satisfactory partition – often isolated nodes
• Does not consider internal cluster connectivity

65
Ratio Cut

Solution: normalize cut by the size of the


groups
cut(𝐴, 𝐵) cut(𝐴, 𝐵)
ratio−cut(A, B) = +
|𝐴| |𝐵|

Size of A and B

A
Internal group
B connectivity is not
taken into account

66
Normalized Cut

• Criterion: Normalized cut


– Connectivity between groups relative to the density of each group

cut(𝐴, 𝐵) cut(𝐴, 𝐵)
normalized−cut(A, B) = +
vol(𝐴) vol(𝐵)

– vol(𝐴): total weighted degree of the nodes in 𝐴, i.e., σ𝑖∈𝐴 𝑤𝑖

• Why use this criterion?


– It produces more balanced partitions A

B
• How do we efficiently find a good partition?
– Computing the optimal cut is NP-hard
[Shi and Malik ‘97] 67
Ratio Cut vs. Normalized Cut (1/2)

Red is Min-Cut

cut(𝐴, 𝐵) cut(𝐴, 𝐵)
ratio−cut(A, B) = +
|𝐴| |𝐵|
cut(𝐴, 𝐵) cut(𝐴, 𝐵)
normalized−cut(A, B) = +
vol(𝐴) vol(𝐵)

Ratio-Cut(Red) = 1/1 + 1/8 = 1.125


Ratio-Cut(Green) = 2/5 + 2/4 = 0.9 Lower value is better

Normalized-Cut(Red) = 1/1 + 1/26 = 1.03


Normalized-Cut(Green) = 2/12 + 2/16 = 0.29

Normalized is even better


for Green due to density 68
Graph Cuts

The ratio and normalized cut criteria


can be reformulated using matrices

The minimum cut problem (ratio, normalized)


can be solved using spectral techniques

Recall the:
• Adjacency matrix 𝑊
• Degree matrix 𝐷
• Laplacian matrix 𝐿 = 𝐷 − 𝑊

69
From Graph Cuts to Spectral Partitioning (1/2)

• For simplicity, consider the objective of the unnormalised cut:


– Recall that a cut 𝐶 is defined as the number of edges between groups 𝐴 and 𝐵

𝐶 = cut 𝐴, 𝐵 = ෍ 𝑤𝑖𝑗
𝑖∈𝐴,𝑗∈𝐵
– The goal is to minimize the cut

• Represent partition membership in vector 𝒙 ∈ 0, 1 |𝑉|

1, 𝑣𝑖 ∈ 𝐴
𝑥𝑖 = ቊ
0, 𝑣𝑖 ∈ 𝐵
• Then, we have the following

0 i, j in same group
(𝑥𝑖 − 𝑥𝑗 )2 = ቊ
1 i, j in different groups

70
From Graph Cuts to Spectral Partitioning (2/2)

• Then, we can express the cut size in terms of 𝑥𝑖 and 𝑥𝑗


𝐶 = cut A, B = ෍ 𝑤𝑖𝑗
𝑖∈𝐴,𝑗∈𝐵
1
= ෍ 𝑤𝑖𝑗 (𝑥𝑖 − 𝑥𝑗 )2
2
𝑖,𝑗∈𝑉
1
= ෍ ෍ 𝑤𝑖𝑗 𝑥𝑖2 − 2𝑤𝑖𝑗 𝑥𝑖 𝑥𝑗 + 𝑤𝑖𝑗 𝑥𝑗2
2
𝑖∈𝑉 𝑗∈𝑉

= ෍ 𝑑𝑖𝑖 𝑥𝑖2 − ෍ ෍ 𝑤𝑖𝑗 𝑥𝑖 𝑥𝑗


𝑖∈𝑉 𝑖∈𝑉 𝑗∈𝑉
𝑇
= 𝒙 𝐷 −𝑊 𝒙

The Cut can be expressed in terms of the Laplacian matrix L:

𝐶 = 𝒙𝑇 𝐿𝒙, 𝒙 ∈ 0, 1 |𝑉|

71
Graph Cut Minimization

• The minimum cut criterion for graph bisection is the following

𝐶 = 𝒙𝑇 𝐿𝒙, 𝒙 ∈ 0, 1 |𝑉| ෝ = argmin 𝒙𝑇 𝐿𝒙


𝒙
𝒙 ∈ 0,1 |𝑉|
• The binary constraints 𝒙 ∈ 0, 1 |𝑉| make the optimization problem NP-
complete
– Idea: relax the binary constraints to real ones: 𝜆2 : second smallest
eigenvalue of 𝐿
𝒙 ∈ℝ𝑉, 𝑥 2 = 1, 𝟏𝑇 𝒙 = ෍ 𝑥𝑖 = 0
𝒊∈𝑽 ℎ2 : corresponding
eigenvector
– Minimum cut is 𝐶(ෝ
𝒙) = 𝒙 𝒙 = 𝒉𝟐 𝑇 𝐿𝒉𝟐 ∝ λ𝟐
ෝ𝑇 𝐿ෝ

• 𝒉1 = [1, … , 1]𝑇 is eigenvector of 𝐿 (𝜆1 = 0) and a trivial solution to 𝐶


• If 𝐺 has two connected components, 𝒉2 is the minimum cut
• Otherwise, 𝒉2 is usually a good approximation (λ𝟐 is small)
72
Properties of the Laplacian (1/2)

The Laplacian matrix is positive semi-definite

𝒙𝑇 𝐿𝒙 ≥ 𝟎, ∀𝒙 ∈ ℝ 𝑉

Spectrum: All eigenvalues of 𝐿 are real and non-negative

Spectrum and connectivity:


– The smallest eigenvalue λ1 of L is zero, as clearly L𝟏 = 0 ∙ 𝟏
– If the second smallest eigenvalue λ2 ≠ 0, then 𝐺 is a connected
component
– If 𝐿 has 𝑛 zero eigenvalues, 𝐺 has 𝑛 connected components

73
Properties of the Laplacian (2/2)

Recall the Laplacian row sum, σ𝑗 𝑙𝑖𝑗 = 0 ∀ 𝑖, implying 𝒉1 = [1, … , 1]𝑇 is an


(unnormalized) eigenvector with eigenvalue 0.
- This is a trivial solution to the graph bisection problem: 𝒉1 𝑻 L𝒉1 = λ1 = 0.

It can be shown that the multiplicity of the eigenvalue 0 is equal to the number
of connected components of 𝐺.
- How to find 𝒉2 ? Recall the constraints:
▪ 𝒉2 ∈ 𝛼, 𝛽 𝑉
(𝒙𝑇 𝐿𝒙 = 0 ⇒ constant across component)
▪ 𝒉1 𝑻 𝒉2 = 0 (orthogonality)
▪ 𝒉2 2 =1 magnitudes
- It can be shown that:
𝐵 /(|𝐴| ∙ |𝑉|), 𝑣𝑖 ∈ 𝐴
𝒉2 [𝑖] = ቐ
− 𝐴 /(|𝐵| ∙ |𝑉|), 𝑣𝑖 ∈ 𝐵

74
Spectral Graph Bisection

• Spectral graph bisection therefore solves a relaxed mincut problem:

ෝ = argmin 𝒙𝑇 𝐿𝒙,
𝒙 𝑠. 𝑡. 𝟏𝑻 𝒙 = 0, 𝒙 2 =1
𝒙 ∈ℝ|𝑉|
• If there are two connected components, the objective equals 0 and the solution is
binarized.

• If not, we obtain something approximate. How to obtain the binary cluster labels?

1, 𝒉2 𝑖 ≥ 0
𝑠𝑖 = sign 𝒉2 𝑖 =ቊ
0, 𝒉2 𝑖 < 0

Spectral graph bisection algorithm


1. Compute the Laplacian matrix 𝐿 with entries 𝐿𝑖𝑗 = 𝐷𝑖𝑗 − 𝑊𝑖𝑗
2. Find the second smallest eigenvector 𝒉2 of 𝐿
3. Cluster membership of node 𝑖 is 𝑠𝑖 = sign(𝒉2 [𝑖])

75
Spectral Bisection Algorithm (1/2)

(1) Pre-processing: 1 2 3 4 5 6

– Build Laplacian 1

2
3

-1
-1

2
-1

-1
0

0
-1

0
0

matrix 𝐿 of the 3 -1 -1 3 -1 0 0

graph 4 0 0 -1 3 -1 -1

5 -1 0 0 -1 3 -1

6 0 0 0 -1 -1 2

(2) Decomposition:
0.0 1 0.3 -0.5 -0.2 -0.4 -0.5

1.0 1 0.6 0.4 -0.4 0.4 0.0

– Find eigenvalues λ 3.0 1 0.3 0.1 0.6 -0.4 0.5


λ= 𝐻=
and eigenvectors 𝐻 3.0 1 -0.3 0.1 0.6 0.4 -0.5

of matrix 𝐿 4.0 1 -0.3 -0.5 -0.2 0.4 0.5

5.0 1 -0.6 0.4 -0.4 -0.4 0.0

– Map vertices to 1

2
0.3
0.6
corresponding 3 0.3
components of λ2 4
How do we now
-0.3

5 -0.3
find the clusters?
6 -0.6
ℎ2
76
Spectral Bisection Algorithm (2/2)

(3) Grouping:
– Assign nodes into one of the two clusters based on the sign of the
corresponding component of ℎ2

Split at 0:
Cluster A: Positive points
Cluster B: Negative points
1 0.3
A B
2 0.6
1 0.3 4 -0.3
3 0.3
2 0.6 5 -0.3
4 -0.3
3 0.3 6 -0.6
5 -0.3

6 -0.6

ℎ2

77
Example: Spectral Partitioning (1/3)

Value of ℎ2

Rank in ℎ2
78
Example: Spectral Partitioning (2/3)

Components of ℎ2

Value of ℎ2

Rank in ℎ2
79
k-Way Spectral Clustering

• How do we partition a graph into k clusters?

• Two basic approaches


– Recursive bi-partitioning [Hagen et al., ’92]
• Recursively apply the bi-partitioning algorithm in a hierarchical
divisive manner
• Disadvantages: inefficient, unstable
– Cluster multiple eigenvectors [Shi-Malik, ’00]
• Build a reduced space from multiple eigenvectors
• Commonly used and preferable approach

80
k-Way Spectral Clustering

• Input: Graph 𝐺 = (𝑉, 𝐸) and parameter 𝑘


• Output: Clusters 𝐶1 , 𝐶2 , … , 𝐶𝑘 (i.e., cluster assignments of each node)

1. Let 𝑊 be the adjacency matrix of the graph


2. Compute the normalized Laplacian matrix 𝐿 = 𝐷 − 𝑊
3. Compute the first 𝑘 eigenvectors of 𝐿𝑛

𝐻 = 𝒉1 𝒉2 … 𝒉𝑘 ∈ ℝ𝑛 × 𝑘

For 𝑖 = 1, … , 𝑛 let 𝑦𝑖 ∈ ℝ𝑘 be the vector corresponding to the 𝑖th row of 𝑉


4. Apply 𝑘 -means to the points (𝑦𝑖 )𝑖=1,…,𝑛 (i.e., rows of 𝑉) to find the
clusters 𝐶1 , 𝐶2 , … , 𝐶𝑘

[von Luxburg ‘07], [Shi and Malik ‘00], [Ng, Jordan, Weiss ‘02]
82
k-Way Spectral Clustering

• Consider the first k eigenvectors of 𝐿 = 𝐷 − 𝑊 as the columns


of a matrix
– Run k -means on the rows of this matrix

h11 h21 hk1


L11…………….....…..L1n First k
L21…………….....…..L2n eigenvectors k -means on
. the rows
. ....
Each row
represents a
Lk1…………….....…..Lkn node
h1n h2n hkn
Laplacian matrix L

83
How to Select k?

• Eigengap
– The difference between two consecutive eigenvalues
• Most stable clustering is generally given by the value k that
maximizes eigengap

50
45
λ1 In general, pick k
40
that maximizes:
35
|λ𝑘+1 − λ𝑘 |
Eigenvalue

30
25
20
λ2
15
max ∆𝑘 |λ2 − λ1 |
10
5 choose k =2
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

84
Spectral Clustering vs. k-Means

• 2-dimensional points
• Find k=3 clusters

k-means spectral clustering 85


scikit-learn

http://scikit-learn.org/stable/modules/clustering.html#spectral-clustering

http://scikit-learn.org/stable/auto_examples/cluster/plot_cluster_comparison.html

86
Comparison (scikit-learn)

87
Thank You! ☺

DiscoverGreece.com 88

You might also like