Lecture4 Slides
Lecture4 Slides
k-means clustering
K-nearest neighbours
Page 2
Netflix
The Netflix Prize
Page 4
The Netflix Prize
2006
• WXYZConsulting beat Cinematch on Oct 8
• UofT (led by Prof. Hinton) emerged as an early leader
2007
• 40,000 teams from 186 countries
• BellKor beat Cinematch by 8.43%
2008
• An ensemble of BellKor and BigChaos beat Cinematch by 9.54%
Page 5
The Netflix Prize
The Winner
• BellKor’s Pragmatic Chaos beat Cinematch by 10.06%
• Declared the winner on September 18, 2009
• Ensemble of three teams
Page 6
User groups
Cluster 290:
• Movies like: Black Mirror,
Lost, and Groundhog Day
Page 7
The basics of clustering
Types of clustering
1. k-means clustering
Page 9
How do we define “similar”?
𝑇
𝒙𝒊 = 𝑥𝑖1 , 𝑥𝑖2 , … , 𝑥𝑖𝐹
Symmetry: 𝑑 𝒙𝟏 , 𝒙𝟐 = 𝑑 𝒙𝟐 , 𝒙𝟏
Page 10
Distance metrics
Euclidean: 1
𝐹 2
2
𝑑 𝒙𝟏 , 𝒙𝟐 = 𝒙𝟏 − 𝒙𝟐 2 = 𝑥1𝑓 − 𝑥2𝑓
𝑓=1
Manhattan:
𝐹
𝑑 𝒙𝟏 , 𝒙𝟐 = 𝒙𝟏 − 𝒙𝟐 𝟏 = 𝑥1𝑓 − 𝑥2𝑓
𝑓=1
Chebychev:
𝑑 𝒙𝟏 , 𝒙𝟐 = 𝒙𝟏 − 𝒙𝟐 ∞ = max 𝑥1𝑓 − 𝑥2𝑓
𝑓=1,…,𝐹
Page 11
Distance metrics
Minkowski:
1
𝐹 𝑝
𝑝
𝑑 𝒙𝟏 , 𝒙𝟐 = 𝒙𝟏 − 𝒙𝟐 𝑃 = 𝑥1𝑓 − 𝑥2𝑓
𝑓=1
Hamming:
𝑑 𝒙𝟏 , 𝒙𝟐 = 𝕀(𝑥1𝑓 ≠ 𝑥2𝑓 )
𝑓=1
Page 12
Index sets and centroids
𝑆𝑘 = {1,3,7,21,44}
1
𝒔𝒌 = 𝒙𝒊
|𝑆𝑘 |
𝑖∈𝑆𝑘
Page 13
Cluster distances
Page 14
k-means clustering
Basics
Partition observations into k clusters such that the total pairwise distance
between each observation and it's nearest centroid is minimized
Hyperparameters
• k – number of clusters
• 𝑑 𝒙𝟏 , 𝒙𝟐 – distance metric
Page 16
Lloyd’s Algorithm
2. Assign each observation to its closest centroid using the distance metric
Page 17
Fisher’s Iris dataset
Overview
4 features
• Petal: length and width
• Sepal: length and width
Page 19
Overview
Page 20
Visualization – no labels
Page 21
Visualization – true labels
Page 22
Visualization – k-means labels
Page 23
How do we determine the number of clusters?
Page 24
Hierarchical / agglomerative clustering
Basics
Build a hierarchy of clusters where the closest pairwise clusters are merged
until there is only one cluster
Hyperparameters
• 𝑑 𝒙𝟏 , 𝒙𝟐 – distance metric
• 𝑑 𝑆1 , 𝑆2 – Linkage criteria
Page 26
Algorithm
2. Merge each cluster with its closest neighbor cluster according to some
distance metric / linkage criteria combination
3. Continue until there is only one cluster (or a stopping criteria is met)
Page 27
Linkage criteria
Centroid:
𝑑 𝑆1 , 𝑆2 = 𝑑 𝒔1 , 𝒔2
Minimum:
𝑑 𝑆1 , 𝑆2 = min 𝑑 𝒙𝑖 , 𝒙𝑗
𝑖∈𝑆1 ,𝑗∈𝑆2
Maximum:
𝑑 𝑆1 , 𝑆2 = max 𝑑 𝒙𝑖 , 𝒙𝑗
𝑖∈𝑆1 ,𝑗∈𝑆2
Page 28
Linkage criteria
Average:
1
𝑑 𝑆1 , 𝑆2 = 𝑑 𝒙𝑖 , 𝒙𝑗
𝑆1 |𝑆2 |
𝑖∈𝑆1 𝑗∈𝑆2
Minimum variance:
2 𝑆1 |𝑆2 |
𝑑 𝑆1 , 𝑆2 = 𝒔1 − 𝒔2 2
𝑆1 + |𝑆2 |
Page 29
Dendrogram – Iris dataset
Page 30
Dendrogram – Iris dataset
Page 31
DailyKos
Overview
Internet blog, forum, and news site devoted to the Democratic Party and
liberal politics
Page 33
Hierarchical clustering dendrogram
Page 34
Articles per cluster
Hierarchical k-means
Page 35
Top 5 words in each cluster
Hierarchical k-means
Page 36
K-nearest neighbors
Overview
Simple, intuitive, and widely used method that can capture complex non-
linear relationships
Two types:
Page 38
Hyperparameters
Page 39
Algorithm
Given n observation with features (𝒙0 , 𝒙1 , … , 𝒙𝑛 ) and targets (𝑦0 , 𝑦1 , … , 𝑦𝑛 )
• Predict for a new observation 𝒙𝑝
2. Compute prediction:
𝑦ො𝑝 = 𝑤𝑖 𝑦𝑖
𝑖∈𝑁𝑝
where
𝑑 𝒙𝒊 ,𝒙𝒑 1
𝑤𝑖 = σ for distance OR 𝑤𝑖 = for uniform
𝑖∈𝑁𝑝 𝑑 𝒙𝒊 ,𝒙𝒑 𝐾
Page 40
Applied to the Iris dataset
Page 41
Applied to the Iris dataset
Page 42
Applied to the Iris dataset
Page 43